Articles/Essays/Papers Log: Thomas Anderson, The Performance of SpinLock Alternatives for Shared-Memory Multiprocessors, IEEE Transactions on Parallel & Distributed Systems, 1990

Abstract

Most shared-memory architectures provide hardware support for making mutually exclusive accesses to shared data structures. This support usually consists of instructions that atomically read and then write a single memory location. These atomic instructions are used to manipulate locks; when a processor is accessing a shared data structure, its lock is busy, and other processors needing access must wait. For small critical sections, spinning (or "busy waiting") for a lock to be released is more efficient than relinquishing the processor to do other work. Unfortunately, spin-waiting can slow other processors by consuming communication bandwidth. This paper examines the question: are there efficient algorithms for software spin-waiting given hardware support for atomic instructions, or are more complex kinds of hardware support needed for performance?

We consider the performance of a number of software spin-waiting algorithms. Arbitration for control of a lock is in many ways similar to arbitration for control of a network connecting a distributed system. We apply several of the static and dynamic arbitration methods originally developed for networks to spin locks. We also propose a novel method for explicitly queueing spinning processors in software by assigning each a unique number when it arrives at the lock. Control of the lock can then be passed to the next processor in line with minimal effect on other processors. Finally, we examine the performance of several hardware solutions that reduce the cost of spin-waiting.

Background

In shared-memory multiprocessors, each processor can directly address memory that can be addressed by all other processors. This uniform access requires some method for ensuring mutual exclusion: the logically atomic execution of operations (critical sections) on a shared data structure. Consistency of the data structure is guaranteed by serializing the operations done on it.

Pure software mutual exclusion is expensive, virtually all shared-memory multiprocessors provide some form of hardware support for making mutually exclusive accesses to shared data structures. This support usually consists of instructions that atomically read and then write a single memory location.

Atomic instructions serve two purposes. First, if the operations on the shared data are simple enough, they can be encapsulated into single atomic instructions. Mutual exclusion is directly guaranteed in hardware. If a number of processors simultaneously attempt to update the same location, each waits its turn without returning control back to software.

A lock is needed for critical sections that take more than one instruction. Atomic instructions are used to arbitrate between simultaneous attempts to acquire the lock, but if the lock is busy, waiting is done in software. When a lock is busy, the waiting process can either block, relinquishing the processor to do other work, or spin ("busy-wait") until the lock is released. Even though spin-waiting wastes processor cycles, it is useful in two situations: if the critical section is small, so that the expected wait is less than the cost of blocking and resuming the process, or if no other work is available.

Common Hardware Support for Mutual Exclusion
Most architectures support mutual exclusion by providing instructions that atomically read, modify, and write memory. These atomic instructions are straightforward to implement. Conceptually, they require four services that might need inter-processor communication: the read and write, some method of arbitration between simultaneous requests, and some state that prevents further accesses from being granted while the instruction is being executed. Most multiprocessors are able to collapse these services into one or two bus or network transactions.
Multistage networks connect multiple processors to multiple memory modules. Memory requests are forwarded through a series of switches to the correct memory module. When a value is read from memory as part of an atomic instruction, any cached copies of the location (recorded in the directory associated with the memory module) must be invalidated and subsequent accesses to that memory module or at least to that location must be delayed (or refused and retired) while the new value is being computed. To minimize this delay, the computation can be done remotely by an ALU attached to each memory module.
In single bus multiprocessors, the bus can be used for arbitration between simultaneous atomic instructions. Before starting an atomic instruction, a processor acquires the bus and raises a line (the atomic bus line). This line is held while the new memory value is being computed to prevent further atomic requests from being started, but the bus can be released to allow other normal memory requests to proceed. Waiting atomic requests delay and only re-arbitrate for the bus when the line is dropped.
In systems that do not cache shared data, the bus transaction used to acquire the atomic bus line can be overlapped with the read request for the data. Similarly, the invalidation-based coherence, even if the lock value is cached, acquiring the atomic bus line can be overlapped with the signal to invalidate other cache copies. Note that normally the invalidation occurs even if the instruction does not change the value of the location because it is done before the instruction executes.
Write-back invalidation-based coherence avoids an extra bus transaction to write the data. In this protocol, the new value is temporarily stored in the processor's cache. When another processor needs the value (for instance, as part of an atomic instruction), it gets the value at the same time it invalidates the first processor's copy.
With distributed-write write-back coherence, the initial read is usually not needed. Because copies in all caches are updated instead of invalidated when a processor changes a memory value, the cache block needed by an atomic instruction will often already be in the cache. In this case it would be wasteful of bus cycles to piggy-back the arbitration mechanism for the atomic bus line on top of the arbitration for the bus. A reason for a separate arbitration mechanism for the atomic bus line. A bus cycle is still usually needed at the end of the atomic instruction to update copies in other caches.

Motivation

Are there efficient algorithms for software spin-waiting for busy locks given hardware support for atomic instructions, or are more complex kinds of hardware support needed for performance ?

Goals

We show that the simple approaches to spin-waiting for busy locks have poor performance. Spinning processors can slow processors doing useful work, including the one holding the lock, by consuming communication bandwidth. This performance penalty occurs if processors spin by continuously trying to acquire the lock; it also occurs for small critical sections if processors spin reading the (cached) lock value and try to acquire the lock only when it is released.

While spinning processors can slow busy processors on any multiprocessor where spin-waiting consumes communication bandwidth, the precise performance of spin-waiting varies along several architectural dimensions: how processors are connected to memory, whether or not each processor has a hardware-managed coherent private cache, and if so, the coherence protocol. This paper will consider six types of architectures from within this design space:

multistage interconnection network without coherent private caches
multistage interconnection network with invalidation-based cache coherence using remote directories
bus without coherent private caches
bus with snoopy write-through invalidation-based cache coherence
bus with snoopy write-back invalidation-based cache coherence
bus with snoopy distributed-write cache coherence

We assume for all these that processors block when making a read request to memory.

Design

We consider the performance of several software spin-waiting alternatives. Although the analogy is not perfect, arbitration for control of a lock is in many ways similar to arbitration for permission to transmit on a carrier-sense multiple access (CSMA) networks. In both there is a cost when either zero or more than one waiting processor attempts to acquire the resource. A number of arbitration mechanisms have been proposed for CSMA networks. including statically assigned slots, static delays, and dynamic backoff; we discuss the performance of these methods when applied to spin-waiting.
We propose a novel method for explicitly queueing spinning processors. As processors arrive at a lock, they each acquire a unique sequence number specifying the order that they will execute the critical section. When the lock is released, control can be directly passed to the next processor in line with no further synchronization and minimal effect on other processors.
We also examine the performance of several hardware solutions. We propose an addition to snoopy cache protocols that exploits the semantics of spin lock requests to obtain better performance.

Contributions

Performance of Simple Approaches to Spin-Waiting

Given atomic read-modify-write instructions, it is relatively straightforward to develop a correct spin lock. It is more difficult to devise an efficient spin lock.

Performance when there is contention for the lock depends on minimizing the communication bandwidth used by spinning processors, since this can slow processors doing useful work.
The delay between when a lock is released and when it is reacquired by a spinning processor must also be minimized, since no processor is executing the critical section during this time.
Latency, the time for a processor to acquire a lock in the absence of contention, is also important, especially for applications with no bottleneck lock. A complex algorithm that reduces the cost of spin-waiting could degrade overall performance if it takes longer to acquire the lock when there is no contention.

This appears to pose a trade-off: the more frequently a processor tries to acquire a lock, the faster it will be acquired, but the more the other processors will be disrupted.

It might seem that the behavior of multiprocessors when there is contention for a spin-lock is not important. A highly parallel application will by definition have no lock with significant amounts of contention, since that would imply a sequential component. If an application has a lock that is a bottleneck, the best alternative would be to redesign the application's algorithms to eliminate the contention. In no case does it make sense to add processors to an application if they end up only spin-waiting.

There are, however, several situations where spin-lock performance when there is contention is important. Poor contention performance may prevent an application with a heavily utilized lock from reaching its peak performance, because the average number of spin-waiting processors will become nontrivial as the lock approaches saturation. Furthermore, if processors arrive at a lock in a burst, queue lengths can be temporarily long, resulting in bad short-term performance, without the lock being a long-term bottleneck.

Alternately, it may not always be possible to tune a program to use the optimal number of processors. An operating system, for instance, has little control over the rate at which users make operating system calls. At high load, locks that are normally not a problem could become sources of contention. Similarly, on a multiprogrammed multiprocessor, a naive user can inadvertently ruin performance for all other users by combining a bottleneck critical section, lots of processors, and an inefficient spin lock.

A. Spin on Test-and-Set

init: lock = 0;

    lock: while  (test_and_set(lock) == 1);

unlock: lock = 0;

The simplest spin-waiting algorithm is for each processor to repeatedly execute a test-and-set instruction until it succeeds at acquiring the lock. The performance of spinning on test-and-set degrades badly as the number of spinning processors increases due to two factors:

In order to release the lock, the lock holder must contend with spinning processors for exclusive access to the lock location. Most multiprocessor architectures have no way of giving priority to the clear request of the lock holder, requiring it to wait behind test-and-sets of spinning processors, even though these cannot succeed until the lock is released.
On architectures where test-and-set requests share the same bus or network as normal memory references, the requests of spinning processors can slow accesses to other locations by the lock holder or by other busy processors. On multistage network architectures, spin-waiting can cause a "hot-spot", delaying accesses to the memory module containing the lock location as well as to other modules. On bus-structured multiprocessors, each test-and-set consumes at least one bus transaction, regardless of whether the lock value is changed; these can saturate the bus.

B. Spin on Read (Test-and-Test-and-Set)

init: lock = 0;

    lock: while (lock == 1 || test_and_set(lock) == 1); 

unlock: lock = 0;

Intuitively, coherent caches should be able to reduce the cost of spin-waiting. The spinning processors loop reading the value of the lock, and only when the lock is free, execute a test-and-set instruction; this eliminates the need to repeatedly test-and-set while the lock is held.
While the lock is busy, spinning is done in the cache without consuming bus or network cycles. When the lock is released, each copy is updated to the new value (distributed-write) or invalidated, causing a cache read miss that obtains the new value. The waiting processor sees the change in state and performs a test-and-set; if someone acquired the lock in the interim, the processor can resume spinning on its cache.
When the critical section is small spinning on a read has almost as much effect on busy processors as spinning directly on a test-and-set instruction. The reason is that transient behavior can dominate; when the lock is released and reacquired by one of the waiting processors, it takes some time for the remaining processors to resume looping in their caches. During this time, most spinning processors have pending memory requests, delaying requests by busy processors during this interim. This behavior is most pronounced for systems with invalidation-based cache coherence, but it also occurs with distributed-write. The transient period increases linearly with the number of processors. As a result even a few spinning processors can adversely impact the execution speed of a moderate-sized critical section.

There are several factors that cause the performance of spinning on a memory read to be worse than expected:

There is a separation between detecting that the lock has been released and attempting to acquire it with a test-and-set instruction. This separation allows more than one processor to notice that the lock has been released, pass by that test, and proceed to try a test-and-set. Ideally, if one processor could notice the change and acquire the lock before any other processor committed to doing a test-and-set, the performance would be better.
Cache copies of the lock value are invalidated by a test-and-set instruction even if the value is not changed. If this were not the case, invalidations would occur only when the lock is released and then again when it is reacquired.
Invalidation-based cache coherence requires O(P) bus or network cycles to broadcast a value of P waiting processors. This occurs despite the fact that, after an invalidation, they each request exactly the same data.

While a solution to any of these three problems by itself would result in better performance, any single solution would still require bus activity that grows linearly with the number of processors.

Ideally, performance initially improves as processors are added, due to increased parallelism, but as the critical section becomes a bottleneck, performance levels out. The ideal performance (free spin-waiting) can be determined from the time to execute the critical section and the mean delay between lock accesses.

Critical sections, since their purpose is to manipulate shared data structures, typically have higher memory access rates than non-critical sections. As a critical section becomes a bottleneck, the spinning processors slow the lock holder's execution, both in absolute terms and relative to non-critical sections, making it more of a bottleneck, resulting in more spinning processors.

Performance of More Complex Approaches to Spin-waiting

A. Delay after spinning processor notices lock has been released.

init: lock = 0;

    lock: while (lock == 1 || test_and_set(lock) == 1)

        begin

        while (lock == 1);

        delay();

        end;

 

unlock: lock = 0;

We can reduce the number of unsuccessful test-and-sets when spinning on a read by inserting a delay between when a processor reads that the lock is released and when it commits to trying the test-and-set. If some other processor acquires the lock during the delay, then the processor can resume spinning; if not try the test-and-set, with a greater likelihood that the lock will be acquired. The number of unsuccessful test-and-set and this invalidations can be reduced.

Static Delay
Each processor can be statically assigned an amount of time to delay. Ensures that atmost one processor times out at any instant. Performs well when there are many spinning processors. When there are only a few spinning processors, however, it is unlikely to have a short delay, leaving the lock unacquired for a relatively long time.

Dynamic Delay
Each processor chooses a random delay and adapts. In designing a back-off scheme (back-off is only used if a lock is initially busy) for spin-locks, there are a number of details that affect performance:

When a processor detects that the lock has been acquired, it should not increase or decrease its mean delay.
There needs to be a maximum bound on the mean delay. This bound should be equal to the number of processors, so the back-off has the same performance as statically assigned slots when there are many spinning processors.
The initial delay of an arriving processor should be some fraction of its delay the last time at the lock. e.g. half the previous delay.

B. Delay between each memory reference.

init: lock = 0;

    lock: while (lock == 1 || test_and_set(lock) == 1)

        delay();

 

unlock: lock = 0;

This can be used on architectures without coherent caches or with invalidation-based coherence to limit the communication (bus) bandwidth consumed by spinning processors. We assume that a test-and-set instruction consumes more bandwidth than a simple read.

More frequent polling improves performance when there are few spinning processors and worsens performance when there are many.

C. Queueing in Shared Memory

init: flags[0] = HAS_LOCK;
flags[1 ... N-1] = MUST_WAIT
queue_last = 0;

    lock:   my_place = read_and_increment(queue_last);

        while (flags[my_place mod N] == MUST_WAIT);

flags[my_place mod P] = MUST_WAIT;

unlock: flags[(my_place + 1) mod P] = HAS_LOCK;

Maintain an explicit queue of spinning processors. Each arriving processor enqueues itself and then spins on a separate flag. When the processor finishes with the critical section, it dequeues itself and sets the flag of the next processor in the queue. If each processor's flag is kept in a separate cache block, then only one cache read miss is needed to notify the next processor.
A method for queueing spin-waiting processors that requires only a single atomic operation per execution of the critical section has been developed. Each arriving processor does an atomic read-and-increment to obtain a unique sequence number. When a processor finishes with the lock, it taps the processor with the next highest sequence number; that processor now owns the lock. Since processors are sequenced, no atomic read-modify-write instruction is needed to pass control of the lock.
Because a spinning processor automatically gets control of the critical section when its bit is set, the time between when one processor finishes and the next processor starts executing the critical section is reduced. In some sense, this exploits parallelism: the spinning processor does the time-consuming work of the atomic operation before the lock is released, decreasing the amount of serial work required to pass control. Thus, throughput actually increases as a critical section becomes a bottleneck.
Queueing has some bad aspects:

It increases lock latency. Each processor must increment a counter, check a location, zero that location, and set another location; in other methods, when there is no contention, the first test-and-set acquires the lock. Thus, when there is contention, queueing is better; when there is no contention, back-off or simple spin-waiting is better.
Problem of queueing when coupled with processor preemption.
Queueing makes it difficult to wait/poll for multiple events.

Conclusions

Simple methods for spin-waiting for mutually exclusive access to shared data structures degrade overall performance as the number of spinning processors increases.
Lock design must be tailored to the coherence protocol for best performance.
Optimized protocols attempt to minimize the number of unfruitful test and set operations in order to minimize number of bus cycles, avoid excessive invalidation/update operations.
For multiprocessors without special support for spin-waiting beyond implementing atomic instructions, software queueing and a variant of Ethernet back-off have good performance even for large numbers of spinning processors. Back-off being simpler has better performance when there is no contention for the lock; queueing, by parallelizing the lock hand-off, performs best when there are waiting processors.
Whether real workloads will have significant enough amounts of spin-waiting to make hardware support for spin-lock requests worthwhile remains an open question.

Future work

Wednesday, March 19, 2008

Thomas Anderson, The Performance of SpinLock Alternatives for Shared-Memory Multiprocessors, IEEE Transactions on Parallel & Distributed Systems, 1990