How do you implement a semaphore in assembler taking into account the fact that you may have multiple CPUs with each CPU possibly executing multiple threads of code at any time?
We use the
xchg
instruction, which performs atomic exchange of its arguments.
;acquire semaphore
xor ax, ax ; set AX to zero
L1:
xchg ax, sema ; atomic test and set. sema = 0, ax = sema
or ax, ax ; is the value we obtained from sema 0?
jz L1 ; yes, wait for someone to release semaphore
dec ax ; no, we got the semaphore.
xchg ax, sema ; decrease by one and store
;process stuff
; release semaphore
xchg ax, sema ; sema value may have been changed by other threads
inc ax ; bump to notify we have released it
xchg ax, sema ; atomic store
This is why there is an atomic exchange instruction in many processors. An atomic test and set instruction is indispensable for a variety of systems programming, semaphores being one of them. On systems with multiple processors, you probably want to lock the bus before executing the
xchg
. The code becomes:
lock xchg ax, sema
In case you have forgotten, the
lock
instruction locks the memory bus for the duration of the next instruction, which ensures that another processor does not read or write to sema when we are writing to it.
The sequence of instructions
L1:
xchg ax, sema
or ax, ax
jz L1
form an anti-pattern called "busy-waiting" and burns up CPU resources when waiting to acquire the semaphore. In real life systems programming, it is often replaced by a spin lock, (which is a "design pattern")
A spin lock busy waits for a few instruction cycles and then goes to sleep. The idea behind this is that context switches are expensive, and for many operations, semaphores are held briefly enough that the majority of the time, the waiting thread acquires the semaphore without the added cost of a context switch.
;acquire semaphore
L0:
mov cx, 0800h ; spin the test loop so many times ...
xor ax, ax ; ... choose 1/2 cost of context switch and tweak
L1:
xchg ax, sema
or ax, ax
jnz L2 ; got the semaphore, go to update
loop L1 ; repeat till cx is zero
call monitor_wait ; context switch
jmp L0 ; need to retest after notify
L2:
dec ax
xchg ax, sema
As you can see this implementation is very efficient and wastes very few cycles. A semaphore variable needs to be initialized with the number of simultaneous threads that can access it. For a mutex (equivalent to java's synchronized) we set the value to 1.
The Linux kernel approaches the problem in a similar manner (albeit in C).
The java equivalent looks like this:
synchronized(lock) // sema is an invisible variable inside object lock
{ // sema acquired if you get here
// process
} // release sema
Given that mutexes are nothing but special case of semaphores which have been in use in systems programming since antiquity and the relative simplicity of the machine code to implement them, its surprising that the
monitorenter
and
monitorexit
instructions were prohibitively expensive in early JVM implementations.
No comments:
Post a Comment