Thursday, August 04, 2005

Semaphores in assembler and java

How do you implement a semaphore in assembler taking into account the fact that you may have multiple CPUs with each CPU possibly executing multiple threads of code at any time? We use the xchg instruction, which performs atomic exchange of its arguments.

;acquire semaphore
    xor     ax, ax      ; set AX to zero
L1:
    xchg    ax, sema    ; atomic test and set. sema = 0, ax = sema
    or      ax, ax      ; is the value we obtained from sema 0?
    jz      L1          ; yes, wait for someone to release semaphore
    dec     ax          ; no, we got the semaphore.
    xchg    ax, sema    ; decrease by one and store

;process stuff

; release semaphore
    xchg    ax, sema    ; sema value may have been changed by other threads
    inc     ax          ; bump to notify we have released it
    xchg    ax, sema    ; atomic store
This is why there is an atomic exchange instruction in many processors. An atomic test and set instruction is indispensable for a variety of systems programming, semaphores being one of them. On systems with multiple processors, you probably want to lock the bus before executing the xchg. The code becomes:

    lock xchg  ax, sema
In case you have forgotten, the lock instruction locks the memory bus for the duration of the next instruction, which ensures that another processor does not read or write to sema when we are writing to it. The sequence of instructions

L1:
    xchg    ax, sema
    or      ax, ax
    jz      L1
form an anti-pattern called "busy-waiting" and burns up CPU resources when waiting to acquire the semaphore. In real life systems programming, it is often replaced by a spin lock, (which is a "design pattern") A spin lock busy waits for a few instruction cycles and then goes to sleep. The idea behind this is that context switches are expensive, and for many operations, semaphores are held briefly enough that the majority of the time, the waiting thread acquires the semaphore without the added cost of a context switch.

;acquire semaphore
L0:
    mov     cx, 0800h    ; spin the test loop so many times ...
    xor     ax, ax       ; ... choose 1/2 cost of context switch and tweak
L1:
    xchg    ax, sema
    or      ax, ax
    jnz     L2           ; got the semaphore, go to update
    loop    L1           ; repeat till cx is zero
    call    monitor_wait ; context switch
    jmp     L0           ; need to retest after notify
L2:
    dec     ax
    xchg    ax, sema

As you can see this implementation is very efficient and wastes very few cycles. A semaphore variable needs to be initialized with the number of simultaneous threads that can access it. For a mutex (equivalent to java's synchronized) we set the value to 1. The Linux kernel approaches the problem in a similar manner (albeit in C). The java equivalent looks like this:

synchronized(lock) // sema is an invisible variable inside object lock
{                  // sema acquired if you get here
    // process
}                  // release sema
Given that mutexes are nothing but special case of semaphores which have been in use in systems programming since antiquity and the relative simplicity of the machine code to implement them, its surprising that the monitorenter and monitorexit instructions were prohibitively expensive in early JVM implementations.

No comments: