Monday, August 22, 2005

Gates' Law

"The speed of processors doubles every 18 months" -- Moore's law

"The speed of software halves every 18 months" -- Gates' law

I'm uncertain if Gates' Law was meant as a joke, but I did notice that Lotus 123 ran at about the same speed on an old 80286 as excel runs on your modern PC.

Some of it is certainly because newer hardware makes more things possible, such as the graphical interface and tooltips. Newer software does try to be sexier by doing flashy things such as the talking paperclip. This causes a tendency for the software to bloat to the point of saturation, i.e. until it performs at barely acceptable speeds.

But application programming seems to get worse with faster processors. Even with more memory, faster disks and multiple CPUs, your average web application runs slower. Is it possible that the faster hardware has made us worse programmers?

In the old days, to set a register to the value 1, even an average assembly programmer would do something like this:


xor ax,ax
inc ax
rather than:
mov ax,1
The reason was simple. The programmer knew the latter alternative required access to memory, which is generally ten times slower than accessing registers. Modern programming languages hide details like this from the programmer, so he does not know the real cost of bad programming.

With garbage collection now part of many languages, we see many applications having to deal with memory problems. Why? Because the average programmer does not know the cost of object creation since he does not have to deal with cleanup.

For example in C++ you would do this:


// case 1: allocate on stack
int a = 1;
// other code
DirEnt de;

// case 2: allocate on heap
int *a = new int;
// other code
DirEnt *de = new DirEnt;
In the first case, the variable is allocated on the stack and disappears when execution moves outside the scope. The compiler calculates the required space at compile time and generates a simple instruction such as:

sub bp, 2+248        ; int is 2 bytes, dirent is 248 bytes
In the second case, the variable is allocated on the heap, which is a comparatively slow operation. First, a check is made for sufficient free memory. The heap is then searched for contiguous free memory of the desired size and then initialized. The region is marked "in-use" and a pointer to the memory is returned. If there is no contiguous memory, the system may choose to compact/defragment memory.

Two characteristics of the heap vs stack in the above example are that the memory allocation routine would need to be called twice, since the compiler has no way of knowing if the allocations can be combined given there is other code between the two allocations. Also while each thread has its own stack, multiple threads compete for the same heap allocator (except a small portion of thread local heap, which is not really a factor in large memory footprints).

In languages such as Java you cannot really allocate anything except primitives on the stack even if you don't intend to use them beyond the immediate scope. This represents a big load on the memory management system. Objects that properly belong on the stack now crowd the heap, putting a strain on the entire system.

"But how slow can memory allocation be? Isn't memory accessible in nanoseconds?"

In most operating systems, for practically all applications, the memory allocated is virtual memory from the theoretical address space rather than actual physical memory. This implies that you need huge amounts of memory to never have a page fault. Page faults generally work at the speed of storage, which is measured in milliseconds unless you have a solid state disk.

With fragmented memory and a large memory footprint, we end up dealing with multiple page faults, causing the system to thrash. This is especially apparent in multi-threaded programming where the locality of reference principle does not save you from multiple simultaneous page faults.

Unfortunately, even a programmer who is aware of these issues is helpless if the language itself is restrictive. What java could have done is allocate objects on the stack if they are not passed to other methods.


// new translates to alloc on stack
DirEnt e = new DirEnt(".");
e.list();

// passed to other class, alloc on heap
DirEnt e = new DirEnt("..");
stack.add(e);
Theoretically, the java compiler could follow the object in the call tree and see if anyone retains a reference to it, but that would render the compiler prohibitively slow and perhaps incorrect if reflection is used.

It is therefore up to the programmer to not trust the much touted features of any language (or any technology) without sufficient experience. Failing which, your programs end up testifying for Gates' Law.

Monday, August 08, 2005

The problem with DTDs and schemas

Many would agree with me that XML is too bulky for its intended purposes. But would you say the same about HTML? probably not.

And why should that be? Isnt XML a glorified, genericized HTML? Despite what the historians say about both being children of SGML, the fact of the matter is that XML only came into being after the spectacular success of HTML and if you remember the original marketing hype, it was all about how web documents would be XML in the future and how they would be rendered or used differently by other consumers.

One of the important factors for the failure of XML is the reliance on DTDs and schemas. HTML succeeded simply because there was no predefined format for any given web page. Browsers ignored what they did not understand, and did the best with what they could. HTML was a language that browsers and servers spoke. There was a syntax, but no schema. The schema came later, long after HTML was refined through multi-year, multi-user testing.

Your typical enterprise application does not have that luxury.

Consider this snippet of XML:


<orders>
<date>Dec-24</date>
 <order>
  <customer>Joe Schlubbs</customer>
  <customer-address>10 drowning st</customer-address>
  <customer-city>London</customer-city>
  <order-line>
     <quantity>100</quantity>
     <item-number>J-350</item-number>
     <item-name>extra large onions</item-name>
     <price-per-item>12.35</price-per-item>
     <!-- other fields -->
  </order-line>
 </order>
 <order>
   <customer>Joe Schlubbs</customer>
   <customer-address>10 drowning st</customer-address>
   <customer-city>London</customer-city>
   <order-line>
      <quantity>100</quantity>
      <item-number>J-007</item-number>
      <item-name>extra large bullets</item-name>
      <price-per-item>1.99</price-per-item>
   </order-line>
  </order>
</orders>

Lets assume we are writing an application that needs a list of all customers who spent more than a thousand bucks in the last month so we could mail them some coupons.

An SQL programmer would simply do a SELECT customer, sum(price*qty) group by customer having sum(price*qty) > 1000. The typical XML programmer on the other hand, has to get all orders for the last month and skim through large amounts of text to produce the necessary results. Depending on the number of tags, close to 90 percent of the input will be useless to the application.

Could the server simply omit the unnecessary data? Probably not, because the tags are not optional. Thats because when they designed the XML, they could not forsee that business would be so good that they would send out coupons.

It is simply impossible to forsee all possible future needs, so to really future-proof your schema, you would need to make nearly every tag optional. This pretty much void the purpose of a schema.

A schema where every tag is optional could still useful, to describe what tags go where and what they mean -- sort of like HTML.

Or you could just deal with the bulk.

Thursday, August 04, 2005

Semaphores in assembler and java

How do you implement a semaphore in assembler taking into account the fact that you may have multiple CPUs with each CPU possibly executing multiple threads of code at any time? We use the xchg instruction, which performs atomic exchange of its arguments.

;acquire semaphore
    xor     ax, ax      ; set AX to zero
L1:
    xchg    ax, sema    ; atomic test and set. sema = 0, ax = sema
    or      ax, ax      ; is the value we obtained from sema 0?
    jz      L1          ; yes, wait for someone to release semaphore
    dec     ax          ; no, we got the semaphore.
    xchg    ax, sema    ; decrease by one and store

;process stuff

; release semaphore
    xchg    ax, sema    ; sema value may have been changed by other threads
    inc     ax          ; bump to notify we have released it
    xchg    ax, sema    ; atomic store
This is why there is an atomic exchange instruction in many processors. An atomic test and set instruction is indispensable for a variety of systems programming, semaphores being one of them. On systems with multiple processors, you probably want to lock the bus before executing the xchg. The code becomes:

    lock xchg  ax, sema
In case you have forgotten, the lock instruction locks the memory bus for the duration of the next instruction, which ensures that another processor does not read or write to sema when we are writing to it. The sequence of instructions

L1:
    xchg    ax, sema
    or      ax, ax
    jz      L1
form an anti-pattern called "busy-waiting" and burns up CPU resources when waiting to acquire the semaphore. In real life systems programming, it is often replaced by a spin lock, (which is a "design pattern") A spin lock busy waits for a few instruction cycles and then goes to sleep. The idea behind this is that context switches are expensive, and for many operations, semaphores are held briefly enough that the majority of the time, the waiting thread acquires the semaphore without the added cost of a context switch.

;acquire semaphore
L0:
    mov     cx, 0800h    ; spin the test loop so many times ...
    xor     ax, ax       ; ... choose 1/2 cost of context switch and tweak
L1:
    xchg    ax, sema
    or      ax, ax
    jnz     L2           ; got the semaphore, go to update
    loop    L1           ; repeat till cx is zero
    call    monitor_wait ; context switch
    jmp     L0           ; need to retest after notify
L2:
    dec     ax
    xchg    ax, sema

As you can see this implementation is very efficient and wastes very few cycles. A semaphore variable needs to be initialized with the number of simultaneous threads that can access it. For a mutex (equivalent to java's synchronized) we set the value to 1. The Linux kernel approaches the problem in a similar manner (albeit in C). The java equivalent looks like this:

synchronized(lock) // sema is an invisible variable inside object lock
{                  // sema acquired if you get here
    // process
}                  // release sema
Given that mutexes are nothing but special case of semaphores which have been in use in systems programming since antiquity and the relative simplicity of the machine code to implement them, its surprising that the monitorenter and monitorexit instructions were prohibitively expensive in early JVM implementations.

Wednesday, August 03, 2005

SEXML

XML has the unfortunate side effect of being bulky. While bulk is good for the large intestine, its pretty much useless in computing. Which is why I devised SEXML. That's Simple Elemental XML. This is how it was derived: First, we get rid of attributes. What good are attributes for? They can easily be replaced by elements:
<order qty="12" price="13.5" brokerage="15%">
<order-type>buy</order-type>
<effective-time>market-close</effective-time>
</order>
can be rewritten as:
<order>
<qty>12</qty>
<price>13.5</price>
<brokerage>15%</brokerage>
<order-type>buy</order-type>
<effective-time>market-close</effective-time>
</order>
Next, we get rid of the irritating double tags. We accomplish this by turning <qty>12</qty> into qty<12>. The XML above becomes:
order<
qty<12>
price<13.5>
brokerage<15%>
order-type<buy>
effective-time<market-close>
>
To avoid confusion between XML and SEXML, and to enable embedding one in the other, lets replace <> with something else, say {} and while we are at it, put them on just one line:
order{qty{12}price{13.5}brokerage{15%}order-type{buy}effective-time{market-close}}
Perhaps we should separate the items with a comma, in case there are empty elements:
order{qty{12},price{13.5},brokerage{15%},order-type{buy},effective-time{market-close}}
There we go. If we had used parantheses, it would look too much like lisp and scare people off. Of course, you would put everything on one line only for use in URLs and the like.

The above has the same informational content of the XML above, but is much more compact and simpler to parse.

"But wait!" I hear screams. "What about schemas? What about the XML header? Character encoding? Entities?"

Nothing prevents you from using a schema. After all, it only describes structure and type information, not whether you use braces or angle brackets. Your schema just wont mention any attributes. As an exercise, try writing a schema for the above in SEXML.

As for the header, I dont see why you cant SEXML it up.

sexml {
 version{1.0}, encoding{utf-8},
 order{
     qty{12},price{13.5},brokerage{15%},order-type{buy},effective-time{market-close}
 }
}
Entities such as &lt; can live as is.

I believe CDATA is quite unnecessary for SEXML and in fact, seems to be rarely used in XML. If you believe otherwise, please enlighten me.

/*
Since the original reason the comment in xml became the unweildy <!-- --> to maintain compatibility with HTML, we can break it here to use C++ style comments. (Do you really need 4 chars to start a comment?)
*/

Lets see what it looks like with parantheses:

sexml (
 version(1.0), encoding(utf-8),
 order (
     qty(12),price(13.5),brokerage(15%),order-type(buy),effective-time(market-close)
 )
)

Guess the Lisp folks were right. Everything is a list of lists.