Monday, August 22, 2005

Gates' Law

"The speed of processors doubles every 18 months" -- Moore's law

"The speed of software halves every 18 months" -- Gates' law

I'm uncertain if Gates' Law was meant as a joke, but I did notice that Lotus 123 ran at about the same speed on an old 80286 as excel runs on your modern PC.

Some of it is certainly because newer hardware makes more things possible, such as the graphical interface and tooltips. Newer software does try to be sexier by doing flashy things such as the talking paperclip. This causes a tendency for the software to bloat to the point of saturation, i.e. until it performs at barely acceptable speeds.

But application programming seems to get worse with faster processors. Even with more memory, faster disks and multiple CPUs, your average web application runs slower. Is it possible that the faster hardware has made us worse programmers?

In the old days, to set a register to the value 1, even an average assembly programmer would do something like this:


xor ax,ax
inc ax
rather than:
mov ax,1
The reason was simple. The programmer knew the latter alternative required access to memory, which is generally ten times slower than accessing registers. Modern programming languages hide details like this from the programmer, so he does not know the real cost of bad programming.

With garbage collection now part of many languages, we see many applications having to deal with memory problems. Why? Because the average programmer does not know the cost of object creation since he does not have to deal with cleanup.

For example in C++ you would do this:


// case 1: allocate on stack
int a = 1;
// other code
DirEnt de;

// case 2: allocate on heap
int *a = new int;
// other code
DirEnt *de = new DirEnt;
In the first case, the variable is allocated on the stack and disappears when execution moves outside the scope. The compiler calculates the required space at compile time and generates a simple instruction such as:

sub bp, 2+248        ; int is 2 bytes, dirent is 248 bytes
In the second case, the variable is allocated on the heap, which is a comparatively slow operation. First, a check is made for sufficient free memory. The heap is then searched for contiguous free memory of the desired size and then initialized. The region is marked "in-use" and a pointer to the memory is returned. If there is no contiguous memory, the system may choose to compact/defragment memory.

Two characteristics of the heap vs stack in the above example are that the memory allocation routine would need to be called twice, since the compiler has no way of knowing if the allocations can be combined given there is other code between the two allocations. Also while each thread has its own stack, multiple threads compete for the same heap allocator (except a small portion of thread local heap, which is not really a factor in large memory footprints).

In languages such as Java you cannot really allocate anything except primitives on the stack even if you don't intend to use them beyond the immediate scope. This represents a big load on the memory management system. Objects that properly belong on the stack now crowd the heap, putting a strain on the entire system.

"But how slow can memory allocation be? Isn't memory accessible in nanoseconds?"

In most operating systems, for practically all applications, the memory allocated is virtual memory from the theoretical address space rather than actual physical memory. This implies that you need huge amounts of memory to never have a page fault. Page faults generally work at the speed of storage, which is measured in milliseconds unless you have a solid state disk.

With fragmented memory and a large memory footprint, we end up dealing with multiple page faults, causing the system to thrash. This is especially apparent in multi-threaded programming where the locality of reference principle does not save you from multiple simultaneous page faults.

Unfortunately, even a programmer who is aware of these issues is helpless if the language itself is restrictive. What java could have done is allocate objects on the stack if they are not passed to other methods.


// new translates to alloc on stack
DirEnt e = new DirEnt(".");
e.list();

// passed to other class, alloc on heap
DirEnt e = new DirEnt("..");
stack.add(e);
Theoretically, the java compiler could follow the object in the call tree and see if anyone retains a reference to it, but that would render the compiler prohibitively slow and perhaps incorrect if reflection is used.

It is therefore up to the programmer to not trust the much touted features of any language (or any technology) without sufficient experience. Failing which, your programs end up testifying for Gates' Law.

Monday, August 08, 2005

The problem with DTDs and schemas

Many would agree with me that XML is too bulky for its intended purposes. But would you say the same about HTML? probably not.

And why should that be? Isnt XML a glorified, genericized HTML? Despite what the historians say about both being children of SGML, the fact of the matter is that XML only came into being after the spectacular success of HTML and if you remember the original marketing hype, it was all about how web documents would be XML in the future and how they would be rendered or used differently by other consumers.

One of the important factors for the failure of XML is the reliance on DTDs and schemas. HTML succeeded simply because there was no predefined format for any given web page. Browsers ignored what they did not understand, and did the best with what they could. HTML was a language that browsers and servers spoke. There was a syntax, but no schema. The schema came later, long after HTML was refined through multi-year, multi-user testing.

Your typical enterprise application does not have that luxury.

Consider this snippet of XML:


<orders>
<date>Dec-24</date>
 <order>
  <customer>Joe Schlubbs</customer>
  <customer-address>10 drowning st</customer-address>
  <customer-city>London</customer-city>
  <order-line>
     <quantity>100</quantity>
     <item-number>J-350</item-number>
     <item-name>extra large onions</item-name>
     <price-per-item>12.35</price-per-item>
     <!-- other fields -->
  </order-line>
 </order>
 <order>
   <customer>Joe Schlubbs</customer>
   <customer-address>10 drowning st</customer-address>
   <customer-city>London</customer-city>
   <order-line>
      <quantity>100</quantity>
      <item-number>J-007</item-number>
      <item-name>extra large bullets</item-name>
      <price-per-item>1.99</price-per-item>
   </order-line>
  </order>
</orders>

Lets assume we are writing an application that needs a list of all customers who spent more than a thousand bucks in the last month so we could mail them some coupons.

An SQL programmer would simply do a SELECT customer, sum(price*qty) group by customer having sum(price*qty) > 1000. The typical XML programmer on the other hand, has to get all orders for the last month and skim through large amounts of text to produce the necessary results. Depending on the number of tags, close to 90 percent of the input will be useless to the application.

Could the server simply omit the unnecessary data? Probably not, because the tags are not optional. Thats because when they designed the XML, they could not forsee that business would be so good that they would send out coupons.

It is simply impossible to forsee all possible future needs, so to really future-proof your schema, you would need to make nearly every tag optional. This pretty much void the purpose of a schema.

A schema where every tag is optional could still useful, to describe what tags go where and what they mean -- sort of like HTML.

Or you could just deal with the bulk.

Thursday, August 04, 2005

Semaphores in assembler and java

How do you implement a semaphore in assembler taking into account the fact that you may have multiple CPUs with each CPU possibly executing multiple threads of code at any time? We use the xchg instruction, which performs atomic exchange of its arguments.

;acquire semaphore
    xor     ax, ax      ; set AX to zero
L1:
    xchg    ax, sema    ; atomic test and set. sema = 0, ax = sema
    or      ax, ax      ; is the value we obtained from sema 0?
    jz      L1          ; yes, wait for someone to release semaphore
    dec     ax          ; no, we got the semaphore.
    xchg    ax, sema    ; decrease by one and store

;process stuff

; release semaphore
    xchg    ax, sema    ; sema value may have been changed by other threads
    inc     ax          ; bump to notify we have released it
    xchg    ax, sema    ; atomic store
This is why there is an atomic exchange instruction in many processors. An atomic test and set instruction is indispensable for a variety of systems programming, semaphores being one of them. On systems with multiple processors, you probably want to lock the bus before executing the xchg. The code becomes:

    lock xchg  ax, sema
In case you have forgotten, the lock instruction locks the memory bus for the duration of the next instruction, which ensures that another processor does not read or write to sema when we are writing to it. The sequence of instructions

L1:
    xchg    ax, sema
    or      ax, ax
    jz      L1
form an anti-pattern called "busy-waiting" and burns up CPU resources when waiting to acquire the semaphore. In real life systems programming, it is often replaced by a spin lock, (which is a "design pattern") A spin lock busy waits for a few instruction cycles and then goes to sleep. The idea behind this is that context switches are expensive, and for many operations, semaphores are held briefly enough that the majority of the time, the waiting thread acquires the semaphore without the added cost of a context switch.

;acquire semaphore
L0:
    mov     cx, 0800h    ; spin the test loop so many times ...
    xor     ax, ax       ; ... choose 1/2 cost of context switch and tweak
L1:
    xchg    ax, sema
    or      ax, ax
    jnz     L2           ; got the semaphore, go to update
    loop    L1           ; repeat till cx is zero
    call    monitor_wait ; context switch
    jmp     L0           ; need to retest after notify
L2:
    dec     ax
    xchg    ax, sema

As you can see this implementation is very efficient and wastes very few cycles. A semaphore variable needs to be initialized with the number of simultaneous threads that can access it. For a mutex (equivalent to java's synchronized) we set the value to 1. The Linux kernel approaches the problem in a similar manner (albeit in C). The java equivalent looks like this:

synchronized(lock) // sema is an invisible variable inside object lock
{                  // sema acquired if you get here
    // process
}                  // release sema
Given that mutexes are nothing but special case of semaphores which have been in use in systems programming since antiquity and the relative simplicity of the machine code to implement them, its surprising that the monitorenter and monitorexit instructions were prohibitively expensive in early JVM implementations.

Wednesday, August 03, 2005

SEXML

XML has the unfortunate side effect of being bulky. While bulk is good for the large intestine, its pretty much useless in computing. Which is why I devised SEXML. That's Simple Elemental XML. This is how it was derived: First, we get rid of attributes. What good are attributes for? They can easily be replaced by elements:
<order qty="12" price="13.5" brokerage="15%">
<order-type>buy</order-type>
<effective-time>market-close</effective-time>
</order>
can be rewritten as:
<order>
<qty>12</qty>
<price>13.5</price>
<brokerage>15%</brokerage>
<order-type>buy</order-type>
<effective-time>market-close</effective-time>
</order>
Next, we get rid of the irritating double tags. We accomplish this by turning <qty>12</qty> into qty<12>. The XML above becomes:
order<
qty<12>
price<13.5>
brokerage<15%>
order-type<buy>
effective-time<market-close>
>
To avoid confusion between XML and SEXML, and to enable embedding one in the other, lets replace <> with something else, say {} and while we are at it, put them on just one line:
order{qty{12}price{13.5}brokerage{15%}order-type{buy}effective-time{market-close}}
Perhaps we should separate the items with a comma, in case there are empty elements:
order{qty{12},price{13.5},brokerage{15%},order-type{buy},effective-time{market-close}}
There we go. If we had used parantheses, it would look too much like lisp and scare people off. Of course, you would put everything on one line only for use in URLs and the like.

The above has the same informational content of the XML above, but is much more compact and simpler to parse.

"But wait!" I hear screams. "What about schemas? What about the XML header? Character encoding? Entities?"

Nothing prevents you from using a schema. After all, it only describes structure and type information, not whether you use braces or angle brackets. Your schema just wont mention any attributes. As an exercise, try writing a schema for the above in SEXML.

As for the header, I dont see why you cant SEXML it up.

sexml {
 version{1.0}, encoding{utf-8},
 order{
     qty{12},price{13.5},brokerage{15%},order-type{buy},effective-time{market-close}
 }
}
Entities such as &lt; can live as is.

I believe CDATA is quite unnecessary for SEXML and in fact, seems to be rarely used in XML. If you believe otherwise, please enlighten me.

/*
Since the original reason the comment in xml became the unweildy <!-- --> to maintain compatibility with HTML, we can break it here to use C++ style comments. (Do you really need 4 chars to start a comment?)
*/

Lets see what it looks like with parantheses:

sexml (
 version(1.0), encoding(utf-8),
 order (
     qty(12),price(13.5),brokerage(15%),order-type(buy),effective-time(market-close)
 )
)

Guess the Lisp folks were right. Everything is a list of lists.

Tuesday, February 01, 2005

Assembling a home theatre PC

I've wanted to build an AMD-64 for some time and I finally got started with the following parts. All items were located on Pricewatch. For some items, I did not automatically choose the cheapest vendor if the vendor had less than best ratings.

The general plan is to have a media box such as this.

ItemPrice (w/. shipping)QtyAmount
AMD Athlon 64 939 3500+ $262.701$262.70
Asus A8N SLI skt939$194.001 $194.00
Ahanix D4 $229.58 1 $229.58
Zalman CNPS7000B-CU Cooler $52.00 1 $52.00
Samsung PC3200 512MB $82.00 2 $164.00
HITACHI 250GB SATA $120.00 2 $240.00
Hauppage WinPVR-250 $128.00 1 $128.00
PCI express eVGA 6600 256MB $129.00 1 $129.00
Arctic Silver 5$6.00 1 $6.00

In addition, I bought some Everclear 95% grain alcohol from a liquor store for cleaning the computer parts. (Never use anything other than pure grain alcohol!)

In general, stores where shipping is free tend to ship via snail mail, so when looking at the price, factor that in.

Most items were packed perfectly. I was impressed with the packing of the hard drives and the Ahanix case. Other items were packaged adequately. The only exception was the Zalman cooler, which had some copper cooling blades bent because they did not seal the box. Copper is a soft metal and bends easily. the fan also looked very slightly scuffed. I straightened out the fins of the Zalman carefully. Then I verified that the fan ran silently and decided not to ask for an exchange.

First impressions:

The Ahanix D4 case looks like a stereo component device. The front metal face is about a quarter inch thick and feels luxurious. A hinged door that covers the 5.25 inch bays is held together by magnets and impresses quality. The entire case is aluminum and is very well constructed. The included power supply is silent and the case is designed to be kept cool. I wont review the case fully here since its been reviewed elsewhere, but suffice to say this is the best case you can get that holds a full ATX board and worth every penny. The D5 is cooler looking, but does not seem to have front USB/firewire ports and has an infra-red remote control sensor that is plugged to the USB port. I was not sure how it would work with lirc, so I did not buy it. The only negetive about this box is that smudges can be easily seen when you touch it.

The AMD Athlon 64 CPU looked like it had some extremely thin layer of gunk on it (returned item?) but I washed it with alcohol and it evaporated.

The Zalman CNPS7700-Cu cooler is a thing of beauty. It looks like a copper flower. The fan is large and silent. It comes with a fan control device, some thermal grease and various brackets. I cleaned its bottom with alcohol too.

The Asus A8N SLI deluxe motherboard truly deserves the Deluxe tag. It comes with every imaginable cable and fixture including a SATA fixture that allows you to plug in SATA drives at the back. The chipset fan is reasonably silent. The optical SPDIF out makes this board attractive to Home Theater projects. The 8 SATA RAID interfaces don't hurt either. The most important feature however, is Nvidia chipset, which has good Linux support from the vendor.

The Arctic silver 5 compound comes in a tiny syringe with a rubber dome top.

The Hitachi SATA hard drives run great, but are not as silent as their IBM counterparts (before IBM sold their hard drive division to Hitachi). It could be because I have 2 SATA drives in raid config (more on that later) and they make the seek sound simultaneously. I need to investigate into silencing the drives a bit more.

The eVGA GeForce 6600 with 256MB seems to be the fastest card at the price. This card comes with a shrouded fan. If you are a serious gamer (I'm not), its a good idea to go with the 6800 GT and get the Arctic cooler for it.

The RAM looked like ... RAM. The DIMMS had Copper heat spreaders on them and the word RAMSES printed. I complained to the vendor that he had sent me Ramses brand when I had requested Micron. He told me that the heat spreader was Ramses and the RAM was actually Samsung, which was allegedly better than Micron. I decided to keep the RAM if it ran stable. Fortunately, it did.

I had an old Plextor 712A DVD Writer, which I plugged into the system.

Hardware installation:

First, I removed the motherboard backplate provided by Ahanix. The items simply did not match my motherboard. I replaced it with one included by ASUS. The motherboard comes with a bracket and baseplate for the stock CPU cooler which I removed and then attached the baseplate that came with the Zalman cooler (using the nipples provided). I placed the motherboard on a flat surface and lifted the ZIF socket's lever and then aligned the CPU's golden triangle with the triangular mark on the ZIF socket. It fell into place easily. Placing the CPU on the ZIF socket and pushing the lever back to horizontal position is about the easiest thing to do.

Next, I took the Arctic silver 5 syringe and punctured the top with a pin and then squeezed about 1 and a half rice grain lengths on the CPU. I then gently lowered the Zalman cooler and aligned the screw holes with the nipples. Once the cooler sits on the CPU, a suction prevents you from easily picking up the cooler. but you can turn it slightly. I then screwed in the cooler tight alternating each screw.

I then lifted the motherboard with the CPU and cooler into the case and screwed it to the base. (Some folks prefer to attach the board first and then install the CPU and heatsink, but given that Arctic silver is slightly capacitative, I needed to be absolutely sure that not a single particle would fall onto the motherboard.)

Next, I installed the Samsung 512MBx2 PC3200 RAM which had copper heat spreaders on them into the appropriate banks (A1 and B1) to make best use of DDR feature.

Next, I installed the Hitachi SATA hard drives. You only get access to one side of the bay, so I screwed them in.

Next, I installed the PCI-Express eVGA card into the blue PCIe slot (as recommended if using a single slot)

Last, I connected all the case connections such as the power and hard drive LEDs

Linux Installation: Most folks would do well to just install the KnoppMyth distribution, but I'm a gentoo fanatic, so I downloaded the AMD64 ISO and installed gentoo and installed from stage3. Standard procedure with some hints for raid.

The only things I need to tell you are: For the sound card, I chose snd-intel8x0 (alsa driver) and for the net I chose forcedeth. Although I got nvidia's nvsound and nvnet to work after applying patches, I saw no advantage over standard kernel provided drivers. For the video, I emerge nvidia-kernel and added nvidia to modules.autoload. I managed to setup SPDIF using instructions here. I see that my SPDIF is properly setup:

# cat /proc/asound/pcm
00-00: Intel ICH : NVidia CK804 : playback 1 : capture 1
00-01: Intel ICH - MIC ADC : NVidia CK804 - MIC ADC : capture 1
00-02: Intel ICH - IEC958 : NVidia CK804 - IEC958 : playback 1
So I should be able to play it like this:
# mplayer -ao alsa:device=hw=0.2 mozart.mp3
But either my receiver has no support for it or I need to do something like this or this. So I went back to analog sound output.

The video was tricky. I finally managed to get the right xorg.conf file but I had two serious problems. The first, the cursor was a transparent square area with a cluster of black dots and second, the icons and text would disappear if the mouse went over them (or sometimes just by themselves). Fortunately, I found a patch on the internet for xorg and recompiled with it. All I did was emerge it, when it said 'Source unpacked' I pressed control+Z, applied the patch and resumed the job. I really should do something like this (go to bottom), but I'm guessing that they will have the bugfix in the next version anyway.

Its important to autoload the nvidia module or you get a "failure to load nvidia" error when you start X.

I struggled to setup the RAID, until I realized that the onboard RAID controllers are nothing more than a place in the BIOS to store your settings. Linux still sees the physical hard drives. Grub probes the BIOS, so it sees the real drives. You cannot install GRUB with fakeraid. The best option was to use mdadm and Linux software raid, which is just as fast if not faster. The utility mdadm was part of the installation cd, so I used it to setup the raid. The trick of course, is to set your boot partition as RAID-1, so that grub will load off it. Install grub in both partitions as shown here. I could probably have used the hardware using instructions here.

I got the Vacuum Fluorescent Display (part of the D4 case) working using these instructions. The VFD displays y with umlaut when the it tries to display a filled square for which I found a patch here

Wealth of info here

Noise

In general, I can constantly hear the constant whoosh of the fans at the back (I've disconnected one). The motherboard chip fan was the noisiest and I noticed that it was running at 8000+ RPM. I fixed the fan speed regulator that came with the CPU cooler and attached it to the motherboard chip fan and turned it fully down. The fan now runs at 4000+ RPM, but has almost negligible sound. The hard disks are silent unless there is activity, which I figure is the sound echoing off the case. The activity is occaisional even when compiling the kernel due to the 8MB caches.

Infrared

I purchased the parts off Radio Shack and soldered my own infra red receiver. I attached the IR sensor off a long wire, so I could squeeze it into the space provided by the D4 case (just above the HDD light left of the VFD).

Heat

In general, the CPU runs at 35C with the case open and 43C with the case closed at idle. On heavy load, it goes upto 55C. The motherboard temp and the case ambient temperature is generally about 5C and 15C below the CPU temperature.

Conclusion

All in all, I'm happy with the power and performance of this machine. I aim to install the arctic cooler for my VGA card and I also need to figure out how to reduce the noise of the fans and the hard disks.

Monday, January 24, 2005

Historical: Typewriter

When I was a kid, I wrote this assembly program that would play with the VGA fonts (I wrote a whole bunch of programs that messed with the VGA fonts) One particular program that comes to mind is one that shifts the lines on the VGA font one scan line upward, making it look like it was typed on a typewriter with an errant key.


SCANLINES	equ	16
SHIFT		equ	 2

.model small
.code
	org	100h
main	proc
	mov	ax,1130h	; read char vector
	mov	bh,06		; 8*16 vga/mcga
	int	10h

	mov	ah,0		; read char
	int	16h
	cbw
	push	ax
	
		
	mov	cl,4		; map of offending char
	shl	ax,cl
	add	bp,ax
	add	bp,SHIFT	
	
	mov	cx,SCANLINES-SHIFT	; save char bitmap
	push	ds
	push	es		
	pop	ds
	pop	es
	mov	si,bp		; ds:si->original map
	lea	di,charmap	; es:di->our buf
	cld
	rep	movsb	
			
	mov	ax,1100h	; set char
	mov	cx,1
	pop	dx
	mov	bh,10h		; 8*16 vga/mcga
	lea	bp,charmap
	int	10h

	mov	ax,4c00h
	int	21h
main	endp

charmap	db SCANLINES dup(0)	; init to 0 as last lines must be blank
	end	main

History: Executing a program in DOS

A common trick in the old days was to execute an application from within another. Here is some code example of how that was normally done:

.model tiny
.code

ExecBlk	Struc
	Psp	dw	?
	Cmdline	dw	?
	CmdSeg	dw	?
	FCB1	dw	?
	FCB1seg	dw	?
	FCB2	dw	?
	FCB2seg	dw	?
ExecBlk	ends

	org	100h

main	proc
	lea	bx,pspblk
	mov	sp,bx
	call	resize	

	mov	execb.psp,0		; copy our environment
	mov	execb.cmdline,80h	; pass command line as is
	mov	execb.FCB1,5ch
	mov	execb.FCB2,6ch
	mov	execb.cmdseg,cs
	mov	execb.FCB1seg,cs
	mov	execb.FCB2seg,cs
	lea	dx,execf
	push	ds
	pop	es
	lea	bx,execb
	mov	ax,4b00h
	int	21h
merr:
	mov	ax,4c00h
	int	21h
main	endp

resize	proc
	add	bx,15
	mov	cl,4
	shr	bx,cl
	mov	ax,cs
	add	bx,ax
	mov	ah,4ah
	int	21h
	jc	merr
	ret
resize	endp


execf	db	"PAT.COM",0		; ASCIIZ pathname of file to be execd
execb	ExecBlk	<>
_stack	dw	1000h dup(?)

pspblk	label	byte

	end	main