Monday, August 22, 2005

Gates' Law

"The speed of processors doubles every 18 months" -- Moore's law

"The speed of software halves every 18 months" -- Gates' law

I'm uncertain if Gates' Law was meant as a joke, but I did notice that Lotus 123 ran at about the same speed on an old 80286 as excel runs on your modern PC.

Some of it is certainly because newer hardware makes more things possible, such as the graphical interface and tooltips. Newer software does try to be sexier by doing flashy things such as the talking paperclip. This causes a tendency for the software to bloat to the point of saturation, i.e. until it performs at barely acceptable speeds.

But application programming seems to get worse with faster processors. Even with more memory, faster disks and multiple CPUs, your average web application runs slower. Is it possible that the faster hardware has made us worse programmers?

In the old days, to set a register to the value 1, even an average assembly programmer would do something like this:


xor ax,ax
inc ax
rather than:
mov ax,1
The reason was simple. The programmer knew the latter alternative required access to memory, which is generally ten times slower than accessing registers. Modern programming languages hide details like this from the programmer, so he does not know the real cost of bad programming.

With garbage collection now part of many languages, we see many applications having to deal with memory problems. Why? Because the average programmer does not know the cost of object creation since he does not have to deal with cleanup.

For example in C++ you would do this:


// case 1: allocate on stack
int a = 1;
// other code
DirEnt de;

// case 2: allocate on heap
int *a = new int;
// other code
DirEnt *de = new DirEnt;
In the first case, the variable is allocated on the stack and disappears when execution moves outside the scope. The compiler calculates the required space at compile time and generates a simple instruction such as:

sub bp, 2+248        ; int is 2 bytes, dirent is 248 bytes
In the second case, the variable is allocated on the heap, which is a comparatively slow operation. First, a check is made for sufficient free memory. The heap is then searched for contiguous free memory of the desired size and then initialized. The region is marked "in-use" and a pointer to the memory is returned. If there is no contiguous memory, the system may choose to compact/defragment memory.

Two characteristics of the heap vs stack in the above example are that the memory allocation routine would need to be called twice, since the compiler has no way of knowing if the allocations can be combined given there is other code between the two allocations. Also while each thread has its own stack, multiple threads compete for the same heap allocator (except a small portion of thread local heap, which is not really a factor in large memory footprints).

In languages such as Java you cannot really allocate anything except primitives on the stack even if you don't intend to use them beyond the immediate scope. This represents a big load on the memory management system. Objects that properly belong on the stack now crowd the heap, putting a strain on the entire system.

"But how slow can memory allocation be? Isn't memory accessible in nanoseconds?"

In most operating systems, for practically all applications, the memory allocated is virtual memory from the theoretical address space rather than actual physical memory. This implies that you need huge amounts of memory to never have a page fault. Page faults generally work at the speed of storage, which is measured in milliseconds unless you have a solid state disk.

With fragmented memory and a large memory footprint, we end up dealing with multiple page faults, causing the system to thrash. This is especially apparent in multi-threaded programming where the locality of reference principle does not save you from multiple simultaneous page faults.

Unfortunately, even a programmer who is aware of these issues is helpless if the language itself is restrictive. What java could have done is allocate objects on the stack if they are not passed to other methods.


// new translates to alloc on stack
DirEnt e = new DirEnt(".");
e.list();

// passed to other class, alloc on heap
DirEnt e = new DirEnt("..");
stack.add(e);
Theoretically, the java compiler could follow the object in the call tree and see if anyone retains a reference to it, but that would render the compiler prohibitively slow and perhaps incorrect if reflection is used.

It is therefore up to the programmer to not trust the much touted features of any language (or any technology) without sufficient experience. Failing which, your programs end up testifying for Gates' Law.

23 comments:

Anonymous said...

Sadly, so true.

Anonymous said...

There's actually a law called Wirth's law: Software gets slower faster than hardware gets faster.

Ref http://en.wikipedia.org/wiki/Wirth%27s_law

Anonymous said...

> xor ax,ax
> inc ax

> rather than:

> mov ax,1

> The reason was simple.
> The programmer knew the latter
> alternative required access to
> memory

Actually, even average programmer should know that there is no memory access involved. Yes, if I remember well, first variant is one byte shorter, but that is all.

OTOH, starting with Pentium I, simple variant (mov ax, 1) will run faster in most cases, as the other one introduces dependecies and cannot be parallelized by superscalar machine.

Maybe it is rather this "smart approach" that makes the software run slow ? :)

Anonymous said...

It's all about bloat man.

The absolute worst? Adobe. Man, CS2 apps are slow to startup, memory hogs, even switching to Photoshop from another program is slow, and this on a 3GHZ P4 with 1GB RAM.

And indeed they stuffed it with all kinds of stuff, some of it runs as a service even...

Anonymous said...

On x86, the inc instructon potentially updates the flags register and therefore create a dependency that will cause the hardware to stall.

Anonymous said...

Hmm, wasn't memory allocation in Java supposed to be dirt cheap? From what I've heard it's just a pointer incrementation.

And the fragemntation issue is rather elgantly solved too. The most important technique in java is to not keep your objects hanging around needlessly.

cadaver said...

Actually on an 8086, for the instruction "mov ax, 1", the processor needs to fetch the word from the location at CS:IP and move it to AX. So yes, there is a memory access. Irrespective of the actual implementation specifics, the point was that the programmer knew that "xor ax,ax; inc ax" was faster than "mov ax,1" and yes, with newer pentiums it was the other way around, but by then, hardly anyone was programming in assembler.

Anonymous said...

This hasn't only to do with bad programming practices. It's got to do with "slower" compilers.

As an experiment try to compile the same code in visual c++ 4 or 6 toward a new release visual studio whatever; i guarantee you that the code compiled with the old rel will be faster AND much smaller in size.

i suspect this has got to do with microsofts contracts with intel..

Anonymous said...

This article makes no sense and remember that "new" in java isn't a direct mapping to malloc either the amount of memory used by the JVM doesn't change much unless the program needs more than the initial size.

Bad coding can happen in MM systems but that is because people hang onto objects for too long.

Most objects that don't hang around till between Garbage collection get destroyed by it and it is virtually free. I think you need to read up on how a VM works before coming to such assumptions.

Anonymous said...

This article is unbelievably wrong.

Others have mentioned that the mov instruction is not quite as you've described.

I'm not sure what you mean about "defragmenting the heap". There's no way in C or C++ to move stuff around, since it's not possible to know if a given memory address holds a pointer to another one. (There isn't enough introspection for that.) The only thing remotely close to this is combining of contiguous empty regions within free().

You definitely don't understand virtual memory. There's a mapping layer. A large virtual memory space does not mean you get many page faults. You only get many page faults if you have a lot more allocated than you have physical memory and your accesses are non-local.

Java's heap allocation is also considerably different (faster) than C++ in the common case due to the memory layout allowed by generational garbage collection. The short version is that it can just keep a pointer to the unused region, go until it runs out, then run the garbage collector which compacts things. (The process I just described as impossible in C or C++.)

The theoretical optimization Java stack optimization you've referred to does not render the compiler prohibitively slow or incorrect. It's called "escape analysis" and it's in beta now.

- Scott Lamb

Anonymous said...

true or false, it isn't the general case
sure that from dveloper point of view compiling the program take much less time in newer prcoessor
also
mathimatics and computation ,graphics rendering
has an insanly speed improvement in newer processor

another issue

when using java your aim to reduce development efforts which is the cost of the insane performace
aquired from using good assembly

Anonymous said...

You all should aim for Motorola processors. Maybe since these have the stack allocation flipped upside-down than Intel's... you all may see something much different :)). Wouldn't that be amazing??? Have a good day and happy programming.


This was intended as a joke. I will now make room for smarter guys to share their thoughts.

Anonymous said...

Despite what people here in the comments say, the creation of objects in Java is *not* cheap. It is true that most JVMs have a compacting (actually a copying) garbage collector, which is in most cases also generational. It does mean that allocating memory on the heap is not expensive, it comes down to increasing a pointer. That does not mean the creation of an object is cheap. The creation of an object is really expensive and probably the most expensive operation in most Java programs. An object that has been created, does take up space on your heap and even though the deallocation of such an object when it is no longer used is in fact free with copying collectors (copying collectors copy the "live" objects, they never touch "dead" objects), it does mean that the more objects you create, the more often you're gonna trigger a garbage collection. And believe me, that is expensive. A couple of days ago I reviewed a small bit of Java code. This Java code was a little parsing routine for parsing a very simple text format. The problem was that it needed over 15 minutes to parse a 3MB file (I never let it finish though). In any case, for each parsed line 3 Scanner objects were created, which were disregarded almost immediately. This is bad. I replaced it with a simple manual parsing of a String for each line, avoiding the creation of numerous objects. Loading time went down to 7 seconds. This is an extreme case, but it just shows that if you carelessly create objects, you'll pay a price.

@Scott Lamb: Technically, Java's memory allocation is not faster because of "generational" garbage collection, but because of "compacting" garbage collection, which would be either mark-slide or copying collection. Generational garbage collection is implemented on top of that to make collection in general more efficient and it can also reduce pause times. I assume you say that allocation is faster in Java in the sense that it just has a contiguous "free" region, while the C/C++ implementation will probably just use a free-list. I don't think that's gonna make such a big difference though.

Anonymous said...

Scott doesn't tell the half of it, and his anonymous interlocutor doesn't either. No analysis of the cost of allocation means anything without including the full-cycle cost -- that is, of freeing too, and checking if it's free, and of pushing other useful stuff out to swap to make room. If all memory were the same speed, and nothing got swapped out, and no other programs were competing for memory, that measurement might not be so hard.

In the Real World (TM), cache locality makes memory items not swapped out accessible in a cycle, or three, or a hundred, where items swapped out take millions or even billions of cycles to get to. Unfortunately, garbage collection, particularly when more than one GC program is running, makes it overwhelmingly more likely for items to be swapped out.

A result is that artificial benchmarks of GC performance are guaranteed to be drastically misleading. At the same time, malloc() allocation performance varies radically from one platform to the next; those that keep "freelists" that must be searched remain distressingly common.

All that can be said confidently, given all the uncertainties, is that (1) where performance matters, dynamic allocation (off-stack) loses; and (2) where dynamic allocation is unavoidable, getting reliably tolerable performance from garbage-collected allocation is way, way harder than from malloc()/free().

(A final note: assigning the result of C++ "new" into a raw pointer almost always a mistake; use some sort of smart-pointer object. Any program with a "delete" statement is suspect.)

Anonymous said...

One other thing: on AMD processors, both code sequences -- like almost any other example of that ilk -- run at exactly the same speed.

Anonymous said...

To noocyte.

Anonymous said:
"As an experiment try to compile the same code in visual c++ 4 or 6 toward a new release visual studio whatever; i guarantee you that the code compiled with the old rel will be faster AND much smaller in size."

So its not about the newer compilers being slover.
Its supposed to be about never compilers producing larger and slower executables than the old compilers.

Anonymous said...

> xor ax,ax
> inc ax

versus

> mov ax,1

xor reg,reg takes : 2(286) 2(386) 1(486 and above) clocks

inc reg takes: 2(286) 2(386) 1(486 and above) clocks

mov reg,immed takes: 2(286) 2(386) 1(486 and above) clocks

so:
on 286 2+2 clocks VS 2 clocks
on 386 2+2 clocks VS 2 clocks
on 486 1+1 clocks VS 1 clock

Anonymous said...

This article is living proof of why java is better then you think: people who think they know, but don't. This has all been dicussed a long time ago:
http://www-128.ibm.com/developerworks
/java/library/j-jtp01274.html

and his whole "java could put it on the stack thing".
From the article:
The JIT compiler can perform additional optimizations that can reduce the cost of object allocation to zero. Consider the code in Listing 2, where the getPosition() method creates a temporary object to hold the coordinates of a point, and the calling method uses the Point object briefly and then discards it. The JIT will likely inline the call to getPosition() and, using a technique called escape analysis, can recognize that no reference to the Point object leaves the doSomething() method. Knowing this, the JIT can then allocate the object on the stack instead of the heap or, even better, optimize the allocation away completely and simply hoist the fields of the Point into registers.

This is why we have bad code: people like the writer. People like the writer are why we need Java. To protect us against unmitigated arrogance.

Anonymous said...

Scott doesn't tell the half of it

Indeed, I did not. My point was not to explain all the intricacies of Java garbage collection but to point out that it (along with everything else) is not as the original poster described.

For those who want to start learning more about generational GC and escape analysis, this article is a decent starting point. You will never really understand the performance of a system (especially one varying version to version as much as Java) without your own experiments (and there are some good tools for doing so), but it at least has some more depth of analysis.

- Scott Lamb

Anonymous said...

We have a lot more software today, and that because more people write it.
But today everybody seems to belieave it is a programmer, it's innevitable that a lot of software has poor performance/quality.
There is also a concept called "time to market" which unfortunately has priority over perforamce and sometimes over quality.
Just don't blame the languages for the mistakes programmers make.

Anonymous said...

@noocyte:
And parsing strings in java could very easily end up being just as costly as creating new Scanner objects, since strings are immutable.. :) Hence you get a new string each time you change it.
True, but when you use Scanner, you also create Strings (in addition to the Scanner objects). I mean, either way (manual parsing and Scanner), you'll have the original Strings and the Strings you need. The difference was really the Scanner objects. (it was a very simple parsing routine)
I can't speak for java, but in .net it really is cheap to allocate small objects.
"temporary" objects? Maybe the .Net compiler allocates those on the stack (or maybe even in regions). This is what escape analysis can give you. (which is gonna be in Mustang according to the IBM article)
@anonymous: you mean swap out of the cache, or swap out to disk? I don't think swap out to disk should be considered relevant anymore.
No analysis of the cost of allocation means anything without including the full-cycle cost -- that is, of freeing too, and checking if it's free, and of pushing other useful stuff out to swap to make room.
This is certainly true, and something I was trying to point out.
All that can be said confidently, given all the uncertainties, is that (1) where performance matters, dynamic allocation (off-stack) loses; and (2) where dynamic allocation is unavoidable, getting reliably tolerable performance from garbage-collected allocation is way, way harder than from malloc()/free().
I totally agree with (2). I don't agree with (1) though. It's not always true. If you consider the full-cycle cost, then a copying collector can be faster than malloc/free. It's very hard to make general comments about this. There is no one collector which gives the best performance in all cases and you can obtain better performance with a collector by tuning it for a specific application.

cadaver said...

Thanks everyone for the feedback.
Scott, in case you may be unaware, there ARE garbage collectors for C/C++ http://www.hpl.hp.co.uk/personal/Hans_Boehm/gc/
and some of them definitely do some defragmenting. Also you seem to have not followed my statement on virtual memory. There are some applications that use a huge (64GB+) RAM systems with paging disabled to guarantee nanosecond access performance. Unless you have one of those, please dont believe that page faults never occur if you allocate less than the available RAM. If enough number of applications are running, you WILL encounter page faults. Go ahead, try it. As for the mov vs xor, only the ones old enough to have programmed a REAL application for an 8086 know what Im talking about. We have spent hours changing application to save that one byte (Embedded/BIOS programmers) or that one millisecond of execution time (device drivers). The fact of the matter is that the mov IS slower on an 8086. You can find this idiom on all over Peter Norton's books (Guide to the IBM PC) etc. from the early 80's.

Unknown said...

People are wanting to delete the Wikipedia article "Gates' Law"

The article:
http://en.wikipedia.org/wiki/Gates%27_law
The article before it was messed with:
http://en.wikipedia.org/w/index.php?title=Gates%27_law&oldid=236567039

I would like there to be contributors either for, or against. I don't think there is a balanced vote right now. You can vote here:
http://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion/Gates%27_law

Thanks again.