Ace's Hardware

The Wayback Machine - https://web.archive.org/web/20030224080428/http://www.aceshardware.com:80/read.jsp?id=153

Binaries Vs Byte-Codes
By Chris Rijk
Friday, June 2, 2000 12:00 PM EDT

Runtime Vs Static Compilation - Performance Comparison

For some time I have been working on an article on Sun's upcoming MAJC processor architecture, which has some pretty significant enhancements to improve the speed of Java code. The MAJC is a high-performance multi-purpose embedded processor, and though a full C/C++ compiler and development toolkit will be provided, for many markets Sun hopes/expects that developers will use Java for most/all of the custom programming. Given that the embedded market is quite sensitive to price/performance, which is not considered one of Java's strengths, for my MAJC article (which I still haven't published, though it very nearly complete) I thought I might as well look at this subject a bit. However, it rather ballooned out into something much bigger than I initially anticipated, and I figured I might as well do this as a separate article.

An introduction to Java execution and static/dynamic compilation. This can be read before continuing, or later.

Objectives

There are several reasons why Java could be thought to be inherently slower than C for pure performance. For starters, Java is optimized at run-time, instead of compile-time, and Java's standard libraries also have to work the same way on all platforms, which requires an extra layer of software. Java is also relatively new - don't forget that it took some time before C compilers got good (they are still being improved) - and the same goes for C++. On top of this, even for low level programs which make little or no use of libraries, there are several things that get in the way - in Java, all array access has to be checked to make sure it's legal (within the bounds of the array), the standard string library is 16-bit Unicode instead of 8-bit ASCII, which means all strings are twice the size, all arrays have to be initialised, dynamic memory allocation is handled through a general garbage collector, no mainstream CPUs currently have any Java optimizations, and it's hard to take advantage of special multi-media type instructions. In theory, the garbage collection and dynamic/run-time optimizations can be a strength as well, and for the most part the above difficulties can be optimized away with a good enough optimizer - except strings being 16-bit.

This article is not supposed to be an attempt to fully quantify the speed difference between Java and C - it's too simple and incomplete for that, for one thing. This will possibly be the first article in a series (though given the horrible amounts of time it's taken just to do this, I almost hope not!) and I'll start off with "does Java require static compilation for good performance?" and "is dynamic/run-time optimization Java's performance Achilles heel?". For both of these, I'm limiting the test to algorithms which require little or no use of standard libraries - incidentally, for the SPEC2000 CPU performance tests it's required that at most %5 of time is spent outside the code being benchmarked. Also, the implementation of the code for the Java and C versions is pretty much as close as I could make it - neither version has much in the way of language specific optimizations. This is partly to try to achieve a more "apples to apples" comparison, though this does somewhat bias against C which has much more in the way of little tweaks and nudges that programmer can provide the compiler with - however, I have yet to find any language extensions or similar that make any difference - for example "inline" had no effect.

Testing Methodology

All the tests were run on my personal computer (Athlon 500 on Asus K7M motherboard with 128MB PC100 SDRAM), with Windows 98SE - Windows currently has the most advanced shipping JVMs (J2SE1.3 and HotSpot 2.0 isn't yet available on Solaris or Linux for once off reasons) as well as a decent choice of C compilers. For each program, several benchmarks are run with increasing complexity/size for the problem - for each of these, the same setup is run 5 times, the fastest and slowest scores are discarded, with the final result being the average of the middle three. For each test, a two quick pre-trials are performed by the software to estimate how many loops are required to make the test run for approximately 10 seconds, to have a consistent error on the timing. 10 seconds is also enough to give the JVMs enough time to fully optimize, though there is also an automatic self-adjust to correct for code that gets faster as you run it. I'm not particularly happy with Windows 98's timing accuracy, threading/multi-tasking, interrupt handing (not nearly as good as some tests I've done under Solaris, FreeBSD and Linux), but the results are generally very predictable and stable, and often there is little/no variance between the 5 runs.

For the results below, the performance is given as a relative score to the base test, which is the C program compiled with only simple optimization. For every test, the results are normalised such that 100 is the score for the base test. A result of 110 would be 10% faster than the base test, while a result of 90 would be 10% slower. Other tests were done with C compiled with much more aggressive optimization, though the same rules were used for all the C programs.

Key for the Graphs

base-C: This is the C version compiled with simple optimization under gcc (-O). This is the baseline, and all other results are relative to it. Cygwin was used to run GCC under Windows.
Version: gcc version 2.95.2 19991024 (release)
Optimization flags: -lm -O
max-C: This is the C version compiled with a stack load of optimizations, so slightly dodgy. I have done a slightly weaker version, which should be quite safe, and is only about 10% worse, but it makes the graph a little cluttered. Same version of gcc as above.
Optimization flags: ( -O3 -lm -s -static -fomit-frame-pointer -mpentiumpro -march=pentiumpro -malign-functions=4 -fu nroll-all-loops -fexpensive-optimizations -malign-double -fschedule-insns2 -mwide-multiply -finline-function s -fstrict-aliasing)
HotSpot: This is the Java program run with Sun's HotSpot Server 2.0.
Version: Java HotSpot(TM) Server VM (build 2.0fcs-E, mixed mode)
Optimization flags: none
IBM: This is IBM's latest Java version 1.1.8 compiler - version 3.5, from March 2000. It's the latest official production release, but I did find a 1.2.2 release (on IBM's AlphaWorks site in a rather obscure place), though I'm not sure if it'd make much difference for these benchmarks.
Version: JDK 1.1.8 IBM build n118p-20000322 (JIT enabled: ibmjitc V3.5-IBMJDK1.1-20000322)
Optimization flags: none
MS C: Microsoft Visual C compiler with heavy optimization.
Version: MSVC 6 Enterprise
Optimization flags: "Maximum Optimizations" setting: /nologo /ML /W3 /GX /O2 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /Fp"Release/fib.pch" /YX /Fo"Release/" /Fd"Release/" /FD /c

Tested Algorithms

"Game of Life" with light load: This is fairly advanced implementation of J.H. Conway's simple cellular automaton - 2D grid of cells, which are either alive or dead, with deaths and births depending on the number of immediately surrounding alive cells. This version is all static, highly dependant on array access, and has a fixed-size board with just one "glider" (a repeating pattern with 5 alive cells which moves in a diagonal) so the bigger the board gets, the more dependant the algorithm is on simple array reading. This is the longest and most complex code in the test with 9 procedures.
"Game of Life" with heavy load: Identical code to the above, but this time the number of gliders scales with the size of the board - as many are squeezed in as possible. This makes the algorithm somewhat less dependant on array accesses.
"Game of Life" with 'infinite' board: This is a completely different implementation which uses a hash to simulate a sparse matrix, which actually has a maximum size of 2³²x2³², but it could be extended reasonably easily. The implementation requires dynamic memory allocation, and has a custom hashing algorithm (to keep the implementation identical) but is much smaller/simpler (2 procedures) than the static memory version. For this test the 'load' is a square block of gliders, tightly packed, the length of each side being a multiple of 5. This implementation is much slower than the above version, by the way.
Fibonacci series (Page 2): The Fibonacci series is a very simple function that calls itself recursively twice on every iteration (except the terminating one) and computation time increases exponentially. Not much for a compiler here to do (unless it can detect the static nature of the results) except inline the recursive calls. I came up with this when trying to think of an algorithm with no array access. Results are kinda interesting but have little reflection on almost all real world programs.
Simple FFT implementation (Page 2): There's lots of Fast Fourier Transform algorithms out there, but this is a fairly simple one with just 4 procedures though has nested loops. While it uses double-precision floating point, it is still rather affected by array access.

C source \| Java source

HotSpot and Microsoft's C compiler start off pretty well, though drop off at the bigger array sizes. For HotSpot I would imagine it's because it hasn't been optimized for array access (bounds checking) yet, and the latter tests are rather dominated by this. IBM's JDK is well optimized for bounds checking it seems. Not quite sure why Microsoft's C compiler drops off though. GCC just edges out IBM's JVM here.

C source \| Java source
(Same as above, just run with the first argument being '-heavy')

HotSpot's better optimizations help it here as with heavy load, more real calculations are done, though there is still much array accessing. GCC wins this one again though (and by some margin too), and MSVC doesn't do quite so poorly.

C source \| Java source

Here it appears obvious that GCC's default dynamic memory allocation is limiting it quite severely. Though array access are still significant, it looks like HotSpot's memory allocation is significantly better than for IBM's JVM. The two JVMs certainly beat GCC here, though MSVC's good showing prevent a clear Java victory.

HotSpot suddenly slumps at 75/100, which turns out to be because of the default sizes of the memory pools. I found this out based on the unsupported options mentioned on this page which can be used to change the defaults. If these are made bigger, the HotSpot scores do not dip at all. Also, HotSpot actually does correct for this automatically, but isn't given enough time with the current methodology.

Fibonacci, FFT, and Conclusion

4 ms

Jan	FEB	Mar
	24
2002	2003	2004