Binaries Vs Byte-Codes
By
Chris Rijk
Friday, June 2, 2000 12:00 PM EDT
|
Runtime Vs Static Compilation - Performance Comparison
For some time I have been working on an article on Sun's upcoming MAJC processor
architecture, which has some pretty significant enhancements to
improve the speed of Java code. The MAJC is a high-performance
multi-purpose embedded processor, and though a full C/C++ compiler and
development toolkit will be provided, for many markets Sun
hopes/expects that developers will use Java for most/all of the custom
programming. Given that the embedded market is quite sensitive to
price/performance, which is not considered one of Java's strengths,
for my MAJC article (which I still haven't published, though it very
nearly complete) I thought I might as well look at this subject a bit.
However, it rather ballooned out into something much bigger than I
initially anticipated, and I figured I might as well do this as a
separate article.
An introduction to Java execution and
static/dynamic compilation. This can be read before continuing, or
later.
Objectives
There are several reasons why Java could be thought to be inherently
slower than C for pure performance. For starters, Java is optimized at
run-time, instead of compile-time, and Java's standard libraries also
have to work the same way on all platforms, which requires an extra
layer of software. Java is also relatively new - don't forget that it
took some time before C compilers got good (they are still being
improved) - and the same goes for C++. On top of this, even for low
level programs which make little or no use of libraries, there are
several things that get in the way - in Java, all array access has to
be checked to make sure it's legal (within the bounds of the
array), the standard string library is 16-bit Unicode instead of 8-bit
ASCII, which means all strings are twice the size, all arrays have to
be initialised, dynamic memory allocation is handled through a general
garbage collector, no mainstream CPUs currently have any Java
optimizations, and it's hard to take advantage of special multi-media
type instructions. In theory, the garbage collection and
dynamic/run-time optimizations can be a strength as well, and for the
most part the above difficulties can be optimized away with a good
enough optimizer - except strings being 16-bit.
This article is not supposed to be an attempt to fully quantify the
speed difference between Java and C - it's too simple and incomplete
for that, for one thing. This will possibly be the first article in a
series (though given the horrible amounts of time it's taken just to
do this, I almost hope not!) and I'll start off with "does Java
require static compilation for good performance?" and "is
dynamic/run-time optimization Java's performance Achilles heel?". For
both of these, I'm limiting the test to algorithms which require
little or no use of standard libraries - incidentally, for the
SPEC2000 CPU performance tests it's required that at most %5 of time
is spent outside the code being benchmarked. Also, the implementation
of the code for the Java and C versions is pretty much as close as I
could make it - neither version has much in the way of language
specific optimizations. This is partly to try to achieve a more
"apples to apples" comparison, though this does somewhat bias against
C which has much more in the way of little tweaks and nudges that
programmer can provide the compiler with - however, I have yet to find
any language extensions or similar that make any difference - for
example "inline" had no effect.
Testing Methodology
All the tests were run on my personal computer (Athlon 500 on Asus K7M
motherboard with 128MB PC100 SDRAM), with Windows 98SE - Windows
currently has the most advanced shipping JVMs (J2SE1.3 and HotSpot 2.0
isn't yet available on Solaris or Linux for once off reasons) as well
as a decent choice of C compilers. For each program, several
benchmarks are run with increasing complexity/size for the problem -
for each of these, the same setup is run 5 times, the fastest and
slowest scores are discarded, with the final result being the average
of the middle three. For each test, a two quick pre-trials are
performed by the software to estimate how many loops are required to
make the test run for approximately 10 seconds, to have a consistent
error on the timing. 10 seconds is also enough to give the JVMs
enough time to fully optimize, though there is also an automatic
self-adjust to correct for code that gets faster as you run it. I'm
not particularly happy with Windows 98's timing accuracy,
threading/multi-tasking, interrupt handing (not nearly as good as some
tests I've done under Solaris, FreeBSD and Linux), but the results are
generally very predictable and stable, and often there is little/no
variance between the 5 runs.
For the results below, the performance is given as a relative
score to the base test, which is the C program compiled with only
simple optimization. For every test, the results are normalised such
that 100 is the score for the base test. A result of 110 would be 10%
faster than the base test, while a result of 90 would be 10% slower.
Other tests were done with C compiled with much more aggressive
optimization, though the same rules were used for all the C
programs.
Key for the Graphs
- base-C
- This is the C version compiled with
simple optimization under gcc (-O). This is the baseline, and all
other results are relative to it. Cygwin was used to run
GCC under Windows.
Version: gcc version 2.95.2 19991024
(release) Optimization flags: -lm -O
- max-C
- This is the C version
compiled with a stack load of optimizations, so slightly dodgy. I have
done a slightly weaker version, which should be quite safe, and is
only about 10% worse, but it makes the graph a little cluttered. Same
version of gcc as above.
Optimization flags: ( -O3 -lm -s
-static -fomit-frame-pointer -mpentiumpro -march=pentiumpro
-malign-functions=4 -fu nroll-all-loops -fexpensive-optimizations
-malign-double -fschedule-insns2 -mwide-multiply -finline-function s
-fstrict-aliasing)
- HotSpot
- This is the Java
program run with Sun's HotSpot Server 2.0.
Version: Java
HotSpot(TM) Server VM (build 2.0fcs-E, mixed mode)
Optimization flags: none
- IBM
- This is IBM's latest Java
version 1.1.8 compiler - version 3.5, from March 2000. It's the latest
official production release, but I did find a 1.2.2 release (on IBM's
AlphaWorks site in a rather obscure place), though I'm not sure if
it'd make much difference for these benchmarks.
Version: JDK
1.1.8 IBM build n118p-20000322 (JIT enabled: ibmjitc
V3.5-IBMJDK1.1-20000322) Optimization flags: none
- MS C
- Microsoft Visual C
compiler with heavy optimization.
Version: MSVC 6
Enterprise Optimization flags: "Maximum Optimizations"
setting: /nologo /ML /W3 /GX /O2 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /Fp"Release/fib.pch" /YX /Fo"Release/" /Fd"Release/" /FD /c
Tested Algorithms
- "Game of Life" with light load
-
This is fairly advanced implementation of J.H. Conway's simple
cellular automaton - 2D grid of cells, which are either alive or dead,
with deaths and births depending on the number of immediately
surrounding alive cells. This version is all static, highly dependant
on array access, and has a fixed-size board with just one "glider" (a
repeating pattern with 5 alive cells which moves in a diagonal) so the
bigger the board gets, the more dependant the algorithm is on simple
array reading. This is the longest and most complex code in the test
with 9 procedures.
- "Game of Life" with heavy load
-
Identical code to the above, but this time the number of gliders
scales with the size of the board - as many are squeezed in as
possible. This makes the algorithm somewhat less dependant on array
accesses.
- "Game of Life" with 'infinite' board
- This is a completely different implementation which uses a
hash to simulate a sparse matrix, which actually has a maximum size of
232x232, but it could be extended reasonably
easily. The implementation requires dynamic memory allocation, and has
a custom hashing algorithm (to keep the implementation identical) but
is much smaller/simpler (2 procedures) than the static memory
version. For this test the 'load' is a square block of gliders,
tightly packed, the length of each side being a multiple of 5. This
implementation is much slower than the above version, by the
way.
- Fibonacci series (Page 2)
- The Fibonacci
series is a very simple function that calls itself recursively twice
on every iteration (except the terminating one) and computation time
increases exponentially. Not much for a compiler here to do (unless it
can detect the static nature of the results) except inline the
recursive calls. I came up with this when trying to think of an
algorithm with no array access. Results are kinda interesting but have
little reflection on almost all real world programs.
- Simple FFT implementation (Page 2)
- There's lots
of Fast Fourier Transform algorithms out there, but this is a fairly
simple one with just 4 procedures though has nested loops. While it
uses double-precision floating point, it is still rather affected by
array access.

C source \| Java source
HotSpot and Microsoft's C compiler start off pretty well, though drop
off at the bigger array sizes. For HotSpot I would imagine it's
because it hasn't been optimized for array access (bounds checking)
yet, and the latter tests are rather dominated by this. IBM's JDK is
well optimized for bounds checking it seems. Not quite sure why
Microsoft's C compiler drops off though. GCC just edges out IBM's JVM
here.

C source \| Java source
(Same as above, just run with the first argument being '-heavy')
HotSpot's better optimizations help it here as with heavy load, more
real calculations are done, though there is still much array
accessing. GCC wins this one again though (and by some margin too),
and MSVC doesn't do quite so poorly.

C source \| Java source
Here it appears obvious that GCC's default dynamic memory allocation
is limiting it quite severely. Though array access are still
significant, it looks like HotSpot's memory allocation is
significantly better than for IBM's JVM. The two JVMs certainly beat
GCC here, though MSVC's good showing prevent a clear Java victory.
HotSpot suddenly slumps at 75/100, which turns out to be because of
the default sizes of the memory pools. I found this out based on the
unsupported options mentioned on this
page which can be used to change the defaults. If these are made
bigger, the HotSpot scores do not dip at all. Also, HotSpot actually
does correct for this automatically, but isn't given enough time with
the current methodology.
All Content is Copyright (C) 1998-2003 Ace's Hardware. All Rights Reserved.
|