Barton: 512 KB Athlon XP Reviewed
By
Johan De Gelas
Monday, February 10, 2003 12:09 AM EST
|
|
The upcoming Opteron and Athlon 64 are constantly in the
limelight of the hardware community. No other AMD processor has created
so much hype, high hopes, and discussion. In the shadow of its big brother
is "Barton," the first AMD processor with 512 KB of L2-cache integrated
in the die.
This exclusive 512 KB L2-cache works together with the 128 KB
L1-cache (64 KB data, 64 KB instruction)
to form one impressive 640 KB on-die cache. According to AMD, the extra 256 KB cache
boosts, an 2170 MHz Athlon XP from a 2700+ level to a 3000+ one. The 54.3 million
transistor 2.17 GHz Barton Athlon XP will thus take on the mighty 55 million transistor
3.06 GHz Pentium 4 with Hyperthreading. Will 256 KB extra cache and a clockrate of 2.17 GHz be
enough to compete with the fastest Intel CPU available today? Well, we'll find out in a
moment. But before we look at the benchmarks, I'd like to discuss the
different L2-caches, as caches are extremely important for modern CPUs.
A 512 KB L2 for the Athlon
L2-cache has often been a drag on the performance of AMD's
processors. The K6 was a sixth generation architecture but it came
with a fifth generation off-die L2-cache running at only 66 or 100 MHz. The L2
cache of the K6-III was pretty impressive, but the clock frequency of
the K6-III did not scale past 450 MHz. The Athlon was a very
impressive seventh generation architecture, but it was launched with
a six generation L2-cache system.
|
In contrast, the L2-cache made the Intel processors really shine. The PII had a 512 KB
half speed, back side bus cache, which gave Intel's CPU a considerable advantage
over competitors like the Cyrix MII and AMD K6. The most important
reason why the Coppermine Pentium III could somewhat keep up with the more
advanced Athlon was its low latency, high bandwidth cache. Extremely
impressive for its time, as the 256 KB cache was not only accessed
via a 256-bit data path, but it could also respond in an amazing 4
clockcycles (total L2-cache latency was 7 cycles).
| Advertisement:
|
Back to today: as the Pentium 4 was built to reach very high clockspeeds,
a 4 cycle L2-cache
latency was not possible. The L2-cache of the Pentium 4 is
still pretty impressive, though, as you can see below. ScienceMark
2.0 tells us what Intel's engineers have been capable of. We tested
with the 3.06 GHz Pentium 4. The most accurate numbers are the 32 byte
to 256 byte step numbers (columns) in rows between the 32768 byte and 131072
dataflows, as we are sure that these measurements happen in the L2-cache.

A latency of 8 cycles (10 including the latency of the L1-cache) to
17 (total latency of 19) cycles is still very impressive for a CPU that
runs at 3 GHz. Eight 3 GHz cycles equals 2.4 ns, faster than the
Pentium III's L1-cache has ever been! Let us take a look at the bandwidth of
the L2-cache.

Although 19 GB/s is nowhere near the theoretical 96 GB/s (3 GHz x 32
bytes/s), the Pentium 4 has a very fast L2-cache. One of the reasons for
this big gap between theory and practice is the fact that only SSE(-2)
instructions can move more than 8 bytes per cycle. And it is very unlikely
that the Pentium 4 can sustain those 128-bit instructions at a rate higher
than 1 per cycle.
Let us see how important cache is for performance. When the Pentium 4 was upgraded to
a 512 KB L2 cache instead of a 256 KB one, performance was between 6%
and 61% higher.
The 61% higher performance in 3DSMax may surprise you, but it can be explained.
The tiny 8 KB data cache can be accessed in 2 cycles by the integer units, but only in 6 cycles by the FPU/SSE-2 units of the Pentium 4. As the datacache is so small and relatively slow to access, the L2-cache is of the utmost
importance to the Pentium 4 when crunching through FPU intensive apps. That
is also the reason why integer intensive applications see a smaller boost.
Modern games which also tend to be FPU intensive, reported an impressive
15
to 17% boost thanks to the larger L2-cache. Only the older games
(like Unreal Tournament) did not perform much better as their critical
loops were satisfied with 256 KB.
Now let us see what the new AMD has in store. AMD has finally caught
up to the Pentium 4, and has even more cache on board than the fastest
CPU of Santa Clara. I'd like to point out again what marvelously efficient architecture
the Athlon is: even Barton with 640 KB cache onboard is only 101 mm�,
which still a lot smaller than Intel's Northwood 130 mm�. Of
course, the slightly larger die size is no problem for Intel, given its huge fab capacity,
and 300 mm� wafers.
Back to Barton, though. How good is Barton's cache? Well, latency is identical
to the cache of Thoroughbred, the other 130 nanometer Athlon XP. Take
a look below...

The L2-cache seems to have a latency between 15 (+L1-cache latency =
18) and 21 (24) cycles. The 24 cycles are a bit odd, as AMD's technical
documentation talks about a (total) latency between 11 and 20 cycles and
other cache programs (cachemem) confirmed the maximum of 20 cycles. Nevertheless,
the important point is that the total L2-cache latency of the Athlon is
higher than the Pentium 4's. What about bandwidth?

The 64-bit 2.17 GHz L2-cache offers up to 5.5 GB/s to the CPU core,
between 3 to 4 times less than the 3 GHz Pentium 4. However, you may not
conclude immediately that Athlon's L2-cache is very slow and hampering
the performance of the Athlon. Contrary to the Pentium 4, the L1-cache
will deliver a lot of the bandwidth needed. Just imagine an FPU intensive
application that runs 85% in the L1-cache and 15% in the L2-cache (ignoring
the memory subsystem for the moment). As the Pentium 4 only searches and uses its L2-cache,
itwill have a 19 GB/s pipe to its FPU pipeline. The Athlon
will have a (0.85 x 19 GB/s + 0.15 * 5.5 GB/s) 17 GB/s pipe to the FPU
unit. In most applications, especially the FPU intensive ones, the Athlon
needs its L2-cache much less than the Pentium 4.
Therefore, we can already say that the performance increase from Thoroughbred
(384 KB cache) to Barton (640 KB) will be much less than what we have witnessed
with the transition from Willamette (256 KB) to Northwood (512 KB) for
the following reasons:
-
The latency of the Athlon's L2-cache is higher
-
The Pentium 4 L1 data cache is very small in integer applications, and
non-existent in FP applications
-
The Athlon, in contrast, relies heavily on its huge 64 KB L1 data cache in all
applications
-
Barton has 67% more cache than Thoroughbred, Northwood had 100% more cache
than Willamette
So we can not expect too much of Barton's L2-cache increase...
Overview
Before we begin, let's take a quick look at what's covered in this review:
Now, let's first take a look at the various Athlon core revisions, including the newest one, "Barton."
All Content is Copyright (C) 1998-2003 Ace's Hardware. All Rights Reserved.
|