The Quest for Better Performance, Part 2
By
Johan De Gelas
Sunday, April 23, 2000 3:39 PM EDT
|
Performance: Chipset, FSB, and Memory
In our previous article, we hunted for performance
bottlenecks and we have discovered quite
a few interesting things. Those bottlenecks are the bandwidth and
latency of the memory and the L2-cache. In this article we will try to
solve these bottlenecks. One of our solutions is faster memory, therefore,
we take a closer look at the performance of different types of memory.
Is Rambus really that bad? What about DDR SDRAM? VC SDRAM? How will
future technology affect the performance of new CPUs like Thunderbird
and Willamette? Is a 200 MHz, 1.6 GB/s FSB useless without fast memory? Let us find out!
Latency and Bandwidth
The latency, expressed in clock cycles, is the time it takes to get the
data send out, that has been requested. It takes some time before the memory address is decoded and the data is ready to be send back. Bandwidth is, of course, the number of megabytes that can be send per second through the memory
bus. Contrary to popular belief, latency and bandwidth are very
closely related. If latency is very high for the first word (4
bytes), the bandwidth is lower, especially with random memory accesses.
For example, the peak bandwidth of RAMBUS PC800 is 1600 MB/s. But with
random memory accesses the first 4 bytes arrive after 11 cycles, and typically
a 32 byte transfer (to transmit a 32 byte cache line of data to the CPU)
takes 11-1-1-1 cycles or 14 cycles. If the FSB runs at 133 MHz, the bandwidth
for random memory accesses to the CPU is 32 bytes x 133 MHz / 14 = 304
MB/s.
SDRAM PC133 will do better in those circumstances (random accesses).
It takes 7-1-1-1 cycles to transfer a 32 bit line to the CPU's cache, so
the CPU will receive (32 bytes x 133 MHz) per 10 clockcycles = 428 MB/s. If the memory accesses are more sequential however, than the initial latency will not
be so important. For example if we can read 64 bytes sequential than we
have 11-1-1-1 for the first 32 bytes but only 4 cycles (1-1-1-1, simplified)
for the next 32 bytes. So the bandwidth will come closer to the peak: 64
bytes x 133 MHz/ 18 cycles= 473 MB/s. Bursts of memory traffic with sequential
accesses will lower the influence of the initial latency and the average
bandwidth to the CPU will rise.
Why RAMBUS Fails!
Astute readers have already figured out why systems with RAMBUS fail
to show better performance than systems with PC133 SDRAM.
First of all, the FSB of the PIII is limited to 133 MHz, limiting the bandwidth to the CPU to 1066 MB/s. The only way that the higher
bandwidth of RAMBUS PC800 can used is though AGP texturing. As we have
shown in previous articles (here
and here
), AGP texturing is not very popular with game developers because it is,
even with AGP 4x, incredibly slow! In other words, the bandwidth of PC800
RAMbus is limited to 1066 MB/s, in theory as much as PC133 SDRAM.
Secondly, lower latencies will always improve performance, in
sequential (less important) and in random memory accesses (very important),
and RAMBUS initial latency is higher, seen from the CPU, than SDRAMs. These
latencies can become even higher than the numbers we quoted (11 cycles
versus 7 cycles) when RAMBUS powers down because of heat issues.
Is High Bandwidth Useless?
I have been emphasizing the importance of low latency so much, that
you might get the impression that memory bandwidth is useless. As we have
pointed out in
our previous article, the only big dataflow that really matters is
the flow from the memory via the chipset to the CPU and vice versa. Again,
bandwidth and latency are very closely related. Even in non-bandwidth critical
applications, burst of memory activity occur. In other words, even non-bandwidth
critical applications (like most applications) have small pieces of code,
with a lot of cache missing, memory hungry instructions. If those instructions
depend on each other results and the memory bandwidth can not keep up,
it takes longer before the data arrives.
Indeed, low bandwidth will increase the latency that the CPU sees! So
the last thing I claim is that high bandwidth is useless. But what we do
claim is that high bandwidth memory is useless if the front side bus can
not cope with it. And now we got the benchmarks to prove it! Let us see
how much the front side bus and the memory bandwidth and latency affects
performance. You will be amazed, I can tell you that.
Memory Affects Performance!
I had two memory types to my disposal, SDRAM and VC SDRAM. VC SDRAM
is supposed to improve the effective bandwidth of SDRAM. I included the
VC SDRAM figures because it is interesting to see how the different types
of memory react. But we will be focusing on the relation between the memory
and the front side bus.
In the table under the graph you see three numbers above each result.
The first number indicates the CAS latency
The Second number indicates the speed of the memory stick.
The first number indicates the front side bus speed, the speed between
the chipset and the CPU.

Well, VC SDRAM offers between 4 and 7% more bandwidth, nothing to write
home about. You can imagine that real world benchmarks will show even smaller
benefits if a bandwidth specific benchmark like stream shows so little
improvement.
But much more interesting is the amount of bandwidth offered by the
different FSB� The 224 Mhz FSB setting with the memory clocked at
112 MHz (CPU at 112x8 = 896 MHz) offers slightly more bandwidth than the
180 MHz FSB setting with the rams clocked at 120 MHz (CPU at 90x10 = 900
MHz). That is weird, as you would have expected that the 120 MHz SDRAM
would always win from the 112 MHz SDRAM. Seems like the speed of the memory
stick is important, but that the FSB plays a less important, but significant
role in determining the bandwidth of the memory.
Let us see another Stream benchmark.

VC SDRAM doesn't shine, the differences are small. But interesting
is that VC SDRAM confirms what we have seen in the previous benchmark:
a high frontside bus might help in some cases. SDRAM seems to prefer the
faster memory speed.
Right now, we only have a few indications that the speed of the FSB
might be important. Unfortunately it is not so easy to study this as I
could only play with FSB speeds ranging from 180 to 224 MHz (90 MHZ DDR-112
MHz DDR). I would have love to set the FSB of the Athlon to 133 MHz and
increase it to 200 MHz to see what happened, but that is not possible.
Nevertheless, let us take a look how Quake 3 reacts with different FSB
and memory speeds.
All Content is Copyright (C) 1998-2003 Ace's Hardware. All Rights Reserved.
|