Table of Contents

RSX

FIXME DRAFT

The RSX is the graphics and audio rendering chip of the PS3. It has access to 256MB of dedicated local DDR3 memory. It also has access system memory through the high-speed FlexIO bus of the Cell using DMA operations. The RSX chip is produced by NVIDIA and is similar to a GeForce 7800 GPU. However, no public documentation has been provided yet by either Sony or NVIDIA on how to program this chip.

This page gathers the information obtained so far by experimenting with the Hypervisor API and the Linux framebuffer driver. It summarizes information relevant to the PS3 RSX only. For a better understanding of NVIDIA GPUs in general see the nouveau project. In particular, this page (of the nouveau wiki) introduces many concepts needed to understand NVIDIA GPU programming.

Video RAM

The video ram is mainly used to store pixel data. Up to 252MB can be reserved via the lv1_gpu_memory_allocate hypervisor call. However, the last 4MB are reserved to store the GPU state and for GPU object instantiation and bookkeeping.

The layout of the video RAM is as follows:

Video RAM Region	Usage
`0000000-fbfffff`	Framebuffer
`fc00000-fffffff`	GPU data
`ff80000-fffffff`	instance memory (RAMIN)
`ff90000-ff93fff`	hash table (RAMHT)
`ffa0000-ffa0fff`	fifo context (RAMFC)
`ffc0000-ffcffff`	DMA objects
`ffd0000-ffdffff`	graphic objects
`ffe0000-fffffff`	graphic context (GRAPH)

FIFO

The FIFO is a buffer used for sending commands to the GPU in an efficient way. It is located in system memory (XDR). The lv1_gpu_context_attribute(FB_SETUP) call sets the location of the fifo in XDR memory. The CPU tells the GPU that commands are available for processing by writing to the FIFO get register. The FIFO put register tells the CPU how many commands have been processed by the GPU so far. When the get and put register are equal, it means the GPU is waiting for more commands to process. Direct access to the FIFO get and put registers is obtained from the lv1_gpu_context_allocate call, in the dma_control_lpar return parameter. The following is the mapping of the dma_control region, with all other registers being zero:

Address	Register
0×40	FIFO get register (owned by CPU)
0×44	FIFO put register (owned by GPU)
0×54	unknown register (owned by GPU)

As of firmware 2.10, the FIFO must be at least 2MiB large.

Context Objects

To perform operations on the screen, the GPU uses objects. Objects can be written to using FIFO commands, and when an object is fully filled in, the operation is executed. For example, executing a blit operation requires filling in the source and destination offset, coordinates, width and height, color format (ARGB, YUYV, ...), etc.. Not all objects can be written to at the same time, for this, they need to be bound to a so-called FIFO subchannel. The RSX supports 8 subchannels on which objects can be attached using the special command tag 0. Once an object is attached to a subchannel, it can be written to by specifying the address and number of consecutive dword to fill in. Here is an example showing the format of FIFO commands:

FIFO address	opcode	subchannel	size	tag	operation/data
0×00	0x0004e000	7	1	0	bind an object to subchannel 7
0×04	0x3137c0de				handle of the object to bind to subchannel 7
0×08	0x0010430c	7	4	0x30c	write 4 consecutive dwords to object 0x3137c0de at address 0x30c
0x0c	0×00000000				data written at address 0x30c
0×10	0x0d000000				data written at address 0×310
0×14	0×00001000				data written at address 0×314
0×18	0×00001000				data written at address 0×318

Which action is performed by writing to a given addresses is dependent on the object class. There are two kinds of objects, graphics objects and DMA objects. Graphics objects perform operations, such as blitting, drawing triangles, and so on. DMA objects describe memory regions and are referenced by graphics objects, for example to specify the source and destination of the operation. Among them, the DMA notify objects are a special case used by the GPU to report completion of commands. A DMA notify object can be attached to a graphic object so that when the graphic operation is finished, an address specified within the DMA notify object is zeroed. Using this, the CPU can wait for the completion of the GPU operation by polling that address until the value is zero.

RAMIN

The instance memory (RAMIN) is where context objects are stored. This memory resides at the end of the video RAM. Object records consist of a few dwords describing their properties. In particular the class field (first byte) specifies the purpose of object. The following classes are available on the RSX:

Class	Description
0×02	DMA (for 2D?)
0×03	DMA Notifier
0×30	NULL object
0x3d	DMA (for 3D?)
0×39	Memory to Memory format
0×62	2D Context Surfaces
0×89	Scaled Image From Memory
0x8a	Image from CPU
0×4097	0×97 = 3D Transform, Clipping and Lighting Engine, 0×40 = chip version
0x9e	Swizzled Surface

As DMA objects specify zones of memory, their properties include the address and the size of the region, if it can be read of written to, and the type of memory (video RAM or system memory).

RAMHT

Objects are referenced by a handle, which is translated using a hash table (called RAMHT) to the RAMIN offset where the object resides. This hash table is maintained by the Hypervisor and consists of 2048 8-byte entries. Each entry contains the handle as first dword, followed by the address of the instance (an offset from start of RAMIN):

hash table entry (bits)
63 - 32	31 - 20	19 - 0
handle	engine	address

Indices into the hash table are obtained by hashing the object handle and the hardware channel with the following function:

hash = (handle ^ (handle >> 11) ^ (handle >> 22) ^ (channel << 7)) & 0x7ff

Channels are a hardware concept allowing the GPU to be shared by multiple users, in a similar way as multiple process share the CPU. For more details, see the nouveau wiki page on context switching.

Hypervisor objects

Graphic objects

On the PS3, the following objects are created by a call to lv1_gpu_context_allocate:

handle	class	description
0×00000000	NULL object
0x3137af00	Scaled Image From Memory	Blitter with stretching and color conversion capability
0x31337a73	Swizzled Surface	Generally used for textures
0×31337808	Image from CPU	Used to upload images from the FIFO line by line
0x3137c0de	Memory to Memory Format	Used to download from video memory to system memory
0x313371c3	Context Surfaces 2D	Describes the display surface
0×31337303	Memory to Memory Format	Used to upload from system memory to video memory
0×31337000	NULL object

DMA objects

handle	class	target	address	limit
0×56616661	DMA Class 0×02	video memory	0x0ff10000	0x00000fff
0×56616660	DMA Class 0x3d	video memory	0x0ff10000	0x00000fff
0×66626660	DMA Class 0x3d	video memory	0x0fe01400	0x00007fff
0×66616661	DMA Class 0×02	video memory	0x0fe00000	0x00000fff
0×66606660	DMA Class 0x3d	video memory	0x0fe00000	0x00000fff
0xfeed0001	DMA Class 0x3d	system memory	0×80000000	0x0fffffff
0xfeed0000	DMA Class 0x3d	video memory	0×00000000	0xffffffff
0×66604200	DMA Notifier	video memory	0x0fe01000	0x0000003f
0×66604201	DMA Notifier	video memory	0x0fe01040	0x0000003f
0×66604202	DMA Notifier	video memory	0x0fe01080	0x0000003f
0×66604203	DMA Notifier	video memory	0x0fe010c0	0x0000003f
0×66604204	DMA Notifier	video memory	0x0fe01100	0x0000003f
0×66604205	DMA Notifier	video memory	0x0fe01140	0x0000003f
0×66604206	DMA Notifier	video memory	0x0fe01180	0x0000003f
0×66604207	DMA Notifier	video memory	0x0fe011c0	0x0000003f
0×66604208	DMA Notifier	video memory	0x0fe01200	0x0000003f
0×66604209	DMA Notifier	video memory	0x0fe01240	0x0000003f
0x6660420a	DMA Notifier	video memory	0x0fe01280	0x0000003f
0x6660420b	DMA Notifier	video memory	0x0fe012c0	0x0000003f
0x6660420c	DMA Notifier	video memory	0x0fe01300	0x0000003f
0x6660420d	DMA Notifier	video memory	0x0fe01340	0x0000003f
0x6660420e	DMA Notifier	video memory	0x0fe01380	0x0000003f
0x6660420f	DMA Notifier	video memory	0x0fe013c0	0x0000003f

Note that all DMA notifiers target video ram memory at address 0x0fe01000 and up. The lpar_reports and lpar_reports_size return values of lv1_gpu_context_allocate define a region accessible from Cell which corresponds to VRAM address 0x0fe00000. Therefore, the DMA notifiers are available from Cell by ioremapping the lpar_reports address and can be used to monitor the progress of GPU operations. Upon completion of a GPU operation for which notification is enabled, the following 128-bit value is written at the DMA notifier target address:

DMA notifier data
dword 0	dword 1	dword 2	dword 3
timestamp MSB	timestamp LSB	return value	error code/state

Subchannel bindings

By calling lv1_gpu_context_attribute(FB_SETUP), some of the graphic objects are bound to the following subchannels:

subchannel	object handle	description
1	0×31337303	DMA upload
2	0x3137c0de	DMA download
3	0x313371c3	display surface format
4	0x31337a73	texture
5	0x31337a73	image from FIFO
6	0x3137af00	blit

The hypervisor sends commands to subchannel 3 and 6 using the FIFO to perform the lv1_gpu_context_attribute(FB_BLIT) call.

Linux framebuffer driver

The linux framebuffer resides in system memory (XDR). Upon initialization, a GPU context is created using the lv1_gpu_memory_allocate and lv1_gpu_context_allocate calls. Then the initial objects are bound to FIFO subchannels using the gpu_context_attribute(FB_SETUP) call.

Upon vsync interrupt, the framebuffer in system memory is blit to video memory using the lv1_gpu_context_attribute(FB_BLIT) call. This call is performed by the Hypervisor using the the Image from Memory object already bound to subchannel 6 by the gpu_context_attribute(FB_SETUP) initialization code. The destination has also already been initialized to use the 0xfeed0000 DMA object corresponding to video memory. The source is set for each FB_BLIT call to 0xfeed0001, corresponding to system memory. Since the Cell FlexIO interface controls all transactions from any external IO device to system RAM, a window to system memory is opened for the GPU during initialization using the lv1_gpu_context_iomap call. The DMA notify object 0×66604200 is attached to the blit operation so that the Hypervisor can wait for the operation to finish by polling the notifier value, at address 0x0fe01000 in video memory. However the Hypervisor can be instructed not to wait for blit completion by removing the L1GPU_FB_BLIT_WAIT_FOR_COMPLETION flag from the FB_BLIT call.

Workarounds

FIFO workaround

The hack consists of asking the Hypervisor to return without waiting for a blit to end. After the Hypervisor returns there is a small length of time during which the FIFO or FIFO registers can be modified before the GPU has finished reading the command. This will occur when a large blit is decomposed into many smaller 1024×1024 blits by the Hypervisor. The last operation pushed to the FIFO by the Hypervisor is a wait for the GPU engine to go idle. By skipping this operation, it is possible to enqueue more commands to the FIFO for the GPU to execute. So the hack consists in either patching the last operation with a NOP, or changing the FIFO write pointer to stop earlier.

Upper VRAM workaround

Once arbitrary commands can be sent to the FIFO using the FIFO workaround, is it possible to enqueue a blit command from video RAM to video RAM. This command can be used to copy the end of video memory to lower regions of the video memory where it can be accessed directly from the Cell (direct access to memory above 254MB from the Cell is blocked by the Hypervisor). As this memory contains the RAMIN, RAMHT, RAMFC and GRAPH information, it can be analyzed to observe what the GPU-related Hypervisor calls do. Changing the direction of the blit (low video RAM to end of video RAM) allows a program outside of the Hypervisor to write to this reserved region (although the destination address must be 256-byte aligned due to restriction in the GPU blitter). To achieve a single unaligned write it is possible to read a large block, modify the desired value, and then write the entire block back.

However, the first parameter to the lv1_gpu_memory_allocate call sets a limit to the video RAM zone which can be DMA’d from/to. Setting this value to zero equals to ‘no limit’. This is probably due to a missing check in the Hypervisor, as the DMA objects size is specified as using a limit , equal to the size minus 1. Thus a size of zero is actually interpreted as a limit of 2^32-1 bytes (~4GB) by the GPU.

At first glance this would appear to be a security hole. But in actuality it is a feature Sony knows about and has used in official code: size = 0 is the default value in the Linux framebuffer driver, and were this issue properly resolved (0 = no RAM, no access) the official driver would be unable to work.

[Discussion: This is not strictly true - Sony could easily make the hypervisor interpret 0 as 252MB in a later release, which would close the hole but still preserve compatibility.]

The 2.10 firmware released on 2007-12-18 does appear to have changed this call so that 0 is now rejected.

This patch http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3fb-fix-gpu-cmd-buff-size-2.10-linux-2.6.23-20071023.diff has an interesting quote - “As of PS3 firmware version 2.10, the GPU command buffer size must be at least 2 MiB large”... why would they increase the minimum size unless they were expecting to fill it with a lot of data? Maybe there is some support for 3D now after all...

Enabling 3D

This section is based on code found here.

The program needs Glaurung's kernel patch applied to run. The output of the program is a triangle rendered by RSX to the display.

Technically, this program has two parts:

3D Class Instancing

Inserting a 3D class into GPU control area: I used user handle 0xfeed0003, the <key,value> pair will be placed in RAMHT just after handles 0xfeed0001 and 0xfeed0000. The hash value is constructed from channel id and offset to some area in RAMIN unused by the Hypervisor. This area was filled by object’s instance data. The binary layout of the object was

offset	data	description
0	0×00004097	nvidia chip version, 3D engine
1	0×00000000	notifier, will be filled by FIFO
2	0×01000000	endianness
3	0x31337a73	unused ?
4	0x31337a73	unused ?
5	0x3137af00	unused ?
6	0×31337303	unused ?
7	0x3137c0de	unused ?

[Discussion: Is 3-7 values bug? I filled it with zeroes. Peter.]

A 3D object was created this way and can by attached to a subchannel.

Rendering

The following steps are performed to draw a triangle on the RSX.

Setup DMA sources to perform DMA operations. I used only video memory, referred as 0xfeed0000 object.
Send some initialization code, found in nouveau. What this code does is unknown–it is marked in the source as “voodoo.”
Setup render states; stencil, alpha, Z tests, blending, fill and color op modes. Some states are missing, like fog.
Render target setup. We need to setup scissor, viewport, pitch, surface format.
Z buffer setup.

Pixel (fragment) programs are stored in the video memory and must be referred by address. We need to tell RSX about the number of temporary registers used by the program. The data for fragprog with big endian Class 3D must be swizzled by words ( two halfs of dword must be swaped ).

Vertex programs are stored in cache memory inside RSX and are referenced by handles. There are two registers, vp_in and vp_out, filled by some bitfields. Each entry (e.g., vertex position, color, texcoordN) has a bit in this mask. My vertex program has position as input and position as output. Position must be written in the vertex program, so there is no bit that needs to be set.

Some noise texture is created in memory. It has A8R8G8B8 linear format without mips. Do not use this format in your production code. It is better to use mipmapped swizzled DXT textures.

Once setup is complete, we can send data to the GPU. We send triangle in glBegin()...glEnd() manner. A more complete program would use vertex and index buffers.

Endianness and fragment programs

RAMIN data is stored in little-endian format ( same as x86 proccessor uses ), so this format seems to be “native” for RSX chip. PPC is a big-endian processor.

There is endianness flag in a context object. Hypervisor’s objects ( context surface for example ) have this flag setted on. These objects are big-endian.

We can set this flag OFF for 3D class. Fragment programs do work, but byte order is incompatible with the context surface, image is visually swizzled from ARGB to BGRA. Also this mode is incompatible with PPC endianness.

We can set this flag ON for 3D class. Fragment programs do not work with big endian microcode order, we get black rendering. The solution is to change byte ordering in fragment program data. The data for fragprog with big endian Class 3D must be swizzled by words ( two halfs of dword must be swaped ).

Known issues

TILE and ZCOMP setup

This setup can not be done with FIFO interface. We think it is perfomance issue.

Dec	JAN	Feb
	21
2008	2009	2010