for draw in renderPass:
for primitive in draw:
for vertex in primitive:
execute_vertex_shader(vertex)
for fragment in primitive:
execute_fragment_shader(fragment)
Tile
# Pass one
for draw in renderPass:
for primitive in draw:
for vertex in primitive:
execute_vertex_shader(vertex)
append_tile_list(primitive)
# Pass two
for tile in renderPass:
for primitive in tile:
for fragment in primitive:
execute_fragment_shader(fragment)
The main advantage of the immediate mode approach is the fact that the output of the vertex shader (and other geometry related shaders) can remain on-chip inside the GPU. The output of these shaders can simply be stored in a FIFO until the next stage in the pipeline is ready to consume the data, which means that little external memory bandwidth cost is incurred for storage and access of intermediate geometry states.
Disadvantages
The major downside of the immediate mode approach is that any triangle in the stream may cover any part of the screen. This means that the framebuffer working set which must be maintained is large; typically a full-screen color buffer and depth buffer, and possibly a stencil buffer too. A frame buffer for a modern device will usually be 32 bits-per-pixel (bpp) color, and 32bpp packed depth/stencil. A 1440p smartphone therefore has a working set of 30MB, which is far too large to keep on chip and therefore must be stored off-chip in DRAM.
Every blending, depth testing, and stencil testing operation requires the current value of the data for the current fragment's pixel coordinate to be fetched from this working set. All fragments shaded will typically touch this working set, so at high resolutions the bandwidth load placed on this memory can be exceptionally high, with multiple read-modify-write operations per fragment, although caching can mitigate this slightly. This need for high bandwidth access in turn drives the need for a wide memory interface with lots of pins, as well as specialized high-frequency DRAM devices, both of which result in external memory accesses which are particularly energy intensive. For mobile and embedded electronics where battery life and passive cooling are important design requirements, this bandwidth to off-chip memory is a significant overall cost.
Tile
Advantages: Bandwidth
The main advantage of tile-based rendering is that a tile is only a small fraction of the total screen area, so it is possible to keep the entire framebuffer working set (color, depth, and stencil) in a fast on-chip RAM which is tightly coupled to the GPU shader core. The intermediate framebuffer states needed for depth/stencil testing and for blending transparent fragments are therefore readily available without needing an external memory access. Reducing the number of external memory accesses needed for common framebuffer operations makes fragment-heavy content significantly more energy efficient.
In addition, a significant proportion of content has a depth and stencil buffer which is transient and only needs to exist for the duration of a single render pass. If developers tell the Mali drivers that depth and stencil buffers do not need to be preserved[1] – ideally via a call to glDiscardFramebufferEXT (OpenGL ES 2.0), glInvalidateFramebuffer (OpenGL ES 3.0), or using appropriate storeOp settings (Vulkan) – then the depth and stencil attachments are never written back to main memory at all.
Further framebuffer bandwidth saving optimizations are possible because Mali only has to write the color data for a tile back to memory once it is complete, at which point we know its final state. We can compare the content of a tile with the current data already in main memory via a CRC check – a process called Transaction Elimination – skipping the tile write to external memory completely if the tile contents are the same. This doesn't help performance in most situations – the fragment shaders still have to run to build the tile content – but it will reduce the external memory bandwidth considerably for many common use cases, such as UI rendering and casual gaming where screen regions will be unchanged across multiple frames, and therefore reduce system power consumption.
In addition we can also compress the color data for the tiles which are written out using a lossless compression scheme called ARM Frame Buffer Compression (AFBC), which allows us to lower the bandwidth and power consumed even further. This compression can be applied to render-to-texture outputs, which can be read back as textures by the GPU in subsequent render passes, as well as the main window surface, provided there is an AFBC compatible display controller such as Mali-DP650 in the system. Framebuffer compression therefore saves bandwidth multiple times; once on write out from the GPU and once each time that framebuffer is read by another processor.
Advantage: Algorithms
In addition to the basic bandwidth saving for framebuffer related operations, tile-based renderers also enable some algorithms which would otherwise be too expensive.
A tile is sufficiently small that Mali can store enough samples locally in the tile memory to allow multi-sample anti-aliasing[2], and the hardware can resolve the MSAA samples to a single pixel color during tile writeback to external memory without needing a separate resolve pass. This allows very low overhead anti-aliasing, both in terms of shading performance overhead and bandwidth cost.
Some more advanced techniques, such as deferred lighting, can benefit from fragment shaders being able to programmatically access the current value stored in the framebuffer by previous fragments. Traditional algorithms might execute in multiple passes, first rendering to a texture in main memory to create the deferred lighting geometry buffer, and then reading that as a texture in a second render pass, at the cost of perhaps 128bpp of bandwidth per G-Buffer read and write. Tile-based renderers can enable lower bandwidth approaches where intermediate per-pixel data is shared directly from the tile-memory, and only the final lit pixels are written back to memory. This functionality is exposed using extensions for OpenGL ES[3], or via the subpass feature in Vulkan.
Disadvantages
It is clear from the sections above that tile-based rendering carries a number of advantages, in particular giving very significant reductions in the bandwidth and power associated with framebuffer data, as well as being able to provide low-cost anti-aliasing. Nothing ever comes for free, so what is the downside?
The principal additional overhead of any tile-based rendering scheme is the point of hand-over from the geometry processing to the fragment processing. The output of the geometry processing stage – the per-vertex varying data and tiler intermediate state – must be written out to main memory and then subsequently read by the fragment processing stage. There is therefore a balance to be struck between the extra bandwidth costs related to geometry, and the bandwidth savings for the framebuffer data.
It is also important for developers to note that some rendering operations, such as tessellation, are disproportionately expensive for a tile-based architecture as they are designed to suit the strengths of the immediate mode architecture where the explosion in geometry data can be buffered inside the on-chip FIFO rather than being written back to main memory.
tile-based-rendering
Immediate
Tile