The goal of this technical article is to give you an insight into one of the cornerstones of PopcornFX : its scripting system, and shed some light on how we manage to keep performance high.
Before we start diving into the technical details of the PopcornFX script execution and memory model, here is a quick history of how the scripts evolved through time:
PopcornFX scripting history
First version, v0.1 (2008)
node-based. Part of a larger in-house game-engine cheesily named “HellHeaven”
- Scripting seems like the natural solution to make a powerful and versatile FX system, that doesn’t limit artists to predefined hardcoded sliders and properties.
- Scripts are at first a simple expression that takes any number of nodal input “pins” referenced in the expression as “$1”, “$2”, “$3”, … , “$n”, and evaluates to a single (vector) value, no function structure. First version of the parallel vector VM that executes scripts.
The “Blast.pkfx” effect, in nodal view
After halting development for two years, decision is made in 2010 to make an fx middleware, named “HellHeavenFX”.
v1.0 is released (2011)
It keeps the nodal representation under the hood, but displays it as a simpler treeview to hide the underlying complexity, all subsequent 1.x versions keep working this way.
The same “Blast.pkfx” effect, displayed as a treeview in v1.0
v1.5.4 is released (2013)
“HellHeavenFX” is rebranded to “PopcornFX”. Gives us an excuse to stuff our faces with popcorn every time there’s a party at the office!
v1.9 is released (2015)
It introduces a GPU backend in addition to the old (2008) CPU backend : We find out making popcorn run on the GPU is actually a piece of cake, thanks to the initial design.
- Initial GPU simulator written from scratch by 1 developer in ~2 weeks time. Iterated on and tweaked from time-to-time over the following months.
- Although unintended when initially designed back in 2008, the script execution model translates naturally to the way current-gen (2015) GPUs work, and performance is excellent from the start.
On the left, “Blast.pkfx” again, On the right, an effect running on the GPU.
Future (2017)
v2.0 goes back to a nodal view, pretty different from the original 2008 nodal system, nodes and script nodes will be merged together under the hood into an unified script, condenses all we’ve learned from more than 8 years of almost exclusively working and experimenting on particle effects, as well as the community feedback.
Different approaches for execution
A PopcornFX script is usually a simple function that reads and writes particle properties, and performs a bunch of computations. Like shaders where you manipulate a single pixel or vertex, PopcornFX scripts let you manipulate a single particle. However, we usually don’t have a single particle to run the script on, we have many.
Here’s the example of a simple script that scales a ‘float4 Color’ particle-field by a scalar ‘float Brightness’ particle-field, and preserves the alpha of the original color:
This script is compiled to the following low-level instructions:
Namely:
- two loads to load ‘Color’ and ‘Brightness’ into temporary registers (a scratch memory area that’s used to store intermediate computation results)
- a swizzle to broadcast ‘float Brighness’ to a float4 with the alpha component set to 1.0
- a float4 multiplication to scale the color by the broadcasted brightness
- a store to store the color back to the particle storage.
There are a few ways we could approach the problem of running that script on many particles. Some more efficient than others:
SISD execution model
SISD stands for Single Instruction, Single Data.
This is the execution model used by regular CPU instructions.
If we transpose this to particles: for each particle, we run the script on that particle’s data. Each instruction processes single values (ex: mul a color with a brightness)
- Pros: simple
- Cons: Excessive per-particle overhead when executing the script, poor usage of CPU’s execution pipeline and resources. Slow.
SISD execution from the first particle (P0) to the last (P1023)
Classic SIMD execution model
SIMD stands for Single Instruction, Multiple Data.
In this model, we run the script on blocks of 4 or 8 particles to take advantage of the CPU’s SIMD vector instructions. Each instruction processes 4 particles if running on an SSE-enabled CPU, 8 particles on AVX-enabled CPUs.
- Pros: better usage of CPU resources
- Cons: Still a lot of per-particle overhead
Streamed SIMD execution model
Run each instruction on all particles, then jump to the next instruction.
Initial model of v0.1 (2008)
- Pros: good usage of CPU resources, the code that executes each instruction can be efficiently unrolled and use SIMD instructions, very low per-instruction VM overhead.
- Cons: For large number of particles, trashes CPU caches, and while it still gives better performance than any of the above, not optimal
Page-based streamed SIMD execution model
Split particle storage in smaller, fixed-size chunks so that on average, all the data needed by the script fits into the CPU caches, then run each instruction on all the particles of that chunk. Model of v0.2 (2008), still used today
- Pros: very efficient, naturally parallelizable on worker threads. Solves all the performance issues of the previous methods.
- Cons: dynamic flow control (if/else/loops) harder to implement, not yet supported.
Different approaches for storage
The actual data layout of the particles is extremely important for performance. There are also various ways to store the particle data:
Naive “object oriented bullshit” way
One C++ class/object per particle, with virtual update and render methods.
- Pros: Relatively easy to add / remove particles, simple to code.
- Cons: This is probably the worst ever way to implement a particle system. Don’t laugh, I mention this because I’ve seen it actually used in production in two games a couple years ago, that had their own in-house particle engine (and unsurprisingly, performance issues related to particles). One of them used a linked-list of particles (!), the other one used an array of pointers to those objects. Please don’t do that.
AOS (Array Of Structures)
Each particle is a tightly packed data structure containing position, size, life, velocity, etc.. The update and render functions loop on this array and process the data for each particle.
- Pros: Efficient if running the whole script sequentially for each particle
- Cons: Inefficient if running a single script instruction for all particles, can’t coalesce loads together with SIMD instructions (except on the rare platforms that support strided/gather loads), CPU pays the cost of touching all cache-lines even if some particle data is “colder” than some other (“cold” data meaning it isn’t needed as often as “hot” data). Also harder to make flexible code-wise, usually either a big fat “uber-particle” structure is used, containing all data needed for all particle types in the game, which makes cache issues worse due to even more cold data, or code needs to handle multiple structures for different particle types, and this leads to code duplication, which leads to maintainability issues, bugs, and potentially sub-optimal instruction-cache usage.
SOA (Structure Of Arrays)
Each particle has a list of “properties” (which we call “particle fields” in PopcornFX), like “Position”, “Size”, “Velocity”, “Color”, …. Each of these is a tightly packed array of values in-memory. For example, the “Size” field of 1024 particles will be an array of 1024*sizeof(float) = 4096 bytes, that will contain the 1024 sizes of the 1024 particles. Vector fields like “Position” have their x, y, and z components stored in 3 separate arrays: xxxxx yyyyy zzzzz
- Pros: Efficient with the stream execution model, allows aggressive unrolling and very efficient data processing.
- Cons: When reading/writing all lanes of a multi-lane field like Position, usage of more x86 load/store buffers and write-combine buffers might affect performance.
Hybrid SOA / AOS
Also sometimes dubbed “SOAOS”. Here vectors lanes are not split in different arrays. Instead, the float3s are stored contiguously in memory, in an xyzxyzxyzxyzxyz fashion, instead of xxxxx yyyyy zzzzz for pure SOA.
Initial model of v0.1 (2008), still used today.
- Pros: makes parts of the code easier to maintain, Might also help cache when loading many data streams in parallel. Operations that do not require knowledge of the different lanes can treat the whole stream as a flat array of independent values (ex: multiplication, addition, etc..), which makes it slightly more efficient than processing three separate arrays.
- Cons: makes some other parts of the code less easy to maintain. To be efficient, some computations need to transpose the data back to pure SOA, compute their stuff, then transpose it back to hybrid form to store it back in the storage. This is inefficient.
Hybrid SOA / AOS (transposed)
Like with the Hybrid SOA/AOS format, there is a single array for vector values, however, the xyzw lanes are pre-transposed inside the array, by smaller batches, whose size is usually equal to the hardware’s SIMD vector-width.
- Pros: avoids transpositions when loading the data for processing, and merges the benefits of both SOA and hybrid SOA, by making the unroll of processing loops even easier.
- Cons: random access is more complex, for example when deleting particles. The end of the stream is less straightforward than any previous method if the number of active elements is not a multiple of the SIMD width used for transposition. For example, a stream with 5 active elements would look like:
[xxxx yyyy zzzz][x___ y___ z___]
Pre-transposed SOA/AOS with SIMD width=4
PopcornFX execution and memory models
PopcornFX uses the Hybrid SOA / AOS memory model since its first version, mainly due to legacy reasons. We might switch to one of the two other SOA methods in the future if we see a real measurable performance benefit in doing so.
The execution model is a combination of multiple things:
SIMD Stream processing model
This is the same execution model as GPUs. First, let’s quickly see how GPUs work :
In a current-gen GPU, each shader-core / compute-unit runs shaders on batches of 32 (NVIDIA) or 64 (AMD) elements, for example, 64 pixels or vertices simultaneously. These batches are named “Warps” on NVIDIA GPUs, and “Wavefronts” on AMD. two different names, but they’re the same thing. Each element of a warp or wavefront is called a “thread”.
The shader controls a single element (pixel/vertex/etc), and inside a single wavefront, all threads execute the same instruction at the same time. There is a single instruction pointer for all of the 32/64 threads. And there effectively is a single instruction executed, that processes 32/64 elements at once.
When processing a wavefront, the GPU has two kinds of registers available: SGPRs (Scalar General Purpose Registers), and VGPRs (Vector General Purpose Registers), which hold intermediate values computed by the shader. Unlike traditional CPU-style SIMD where you’d store a float4 color inside a 4-wide SIMD register, GPUs adopt an orthogonal view (see “Streamed SIMD execution model” above), where each float4 is actually stored inside 4 VGPRs across the 64 elements of the wavefronts, one for each of the xyzw lanes. This matches exactly the pure SOA memory model (see above). So unlike on PC where SIMD registers contain 4 (SSE), 8 (AVX), or 16 (AVX512/KNC) floats, each VGPR contains 32 floats (NV), or 64 floats (AMD).
Dynamic flow control (if/else/loops) is more complex due to the notion of “divergence” between threads of the same wavefront, that breaks the “single instruction pointer” paradigm, and we won’t cover it here, if you’re interested in the subject, see this excellent article about wavefront divergence, written by Tim Foley.
The PopcornFX execution backend has the exact same execution model as a GPU, except PopcornFX’s wavefronts and registers are 1024-wide instead of 64-wide, meaning a single PopcornFX wavefront processes 1024 particles at once. (The actual wavefront size is equal to the storage page size, so varying from 256 to 2048, usually around 1024)
You can basically see the PopcornFX script CPU backend as a sort of simple GPU emulator, with even fatter wavefronts.
Parallel execution
Where GPUs have compute-units, CPUs have cores. PopcornFX uses threads to parallelize script execution on each CPU core, just like the GPU parallelizes wavefront execution on multiple SIMD units and compute units.
Natural batching
Function calls to C++ are naturally batched. A call to a C++ function, like a sampler, or scene.intersect(), is run as a single instruction, and where you’d write:
scene.intersect(Position, direction, rayLength);
and reason about a single per-particle position, ray direction, and ray length, the actual C++ intersect function gets called with an array of positions, directions, and lengths, where each of those three arrays will have the size of the current wavefront.
Meta-Types / Frequencies
Each value and operation in a script has a type (ex: float, float3, int4, ..), but also an independent “Meta-Type” (also sometimes named “Frequency”).
This meta-type represents the variation frequency of each value. For example, the Meta-Type of “4.5” is “Constant”. The Meta-Type of “scene.Time” is “Normal”. The Meta-Type of an effect attribute is “Instance”. The Meta-Type of a particle field, like “Position”, is “Stream”.
- Constant (Lowest frequency)
Does not change during execution, can be precomputed at compile time.
Ex: numerical constants, scene.axisUp(), scene.axisForward(), etc.. - Normal
Can change, but not within a single execution of a script (ie: changes per frame)
Ex: scene.Time, dt, view.position(), view.axisForward(), etc… - Instance
Can change, but stays the same for all particles that come from the same effect instance
Ex: effect attributes, attribute samplers, instance position, etc… - Stream (Highest frequency)
Changes for each particle.
Ex: particle fields, rand() functions, etc…
Meta-Types allow the script backend to perform less computations. For example, all ‘Normal’ operations can be run once per frame or once per wavefront, instead of once per particle. This is equivalent to GPU’s SGPRs, whereas the ‘Stream’ Meta-Type is equivalent to VGPRs.
Meta-Types are propagated through operations, and the Meta-Type of the result of an operation is at least the highest Meta-Type of its input values. Therefore, as soon as you’re doing operations with a stream value, every operation can very quickly become streams as well.
For this reason, the script compiler also tries to reorder operations based on their meta-types to optimize computations even more. This can be seen as an equivalent to regular compiler’s optimizers pulling loop-invariant computations outside of inner-loops.
Each instruction’s Meta-Type is visible in the ‘disassembly’ tab of the script editor, on the leftmost column.
It’s the combination of all those techniques that keeps overall performance high.
The compilation pipeline : From source text to execution
This section is not directly related to the execution model, but it might be interesting to mention, as it’s the backbone structure of the scripting system, and what gives it flexibility. It’s also where optimizations take place, like arithmetic simplifications, combining constant values, etc…
When you create a PopcornFX script, the compilation pipeline goes through 3 different stages before it produces the final executable representation of the script that can be run by the runtime: The frontend, IR (Intermediate Representation), and backend.
Let’s consider this simple script as an example:
Front-end (2008, v0.1)
- Parses the script, builds an AST (Abstract Syntax Tree), that’s basically a node-graph of all operations typed in the script.
- Performs optimizations on the AST directly:
- Arithmetic transformations
- Combines, simplifies math expressions.
Ex: 1 / (1 / x) –> x - Bubbles-up lower meta types up the evaluation branches whenever possible.
Ex: (((x + 1) + 2) + 3) –> (1 + 2 + 3) + x
- Combines, simplifies math expressions.
- Constant propagation & folding
- Computes operations with constant operands at compile-time.
Ex: (1 + 2 + 3) + x –> x + 6 - Able to call C++ functions that have constant parameters at compile-time.
- Computes operations with constant operands at compile-time.
- Arithmetic transformations
Visual representation of the script’s internal AST:
IR (2016, v1.10)
- Before v1.10, there was no real IR, and the AST was sent directly to the backends for them to build their final executable bytecode.
- Since v1.10, an IR is generated in SSA form (Static Single Assignment).
- In v1.11, a bunch of powerful optimization passes have been added to the IR (in bold the ones that did not exist before in the AST optimizer):
- Load/store elimination (dead store elimination, also applied to loads, removes redundant loads & stores)
- Copy propagation (removes unnecessary copies and assignments)
- Dead code elimination (removes computations whose results are unused)
- Constant folding (folds constant registers and propagates values through the IR)
- Instruction combine (combines and morphs instructions)
- Common subexpression elimination (matches and removes duplicate computations)
Here is what the IR of the script above looks like, when printed in text form:
Backend (2008, v0.1)
- CBEM : CPU backend (2008, v0.1)
Stands for “Compiler Backend EMulation” : Is given the IR (v1.10 and above), or the AST (before v1.10), by the previous stage in the compilation pipeline, and builds a runtime representation of the script as a sequence of instructions. Performs a simple register allocation to minimize the number of total registers needed. Stores those instructions in a condensed bytecode format.
The CPU backend also contains the VM (Virtual Machine) that will decode the bytecode and execute the instructions. - CBD3D11 : GPU backend (2015, v1.9)
Stands for “Compiler Backend D3D11” : Like CBEM, is given the IR, and generates a D3D11 compute-shader.
Here is what the CBEM instructions look like:
Here is what the CBD3D11 Compute-Shader looks like:
// PopcornFX GPU Backend output (Kernel inputs) RWByteAddressBuffer __r_pk_Color : register(u0); ByteAddressBuffer __r_pk_Brightness : register(t0); [numthreads(256U, 1, 1)] void Eval(uint3 aStreamSize : SV_DispatchThreadID) { const int i = aStreamSize.x; float4 _pk_Color = asfloat(__r_pk_Color.Load4(i * 16)); const float _pk_Brightness = asfloat(__r_pk_Brightness.Load(i * 4)); const float4 sr2 = float4((float3)(_pk_Brightness), 1); const float4 sr3 = (_pk_Color * sr2); _pk_Color = sr3; __r_pk_Color.Store4(i * 16, asuint(_pk_Color)); }
This “multiple frontends > common IR > multiple backends” model allows to easily extend the system by adding new custom backends (for example: a GNM backend for PS4, a D3D12 backend, a Metal backend, a Vulkan/SPIRV backend, why not a CPU JIT backend, whatever…)
With v2.0, the nodal graph counts as a different front-end, that produces the common IR, and can be merged with the IR generated from script nodes, this allows very powerful inter-evolver optimizations, that were not possible before.