NonBorn
10-06-2007, 13:24
Programming the Cell Processor
It may be tricky, but the performance gains are worth the effort
Thanks to nine processors on a single silicon die, the Cell Broadband Engine—a processor jointly designed by IBM, Sony, and Toshiba and used in the PlayStation 3—promises lots of power. The good news is that the Cell is really fast: It provides enough computational power to replace a small high-performance cluster. The bad news is that it's difficult to program: Software that exploits the Cell's potential requires a development effort significantly greater than traditional platforms. If you expect to port your application efficiently to the Cell via recompilation or threads, think again.
In this article, we present strategies we've used to make a Breadth-First Search on graphs as fast as possible on the Cell, reaching a performance that's 22 times higher than Intel's Woodcrest, comparable to a 256-processor BlueGene/L supercomputer—and all this with just with a single Cell processor! Some techniques (loop unrolling, function inlining, SIMDization) are familiar; others (bulk synchronous parallelization, DMA traffic scheduling, overlapping of computation and transfers) are less so.
Computing Is Changing
In the last 10 years, processors are faster mainly due to increasing clock frequencies or more complex architectures. The trend can't continue because fabrication technologies are reaching physical limits. Transistors are getting so small that a gate is only a few atoms thick. Additionally, smaller circuits means higher heat production: It's more and more difficult to remove heat fast enough to avoid circuit burndown.
This is why the computing community is so interested in multicore architectures: IBM is pushing the Cell, and AMD and Intel quad-core processors. Intel also has shown its TeraScale prototype, a single chip with 80 cores. Architectures are changing fast, and developers have to keep up.
What's Under the Hood
The Cell contains:
One general-purpose 64-bit processor, the Power Processing Element (PPE).
Eight simpler processors, the Synergistic Processing Elements (SPE).
And a bus, the Element Interconnect Bus (EIB) that connects the PPE and SPEs.
The PPE is a 64-bit processor with a PowerPC instruction set, 64 KB of L1 cache memory, and 512K L2. Like Intel's HyperThreading, it supports simultaneous multithreading, but is remarkably simpler than Pentiums or Opterons.
SPEs are different. They have 128-bit registers and SIMD (single instruction, multiple data) instructions that can simultaneously process the four 32-bit words inside each register. Plus, there are so many registers (128) that you can unroll loops many times before running out of them. This is ideal for dataflow-based applications.
But the most radical peculiarity for programmers is that SPEs have no cache memory. Rather, they have a 256-KB-scratchpad memory called "local store" (LS). This makes SPEs small and efficient because caches cost silicon area and electrical power. Still, it complicates things for programmers. All the variables you declare are allocated in the LS and must fit there. Larger data structures in main memory can be accessed one block at a time; it is your responsibility to load/store blocks from/to main memory via explicit DMA transfers. You have to design your algorithms to operate on a small block of data at a time, fitting in the LS. When they are finished with a block, they commit the results to main memory, and fetch the next block. In a way, this feels like the old DOS days, when everything had to fit in the (in)famous 640 KB. On the other hand, an SPE's local storage (256 KB) is so much larger than most L1 data caches (a Xeon has just 32 KB). This is one of the reasons why a single SPE outperforms the highest-clocked Pentium Xeon core by a factor of three on many benchmarks.
The PPE, SPEs, and memory controllers are connected by the EIB bus. The EIB contains four data rings, two of which run clockwise and two counter-clockwise. It operates at 1.6 GHz and reaches aggregate transfer rates in excess of 200 GB/second. It employs point-to-point connections, similar to networks in high-performance clusters and supercomputers. Therefore, Cell programmers face issues of process mapping and congestion control—traditional problems of parallel computing. Additionally, the larger the blocks are, the higher their EIB transfer efficiency. So programmers are pressured to keep data structures small enough to fit the LS, but large enough to be transferred efficiently on the EIB.
Unfortunately, the compiler won't help you with parallelization, choice of optimal data structure size, scheduling of transfers, SIMDization, loop unrolling, and the like. You have to do that manually.
The quickest way to get started with Cell programming is with the Cell SDK (www.ibm.com/developerworks/power/cell), which contains a full system simulator. To profile applications (including data transfers), you need a real system—a Mercury Computer Systems development board (www.mc.com) or Sony PlayStation 3. Mercury's board has two DD3 Cell processors clocked at 3.2 GHz, running a Linux kernel 2.6.16 with the GCC 4 compiler set. The PlayStation 3 has a single Cell, and the Fedora Core 5 distribution reportedly has been running on it (ps3.qj.net).
The Problem
To illustrate the peculiarities of Cell programming, we use the Breadth-First Search (BFS) on a graph. Despite its simplicity, this algorithm is important because it is a building block of many applications in computer graphics, artificial intelligence, astrophysics, national security, genomics, robotics, and the like.
Listing One is a minimal BFS implementation in C. Variable G contains the graph in the form of an array of adjacency lists. G[i].length tells how many neighbors the i-th vertex has, which are in G[i].neighbors[0], G[i].neighbors[1], and so on. The vertex from which the visit starts is in variable root. A BFS visit proceeds in levels: First, the root is visited, then its neighbors, then its neighbors' neighbors, and so on. At any time, queue Q contains the vertices to visit in the current level. The algorithm scans every vertex in Q, fetches its neighbors, and adds each neighbor to the list of vertices to visit in the next level, Qnext. To prevent being caught in loops, the algorithm avoids visiting those vertices that have been visited before. To do so, it maintains a marked array of Boolean variables. Neighbors are added to Qnext only when they are not already marked, then they get marked. At the end of each level, Q and Qnext swap, and Qnext is emptied.
On a Pentium 4 HT running at 3.4 GHz, this algorithm is able to check 24-million edges per second. On the Cell, at the end of our optimization, we achieved a performance of 538-million edges per second. This is an impressive result, but came at the price of an explosion in code complexity. While the algorithm in Listing One fits in 60 lines of source code, our final algorithm on the Cell measures 1200 lines of code.
Let's Get Parallel
The first step in adapting programs to a multicore architecture is making it parallel. The basic idea is to split loop for (Q_index=0; Q_index<Q_size; Q_index++)... among different SPEs. Then you access a lock marked by the protection of a synchronization mechanism. Locks work fine in cache-coherent shared-memory machines with uniform memory and limited threads, but scale poorly on multicore systems. Instead, we partition both the vertex space and the marked array evenly among the SPEs. Each SPE explores only the vertices it owns, and forwards the others to their respective owners. Function which_owner() returns the owner of a given vertex identifier.
Rather than synchronizing at a fine grain, we adopt a Bulk-Synchronous Parallel (BSP) approach. In BSP, an algorithm is split in steps, and all the cores execute the same step at the same time. After each step, there is a barrier; see barrier() in Listing Two (available at http://www.ddj.com/code/). At a barrier, whoever finishes first waits for all the others to complete before proceeding to the next step. The BSP approach is very common in the parallel programming community because it greatly simplifies the control flow and the communication protocols, at the expense of a negligible performance penalty.
Listing Two is a pseudo-C rendition of the algorithm in BSP form. The code is executed by each of the SPEs, numbered from 0 to Nspe-1. Each SPE examines the neighbor lists of each of the vertices in its Q, encountering neighbors that belong to different SPEs. It dispatches them, putting those owned by the i-th SPE in queue Qout[i]. Then, an all-to-all exchange takes place, which routes the vertices to their respective owners. Each Qout[i] is sent to the i-th SPE, which receives it into Qin[s], where s is the identifier of the sender SPE. Next, each SPE examines the incoming adjacency lists, and marks and adds vertices to its private Qnext as before. The entire algorithm is complete when all the SPEs find their Qs empty. This is done via a reduce_all operation, which performs a distributed addition of all the variables Q_size among all the SPEs.
It may be tricky, but the performance gains are worth the effort
Thanks to nine processors on a single silicon die, the Cell Broadband Engine—a processor jointly designed by IBM, Sony, and Toshiba and used in the PlayStation 3—promises lots of power. The good news is that the Cell is really fast: It provides enough computational power to replace a small high-performance cluster. The bad news is that it's difficult to program: Software that exploits the Cell's potential requires a development effort significantly greater than traditional platforms. If you expect to port your application efficiently to the Cell via recompilation or threads, think again.
In this article, we present strategies we've used to make a Breadth-First Search on graphs as fast as possible on the Cell, reaching a performance that's 22 times higher than Intel's Woodcrest, comparable to a 256-processor BlueGene/L supercomputer—and all this with just with a single Cell processor! Some techniques (loop unrolling, function inlining, SIMDization) are familiar; others (bulk synchronous parallelization, DMA traffic scheduling, overlapping of computation and transfers) are less so.
Computing Is Changing
In the last 10 years, processors are faster mainly due to increasing clock frequencies or more complex architectures. The trend can't continue because fabrication technologies are reaching physical limits. Transistors are getting so small that a gate is only a few atoms thick. Additionally, smaller circuits means higher heat production: It's more and more difficult to remove heat fast enough to avoid circuit burndown.
This is why the computing community is so interested in multicore architectures: IBM is pushing the Cell, and AMD and Intel quad-core processors. Intel also has shown its TeraScale prototype, a single chip with 80 cores. Architectures are changing fast, and developers have to keep up.
What's Under the Hood
The Cell contains:
One general-purpose 64-bit processor, the Power Processing Element (PPE).
Eight simpler processors, the Synergistic Processing Elements (SPE).
And a bus, the Element Interconnect Bus (EIB) that connects the PPE and SPEs.
The PPE is a 64-bit processor with a PowerPC instruction set, 64 KB of L1 cache memory, and 512K L2. Like Intel's HyperThreading, it supports simultaneous multithreading, but is remarkably simpler than Pentiums or Opterons.
SPEs are different. They have 128-bit registers and SIMD (single instruction, multiple data) instructions that can simultaneously process the four 32-bit words inside each register. Plus, there are so many registers (128) that you can unroll loops many times before running out of them. This is ideal for dataflow-based applications.
But the most radical peculiarity for programmers is that SPEs have no cache memory. Rather, they have a 256-KB-scratchpad memory called "local store" (LS). This makes SPEs small and efficient because caches cost silicon area and electrical power. Still, it complicates things for programmers. All the variables you declare are allocated in the LS and must fit there. Larger data structures in main memory can be accessed one block at a time; it is your responsibility to load/store blocks from/to main memory via explicit DMA transfers. You have to design your algorithms to operate on a small block of data at a time, fitting in the LS. When they are finished with a block, they commit the results to main memory, and fetch the next block. In a way, this feels like the old DOS days, when everything had to fit in the (in)famous 640 KB. On the other hand, an SPE's local storage (256 KB) is so much larger than most L1 data caches (a Xeon has just 32 KB). This is one of the reasons why a single SPE outperforms the highest-clocked Pentium Xeon core by a factor of three on many benchmarks.
The PPE, SPEs, and memory controllers are connected by the EIB bus. The EIB contains four data rings, two of which run clockwise and two counter-clockwise. It operates at 1.6 GHz and reaches aggregate transfer rates in excess of 200 GB/second. It employs point-to-point connections, similar to networks in high-performance clusters and supercomputers. Therefore, Cell programmers face issues of process mapping and congestion control—traditional problems of parallel computing. Additionally, the larger the blocks are, the higher their EIB transfer efficiency. So programmers are pressured to keep data structures small enough to fit the LS, but large enough to be transferred efficiently on the EIB.
Unfortunately, the compiler won't help you with parallelization, choice of optimal data structure size, scheduling of transfers, SIMDization, loop unrolling, and the like. You have to do that manually.
The quickest way to get started with Cell programming is with the Cell SDK (www.ibm.com/developerworks/power/cell), which contains a full system simulator. To profile applications (including data transfers), you need a real system—a Mercury Computer Systems development board (www.mc.com) or Sony PlayStation 3. Mercury's board has two DD3 Cell processors clocked at 3.2 GHz, running a Linux kernel 2.6.16 with the GCC 4 compiler set. The PlayStation 3 has a single Cell, and the Fedora Core 5 distribution reportedly has been running on it (ps3.qj.net).
The Problem
To illustrate the peculiarities of Cell programming, we use the Breadth-First Search (BFS) on a graph. Despite its simplicity, this algorithm is important because it is a building block of many applications in computer graphics, artificial intelligence, astrophysics, national security, genomics, robotics, and the like.
Listing One is a minimal BFS implementation in C. Variable G contains the graph in the form of an array of adjacency lists. G[i].length tells how many neighbors the i-th vertex has, which are in G[i].neighbors[0], G[i].neighbors[1], and so on. The vertex from which the visit starts is in variable root. A BFS visit proceeds in levels: First, the root is visited, then its neighbors, then its neighbors' neighbors, and so on. At any time, queue Q contains the vertices to visit in the current level. The algorithm scans every vertex in Q, fetches its neighbors, and adds each neighbor to the list of vertices to visit in the next level, Qnext. To prevent being caught in loops, the algorithm avoids visiting those vertices that have been visited before. To do so, it maintains a marked array of Boolean variables. Neighbors are added to Qnext only when they are not already marked, then they get marked. At the end of each level, Q and Qnext swap, and Qnext is emptied.
On a Pentium 4 HT running at 3.4 GHz, this algorithm is able to check 24-million edges per second. On the Cell, at the end of our optimization, we achieved a performance of 538-million edges per second. This is an impressive result, but came at the price of an explosion in code complexity. While the algorithm in Listing One fits in 60 lines of source code, our final algorithm on the Cell measures 1200 lines of code.
Let's Get Parallel
The first step in adapting programs to a multicore architecture is making it parallel. The basic idea is to split loop for (Q_index=0; Q_index<Q_size; Q_index++)... among different SPEs. Then you access a lock marked by the protection of a synchronization mechanism. Locks work fine in cache-coherent shared-memory machines with uniform memory and limited threads, but scale poorly on multicore systems. Instead, we partition both the vertex space and the marked array evenly among the SPEs. Each SPE explores only the vertices it owns, and forwards the others to their respective owners. Function which_owner() returns the owner of a given vertex identifier.
Rather than synchronizing at a fine grain, we adopt a Bulk-Synchronous Parallel (BSP) approach. In BSP, an algorithm is split in steps, and all the cores execute the same step at the same time. After each step, there is a barrier; see barrier() in Listing Two (available at http://www.ddj.com/code/). At a barrier, whoever finishes first waits for all the others to complete before proceeding to the next step. The BSP approach is very common in the parallel programming community because it greatly simplifies the control flow and the communication protocols, at the expense of a negligible performance penalty.
Listing Two is a pseudo-C rendition of the algorithm in BSP form. The code is executed by each of the SPEs, numbered from 0 to Nspe-1. Each SPE examines the neighbor lists of each of the vertices in its Q, encountering neighbors that belong to different SPEs. It dispatches them, putting those owned by the i-th SPE in queue Qout[i]. Then, an all-to-all exchange takes place, which routes the vertices to their respective owners. Each Qout[i] is sent to the i-th SPE, which receives it into Qin[s], where s is the identifier of the sender SPE. Next, each SPE examines the incoming adjacency lists, and marks and adds vertices to its private Qnext as before. The entire algorithm is complete when all the SPEs find their Qs empty. This is done via a reduce_all operation, which performs a distributed addition of all the variables Q_size among all the SPEs.