Single SPE outperforms the highest-clocked Pentium Xeon core by 3x [Archive]

View Full Version : Single SPE outperforms the highest-clocked Pentium Xeon core by 3x

NonBorn

10-06-2007, 13:24

Programming the Cell Processor

It may be tricky, but the performance gains are worth the effort

Thanks to nine processors on a single silicon die, the Cell Broadband Engine—a processor jointly designed by IBM, Sony, and Toshiba and used in the PlayStation 3—promises lots of power. The good news is that the Cell is really fast: It provides enough computational power to replace a small high-performance cluster. The bad news is that it's difficult to program: Software that exploits the Cell's potential requires a development effort significantly greater than traditional platforms. If you expect to port your application efficiently to the Cell via recompilation or threads, think again.

In this article, we present strategies we've used to make a Breadth-First Search on graphs as fast as possible on the Cell, reaching a performance that's 22 times higher than Intel's Woodcrest, comparable to a 256-processor BlueGene/L supercomputer—and all this with just with a single Cell processor! Some techniques (loop unrolling, function inlining, SIMDization) are familiar; others (bulk synchronous parallelization, DMA traffic scheduling, overlapping of computation and transfers) are less so.

Computing Is Changing

In the last 10 years, processors are faster mainly due to increasing clock frequencies or more complex architectures. The trend can't continue because fabrication technologies are reaching physical limits. Transistors are getting so small that a gate is only a few atoms thick. Additionally, smaller circuits means higher heat production: It's more and more difficult to remove heat fast enough to avoid circuit burndown.

This is why the computing community is so interested in multicore architectures: IBM is pushing the Cell, and AMD and Intel quad-core processors. Intel also has shown its TeraScale prototype, a single chip with 80 cores. Architectures are changing fast, and developers have to keep up.

What's Under the Hood

The Cell contains:

One general-purpose 64-bit processor, the Power Processing Element (PPE).
Eight simpler processors, the Synergistic Processing Elements (SPE).
And a bus, the Element Interconnect Bus (EIB) that connects the PPE and SPEs.

The PPE is a 64-bit processor with a PowerPC instruction set, 64 KB of L1 cache memory, and 512K L2. Like Intel's HyperThreading, it supports simultaneous multithreading, but is remarkably simpler than Pentiums or Opterons.

SPEs are different. They have 128-bit registers and SIMD (single instruction, multiple data) instructions that can simultaneously process the four 32-bit words inside each register. Plus, there are so many registers (128) that you can unroll loops many times before running out of them. This is ideal for dataflow-based applications.

But the most radical peculiarity for programmers is that SPEs have no cache memory. Rather, they have a 256-KB-scratchpad memory called "local store" (LS). This makes SPEs small and efficient because caches cost silicon area and electrical power. Still, it complicates things for programmers. All the variables you declare are allocated in the LS and must fit there. Larger data structures in main memory can be accessed one block at a time; it is your responsibility to load/store blocks from/to main memory via explicit DMA transfers. You have to design your algorithms to operate on a small block of data at a time, fitting in the LS. When they are finished with a block, they commit the results to main memory, and fetch the next block. In a way, this feels like the old DOS days, when everything had to fit in the (in)famous 640 KB. On the other hand, an SPE's local storage (256 KB) is so much larger than most L1 data caches (a Xeon has just 32 KB). This is one of the reasons why a single SPE outperforms the highest-clocked Pentium Xeon core by a factor of three on many benchmarks.

The PPE, SPEs, and memory controllers are connected by the EIB bus. The EIB contains four data rings, two of which run clockwise and two counter-clockwise. It operates at 1.6 GHz and reaches aggregate transfer rates in excess of 200 GB/second. It employs point-to-point connections, similar to networks in high-performance clusters and supercomputers. Therefore, Cell programmers face issues of process mapping and congestion control—traditional problems of parallel computing. Additionally, the larger the blocks are, the higher their EIB transfer efficiency. So programmers are pressured to keep data structures small enough to fit the LS, but large enough to be transferred efficiently on the EIB.

Unfortunately, the compiler won't help you with parallelization, choice of optimal data structure size, scheduling of transfers, SIMDization, loop unrolling, and the like. You have to do that manually.

The quickest way to get started with Cell programming is with the Cell SDK (www.ibm.com/developerworks/power/cell), which contains a full system simulator. To profile applications (including data transfers), you need a real system—a Mercury Computer Systems development board (www.mc.com) or Sony PlayStation 3. Mercury's board has two DD3 Cell processors clocked at 3.2 GHz, running a Linux kernel 2.6.16 with the GCC 4 compiler set. The PlayStation 3 has a single Cell, and the Fedora Core 5 distribution reportedly has been running on it (ps3.qj.net).

The Problem

To illustrate the peculiarities of Cell programming, we use the Breadth-First Search (BFS) on a graph. Despite its simplicity, this algorithm is important because it is a building block of many applications in computer graphics, artificial intelligence, astrophysics, national security, genomics, robotics, and the like.

Listing One is a minimal BFS implementation in C. Variable G contains the graph in the form of an array of adjacency lists. G[i].length tells how many neighbors the i-th vertex has, which are in G[i].neighbors[0], G[i].neighbors[1], and so on. The vertex from which the visit starts is in variable root. A BFS visit proceeds in levels: First, the root is visited, then its neighbors, then its neighbors' neighbors, and so on. At any time, queue Q contains the vertices to visit in the current level. The algorithm scans every vertex in Q, fetches its neighbors, and adds each neighbor to the list of vertices to visit in the next level, Qnext. To prevent being caught in loops, the algorithm avoids visiting those vertices that have been visited before. To do so, it maintains a marked array of Boolean variables. Neighbors are added to Qnext only when they are not already marked, then they get marked. At the end of each level, Q and Qnext swap, and Qnext is emptied.

On a Pentium 4 HT running at 3.4 GHz, this algorithm is able to check 24-million edges per second. On the Cell, at the end of our optimization, we achieved a performance of 538-million edges per second. This is an impressive result, but came at the price of an explosion in code complexity. While the algorithm in Listing One fits in 60 lines of source code, our final algorithm on the Cell measures 1200 lines of code.

Let's Get Parallel

The first step in adapting programs to a multicore architecture is making it parallel. The basic idea is to split loop for (Q_index=0; Q_index<Q_size; Q_index++)... among different SPEs. Then you access a lock marked by the protection of a synchronization mechanism. Locks work fine in cache-coherent shared-memory machines with uniform memory and limited threads, but scale poorly on multicore systems. Instead, we partition both the vertex space and the marked array evenly among the SPEs. Each SPE explores only the vertices it owns, and forwards the others to their respective owners. Function which_owner() returns the owner of a given vertex identifier.

Rather than synchronizing at a fine grain, we adopt a Bulk-Synchronous Parallel (BSP) approach. In BSP, an algorithm is split in steps, and all the cores execute the same step at the same time. After each step, there is a barrier; see barrier() in Listing Two (available at http://www.ddj.com/code/). At a barrier, whoever finishes first waits for all the others to complete before proceeding to the next step. The BSP approach is very common in the parallel programming community because it greatly simplifies the control flow and the communication protocols, at the expense of a negligible performance penalty.

Listing Two is a pseudo-C rendition of the algorithm in BSP form. The code is executed by each of the SPEs, numbered from 0 to Nspe-1. Each SPE examines the neighbor lists of each of the vertices in its Q, encountering neighbors that belong to different SPEs. It dispatches them, putting those owned by the i-th SPE in queue Qout[i]. Then, an all-to-all exchange takes place, which routes the vertices to their respective owners. Each Qout[i] is sent to the i-th SPE, which receives it into Qin[s], where s is the identifier of the sender SPE. Next, each SPE examines the incoming adjacency lists, and marks and adds vertices to its private Qnext as before. The entire algorithm is complete when all the SPEs find their Qs empty. This is done via a reduce_all operation, which performs a distributed addition of all the variables Q_size among all the SPEs.

NonBorn

10-06-2007, 13:25

Fitting In

Listing Two has a major issue—it assumes no limits on the size of its data structures. But the space in an LS is limited—code, stack, global data, and heap must fit in 256 KB. Therefore, we leave Q, Qnext, and G in main memory, and operate on smaller buffers, called bQ, bQnext, and bG, respectively, which are allocated in the LS. Figure 1 describes the new algorithm.

http://i.cmpnet.com/ddj/images/article/2007/0703/070301ds01_f1.gif
Figure 1: A dataflow diagram of our final implementation of the algorithm.

Step 1 fetches a portion of Q into buffer bQ via a DMA transfer. We hide the latency of this transfer with a double-buffering technique. In short, while buffer 0 is being processed, new data are transferred into buffer 1. When we finish processing buffer 0 and transferring buffer 1, we swap the buffers: Buffer 1 comes into use, and another transfer is initiated on buffer 0; see Listing Three (available at http://www.ddj.com/code/). The same technique is used for putting bQnext into main memory (in Step 6).

Step 2 scans vertices in bQ and loads the respective adjacency lists in bG until bG is full. Also, bG is managed as a double buffer. But adjacency lists are sparse blocks in main memory, therefore, multiple transfers are needed. We carry them out with a DMA list. The Cell supports DMA lists up to 2048 transfers, each involving up to 16 KB. You must know the size of each transfer in advance to set it up in the DMA list, but there is no obvious way to know the size of an adjacency list before loading it. We solve this with a hack— we represent vertices with vertex codes; for instance, a 32-bit word where the upper bits contain the vertex identifier, and the lower bits contain its adjacency list length. The results are given in Listing Four (available at http://www.ddj.com/code/), where macros VERTEX_CODE_TO_INDEX and VERTEX_CODE_TO_LENGTH extract the two fields from a vertex code. With the help of the length field, Step 2 can stuff the maximum possible amount of adjacency lists into bG, minimizing wasted space.

Step 3 splits the adjacency lists loaded by Step 2 into the respective Qouts. To expedite this process, at graph generation, we prepare adjacency lists in the graph in a split per-SPE format. A header specifies the offset and length of each per-SPE portion of that list.

Step 4 is an all-to-all exchange in which each SPE delivers each of the Qout queues into a Qin inside its respective recipient. To detect at recipient side when each Qin is ready, we use a sentinel. The sentinel is a flag that the sender sets when the transfer is complete. To detect transfer completion, the recipient waits for the sentinel to change value. Hardware support is employed to guarantee that sentinels are not transferred before their payload. That is, we interleave payload and sentinels in a DMA list and employ a DMA list with a barrier; for instance, intrinsic mfc_putlb in Listing Five (available at http://www.ddj.com/code/).

For maximum efficiency, we avoid the transfer of Qout[i] to the same SPE i, and we schedule transfers in a circular fashion that exploits the geography of the EIB. The send order is computed by init_all_to_all() and stored in Qout_send_order; see Listing Five. The order and coordinates of transfers are known in advance, so the DMA list can be prepared once for all at program initialization. Thanks to this, Step 4 boils down to just a call to mfc_putlb.

Step 5 scans the vertices in each incoming Qin queue; if they were not marked before, it adds them to its Qnext and marks them. Step 5 is the most expensive step of the algorithm and hardest to optimize.

Step 6 commits Qnext to Qnext in main memory with double buffering. The same considerations made for Step 1 apply.

To achieve the best results, we overlap computation and data transfers as much as possible. Due to double buffering, an iteration takes two "scheduling periods," but iterations are pipelined, so one iteration completes at each period. With the aforementioned parameters, a period lasts 47 microseconds and processes 2700 edges on average. The latency of data transfers is completely hidden. Computations never have to wait for data being transferred. The overall latency is determined by the sum of all computations.

The Bitmap: Hitting the Bottleneck

Throughout development, Step 5 has constantly been the bottleneck of the entire process. At the highest degree of optimization, it still visits only between 35- and 114-million edges per second per SPE. This throughput is much less than the other steps. We started with a sequential version of Step 5, which is pretty straightforward (Listing Six; available at http://www.ddj.com/code/), and we optimized it extensively, obtaining Listing Seven (available at http://www.ddj.com/code/). The result is not exactly the most readable code, but it gives you a good idea of what deeply optimized source code for the Cell looks like.

The initial code does not compile efficiently on the Cell: Only 19 registers are used out of 128, the average CPI is 2.18, no SIMD operations are used, only 3 percent of the issues are dual issues (SPEs have two separate instruction pipelines and can issue two instructions at a time, if they don't conflict), and 48.5 percent of the cycles are spent waiting in stalls, either for memory operations or for intermediate results. On the other hand, our final implementation is 3.68-times faster, it uses 82 registers, it is optimally SIMDized, has a CPI equal to 0.99, 31 percent of the issues are dual, and stalls occur in 32.9 percent of the cycles.

Code in step5() (Listing Seven) is divided in two similar portions. The first portion processes vertices owned by the local SPE, which are ready immediately because they did not need to be transferred in the all-to-all exchange. The second portion processes each incoming Qin after waiting for the associated sentinel. Since it takes more time to process a Qin (4.10 microseconds) than to transfer it (1.19 microseconds), and the first Qin does not need to be transferred, the actual total time spent waiting for sentinels is zero.

As for optimizations, we inlined functions and unrolled the loops. Both optimizations reduce branch penalties and create larger basic blocks. With larger basic blocks, the compiler can schedule the instructions more efficiently, decreasing stalls and increasing the dual issue rate.

We employ 128-bit loads and SIMD operations (the spu_xxx intrinsics in Listing Seven) wherever it is convenient. This leads to better register exploitation and lower CPI. We also issue four loads per iteration with "offset pointers"; that is, a base pointer plus a small literal offset. These are compiled as load instructions in the "d-form," which calculates the sum of a register and an immediate offset without the need for a separate address computation instruction.

We got rid of branches by using software speculation. Branches cost up to 24 cycles when predicted incorrectly. So, we set bits in the bitmap unconditionally, even when they were already set. This has no impact on the algorithm's veracity. And we queue vertex codes into Qnext even when they're not needed. Then, we advance the end pointer of the queue only if the vertex was actually added. This allows most of the code to proceed unconditionally.

We use "restrict" pointers. Normally, a sequence of multiple loads causes stalls because the compiler does not try to schedule them concurrently. In fact, this would break the C pointer semantics if the loads alias. And bitmap load/stores are responsible for more than 30 percent of the time spent in this step. To improve performance, we declare as __restrict__ the pointers to values in the bitmap, guaranteeing that no aliasing takes place and allowing the four loads to proceed in parallel. And at graph generation, we shuffle adjacency lists appropriately so that they do not cause aliasing.

Conclusion

In short, the Cell offers an impressive potential for performance. However, due to its architecture and limited support offered by the compiler, you can't expect to exploit this potential by just recompiling your current applications. Applications must be radically redesigned in terms of computation and data transfers. Computational bottlenecks must often be analyzed and addressed manually, and data transfers must be properly orchestrated in order to hide their latency completely under the computational delays. This work has been supported by the Laboratory Directed Research and Development Program for the Data-Intensive Computing Initiative at Pacific Northwest National Laboratory, operated by Battelle for the U.S. Department of Energy under Contract DEAC0576RL01830.

Πηγή: http://www.ddj.com/dept/64bit/197801624

pvinis

10-06-2007, 13:55

fAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAsi!!
anetos o cell... alla apo 60 grammes 1200! damn! eprepe na exoun compiler pou na ta moirazei monos tou....

mg1980

11-06-2007, 21:56

χαλαρα,great post,nonborn