PS4′s Asynchronous Fine-Grain Compute GPU [Archive]

View Full Version : PS4′s Asynchronous Fine-Grain Compute GPU

Leon_DiZ

02-07-2013, 01:35

PS4′s Asynchronous Fine-Grain Compute GPU: What the Heck Can it Do?

http://cdn.dualshockers.com/wp-content/uploads/2013/06/PS4_AsynchronousFineGrain-670x376.jpg (http://cdn.dualshockers.com/wp-content/uploads/2013/06/PS4_AsynchronousFineGrain.jpg)

Mark Cerny’s “The road to PS4″ presentation at GameLab 2013, in Barcelona, was one of the best I heard in sixteen years of writing on games. He managed to touch some very specific points, decorating them with a lovely story of friendship and personal initiative to create what can easily be defined a lovely feat of storytelling performed by a master narrator.

I’m not surprised to hear that he gave eight hour-long presentations before, because I never heard someone else talking about GDRR5, Cell and cold hard silicon components making it sound like an epic bedtime story (if you don’t believe me just check the full video embedded at the bottom of this post). He also managed not to sound overly technical, allowing even laymen to understand what he was talking about.

Yet, like every passionate geek, he still slipped into techie talk at least once, leaving a lot of questions in many of his listeners, and that’s when he was talking about the GPU that will power the graphics of the PS4, and its future-proof features mentioned under that extremely technical-sounding umbrella named “Asynchronous Fine-Grain Compute.”

This is actually a very relevant topic, as it empowers that “rich feature set” that Cerny envisioned for the future of the console and that will give developers the tools to create better games over the years. But what does “Asynchronous Fine-Grain Compute” even mean?

Normally, a GPU is used to execute graphics commands rendering what we see on our screen, while compute commands, that normally simulate the world around us, drive the artificial intelligence and prompt the software to react to our actions, are handled by the CPU.

The GPU of the PS4 has been optimized to break that barrier, thanks to the shared memory pool and to a secondary bus that allows it to read and write directly from and to the system memory. The number of compute commands that the architecture can queue has also been dramatically increased (to 64) in order to run a relatively large number of small programs simultaneously or almost (fine-grain).

http://cdn2.dualshockers.com/wp-content/uploads/2013/06/PS4FeatureSet-670x376.jpg (http://cdn2.dualshockers.com/wp-content/uploads/2013/06/PS4FeatureSet.jpg)

In layman terms, what this means is that the GPU can directly support the CPU in performing compute tasks when there are free resources that would normally remain unused in a traditional system with separate CPU and GPU memory. If you have a PC and you like to benchmark your video memory usage while you game, you probably noticed that there are plenty times in which way less than 100% of the available resources are being used, and this system makes sure that those resources are put to good use, while still handling the graphics commands at the same time (asynchronous).

One of the most interesting yet obscure parts of the presentation was exactly the examples brought by Cerny on what those applications to be run by the GPU could be. He mentioned ray casting for audio, decompression, physics simulation, collision detection and world simulation. Some of those are almost self-explanatory, but some definitely aren’t. Let’s examine each element in detail.

Ray Casting for Audio
Ray casting, or more commonly ray tracing, is normally a technique used in computer graphics to generate an image by tracing the path of light through the pixels on a plane (represented by the screen) and accurately simulating the effects the light would create when encountering the objects drawn. It normally renders an extremely high quality picture, but its hardware costs are prohibitive, and that’s why it’s not normally used in game design.

http://cdn2.dualshockers.com/wp-content/uploads/2013/06/raytracingaudio3-209x425.jpg (http://cdn2.dualshockers.com/wp-content/uploads/2013/06/raytracingaudio3.jpg)http://cdn.dualshockers.com/wp-content/uploads/2013/06/raytracingaudio2-1024x705.jpg (http://cdn2.dualshockers.com/wp-content/uploads/2013/06/raytracingaudio2.jpg)

A similar technique can be used for real time audio synthesis with unparalleled levels of realism. When a sound is generated in the virtual environment, the sound engine simulates the sound waves by tracing rays from the origin of the sound in all directions. Those rays can then intersect objects in the 3D scene, and those objects have sound altering properties.

Those properties are then applied to each ray and a new indirect and modified ray is traced from the obstacles itself. The process is repeated over and over until all the objects on the way are cleared or the ray finally intersects the position of the listener. When that happens the sound components are combined by the sound engine and the listener finally hears the sound realistically modified by the environment around him.

Of course the more resources are available, the more rays can be cast and the more complex and realistic the simulation can be. If you want to read about this technique in full detail, you can find a patent describing it here (http://www.google.com/patents/US8139780). Of course it’s quite complicated, but also extremely interesting.

Decompression
Games use texture compression in order to save rendering bandwidth. In most cases if they used raw textures, bandwidth would run out way too fast and creating richly detailed worlds would basically be impossible.

Of course the use of compression is normally very lossy. For instance, the popular DXT5 format compresses every block of 16 pixels to a fixed size of 128 bits. This isn’t ideal, as some blocks are more complex than others, and having a fixed ratio means that the more complex blocks will generate artifacts, while the blocks that could be compressed more (as they are simpler) aren’t compressed as much as they could be.

http://cdn2.dualshockers.com/wp-content/uploads/2013/06/PS4MarkCerny-670x366.jpg (http://cdn2.dualshockers.com/wp-content/uploads/2013/06/PS4MarkCerny.jpg)

In addition to this, most games’ full texture sets are often too big to be used at the same time, and are switched in and out of memory during loading screens. When loading screens are not possible, textures need to be compressed very heavily to avoid or mitigate that annoying pop-in effect that you see in many open world games.

Using compute resources on the GPU for decompression opens interesting possibilities, like, for instance, the use of Variable Bit Rate compression/decompression algorithms that compress textures more efficiently (compressing more the less complex pixel clusters and less the blocks that show more complexity). This has the two pronged effect of increasing final quality and removing or reducing the need for loading screens and mitigating pop-in.

Of course textures aren’t the only asset that is compressed and needs to be decompressed. Audio and animation data are other examples that already normally use Variable Bit Rate compression. By moving the decompression process to the GPU when available, the resources of the CPU (that on the PS4 isn’t exactly a monster) can be used more efficiently.

Physics Simulation
Simulating physics is normally done by the CPU, and it’s a quite resource-intensive activity, especially in games that feature destructible environments that have a lot of fragments bouncing all over the place (of course assuming that those fragments are actually physically simulated and not just particles). The ability to simulate physics on the GPU already exists on PC. If you have an Nvidia video card you probably heard of PhysX.

What PhysX does is primarily offloading physical calculations that normally would burden the CPU to the GPU, in order to free resources on the CPU itself. The most interesting element, though, is that it also allows developers to create effects that would normally be unpractical to simulate on the CPU, increasing realism and allowing for more visual glitz in games.

Above you can see an example with Borderlands 2, but you can see a lot more here (http://physxinfo.com/). Just click on the “i” near each title to check the related comparison video.

Of course the GPU of the PS4 won’t run PhysX, as the engine is proprietary to Nvidia cards, and Sony has chosen the rival AMD as its component vendor. despite that the unified architecture will facilitate the implementation of similar effects despite the difference in brand.

As a final note on this topic, solid objects aren’t the only ones benefiting from this kind of GPU-based simulation, fluids are another relevant example, alongside fire, smoke and so forth.

Collision Detection
Collision detection is linked to physical simulation, as it’s the ability to detect the intersection of two models and prevent it with the appropriate physical effect. It’s quite resource-intensive and more so if the models (or their hitboxes) are very complex. That’s why many games belonging to the current generation have only environmental collision detection (preventing players and NPCs to clip into elements of the environment) and no collision detection between mobile models.

Even when collision detection is enabled between all models, it’s often “a posteriori”, meaning that it’s calculated only after two bodies have collided. It’s less hardware-intensive but also less precise and stable, and it’s also normally late in the reaction by a frame or more.

Offloading collision detection to the GPU allows to allocate more resources to it, ensuring that it’s enabled between all models and possibly “a priori”, or precisely calculated prior to the moment of collision by analyzing the trajectory of physical objects, ensuring better fluidity and fidelity and no delay in reaction.

http://cdn.dualshockers.com/wp-content/uploads/2013/06/PS4CallofFish-670x376.jpg (http://cdn.dualshockers.com/wp-content/uploads/2013/06/PS4CallofFish.jpg)

World Simulation
This is a much less specific topic, as compute resources can easily be applied to almost every element behind the simulation of the game’s world, this doesn’t just include elements governed by physics, but also those driven by the artificial intelligence and by other factors.

NPCs populating a city sporting complex action and reaction patterns, fishes in the water (yes, even something like the allegedly super-advanced fishes from Call of Duty: Ghosts), birds in the sky, weather simulation, traffic flow and quite a lot of other details that can make the world more realistic and immersive can be offloaded to the GPU in order to use its resources and those of the CPU more efficiently. You can find an slightly old but interesting (and not too complex) paper illustrating an example between many here (http://isis.dia.unisa.it/papers/CASA09_EFS.pdf).

The features mentioned above cover a large variety of aspects of game development, and we don’t even yet know if they’re all Cerny is planning, as they could have been brought just as examples of a larger picture. One thing is for sure, though: the ability to flexibly use the CPU or the GPU for compute actions might save developers a lot of headaches in allocating CPU resources that on consoles always tend to be rather limited.

They also have the potential to give us worlds that not only look better, but also sound better, act more naturally on their own and react in more realistic ways to our actions, feeling more alive and immersive. Obviously, at least for now, this is all theory, but it’s a theory that I can’t wait to see applied a few years from now.

Leon_DiZ

02-07-2013, 01:50

Ωωω, μεγαλεια :P

Slay

02-07-2013, 01:59

Πιστευω οτι αξιζε δικο του θεμα, το αρθρο βασικα αναλυει τις αλλαγες που εκανε ο cerny στην gpu ετσι ωστε να αποδιδει καλυτερα σαν GPGPU, προσθεσε ενα ακομα bus, και μεγαλωσε σημαντικα το command pipe, το αρθρο επισης δινει μερικα παραδειγματα για το πως μπορει να εχει σημαντικη πρακτικη εφαρμογη αυτο.

Mokelle

02-07-2013, 10:12

Απο ότι φαίνεται τα hardware έχει γίνει τόσο δυνατό (και ενεργοβόρο) πλέον που δεν έχει τόσο σημασία η ωμή δύναμη και το παιχνίδι παίζεται στο efficiency των chip και του συστήματος γενικότερα. Αυτή η ιδιοφυΐα που λέγεται Mark Cerny (αλήθεια πως καταφέρνει η Sony και τους ψαρεύει) έχει σχεδιάσει ένα σύστημα με γνώμονα την ομαλή συνεργασία των εξαρτημάτων και την ασύγχρονη επεξεργασία, με τα λιγότερα δυνατά bottlenecks και αναμονές. Ενα σύστημα δηλαδή που θεωρητικά όλα του τα εξαρτήματα θα μπορούν να αποδίδουν ~100% χωρίς να περιμένει το ένα το άλλο (high efficiency).

Συνδυαζόμενα αυτά με τον εύκολο προγραμματισμό, τη δυνατότητα προγραμματισμού σε low level σε κλειστό σύστημα σύν τα εργαλεία που θα προσφέρει η Sony (βλέπε performance analyzer απο την εποχή του PS2) πιστεύω πως θα δώσουν στο PS4 φοβερά αποτελέσματα, χωρίς να χρειάζεται την ωμή δύναμη μιας κορυφαίας GPU των 200W.

P.S. Είναι σημαντικό να σημειώσουμε ότι τα περισσότερα αν όχι όλα από αυτά που αναφέρει το άρθρο θεωρητικά μπορούσε να τα κάνει και ο Cell μέσω των SPUs.

Slay

02-07-2013, 11:43

P.S. Είναι σημαντικό να σημειώσουμε ότι τα περισσότερα αν όχι όλα από αυτά που αναφέρει το άρθρο θεωρητικά μπορούσε να τα κάνει και ο Cell μέσω των SPUs.
Oντως μπορει να τα κανει, αλλα η χρηση της GPU ειναι πολυ ποιο αποδοτικη , τοσο απο πλευρας κοστους/ενεργειας οσο και απο πλευρας αποδοσης, 2 μολις CU απο τα 18 εχουν περισσοτερη υπολογιστικη ισχυ απο τον cell.

Mokelle

17-07-2013, 11:17

PS4: 14 things we learned at Develop 2013

http://www.guardian.co.uk/technology/gamesblog/2013/jul/15/ps4-develop-2013-playstation-sony

Πολλά απο αυτά που αναφέρονται είναι γνωστά, αλλά κάτι που μου τράβηξε το ενδιαφέρον είναι το εξής:

"the GPU can handle lots of other tasks apart from graphics rendering, including physics simulation, collision detection, ray-casting for audio and decompression – all allegedly with little impact on those polygon and effects calculations"

Δηλαδή η μεταφορά υπολογισμών απο τη CPU στην GPU δεν θα έχει σημαντική επίπτωση στις επιδόσεις της.

Είναι πραγματικά ενδιαφέρον να δούμε κατα πόσο αυτή η ασύγχρονη αρχιτεκτονική μπορεί να λειτουργήσει αρμονικά. Όλο το PS4 έχει σχεδιαστεί με βάση αυτή τη λογική(ειδικά busses, τροποποιήσεις στη GPU, unified memory κ.α.). Είναι πιστεύω το μεγάλο στοίχημα για αυτή τη γενιά κονσολών που πλέον δεν έχουν βάλει στόχο τα υψηλά specs αλλά το υψηλό efficiency στο hardware με όσο το δυνατό λιγότερο idling, queues κτλ. Θα δούμε αν στην πράξη είναι παιχνίδι του marketing, αν και πολύ αμφιβάλλω γιατι το efficiency δεν είναι κατι τόσο ευκολόπεπτο για να το πουλήσουν στις μάζες.

Είχα διαβάσει στο παρελθόν ότι η GPU του 360 ήταν περίπου 60% efficient. Για τα PCs δεν το συζητώ καν, το νούμερο είναι πολύ μικρότερο. Αν καταφέρουν και πάνε το PS4 στο 90%+ και σε συνδυασμό με την αύξηση της ισχύος σε σχέση με το PS3, τότε αναμένω εκπληκτικά αποτελέσματα, επιπέδου Dark Sorcerer ή Deep Down.

Slay

17-07-2013, 18:16

Δεν υπαρχει το "δεν εχει επιπτωση" και βεβαια εαν σπαταλας πορους για physics μπορει να σου λειψουν απο αλλου, απλα ο cerny εχει κανει αλλαγες στην GPU ωστε να εκτελει πολυ ποιο αποδοτικα τετοιου ειδους κωδικα , σε σχεση με μια παραδοσιακη off the self καρτα, εκτος και εαν οι παλιες φημες για 14+4 CU ειναι αληθινες, αλλα αν δεν κανω λαθος ειχε ερωτηθει ο cerny πριν λιγο καιρο για αυτο και ειχε πει οτι δεν υπαρχει διαχωρισμος, η κατι σε αυτο το στυλ τελος παντων.

Mokelle

17-07-2013, 18:47

^^Έχει όντως επαληθευτεί ότι όλα τα CUs είναι όμοια και μπορούν να χρησιμοποιηθούν κατα την κρίση του κάθε dev.

Η ερμηνεία που δίνω εγώ για το πως "δεν έχει μεγάλη επίπτωση στα γραφικά" η χρήση των CUs για general purpose κώδικα, είναι το ότι η GPU δεν εκμεταλλέυεται ποτέ στο 100% και όπως ανέφερα πιο πρίν ένα μεγάλο ποσοστό των resources της είναι σε αναμονή ανα πάσα στιγμή.

Είναι πιθανό λοιπόν η Sony να καταφέρνει να εκμεταλλεύεται αυτό το 30-40% των idling resources μέσω έξυπνων εργαλείων (performance analyzer) και έξυπνου σχεδιασμού hardware, όπως τα 20Gb/s bus μεταξύ CPU και GPU χωρίς τη μεσολάβηση των L1 και L2 cache, tags για προτεραιότητα των εντολών, 64 queues για καλύτερη κατανομή του general purpose κωδικά στη GPU, unified GDDR5 με μοίρασμα των 176Gb/s ανάλογα με τις ανάγκες των CPU και GPU και επιπλέον εξειδικευμένα chip για video και audio decoding για να αποφορτίζεται το σύστημα.

Βέβαια ένας κακογραμμένος κώδικας δεν αφήνει ελεύθερα resources πουθενά οπότε και πάλι θα παίξει μεγάλο ρόλο η ικανότητα του εκάστοτε studio.

psx3

17-07-2013, 18:58

Όμοια είναι από άποψη ALUs. Δεν είναι δηλαδή να πεις ότι τα 4 CUs θα έχουν παραπάνω ALUs, αλλιώς το 1.84 Tflops θα ήταν περισσότερο. 18 CUs x 64 ALUs = 1152 ALUs. Κανονικά έχει 20 CUs, αλλά 2 είναι disabled για να αυξήσουν τα yields (όπως με το 8ο SPU του Cell).

Η σημαντική αλλαγή είναι αυτό που ανέφερες με τα 64 queues και το scheduling (μειώνει την εξάρτηση για καλό compiler) που προσφέρει η GCN αρχιτεκτονική, κάνοντας την ιδανική πλατφόρμα για GPGPU προγραμματισμό. Έτσι αυξάνεται το efficiency.

Η GPU του XBOX360 μπορεί να ήταν πρωτοποριακή για το 2005, αλλά σήμερα θεωρείται αρχαία. Εξού και το χαμηλό efficiency. Τα ίδια και για την GPU του Wii U που είναι βασισμένη σε VLIW αρχιτεκτονική.

Mokelle

17-07-2013, 19:20

^^Yup συμφωνούμε απόλυτα. Την GPU του 360 την ανέφερα ώς το μοναδικό σημείο αναφοράς που γνωρίζω και φυσικά αναφέρομαι σε 60% efficincy μόνο για τον κωδικα γραφικών που μπορεί να τρέξει. Οι σύγχρονες GpGPU είναι τελείως διαφορετικό κεφάλαιο, αλλά δεν νομίζω ότι με τους περιορισμούς του PCIe μπορούν να κάνουν και πολλά όσον αφορά τον general purpose κώδικα (βλέπε tress fx), οπότε αν δεν έχουμε φοβερή εκμετάλλευσή τους στα γραφικά και πάλι θα έχουμε αρκετά resources ανεκμετάλλευτα ανα πάσα στιγμή.

Mokelle

09-08-2013, 22:25

Έπεσα τυχαία πάνω στο παρακάτω διάγραμμα που είχε δείξει η Nvidia το 2012 σε κάποια παρουσίαση του Kepler.

http://www.nvidia.com/content/kepler-compute-architecture/images/emeai/large-video-hyper-q-2-en.jpg

Το απλό και σχηματικό αυτό διάγραμμα δείχνει πόσο μεγάλο περιθώριο υπάρχει στις σύγχρονες GPU για να τρέξουν general compute κώδικα (Hyper Q λειτουργία κατα την Nvidia) με μικρή ή καθόλου επίπτωση στο graphics proccessing. Αυτό που συζητούσαμε δηλαδή και πιο πρίν και που το είχε αναφέρει και ο Cerny. Εκεί έχει δωθεί όλο το βάρος στο PS4 και μένει να δούμε πώς αυτό θα υλοποιηθεί στην πράξη...

psx3

09-08-2013, 23:40

Το θέμα είναι πόσο εύκολο θα είναι να γίνει αυτό στην πράξη... αν είναι τόσο δύσκολο όσο ο προγραμματισμός του Cell, τότε ελάχιστα games θα το αξιοποιήσουν.

Slay

10-08-2013, 13:44

Νομιζω οτι υπαρχει μια βαση μεσω της openCL