Speeding Up PS1 Emulation with OMAP’s DSP

topdsp

For a long time, most developers could not make good use of DSP comprised in the OMAP architecture of the Pandora SoC. Sure, there was TI’s dispbridge driver in the Pandora firmware for a long time, and a few developers like MH-T or Hdonk tried to make use of it – but the key issue was always linked with the latency to call the DSP or high CPU usage. Then BSP came with his deep knowledge of the architecture, and was able to package a kernel driver, a DSP component and an ARM library to make better use of it. Unfortunately, while the DSP can certainly help to relieve your CPU of certain tasks , it requires to re-factor your code and separate it into parallel processes to be used in concurrence with the main program thread. With the exception of a few demos by BSP, MH-T and Hdonk, no one had yet fully completed that kind of work. Notaz’s latest PCSXReARMed build now makes full use of that architecture.

His last build uses the OMAP’s DSP mainly to accelerate the audio part of the emulation. But it has taken a while for Notaz to find this use case.

Notaz: I’ve started playing with the DSP soon after bsp released his tools. I wrote some code to test things and get a hang of it and gave bsp some feedback about his library, some of which he incorporated into his code. After that I took a quick look at software I’ve released, but couldn’t think of any good use of DSP at that time so moved to other projects.

Actually, Notaz had at first considered Mupen, the Nintendo 64 emulator, as a potential candidate for DSP optimization…

Notaz: Later in 2014 while running Mupen64plus I took a look at the profiler and saw that the Rice plugin could make good use of NEON in it’s vertex processing. After finishing that code and handing it over to ptitSeb, I had also looked at potentially moving the RSP sound list processing to the DSP, however I decided that it would be too complicated to do without breaking the emulator… and potential benefits for moving that part to DSP were too small. Fast forward some more months to some time in December I’ve decided to update PCSX ReARMed for it’s 4 year release anniversary. One of the (many) issues in TODO list for that emulator were sound issues in some games. While debugging those I’ve noticed that some processing can be moved to another thread, so decided to do that after first releasing anniversary release (r20) with fixes only.

And so he did. The speed gains were remarkable in some cases, with up to 20% speed up in case the audio part is rather complex on the PS1 (for example, many channels used at once, or numerous, large samples to play simultaneously). Therefore any game using CD Audio for the background music does not really benefit from it – however there are still many games out there where the speed increase is noticeable to say the least. Before we move into more details regarding the implementation, I wanted to share with you some benchmarks I did on different games I own, using a special version of PCSXReARMed modified by Notaz to log the performance results in a text file.

Gran Turismo 1

  • Settings: Enhanced Resolution + Speed Hack / Frameskip OFF / 1000 Mhz / Show FPS ON
  • Benchmark Method: Replay of Grand Valley East / 1st Lap
  • Measure: CPU Activity on average, plotted with R Boxplot function.
  • Pandora Model: 1 Ghz
  • Variable: DSP OFF or ON

gt1

This is what the performance in CPU utilization looked like on this track’s replay:

Boxplotgt1-names

As you can see, the difference is quite sensitive (and actually statistically significant), on average about 5.7 % less CPU utilization. This is an average number but this makes a huge difference if you intend to play with frameskip OFF. You can see that there are peaks of CPU usage in that particular replay, going up to 100% when the DSP is not used. When this occurs, you get slow downs, dropped frames and dropped audio – but if you use the DSP, you can get almost a pure frameskip-less experience.

Gt1timeplot

Only at a certain point in the circuit will you reach a part where the CPU cannot handle everything anymore, but as shown on the graph this is clearly an outlier in terms of performance issue. So while 5% may not sound like an huge deal, in this particular case it’s quite significant.

Gran Turismo 2

  • Settings: Enhanced Resolution + Speed Hack / Frameskip OFF / 1000 Mhz / Show FPS ON
  • Benchmark Method: Replay of Laguna Seca / 1st Lap
  • Measure: CPU Activity on average, plotted with R Boxplot function.
  • Pandora Model: 1 Ghz
  • Variable: DSP OFF or ON

gt2

Somehow GT2 is quite more demanding than the first opus. You could see that they were really trying to push the limits of the PS1, and the emulator consumes more CPU accordingly to follow (the minimum CPU usage in the replay is about 8% higher than GT1’s minimum).

gt2boxplot-names

Here the gain with the DSP are even more consequent. Now you get on average 6.4 % of CPU usage reduction with the DSP, in high res mode.

GT2timebench

While the DSP is very effective in this game as well, there’s just too much to do at once when you turn frameskip off and several times you will reach 100% CPU utilization even with the DSP on. But it’s worth noting this happens again way more often when the DSP is OFF. What the timeplot above shows that the DSP gains and not always consistent. There are cases where both lines are indeed running in parallel from each other, while some other times where there’s almost no separation between both tests. This is probably happening when the in-game audio performance required is low. Plotting the difference between the two timeplots shows that the DSP gains can be locally pretty high, 25% of the time going over 8.5% CPU reduction, and reaching a peak at 17% reduction at some point during the bench.

Chrono Cross

  • Settings: Enhanced resolution + Speed hack / Frameskip OFF / 1000 Mhz / Show FPS ON
  • Benchmark Method: Running demo after the title screen
  • Measure: CPU Activity on average, plotted with R Boxplot function.
  • Pandora Model: 1 Ghz
  • Variable: DSP OFF or ON

chronoscreen

This game was actually suggested by Notaz, because it does not stream music but uses a lot of samples and channels and therefore you may expect significant differences there.

ChronoCrossBoxplot

The CPU usage reduction is actually pretty good, with almost 6 % difference on average. Not as much as what you get in GT2, but still pretty good. You can see in the above boxplot that the CPU spikes in lows and highs all the time depending on the complexity of the scene to render, and the running demo has a number of very different scenes: it’s far from being homogeneous like a replay in Gran Turismo, hence the great number of outliers in the data.

chronocrosstrucplot

While there is some overlap especially in the CPU peaks, most of the time you can clearly see both CPU usage lines running in parallel. The effect of the DSP is very, very clear in the Chrono Cross test.

Tekken 2

  • Settings: Normal resolution / Frameskip OFF / 1000 Mhz / Show FPS ON
  • Benchmark Method: Demo fight of the game
  • Measure: CPU Activity on average, plotted with R Boxplot function.
  • Pandora Model: 1 Ghz
  • Variable: DSP OFF or ON

tekken2

That was a tricky one in terms of performance evaluation.

Tekken2boxplot

Here the results are less significant. I used the Low resolution mode because there was virtually no difference in Enhanced resolution. The gains in CPU usage here are of about 4.3% – it seems that some of the audio in the game is streamed, and the other audio channels are only used for the sound effects like punches and kicks. That would explain how limited the DSP can help in such a game.

Overall the performance gains provided by the DSP are in line with what Notaz expected in the first place:

Notaz: It’s more or less as expected, before starting I’ve checked how much time can be saved in the best case by disabling SPU processing completely and checked it on games that can run without SPU. Luckily I think I ended pretty close to that best case.

DSP Implementation

As mentioned earlier, the DSP implementation takes care of the audio part of the emulator in the latest version. In order to do that, Notaz had to modify some aspects of the emulator:

Notaz: I had to change the processing to happen less often and in larger batches, to avoid needing to sync with DSP too often. That however caused accuracy issues (like missed sounds), so the emulator was changed to sync when it’s needed, like on important SPU register writes, instead of just doing it in small periods like before. I think this should improve overall accuracy, but may have introduced some possible corner cases that are not correctly handled

In his post regarding the DSP implementation, Notaz mentioned that most of the issues with the DSP usage “were related to syncing the caches between ARM and DSP”. This is actually very much linked to how the hardware is designed in the Pandora’s OMAP.

Notaz: Some time ago processors used to be connected directly to memory (RAM) and there was no cache. As time passed faster and faster processors were created, and RAM sizes increased, however RAM could not keep up with processor speeds, because it takes time to address the right chip on the RAM module and select correct location inside the chip. Today’s computers (and mobile devices too) can run at least an order of magnitude faster than RAM can supply data, especially if access is random (today’s RAM memory, despite it’s name can do sequential access faster than random). For that reason cache was invented, which is small amount of memory that’s inside of CPU.

To make it easier for the CPU to address the cache and because RAM can handle sequential reads/writes faster, the cache is split in blocks called cachelines. On pandora both ARM and the DSP have 64byte cachelines (128byte for DSP in some cases).

You can see below the overall architecture of the DM3730 powering the 1Ghz Pandora. The ARM core is a cortex A8 chip, and it can communicate through a common interface with the DSP present in the IVA2 module (top right left hand corner). As Notaz mentioned, you can see both the ARM and the DSP having their own integrated caches of 64 bytes.

AM37x_BlkDiagrarm

The issue comes when the caches’ contents are not synchronized across the ARM and the DSP.

Notaz: So basically how this works is that when the CPU (ARM or DSP) needs to process some data in RAM, even if it’s only 1 byte, it reads the whole cacheline (64bytes) to cache and keeps it there for fast processing. The CPU keeps things in cache as long as it can to speed up processing with that data in future, and doesn’t send it back to RAM to save time, unless special command is executed or CPU runs out of space in cache, in which case processed data is sent to RAM to make room for other data.

However while data is in cache, you can get in a situation where the RAM contains the same old data that was there before, and at the same time the ARM has some other version in its cache, the DSP has its own version in its cache. So this presents some extra complexity for the programmer, but it can be solved by being careful and using special commands to forcefully update the RAM or re-read it to exchange data between the CPUs (ARM and DSP) as intended. So one of the harder problems was that I had some data intended for only one of CPUs in those shared cachelines by mistake, so had one CPU destroying other’s data.

You can see that particular issue being described in the below documentation related to cachelines handling with the C64x DSP.

concurrence

For each frame rendered, the DSP and the ARM need to work in sync to ensure each of their tasks is handled on time. This is driven by the ARM process itself, checking the DSP progress from time to time:

Notaz: On each frame ARM decides how far DSP should go, so it can’t “run away” too much, if DSP reaches the target early it goes to sleep. There are some variables in shared memory that DSP updates according to it’s progress. There is also a function in bsp’s lib to wait for the DSP to finish without wasting ARM CPU time, in case the DSP lags behind. Once the ARM comes back after doing some other work (like MIPS emulation or rendering), in almost all cases the DSP is already done its SPU work and is back to sleep.

Besides Bsp’s initial work to develop the tools to exploit the DSP, Notaz had almost no use for anything else to develop this additional functionality in PCSXReARMed:

Notaz: There were no special tools used in development. In some cases I used gdb to look what’s going on on ARM’s side, for the DSP side I used bsp’s printf function and extra debug code to put internal variables on shared memory for ARM to see. Cache issues were the most time consuming to debug.

While in this particular use case the DSP was dedicated to the audio part, it could have been used for other purposes, as you can easily imagine:

Notaz: It should be possible to move parts of rendering work, or maybe even whole of it to the DSP, but it’s real difficult. Right now I don’t have any plans for it.

Note that you are unlikely to see such feats (DSP usage) on the upcoming Pyra (the Pandora successor designed to run with an OMAP5 from Ti), at least for current emulators:

Notaz: On the Pyra, there is nothing that could be easily used, except having 2 cores might be helpful (DraStic can already take advantage of multiple cores, for example). A specially designed emulator could perhaps make use of hardware virtualization, 2D accelerators and even the DSP, but it’s not really practical. Existing emulators (that the Pandora can already handle) don’t really need those things, and newer systems to emulate are way too complex to to write an emulator designed around OMAP5’s hardware for them.

While Notaz’s work on the Pyra is progressing in parallel, he MAY not be completely done with his DSP experiments…

Notaz: At the moment I don’t have any other project using the DSP, but I still hope to find some more use for it.

I’m all for more experiments! This just shows how far you can go even on hardware which is quite obsolete in 2015.

Many thanks to Notaz for his availability to make this article possible.

Leave a Reply

7 Comments on "Speeding Up PS1 Emulation with OMAP’s DSP"

avatar
  Subscribe  
newest oldest most voted
Notify of
Eight Bit
Guest

Nice article, thanks!

Wiz
Guest

Awesome, brilliant article. Thanks a lot !

Shenmue
Guest

Please use DSP for Reicast and PPSSPP 🙂

thermostat
Guest

where can that documentation be obtained? (the one where you took the image for the cachelines from)