Box86: Run X86 code (and games!) on ARM

This is a guest post by PtitSeb from the Open Pandora/Pyra community. Box86 is a new Linux Usermode X86 Emulator. While still young (read full of bugs) and missing a JIT (read slow), it’s already able to run a few games, and even some full speed, even on a slow device like the Pandora (more games runs perfectly on powerful devices like the ODroid XU4).

https://i.ytimg.com/vi/PX_Oy-lSix8/maxresdefault.jpg
the Odroid XU4

The secret to speed is that most system libs games uses are not emulated, but running natively. Why emulating the function “memcpy” when you can use the native version instead? Why not also using the native OpenGL functions? Or SDL?

And that’s exactly what is happening with Box86. This leads to smooth gameplay on Airline Tycoon Deluxe, even with an interpreted CPU emulation!

Why a new emulator?

There are more and more Linux machines running on ARM processors. There are of course Raspberry PIs, but also ODroid boards, BananaPIs, OrangePIs… and ultra portable devices like the Pandora (and the upcoming Pyra). Let’s not forget laptops like the LibrePC or Chromebooks. This is an interesting platform, with a relatively cheap CPU (compared to their x86 counterparts), and with a large software library support: most Open Source Software is usually available. However when it comes to running some closed source software (like games), that’s a completely different story, and most of the time, you are stuck with no good solution.

For such piece of software you need some kind of x86 emulator, and probably some OpenGL support, yet most ARM boards only have some kind of GLES hardware and software suppoert.

For the OpenGL part, I have already worked on gl4es which converts OpenGL calls to OpenGLES ones. But for the x86 part, there are not many options. “Qemu Userpace emulation” seems to be the only open source solution, and on the proprietary side Exagear has stopped offering its application. So I thought: “Hey, now that I have gl4es, why not create some x86 emulation that can push gl4es to its limits!”

Enter Box86

The main idea was, at first, to be able to run some commercial software using gl4es for the OpenGL part. After doing some work with FNA games (they use C# and run on linux, so no need for a CPU emulator, and I had a few success storied like Stardew Valley or FEZ and several others), I wanted to go to the next level, with a real x86 emulator.

My first plan was to modify WINE: making use of WINE for ARM to be able to use all of native libraries (including gl4es) and add a x86 emulator (because Wine Is Not an Emulator) to emulate only the minimum required. That strategy required to write some special wrapper when the guest program calls a system function, as you need to jump from the x86 world to the ARM world, and back to the x86 world when the function is complete. There are of course some difficulties to do that, but I got started with it. The main issue is that the WINE scene has become very active, and its source code being quite complex, all the x86 emulator and wrapper stuff need to be written (and tested), so I decided to put that project on hold and start a “simpler” one, or at least one that I would write from beginning to end. And so was born Box86: a Linux Userspace x86 Emulator (it could have been called LUXE!) in the spirit of Dosbox, as “plug’n play” as possible (not requiring a complete x86 linux chroot somewhere).

Also, now that Wine-Hangover is out, I may never go back to my initial idea of WineBox, as this seems to be based on the same concept.

A First Program

Our first step was to create the first program box86 will execute. That program is this one:

#include <sys/syscall.h> 
#include <unistd.h> 
int main(int argc, char **argv) 
{ 
const char msg[] = "Hello World!\n";
//syscall(4, STDOUT_FILENO, msg, sizeof(msg)-1);
asm (
"movl $4, %%eax \n"
"movl $1, %%ebx \n"
"movl %0, %%ecx \n"
"movl $13, %%edx \n"
"int $0x80 \n"
:
:"r" (msg)
:"%eax","%ebx","%ecx","%edx"
);
return 0; 
}

It’s a simple “hello world”, with no specific library, just using a syscall and exit. From there on I could start writing Box86. First some custom ELF Loader. Then some basic x86 emulator, with special handling of the syscall instruction to use an actual native syscall. And after a few days of work, I finally got the “Hello World” message on Pandora screen. Success!

Then it was time for the tricky part: to use a native function from within an emulated program. So it was time for the second test:

#include <sys/syscall.h> 
#include <unistd.h> 
int main(int argc, char **argv) 
{ 
const char msg[] = "Hello World!\n";
syscall(4, STDOUT_FILENO, msg, sizeof(msg)-1);
return 0; 
}

This is similar to the first program, but this time using a function call. Using a special instruction to create the jump, some wrapper to get the functions parameters from the x86 world to use on the ARM world, and the custom ELF loader to link to wrapper functions instead of x86 function, the 2nd test was, soon enough, printing the “hello world” too!

More tests, more functions, more instructions!

The Proof Of Concept stage was over, and it was time to expand Box86 with more tests, more wrapped functions and more x86 instructions. Remember my initial objective was to use gl4es, so the objective I set this time was to run “sdlgears”. It’s akin to the famous glxgears, using SDL instead of OpenGL. It’s more useful to target SDL (used in numerous commercial games, and similar to SDL2 used in even more games) than to use GLUT.

So I started working on all required functions. I needed to add floating point support, meaning x87 emulation. That one is a bit of an odd co-processor, It is now included in all x86 CPUs, but there was a time (back in pre-pentium days) when it was an optional, isolated co-processor. What is special about x87 is its use of a virtual stack. On x86 (or ARM or PPC), main registers have a name (like EAX or R0) and let you access register in mostly any way you want. On x87 you have 8 registers of 80bits, but you don’t access them directly, instead, you use a stack.

So at first the stack is empty, and when you load a value, you push it to the stack. Let’s say you want to add two doubles that are somewhere in memory, and store that sum somewhere else. You will load the 1st value, then the second. Your stack will be of 2 elements. Then you add both value using a single instruction that adds the top of stack value with the second of stack value, pop stack and store the result in current top. Then you use another instruction to store the top of the stack, and pop the other elements (the stack is now empty).

After having some of the x87 opcode in place, it was time to wrap the “printf” function. This is a function using a variable set of arguments, and it needed special care. After that, I had to add a few SDL functions, and OpenGL functions… And I was able to launch sdlgears.

That one was the first “real” program (not a specific test I came up with) to run using Box86.

Still need more functions

This done, I was thinking that all fundamental bricks were in place, and all I needed was simply to write more wrapped functions to get real games to run…

Most games use C++ nowadays. And C++ library (like “libstdc++”) cannot be wrapped like a C library (because the vtable needs to be altered), so I added support for emulated libs to be able to launch more games.

Of course this is when the real trouble started. First, I realized my custom ELF Loader was far too simplistic, and the various overloads occuring when loading a game and its libraries were not always correct (and still are not, it seems).

Then I realized there are more callbacks than I thought. Callbacks are functions that you pass as argument to functions. The issue here is that the functions are in native (so ARM) world and the Callback is from the program, so in a x86 setting. Sometimes, callbacks are not really explicit either.

Take SDLrwops. This is a structure of 4 callbacks used for file access. When you use SDL_LoadWAV(…) to load a wav file using it’s file name, you are, in fact creating an SDL_RWops using SDL_RWFromFile(…) and then call SDL_LoadWAV_RW(…). So I needed to develop a solid mechanism to support Callbacks. And something easy to use because callbacks are everywhere, and in numerous libs.

Along the addition of yet more functions, I stumbled upon more specific cases, like functions that have a structure in the parameters list instead of pointers, or functions returning a structure…

I’m not finished dealing with all special cases, but I get more and more applications to start and run. Check below…

World of Goo
FTL
Airline Tycoon
NOT A HERO
BIT TRIP RUNNER

Debugging the beast

The previous screenshots looks nice, and it’s always nice to have a working program. But it’s not always the case.

Debugging Box86 can be tricky. There are many areas where things can go wrong: it can come from the ELFLoader, from some wrapping mechanism, from typo/error in wrapped functions or from bugs in the CPU Emulator. Finding the source of a bug is complicated!

Box86 doesn’t (yet?) include a debugger, but it can trace what it is currently executing, or what wrapped function are called.

Here a small sample of the trace Box86 can generate:

EAX=00000018 ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=-CPA-S
ESP=404adb8c EBP=404adbc8 ESI=000000ff EDI=00000000 EIP=0816a510 55 push ebp => STACK_TOP: 0x8126f9e
EAX=00000018 ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=-CPA-S
ESP=404adb88 EBP=404adbc8 ESI=000000ff EDI=00000000 EIP=0816a511 89 E5 mov ebp, esp
EAX=00000018 ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=-CPA-S
ESP=404adb88 EBP=404adb88 ESI=000000ff EDI=00000000 EIP=0816a513 53 push ebx
EAX=00000018 ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=-CPA-S
ESP=404adb84 EBP=404adb88 ESI=000000ff EDI=00000000 EIP=0816a514 83 EC 14 sub esp, 0x14
EAX=00000018 ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=??????
ESP=404adb70 EBP=404adb88 ESI=000000ff EDI=00000000 EIP=0816a517 0F B6 45 10 movzx eax, byte ptr [ebp+0x10]
EAX=00000001 ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=??????
ESP=404adb70 EBP=404adb88 ESI=000000ff EDI=00000000 EIP=0816a51b C7 04 24 00 00 00 00 mov dword ptr [esp], 0x00
EAX=00000001 ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=??????
ESP=404adb70 EBP=404adb88 ESI=000000ff EDI=00000000 EIP=0816a522 89 44 24 08 mov [esp+0x08], eax
EAX=00000001 ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=??????
ESP=404adb70 EBP=404adb88 ESI=000000ff EDI=00000000 EIP=0816a526 8B 45 08 mov eax, [ebp+0x08]
EAX=0000009a ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=??????
ESP=404adb70 EBP=404adb88 ESI=000000ff EDI=00000000 EIP=0816a529 89 44 24 04 mov [esp+0x04], eax
EAX=0000009a ECX=00000000 EDX=0000009a EBX=00000001 FLAGS=??????
ESP=404adb70 EBP=404adb88 ESI=000000ff EDI=00000000 EIP=0816a52d E8 5E FC FF FF call 0x0816A190

It’s a small sample, the issue is a game quickly execute thousands of millions of instructions (yes, literally), so a full trace is not always the best way to debug.

When World Of Goo first launched, I tried out a game on the Pandora, and while the game was somewhat slow, something else was wrong. Do you see it?

Some of the numbers were not correctly displayed!

It’s more obvious something is really, really wrong when you complete a level.

Using the trace showing the wrapped functions call, I noticed there was some calls to “vswprintf” (that function print a formatted wide-char string to a wide-char buffer) before the OpenGL functions.

And then I suddenly realized where my bug was: Stack alignment between x86 and ARM is different, and I had to make a function to realign arguments when wrapping functions like “printf”, by analyzing the string itself and realigning double or removing long doubles (long doubles are 80bits doubles that are native in x86 but just do not exist in ARM). I simply used the same function for vswprintf, and I forgot that the format string was wide-char and not simple char! So I created a new alignment function for wide-char string… et voila! Problem fixed!

What’s next

As you can imagine, Box86 is not finished. There are still some bugs in the CPU Emulation: FTL starts and works but the music does not render correctly. Super Meat Boy runs but there are graphical issues and the main character doesn’t behave correctly, kind of digging inside the ground… Probably some x87 or SSE emulation bugs somewhere. Some games segfault at launch because the ELF Loader is not getting things ready as it should.

Finally, some games are really slow (most games are slow on the Pandora), because the CPU emulator is only an interpreter at this stage. There is no JIT/Dynarec/Hotspot mecanism to get more speed. The JIT will probably comes last, when all other aspects of the emulation will be stable enough.

Neverwinter Nights running on the Odroid XU4

In the meantime, some games, like Airline Tycoon Deluxe run truly full speed on the Open Pandora, even with the interpreted CPU emulation, and many games run full speed on more powerful board (like the Odroid XU4) as I said in the introduction.

You can find sources for Box86 here: https://github.com/ptitSeb/box86 and contributions are of course welcome! Please feel free to correct typos in Help/Readme, or contribute to wrapped functions or libraries fixes. I won’t have any objection if you contribute a full blown JIT!

2
Leave a Reply

avatar
1 Comment threads
1 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
ptitSebanonym674 Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
anonym674
Guest
anonym674

what is the advantage over QEMU userspace emulation?

ptitSeb
Member

It’s a different approach: qemu needs a complete x86 chroot to run, and everything is emulated (so you get a high generic compatibility, but you need to emulate everything, and accessing hardware can be tricky). With Box86, the minimum is emulated. The compatibility can be lower (for now at least), but many things run native, including things like OpenGL or SDL, leading to some boost in speed. And you don’t need a full chroot, just a handfull of x86 libs to run stuffs.