Monday, May 11, 2015

Performance of the Em-DOSBox CPU interpreter

When I first got DOSBox to run in a web browser, performance was terrible. The problem was the CPU interpreter. A single function fetches, decodes and executes most x86 instructions. Most of the function consists of a big switch statement with many cases. It is big because there are many x86 instructions.

The first problem was Emscripten converting the switch statement into a long chain of else if comparisons. Actually, a switch statement was used, but in most cases it merely set a variable which was later tested via comparisons. Instruction decoding, which needs to be done for every instruction, changed from O(1) into O(n).

Emscripten could generate a much better switch statement with a patch. This made DOSBox run fast in Firefox, but it was much too slow to use in Chrome. When I profiled it, I saw a warning triangle by the CPU interpreter function, telling me it's not optimized because the switch statement is too big. There was already v8 bug filed about this issue.

I solved this problem by transforming the cases of the big switch statement into functions using extractfun.py. This reduces function size and allows a function pointer to be used instead of a switch statement. The process is somewhat convoluted because the switch statement is normally built using the preprocessor. First, preprocessor output files need to be produced. In order to get Automake to create them using proper dependencies, it needs to create a library which is otherwise unnecessary. Then the Python script parses the preprocessor files. It stores functions into a function store, removing duplicates and fixing name collisions. Finally, it creates header files which are used when building the final version of the CPU interpreters. Three CPU interpreters are processed this way: the simple, normal and prefetch cores.

Since then, the Emscripten bug has been fixed, I assume by the switch to Fastcomp. When Chrome started using TurboFan for asm.js, it could finally get good performance with an un-transformed CPU interpreter. This led me to check whether extractfun.py is still necessary.

Safari 8.0.6 and Internet Explorer 11 still get terrible performance without the transformation. Use of --llvm-opts '["-lowerswitch"]' doesn't seem to help. Looking at the JavaScript, I can confirm that it changes the big switch into a binary search, so this probably means the problem is due to the size or complexity of the function, and not just due to switch statements. I also experimented with the Emscripten outlining limit, with or without -lowerswitch. I assume that transforming switch cases into functions is a more efficient split than what's done by Emscripten outlining.

No comments: