Saturday, January 29, 2011

A JIT Backend for ARM Processors

In the past few months, I have been developing as a part of my master thesis the ARM backend for the the PyPy JIT, in the arm-backend branch. Currently, it is still work in progress: all integer and object operations are working and the support for floating point is also under development.
ARM processors are very widely used, beeing deployed in servers, some netbooks and mainly mobile devices such as phones and tablets. One of our goals is to be able to run PyPy on phones, specially on Android. Currently is not yet possible to translate and compile PyPy for Android automatically, but there has been some work on using Android's NDK to compile PyPy's generated C code.
The JIT Backend targets the application profile of the ARMv7 instruction set architecture which is found for example in the Cortex-A8 processors used in many Android powered devices and in Apple's A4 processors built into the latest iOS devices. To develop and test the backend we are using a BeagleBoard-xM which has a 1 GHz ARM Cortex-A8 and 512 MB of RAM running the ARM port of Ubuntu 10.10.
Currently on Linux it is possible to translate and cross-compile PyPy's Python interpreter as well as other interpreters with the ARM JIT backend enabled using Scratchbox 2 to provide a build environment and the GNU ARM cross compilation toolchain. So far the backend only supports the Boehm garbage collector which does not produce the best results combined with the JIT, but we plan to add support for the other GCs in the future, doing so should increase the performance of PyPy on ARM.
While still debugging the last issues with the backend we already can run some simple benchmarks on Pyrolog, a prolog interpreter written in RPython. Even using Boehm as the GC the results look very promising. In the benchmarks we compare Pyrolog to SWI-Prolog, a prolog interpreter written in C, which is available from the package repositories for Ubuntu's ARM port.
The benchmarks can be found in the pyrolog-bench repository.
BenchmarkSWI-Prolog in ms.Pyrolog in ms.Speedup
iterate60.06.010.0
iterate_assert130.06.021.67
iterate_call3310.05.0662.0
iterate_cut60.0359.00.16713
iterate_exception4950.0346.014.306
iterate_failure400.0127.03.1496
iterate_findall740.0No res.
iterate_if140.06.023.333
The iterate_call benchmark, which constructs a predicate and calls it at runtime, with a speedup of 662 times over SWI-Prolog is an example where the JIT can show its strength. The Pyrolog interpreter and the JIT treat dynamically defined predicates as static ones and can generate optimezed code in both cases. Whereas SWI only compiles statically defined rules and has to fall back to interpretation on dynamic ones.
For simple benchmarks running on PyPy's Python intepreter we see some speedups over CPython, but we still need to debug the backend bit more before we can show numbers on more complex benchmarks. So, stay tuned.

Friday, January 21, 2011

PyPy wants you!

If you ever considered contributing to PyPy, but never did so far, this is a good moment to start! :-)

Recently, we merged the fast-forward branch which brings Python 2.7 compatibility, with the plan of releasing a new version of PyPy as soon as all tests pass.

However, at the moment there are still quite a few of failing tests because of new 2.7 features that have not been implemented yet: many of them are easy to fix, and doing it represents a good way to get confidence with the code base, for those who are interested in it. Michael Foord wrote a little howto explaining the workflow for running lib-python tests.

Thus, if you are willing to join us in the effort of having a PyPy compatible with Python 2.7, probably the most sensible option is to come on the #PyPy IRC channel on Freenode, so we can coordinate each other not to fix the same test twice.

Moreover, if you are a student and are considering participating in the next Google Summer of Code this is a good time to get into pypy. You have the opportunity to get a good understanding of pypy for when you decide what you would like to work on over the summer.

Tuesday, January 11, 2011

Loop invariant code motion

Recently, the jit-unroll-loops branch was merged. It implements the idea described in Using Escape Analysis Across Loop Boundaries for Specialization. That post does only talk about virtuals, but the idea turned out to be more far reaching. After the metainterpreter produces a trace, several optimizations are applied to the trace before it is turned into binary code. Removing allocations is only one of them. There are also for instance
  • Heap optimizations that removes memory accesses by reusing results previously read from or written to the same location.
  • Reusing of the results of pure operations if the same pure operation is executed twice.
  • Removal of redundant guards.
  • ...
A lot of these optimizations are in one way or another removing operations form the trace and/or reusing previous results. All of these optimizations could benefit from being able to operate across loop boundaries. Not only in the sense that operations operating on loop invariants could be moved out of the loop entirely. But also that results produced at the end of an iteration could be reused at the beginning of the next even if there are no loop invariants involved.

This is achieved by unrolling the trace into two iterations, and letting the optimizer work on this two-iteration-trace. The optimizer will now be able to optimize the second iteration more than the first since it can reuse results from the first iteration. The optimized version of the first iteration we call the preamble and the optimized version of the second iteration we call the loop. The preamble will end with a jump to the loop, while the loop will end with a jump to itself. This means that the preamble will be executed once for the first iteration, the loop will be executed for all following iterations.

Sqrt example

Here is an example of a Python implementation of sqrt using a fairly simple algorithm

def sqrt(y, n=10000):
    x = y / 2
    while n > 0:
        n -= 1
        x = (x + y/x) / 2
    return x

If it is called with sqrt(1234.0), a fairly long trace is produced. From this trace the optimizer creates the following preamble (Loop 1) and loop (Loop 0)

Looking at the preamble, it starts by making sure that it is not currently being profiled, the guard on i5, and that the function object have not been changed since the trace was made, the guard on p3. Somewhat intermixed with that, the integer variable n is unboxed, by making sure p11 points to an integer object and reading out the integer value from that object. These operations are not needed in the loop (and have been removed from it) as emitting the same guards again would be redundant and n becomes a virtual before the end of the preamble.

        guard_value(i5, 0, descr=<Guard6>) 
        guard_nonnull_class(p11, ConstClass(W_IntObject), descr=<Guard7>) 
        guard_value(p3, ConstPtr(ptr15), descr=<Guard8>) 
        i16 = getfield_gc_pure(p11, descr=<W_IntObject.inst_intval>)
Next comes a test and a guard implementing the while statement followed by the decrementing of n. These operation appear both in the preamble and in the loop
        i18 = int_gt(i16, 0)
        guard_true(i18, descr=<Guard9>) 
        i20 = int_sub(i16, 1)
After that the two floating point variables x and y are unboxed. Again this is only needed in the preamble. Note how the unboxed value of y, called f23, is passed unchanged from the preamble to the loop in arguments of the jump to allow it to be reused. It will not become a virtual since it is never changed within the loop.
        guard_nonnull_class(p12, 17652552, descr=<Guard10>) 
        guard_nonnull_class(p10, 17652552, descr=<Guard11>) 
        f23 = getfield_gc_pure(p10, descr=<W_FloatObject.inst_floatval>)
        f24 = getfield_gc_pure(p12, descr=<W_FloatObject.inst_floatval>)
Following that is the actual calculations performed in the loop in form of floating point operations (since the function was called with a float argument). These appear in both the loop and the preamble.
        i26 = float_eq(f24, 0.000000)
        guard_false(i26, descr=<Guard12>) 
        f27 = float_truediv(f23, f24)
        f28 = float_add(f24, f27)
        f30 = float_truediv(f28, 2.000000)
Finally there are some tests checking if a signal was received (such as when the user presses ctrl-C) and thus should execute some signal handler or if we need to hand over to another thread. This is implemented with a counter that is decreased once every iteration. It will go below zero after some specific number of iterations, tunable by sys.setcheckinterval. The counter is read from and written to some global location where it also can be made negative by a C-level signal handler.
        i32 = getfield_raw(32479328, descr=<pypysig_long_struct.c_value>)
        i34 = int_sub(i32, 2)
        setfield_raw(32479328, i34, descr=<pypysig_long_struct.c_value>)
        i36 = int_lt(i34, 0)
        guard_false(i36, descr=<Guard13>) 
        jump(p0, p1, p2, p4, p10, i20, f30, f23, descr=<Loop0>)

Bridges

When a guard fails often enough, the meta-interpreter is started again to produce a new trace starting at the failing guard. The tracing is continued until a previously compiled loop is entered. This could either be the the same loop that contains the failing guard or some completely different loop. If it is the same loop, executing the preamble again maybe be unnecessary. It is preferable to end the bridge with a jump directly to the loop. To achieve this the optimizer tries to produce short preambles that are inlined at the end of bridges allowing them to jump directly to the loop. Inlining is better than jumping to a common preamble because most of the inlined short preamble can typically be removed again by the optimizer. Creating such a short preamble is however not always possible. Bridges jumping to loops for which no short preamble can be generated have to end with a jump to the full preamble instead.

The short preamble is created by comparing the operations in the preamble with the operations in the loop. The operations that are in the preamble but not in the loop are moved to the short preamble whenever it is safe to move them to the front of the operations remaining. In other words, the full preamble is equivalent to the short preamble followed by one iteration of the loop.

This much has currently been implemented. To give the full picture here, there are two more features that hopefully will be implemented in the near future. The first is to replace the full preamble, used by the interpreter when it reaches a compiled loop, with the short preamble. This is currently not done and is probably not as straight forward as it might first seem. The problem is where to resume interpreting on a guard failure. However, implementing that should save some memory. Not only because the preamble will become smaller, but mainly because the guards will appear either in the loop or in the preamble, but not in both (as they do now). That means there will only be a single bridge and not potentially two copies once the guards are traced.

The sqrt example above would with a short preamble result in a trace like this

If it is executed long enough, the last guard will be traced to form a bridge. The trace will inherit the virtuals from its parent. This can be used to optimize away the part of the inlined short preamble that deals with virtuals. The resulting bridge should look something like
    [p0, p1, p2, p3, p4, f5, i6]
    i7 = force_token()
    setfield_gc(p1, i7, descr=<PyFrame.vable_token>)
    call_may_force(ConstClass(action_dispatcher), p0, p1, descr=<VoidCallDescr>)
    guard_not_forced(, descr=<Guard19>) 
    guard_no_exception(, descr=<Guard20>) 

    guard_nonnull_class(p4, 17674024, descr=<Guard21>) 
    f52 = getfield_gc_pure(p4, descr=<W_FloatObject.inst_floatval>)
    jump(p1, p0, p2, p3, p4, i38, f53, f52, descr=<Loop0>)
Here the first paragraph comes from the traced bridge and the second is what remains of the short preamble after optimization. The box p4 is not a virtual (it contains a pointer to y which is never changed), and it is only virtuals that the bridge inherit from it's parents. This is why the last two operations currently cannot be removed.

Each time the short preamble is inlined, a new copy of each of the guards in it is generated. Typically the short preamble is inlined in several places and thus there will be several copies of each of those guards. If they fail often enough bridges from them will be traced (as with all guards). But since there typically are several copies of each guard the same bridge will be generated in several places. To prevent this, mini-bridges from the inlined guards are produced already during the inlining. These mini-bridges contain nothing but a jump to the preamble.

The mini-bridges needs the arguments of the preamble to be able to jump to it. These arguments contain among other things, boxed versions of the variables x and y. Those variables are virtuals in the loop, and have to be allocated. Currently those allocations are placed in front of the inlined guard. Moving those allocations into the mini-bridges is the second feature that hopefully will be implemented in the near future. After this feature is implemented, the result should look something like

Multiple specialized versions

Floating point operations were generated in the trace above because sqrt was called with a float argument. If it is instead called with an int argument, integer operations will be generated. The somewhat more complex situations is when both int's and float's are used as arguments. Then the jit need to generate multiple versions of the same loop, specialized in different ways. The details, given below, on how this is achieved is somewhat involved. For the casual reader it would make perfect sense to skip to the next section here.

Consider the case when sqrt is first called with a float argument (but with n small enough not to generate the bridge). Then the trace shown above will be generated. If sqrt is now called with an int argument, the guard in the preamble testing that the type of the input object is float will fail:

        guard_nonnull_class(p12, 17652552, descr=<Guard10>) 
It will fail every iteration, so soon enough a bridge will be generated from this guard in the preamble. This guard will end with a jump to the same loop, and the optimizer will try to inline the short preamble at the end of it. This will however fail since now there are two guards on p12. One that makes sure it is an int and and one that makes sure it is a float. The optimizer will detect that the second guard will always fail and mark the bridge as invalid. Invalid loops are not passed on to the backend for compilation.

If a loop is detected to be invalid while inlining the short preamble, the metainterpreter will continue to trace for yet another iteration of the loop. This new trace can be compiled as above and will produce a new loop with a new preamble that are now specialized for int arguments instead of float arguments. The bridge that previously became invalid will now be tried again. This time inlining the short preamble of the new loop instead. This will produce a set of traces connected like this

(click for some hairy details)

The height of the boxes is this figure represents how many instructions they contain (presuming the missing features from the previous section are implemented). Loop 0 is specialized for floats and it's preamble have been split into two boxes at the failing guard. Loop 2 is specialized for ints and is larger than Loop 0. This is mainly because the integer division in python does not map to the integer division of the machine, but have to be implemented with several instructions (integer division in python truncates its result towards minus infinity, while the the machine integer division truncates towards 0). Also the height of the bridge is about the same as the height of Loop 2. This is because it contains a full iteration of the loop.

A More Advanced Example

Let's conclude with an example that is a bit more advanced, where this unrolling approach actually outperforms the previous approach. Consider making a fixed-point implementation of the square root using 16 bit's of decimals. This can be done using the same implementation of sqrt but calling it with an object of a class representing such fixed-point real numbers:

class Fix16(object):
    def __init__(self, val, scale=True):
        if isinstance(val, Fix16):
            self.val = val.val
        else:
            if scale:
                self.val = int(val * 2**16)
            else:
                self.val = val

    def __add__(self, other):
        return  Fix16(self.val + Fix16(other).val, False)

    def __sub__(self, other):
        return  Fix16(self.val - Fix16(other).val, False)

    def __mul__(self, other):
        return  Fix16((self.val >> 8) * (Fix16(other).val >> 8), False)

    def __div__(self, other):
        return  Fix16((self.val << 16) / Fix16(other).val, False)

Below is a table comparing the runtime of the sqrt function above with different argument types on different python interpreters. Pypy 1.4.1 was released before the optimizations described in this post were in place while they are in place in the nightly build from January 5, denoted pypy in the table. There are also the running time for the same algorithms implemented in C and compiled with "gcc -O3 -march=native". Tests were executed on a 2.53GHz Intel Core2 processor with n=100000000 iterations. Comparing the integer versions with C may be considered a bit unfair because of the more advanced integer division operator in python. The left part of this table shows runtimes of sqrt in a program containing a single call to sqrt (i.e. only a single specialized version of the loop is needed). The right part shows the runtime of sqrt when it has been called with a different type of argument before.

First callSecond call
floatintFix16   floatintFix16
cpython 28.18 s 22.13 s 779.04 s 28.07 s 22.21 s 767.03 s
pypy 1.4.1 1.20 s 6.49 s 11.31 s 1.20 s 6.54 s 11.23 s
pypy 1.20 s 6.44 s 6.78 s 1.19 s 6.26 s 6.79 s
gcc 1.15 s 1.82 s 1.89 s 1.15 s 1.82 s 1.89 s

For this to work in the last case, when Fix16 is the argument type in the second type, the trace_limit had to be increased from its default value to prevent the metainterpreter from aborting while tracing the second version of the loop. Also sys.setcheckinterval(1000000) were used to prevent the bridge from being generated. With the bridge the performance of the last case is significantly worse. Maybe because the optimizer currently fails to generate a short preamble for it. But the slowdown seems too big for that to be the only explanation. Below are the runtimes numbers with checkinterval set to its default value of 100:

First callSecond call
floatintFix16   floatintFix16
cpython 28.71 s 22.09 s 781.86 s 28.28 s 21.92 s 761.59 s
pypy 1.4.1 1.21 s 6.48 s 11.22 s 1.72 s 7.58 s 12.18 s
pypy 1.21 s 6.27 s 7.22 s 1.20 s 6.29 s 90.47 s

Conclusions

Even though we are seeing speedups in a variety of different small benchmarks, more complicated examples are not affected much by these optimizations. It might partly be because larger examples have longer and more complicated loops, and thus allowing optimizations to operate across loop boundary will have a smaller relative effect. Another problem is that with more complicated examples there will be more bridges, and bridges are currently not handled very well (most of the time all virtuals are forced at the end of the bridge as explained above). But moving those forcings into the mini bridges should fix that.