Monday, November 30, 2009

Using CPython extension modules with PyPy, or: PyQt on PyPy

If you have ever wanted to use CPython extension modules on PyPy, we want to announce that there is a solution that should be compatible to quite a bit of the available modules. It is neither new nor written by us, but works nevertheless great with PyPy.

The trick is to use RPyC, a transparent, symmetric remote procedure call library written in Python. The idea is to start a CPython process that hosts the PyQt libraries and connect to it via TCP to send RPC commands to it.

I tried to run PyQt applications using it on PyPy and could get quite a bit of the functionality of these working. Remaining problems include regular segfaults of CPython because of PyQt-induced memory corruption and bugs because classes like StandardButtons behave incorrectly when it comes to arithmetical operations.

Changes to RPyC needed to be done to support remote unbound __init__ methods, shallow call by value for list and dict types (PyQt4 methods want real lists and dicts as parameters), and callbacks to methods (all remote method objects are wrapped into small lambda functions to ease the call for PyQt4).

If you want to try RPyC to run the PyQt application of your choice, you just need to follow these steps. Please report your experience here in the blog comments or on our mailing list.

  1. Download RPyC from the RPyC download page.
  2. Download this patch and apply it to RPyC by running patch -p1 < rpyc-3.0.7-pyqt4-compat.patch in the RPyC directory.
  3. Install RPyc by running python install as root.
  4. Run the file rpyc/servers/ using CPython.
  5. Execute your PyQt application on PyPy.

PyPy will automatically connect to CPython and use its PyQt libraries.

Note that this scheme works with nearly every extension library. Look at pypy/lib/ on how to add new libraries (you need to create such a file for every proxied extension module).

Have fun with PyQt

Alexander Schremmer

Wednesday, November 18, 2009

Some benchmarking


Recently, thanks to the surprisingly helpful Unhelpful, also known as Andrew Mahone, we have a decent, if slightly arbitrary, set of performances graphs. It contains a couple of benchmarks already seen on this blog as well as some taken from The Great Computer Language Benchmarks Game. These benchmarks don't even try to represent "real applications" as they're mostly small algorithmic benchmarks. Interpreters used:

  1. PyPy trunk, revision 69331 with --translation-backendopt-storesink, which is now on by default
  2. Unladen swallow trunk, r900
  3. CPython 2.6.2 release

Here are the graphs; the benchmarks and the runner script are available

And zoomed in for all benchmarks except binary-trees and fannkuch.

As we can see, PyPy is generally somewhere between the same speed as CPython to 50x faster (f1int). The places where we're the same speed as CPython are places where we know we have problems - for example generators are not sped up by the JIT and they require some work (although not as much by far as generators & Psyco :-). The glaring inefficiency is in the regex-dna benchmark. This one clearly demonstrates that our regular expression engine is really, really, bad and urgently requires attention.

The cool thing here is, that although these benchmarks might not represent typical python applications, they're not uninteresting. They show that algorithmic code does not need to be far slower in Python than in C, so using PyPy one need not worry about algorithmic code being dramatically slow. As many readers would agree, that kills yet another usage of C in our lives :-)


Friday, November 13, 2009

Düsseldorf Sprint Report

While the Düsseldorf is dwindling off, we put our minds to the task of retelling our accomplishments. The sprint was mostly about improving the JIT and we managed to stick to that task (as much as we managed to stick to anything). The sprint was mostly filled with doing many small things.


Carl Friedrich and Samuele started the sprint trying to tame the JIT's inlining. Until now, the JIT would try to inline everything in a loop (except other loops) which is what most tracing JITs actually do. This works great if the resulting trace is of reasonable length, but if not it would result in excessive memory consumption and code cache problems in the CPU. So far we just had a limit on the trace size, and we would abort tracing when the limit was reached. This would happen again and again for the same loop, which is not useful at all. The new approach introduced is to be more clever when tracing is aborted by marking the function with the largest contribution to the trace size as non-inlinable. The next time this loop is traced, it usually then gives a reasonably sized trace.

This gives a problem because now some functions that don't contain loops are not inlined, which means they never get assembler code for them generated. To remedy this problem we also make it possible to trace functions from their start (as opposed to just tracing loops). We do that only for functions that can not be inlinined (either because they contain loops or they were marked as non-inlinable as described above).

The result of this is that the Python version telco decimal benchmark runs to completion without having to arbitrarily increase the trace length limit. It's also about 40% faster than running it on CPython. This is one of the first non-tiny programs that we speed up.

Reducing GC Pressure

Armin and Anto used some GC instrumentation to find places in pypy-c-jit that allocate a lot of memory. This is an endlessly surprising exercise, as usually we don't care too much about allocations of short-lived objects when writing RPython, as our GCs usually deal well with those. They found a few places where they could remove allocations, most importantly by making one of the classes that make up traces smaller.

Optimizing Chains of Guards

Carl Friedrich and Samuele started a simple optimization on the trace level that removes superfluous guards. A common pattern in a trace is to have stronger and stronger guards about the same object. As an example, often there is first a guard that an object is not None, later followed by a guard that it is exactly of a given class and then even later that it is a precise instance of that class. This is inefficient, as we can just check the most precise thing in the place of the first guard, saving us guards (which take memory, as they need resume data). Maciek, Armin and Anto later improved on that by introducing a new guard that checks for non-nullity and a specific class in one guard, which allows us to collapse more chains.

Improving JIT and Exceptions

Armin and Maciek went on a multi-day quest to make the JIT and Python-level exceptions like each other more. So far, raising and catching exceptions would make the JIT generate code that has a certain amusement value, but is not really fast in any way. To improve the situation, they had to dig into the exception support in the Python interpreter, where they found various inefficiencies. They also had to rewrite the exceptions module to be in RPython (as opposed to just pure Python + an old hack). Another problems is that tracebacks give you access to interpreter frames. This forces the JIT to deoptimize things, as the JIT keeps some of the frame's content in CPU registers or on the CPU stack, which reflective access to frames prevents. Currently we try to improve the simple cases where the traceback is never actually accessed. This work is not completely finished, but some cases are already significantly faster.

Moving PyPy to use py.test 1.1

Holger worked on porting PyPy to use the newly released py.test 1.1. PyPy still uses some very old support code in its testing infrastructure, which makes this task a bit annoying. He also gave the other PyPy developers a demo of some of the newer py.test features and we discussed which of them we want to start using to improve our tests to make them shorter and clearer. One of the things we want to do eventually is to have less skipped tests than now.

Using a Simple Effect Analysis for the JIT

One of the optimization the JIT does is caching fields that are read out of structures on the heap. This cache needs to be invalidated at some points, for example when such a field is written to (as we don't track aliasing much). Another case is a call in the assembler, as the target function could arbitrarily change the heap. This of course is imprecise, since most functions don't actually change the whole heap, and we have an analysis that finds out which sorts of types of structs and arrays a function can mutate. During the sprint Carl Friedrich and Samuele integrated this analysis with the JIT, to help it invalidate caches less aggressively. Later Anto and Carl Friedrich also ported this support to the CLI version of the JIT.


Samuele (with some assistance of Carl Friedrich) set up a buildbot slave on a Mac Mini at the University. This should let us stabilize on the Max OS X. So far we still have a number of failing tests, but now we are in a situation to sanely approach fixing them.

Anto improved the CLI backend to support the infrastructure for producing the profiling graphs Armin introduced.

The guinea-pigs that were put into Carl Friedrich's care have been fed (which was the most important sprint task anyway).

Samuele & Carl Friedrich

Friday, November 6, 2009

Düsseldorf Sprint Started

The Düsseldorf sprint starts today. Only Samuele and me are there so far, but that should change over the course of the day. We will mostly work on the JIT during this sprint, trying to make it a lot more practical. For that we need to decrease its memory requirements some more and to make it use less aggressive inlining. We will post more as the sprint progresses.

Tuesday, November 3, 2009

PyPy on RuPy 2009


It's maybe a bit late to announce, but there will be PyPy talk at Rupy conference this weekend in Poznan. Precisely, I'll be talking mostly about PyPy's JIT and how to use it. Unfortunately the talk is on Saturday, at 8:30 in the morning.

EDIT: Talk is online, together with examples


Sunday, November 1, 2009

Logging and nice graphs

Hi all,

This week I worked on improving the system we use for logging. Well, it was not really a "system" but rather a pile of hacks to measure in custom ways timings and counts and display them. So now, we have a system :-)

The system in question was integrated in the code for the GC and the JIT, which are two independent components as far as the source is concerned. However, we can now display a unified view. Here is for example pypy-c-jit running pystone for (only) 5000 iterations:

The top long bar represents time. The bottom shows two summaries of the total time taken by the various components, and also plays the role of a legend to understand the colors at the top. Shades of red are the GC, shades of green are the JIT.

Here is another picture, this time on pypy-c-jit running 10 iterations of richards:

We have to look more closely at various examples, but a few things immediately show up. One thing is that the GC is put under large pressure by the jit-tracing, jit-optimize and (to a lesser extent) the jit-backend components. So large in fact that the GC takes at least 60-70% of the time there. We will have to do something about it at some point. The other thing is that on richards (and it's likely generally the case), the jit-blackhole component takes a lot of time. "Blackholing" is the operation of recovering from a guard failure in the generated assembler, and falling back to the interpreter. So this is also something we will need to improve.

That's it! The images were generated with the following commands:

PYPYLOG=/tmp/log pypy-c-jit
python pypy/tool/ draw-time /tmp/log --mainwidth=8000 --output=filename.png
EDIT: nowadays the command-line has changed to:
python rpython/tool/ draw-time /tmp/log --mainwidth=8000 filename.png