Wednesday, May 28, 2008

Threads and GCs

Hi all,

We can now compile a pypy-c that includes both thread support and one of our semi-advanced garbage collectors. This means that threaded Python programs can now run not only with a better performance, but without the annoyances of the Boehm garbage collector. (For example, Boehm doesn't like too much seeing large numbers of __del__(), and our implementation of ctypes uses them everywhere.)

Magic translation command (example): --thread --gc=hybrid targetpypystandalone --faassen --allworkingmodules

Note that multithreading in PyPy is based on a global interpreter lock, as in CPython. I imagine that we will get rid of the global interpreter lock at some point in the future -- I can certainly see how this might be done in PyPy, unlike in CPython -- but it will be a lot of work nevertheless. Given our current priorities, it will probably not occur soon unless someone steps in.

Progresses on the CLI JIT backend front

In the last months, I've actively worked on the CLI backend for PyPy's JIT generator, whose goal is to automatically generate JIT compilers that produces .NET bytecode on the fly.

The CLI JIT backend is far from be completed and there is still a lot of work to be done before it can handle the full PyPy's Python interpreter; nevertheless, yesterday I finally got the first .NET executable that contains a JIT for a very simple toy language called tlr, which implements an interpreter for a minimal register based virtual machine with only 8 operations.

To compile the tlr VM, follow these steps:

  1. get a fresh checkout of the oo-jit branch, i.e. the branch where the CLI JIT development goes on:

    $ svn co
  2. go to the oo-jit/pypy/jit/tl directory, and compile the tlr VM with the CLI backend and JIT enabled:

    $ cd oo-jit/pypy/jit/tl/
    $ ../../translator/goal/ -b cli --jit --batch targettlr

The goal of our test program is to compute the square of a given number; since the only operations supported by the VM are addition and negation, we compute the result by doing repetitive additions; I won't describe the exact meaning of all the tlr bytecodes here, as they are quite self-documenting:

ALLOCATE,    3,   # make space for three registers
MOV_A_R,     0,   # i = a
MOV_A_R,     1,   # copy of 'a'

SET_A,       0,
MOV_A_R,     2,   # res = 0

# 10:
SET_A,       1,
ADD_R_TO_A,  0,
MOV_A_R,     0,   # i--

MOV_R_A,     2,
ADD_R_TO_A,  1,
MOV_A_R,     2,   # res += a

MOV_R_A,     0,
JUMP_IF_A,  10,   # if i!=0: goto 10

MOV_R_A,     2,
RETURN_A          # return res

You can find the program also at the end of the tlr module; to get an assembled version of the bytecode, ready to be interpreted, run this command:

$ python assemble > square.tlr

Now, we are ready to execute the code through the tlr VM; if you are using Linux/Mono, you can simply execute the targettlr-cli script that has been created for you; however, if you use Windows, you have to manually fish the executable inside the targettlr-cli-data directory:

# Linux
$ ./targettlr-cli square.tlr 16

# Windows
> targettlr-cli-data\main.exe square.tlr 16

Cool, our program computed the result correctly! But, how can we be sure that it really JIT compiled our code instead of interpreting it? To inspect the code that it's generated by our JIT compiler, we simply set the PYPYJITLOG environment variable to a filename, so that the JIT will create a .NET assembly containing all the code that has been generated by the JIT:

$ PYPYJITLOG=generated.dll ./targettlr-cli square.tlr 16
$ file generated.dll
generated.dll: MS-DOS executable PE  for MS Windows (DLL) (console) Intel 80386 32-bit

Now, we can inspect the DLL with any IL disassembler, such as ilasm or monodis; here is an excerpt of the disassembled code, that shows how our square.tlr bytecode has been compiled to .NET bytecode:

.method public static  hidebysig default int32 invoke (object[] A_0, int32 A_1)  cil managed
    .maxstack 3
    .locals init (int32 V_0, int32 V_1, int32 V_2, int32 V_3, int32 V_4, int32 V_5)

    ldc.i4 -1
    ldc.i4 0
    IL_0010:  ldloc.1
    brfalse IL_003b

    ldc.i4 -1
    stloc.s 4
    stloc.s 5
    ldloc.s 5
    ldloc.s 4
    starg 1

    br IL_0010

    IL_003b:  ldloc.2
    br IL_0042


If you know a bit IL, you can see that the code generated is not optimal, as there are some redundant operations like all those stloc/ldloc pairs; however, while not optimal, it is still quite good code, not much different to what you would get by writing the square algorithm directly in e.g. C#.

As I said before, all of this is still work in progress and there is still much to be done. Stay tuned :-).

Monday, May 26, 2008

More windows support

Recently, thanks to Amaury Forgeot d'Arc and Michael Schneider, Windows became more of a first-class platform for PyPy's Python interpreter. Most RPython extension modules are now considered working (apart from some POSIX specific modules). Even CTypes now works on windows!

Next step would be to have better buildbot support for all supported platforms (Windows, Linux and OS X), so we can control and react to regressions quickly. (Buildbot is maintained by JP Calderone)


Friday, May 23, 2008

S3-Workshop Potsdam 2008 Writeup

Trying to give some notes about the S3 Workshop in Potsdam that several PyPyers and Spies (Armin, Carl Friedrich, Niko, Toon, Adrian) attended before the Berlin sprint. We presented a paper about SPy there. Below are some mostly random note about my (Carl Friedrich's) impressions of the conference and some talk notes. Before that I'd like to give thanks to the organizers who did a great job. The workshop was well organized, the social events were wonderful (a very relaxing boat trip in the many lakes around Potsdam and a conference dinner).

Video recordings of all the talks can be found on the program page.

Invited Talks

"Late-bound Object Lambda Architectures" by Ian Piumarta was quite an inspiring talk about VPRI's attempt at writing a flexible and understandable computing system in 20K lines of code. The talk was lacking a bit in technical details, so while it was inspiring I couldn't really say much about their implementation. Apart from that, I disagree with some of their goals, but that's the topic of another blog post.

"The Lively Kernel – A Self-supporting System on a Web Page" by Dan Ingalls. Dan Ingalls is one of the inventors of the original Smalltalk and of Squeak. He was talking about his latest work, the attempts of bringing a Squeak-like system to a web browser using JavaScript and SVG. To get some feel for what exactly The Lively Kernel is, it is easiest to just try it out (only works in Safari and Firefox 3 above Beta 5 though). I guess in a sense the progress of the Lively Kernel over Squeak is not that great but Dan seems to be having fun. Dan is an incredibly enthusiastic, friendly and positive person, it was really great meeting him. He even seemed to like some of the ideas in SPy.

"On Sustaining Self" by Richard P. Gabriel was a sort of deconstructivist multi-media-show train wreck of a presentation that was a bit too weird for my taste. There was a lot of music, there were sections in the presentation where Richard discussed with an alter ego, whose part he had recorded in advance and mangled with a sound editor. There was a large bit of a documentary about Levittown. Even the introduction and the questions were weird, with Pascal Constanza staring down the audience, without saying a word (nobody dared to ask questions). I am not sure I saw the point of the presentation, apart from getting the audience to think, which probably worked. It seems that there are people (e.g. Christian Neukirchen) that liked the presentation, though.

Research Papers

"SBCL - A Sanely Bootstrappable Common Lisp by Christophe Rhodes described the bootstrapping process of SBCL (Steel Bank Common Lisp). SBCL can be bootstrapped by a variety of Common Lisps, not just by itself. SBCL contains a complete blueprint of the initial image instead of always getting the new image by carefully mutating the old one. This bootstrapping approach is sort of similar to that of PyPy.

"Reflection for the Masses" by Charlotte Herzeel, Pascal Costanza, and Theo D'Hondt retraced some of the work of Brian Smith on reflection in Lisp. The talk was not very good, it was way too long (40 min), quite hard to understand because Charlotte Herzeel was talking in a very low voice. The biggest mistake in her talk was in my opinion that she spent too much time explaining a more or less standard meta-circular interpreter for Lisp and then running out of time when she was trying to explain the modifications. I guess it would have been a fair assumptions that large parts of the audience know such interpreters, so glossing over the details would have been fine. A bit of a pity, since the paper seems interesting.

"Back to the Future in One Week - Implementing a Smalltalk VM in PyPy" by Carl Friedrich Bolz, Adrian Kuhn, Adrian Lienhard, Nicholas D. Matsakis, Oscar Nierstrasz, Lukas Renggli, Armin Rigo and Toon Verwaest, the paper with the longest author list. We just made everybody an author who was at the sprint in Bern. Our paper had more authors than all the other papers together :-). I gave the presentation at the workshop, which went quite well, judging from the feedback I got.

"Huemul - A Smalltalk Implementation" by Guillermo Adrián Molina. Huemul is a Smalltalk implementation that doesn't contain an interpreter but directly compiles all methods to assembler (and also saves the assembler in the image). In addition, as much functionality (such as threading, GUI) as possible is delegated to libraries instead of reimplementing them in Smalltalk (as e.g. Squeak is doing). The approach seems to suffer from the usual problems of manually writing a JIT, e.g. the VM seems to segfault pretty often. Also I don't agree with some of the design decisions of the threading scheme, there is no automatic locking of objects at all, instead the user code is responsible for preventing concurrent accesses from messing up things (which even seems to lead to segfaults in the default image).

"Are Bytecodes an Atavism?" by Theo D'Hondt argued that using AST-based interpreters can be as fast as bytecode-based interpreters which he proved by writing two AST-interpreters, one for Pico and one for Scheme. Both of these implementations seem to perform pretty well. Theo seems to have many similar views as PyPy, for example that writing simple straightforward interpreters is often preferable than writing complex (JIT-)compilers.

Berlin Sprint Finished

The Berlin sprint is finished, below some notes on what we worked on during the last three days:

  • Camillo worked tirelessly on the gameboy emulator with some occasional input by various people. He is making good progress, some test ROMs run now on the translated emulator. However, the graphics are still not completely working for unclear reasons. Since PyBoy is already taken as a project name, we considered calling it PyGirl (another name proposition was "BoyBoy", but the implementation is not circular enough for that).
  • On Monday Armin and Samuele fixed the problem with our multimethods so that the builtin shortcut works again (the builtin shortcut is an optimization that speeds up all operations on builtin non-subclassed types quite a bit).
  • Antonio and Holger (who hasn't been on a sprint in a while, great to have you back!) worked on writing a conftest file (the plugin mechanism of py.test) that would allow us to run Django tests using py.test, which seems to be not completely trivial. They also fixed some bugs in PyPy's Python interpreter, e.g. related to dictionary subclassing.
  • Karl started adding sound support to the RPython SDL-bindings, which will be needed both by the Gameboy emulator and eventually by the SPy VM.
  • Armin and Maciek continued the work that Maciek had started a while ago of improving the speed of PyPy's IO operation. In the past, doing IO usually involved copying lots of memory around, which should have improved now. Armin and Maciek improved and then merged the first of the two branches that contained IO improvements, which speeds up IO on non-moving GCs (mostly the Boehm GC). Then they continued working on the hybrid-io branch which is supposed improve IO on the hybrid GC (which was partially designed exactly for this).
  • Toon, Carl Friedrich finished cleaning up the SPy improvement branch and fixed all warnings that occur when you translate SPy there. An obscure bug in an optimization prevented them from getting working executables, which at this moment blocks the merging of that branch.

By now everybody is home again (except for Anto, who booked his return flight two days too late, accidentally) and mostly resting. It was a good sprint, with some interesting results and several new people joining. And it was definitely the most unusual sprint location ever :-).

Sunday, May 18, 2008

Berlin Sprint Day 1 + 2

After having survived the S3-Workshop which took place in Potsdam on Thursday and Friday (a blog-post about this will follow later) we are now sitting in the c-base in Berlin, happily sprinting. Below are some notes on what progress we made so far:

  • The Gameboy emulator in RPython that Camillo Bruni is working on for his Bachelor project at Uni Bern does now translate. It took him (assisted by various people) a while to figure out the translation errors (essentially because he wrote nice Python code that passed bound methods around, which the RTyper doesn't completely like). Now that is fixed and the Gameboy emulator translates and runs a test ROM. You cannot really see anything yet, because there is no graphics support in RPython.
  • To get graphics support in RPython Armin and Karl started writing SDL bindings for RPython, which both the Gameboy emulator and the SPy VM need. They have basic stuff working, probably enough to support the Gameboy already.
  • Alexander, Armin, Maciek and Samuele discussed how to approach separate compilation for RPython, which isn't easy because the RPython type analysis is a whole-program analysis.
  • Stephan, Peter and Adrian (at least in the beginning) worked on making PyPy's stackless module more complete. They added channel preferences which change details of the scheduling semantics.
  • Toon, Carl Friedrich and Adrian (a tiny bit) worked on SPy. There is a branch that Toon started a while ago which contains many improvements but is also quite unclear in many respects. There was some progress in cleaning that up. This involved implementing the Smalltalk process scheduler (Smalltalk really is an OS). There is still quite some work left though. While doing so, we discovered many funny facts about Squeak's implementation details (most of which are exposed to the user) in the process. I guess we should collect them and blog about them eventually.
  • Samuele and Maciek improved the ctypes version of pysqlite that Gerhard Häring started.
  • Armin, Samuele and Maciek found an obscure bug in the interaction between the builtin-type-shortcut that Armin recently implemented and our multimethod implementation. It's not clear which of the two are to blame, however it seems rather unclear how to fix the problem: Armin and Samuele are stuck in a discussion about how to approach a solution since a while and are hard to talk to.
  • Stijn Timbermont, a Ph.D. student at the Vrije Universiteit Brussel who is visiting the sprint for two days was first looking at how our GCs are implemented to figure out whether he can use PyPy for some experiments. The answer to that seems to be no. Today he was hacking on a Pico interpreter (without knowing too much about Python) and is making some nice progress, it seems.

Will try to blog more as the sprint progresses.

Saturday, May 10, 2008

General performance improvements

Hi all,

During the past two weeks we invested some more efforts on the baseline performance of pypy-c. Some of the tweaks we did were just new ideas, and others were based on actual profiling. The net outcome is that we now expect PyPy to be in the worst case twice as slow than CPython on real applications. Here are some small-to-medium-size benchmark results. The number is the execution time, normalized to 1.0 for CPython 2.4:

  • 1.90 on templess (a simple templating language)
  • 1.49 on gadfly (pure Python SQL database)
  • 1.49 on (pypy's own translation toolchain)
  • 1.44 on mako (another templating system)
  • 1.21 on pystone
  • 0.78 on richards

(This is all without the JIT, as usual. The JIT is not ready yet.)

You can build yourself a pypy-c with this kind of speed with the magic command line (gcrootfinder is only for a 32-bit Linux machine):

    pypy/translator/goal/ --gc=hybrid --gcrootfinder=asmgcc targetpypystandalone --allworkingmodules --faassen

The main improvements come from:

  • A general shortcut for any operation between built-in objects: for example, a subtraction of two integers or floats now dispatches directly to the integer or float subtraction code, without looking up the '__sub__' in the class.
  • A shortcut for getting attributes out of instances of user classes when the '__getattribute__' special method is not overridden.
  • The so-called Hybrid Garbage Collector is now a three-generations collector. More about our GCs...
  • Some profiling showed bad performance in our implementation of the built-in id() -- a trivial function to write in CPython, but a lot more fun when you have a moving GC and your object's real address can change.
  • The bytecode compiler's parser had a very slow linear search algorithm that we replaced with a dictionary lookup.

These benchmarks are doing CPU-intensive operations. You can expect a similar blog post soon about the I/O performance, as the io-improvements branch gets closer to being merged :-) The branch could also improve the speed of string operations, as used e.g. by the templating systems.

Sunday, May 4, 2008

Next Sprint: Berlin, May 17-22nd May

Our next PyPy sprint will take place in the crashed c-base space station, Berlin, Germany, Earth, Solar System. This is a fully public sprint: newcomers (from all planets) are welcome. Suggestion of topics (other topics are welcome too):

  • work on PyPy's JIT generator: we are refactoring parts of the compiling logic, in ways that may also allow generating better machine code for loops (people or aliens with knowledge on compilers and SSA, welcome)
  • work on the SPy VM, PyPy's Squeak implementation, particularly the graphics capabilities
  • work on PyPy's GameBoy emulator, which also needs graphics support
  • trying some large pure-Python applications or libraries on PyPy and fixing the resulting bugs. Possibilities are Zope 3, Django and others.

For more information, see the full announcement.