Python-VM: Parallel Computing with Python in the Browser

This is a showcase/proof of concept of a Python Virtual Machine with parallel execution that runs entirely in the browser.

print("Please select an example...")

Run Your Own Script

You can upload your own Python script (pyc-file) to the browser and run it. The file needs to be a compiled "PYC"-file containing Python bytecode. Since most of the interpreter is not yet implemented, your script might just fail without warning. If the filename starts with mpi, an MPI-ish interpreter will be used to execute it (i.e. you will get four different versions of your program running in parallel).

This showcase is part of a proposed research project on parallelism for (scientific) web applications. Its aim is to run Python programs with full support for concurrency features (multi-threading and multi-processing) and SIMD (e.g., NumPy), eventually allowing you to have, e.g., a Jupyter notebook run entirely inside your browser, without the need for a server backend to do any computation.

The actual Python code here runs inside one (or several) web workers. This has several consequences. First, the UI remains fully responsive even if you have large work loads. Second, any output (such as to the console) is visible immediately because the main thread is free to update the UI. Third, we can run several interpreters (as independent processes) in parallel. The fully decoupled interpreter and user interface also allows us to have a debugger and thus pause a program or stop it at any time.

Important: we collect data about performance of the scripts presented here. Whenever a script runs, we record the type of browser and operating system, together with the running time. This is used to study the behaviour of the interpreter in different environments. Absolutely no personal information is collected! See below.

FAQ

How does it work?

The interpreter has a Python virtual machine that executes Python bytecode (the same as you find in Python 3.0–3.7). In that regard, the basic approach is similar to Beeware's Batavia, and it allows us to concentrate on the virtual machine first without the need for an actual compiler.

Code is executed as tasks/tasklets that are scheduled in the JavaScript event queue in such a way that no task runs much longer than about a tenth of a second. Moreover, the entire interpreter is internally based on principles of continuation passing style and callbacks, and thus fully asynchronous. The main thread runs a manager that starts and controls new wokers, and handles all input and output.

Why does my own code crash?

The underlying interpreter is an actual interpreter, in principle capable of running any Python program. However, it is work in progress and by far not complete yet, but rather a skeletton. We have implemented just enough for the above examples to run, but there is little there outside this small set. The full interpreter will become available once the project has progressed/matured enough. We aim for releasing the entire project as open source.

What kind of code is actually executed when I hit `run'?

The examples have first been compiled with standard Python 3.6. The resulting bytecode files are then loaded into the browser just like to the source code you see. The actual interpreter executes this bytecode very similar to how standard Python does it as well. Hence, the code you see being executed really is Python. However, not all bytecode instructions are fully supported, yet, and there is only a small part of the built-in functions and classes available at the moment.

NumPy, Matplotlib, mpi4py, etc.

These extensions modules are not really implemented. I really reuse their interface to show that (a) they could be implemented, and that (b) even this unoptimised mock implementation already brings some impressive speed ups (in the case of NumPy and mp4py, say). Finally, reusing these interfaces improves comparability with existing systems: you can basically run the same program here as on your own computer.

You will notice that the visuals of matplotlib, for instance, are very crude and rudimentary. There are no labels or ticks, and the data might even move out of the figure. NumPy arrays are limited to integer and floats in one or two dimenstions, and their output is often not as nicely pretty printed as in the original. These are not shortcomings of the design itself, but reflect the minimum amount of time and effort I have put into these things for this proof of concept.

Why do we need yet another Python implementation?

Indeed, there already are a number of excellent Python implementations that run Python code inside your browser, for instance Skulpt, Brython, Batavia, PyPy.js, and Pyodide. Moreover, companies like Anvil offer a full stack for running Python-based web applications. So, what could yet another implementation possibly offer?

In contrast to these existing systems, our project focuses on parallel and asynchronous execution inside the browser rather than Python as such. We want to know how to best map Python code to JavaScript's asynchronous execution environment. As it turns out, this requires a radically different approach to the very core of how code is scheduled and executed, resulting in a new (experimental) implementation. It might well be that, in the end, some of the insights from our project can be adapted to other implementations as well. But for the time being, we need a different `interpreter core' (virtual machine) as basis for our research.

Why is it so slow?

Keep in mind that the present interpreter is a proof of concept with emphasis on concurrency. At the moment, we have not optimised the interpreter, yet, as that has little priority at the moment. Our goal is to explore parallel execution first, and optimise single threaded code execution after. In other words, the important thing is that parallel execution (using MPI) does run faster and gives us a speed up over single-threaded execution.

Furthermore, execution time differs quite significantly, depending on the browser. We found that execution times are typically within a factor of 2x–5x when compared to Skulpt. While Brython is as fast as Skulpt for some examples, our implementation is up to 5x faster than Brython for other examples. Hence, the interpreter already achieves performance that is more or less comparable to other systems (but does not block for longer running programs).

In case of the debugging examples, speed is not an issue at all, and the interpreter needs to be slowed down considerably so that the execution steps can be observed at all. There is a (configurable) setting that determines the minimum delay every time the trace function has been called before execution resumes.

Why do you use MPI rather than Multi-Processing?

Python's multi-processing module basically forks the interpreter (process) to start a new process and communicates with it through sockets. MPI, on the other hand, starts the desired number of processes from the beginning on, assigns each one a specific index (the rank), and communication can be handled through a common broadcast channel (the communicator). Both systems require substantial engineering work to be fully supported, but I found it easier to begin with the concept of starting a fixed number of workers/interpreters, than to implement enough of the processing API to have a basic example running. The MPI interface is modelled after mpi4py, although most parts are still missing, of course.

What is the difference between MPI and Multi-Threading?

Concerning this project: MPI starts a number of individual web workers/interpreters, which then communicate through a channel, but can run in parallel. Multi-threading, on the other hand, is implemented basically the same way as in CPython: the individual threads are executed sequentially in a round-robin manner, one after the other, with switches occurring every few milliseconds to give the illusion of parallel execution.

As such, the current implementation of multi-threading does not give you any performance improvements, but it demonstrates the strength of the interpreter's asynchronous approach. Since it relies on short-lived tasks that are scheduled through the event queue, adding a multi-threading API is almost for free. And even though the threads are actually sequentialised, there can still be a benefit from using multi-threading as demonstrated by the multi-threaded turtle example.

Note that this description does not hold in general, but is limited to the present system and implementation. More generally, the difference is rather that multiple threads all share the same memory, whereas MPI is based on distributed execution where each process has its own dedicated memory. Since web workers do not share memory, but are rather fully isolated, the MPI design is a more natural fit (although SharedArrayBuffers might offer an interesting avenue to explore effective multi-threading).

What is wrong with having a server backend?

Running a web application with a server backend to do computations or database queries can offer a lot of performance and even a fully fledged unix backend, say. On the flip side, it also means that your application is severly crippled when you loose connection, e.g., because your train enters a tunnel, your plane has just taken off, or your neighbours are all streaming that latest TV series. Successful applications (e.g. Google Docs, Dropbox, etc.) thus favour a hybrid approach that allows you work both locally and synchronise your work to a server—when it is available. So, in that sense, we want to find out how much can actually be done locally, before we need to bring in the powerful server backend.

We can also observe that a lot of mobile devices are getting ever more powerful, supporting augmented and virtual reality, as well as video games with sophisticated graphics. These mobile devices obviously offer a great deal of performance that we want to also use for "scientific" web applications, data analysis, machine learning, etc. Our project sets out to explore how to best support running Python programs in the browser so that they can utilise as much performance as possible.

Finally, running a server backend is not cheap. If you develop a web application, you need to think about the costs involved in running code of your clients (not to mention issues of security). In comparison, having a simple HTTP web server to deliver a JavaScript file to your clients and let them run their code locally is much cheaper and scales much better.

However, all that being said, it might turn out that the design of our system is even perfectly suited for working with a server backend. All Python code is exected `remotely' in web workers. In principle, this could be extended to code actually running on the server instead of a web worker—fully transparently to the programmer or user.

Why do I want to run my code in parallel in the first place?

Here's the problem. Making a processor fast has two major downsides: you need a lot of power and it gets really hot. Simply pluging a fast processor into your mobile device means that it is great for making eggs sunny side up and that the battery will be drained before you could say "Jack Robinson". A much better approach is to have two processors with half the speed each to do the same amount of work—they require less power and get less hot. But! This means that we need to write our applications in such a way that the work can be distributed to the different processors, and this is far from easy!

In the context of web applications we find that there is only one thread of execution doing all the work for a webpage. Hence, while your web application is busy doing some computation, it also keeps the browser from updating the user interface. It becomes sluggish, unresponsive, and any output of the web application to inform you about its progress can only be seen once the entire computation has run its way (which kind of spoils the idea of a progress indicator). By moving the computation to another process/web worker, however, we free the main thread to take care of the user interface and keep it fully responsive.

The premise of the entire project is that you as a web app developer should not have to worry about any of this. Rather, let the system take care of finding the best way to run your application (i.e. as fast and/or power-efficient as possible) and keep the user interface responsive.

What does it mean to do "research"?

In this context, `research' has two components. First, we develop, try out, and evaluate new and differing approaches to tackle the issue. This is not about just building a web application to run Python in parallel, but to compare various ways this could be done and investigate whether any of them actually brings any benefit at all. For instance, if we just naively spread the work of an application to two different cores, we might find that the necessary communication between the cores to synchronise their work introduces so much overhead that the overall result consumes much more power and runs slower. Something similar could happen here, where we find that the communication overhead negates any advantage from running Python code in web workers.

The second component of research is the documentation. While trying out different approaches, we keep track of what worked and what did not, what issues could be solved, and which are hard limits imposed by the design and nature of the underlying system.

In other words: the primary outcome of this project is not the virtual machine itself, but the insights into the challenges and benefits of running Python code parallelised in the browser.

What information is collected and why?

We use this webpage to study the performance of various parts of the interpreter across different platforms and browsers. To that end, the webpage records for each script that has been executed the overall time it took, together with the type of browser and platform. There is no personal information of any kind, no IP tracking, no cookies, etc.

The record as sent to the server contains information as in the following example:
Name: Fibonacci-Sample, Browser: Firefox, Platform: Linux i386, Worker-Count: 1, Time: 45ms, Session: 23A982X7.

The session is a random string generated whenever you load the page that lets us compare runtimes from a single machine. In the end, we want to answer questions like: does the MPI version of a program run faster on every machine or are there differences?

The more examples you run, the more you help with our study of performance—thank you :).

Why are there turtle graphics and a debugger?

Although the primary aim of this project is to build a Python virtual machine for scientific computing and explore how web applications could make use of parallelism in the browser, I have a strong background and interest in education (also see the online Python environment that a student of mine has developed, using Skulpt). It is important to me to demonstrate that the proposed system clearly has applications in education as well as science.

What programming language do you use?

The entire interpreter is written in Scala and then compiled using Scala.js. However, the interface is carefully designed to be fully compatible with JavaScript. As an example, here is the code for the trace-function used to track the currently executed line in the ACE editor.

interpreter.onTrace = function (frame, state, arg) {
    if (state === "line") {
        editor.gotoLine(frame.f_lineno);
    }
};

How is the running label animated/updated?

When you run a program, you will find a label next to the `stop' button with a moving dot. Every few milliseconds, the working interpreter sends an update about its status (processing time, number of active threads, etc.) to the main thread. Whenever these messages arrive, the running label is updated to animate it. Hence, if the running interpreter gets really stuck, the dot would immediately stop moving.

You will notice that the dot moves faster when running a MPI example—in that case you have several web workers constantly sending an update to the main/UI thread. With the NumPy interface, on the other hand, it seems to work slower: as the bulk of the work is done by libraries, fewer Python lines are executed.

What libraries and other projects do you use?

This page uses the ACE Editor, and the code is compiled with Scala.js.

About

This prototype/proof of concept is written entirely by Tobias Kohn, currently a Research Associate in the Computer Architecture Group at the University of Cambridge, UK, and a Research Fellow at Hughes Hall College.