500 Python Interpreters • Izzy Muerte

As we approach the final release date for Python 3.13, I’ve seen an uptick in discussion regarding 3.13’s introduction of an optional GIL. While removing the GIL has been a long time coming for the average user (I’ve dreamt of this for nearly 20 years), there have actually been two concurrent efforts to improve Python’s performance for multithreading. The first is the optional GIL, specified in PEP 703, and second is the introduction of a per-interpreter GIL, specified in PEP 684 and introduced in Python 3.12.

In this post, I’ll explain some of the history of the GIL, how it’s affected the C API design of Python, and what steps were taken at the C API level to get us to where we are today. I’ll also briefly discuss how PEP 684 and PEP 703 will be able to work in tandem. And to explain all of this, we’re going to start with my personal history on the matter and my first run-in (and faceplant) with Python’s GIL. Learning to make

✨video games ✨

A Dark Seed Is Planted

In the summer of 2005, I attended a several week long game development summer camp that was located at Stanford University. This was my first real introduction to programming. I don’t count the C++ class I took in freshman year of high school because we just simply copied text from JPEGs of a scan of a C++ book that were scaled down so that two pages could fit on an 8.5x11" paper. At the time, this was (for me) no different than reading the Voynich Manuscript or Codex Seraphinianus.

I was being taught Python 2.4, a whole year before the 2.5 release. Now at the time, PyGame was at around version 1.7.1 and both then and now requires some knowledge of Python to be used effectively. I would argue that even today teaching someone Python via PyGame is not a good introduction to Python the language. It is an intermediate step after you’ve gotten the basics of objects and for loops and the like down. However we were using a (slightly modified fork) of a library named livewires, which was a wrapper that greatly simplified a lot of the logic behind PyGame so you could focus on teaching.

This fork had been written by my camp instructor, a programmer, TV show writer, and teacher by the name of Michael Dawson. For those of you who read the section title, and are also familiar with ancient eldritch e l d r i t c h video game lore, you might recognize his name as the lead programmer, writer, and star of the 1992 DOS game Dark Seed¹. This was, for me, a huge deal, as it lended an air of authority to his teaching: he’d made an actual game and mostly by himself. Surely, I could do the same thing one day. Right? 🙂

During this instructional period, I was using a large number of sprites. I was too inexperienced to know what a sprite atlas was, let alone to implement it with the API we were using under livewires. My load times were atrocious. I had maybe 100+ sprites, each with about 20 animations, that I’d pulled from various geocities websites. I was quite honestly at a loss. Outside of these instructional classes, we had additional activities where we could socialize with other people attending the camp. One of these people was very on the up and up in terms of open source, Linux, and the like (they owned a GP2X and managed to get their game to work on the device) compared to myself. I explained to them the issue I was having, and they showed me how to profile my code, and we saw that because I was loading in a sprite one at a time, it would take forever to open and read each and every individual sprite. They then suggested I use the threading module to at least speed things up. After spending some time tinkering we got it working and… my load times had increased 😭.

The next day, I asked Michael what the issue could be and with a heavy sigh he had to explain the issue with threads and the Python interpreter, as I had stumbled into the oldest issue Python has had to date for performance²: that damned GIL.

I’m also not the only person who ran into this issue within the context of games. Sid Meier’s Civilization IV used Python for its game logic and scripting systems. Games would run slower the more units were on the map and the more turns that had executed. It was a known issue that Civ IV had (and one might even argue still has) performance issues. With the release of Sid Meier’s Civilization V the team had switched to Lua and this was touted as a big deal because it meant a huge performance improvement. And it was! Fewer lockups and games didn’t take forever to process an AI turn (usually).

Alas! The beef I’ve had with the GIL was planted as a dark seed about 20 years ago. For many years I’ve been filled with jealousy towards Lua even going so far as to refuse to use it unless forced to for the last 9 years, though this was mostly because I had to operate directly on a stack and worry about things like upvalues. What’re upvalues? I don’t know dawg, what’s up with you?

Erm, what the sigma GIL?

OK, so there’s a general misunderstanding of what the GIL is, in addition to why it even exists in the first place. To better comprehend the situation, we’re going to talk specifically about the evolution of CPython and it’s interpreter.

A Short History of Threading

Threading was first introduced in Python on February 17th, 1998 with the release of Python 1.5. This release brought along with it the threading module and is also where the PyInterpreterState and PyThreadState objects were first introduced, bringing the GIL along with them. Prior to this, Python had a simple evaluation execution step in a single function named code_eval that would in some cases recurse into itself. The introduction of PyInterpreterState and PyThreadState were a big deal as this allowed Python to finally begin moving towards an embeddable state where users could not only extend Python, but also embed it. This might not seem like such a big deal, but it was 1998. The number of scripting languages in use at the time were many, but each had their own limitations with regards for C to interact with it in custom programs. Hell, go and look at Perl’s C API today and you can see a very 90s API design that might be viewed as anti-embedding, but really was just a pragmatic approach to assume that if you were trying to embed Perl as a C library, you were the Perl interpreter.

This newly introduced PyInterpreterState had a single instance, and even with the upcoming release of Python 3.13, there is still a static PyInterpreterState* object that represents the main PyInterpreterState. Even if this PyInterpreterState is not used, it still exists.

This is why the multiprocessing module has historically been faster for Python. We’re creating a whole new interpreter instance as a subprogram, and thus it does not need to worry about a GIL. However, this comes at the cost of an entire subprocess and the various work involved with communication between the two. For Windows especially, this is a very heavy approach for a speedup, yet manages to still be faster in the high contention cases.

Thus, with the release of 3.13, multiprocessing is technically only needed in cases where the GIL is still needed, and even then it still can carry a cost compared to threading in some specific use cases.

Big GIL, So What?

OK, so if you’re coming from other scripting languages this approach might not make much sense. For those of you coming from lua_State*s, JSContextRefs, v8::Isolates, and more, we need to take a look at how Python was evolving. There was no real line in the sand to explain the direction Python would be taking. Call me one to speculate, but it seemed like features were being added because someone wanted them, rather than because of some goal in mind. This 1.5 release predates both the Python Software Foundation and PEP 1, which was introduced only with the release of Python 2.0 (alongside the PEP process). Thus, there was no real community direction that Python could go in, beyond whatever work someone was willing to contribute, and therefore these changes were made because… well what else was there at the time? We had Perl, Tcl, a bunch of implementations of scheme that are completely forgettable, and while Lua 3.0 was out, it wouldn’t be until about 5 months later in July of 1998 that Lua 3.1 was released along with its now well known lua_State* (and even then there were semantic issues that weren’t resolved until the 4.0 release).

In fact if we look at what was available for existing API designs to pull from, the only real stable designs for execution were Tcl (which people have gone great lengths to avoid copying, because when you do try to copy its design you get CMake³), and OpenGL. As OpenGL has a “context” object that you must swap to/make current on the given thread it should be no surprise that Python implemented a similar interface while eschewing the isolated single-thread interpreter design.

It’s extremely important to keep in mind options for these decisions were limited at the time. People were more or less stumbling around in the dark for the C API side of dynamically typed scripting languages. We didn’t have decades of mistakes and real world use to learn from. Even today when we know there are “better” options sometimes the worse one wins out. For example, Lua has constantly been heralded in various spaces for it’s register based VM. Yet WASM, which is shaping up to be the cross platform du jour VM of the 21st century, went with a stack based VM. This decision was driven entirely because of ✨ reasons ✨ that are outside the scope of this post and that I don’t actually think hold water and if you think they do you can shut up nerd.

How Can She Interpret‽‽

We’re near the end of this explanation so let’s take a brief look at how the GIL works in practice at the C API level prior to making the GIL optional. For starters, users create something called a sub-interpreter, which is what a PyThreadState technically is. This is done via the poorly named Py_NewInterpreter. When this is called, it uses whatever is set as the global interpreter and adds it to the list of PyThreadStates that PyInterpreterState keeps track of. Users are then expected to acquire the GIL at some point upon the PyThreadState and then evaluate some Python code, then release the GIL from the PyThreadState and at this point they are free to either destroy the PyThreadState or they are able to just keep it around and keep executing as needed. No matter what happens there is going to be a lock occurring. No two PyThreadStates can execute Python bytecode at the same time. However, they can execute multiple C calls at the same time which is why for long running pure C operations extension and embedding developers are encouraged to release the GIL temporarily. Technically speaking this is where something like ThreadSanitizer can come in handy.

This approach to extensions is also sort of why we have the GIL in the first place. Up until Python 3.5, which introduced so-called multi-phase extension module initialization via PEP 489 we didn’t actually have a way to even really isolate modules once they were imported. All objects, exceptions, methods, etc. were all added at the same time and are inserted into the central main interpreter. The real purpose is to effectively separate native code (written in whatever language you want, though most folks are using C++, C, Zig, or Rust these days) with the actual binding. Thus “when in Python, there is a lock” and extension developers are given a level of trust that the lock can be disengaged and re-engaged as long as Python objects are not being touched⁴.

As far as I know, Rust’s lifetimes can’t entirely be enforced across these dynamic library extensions unless it’s encoded at the C ABI level in some manner and to my knowledge PyO3 and other wrappers do not expose this at the type system level for the compiler and to warn users for raw locking/unlocking operations, preferring to instead provide an allow_threads method that will handle the locking/unlocking within the execution of a user provided closure that performs the actual work (an operation that any “RAII” capable language might provide for sanity and safety’s sake)

With the release of Python 3.12 however we actually saw the first vestiges of the Python C API that would allow us to create a sub interpreter that has no connection to the global interpreter. This API exists in the form of Py_NewInterpreterFromConfig, where a PyInterpreterConfig structure is passed in. This structure allows users to set a few “hard” settings. For example, you can disable the threading module, os.execv(), and process forking (though subprocess does not create forked processes, launching entirely brand new ones instead). The most important setting of these however is the PyInterpreterConfig.gil member, which allows users to either share the sub-interpreter’s GIL with the main interpreter’s GIL or they can make the sub-interpreter have it’s own GIL, thus keeping it isolated from the rest of the system. There are a few caveats here. Although the sys.stdin object is a unique PyObject*, it does refer to the same underlying handle as other sys.stdin objects. This can be semi-resolved by just changing it.

500 Interpreter States

OK, so with everything we’ve mentioned up until now you’re more or less prepared to understand why we’ve got these wacky shenanigans. So let’s make an extreme example. Let’s make:

This code example is going to be very “high level”. We won’t be doing proper due diligence for API design because this is meant to be an example, not production ready code. With that in mind, we have a few things we need to do before we even create our python interpreters. The first of these is to create an isolated configuration for the default interpreter. This is extremely important if you’re doing a proper embed of C Python, as you might not want to care about things like user directories and the like to modify how python might execute.

PyConfig config {};
PyConfig_InitIsolatedConfig(&config);
if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status)) {
    /* handle error here */
}
PyConfig_Clear(&config);

There is additional work one might want to do with the PyPreConfig set of APIs, but I… don’t care about handling that. Read the documentation. You can figure it out 😉

This configuration creation does the very basic steps for initializing Python into an isolated state, which is intended for situations where users might not be shipping the Python stdlib alongside their executable.

Next, we’re going to just spin up all our threads. When creating these new PyThreadState instances they are set as the current thread state after creation. This means we more or less need to initialize them inside of a thread. We can get creative with the situation if we’re using C++, where thread_local exists as a keyword + storage qualifier. This also lets us know if a given thread has been initialized for a thread state. We won’t be doing any fancy checking here, since this is “babby’s first example”, but this sort of design paradigm might come in handy. We’ll also just define the maximum number of states as a constant to save us some time.

// Mark this as inline if you're placing it in a header. Save your sanity. 🙂
static thread_local inline PyThreadState* state = nullptr;
// 500 interpreter states 😌
static constexpr auto MAXIMUM_STATES = 463;

Next we’ll create our default PyInterpreterConfig. Because we’re relying on capturing this configuration object we don’t need to make it global in the function we’re executing.

PyInterpreterConfig config = {
  .use_main_obmalloc = 0,
  .allow_fork = 0,
  .allow_exec = 0,
  .allow_threads = 0,
  .allow_daemon_threads = 0,
  .check_multi_interp_extensions = 1,
  .gil = PyInterpreterConfig_OWN_GIL,
};

There’s a few details here that are more accurately explained in the Python documentation, however the biggest two things that matter are the check_multi_interp_extensions value and gil. The check_multi_interp_extensions value effectively forces all extensions to be what is known as a multi phase extension. This is where you initialize some of the module data, but don’t hook up objects, exceptions, and actual python bindings until a separate function is called. This was a huge stumbling block prior to either the per-GIL subinterpreter and the optional GIL. By forcing all extensions to be multi-phase, you can create per-extension instances of all your objects. This can get hairy if you actually have some hidden globals within your native code that you’re binding, but surely you wouldn’t lie about that ever, right? 🙂

Moving on, we’ll generate all our threads to create an interpreter state, execute some python code, print to sys.stdout and then shut down our interpreter. Easy Peesy, Lemon Squeezy.

// We want the threads to join on destruction, so std::jthread it is
std::vector<std::jthread> tasks { };
tasks.reserve(MAXIMUM_STATES);
for (auto count = 0zu; count < tasks.capacity(); count++) {
  tasks.emplace_back([config, count] {
    if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
        std::println("Failed to initialize thread state {}. Received error {}", count, status.err_msg);
        return;
    }
    auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
    auto globals = PyDict_New();
    auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
    auto result = PyEval_EvalCode(code, globals, globals);
    Py_DecRef(result);
    Py_DecRef(code);
    Py_DecRef(globals);
    Py_EndInterpreter(state);
    state = nullptr;
  });
}

Now, we can just let the threads go out of scope and call Py_Finalize(). What could possibly go wrong?

FUCK YOU, ASSHOLE!

IF YOU’RE A DUMB ENOUGH TO TRY TO RUN ~~500~~ 463 PYTHON INTERPRETER STATES IN A DEBUG BUILD, YOU’RE A BIG ENOUGH SHMUCK TO COME ON DOWN TO BIG BILL’S HELL INTERPRETERS.

MEMORY CORRUPTIONS!
SYNCHRONIZATION ISSUES
THE MESSAGE OUCH IN YOUR TERMINAL FROM PYTHON’S MEMORY TRACKING
THIEVES

IF YOU THOUGHT THIS C AND C++ CODE WAS GONNA RUN JUST FINE WITHOUT ANY ISSUE

YOU CAN KISS MY ASS

YOU HEARD ME RIGHT

YOU CAN KISS MY ASS

OK, hold up, what’s happening?

So unfortunately during the writing of this post, and the reason it’s been delayed by so much (I wanted to post it last weekend 😭), is because I was running into memory corruption issues. I thought “Hey, you know it’s been so long since I’ve use C++’s std threading API that I probably fucked up.”

Unfortunately, I did not fuck up and my C++ code was correct. It seems there are some latent memory corruption issues if you are creating and then destroying 500 python interpreter states in debug mode. I unfortunately wasn’t able to nail it down to any one specific issue beyond “at some point after about 10 PyThreadStates being created and destroyed constantly”, you get memory corruption, and there’s no way to recover.

On closer inspection, this is a case of memory reuse, where destroyed PyThreadStates are not fully zeroed out, and this causes Python’s memory debug assertions to fire off. Effectively, because we’re creating and destroying these PyThreadStates as we execute, we’re breaking some assumptions within Python’s C code. I do not know if this a security issue or not, but it most certainly IS something that should be resolved, even if the reason I found it was by writing a shitpost 😅. I collected what little information I could, and have submitted an issue on the Python GitHub repository, which you can find here.

The Dream Of The Child

So here we are in 2024, and what I’d hoped and dreamed would eventually happen in my childhood (the GIL going away) is finally on the horizon. I know the current state of things is not great in the Python space, but for a language that’s been running as long as it has, it should be no surprise that there will be significant challenges along the way.

The only thing that pains me is that these days there are better solutions than using Python for an embedded scripting language (that aren’t named Lua). From JS implementations like V8, to .NET Core being self hostable, to so, so many WASM implementations, to CPU emulators like libriscv. Sadly, even alternative Python VM implementations like pocketpy all seem like better options.

Yes, I was taught how to program by Michael Dawson. ↩︎
No, I don’t think that a JIT or inlining byte code with native calls has been as much of a problem. It’s a scripting language. You knew what you were getting into. If you wanted speed and a JIT, you’d use a precompiled language until the V8 engine came out. ↩︎
Assuming you’re not new to my site, did you really think I wouldn’t mention CMake? Seriously? ↩︎
What if “Trusting Trust” was also extended to every single API you work with? 🙂 ↩︎