As we approach the final release date for Python 3.13, I’ve seen an uptick in discussion regarding 3.13’s introduction of an optional GIL. While removing the GIL has been a long time coming for the average user (I’ve dreamt of this for nearly 20 years), there have actually been two concurrent efforts to improve Python’s performance for multithreading. The first is the optional GIL, specified in PEP 703, and second is the introduction of a per-interpreter GIL, specified in PEP 684 and introduced in Python 3.12.
In this post, I’ll explain some of the history of the GIL, how it’s affected the C API design of Python, and what steps were taken at the C API level to get us to where we are today. I’ll also briefly discuss how PEP 684 and PEP 703 will be able to work in tandem. And to explain all of this, we’re going to start with my personal history on the matter and my first run-in (and faceplant) with Python’s GIL. Learning to make
β¨video games β¨
A Dark Seed Is Planted
In the summer of 2005, I attended a several week long game development summer camp that was located at Stanford University. This was my first real introduction to programming. I don’t count the C++ class I took in freshman year of high school because we just simply copied text from JPEGs of a scan of a C++ book that were scaled down so that two pages could fit on an 8.5x11" paper. At the time, this was (for me) no different than reading the Voynich Manuscript or Codex Seraphinianus.
I was being taught Python 2.4, a whole year before the 2.5 release. Now at the time, PyGame was at around version 1.7.1 and both then and now requires some knowledge of Python to be used effectively. I would argue that even today teaching someone Python via PyGame is not a good introduction to Python the language. It is an intermediate step after you’ve gotten the basics of objects and for loops and the like down. However we were using a (slightly modified fork) of a library named livewires, which was a wrapper that greatly simplified a lot of the logic behind PyGame so you could focus on teaching.
This fork had been written by my camp instructor, a programmer, TV show writer, and teacher by the name of Michael Dawson. For those of you who read the section title, and are also familiar with ancient eldritch e l d r i t c h video game lore, you might recognize his name as the lead programmer, writer, and star of the 1992 DOS game Dark Seed1. This was, for me, a huge deal, as it lended an air of authority to his teaching: he’d made an actual game and mostly by himself. Surely, I could do the same thing one day. Right? π
During this instructional period, I was using a large number of sprites. I was
too inexperienced to know what a sprite atlas was, let alone to implement it
with the API we were using under livewires. My load times were atrocious. I had
maybe 100+ sprites, each with about 20 animations, that I’d pulled from various
geocities websites. I was quite honestly at a loss. Outside of these
instructional classes, we had additional activities where we could socialize
with other people attending the camp. One of these people was very on the up
and up in terms of open source, Linux, and the like (they owned a GP2X and
managed to get their game to work on the device) compared to myself. I
explained to them the issue I was having, and they showed me how to profile my
code, and we saw that because I was loading in a sprite one at a time, it would
take forever to open and read each and every individual sprite. They then
suggested I use the threading module to at least speed things up. After
spending some time tinkering we got it working and… my load times had
increased π.
The next day, I asked Michael what the issue could be and with a heavy sigh he had to explain the issue with threads and the Python interpreter, as I had stumbled into the oldest issue Python has had to date for performance2: that damned GIL.
I’m also not the only person who ran into this issue within the context of games. Sid Meier’s Civilization IV used Python for its game logic and scripting systems. Games would run slower the more units were on the map and the more turns that had executed. It was a known issue that Civ IV had (and one might even argue still has) performance issues. With the release of Sid Meier’s Civilization V the team had switched to Lua and this was touted as a big deal because it meant a huge performance improvement. And it was! Fewer lockups and games didn’t take forever to process an AI turn (usually).
Alas! The beef I’ve had with the GIL was planted as a dark seed about 20 years ago. For many years I’ve been filled with jealousy towards Lua even going so far as to refuse to use it unless forced to for the last 9 years, though this was mostly because I had to operate directly on a stack and worry about things like upvalues. What’re upvalues? I don’t know dawg, what’s up with you?
Erm, what the sigma GIL?
OK, so there’s a general misunderstanding of what the GIL is, in addition to why it even exists in the first place. To better comprehend the situation, we’re going to talk specifically about the evolution of CPython and it’s interpreter.
A Short History of Threading
Threading was first introduced in Python on with the release of Python 1.5. This release brought
along with it the threading module and is also where the
PyInterpreterState and PyThreadState objects were first introduced,
bringing the GIL along with them. Prior to this, Python had a simple evaluation
execution step in a single function named code_eval that would in some cases
recurse into itself. The introduction of PyInterpreterState and
PyThreadState were a big deal as this allowed Python to finally begin moving
towards an embeddable state where users could not only extend Python, but
also embed it. This might not seem like such a big deal, but it was 1998. The
number of scripting languages in use at the time were many, but each had their
own limitations with regards for C to interact with it in custom programs.
Hell, go and look at Perl’s C API today and you can see a very 90s API
design that might be viewed as anti-embedding, but really was just a pragmatic
approach to assume that if you were trying to embed Perl as a C library, you
were the Perl interpreter.
This newly introduced PyInterpreterState had a single instance, and even with
the upcoming release of Python 3.13, there is still a static PyInterpreterState* object that represents the main PyInterpreterState. Even
if this PyInterpreterState is not used, it still exists.
This is why the multiprocessing module has historically been faster for
Python. We’re creating a whole new interpreter instance as a subprogram, and
thus it does not need to worry about a GIL. However, this comes at the cost of
an entire subprocess and the various work involved with communication between
the two. For Windows especially, this is a very heavy approach for a speedup,
yet manages to still be faster in the high contention cases.
Thus, with the release of 3.13, multiprocessing is technically only needed
in cases where the GIL is still needed, and even then it still can carry a cost
compared to threading in some specific use cases.
Big GIL, So What?
OK, so if you’re coming from other scripting languages this approach might not
make much sense. For those of you coming from lua_State*s, JSContextRefs,
v8::Isolates, and more, we need to take a look at how Python was evolving.
There was no real line in the sand to explain the direction Python would be
taking. Call me one to speculate, but it seemed like features were being added
because someone wanted them, rather than because of some goal in mind. This 1.5
release predates both the Python Software Foundation and PEP 1, which was
introduced only with the release of Python 2.0 (alongside the PEP process).
Thus, there was no real community direction that Python could go in, beyond
whatever work someone was willing to contribute, and therefore these changes
were made because… well what else was there at the time? We had Perl, Tcl, a
bunch of implementations of scheme that are completely forgettable, and while
Lua 3.0 was out, it wouldn’t be until about 5 months later in July of 1998 that
Lua 3.1 was released along with its now well known lua_State* (and even then
there were semantic issues that weren’t resolved until the 4.0 release).
In fact if we look at what was available for existing API designs to pull from, the only real stable designs for execution were Tcl (which people have gone great lengths to avoid copying, because when you do try to copy its design you get CMake3), and OpenGL. As OpenGL has a “context” object that you must swap to/make current on the given thread it should be no surprise that Python implemented a similar interface while eschewing the isolated single-thread interpreter design.
It’s extremely important to keep in mind options for these decisions were limited at the time. People were more or less stumbling around in the dark for the C API side of dynamically typed scripting languages. We didn’t have decades of mistakes and real world use to learn from. Even today when we know there are “better” options sometimes the worse one wins out. For example, Lua has constantly been heralded in various spaces for it’s register based VM. Yet WASM, which is shaping up to be the cross platform du jour VM of the 21st century, went with a stack based VM. This decision was driven entirely because of β¨ reasons β¨ that are outside the scope of this post and that I don’t actually think hold water and if you think they do you can shut up nerd.
How Can She Interpretβ½β½
We’re near the end of this explanation so let’s take a brief look at how the
GIL works in practice at the C API level prior to making the GIL optional. For
starters, users create something called a sub-interpreter, which is what a
PyThreadState technically is. This is done via the poorly named
Py_NewInterpreter. When this is called, it uses whatever is set as the
global interpreter and adds it to the list of PyThreadStates that
PyInterpreterState keeps track of. Users are then expected to acquire the GIL
at some point upon the PyThreadState and then evaluate some Python code, then
release the GIL from the PyThreadState and at this point they are free to
either destroy the PyThreadState or they are able to just keep it
around and keep executing as needed. No matter what happens there is going to
be a lock occurring. No two PyThreadStates can execute Python bytecode at
the same time. However, they can execute multiple C calls at the same time
which is why for long running pure C operations extension and embedding
developers are encouraged to release the GIL temporarily. Technically speaking
this is where something like
ThreadSanitizer can come in
handy.
This approach to extensions is also sort of why we have the GIL in the first place. Up until Python 3.5, which introduced so-called multi-phase extension module initialization via PEP 489 we didn’t actually have a way to even really isolate modules once they were imported. All objects, exceptions, methods, etc. were all added at the same time and are inserted into the central main interpreter. The real purpose is to effectively separate native code (written in whatever language you want, though most folks are using C++, C, Zig, or Rust these days) with the actual binding. Thus “when in Python, there is a lock” and extension developers are given a level of trust that the lock can be disengaged and re-engaged as long as Python objects are not being touched4.
As far as I know, Rust’s lifetimes can’t entirely be enforced across these
dynamic library extensions unless it’s encoded at the C ABI level in some
manner and to my knowledge PyO3 and other wrappers do not
expose this at the type system level for the compiler and to warn users for raw
locking/unlocking operations, preferring to instead provide an
allow_threads
method that will handle the locking/unlocking within the execution of a user
provided closure that performs the actual work (an operation that any “RAII”
capable language might provide for sanity and safety’s sake)
With the release of Python 3.12 however we actually saw the first vestiges of
the Python C API that would allow us to create a sub interpreter that has no
connection to the global interpreter. This API exists in the form of
Py_NewInterpreterFromConfig, where a PyInterpreterConfig structure
is passed in. This structure allows users to set a few “hard” settings. For
example, you can disable the threading module, os.execv(), and process
forking (though subprocess does not create forked processes, launching
entirely brand new ones instead). The most important setting of these however
is the PyInterpreterConfig.gil member, which allows users to either
share the sub-interpreter’s GIL with the main interpreter’s GIL or they
can make the sub-interpreter have it’s own GIL, thus keeping it isolated from
the rest of the system. There are a few caveats here. Although the sys.stdin
object is a unique PyObject*, it does refer to the same underlying handle
as other sys.stdin objects. This can be semi-resolved by just changing it.
500 Interpreter States
OK, so with everything we’ve mentioned up until now you’re more or less prepared to understand why we’ve got these wacky shenanigans. So let’s make an extreme example. Let’s make:

500 Interpreter States
This code example is going to be very “high level”. We won’t be doing proper due diligence for API design because this is meant to be an example, not production ready code. With that in mind, we have a few things we need to do before we even create our python interpreters. The first of these is to create an isolated configuration for the default interpreter. This is extremely important if you’re doing a proper embed of C Python, as you might not want to care about things like user directories and the like to modify how python might execute.
PyConfig config {};
PyConfig_InitIsolatedConfig(&config);
if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status)) {
/* handle error here */
}
PyConfig_Clear(&config);
There is additional work one might want to do with the PyPreConfig set of
APIs, but I… don’t care about handling that. Read the documentation. You can
figure it out π
This configuration creation does the very basic steps for initializing Python into an isolated state, which is intended for situations where users might not be shipping the Python stdlib alongside their executable.
Next, we’re going to just spin up all our threads. When creating these new
PyThreadState instances they are set as the current thread state after
creation. This means we more or less need to initialize them inside of a
thread. We can get creative with the situation if we’re using C++, where
thread_local exists as a keyword + storage qualifier. This also lets us know
if a given thread has been initialized for a thread state. We won’t be doing
any fancy checking here, since this is “babby’s first example”, but this sort
of design paradigm might come in handy. We’ll also just define the maximum
number of states as a constant to save us some time.
// Mark this as inline if you're placing it in a header. Save your sanity. π
static thread_local inline PyThreadState* state = nullptr;
// 500 interpreter states π
static constexpr auto MAXIMUM_STATES = 463;
Next we’ll create our default PyInterpreterConfig. Because we’re relying on
capturing this configuration object we don’t need to make it global in the
function we’re executing.
PyInterpreterConfig config = {
.use_main_obmalloc = 0,
.allow_fork = 0,
.allow_exec = 0,
.allow_threads = 0,
.allow_daemon_threads = 0,
.check_multi_interp_extensions = 1,
.gil = PyInterpreterConfig_OWN_GIL,
};
There’s a few details here that are more accurately explained in the Python
documentation, however the biggest two things that matter are the
check_multi_interp_extensions value and gil. The
check_multi_interp_extensions value effectively forces all extensions to be
what is known as a multi phase extension. This is where you initialize some
of the module data, but don’t hook up objects, exceptions, and actual python
bindings until a separate function is called. This was a huge stumbling block
prior to either the per-GIL subinterpreter and the optional GIL. By forcing all
extensions to be multi-phase, you can create per-extension instances of all
your objects. This can get hairy if you actually have some hidden globals
within your native code that you’re binding, but surely you wouldn’t lie about
that ever, right? π
Moving on, we’ll generate all our threads to create an interpreter state,
execute some python code, print to sys.stdout and then shut down our
interpreter. Easy Peesy, Lemon Squeezy.
// We want the threads to join on destruction, so std::jthread it is
std::vector<std::jthread> tasks { };
tasks.reserve(MAXIMUM_STATES);
for (auto count = 0zu; count < tasks.capacity(); count++) {
tasks.emplace_back([config, count] {
if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
std::println("Failed to initialize thread state {}. Received error {}", count, status.err_msg);
return;
}
auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals);
Py_DecRef(result);
Py_DecRef(code);
Py_DecRef(globals);
Py_EndInterpreter(state);
state = nullptr;
});
}
Now, we can just let the threads go out of scope and call Py_Finalize(). What
could possibly go wrong?
FUCK YOU, ASSHOLE!
IF YOU’RE A DUMB ENOUGH TO TRY TO RUN 500 463 PYTHON INTERPRETER STATES IN A
DEBUG BUILD, YOU’RE A BIG ENOUGH SHMUCK TO COME ON DOWN TO BIG BILL’S HELL
INTERPRETERS.
- MEMORY CORRUPTIONS!
- SYNCHRONIZATION ISSUES
- THE MESSAGE
OUCHIN YOUR TERMINAL FROM PYTHON’S MEMORY TRACKING - THIEVES
IF YOU THOUGHT THIS C AND C++ CODE WAS GONNA RUN JUST FINE WITHOUT ANY ISSUE
YOU CAN KISS MY ASS
YOU HEARD ME RIGHT
YOU CAN KISS MY ASS
OK, hold up, what’s happening?
So unfortunately during the writing of this post, and the reason it’s been
delayed by so much (I wanted to post it last weekend π), is because I was
running into memory corruption issues. I thought “Hey, you know it’s been so
long since I’ve use C++’s std threading API that I probably fucked up.”
Unfortunately, I did not fuck up and my C++ code was correct. It seems there
are some latent memory corruption issues if you are creating and then
destroying 500 python interpreter states in debug mode. I unfortunately wasn’t
able to nail it down to any one specific issue beyond “at some point after
about 10 PyThreadStates being created and destroyed constantly”, you get
memory corruption, and there’s no way to recover.
On closer inspection, this is a case of memory reuse, where destroyed
PyThreadStates are not fully zeroed out, and this causes Python’s memory
debug assertions to fire off. Effectively, because we’re creating and
destroying these PyThreadStates as we execute, we’re breaking some
assumptions within Python’s C code. I do not know if this a security issue or
not, but it most certainly IS something that should be resolved, even if the
reason I found it was by writing a shitpost π
. I collected what little
information I could, and have submitted an issue on the Python GitHub
repository, which you can find
here.
The Dream Of The Child
So here we are in 2024, and what I’d hoped and dreamed would eventually happen in my childhood (the GIL going away) is finally on the horizon. I know the current state of things is not great in the Python space, but for a language that’s been running as long as it has, it should be no surprise that there will be significant challenges along the way.
The only thing that pains me is that these days there are better solutions than using Python for an embedded scripting language (that aren’t named Lua). From JS implementations like V8, to .NET Core being self hostable, to so, so many WASM implementations, to CPU emulators like libriscv. Sadly, even alternative Python VM implementations like pocketpy all seem like better options.
Yes, I was taught how to program by Michael Dawson. ↩︎
No, I don’t think that a JIT or inlining byte code with native calls has been as much of a problem. It’s a scripting language. You knew what you were getting into. If you wanted speed and a JIT, you’d use a precompiled language until the V8 engine came out. ↩︎
Assuming you’re not new to my site, did you really think I wouldn’t mention CMake? Seriously? ↩︎
What if “Trusting Trust” was also extended to every single API you work with? π ↩︎