Skip to main content

Command Palette

Search for a command to run...

Why your multi-threaded python code is not faster?

Updated
5 min read
Why your multi-threaded python code is not faster?

This article was not written by AI. It is human written, and I would appreciate any feedbacks on it.

We must have heard how python isn’t truly parallel. Why does it happen? What actually is the Global Interpreter Lock and how it is a culprit behind the lack of true parallelism in python. I try to explain it while also explaining a few internal workings of the language with a very high level oversimplification of the actual workings.

In short, GIL is a lock, that allows only one thread at a time to execute python bytecode within the cpython interpreter.

Prerequisites

To understand GIL, we need to understand the following concepts.

  1. IO Bound v/s CPU Bound

    The core reason behind need for building parallel systems is the performance bottleneck which arises from either IO wait or processor’s limit.

    • An IO bound task spends most of it’s time waiting for something outside the CPU to respond. For eg: Querying a database depends on the database, downloading a file from internet etc.

    • A task is CPU bound, if it’s speed depends entirely on how fast the processor can crunch numbers. That means, a faster processor can complete the task faster unlike the case in IO bound. Eg: image processing, mathematical calculations etc.

In IO bound task, processor speed isn’t the bottleneck while in CPU bound, it is.

  1. Threads v/s Processes

    • Threads

      A thread is a lightweight unit of execution that lives inside the process. Every process starts with at least one thread. Multiple threads in a process, share the same memory, same interpreter and same GIL.

    • Process

      It is an independent running instance of a program. It has it’s own python interpreter, it’s own memory space and it’s own GIL. If we spawn a separate process for each task, then we achieve true parallelism because each task will have a GIL of it’s own. But, this isn’t an efficient approach and we’ll understand why later.

How Python manages memory?

Python uses a technique called Reference Counting. In Reference Counting, each object created in python have a reference count, that tracks the number of references that point to that object. When reference count reaches zero, the memory occupied is freed.

Python also uses Garbage collection as a way to manage memory. But the problem of GIL arises from Reference Counting. Why was reference counting preferred in the first place is a separate topic of discussion, which we will not go in this article.

Problem that GIL solves

The reference count variable is susceptible to race conditions if two threads are allowed to read or write to that value simultaneously. This can result in leaked memory that is the memory which is never released even after the end of the program. Or it can incorrectly release the memory while a reference to that object still exists.

So, GIL is a OS level lock. It is a single lock, that the thread must acquire before executing the python byte code. This way, only one thread will be able to modify the reference count. Due to this reason, even if we spawn multiple threads, only one thread can perform a CPU-bound task at a time. However, during IO, the GIL is released. During this case, the threads can work in parallel too.

How threads can acquire GIL has a different story. In the old way(before python3.2), it was based on ticks. A check occurred every 100 ticks to check if the thread should give up the lock. But modern way, is based on timed wait. A thread will run, until second thread requests for the lock. After the request is made, the second thread waits for certain time and forces the first thread to drop the lock.

We can achieve true parallelism in python if we spawn multiple processes instead of the thread. But creating a process is an expensive operation and may not necessarily provide better speed even with parallelism due to the overhead of launching a separate interpreter and memory spaces. In some cases, a single threaded program can be faster than a program using multiple processes.

What could have been an alternative solution to GIL?

  • One idea would be to provide a single lock to each data structure that are shared across threads, so that can can’t be modified inconsistently. But, this solution can result in a deadlock, a condition where one lock is waiting for other to be freed in loop. This can also result in a reduced efficiency caused by repeated acquisition and release of locks.

  • Another viable approach is to use a priority scheme. Penalizing the threads that use their full time slice and giving bonus to those that give up CPU voluntarily (for IO-bound tasks).

Why was GIL chosen?

The major reason was to provide thread safety when integrating c-extensions. C-extensions weren’t thread safe by default, but python gave a way to make these reliable and thread-safe. These c-extensions became one of the primary reasons for adoption of python by the dev community.

To learn more more about the technicalities of GIL, the reader can refer to this talk posted by David Breazley from Pycon’2010. : https://www.youtube.com/watch?v=Obt-vMVdM8s

Current Status of GIL

Many attempts were made to remove GIL, one of the famous ones namely Gilectomy.

As of python 3.13, an experimental free threaded mode was released. It uses advanced memory management techniques like mimalloc and biased reference counting. In spite of this, the GIL remains the current default.

Unorganized points

  • The original python interpreter cpython these macro definitions are what implement the GIL.

    • Py_BEGIN_ALLOW_THREADS: Releasing the GIL by a thread.

    • Py_END_ALLOW_THREADS: To re-acquire the GIL by a thread.