\(\newcommand{\latexonlyrule}[2]{}\)

Post Mortem of a Sneaky Bug Whilst Using Python Multiprocessing
Tyler Neylon
35.2023
\(\newcommand{\customstrut}{}\) \(\newcommand{\crossedouty}{\dot y \kern -5pt \scriptscriptstyle{\diagup\atop\vphantom{y}}}\) \(\newcommand{\crossedoutone}{\dot 1 \kern -5pt {\scriptscriptstyle{\diagup\atop\vphantom{\textstyle 1}}}}\) \(\newcommand{\crossedouttwo}{\dot 2 \kern -5pt {\scriptscriptstyle{\diagup\atop\vphantom{\textstyle 2}}}}\) \(\newcommand{\crossedoutthree}{\dot 3 \kern -5pt {\scriptscriptstyle{\diagup\atop\vphantom{\textstyle 3}}}}\) \(\newcommand{\crossedoutfive}{\dot 5 \kern -5pt {\scriptscriptstyle{\diagup\atop\vphantom{\textstyle 5}}}}\) \(\newcommand{\crossedoutsix}{\dot 6 \kern -5pt {\scriptscriptstyle{\diagup\atop\vphantom{\textstyle 6}}}}\) \(\newcommand{\crossedoutseven}{\dot 7 \kern -5pt {\scriptscriptstyle{\diagup\atop\vphantom{\textstyle 7}}}}\) \(\newcommand{\lowerhaty}{\lower 1ex{\hat y}}\) \(\newcommand{\lhy}{\lower 1ex{\hat y}}\) \(\renewcommand{\hat}[1]{\widehat{#1}}\)

[ Formats: html | pdf \(\,\)]

While working on a prototype machine learning system in Python (v3.10.6), I hit a concurrency deadlock that seemed at first to be impossible. This is the story of that bug.

1 The Architecture

The problem I’m solving involves iterating over more data than can fit into memory, doing some non-trivial processing of each data item — call this step 1 — and then integrating those outputs together in a non-trivial way — call this step 2. Both memory and cpu cycles are potential bottlenecks.

I could shape the system in different ways, but I’m trying out an architecture where a pool of worker processes does step 1, which is cpu-bound, and then uses a shared queue to send its data to another process to do step 2, integrating the output together. This is inspired by the actor model of handling concurrency: Basically, there is no shared memory outside of message passing, which is done by process-safe queues.

Here’s the process and data flow layout:

Here’s some pseudocode to capture this architecture, with the Queue in blue and the Pool in orange:

  from multiprocessing as mp
  Process = mp.Process

  # Load the data.
  data_list = read_input()

  # Make the writer process/queue.
  writer_q = mp.Queue()
  def writer_f():
      global writer_q
      # Process each message in the
      # queue, stopping when we get
      # a message == 'STOP'.
      for msg in iter(writer_q.get, 'STOP'):
          write_msg_into_data(msg)
  writer_proc = Process(target=writer_f)
  writer_proc.start()

  # Process the data.
  with mp.Pool(num_cores) as p:
      # The process_one_input() fn
      # processes an input datum
      # and put()s the output to
      # writer_q.
      p.map(process_an_input, data_list)

  # Before we use the result,
  # ensure the queues are done.
  writer_q.put('STOP')
  writer_proc.join()

That’s pretty close to the actual code I was running, leaving out extraneous code.

Hint 1: If you deeply grok Python’s multiprocessing module, then the above is enough information for you to see the bug.

Hint 2: Otherwise, I suspect you can’t figure it out by looking at that code. It is, in my opinion, not a simple typo or silly mistake. For example, sharing writer_q as a global is legitimate.

2 What Wasn’t Working

Sometimes the code would work as intended, and other times — often — it froze on the last line:

  writer_proc.join()

3 Investigating the Problem

I tried many things before fully understanding what was broken. I’ll start by listing the truly helpful steps I took, then explaining the source of the problem, and finally — since I think this is also helpful to other coders — I’ll list other steps I took that turned out to be unnecessary, but might be helpful in other situations.

3.1 Helpful Steps

  # In your shell, print out locations
  # where python finds library imports:

  $ python3 -c 'import sys; print("\n".join(sys.path))'

(prints out a list of directories)

  # One of those dirs, probably near the
  # top of the list, contains your
  # standard library files.

  # In Python's REPL, see the location
  # actually used for a particular
  # module import:

  $ python3 -v

(lots of output you can ignore)

  >>> import multiprocessing

  # /usr/lib/python3.10/multiprocessing/
    _pycache__/__init__.cpython-310.pyc
    matches /usr/lib/python3.10/
    multiprocessing/__init__.py
  …

  # The example above tells us my
  # standard library is (mostly) at
  # /usr/lib/python3.10/MODULE_NAME.

  logger = mp.log_to_stderr()
  logger.setLevel(logging.DEBUG)

  # In synchronize.py, edit the
  # _make_methods() method of class Lock
  # like so:

  def _make_methods(self):

      def dbg_acquire(
              blocking=True,
              timeout=None):
          res = self._semlock.acquire(blocking, timeout)
          if res:
              print('Lock acquired')
          return res

      self.acquire = dbg_acquire

      def dbg_release():
          self._semlock.release()
          print('Lock released')

      self.release = dbg_release

  def Lock2(self):
      '''Return a non-recursive lock.'''
      from .synchronize import Lock2
      ctx = self.get_context()
      return Lock2(ctx=ctx)

Print out the stack trace of any thread, such as the current thread:

  frames = sys._current_frames()
  thread = threading.current_thread()
  frame  = frames[thread.ident]
  trace  = traceback.format_stack(frame)
  print(''.join(trace))

The above idiom can be extended to print out stack traces for other threads; in fact, the main use case for sys._current_frames() is to enable you to do so even in a deadlock situation. Here’s a quote from the sys docs for _current_frames():

“This is most useful for debugging deadlock: this function does not require the deadlocked threads’ cooperation, and such threads’ call stacks are frozen for as long as they remain deadlocked. The frame returned for a non-deadlocked thread may bear no relationship to that thread’s current activity by the time calling code examines the frame.”

For the record, here’s a simpler way to print out the current stack trace:

  traceback.print_stack()

Print out a list of all threads in the current process, and their stacks:

  frames = sys._current_frames()
  for thread in threading.enumerate():
      print(thread.name)
      if thread.ident in frames:
          frame = frames[thread.ident]
          traceback.print_stack(frame)

Print out the name / identifier / pid for the current process and thread.

In this example, I’ll also show one way to package up multiple strings into a single print. This is useful to reduce the likelihood of prints overlapping each other across different processes.

  frames = sys._current_frames()
  thread = threading.current_thread()
  frame  = frames[thread.ident]
  trace  = traceback.format_stack(frame)
  stack  = ''.join(trace)
  # inspect.cleandoc() trims some
  # whitespace for you.
  msg = inspect.cleandoc(f'''
      Lock acquired
      proc {process.current_process()}
      pid {os.getpid()}
      thrd {threading.current_thread()}
      Stack:
      {stack}
  ''')
  print(msg, flush=True)

  # I call process.current_process()
  # above since I'm within the
  # multiprocessing source; outside of
  # it, you'd instead call
  # multiprocessing.current_process().

It’s useful to include flush=True because when you redirect to a file, flushing is not automatic. Without the flush, you might end up seeing a side effect that happened after a debug print line, but never seeing that debug print in your output.

This example is from queues.py, in the _feed() method, which I’ll explain further below.

I replaced the single line:

  wacquire()

with this:

  # Try to lock, timeout after 2 sec.
  if not wacquire(True, 2):
      # Debug prints.
      print('I appear to be stuck.')
      print('Threads:')
      frames = sys._current_frames()
      for th in threading.enumerate():
          print(th.name)
          if th.ident in frames:
              fr = frames[th.ident]
              traceback.print_stack(fr)
      wacquire()

$ python3 my_program.py
… it freezes; I can see any pids that hold the lock via debug prints from the steps above.

Me, in another shell:
$ pgrep python
… list of living Python pids. If a dead (missing) pid holds the lock, that causes a deadlock.

3.2 The Source of the Problem

Internally, Python’s multiprocess queues rely on os-level pipes for communication. A pipe is a FIFO (first-in, first-out) buffer of bytes. When two processes have the same pipe open, either one can write bytes into it, or read bytes out of it. It‘s already similar to a queue, except that:

In particular, multiprocessing.Queue supports many put() calls by including a process-local buffer before data reaches the pipe. Every time a Queue instance arrives in a new process, it creates a thread called a feeder thread in that process. When you call Queue.put(), your message goes into the process-local buffer before hitting the pipe; it works this way so that put() calls never block, even if the pipe is full. It’s the feeder thread’s job to asynchronously move data out of the process-local buffer and put that into the pipe according to the pipe’s capacity.

The feeder thread is smart enough to use a write lock on the pipe. This way, no two messages will be interwoven and thus mangled if two processes execute a put() at the same time.

The Queue implementation is also smart enough to flush the process-local buffer into the pipe when any process undergoes a normal shutdown process. Essentially, when the process-local copy of a Queue is garbage collected, it waits for the feeder thread to empty the process-local buffer. This ensures that (a) no data is lost, and (b) the locks don’t end up in a broken state.

So far, so good, right? What can go wrong?

Well, the problem is in the combination of:

  1. A Queue that depends on clean process shutdown, along with
  2. a Pool that is
  3. managed by a with statement.

A multiprocessing.Pool object is normally quite friendly to its workers. However, it turns out that when you manage the lifetime of a pool via the with statement, the workers will be terminated with prejudice. That is, killed without a chance to clean themselves up.

So a worker can acquire _wlock, be unceremoniously killed, and then the next thread that wants to acquire _wlock is stuck. In my case, the feeder thread is stuck when I call writer_q.put('STOP'). The 'STOP' message never gets through, but I don’t see any bad behavior unless I try to join the writer process, which never completes as it’s waiting forlornly for its long-lost sentinel signal (the 'STOP' message).

The timeline looks like this:

As a list of events, the deadlock occurs like this:

That’s the whole bug. I’ll explain in a minute how I decided to fix it.

3.3 Unnecessary but Interesting Steps

Before explaining the fix, I wanted to give some tips for building cpython from source. Why? Well, at one point I thought I’d have to debug at the C level, so I took that red pill temporarily. It turns out, I didn’t really need to do that, but it was easier than I had expected. And kind of fun.

Here are some tips if you’d like to build cpython from source on ubuntu, which is the platform I’ve been using:

Build from source.


Find the version you want and download the gzipped tarball from cpython’s github tags page.

  # For example, this would work in a
  # dedicated directory:

  $ wget https://github.com/python/cpython/archive/refs/tags/v3.10.6.tar.gz
  $ gunzip *.gz
  $ tar xvf *
  $ cd cpython
  $ ./configure
  $ make

That’s actually enough already to build from source. To run your freshly baked interpreter, do this:

  $ ./python

Remind yourself you’re using an edited version of Python.


I highly recommend editing Programs/python.c by inserting the following line at the start of main():

  printf("narf\n");

This will ensure that you are constantly reminded that you’re running your own personal narfed version of cpython every time you run. Please be sure to spell “narf” correctly; otherwise you may be confused by a clearly misspelled nonsense word.

Get the ffi and openssl libraries linked in.


  # For some reason, my run of
  # ./configure did not find these
  # on my box. I fixed this like so:

  $ sudo apt-get install libffi-dev
  $ sudo apt-get install libssl-dev
  $ ./configure
  $ make

Get pip up and running.


I didn’t get pip with my wget from github. Go figure. But it’s not too hard to get:

  $ ./python -m ensurepip

From there you can run:

  $ ./python -m pip install whatever

Have fun!

4 Fixing the Problem

4.1 Non-Solutions

Cleanup Messages

At first, I tried to modify my mapper function, which runs on the workers, so that it would recognize a special message to flush the feeder thread. If all workers flushed their feeder exhaust ports before expiring, then none of the workers would try to acquire _wlock when the pool decided to become the terminator.

However, this is not a great idea. Why not? Because pool does not distribute tasks in a round-robin fashion. And it shouldn’t really, because you basically want to always have all cores busy for as long as possible — so it makes sense to let workers take tasks on a first-come first-serve basis. When I implemented this, some workers would get two cleanup messages, and some got none.

Other Queue Types

I also thought about using one of the other Queue types provided by multiprocessing: SimpleQueue, which has no buffer; or JoinableQueue, which supports feedback to a message producer than messages have been completely processed (not just received).

Both of these are non-starters because they don’t solve the fundamental problem: As soon as you rely on a lock, you can’t dance with processes that are terminated. And all of these queue types have at least one lock.

4.2 A Two-Line Fix

The fix is almost painfully simple:

Instead of this:

  with mp.Pool(num_cores) as p:
      p.map(process_an_input, data_list)

do this:

  with mp.Pool(num_cores) as p:
      p.map(process_an_input, data_list)
      p.close()
      p.join()

Calling p.join() gives all workers the time to get their affairs in order before their demise. We only call p.close() first because the class requires this before we can join().

Another alternative is to simply avoid using a with clause. However, if there’s a reasonable way for me to use with clauses, I prefer them because it’s an unmissable cue to future developers that there is some context here which we probably want to ensure is cleaned up.

A subtle consequence of this design is that, if an exception is raised by the p.map() call, then the workers may still be terminated. In my case, exceptions mean we should stop the program; but it’s good to keep in mind this possible issue for other applications.

5 Hints in the Docs

To what degree was this bug my fault? First, to give some credit to the developers of multiprocessing, there is a relevant warning in the docs, which I’ll include below. Second, to be fair to myself, I don’t see why terminating the workers is the right behavior for Pool.__exit__(); it’s surprising behavior to me which is fundamentally incompatible with the design of the rest of the module. It seems more natural for the end of a with clause to close and join the pool. If terminate() is necessary, it can be called after joining.

Here’s the warning from the docs for multiprocessing:

Avoid terminating processes

Using the Process.terminate method to stop a process is liable to cause any shared resources (such as locks, semaphores, pipes and queues) currently being used by the process to become broken or unavailable to other processes.

Therefore it is probably best to only consider using Process.terminate on processes which never use any shared resources.

6 How can I help other developers?

I don’t usually write up post mortems like this — otherwise I wouldn’t get much done! But this was a fun bug, and I learned enough while debugging that I wanted to take notes and share these with other coder friends.

The notes above cover tactical tips, like flushing your prints immediately to promote accurate logs, but I also have a small collection of higher-level strategic considerations.

Concurrency debugging suggestions from myself and friends:


You can get emails about new posts if you'd like: