Gunicorn Application Preloading

As we saw before, running more Gunicorn worker processes multiplies your application's memory use. If memory becomes the constraining factor for your app—by causing out of memory errors, by requiring you to purchase more expensive servers, or by reducing performance (since small == fast)—you might want to use --preload.


What is Preloading?

During application startup, the master Gunicorn process spawns worker processes. Normally, the worker processes each individually import your WSGI application, which loads it into memory. Since each process has its own memory, your application and dependencies are duplicated in the memory of each worker process. Using the preload setting makes this import happen in the master process, before worker processes are spawned.

What effect does preloading have on memory use?

I added some unused imports to our toy WSGI app to increase its memory footprint.

# wsgi.py
# import something to increase memory footprint
import matplotlib.pyplot as plt
import numpy as np

def application(environ, start_response):
    start_response("204 No Content", [])
    return [b""]
$ # start 5 workers
$ gunicorn --bind=127.0.0.1:8000 --workers=5 --worker-class=sync wsgi:application
[2021-01-21 13:21:49 -0500] [14415] [INFO] Starting gunicorn 20.0.4
[2021-01-21 13:21:49 -0500] [14415] [INFO] Listening at: http://127.0.0.1:8000 (14415)
[2021-01-21 13:21:49 -0500] [14415] [INFO] Using worker: sync
[2021-01-21 13:21:49 -0500] [14418] [INFO] Booting worker with pid: 14418
[2021-01-21 13:21:49 -0500] [14419] [INFO] Booting worker with pid: 14419
[2021-01-21 13:21:49 -0500] [14423] [INFO] Booting worker with pid: 14423
[2021-01-21 13:21:49 -0500] [14430] [INFO] Booting worker with pid: 14430
[2021-01-21 13:21:49 -0500] [14431] [INFO] Booting worker with pid: 14431

$ # in another shell
$ # warm up the workers
$ hey -n 30 -c 30 -t 0 http://127.0.0.1:8000
...
Status code distribution:
  [204] 30 responses

$ smem -t -k --processfilter="wsgi:application"
  PID User     Command                         Swap      USS      PSS      RSS
14415 joel     /home/joel/.virtualenvs/gun        0     9.7M    10.9M    22.6M
14431 joel     /home/joel/.virtualenvs/gun        0    34.3M    36.9M    54.8M
14418 joel     /home/joel/.virtualenvs/gun        0    34.3M    36.9M    54.8M
14419 joel     /home/joel/.virtualenvs/gun        0    34.3M    36.9M    54.8M
14423 joel     /home/joel/.virtualenvs/gun        0    34.3M    36.9M    54.8M
14430 joel     /home/joel/.virtualenvs/gun        0    34.3M    36.9M    54.8M

Each worker process uses ~37 MB of memory (again, PSS is the best number to measure this).

Now with preloading:

$ # use --preload
$ gunicorn --bind=127.0.0.1:8000 --workers=5 --worker-class=sync --preload wsgi:application
[2021-01-21 13:44:44 -0500] [14751] [INFO] Starting gunicorn 20.0.4
[2021-01-21 13:44:44 -0500] [14751] [INFO] Listening at: http://127.0.0.1:8000 (14751)
[2021-01-21 13:44:44 -0500] [14751] [INFO] Using worker: sync
[2021-01-21 13:44:44 -0500] [14758] [INFO] Booting worker with pid: 14758
[2021-01-21 13:44:44 -0500] [14759] [INFO] Booting worker with pid: 14759
[2021-01-21 13:44:45 -0500] [14760] [INFO] Booting worker with pid: 14760
[2021-01-21 13:44:45 -0500] [14761] [INFO] Booting worker with pid: 14761
[2021-01-21 13:44:45 -0500] [14762] [INFO] Booting worker with pid: 14762

$ # in another shell
$ # warm up the workers
$ hey -n 30 -c 30 -t 0 http://127.0.0.1:8000
...
Status code distribution:
  [204] 30 responses

$ smem -t -k --processfilter="wsgi:application"
  PID User     Command                         Swap      USS      PSS      RSS
14762 joel     /home/joel/.virtualenvs/gun        0     4.0M    10.1M    45.0M
14759 joel     /home/joel/.virtualenvs/gun        0     4.1M    10.2M    45.0M
14760 joel     /home/joel/.virtualenvs/gun        0     4.1M    10.2M    45.0M
14758 joel     /home/joel/.virtualenvs/gun        0     4.1M    10.2M    45.0M
14761 joel     /home/joel/.virtualenvs/gun        0     4.1M    10.2M    45.0M
14751 joel     /home/joel/.virtualenvs/gun        0    11.2M    17.7M    56.6M

Now each worker process uses ~10 MB of memory, and the total memory use dropped from 195 MB to 69 MB.

How does preloading reduce memory use?

To understand how preloading reduces memory use, we have to learn more about the operating system.

(I would link to the wonderful site where I originally learned this, but I can't find it. If you find a great visualization of fork and copy-on-write, let me know!)

Operating systems expose two ways for a process to create new processes: exec and fork. Forking immediately creates a child process that is nearly identical to the parent process. Process internals, including the contents of memory, are the same between the two processes. After calling fork you need to quickly do some detective work to find out if you're running in the child process or the parent. From the Gunicorn project (comments are mine):

# gunicorn/arbiter.py
def spawn_worker(self):
    # ...

    pid = os.fork()
    if pid != 0:
        # we're in the parent process, keep track of the new worker
        worker.pid = pid
        self.WORKERS[pid] = worker
        return pid

    # this line and below are run only by the worker process
    # ...

The operating system wants to avoid having to needlessly copy-and-paste the parent process' memory to a new location for the child process to use. In many cases the child process will never need to mutate certain values in memory, so having duplicate copies would waste space. One example is the memory used by imported Python dependencies. Most likely these libraries are going to sit in memory and be used by your application without changing any internal state.

Copy-on-write is a technique where the child process will continue to share the same physical memory locations with the parent process until it wants to overwrite those memory values. Only then will the child get its own copy of that page in memory.

When your application code and dependencies are preloaded before worker processes are spawned, they can share many memory pages because of copy-on-write. You can see this in the sharp decrease in USS memory when using --preload. USS measures the memory unique to a process; preload reduces this by allowing processes to share more memory.

When not to preload

You cannot combine Gunicorn's preload and reload settings. --reload is a convenience for development that automatically replaces the worker processes if your application code changes. When your code is preloaded into the master process, however, the new worker processes will still have your old code in memory due to how fork works.

Similarly, you cannot reload your code in production "on the fly" if it is preloaded:

$ gunicorn --bind=127.0.0.1:8000 --workers=5 --worker-class=sync wsgi:application
[2021-01-21 14:37:36 -0500] [16505] [INFO] Starting gunicorn 20.0.4
[2021-01-21 14:37:36 -0500] [16505] [INFO] Listening at: http://127.0.0.1:8000 (16505)
[2021-01-21 14:37:36 -0500] [16505] [INFO] Using worker: sync
[2021-01-21 14:37:36 -0500] [16508] [INFO] Booting worker with pid: 16508
[2021-01-21 14:37:36 -0500] [16512] [INFO] Booting worker with pid: 16512
[2021-01-21 14:37:36 -0500] [16513] [INFO] Booting worker with pid: 16513
[2021-01-21 14:37:36 -0500] [16517] [INFO] Booting worker with pid: 16517
[2021-01-21 14:37:36 -0500] [16518] [INFO] Booting worker with pid: 16518

$ # in a separate terminal
$ hey -n 10 -c 10 -t 0 http://127.0.0.1:8000
$ # we don't see any console writes in the gunicorn output because
$ # there are no print statements in our handler

$ # add a print("request received") to our app and save the file
$ vim wsgi.py

$ hey -n 10 -c 10 -t 0 http://127.0.0.1:8000
$ # we still don't see any console writes because the old
$ # worker processes are still running

$ # find the PID of the gunicorn master process, here it's 16505
$ kill -SIGHUP 16505

$ # we see the workers reload in the gunicorn output
[2021-01-21 14:38:08 -0500] [16505] [INFO] Handling signal: hup
[2021-01-21 14:38:08 -0500] [16505] [INFO] Hang up: Master
[2021-01-21 14:38:09 -0500] [16561] [INFO] Booting worker with pid: 16561
[2021-01-21 14:38:09 -0500] [16562] [INFO] Booting worker with pid: 16562
[2021-01-21 14:38:09 -0500] [16563] [INFO] Booting worker with pid: 16563
[2021-01-21 14:38:09 -0500] [16513] [INFO] Worker exiting (pid: 16513)
[2021-01-21 14:38:09 -0500] [16512] [INFO] Worker exiting (pid: 16512)
[2021-01-21 14:38:09 -0500] [16508] [INFO] Worker exiting (pid: 16508)
[2021-01-21 14:38:09 -0500] [16570] [INFO] Booting worker with pid: 16570
[2021-01-21 14:38:09 -0500] [16518] [INFO] Worker exiting (pid: 16518)
[2021-01-21 14:38:09 -0500] [16517] [INFO] Worker exiting (pid: 16517)
[2021-01-21 14:38:09 -0500] [16571] [INFO] Booting worker with pid: 16571


$ hey -n 10 -c 10 -t 0 http://127.0.0.1:8000
request received
request received
request received
request received
request received
request received
request received
request received
request received
request received

If we use --preload, the new workers don't pick up the changes or print the message.

Note that this doesn't always matter: if your app is running on a PaaS like Google App Engine, new versions are released by deploying the update to a new machine and routing traffic to the new deployment. In this environment, you don't need to take advantage of on-the-fly reloading anyways.

If you aren't going to make use of on-the-fly reloading, consider preloading your application code to reduce its memory footprint.