Profiling gevent in Python
Inspecting greenlets and tracing coroutines for optimal performance.
The effects of greenlet contention can be difficult to measure and investigate. It sporadically affects unrelated tasks that just happen to be waiting in the queue and can have an outsized effect on response latency, causing spikes in the high percentiles.
At Affirm, we learned the importance of instrumenting greenlets in order to observe their switching behaviors under different combinations of workloads. This article briefly describes how we use coroutines, some of the problems we encountered, and how we solved them.
Using gevent — a greenlet framework
Python has a Global Interpreter Lock, which means threads run concurrently but not in parallel. The introduction of stackless and greenlets gave developers granular control via coroutine switching, offering performance gains on some types of workloads. When using greenlets, one can control exactly when the code runs, taking and releasing control of the main thread deliberately without implicit scheduling.
gevent is a networking library built on top of greenlets. It is particularly useful for I/O-bound workloads, such as web services that issue multiple SQL queries for each HTTP request. In these workloads, the thread spends much of request latency waiting for database results and RPC responses. gevent monkey-patches library modules, such as socket, psycopg2 and python-memcached, so that greenlets yield whenever they are blocked on network I/O. This releases the main thread to other greenlets, and keeps the main thread productive.
However, monkey-patching has its flaws. Applying patches too late in a process’s lifecycle often renders them ineffective. Furthermore, gevent is suboptimal for CPU-intensive computations. A CPU-intensive greenlet can hold up the main thread for a long time, starving other tasks. We call this greenlet contention.
Observing Contention
To gain more visibility into greenlet switching, we adapted a tracer from a patch submitted to Mozilla. When our celery workers or gunicorn workers initialize, they register a callback on greenlet switches using settrace
.
The settrace
callback looks like:
Notice that we are logging switch and throw events; the latter happens when an exception causes a greenlet to exit. This example calculates the time between callbacks in wall clock time and in CPU ticks. If the duration between switches exceeds some threshold, we log it.
From this data, we can determine:
- number of switches per interval that exceed the configured threshold;
- duration between switches, which can be an indicator of CPU-intensive workloads; and
- difference in duration between switches in CPU ticks vs wall clock time, which can be a symptom of insufficient monkey-patch coverage.
Aside: Measuring Time
This example implementation uses both time.clock
and time.time
to measure duration. If the measurement by time.clock
diverges from that by time.time
, it could indicate that some greenlet was blocked on I/O but did not yield the main thread to other greenlets. time.clock
also does not include time elapsed during sleep
.
You can observe the differences between the two by running the script below with netcat
:
$ nc -l 127.0.0.01 8080 # in a separate shell
$ python sleep.py
time.clock: 0.000156
time.time: 2.973633
By simulating a large I/O-wait using netcat
, we can observe that time.clock
did not account for the duration spent waiting for data at s.recv
. In Python 3, time.clock
has been replaced by time.process_time
.
The insights from this instrumentation data has informed our optimization efforts. It was immediately obvious that some celery tasks (e.g. credit report parsing, underwriting) were CPU hogs that were better suited for multi-processing than cooperative multitasking. Later, it helped us tune the number of greenlets per process, make capacity estimates, and refactor branches that spent more time on the CPU than expected. We are also exploring other options, such as using RPC to offload compute-intensive work to other hosts, or to sidecar processes on the same host.
There is still more work to be done. For instance, the tracer currently shows you the stack frames that yielded the main thread, but not function calls that were hogging it. That feature would let us narrow down the CPU-bound functions and refactor them. We see more opportunity to further optimize our web services by using such tools to improve observability. If this challenge is of interest to you, we are actively hiring for this team.