Profiling gevent in Python

James Lim
Affirm Tech Blog
Published in
4 min readDec 14, 2018

--

Inspecting greenlets and tracing coroutines for optimal performance.

The effects of greenlet contention can be difficult to measure and investigate. It sporadically affects unrelated tasks that just happen to be waiting in the queue and can have an outsized effect on response latency, causing spikes in the high percentiles.

At Affirm, we learned the importance of instrumenting greenlets in order to observe their switching behaviors under different combinations of workloads. This article briefly describes how we use coroutines, some of the problems we encountered, and how we solved them.

Using gevent — a greenlet framework

Python has a Global Interpreter Lock, which means threads run concurrently but not in parallel. The introduction of stackless and greenlets gave developers granular control via coroutine switching, offering performance gains on some types of workloads. When using greenlets, one can control exactly when the code runs, taking and releasing control of the main thread deliberately without implicit scheduling.

These matches are concurrent but not parallel.
Simultaneous chess exhibit v. Judit Polgár, 1992–22. Photo by Ed Yourdon, CC BY-NC-SA 2.0.

gevent is a networking library built on top of greenlets. It is particularly useful for I/O-bound workloads, such as web services that issue multiple SQL queries for each HTTP request. In these workloads, the thread spends much of request latency waiting for database results and RPC responses. gevent monkey-patches library modules, such as socket, psycopg2 and python-memcached, so that greenlets yield whenever they are blocked on network I/O. This releases the main thread to other greenlets, and keeps the main thread productive.

However, monkey-patching has its flaws. Applying patches too late in a process’s lifecycle often renders them ineffective. Furthermore, gevent is suboptimal for CPU-intensive computations. A CPU-intensive greenlet can hold up the main thread for a long time, starving other tasks. We call this greenlet contention.

Comparing late_patch.py and eager_patch.py, notice how the former failed to replace `time.sleep`.

Observing Contention

To gain more visibility into greenlet switching, we adapted a tracer from a patch submitted to Mozilla. When our celery workers or gunicorn workers initialize, they register a callback on greenlet switches using settrace .

The settrace callback looks like:

Using greenlet.settrace to listen on switches and throws.

Notice that we are logging switch and throw events; the latter happens when an exception causes a greenlet to exit. This example calculates the time between callbacks in wall clock time and in CPU ticks. If the duration between switches exceeds some threshold, we log it.

From this data, we can determine:

  • number of switches per interval that exceed the configured threshold;
  • duration between switches, which can be an indicator of CPU-intensive workloads; and
  • difference in duration between switches in CPU ticks vs wall clock time, which can be a symptom of insufficient monkey-patch coverage.
p99 of greenlet switches. By grouping metrics by version and host, we can observe how our software behaves in simulated and production workloads.

Aside: Measuring Time

This example implementation uses both time.clock and time.time to measure duration. If the measurement by time.clock diverges from that by time.time, it could indicate that some greenlet was blocked on I/O but did not yield the main thread to other greenlets. time.clock also does not include time elapsed during sleep.

You can observe the differences between the two by running the script below with netcat :

Simulating I/O blocking with `socket` and `netcat`.
$ nc -l 127.0.0.01 8080  # in a separate shell
$ python sleep.py
time.clock: 0.000156
time.time: 2.973633

By simulating a large I/O-wait using netcat, we can observe that time.clock did not account for the duration spent waiting for data at s.recv. In Python 3, time.clock has been replaced by time.process_time.

The insights from this instrumentation data has informed our optimization efforts. It was immediately obvious that some celery tasks (e.g. credit report parsing, underwriting) were CPU hogs that were better suited for multi-processing than cooperative multitasking. Later, it helped us tune the number of greenlets per process, make capacity estimates, and refactor branches that spent more time on the CPU than expected. We are also exploring other options, such as using RPC to offload compute-intensive work to other hosts, or to sidecar processes on the same host.

There is still more work to be done. For instance, the tracer currently shows you the stack frames that yielded the main thread, but not function calls that were hogging it. That feature would let us narrow down the CPU-bound functions and refactor them. We see more opportunity to further optimize our web services by using such tools to improve observability. If this challenge is of interest to you, we are actively hiring for this team.

--

--