Defining Data Models with IDOL

Neil Vyas
Affirm Tech Blog
Published in
6 min readMay 13, 2019

--

The integrity and availability of Affirm’s data is critical in providing many of our services, such as underwriting and servicing. This data may represent all sorts of things, from user payments to merchant configurations, and its definition may evolve over time as we add new products or augment existing ones.

Unfortunately, because it’s a large distributed system, it’s possible for our infrastructure’s data definitions to fall out of sync among its pieces, which can result in failures. These failures represent a major impediment in our ability to scale our business and commit to increasingly aggressive SLAs and Affirm needed a system to prevent them from happening. Our solution was building out a system we call IDOL.

What is IDOL?

IDOL is a tool used at Affirm to define data model schemas intended to be persisted across ownership or process boundaries. It is a layer on top of protobuf and provides the following capabilities:

  • Acceptably performant serialization and deserialization, in time and space.
  • A well-defined and use-case-specific notion of “schema version,” along with the automatic assignment of version numbers.
  • A canonical reference to and well-defined notion of schema identity.

And pithily, we’d say that IDOL lets you forget about providing and maintaining your public interfaces. Currently, IDOL is used in the following use-cases:

  • Historical models, or models which will be persisted and used for machine-learning-model-training, possibly for years to come;
  • RPC interfaces;
  • Online blob-store models; and
  • Cached models.

Using IDOL in these systems ensures that we never introduce incompatibilities between readers and writers, or clients and servers, that would cause downtime. Historically, such incompatibilities have been the root cause of many headaches at Affirm, in critical and non-critical components.

Behind the implementation (which is not that interesting, although we did manage to use some techniques from the excellent paper “Trees that Grow”) is the much more interesting specification, which is what enables us to provide such strong guarantees with confidence.

I won’t detail the whole spec here, but we can consider a small and hopefully familiar example: why adding a required field to a historical model is a breaking schema change.

Consider the following actor diagram, in which writers are nodes recording historical models and the reader is the model-training job:

This actor diagram depicts the distributed system that is model training.

Say that the red schema, at version two, introduces a new required field. When the reader, which is using the schema at version two, attempts to deserialize the models from the blue writers, which are at version one, it will find that its data is missing the new required field and therefore fail.

So, it should be clear that adding a required field to a historical model is a breaking schema change — and indeed, if we attempt to make this change to a historical IDOL schema that’s in production, IDOL will yell at us!

This is the CLI output when attempting to commit a breaking change.

What is IDOL’s business impact?

IDOL improves on our previous system for persisting historical data in a number of ways. It is significantly more performant, by virtue of doing much less. Performance increases allow us to build further features, like increasing the scope and coverage of our risk validation system. It provides very strong schema evolution guarantees at compile-time and also automatically takes remediation actions if a breaking change is detected.

In scaling IDOL usage, we’ve also reinforced valuable process lessons:

  • Maintainers should be abstracted from users. Talking to the maintainers doesn’t scale, and should be reserved only for escalations and exploratory work.
  • Users should use documentation. If a question is spuriously escalated, refactor the documentation and have the user review the changes. Ideally, the library should make incorrect usage impossible.
  • All bug reports, feature discussions and review requests should be well-specified tickets, even if they stemmed from in-person or online conversations.
  • Users should be abstracted from maintainers. If you can’t hold a user’s hand, then your documentation and other feedback mechanisms must be up to par. We don’t get alerted when new IDOL usage is committed, although we have tools for helping us find and audit all usage.

These lessons have proved useful to other teams (e.g., the AXP team, which builds and maintains our experiments platform) that find themselves authoring internal libraries and services, and will only prove more valuable as we continue to scale the engineering organization.

What’s the use of a versioning scheme?

Being able to version schemas means that we can version other components in a straightforward fashion. For example, in order to version our RPC interfaces, we have a handful of simple rules that address wholesale addition or deletion of routes. Then, the remaining rules check that the request and response schema didn’t experience a breaking version change.

Versioning other components can have implications not only for correctness but also performance. For example, by including the version number in the cache key, we can achieve optimum cache performance in the presence of multiple versions of a cached model being used concurrently, as they are in our current deploy process. Other proxies for model version, like release version, can be extremely suboptimal. Consider an endpoint that a user hits once per day that experiences more than one release a day. If we didn’t use any versioning in the cache key and just considered deserialization failures to be cache misses, then we’d see significant performance degradation during deploys.

This actor diagram depicts the distributed system that is an online request cache.

A versioning scheme can also give you safe schema lifecycle management, enabling you to safely delete old schema versions. This sort of schema hygiene is important both for consumers of data, like analytics and data science teams, and producers. For consumers, it gives them confidence in their data sources (especially since IDOL encourages documenting fields and models). Other processes, like onboarding and spec review, can be made simpler and more precise as well. For producers, deleting unused schema versions simplifies their maintenance burden and having a canonical, documented schema reference means fewer interrupts and questions.

It’s critical to have a good versioning scheme for your persisted data models. Not having one is like writing code without being allowed to delete or rename anything you’ve committed, while being expected to maintain a rapid development pace and correctness. And further, old servers could still serve requests. That’s certainly too difficult for me!

The future of IDOL

IDOL started as a relatively minor commitment. There was an immediate problem to be solved, we specified and solved it satisfactorily and we moved on. However, the problem grew in scope and consequence and I returned to it a year later as part of a new team and with new perspectives. Today, scaling IDOL involves not only fun technical play but also lots of process-focused and cross-functional work. Not only is IDOL’s growth an exciting piece of my career at Affirm, but it also nicely reflects some of our views regarding engineering investments. If developing and scaling solutions to issues like these excites you, we’re hiring!

--

--

I’m a software engineer at Affirm in San Francisco, CA, among other things.