Cryptography and Platform Security

Kevin Wang
Affirm Tech Blog
Published in
6 min readApr 17, 2019

--

As a financial technology company, we handle a variety of sensitive information that our customers share with us. We take data security and compliance very seriously and we protect this information using strong encryption.

The objective of the crypto service is to abstract away the implementation of cryptography, including algorithm implementation, key storage and key lifecycle management across our platform. Its primary purpose is to serve as an encrypted key-value store, but it also provides cryptographic signing, keyed hashing (HMAC) and password hashing as a service.

The crypto service is the only service in our platform with access to our cryptographic keys. Other services, such as our frontend web app, celery workers and batch job nodes, send requests to the crypto service via RPC and the crypto service authenticates these requests and uses its keys to perform cryptographic operations.

Some of the main endpoints the service exposes are encrypt(keyset_name, obj) and decrypt(id) functions. encrypt() takes the name of a keyset (a set of successively versioned crypto keys) and a serializable object to encrypt, encrypts and stores the object, and returns an id referring to the stored ciphertext. decrypt() takes the same id, fetches the encrypted data from the data store, and returns the decrypted plaintext.

In 2018, we rewrote the crypto service from scratch with a new, more future-proof design that allows us to adapt to changing compliance and technical requirements. In this post, I will describe some of the major features in our redesign, including adding support for key rotations and implementing mutual TLS to secure our network traffic, and how these improvements improve the security, stability and operational ease of maintaining our platform.

Crypto service architecture diagram

Key rotation

One major feature present in the new version of the service is support for key rotations. Key rotation refers to the process of switching to a new key for encryption purposes while retaining old keys for decryption only. By performing key rotations, we can limit the amount of data encrypted with a single key, which reduces the surface area of a key compromise. Additionally, it allows us to migrate to faster or stronger algorithms over time. Key rotations are mandated by various compliance regimes including the Payment Card Industry Data Security Standard (PCI DSS).

Our implementation of key rotation additionally includes support for “live” key rotation and creation. In the old crypto service, adding a new key involved manually restarting the service. Because of the way we stored our crypto keys, each key on disk was encrypted with a different password that would need to be input on startup via SSH by one of a small number of privileged individuals in order to bootstrap the service.

In the new service, we now encrypt all keys with the same master password and the service watches the filesystem for new or updated keys. By keeping the master key in memory, the service can load and immediately begin using new keys without requiring a redeploy. Additionally, these key files are pulled down from a central key store by each instance periodically, so the process to rotate a key boils down to the following:

  1. Generate a new key and add it to the file for the particular keyset being rotated. This new key is marked as “decrypt-only” for the time being.
  2. Upload the updated file to the key store and wait a few minutes for all of the crypto nodes to pull down and load the new key.
  3. Update the file again, removing the “decrypt-only” restriction from the key.
  4. Upload this new file to the key store. Within a few minutes, all crypto nodes will begin encrypting with the new key.

The purpose of the “decrypt-only” flag is to avoid a situation where some servers have already begun encrypting with the new key, but other servers have not yet loaded the key and thus are unable to decrypt data.

Our cryptographic algorithm implementations are modular by design, making it easy to add new algorithms and swap out implementations for existing ones. This also allows us to write unit tests to automatically test the service against all possible algorithms.

Unifying RPC protocols

Prior to the rollout of the new version of the service, crypto was the only part of our platform that ran on zerorpc, one of a hodgepodge of various RPC protocols used throughout our platform. We wanted to move off of zerorpc for a number of reasons:

  1. Security: Our zerorpc traffic was unencrypted, which is ironic for an encryption service.
  2. Capacity: Our in-house zerorpc broker would sometimes get overwhelmed by elevated load from batch jobs.
  3. Reliability: zerorpc relies on long-lived TCP connections to brokers, which can get messed up by network flakes, causing errors.

Then in 2018, another team rolled out a standardized RPC framework intended to be used by all new services. This new framework uses HTTP and protobufs, making it much easier to use industry standard infrastructure like AWS Elastic Load Balancers and industry standard security architectures like mutual TLS. Unifying our RPC protocols reduces the scope of technologies we must support, and allows us to more easily make improvements to our single RPC framework that benefit all of its consumers.

Mutual TLS

We also rolled out a public key infrastructure scheme for providing mutually authenticated and encrypted communication as well as host-level access controls between services achieved using mutual TLS. Unlike normal TLS, where the server authenticates itself to the client using its certificate but not vice-versa, with mutual TLS, both the client and server possess certificates. This allows the server to verify the authenticity of the client before establishing a connection.

By including information about a host in its certificate, servers can determine the identity of a requesting server and perform access controls accordingly. For instance, we can enforce that only certain types of servers are allowed to perform decrypt operations and that these servers may only perform operations using certain keys.

Mutual TLS architecture diagram

Every instance in our infrastructure is loaded with a client certificate, which can be used as either a client or server certificate. These certificates are issued by HashiCorp Vault (which serves as a certificate authority) and have a validity of seven days. Authentication to Vault is performed using the AWS IAM auth method. In this scheme, an instance authenticates with Vault by preparing a signed request to the AWS API using the AWS secret access key from its instance profile and AWS Signature v4. Once logged in, a client can request a new certificate and private key. An hourly cron job on each instance checks the certificate and renews it if it is within one day of expiration. Short-lived certificates are preferable (versus one-year certificates, for instance), because they limit the amount of time a compromised private key is useful for.

The crypto service was the first service to enable mutual TLS and since then, it has been enabled on most of our RPC services.

Building for the future

The crypto service was designed with the future in mind. By switching to our standard RPC framework and building in first-class support for key and algorithm rotations, we can handle changes to our infrastructure as well as advances in cryptography and security best practices. A core tenet of our engineering philosophy at Affirm is to build for the future rather than the now. If projects like these are interesting to you and you’re looking to join an impactful and strategic engineering team, we’re hiring!

--

--