When to migrate from online databases to datalake stores

The evolution of ZipRecruiter’s Single Job Store

ZipRecruiter is America’s #1 job board with nearly 9 million active jobs at any given moment. Storing, analyzing, and serving all those opportunities is at the core of our service.

Over the past 14 years, our job store database has been built and rebuilt as the company scaled and needs changed; Architecture that was intended to quickly serve a couple of thousand jobs was no longer suitable at the scale of millions of records and billions of versions. When new capabilities like big data processing and machine learning data enhancements were required, accessing online data for inherently offline, non-user facing issues became inefficient and costly.

This is the evolutionary (and cautionary) tale of databases used for mixed access patterns.

One job store to record them all (but actually two)

Our job store was initially built to serve job seekers ‘job details’ pages for open positions not hosted directly on ZipRecruiter.com. This allowed users to preview jobs before applying offsite.

The second iteration came about a few years later, fueled by a desire to reduce the time it took for external jobs to reach our search engine and add access patterns for active records.

As time went by, however, we began running into inconsistencies between the jobs in our relational data model in MySQL, which held mostly jobs hosted on ZipRecruiter.com, and our online DynamoDB (DDB) job store which had both ZipRecruiter.com jobs and those hosted on other sites. One reason for this was that our data in DDB included ML-based data enrichment on jobs which MySQL did not.

Having two databases made the presentation of opportunities in the ‘job details’ pages dependent on the database from which they were populated.

We wanted a single place to store all the jobs with service interfaces that a single job details page could use. It was time for a third iteration.

An initial solution created unintended problems

Our first approach was to make the DDB job store better. We added versioning of jobs to register changes (e.g. phrasing of the job description or update to compensation) and retain what a jobseeker would have seen before and after the change–things one might find addressed by a feature store, today. To make search and filtering faster, we added support for various read patterns and indexing. And to keep data as “fresh” as possible everywhere, we published DDB CDC streams to Kafka for other consumers to receive.

This created new problems.

First and foremost, the DDB job store was continuously growing and becoming quite expensive. An event on Kafka was being sent for every DDB record change, which was more than every logical change to a job. Even a small change, like a job URL, caused a new version. The versioning table quickly swelled into our largest online database. In addition, search and recommendation engines needed to batch dump from online data stores regularly which was costly and inefficient to do in DDB.

As Spark became more prevalent, so did DDB’s limitations. More and more teams wanted to run non-user facing requests–such as offline batch processing and machine learning enhancements with Spark–which are cumbersome to work with when making random access calls to an online database. In many cases, these applications wanted to access the old versioning via offline batch lookups which amplified costs.

Finally, consumers became tightly coupled to the DDB stream implementation. Many business critical applications were built to rely on the versioning and Kafka stream. We were reaching a point whereby we could not change the DDB table and its interface without severely affecting consumers.

Finding ourselves in a quagmire, we had to dig a smart way out.

From online-first to datalake-first, starting with new changes

When your other systems are tightly coupled to a datastore (the online DDB in our case), it is very difficult to re-design or change the underlying store, let alone build datalake tables.

A plan was set in motion to rebuild our data store entirely, datalake-first. To do this without leaving 25 million monthly users unserved, we started emitting all change and new job events directly to Kafka which would maintain both an online store and create our datalake tables in the datalake. Kafka effectively decoupled data storage from data processing, thereby freeing us to simplify and optimize the DDB and achieve a more flexible and scalable system overall.

We also decided to present only the latest version to online users via the expensive DDB access pattern, and removed the option of a historical version lookup. While this didn’t impact user experience much, it saved a lot of space and compute power.

Now, internal teams can work easily with job data (e.g. quick querying, big joins, and batch processing) along with clicks, impressions etc. in a common Spark framework. Versioning is available for offline uses via batch patterns in the datalake, and search and recommendation engines can efficiently load all the jobs they ‘desire’ in batches vs expensively dumping data from DDB.

Sometimes, starting over is the smart way to go.

About the Author

Daniel Arias

Engineering Manager

Daniel Arias is an Engineering Manager on the Search team at ZipRecruiter. Prior to joining the company in 2015, Daniel built chip validation software tools at Intel. After developing a stronger interest in software development, he joined ZipRecruiter as an entry level Engineer. Since then, his career has steadily matured as he fulfilled increasingly senior roles across various teams at ZipRecruiter. Daniel cherishes the opportunity to be at the frontlines of the job-market, impacting real needs on a daily basis.