Lyft Speeds Up Data Discovery with Tool Using Neo4j
Data is at the heart of every decision at Lyft. Once decisions are made, their impact is evaluated using data.
Given the vital role of data and analytics across the company, the speed with which users can find data, understand it, analyze it and gain insights is critical.
Data discovery – finding the right data and understanding it – was slow and inefficient. Tables might have similar names, like driver_rides_completed and rides_driver_total.lifetime_completed. Users asked coworkers for help, reached out on Slack channels or looked at Github to see how a table was generated. They often pulled the first 100 rows to get a feel for the contents.
Lyft’s growth exacerbated the challenge of data discovery. Lyft already had about 10 petabytes in thousands of tables across a variety of different data stores according to Tamika Tannis, a Lyft software engineer. Growth meant even more data generated by the mobile app and other services. As new talent was hired, the number of users doing data discovery also grew.
Lyft needed a better way to support data discovery for everyone in the company. To quantify the problem and get a baseline, Tannis’s team looked at the impact on data scientists and found that data discovery consumed about a third of their time.
Lyft engineers decided to build a tool to simplify data discovery. Their first target audience would be the most frequent users of data: analysts and data scientists.
Named Amundsen, the tool would offer three complementary ways to do data discovery: search-based, lineage-based and network-based.
An effective search was a top priority, ranking results by popularity and relevance. Lineage-based discovery traces connections among datasets. Network-based data discovery connects data with people, particularly valuable for new team members.
“You might want to see what data resources your manager or your coworkers are using so you can use trusted data resources that everyone else is already using for similar purposes,” said Tannis.
Amundsen uses a microservice architecture. The Databuilder service ingests data into the search service, which is backed by Elasticsearch, and the metadata service, which is run by the Neo4j graph database. Elasticsearch powers the search by providing relevance based on search terms, the user’s position in the company and the popularity of the tables. All of those connections are first made in Neo4j.
Lyft chose Neo4j because it captures the shape of their data ecosystem, which is naturally expressed as a graph. The flexibility of Neo4j is very beneficial when it comes to iterating quickly on new features.
“When we have a new use case and a new piece of metadata to represent, we just have to create a new node and create that relationship,” said Tannis.
At Lyft, Neo4j is an important component of Amundsen’s architecture; it serves as the source of truth for editable metadata. Neo4j also provides a foundation for new projects like compliance and data quality. “The future, as I see it, is that we’ve got a full-fledged metadata repository on which we’re building many applications,” said Mark Grover, a product manager at Lyft.