What is Data Lineage?Data lineage is the ability to track how data flows through your enterprise, and to understand lineage is to understand where data comes from, where it goes to, and what happens to it along the way. This is critical to understand if you’re trying to get better insights into your assets and how they affect all the pipelines in your organization.
Why is Data Lineage Difficult to Achieve?There are many reasons data lineage is difficult to achieve.
- Business vs. technical view: Many users, who are often business users, want to see high level information about how data flows from one application to another. For example, data might flow from a division in one particular continental geography to another. It could also come from a corporate data lake and flow into a risk system.
- Level of detail: The fact that business users might want to see data lineage at that high level can cause vocabulary conflicts. Some people are completely overwhelmed when they see individual databases and schemas and SQL syntax. Yet, technical people want to see all that so they can understand what’s actually in the code, and they can understand it because either they are the ones that wrote it, or, as is often the case, it is legacy code that has been around for a long time.
- Scope: Code changes so often. Being able to take a look at individual SQL statements coming from a database application that we wrote in the year 2000 or each individual process within ETL jobs that perhaps are now up on the cloud is different from seeing data at that high level. Sometimes I just want to see the data lake and how data flows from it.
- Company evolution (technologies, people migration, mergers and acquisitions): Large organizations often have legacy applications from the year 2000 or even before. Many of these applications have been the result of mergers and acquisitions. Through all this, people move on or get promoted, and oftentimes, no one has documented things. So, you’re left with code that’s scattered all around. Some of it is still working in production with no one that’s actually coding it in anymore. No one wants to touch those production processes. We certainly don’t want to have to open them up in order to figure out how data is flowing. Still, having solutions that can actually read through that code and parse it and figure it out can be very helpful.
Use Case 1: DataOps
DataOps is all the work that goes into making sure that your data pipelines are reliable and remain up and running. You don’t want your executives to go look at reports on a Monday morning and have the reports turn up blank with no data. An architect or a DBA can look at lineage and try to define it. If someone is going to change a particular algorithm, they make sure it doesn’t bust something downstream.
Use Case 2: (Cloud) Migrations
Cloud migrations are about trying to get an inventory of all your systems. If we have legacy applications, we need to understand them before we go and lift everything to the Cloud. It might be that out of 76 reports, there are only four that people really use anymore. Well, let’s do a lineage on that so we can see which aspects of that application people really care about.
Use Case 3: Data Quality For data quality, lineage is about trying to find the source of data quality problems within your infrastructure.
Use Case 4: Governance and Regulatory Compliance What I really want to focus on is governance and regulatory compliance. Compliance is pitting all companies with this challenge to face government or industry regulators. With new privacy rules, every industry is responding to regulatory issues. Companies need to understand exactly how and where privacy data flows through the organization. If someone asks to be forgotten, you’ll need to know exactly which databases to look in. We don’t want the company to get a bad reputation or show up in the headlines because they caused the breach that they should’ve known about.
Lineage is typically done in a manual fashion, and which makes it difficult to meet regulator deadlines fast enough. Plus, even if we are doing everything correctly, we can’t prove it.
Many sites are bringing lineage into governance solutions to do things like provide glossaries and common understanding for data stewardship. Governance is typically a three-legged stool. And good governance requires that three-legged stool to be solid. The first leg requires you to understand your data and define it correctly. The second leg is laying out the data quality so that data can be checked and accurately profiled. The third leg is lineage, and if you don’t have lineage, your governance solution can’t easily stand on its own.
Governance is about trust. Trusting data means having confidence that it’s being handled correctly and is reliable when I look at it within my data science models or standard reports.
If I can identify data quality issues within the context of a lineage pipeline, I can help prioritize fixing them. Everybody has data quality issues, but if I have a data quality issue that I can see in a data lineage and that issue is affecting a report that goes over to my VPs of revenue for sales and resourcing decisions, that’s a big problem. This is a bigger data quality problem than one in the pipeline affecting parking passes in a human resources application. If I was looking at data quality issues that I have, and they’re all flagged within the context of lineage, I’ll have a better chance of being able to prioritize which ones I should fix first.
Having trust in that pipeline is one of the biggest goals for governance and compliance. When I’m looking at a report or a spreadsheet and see something in red that I don’t agree with, I want to be able to challenge it. Who do I talk to? Where did the data come from? How recently was it updated? Data lineage provides those answers critical to the governance message.
We need to more accurately define the scope of governance initiatives. Too often, we see sites that want to govern everything. They bite off more than they can chew and try to boil the whole ocean. With lineage, we can go to the teams that are most in need of governance, find their reports, and then use lineage to determine where we start. We think about which groups in the company to start with, and then we perform lineage to lay out what resources to work on.
I’ve been discussing upstream lineage for trust, but especially if we go downstream, we’ll find all the potentially hidden nightmares when we talk about exposing privacy data.
What do we do at MANTA?At MANTA, we help our customers by providing end-to-end lineage. When we say end-to-end, it means being able to go back and even look at your mainframe sources, as well as operational systems and data warehouses. Perhaps certain aspects of these were pulled over the years into places like Hadoop where it’s studied before it goes off to a Cloud database like Snowflake, where it’s reported on. End-to-end lineage means being able to see all of that visually.
MANTA looks at code, crunches your custom SQL, your ETL jobs, your business intelligence reports, and documents the lineage along the way. We visualize that lineage with an interactive color-coded map that allows you to see how that data is flowing as you click through it. Alternatively, where people have third-party governance tools, we can push lineage information into those solutions and create that three-legged stool.
That wraps up my quick run-down on lineage! I hope this blog provides you with an introduction to how data lineage is used for governance and regulatory compliance, as well as why it is so important for building trust in your data, which can only be done if your data is accurate and reliable.