The adoption of the European Union’s GDPR regulation in May of 2018 has forever changed the way enterprises have to think about their data. At any given moment, a data citizen can approach your company and request any and all data that relates to them, which requires your enterprise to be able to quickly pull up, trace, and confirm the accuracy of your data across data management architectures that are becoming increasingly complex.
Luckily, there are tools available through both Neo4j and Pitney Bowes that allow enterprises to harness graph database technology and address these new challenges head on. From modeling your customer data as a graph to data discovery, data mapping, whiteboarding, mapping the single customer view, and virtualizing the graph, Neo4j and Pitney Bowes’ Spectrum provide the agility and flexibility your enterprise needs in order to make data traceable and meet the requirements of GDPR, helping you avoid millions in fines.
Full Presentation: Moving From the Static Grid to the Virtualized Graph Database
What we’re going to be talking about today is how enterprises can harness graph database technology to help ensure GDPR compliance:
Aaron Wallace: We’re going to start with a brief overview of GDPR, which is a newish regulation in Europe that’s creating a lot of stress for many organizations in terms of how to ensure they come into compliance.
Next we’re going to cover how single view of customer applies to GDPR compliance, and how enterprise metadata serves as a foundational element to single customer views. And finally, we’ll explore GDPR compliance as an enterprise metadata problem, because it requires knowing where all of your customer information resides within your enterprise at any given time.
Andrew Chumney is going to provide the GDPR introduction and overview, and then I’ll jump back in to go over the single view metadata discussions.
Andrew Chumney: Let’s start off with a quick overview of Pitney Bowes. We’re one of the top 100 software companies in the industry – and one of Neo4j’s first graph database partners.
Most people know us as solutions in the geospatial, address recognition and mailing and shipping logistics realm, but for the last 10 years we’ve incorporated master data management and enterprise-wide software solutions outside of this traditional logistics space.
GDPR, which went into effect May 25, 2018, is a European Union regulation that allows for any data citizen on the planet to approach your company and request all the information you have about them. Many of you may think it doesn’t apply to your US-based company, but the regulations apply to any company that does transactions with a European destination. And the stakes are high. The regulation that predated GDPR ended up resulting in a $20 billion fine for Google. Just to put that in perspective, that’s more than the gross domestic product of 17 of the world’s countries.
GDPR allows the European Union to levy fines up to 4% of your company’s revenue or $20 million for each time they find your company in violation. Again, you are subject to these regulations if you have any information about a data citizen who resides in the EU, or if any of your transactions occur in the EU.
Here’s a high-level overview of what GDPR covers:
Data accuracy is crucial. A consumer needs to be able to ask, “Are you representing me properly? Do you actually know who I am, and have you recorded my data as accurately as you can?” They can also ask for all the data you have about them, and for that data to be rectified or erased. To me, this is the scariest part of GDPR due to all the referential integrity issues that may come up by deleting a person’s data.
Think about where a single name appears across your entire enterprise right now, using my name as an example. I’m in your data warehouses, transactional systems, secondary systems, logs and all kinds of non-traditional database sources. If you’re in violation in any of these places, the government can sue you for $4 million per instance that I find my name in your organization – and that gets really expensive really quickly.
Many people also interpret GDPR as now requiring permission from data citizens to stand up additional data stores. For this reason, most organizations I work with have put the brakes on any new data persistence capabilities and data stores.
Addressing GDPR Challenges
Aaron: A study performed by Forrester Research around six months ago found that 50% of organizations were not even aware of GDPR, which is a pretty staggering statistic. Recent research has looked at how close to compliance cloud-reliant enterprise companies are (and keep in mind that most organizations employ roughly 1,000 cloud services throughout their enterprise) and found that almost 75% of those companies have not achieved the basic level of GDPR compliance.
On the heels of that, let’s explore the high-level next steps to coming into compliance:
The first step is data discovery. If a customer asks you to tell them everything you know about them, you’re going to need to evaluate hundreds or thousands of data sources that contain customer and PII information.
The natural first step is to create a map of your enterprise so you can understand where this information lives, which is where single customer view comes into play.
Next you “prepare” in the sense that you tag, identify or classify data elements across your different systems that contain PII.
And the final step is to act, whether that means setting up processes that enable you to find relevant information quickly and easily at the request of a customer, or the basic step of resolving entities across different databases – which is a huge problem in almost every organization, regardless of size.
For example, I’ve had the same credit card for the last 25 years and still get pieces of mail every week that has my name represented incorrectly. It’s staggering how much these organizations still struggle with data quality, and it’s a big deal to get it fixed – especially in the context of GDPR.
Below are some of the forces acting on almost every enterprise today:
This includes the cloud and its prevalence, elastic computing, the desire to move from a CAPEX to an OPEX model, big data (which is both a problem and an opportunity), and the need for rapid prototyping and innovations in collaborative environments. We also have data generated from the IoT and smart devices, which makes it easier to understand our customers but also create a lot of challenges in the context of GDPR. Analytics and data science have been trendy for a while now, but I think a lot of organizations are just starting to get a handle on how this can help their overall organization.
All of these forces come together and present an opportunity for enabling agility within the organization.
The Role of Graph Databases In GDPR Compliance
So what does any of this have to do with graph databases?
We really think it’s the perfect tool for addressing these challenges. Our goal is to provide real-time intelligence in a modern data management platform that allows us to understand enterprise metadata. In the context of GDPR, we need to be able to find and access customer information wherever it lives, and make sure that it’s properly secured. This is why we think GDPR is an enterprise metadata problem at its core.
According to another Forrester study, a very large number of companies are still struggling to understand the scope of their data assets as it relates to customers. The adoption of data classification solutions is on the rise, with about 50% already there, 5% working towards that goal and 35% that aren’t really thinking about it yet.
As an enterprise metadata and single view of customer problem, we need to know where all of our data lives and how to access it. Again, the first steps is to understand where it lives:
Find trusted and relevant sources of data, allow customers to discover and profile new data sources, evaluate impacts of system changes and enable traceability.
And if we’re looking at reports: How do I trace that customer information back to all these systems that it came from?
We really believe that there’s a defined best practice around building up a single customer view. It begins with modeling your customer data, discovering where that customer data lives, mapping your high-level graph models to those existing data sources, mastering the appropriate data and, finally, virtualizing the data.
Master data management has been around for a long time, and our view has evolved to what is traditionally referred to as a registry pattern for MDM, which means we only master certain portions of data. Being able to create virtual pointers for other systems enables you to access data in real time without having to worry about the mechanics of continually running batch jobs to master and resolve data.
Step 1: Modeling Your Customer Data as a Graph
Let’s walk through these best practices, beginning with modeling your customer data and metadata as a graph:
It’s important to understand the connections between your customer data and their data elements. How are customers connected to one another, as well as to various aspects inside the metadata environment? Being able to answer these questions provides data lineage and traceability abilities. It’s not enough to understand how the actual data records are connected to one another. In the metadata landscape, you also have to understand how the elements are connected.
There’s also an extremely broad landscape that includes reports, databases, users and dashboards, along with the various transformations and processes that run on your data, all of which represent a densely connected network. And the graph model is absolutely perfect for understanding all of this.
When you undertake these projects, it’s critical to start with the business’s view of how they want to address these problems.
In a lot of traditional MDM solutions, we see the purchasing and evaluation of tools purely from an IT perspective, without enough engagement from other stakeholders in the company. This can lead to progressing several years down a path with a final result that doesn’t match with what your business users was looking for. A graph is great because it proves agility through its schema-less nature, which allows you to quickly enhance and bring in new aspects to your model.
Step 2: Discover and Profile Data Assets
After we’ve built our model, we have to understand all the ways in which we access our data:
This starts with going out, discovering, profiling, and classifying key data assets – especially customer and PII information – and typically includes millions if not billions of relevant data records that span across the enterprise landscape.
We have to document the metadata and understand where entities and graph relationships live, and the role of the information and data steward is critically important for this. This is the role that tends to bridge a lot of the business requirements over to the IT team that is actually implementing the tool.
The model we went over in the prior step defines the semantic aspects of how we do this discovery. We can use semantic tagging from our model to discover, profile, and tag our data sources, which allows us to set up more effective mapping – which is our next step in the process.
Step 3: Data Mapping
In our experience, mapping is where things can get a bit messy. There’s no getting around the fact that mapping a simple model to a complex data environment involves a lot of different connections and relationships. Again, graph is perfect for this kind of thing.
This starts with developing a whiteboard-friendly enterprise model to the real world:
Again, the models are great, but there’s no getting around this sort of messy business of mapping your existing data sources. Automated semantic detection can really help with this process, along with tagging, which allows you to set up some auto-mapping aspects. If you have a model that defines a customer entity, you can go out and tag certain data elements in advance. This includes defining which tables contain data related to customers, products and relationships, as well as which columns within those tables contain which properties. If you know you’re going to be working on this project, why not send somebody out now to explore, discover and understand those data assets?
Step 4: Create the Single Customer View, Master Relevant Sources of Data
Now that our mapping is set up, the next step is to create the single customer view:
After you determine where all your data lives, you have to decide what data you need to master versus what data you can treat more as a registry-type aspect, or pointer, for other data systems. This is where you get into mastering data from all your different sources.
In today’s enterprise, you’re not just talking about Oracle and SQL server instances. You’re looking at things like Mongo, Cassandra and Couchbase, as well as enterprise systems like SAP and Salesforce – all of which are typically relevant to a single customer view. Bringing this data into a mastered or virtualized view of your customer data can assist with GDPR compliance.
And what does “mastering” really mean? It can include resolving entities across systems as well as data quality, which means cleaning up names and addresses across systems and adding context to customer information.
Think about the importance of location as an aspect of customer information. We want to understand where people live, shop, undertake recreational activities, etc. Adding that context can be part of the process of mastering data from different systems.
Again you have to determine what data you want to master. It’s not appropriate, performant or scalable to throw data from 1,000 data sources into a graph database. There are certain aspects of how you need to design you graph data platform to maximize performance and scalability.
Step 5: Virtualize the Graph
Virtualizing the graph can be a big driver of this, which brings us to federating access to data that we don’t want to master via virtual interfaces.
We find this to be the best practice for creating faster time to value, an equivalent to a registry pattern for MDM, and highly useful for sources that have data that changes frequently – for example, website and social media data. Virtualizing and reaching out for this data automatically is much more performant than running a batch process every few minutes.
There’s also an ongoing iterative aspect to this. It’s not just “one-and-done,” because you have to constantly review any new data sources you bring online, and because data lineage is extremely important in the context of GDPR:
Impact analysis, which is looking at the data source, is kind of the flip side to that.
If I have a relational database and I want to change a column name, this can take a lot of red tape to get done in a big enterprise. We really need to be able to understand the impact of changing the column name, since it impacts processes that write data to these other systems and it helps us pinpoint who we need to notify about that change – and potentially become more streamlined in that process.
Again, in the ongoing governance of building a single view, you need to always be looking towards the next step and the ongoing aspects of both lineage and impact analysis across the enterprise.
The Role of Graph
Now let’s touch back on the relevance of graph.
We’ve been a partner with Neo4j for the last seven years, and initially used the software as an underpinning for what was initially an MDM product, but is now a single view of customer product. We chose Neo4j for a number of reasons, many of which I’ve already touched on:
It’s a great tool for both understanding and efficiently querying connections between data, as well as providing an initial modeling exercise that bridges that gap between business and IT. It’s very agile, so you don’t have to define your entire model upfront, and you can continuously evolve data structures because you can connect data across multiple existing systems sources and platforms. It’s also a really powerful tool for setting up natural language processing queries.
We have a number of capabilities that allow you to explore the basic natural language structure around your model elements and relationships. If you know the basic context of your metadata model at a high level, it’s easy to get access to and explore that information, which leads to search engine simplicity.
Again, we find graph to be the key piece of this transformative approach. If you look at the industry as a whole through resources like Forrester and Gartner, you’ll find that these graphy stories are really catching on. We were out in front of it, but it’s great to see widespread adoption across industries.