In this blog post, we’ll use Neo4j to turn the European Gas Network into a knowledge graph to analyze the data.
The crisis between Ukraine and Russia caused relations between Russia and the E.U. to fall to the lowest point since the Cold War. The U.S. and E.U. imposed some sanctions on Russia over the Ukraine invasion. The financial measures are designed to damage Russia’s economy and penalize President Putin, his high-ranking officials, and the people who have benefited from his regime.
Europe relies on Russia to keep warm, and Russia needs revenue from the gas trade — hence, both still need each other, despite the conflict. The Foreign Minister of Germany recently announced that Germany would stop all Russian oil imports by the end of 2022 .
The main focus of this post is to turn the European Gas Network via Neo4j into a knowledge graph, and then explore and visualize it. If you’re a developer and are not familiar with Neo4j, you should start here to acclimate yourself. In short, Neo4j is one of the industry-standard graph databases that offers alternative solutions for developers. Products include Neo4j Desktop, AuraDB, AuraDS, Bloom, Graph Data Science, etc.
The above picture clearly shows Russia’s major gas pipelines to Europe. Russian natural gas arrives on the European continent via pipelines, and it makes up about a third of all gas used. Therefore, the natural gas of Russia plays a significant role in the energy mix of European nations. The below picture describes essential transportation routes in detail, and the legend of the image contains the number of elements for each component.
In this article, we will cover:
- Definitions of Components and Element Structure
- Creation of Knowledge Graph
- Exploratory Data Analysis and Some Queries
- Visualization via NeoDash
Definitions of Components and Element Structure
You can access the dataset that I will use to create a knowledge graph from this link. For simplicity, I shipped all the related data files to my GitHub account and used them while creating. To understand the fields inside the dataset, I will use the “SciGRID_gas: The raw EMAP data set” report published by the “DLR Institute for Networked Energy Systems.” I will be sharing the definitions in general terms in this post, so readers don’t have to read the mentioned report again.
Gas transmission networks consist of different components, such as pipelines, compressors, LNGs, etc. With the help of the main report, let’s briefly describe these components.
Nodes: In a gas network, gas flows from one point to another point, which are given through their coordinates. Elements of all other components (such as compressor stations and power plants) have an associated node, which allows for the geo-referencing of each element. Overall, the term Nodes will be used throughout this blog post, as it aligns with graph theory aspects.
PipeLines: PipeLines allow for the transmission of gas from one node to another. PipeLines are georeferenced by an ordered list of nodes.
PipeSegments: PipeSegments are almost identical to PipeLines — however, they are only allowed to connect two nodes. Hence, any PipeLines element (with three or more nodes) can easily be converted into multiple PipeSegments elements.
Compressors: Compressors represent compressor stations, which increases the pressure of the gas, and thus allows the gas to flow from one node to another node. A gas compressor station contains several gas compressor units (turbines).
LNGs: LNGs is the acronym for Liquefied natural gas. There are several LNG terminals and LNG storage in Europe, as some gas gets transported to Europe via ships.
Storages: Storages are another network component. Surplus gas can be stored underground (e.g. in old gas fields or salt caverns) and used during low supply or high demand periods.
Consumers: Consumers is the term used for gas users, which can include households, industries, and commercial uses. This data set will be generated through a master project, and it excludes power plants.
PowerPlants: PowerPlants is the term for gas used by power plants only.
Productions: These can be wells inside a country where gas is pumped out of the ground. Most of the gas used in Europe comes from outside of the EU. However, there are several smaller gas production sites scattered throughout Europe.
BorderPoints: BorderPoints are facilities at borders between countries, which are mainly used to meter the gas flow from one country to another.
As mentioned above, elements are describing individual facilities, such as compressors or LNG terminals. However, the overall structure of those elements is the same for all elements of all components, and is described as follows:
- id: A string that is the ID of the element, and must be unique.
- name: A string that is the name of the facility, such as “Compressor Radeland.” In most cases this is not supplied.
- source_id: A list of strings that are the data sources of the element. As several elements from different sources could have been combined into a single element, one might need to know the original data sources.
- node_id: The ID of a geo-referenced node to which an element of the network is associated to. For a compressor, this will be just a single node_id. However, for a gas pipeline this entry would be a list of at least two node_id values: the starts node id and the end node id.
- lat: The latitude value of an element. For elements of type PipeLines and PipeSegments, lat is a list of latitude values. Throughout the SciGRID_gas project, the projection World Geodetic system 1984 (epsg:4326) will be used.
- long: The longitude analog to lat.
- country_code: A string indicating the two-digit ISO country code (Alpha-2 code, see Chapter 10.6 for list of countries and their codes) of the associated node of elements or list of nodes in case of PipeLines or PipeSegments.
- comment: An arbitrary comment that is associated with the element. In most cases this is not supplied.
- tags: This dictionary is reserved for OpenStreetMap data. It contains all associated key:value-pairs of an OpenStreetMap item.
Creating the Gas Network Knowledge Graph
Before running this section, you need to create a Neo4j sandbox to run the codes in the browser or communicate via notebook. First, we will define the constraints, and then we will jump into creating the components. I will create all the components whether or not they are part of the pipeline, since I will use all of them for visualization purposes.
On the other hand, you can find all the code snippets of all components in the notebook, but only “border points” will be here as an example of not being a part of the pipelines.
Now, we will create the “nodes” as the “junction points” of the pipelines. Then we will connect them to form the pipelines.
Exploratory Data Analysis (EDA) and Graph Data Science (GDS)
In the EDA section, the first question could be “How many nodes does KG consist of and what are their types?” The second would be the same question for relationships. For this example, the relationship is not meaningful since there is only one relationship among the nodes, but I want to cover this part to share the related code snippets.
You can also use the following “meta stats” code snippet to get the same picture as a different output format. You can also check the type and number of the nodes and relationships.
Neo4j Graph Data Science (GDS)
Neo4j released the 2.0 version of Graph Data Science on March 24, 2022. You can check out the release notes through the link below:
I decided to utilize the GDS 2.0 for multiple graph-based analyses as a part of EDA, such as page ranking, degree centrality, etc. Upfront, we need to create a GDS object, and we will plug the graph-based queries into this instance and check out the results.
Before jumping into the GDS algorithms, I want to share some details about the algorithm syntax execution modes: Once you have created a named graph projection, there are four different execution modes provided for each algorithm:
- stream: Returns the results of the algorithm as a stream of records without altering the database
- write: Writes the results of the algorithm to the Neo4j database and returns a single record of summary statistics
- mutate: Writes the results of the algorithm to the projected graph and produces a single form of summary statistics
- stats: Returns a single record of summary statistics but does not write to either the Neo4j database or the projected graph
In addition to the above four modes, it is possible to use estimate to forecast how much memory a given algorithm will use.
A special note on mutate mode: When it comes time for feature engineering, you will likely want to include some quantities calculated by GDS into your graph projection, and mutate takes its place on the scene. It does not change the database itself but writes the calculation results to each node within the projected graph for future calculations. This behavior is functional when using more complicated graph algorithms or pipelines. It’s beyond the scope of this blog but is covered in more detail in the API docs.
There are many ways to determine the centrality or importance, but one of the most popular ways is through the calculation of PageRank. PageRank (PR) algorithm measures the significance of each node within the graph. PR computes the ranking of the nodes based on the number of incoming relationships (links). Initially, it was designed to rank the web pages. Generally speaking, the underlying assumption is that a web page is only as important as the web pages that link to it.
The related code snippet to get the top 10 nodes in the gas pipelines according to PageRank:
and the result is:
According to the results, the most critical node is node number 5305. It is worth noting that this “id” is not the id we set when creating the nodes in our actual graph. “id” shown in the results is the internal id space of each node set by the Neo4j engine during the creation period by default. So, how can we find this node in the graph?
To alleviate that problem, we will slightly modify our GDS PR code. First, we will write the “pagerank” of each node by the GDS algorithm, and then we will query them by the “pagerank” value.
As a result, the most important junction points of the gas pipelines of Europe can be found by using the GDS Page Rank function, and we also showed their corresponding incoming max annual gas volume.
The Degree Centrality algorithm measures the number of both incoming and outgoing relationships from a node to find popular nodes within a graph. For example, we can identify the most influential users on Twitter or help separate fraudsters from legitimate users of an online auction. Degree Centrality is an essential component of any attempt to determine the most critical nodes in a network. The core part of this algorithm is the orientation parameter that shapes its working principles. There are three types of orientation:
- UNDIRECTED : scores both incoming and outgoing relationships
- REVERSE : scores only incoming relationships
- NATURAL : scores only outgoing relationships
I will use the UNDIRECTED orientation to count all incoming and outgoing relationships for this use case. If I was trying to detect the influencers on Twitter, I should use the REVERSE orientation to calculate the users’ followers.
The Betweenness Centrality algorithm measures the centrality within a graph based on the shortest paths. According to graph theory, in a connected graph there exists at least one shortest path between nodes for every pair of nodes. The betweenness centrality for each node is the total number of these shortest paths that pass through the node.
In other words, it is highly used to find the bridge nodes that connect one part of a graph to another. In network theory, the betweenness centrality algorithm applies to a wide range of problems related to social networks, biology, transportation, telecommunications, etc.
For example, a node with higher Betweenness Centrality in a transportation network would have more control over the net because more items/passengers will pass through that node. The decision makers can utilize Betweenness Centrality scores to determine the hubs in a transportation network.
Cluster Detection Via Louvain Modularity
This method is used to find the communities/rings in large networks. It is one of the fastest modularity-based algorithms and works well in large networks. Modularity is a measure of how well groups have been partitioned into clusters. The Louvain algorithm recursively merges communities into a single node and executes the modularity clustering.
I will use the “pipes” projected graph created at the beginning of this section.
To read the country codes of the clusters more efficiently, you can find the long form of the country codes for some countries below. Exploring this list, we can see that the most prominent community corresponding to junction points are Russia, the natural gas provider, and the closer countries to Russia, such as Estonia, Belarus, and Ukraine. The junction points in these countries play an essential role in distributing natural gas to Europe.
- EE: Estonia
- BY: Belarus
- UA: Ukraine
- RU: Russia
- NL: Netherlands
- BE: Belgium
- XX: No country — under sea
- DE: Germany
- AT: Austria
- CH: Switzerland
Like all other graph algorithm categories we have explored, there are several alternatives for pathfinding. Predominantly, the goal of pathfinding algorithms is to find the shortest path between two or more nodes. In the case of our natural gas pipeline graph, the pathfinding algorithm would help us determine which junction point (node) would be required for the minimum overall distance.
Dijkstra’s algorithm is one of the most common shortest path algorithms used. A* and Yen’s algorithm are the other alternative guns in the Neo4j GDS arsenal.
Unlike the previous GDS examples, we will need a weighted graph projection, since Dijkstra’s algorithm supports weighted graphs with positive relationship weights, such as distance. It begins by finding the lowest weighted relationship from the source nodes to all reachable nodes. It then performs the same calculation from that node to all nodes connected to it, and so on. It always chooses the relationship with the lowest weight until the target node is reached.
As you can see in the above query, we specify a source and target node and use the relationshipWeightProperty of length_km. At this point, I arbitrarily chose two nodes — the source node is “INET_N_856” and the target node is “NutsCons_1003” — and ran the pathfinding algorithm. Many things are returned, including the total distance and a listing of the junction points along this path. In this case, we see that the shortest route is 18 hops long, and the total distance is minimized.
Visualization Via NeoDash
NeoDash is a graph app to build dashboards by Neo4j graphs in minutes. You can create multi-page visualizations using the graph database. There are multiple options to present the data, such as a map, table, bar chart, pie chart, graph, line chart, etc. It also allows setting a dynamic parameter that affects other visualizations. After creating the dashboard, you can save it to your graph database as a node and fire up it again. To learn more, check out the resources below.
- Building Interactive Neo4j Dashboards with NeoDash 1.1
- NeoDash 2.0 — A Brand New Way to Visualize Neo4j
Here is one example of the dash created by using NeoDash.
The notebook we’ve worked through can be found here. I hope you fork it and modify it to meet your needs. Pull requests are always welcome!
Read the White Paper
Exploring the European Natural Gas Network as a Knowledge Graph was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.