Datasets

To support getting up and running quickly and conveniently with graphdatascience, the library comes with functionality for very easily projecting some interesting datasets into server-side GDS. Some of these are smaller and built-in to the library, and others are larger and hosted elsewhere and are automatically fetched under the hood upon loading.

1. Built-in datasets

For convenience, the library is shipped with a few smaller built-in datasets that are very fast to load. These are easily imported to GDS to get a graph object representing the dataset.

These datasets comes with a loader method that takes two optional parameters: graph_name which assigns a graph name, undirected which takes a boolean and will load the graph as undirected if set to true.

If a graph is loaded as undirected = True, then it will have twice the number of relationships compared to its directed version. The default value for undirected varies for each dataset.

For example:

G = gds.graph.load_cora()

assert G.node_count() == 2_708
assert G.node_labels() == ["Paper"]

1.1. Cora

A well known citation network introduced by Automating the Construction of Internet Portals with Machine Learning and used in many node classification or link prediction publications.

The default is to load Cora as undirected = False.

Table 1. Cora graph statistics
Name Value

name

cora

node_count

2_708

relationship_count

5_429

node_labels

['Paper']

relationship_types

['CITES']

node_properties

Paper: [features, subject]

relationship_properties

CITES: []

1.2. Karate club

A well known social network introduced by Zachary. The default is to load Karate club as undirected = False.

Table 2. Karate club graph statistics
Name Value

name

karate_club

node_count

34

relationship_count

78

node_labels

['Person']

relationship_types

['KNOWS']

node_properties

Person: []

relationship_properties

KNOWS: []

1.3. IMDB

A heterogeneous graph that is used to benchmark node classification or link prediction models such as Heterogeneous Graph Attention Network, MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding and Graph Transformer Networks. The graph contains Actors, Directors, Movies (and UnclassifiedMovies) as nodes, and relationships between actors and movies that they acted in, and between directors and movies which they directed for.

The default is to load IMDB dataset as undirected = True. If loaded as directed, it will have half the relationships.

Table 3. IMDB graph statistics
Name Value

name

imdb

node_count

12_772

relationship_count

37_288

node_labels

['Movie', 'Actor', 'Director', 'UnclassifiedMovie']

relationship_types

['ACTED_IN', 'DIRECTED_IN']

node_properties

Movie: [plot_keywords, genre], Actor: [plot_keywords], Director: [plot_keywords], UnclassifiedMovie: [plot_keywords]

relationship_properties

ACTED_IN: [], DIRECTED_IN: []

1.4. LastFM

A heterogeneous graph that is used to benchmark link prediction models, for example used by MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding. The data and licensing is from HetRec 2011 The original raw data is from LastFM.

The graph contains User and Artists as nodes, each with a rawId field that corresponds to the ids from the raw data. The rawId could be used, for example, to check the artist names by referring back to the HetRec'11 .dat files.

There are three types of relationships:

  • IS_FRIEND exists between User nodes, representing user-to-user friendships.

  • LISTEN_TO exists between User and Artist nodes, indicating the artists a user listens to.

  • weight property represents the number of times the user listened to the artist.

  • TAGGED exists between User and Artist nodes, indicating artists that a user has tagged.

  • tagID property represents a genre

  • day, month, and year properties represents the time when tagging was created

  • timestamp represents the same date as milliseconds since EPOCH

Note that a user can tag the same artist multiple times with different tagID.

The default is to load LastFM dataset as undirected = True. If loaded as directed, it will have half the relationships.

Table 4. Default LastFM graph statistics
Name Value

name

lastfm

node_count

19_914

relationship_count

584_060

node_labels

['Artist', 'User']

relationship_types

['TAGGED', 'LISTEN_TO', 'IS_FRIEND']

node_properties

Artist: [rawId], User: [rawId]

relationship_properties

TAGGED: [year, month, day, tagID, timestamp], LISTEN_TO: [weight], IS_FRIEND: []

2. Datasets from the Open Graph Benchmark

There are convenience methods for loading Open Graph Benchmark datasets in the library. In order to use these methods, one has to install OGB support for the graphdatascience library:

pip install graphdatascience[ogb]

The two methods that expose the OGB dataset loading functionality are:

These methods both return a Graph object, and take four arguments:

Table 5. OGB dataset loading method arguments
Name Type

dataset_name

str

The name of the ODB dataset

dataset_root_path

str = "./dataset"

An optional path for where the downloaded OGB dataset will be stored

graph_name

Optional[str]

An optional name of the created graph. Defaults to dataset_name

concurrency

int = 4

An optional number of threads to use

The OGB loading methods uses the OGB library under the hood. As such, when loading a new OGB dataset, the dataset will be downloaded from the internet, hence an internet connection is required. If the dataset is already found in the dataset_root_path however, an internet connection is not required.

All OGB datasets are loaded into GDS with orientation DIRECTED. However, some of the datasets are effectively undirected in the sense that for each relationship, one in the opposite direction between the same nodes also exists. The Open Graph Benchmark website informs us on for which of the datasets this is the case. If you want a projected graph to be UNDIRECTED in the Neo4j GDS sense of the word, you can modify it with gds.graph.relationships.toUndirected.

If the targeted GDS server has an enterprise license and the Arrow Flight Server enabled, the performance of the OGB dataset loading will be very good. If not, the performance will be a lot worse. The loading will take several magnitudes longer time, and require substantially more memory. Indeed, one may need to increase the maximum heap size of the Neo4j database which hosts GDS in order to have enough memory for the loading.

At this time, the loading methods do not support loading of relationship features that are in the OGB datasets. Whenever an dataset that includes relationship features is loaded, those features will not be included in the projection, and a warning till be issued.

Depending on the dataset and task, the graphs projected into server-side GDS will look a bit different.

2.1. OGBN graphs

Datasets used for node property prediction.

2.1.1. Homogeneous

These graphs will, when projected into server-side GDS, have:

  • Up to three disjoint node labels representing the dataset split: "Train", "Valid" and "Test"

  • One relationship type "R" for all relationships

  • A "classLabel" property on all nodes, which is represented by a single number

  • If provided by the dataset, a node property "features", which is represented by an array of numbers for all nodes

Let’s see an example of loading and inspecting one of these datasets:

Example of loading the 'ogbn-arxiv' dataset
G = gds.graph.ogbn.load("ogbn-arxiv")

assert G.name() == "ogbn-arxiv"
assert G.node_count() == 169_343
assert G.node_labels() == ["Train", "Valid", "Test"]
assert G.node_properties()["Train"] == ["features", "classLabel"]
assert G.relationship_count() == 1_166_243
assert G.relationship_types() == ["R"]
assert G.relationship_properties()["R"] == []

2.1.2. Heterogeneous

These graphs are heterogenous, so by definition will have multiple node labels and relationship types. These labels and types will be named in the graph projection according to the their names in the original dataset. In addition, the projected graph will have:

  • Up to three disjoint node labels representing the dataset split: "Train", "Valid" and "Test". This implies that nodes might have multiple labels

  • A "classLabel" property on the nodes targeted for prediction, which is represented by a single number

  • If provided by the dataset, a node property "features", which is represented by an array of numbers for some or all of the nodes

Let’s see an example of loading and inspecting one of these datasets:

Example of loading the 'ogbn-mag' dataset
G = gds.graph.ogbn.load("ogbn-mag")

assert G.name() == "ogbn-mag"
assert G.node_count() == 1_939_743
assert set(G.node_labels()) == {
	"Train",
	"Test",
	"Valid",
	"institution",
	"field_of_study",
	"paper",
	"author",
}
assert G.node_properties()["paper"] == ["features", "classLabel"]
assert G.node_properties()["institution"] == []
assert G.relationship_count() == 21_111_007
assert G.relationship_types() == ["cites", "writes", "affiliated_with", "has_topic"]

2.2. OGBL graphs

Datasets used for link property prediction.

2.2.1. Homogeneous

These graphs are used for link prediction. When projected into server-side GDS, they will have have:

  • One node label "N" for all nodes

  • Up to six disjoint relationship types representing the dataset split: "TRAIN_POS", "TRAIN_NEG", "VALID_POS", "VALID_NEG", "TEST_POS", "TEST_NEG"

  • If provided by the dataset, a node property "features" which is represented by an array of numbers for all nodes

Let’s see an example of loading and inspecting one of these datasets:

Example of loading the 'ogbl-ddi' dataset
G = gds.graph.ogbl.load("ogbl-ddi")

assert G.name() == "ogbl-ddi"
assert G.node_count() == 4_267
assert G.node_labels() == ["N"]
assert G.node_properties()["N"] == []
assert G.relationship_count() == 1_334_889 + 197_481  # Positive + negative counts
assert G.relationship_types() == ["TRAIN_POS", "VALID_POS", "VALID_NEG", "TEST_POS", "TEST_NEG"]

ogbl-wikikg2 is a homogenous dataset according to the OGB API, but it has multiple relationship types. Hence we load it in the same way as the heterogeneous ones. For example we suffix the relationships with the dataset split type (see below). Please also note that since the number of negative relationships in this particular dataset is so large only positive relationships will be loaded.

2.2.2. Heterogeneous

These are heterogenous graphs used for knowledge graph completion, so by definition will have multiple node labels and relationship types. The node labels will be named in the graph projection according to the their names in the original dataset. Complementing the naming in the original dataset, the relationship types names will be suffixed with an underscore followed by the kind of set they appear in in the dataset split ("TRAIN", "VALID" or "TEST").

In addition, the projected graph will have:

  • A "classLabel" property on all relationships, which is represented by a single non-negative integer. This integer maps one-to-one with the relationship types of the original dataset

  • If provided by the dataset, a node property "features" which is represented by an array of numbers for some or all of the nodes

Let’s see an example of loading and inspecting one of these datasets:

Example of loading the 'ogbl-biokg' dataset
G = gds.graph.ogbl.load("ogbl-biokg")

assert G.name() == "ogbl-biokg"
assert G.node_count() == 93_773
assert G.node_labels() == ["disease", "protein", "drug", "sideeffect", "function"]
assert G.node_properties()["protein"] == []
assert G.relationship_count() == 5_088_434
# For each of the train, valid and test sets: number of rel types
assert len(G.relationship_types()) == 51 * 3
assert G.relationship_properties()["drug-drug_polycystic_ovary_syndrome_TRAIN"] == ["classLabel"]

2.3. Limitations on community edition

For users of GDS community edition, performance can be impacted for large graphs. It is possible that socket connection with the database times out. If this happens, a possible workaround is to modify the server configuration server.bolt.connection_keep_alive or server.bolt.connection_keep_alive_probes. However, be aware of the side effects such as a genuine connection issue now taking longer to be detected.