Common datasets

For convenience, the library is shipped with a few common datasets. These are easily imported to GDS to get a graph object representing the dataset.

The common datasets comes with a loader method that takes two optional parameters: graph_name which assigns a graph name, undirected which takes a boolean and will load the graph as undirected if set to true.

If a graph is loaded as undirected = True, then it will have twice the number of relationships compared to its directed version. The default value for undirected varies for each dataset.

For example:

G = gds.graph.load_cora()

assert G.node_count() == 2708
assert G.node_labels() == ["Paper"]

1. Datasets

1.1. Cora

A well known citation network introduced by Automating the Construction of Internet Portals with Machine Learning and used in many node classification or link prediction publications.

The default is to load Cora as undirected = False.

Table 1. Cora graph statistics
Name Value

name

cora

node_count

2708

relationship_count

5429

node_labels

['Paper']

relationship_types

['CITES']

node_properties

Paper: [features, subject]

relationship_properties

CITES: []

1.2. Karate club

A well known social network introduced by Zachary. The default is to load Karate club as undirected = False.

Table 2. Karate club graph statistics
Name Value

name

karate_club

node_count

34

relationship_count

78

node_labels

['Person']

relationship_types

['KNOWS']

node_properties

Person: []

relationship_properties

KNOWS: []

1.3. IMDB

A heterogeneous graph that is used to benchmark node classification or link prediction models such as Heterogeneous Graph Attention Network, MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding and Graph Transformer Networks. The graph contains Actors, Directors, Movies (and UnclassifiedMovies) as nodes, and relationships between actors and movies that they acted in, and between directors and movies which they directed for.

The default is to load IMDB dataset as undirected = True. If loaded as directed, it will have half the relationships.

Table 3. IMDB graph statistics
Name Value

name

imdb

node_count

12772

relationship_count

37288

node_labels

['Movie', 'Actor', 'Director', 'UnclassifiedMovie']

relationship_types

['ACTED_IN', 'DIRECTED_IN']

node_properties

Movie: [plot_keywords, genre], Actor: [plot_keywords], Director: [plot_keywords]

relationship_properties

ACTED_IN: [], DIRECTED_IN: []