Additional Information

At the end of this course, we would like to give you some additional information to keep in mind when developing and creating your graph analytics workflow.

Labs algorithms

This course only covered the production-quality graph algorithms in the GDSL, but there are more than 40 algorithms available in alpha and beta tier at the time of this writing. Take a look at the GDSL documentation to explore all the supported graph algorithms. You will find Link Prediction, Pathfinding, and Node Embedding algorithms that we have not explored in this course.

Types of graph

It is vital to understand the type of graph you want to analyze. Remember, most of the Community Detection and Centrality algorithms are designed to run on a monopartite graph. If you start with a bipartite graph, you might want to use the Similarity algorithms or Cypher projection to infer a monopartite network before moving on to other graph algorithms. You always include the Weakly Connected Components algorithm in your graph analytics workflow to learn how the graph is connected. Disconnected components might skew the results of other graph algorithms, so it is critical to understand how well your graph is connected. Another thing to keep in mind is the direction of relationships. Differentiating between directed and undirected networks is of great importance, as it has a significant influence on the algorithm’s results. Finally, keep in mind if you might want to include the relationship weights as an input to the algorithm. Sometimes, you might see better results if you include weights, but sometimes they do not produce the desired results. Test many configuration options and see what works best.

Seed parameters

The first step in most Community Detection algorithms is to initialize each node in its own community. In practice, each node gets assigned a unique community id. In the next step, algorithms use various techniques to search for communities within the network. Community Detection algorithms can return different community ids when executed multiple times on the same graph. This can be a bit of a nuisance if you want to track how communities evolve over time. The seed property parameter allows you to define the initial community id for each node. Using a seed property can be thought of as semi-supervised community detection, where you provide the initial community id for some nodes based on prior domain knowledge. Imagine you are running a daily batch process where you search for communities in your Neo4j graph. What you want to do each day is to provide the communities from the previous day as the seed values. There are two reasons for this. The first one is that it is easier to track a specific community through time by providing the communities from the day before as a seed property, we make sure that the community ids do not change. That is, unless a community disintegrates. The second reason is that the algorithm execution will be faster as most of the communities are already calculated from the day before, so the algorithms needs less iteration.

Neo4j Causal Cluster

Avoid writing to core cluster members. You will want to run the graph algorithms on a read replica and write the results to CSV (or export somehow). If you want to store the results back to the cluster, you can add them back to the leader by way of import-csv, or use Kafka to stream the results back. Another alternative is to create and detach a read replica from the cluster. You can then run the graph algorithms on that instance.

Common concerns

Try to avoid running graph algorithms on a transactional graph. If algorithms take a really long time to run, try to use the approximation variants of the algorithm or run the algorithm only on the subset of a graph. In production, try to use only the Native projection and avoid Cypher projection.