Online Course Using a Machine Learning Workflow for Link Prediction Setting Up your Development Environment Exploratory Data Analysis Recommendations Predictions Summary: Using a Machine Learning Workflow for Link Prediction Want to Speak? Get $ back. Exploratory Data Analysis About this… Read more →

Exploratory Data Analysis

About this module

In the previous section, you setup your Neo4j Desktop environment and loaded the citation dataset. In this module you will explore that data. You will be querying Neo4j and processing the results using tools in the Python ecosystem.

At the end of this module, you should be able to:

  • Query a database for its schema.
  • Return and chart the number of node labels and relationship types using matplotlib.
  • Build and plot a histogram of papers and their citations using pandas and matplotlib.

Tools

You will be using the following Python libraries in this course:

py2neo

The py2neo driver enables data scientists to easily integrate Neo4j with tools in the Python Data Science ecosystem. It does this by providing functions that translate the results of queries into data structures used by these tools. You will be using this library to execute Cypher queries against Neo4j.

pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. You will be using this library to do post-processing of the data that you query from Neo4j.

Matplotlib

Matplotlib is a Python 2D plotting library which produces publication-quality figures in a variety of hard-copy formats and interactive environments across platforms. You will be using this library to create charts based on our data.

Citation Dataset

Now you are ready to start exploring the data.

Here is the graph model for the dataset:

Graph Model

You want to better understand the data you will be working with, including the distribution of authors, papers, and citations. You will then be able to use this knowledge to help build a recommendation engine and make predictions on the data.

Exercise 1: Exploring the data

In this exercise you use the Jupyter notebook you set up previously.

Open the 02_EDA.ipynb notebook to complete the first exercise.

Once you have attempted the exercises, you can see the answers by launching the 02_EDA_Solutions.ipynb notebook.

Check your understanding

Question 1

What is the name of the procedure that returns the node labels in the database?

Select the correct answer.

  • db.labels
  • db.nodeLabels
  • db.nodes
  • dbms.labels

Question 2

Which node label is the most popular one in this dataset?

Select the correct answer.

  • Article
  • Author
  • Venue

Question 3

What is the mean number of articles published by an author?

Select the correct answer.

  • 2.064
  • 89.000
  • 1.751
  • 3.000

Summary

You should now be able to:

  • Query a database for its schema.
  • Return and chart the number of node labels and relationship types using matplotlib.
  • Build and plot a histogram of papers and their citations using pandas and matplotlib.

Stay Connected

Sign up to find out more about Neo4j's upcoming events & meetups.