Online Course Data Science with Neo4j Setting Up your Development Environment Exploratory Data Analysis Recommendations Predictions Summary: Data Science with Neo4j Want to Speak? Get $ back. Exploratory Data Analysis About this module In the previous section, we setup our… Read more →

Exploratory Data Analysis

About this module

In the previous section, we setup our Neo4j Sandbox environment containing the citation dataset. In this module we’re going to explore that data. We’ll be querying Neo4j and processing the results using tools in the Python ecosystem.

At the end of this module, you should be able to:

  • Query a database for its schema.
  • Return and chart the number of node labels and relationship types using matplotlib.
  • Build and plot a histogram of papers and their citations using pandas and matplotlib.

Tools

We’ll be using the following Python libraries in this course:

py2neo

The py2neo driver enables data scientists to easily integrate Neo4j with tools in the Python Data Science ecosystem. It does this by providing functions that translate the results of queries into data structures used by these tools. We’ll be using this library to execute Cypher queries against Neo4j.

pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. We’ll be using this library to do post processing of the data that we query from Neo4j.

Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. We’ll be using this library to create charts based on our data.

Citation Dataset

Now you are ready to start exploring the data. The graph model for the dataset is shown below:

Graph Model

You want to better understand the data you will be working with, including the distribution of authors, papers, and citations. You will then be able to use this knowledge to help build a recommendation engine and make predictions on the data.

Exercise 1: Exploring the data

In this course you use the Jupyter notebook you set up previously.

Click the button below to launch the notebook and perform the steps for exploring the data. When you launch this notebook, you will enter the same credentials you entered when you tested your connection to the Neo4j Sandbox.

Exercise 1

Once you have attempted the exercises, you can see the answers by launching the following notebook:

See answers

Check your understanding

Question 1

What is the name of the procedure that returns the node labels in the database?

Select the correct answer.

  • db.labels
  • db.nodeLabels
  • db.nodes
  • dbms.labels

Question 2

Which node label is the most popular one in this dataset?

Select the correct answer.

  • Article
  • Author
  • Venue

Question 3

What is the mean number of articles published by an author?

Select the correct answer.

  • 2.064
  • 89.000
  • 1.751
  • 3.000

Summary

You should now be able to:

  • Query a database for its schema.
  • Return and chart the number of node labels and relationship types using matplotlib.
  • Build and plot a histogram of papers and their citations using pandas and matplotlib.

Stay Connected

Sign up to find out more about Neo4j's upcoming events & meetups.