Announcing the Neo4j Connector for AWS Glue

Photo of Stu Moore

Stu Moore

Product Manager

I’m thrilled to announce the official release of the Neo4j Connector for AWS Glue, available as a custom connector for use with AWS Glue.

For data engineers and developers building graphs in the AWS cloud, this connector simplifies graph creation from tabular data. Whether you’re running Neo4j AuraDB on AWS or managing self-hosted Neo4j instances on Amazon EC2, you can now use the power of serverless AWS Glue to transform and load data from Amazon services like S3, DynamoDB, RDS, Redshift, and Kinesis Data Streams, as well as popular data stores like Apache Kafka, Snowflake, and MySQL, directly into Neo4j.

This blog will provide best practices and demonstrate how easy it is to transform Transport for London (TfL) cycling in CSV files into a graph of cycle stations in areas of London and the trips between stations.

The connector uses the Neo4j JDBC Driver’s powerful SQL to Cypher translator to convert AWS Glue’s SQL commands into efficient Cypher statements, ensuring optimal performance for graph creation. The SQL to Cypher translator can map between the tables in a relational database, or the files, which describe the nodes or provide the relationships that connect source and target nodes. Check out Michael Simonsblog on the JDBC Driver for more information.

For example, stations.csv describes the nodes with the label Station, and stations_in_location.csv describes the relationship between nodes with the Station label and nodes with the Area label.

Best Practices

Neo4j is a schema-free or schema-optional database, which makes it flexible for developers creating graphs. AWS Glue expects a schema to be present in all databases, so we recommend the following when using the connector:

  • Create a blueprint* that defines the labels (types of nodes), relationship types, and properties.
  • Create two jobs. The first loads the nodes. Multiple label types can be in the same job; each must use a separate connector target. The second job must load the relationship types; each must use a separate connector target.
  • Use a consistent format style; you can use either camelcase (e.g., areaName) or underscores (e.g., station_name).**

*I avoided the use of the word schema because what’s created isn’t fixed in the way a traditional schema is. However, in order for AWS Glue to be able to write to the database, it needs to be able to query a schema to know what tables and columns exist.

**I used both in the blog because I was testing underscore, and camel caps did not confuse how Glue used the SQL to Cypher translator.

This blog was written using Neo4j Connector for AWS Glue v1.3, which didn’t permit the same property name on nodes with different labels (e.g., ‘name’ on Station and ‘name’ on Area. The schema changes for name to stationName, name to areaName are no longer required and can be ignored.

Why This Connector Matters for TfL-Style Data

TfL publishes rich transport datasets that are ideal for graph modeling — station locations, time-series flows, and relationships to road segments or stations. Using the Neo4j Connector for AWS Glue, you can:

  • Treat Neo4j as a JDBC source/target inside AWS Glue jobs, including the visual ETL tool.
  • Perform streaming/online ETL patterns from S3 or APIs and write node/relationship updates directly into Neo4j without building separate ingestion services.
  • Keep analytics and operational graph stores (self-managed Neo4j or Aura on AWS) in sync with live TfL feeds for route analytics, anomaly detection, and richer queries.

Data Chosen for the Example

The data in this blog is Powered by TfL Open Data, Transport for London’s cycle datasets, which track trips between bike stations in the London area. The dataset contains OS data © Crown copyright and database rights 2016 and Geomni UK Map data © and database rights [2019], which I will use with Explore’s location-based layout to create a map of London.

Note: The data provided by TfL includes additional information like the bike and route used. I haven’t included it because I wanted the blog to focus on the ETL capabilities, but it’s available if you want to create a richer graph.

Source Data and Table Schemas

The AWS Glue job will read six CSV files, stored in Amazon S3. The relational schema maps to the graph schema as follows.

Reminder: The blog was written with the Neo4j Connector for AWS Glue v1.3. Starting from v1.4, you don’t need to change the name column/properties to be unique, reducing the transformations required.

stations.csv

  • Provides the source of the :Station node
  • Columns are station_id (PK string), name (string), location (string)
  • Use AWS Glue to transform location (string) into two columns: longitude (float) and latitude (float)
  • Neo4j properties stationId (int), stationName (string), longitude (float), and latitude (float)

areas.csv

  • Provides the source of the :Area node
  • Columns are area_id (PK string), name (string)
  • Transform to Neo4j properties areaId (int) and areaName (string)

trips.csv

  • Source of the :Trip node
  • Columns are trip_id (PK string), duration_ms (string), start_time (string), and end_time (string)
  • Transform to Neo4j properties tripId (int), duration_ms (int), start_time (datetime), and end_time (datetime)

station_in_area.csv

  • JOIN table that defines the relationship :LOCATED_IN
  • Columns are station_id (string) and area_id (string)
  • Transform to Neo4j properties station_id (int) and areaId (int)
  • Connects (:Station) via station_id to (:Area) via areaId

trip_from_station.csv

  • JOIN table that defines the relationship :STARTS_AT
  • Columns are trip_id (string) and station_id (string)
  • Transform to Neo4j properties tripId (string) and stationId (int)
  • Connects (:Trip) via tripId to (:Station) via stationId

trip_to_station.csv

  • JOIN table that defines the relationship :ENDS_AT
  • Columns are trip_id (station), station_id (station)
  • Transform to Neo4j properties tripId (string), stationId (int)
  • Connects (:Trip) via tripId to (:Station) via stationId

The Graph Model (Blueprint)

The resulting graph will look like the following (I’ll provide the Cypher statement later).

Blueprint of (:Station)-[:LOCATED_IN]-(:Area), (:Trip)-[:STARTS_AT]-(:Station), (:Trip)-[:ENDS_AT]-(:Station)

It can answer questions like:

  • “What are the most popular :Trip routes ([:STARTS_AT] -> [:ENDS_AT]) that start in the Westminster :Area?”
  • “Which :Station has the most outgoing :Trips and is connected to the Central London :Area?”

Architecture and Deployment Notes

  • Place the Neo4j AWS Glue connector JAR in an accessible S3 bucket, available from the Neo4j Deployment Center.
  • Use AWS Glue’s custom JDBC connector, with the Neo4j JDBC Driver’s classpath, which is org.neo4j.jdbc.Neo4jDriver, and provide the URL parameter to enable translation enableSQLTranslation=true — this will translate AWS Glue’s SQL-style reads and writes to Cypher at runtime.
  • Store Neo4j credentials in AWS Secrets Manager and configure the AWS Glue connection to fetch them; the connector expects the standard credential property names user and password. Don’t use the neo4j_username or neo4j_password properties that appear in the neo4j connection file.
  • Ensure that AWS Glue can reach Neo4j Aura (VPC, subnets, security groups with the correct networking). Test connectivity with a small job first.

Operational Tips and Best Practices

  • Start with a subset of the data that describes the nodes and relationships you want to create.
  • If you make a mistake and need to re-run, ensure that the database is empty — nodes and or relationships that are not part of the blueprint will confuse AWS Glue when it uses the SQL to Cypher translator. Delete small graphs with MATCH (n) DETACH DELETE n; for large graphs, use MATCH (n) CALL (n) { DETACH DELETE n } IN TRANSACTIONS.
  • Create a job to load the nodes first. Within the job, there should be an S3 Source and a Neo4j Target pair for each node label — you will have three. Load a single label first to test the connection details and confirm that the property names and types are created as expected in the Neo4j database.
  • Add the remaining node labels; each must have its own S3 Source and Target. You will end up with three flows that will be executed in parallel.
  • Relationship strategy: create an S3 Source to Neo4j Connector Target flow for each relationship type, and create and test the flow for a single relationship type. Repeat for the remaining relationship types.*
  • Between each S3 Source and Target, use AWS Glue’s transforms to clean the data — remove duplicates and incomplete records — to ensure a high-quality graph. The translator will use MERGE, but it’s best to create unique constraints in Neo4j.
  • If the property types or names aren’t created correctly on the nodes, relationship creation will fail with messages like “table does not exist.”
  • Inspect AWS Glue job logs and Neo4j query logs to find translator mismatches; the JDBC translator will surface unsupported SQL constructs in exceptions.

*Not all data models and datasets will permit you to load the relationship types in this way. When multiple relationship types link the same nodes, you may experience deadlocks. If this is the case, move each of the relationship types to different AWS Glue jobs.

Building the TfL Cycle Graph Using AWS Glue’s Visual ETL

The steps below walk through the creation of a reproducible visual-ETL flow that ingests the CSV sources hosted on S3 to create the nodes and relationships in Neo4j.

I’ve assumed readers are familiar with AWS Glue’s transforms and skipped most of the intermediary steps to clean the data and transform the desired column/property names and types described in Source Data and Table Schema above.

For detailed instructions for the Neo4j Connector for AWS Glue, please refer to the documentation. For detailed instructions for using AWS Glue, please refer to the AWS documentation.

Top Tips: Preview, Iterate With a Sample Dataset, Run

Just in case you’re going to follow along, I’ll provide these now, rather than at the end because it can save you some time and effort if you’re aware of these right from the start:

  • Use AWS Glue’s schema and data preview to make sure field mappings are correct.
  • Run with a small dataset first and check that the results in Neo4j are expected.
  • AWS Glue’s error messages will be quite cryptic because it only deals with tables and columns and doesn’t understand what labels, properties, and relationships are. Expect to see messages like “table does not exist” when trying to create a relationship. Sadly, it doesn’t tell you the table name.
  • When writing to Neo4j, you’ll use the Connector as a Target. There’s also an option for using Neo4j as a Source for you to extract data from Neo4j. This can be used to test that you can read the blueprint data and get the expected results.
Create a Data source and select Neo4jAWSGlueConnector (or whatever you called it) to test that the blueprint is interpreted correctly

1. Preparation

  • Upload the CSVs into S3.
Upload the files you need to S3; the blog uses a subset of these
  • Download the neo4j-aws-glue connector.jar from Neo4j or from GitHub — be sure to get the latest, which will include the latest version of the JDBC driver.
  • Upload the jar to an S3 path.
  • Create an AWS Secrets Manager secret with Neo4j credentials and reference it in the AWS Glue connection.
Note the use of user and password as keys — do not use neo4j_username or neo4j_password

Create an instance in Aura and save the connection file in the Secrets Manager as well.

Create an instance in Aura

Create the blueprint schema of the nodes, relationships, and properties that we will import:

MATCH(n) DETACH DELETE n;

CREATE (a:Area:Temp{areaName:'area'})
CREATE (s:Station:Temp{stationName:'station',longitude:-0.10997100174427032,latitude:51.52920150756836,stationId:1})
CREATE (t:Trip:Temp{tripId:"uuid",date_start:localdatetime(),date_end:localdatetime(),duration_ms:1})
MERGE (t) -[:ENDS_AT]-> (s)
MERGE (s) -[:LOCATED_IN]-> (a)
MERGE (s) -[:STARTS_AT]-> (t)

Note: Make sure you provide values of the expected type and within the range of the data. This is especially important for the longitude and latitude. If you put in values like 0 and 1, you’ll find it throws off the scaling in Explore, and you won’t get a cool map (so can transposed data points for a single station).

2. Create a Custom Connector

  • Select Custom connectors to create a custom JDBC connector in AWS Glue using the JAR in S3.
  • Set the driver class to org.neo4j.jdbc.Neo4jDriver and add enableSQLTranslation=true to the JDBC URL so the connector will translate supported SQL to Cypher at runtime.
  • Point credentials to the Secrets Manager secret (so jobs pick them up at runtime).
Create custom connector

Important: When you need to update the version of the JAR to the latest version, it’s best to create a new custom connector with the latest version. I found that updating an existing connector to use the latest version left me running jobs with an out-of-date JAR.

3. Create a Connection to Aura

  • Select the Connector, provide a description (e.g., Connection to Aura Instance using the latest Connector).
  • Set the credential type to default.
  • Provide user and password parameters as ${user} and ${password} respectively*.
  • Save changes.

*Make sure they were stored like this in the Secrets Manager.

Create connection

4. Build the AWS Glue Visual ETL Job for Nodes

In the AWS Glue visual authoring canvas, you need to create two jobs: the first loads the nodes, the second loads the relationships.

The approach, assuming you’re going to build a graph with the areas, stations, and trips, is as follows:

  • Sources: Create S3 sources — one for each of the nodes: areas.csv, trips.csv, stations.csv.
  • Transformations: For each source, create a series of transformations to clean the data (drop duplicates), change the schema from all the strings in the CSV file to the types and column names specified in Source Data and Table Schemas above, and drop any columns that aren’t required.
  • Target: For each set of node labels, add the Neo4j Connector for AWS Glue as a target. You need three because each CSV file describes a new label, and each label maps to a table name specified in the connector.

Example using the stations.csv on S3:

Setting up the source for the station nodes using stations.csv

Note: With v1.4, the following is no longer required.

To avoid confusion between Area and Station labels having a property called name, I changed ‘name’ to stationName, or you can use station_name. The same goes for the import of areas.csv; you’ll need to change the name to areaName or area_name.

The change schema step ensures that the columns in the CSV file map to the properties we want in the graph (refer to Source Data and Table Schemas for details). To convert the string containing the location data to separate fields for longitude and latitude:

  • Use the Split transform to split the location using the following regex [,\[\]] pattern into a new column (e.g., long-lat).
  • Use the Array to Column transform to separate out into four columns (e.g., junk, longitude, latitude, junk2).
  • Use Drop to remove the location, junk, and junk2 columns.
Change the source schema to the target schema for the Neo4j database

When you configure the data target, Neo4j AWS Glue Connector, select the connection you created and specify the node label as the Table name. For stations.csv, this will be Station.

The table name is set to Station, which will be the name of the label in the Neo4j database

Next, test that the Station nodes and their properties are created correctly in Neo4j.

Repeat the steps above for creating the Area and Trip nodes, the table names will be Area and Trip — pay close attention to the column names and types to ensure they’re as described in Source Data and Table Schemas above. The resulting job should look similar to the following image.

Load nodes job — source target pairs for (:Station) from stations.csv, (:Area) from areas.csv, (:Trip) from trips.csv

5. Build the AWS Glue Visual ETL Job for Relationships

Next, create a new job for the relationships. Use a similar approach to the nodes:

  • Sources: Create S3 sources — one for each of the relationship files: station_in_area.csv, trip_to_station.csv, and trip_from_station.csv.
  • Transformations: For each source, create a series of transformations to clean the data (drop duplicates), change the schema from all the strings in the CSV file to the types and column names specified in Source Data and Table Schemas above, plus drop any columns that aren’t required.
  • Target: For each set of relationship types, add the Neo4j Connector for AWS Glue as a target. You need three because each CSV file describes a new relationship that will be created. AWS Glue thinks of this as a table, but the SQL to Cypher translator will turn this into a relationship.
station_in_area.csv

To create relationships, we use the following pattern: SourceLabel_RELATIONSHIP_NAME_TargetLabel. For example, to connect Station to Area via a relationship LOCATED_IN, we specify the table name Station_LOCATED_IN_Area.

Creating the relationship LOCATED_IN by specifying the table Station_LOCATED_IN_Area

Test that the relationship is created correctly. If you get an error stating the table name doesn’t exist, check the column to property mapping is correct, and make sure you haven’t mistyped the table name (e.g., plural Stations_LOCATED_IN_Area).

Then repeat to create the remaining relationship tables Station_STARTS_AT_Trip for the relationship [:STARTS_AT], Trip_ENDS_AT_Station [:ENDS_AT].

Station_STARTS_AT_Trip, Station_LOCATED_IN_Area, Trip_ENDS_AT_Station

6. The Results in Query and Explore

Query with the Trips, from Stations and Areas
Coordinate-based layout in Explore uses longitude and latitude to position the stations

Sample Cypher Query to Run After Ingestion

Busiest station for departing trips: Hyde Park Corner, Hyde Park with 228:

MATCH (s:Station)-[STARTS_AT]->(t:Trip)
RETURN s.stationName AS station, count(t) AS departingTrips
ORDER BY departingTrips DESC
LIMIT 1

Summary

The Neo4j Connector for AWS Glue with the built-in SQL to Cypher translator provided by the JDBC Driver is a great way of transforming tabular data into a graph. AWS Glue’s powerful transformations enable you to clean and prepare the data into the required types while the connector handles the SQL to Cypher queries to create the data in a graph format. The connector can be used to export graph data as well; just select the connector as a source, and you can export to other databases or files like Parquet on S3.


Announcing the Neo4j Connector for AWS Glue was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.