Turn Your CSVs Into Graphs Using LLMs

Developer Experience Engineer at Neo4j
26 min read

How do LLMs fare when attempting to create graphs from flat CSV files?
A large part of my job is improving users’ experience with Neo4j. Often, getting data into Neo4j and modeling it efficiently is a key challenge for users, especially in the early days. Although the initial data model is important and needs contemplation, it can be easily refactored to improve performance as the data size or number of users grows.
So, as a challenge to myself, I thought I would see if an LLM could help with the initial data model. If nothing else, it would demonstrate how things are connected and provide the user with some quick results they can show others.
Intuitively, I know data modeling is an iterative process, and certain LLMs can be easily distracted by large amounts of data, so this presented a good opportunity to use LangGraph to work in cycles through the data.
Let’s dive into the prompts that made it happen.
Graph Modeling Fundamentals
The Graph Data Modeling Fundamentals course on GraphAcademy guides you through the basics of modeling data in a graph, but as a first pass, I use the following rules of thumb:
- Nouns become labels — They describe the thing that the node represents.
- Verbs become relationship types — They describe how things are connected.
- Everything else becomes properties (particularly adverbs) — You have a name and may drive a gray car.
Verbs can also be nodes; you may be happy to know that a person has ordered a product, but that basic model doesn’t allow you to know where and when the product was ordered. In this case, order becomes a new node in the model.
I’m sure this could be distilled into a prompt to create a zero-shot approach to graph data modeling.
An Iterative Approach
I attempted this briefly a few months ago and found that the model I was using became easily distracted when dealing with larger schemas, and the prompts quite quickly reached the LLM’s token limits.
I thought I’d try an iterative approach this time, taking the keys one at a time. This should help avoid distraction because the LLM only needs to consider one item at a time.
The final approach used the following steps:
- Load the CSV file into a Pandas dataframe.
- Analyze each column in the CSV and append it to a data model loosely based on JSON Schema.
- Identify and add missing unique IDs for each entity.
- Review the data model for accuracy.
- Generate Cypher statements to import the nodes and relationships.
- Generate the unique constraints that underpin the import statements.
- Create the constraints and run the import.
The Data
I took a quick look on Kaggle for an interesting dataset. The dataset that stood out was Spotify Most Streamed Songs.
import pandas as pdnncsv_file = '/Users/adam/projects/datamodeller/data/spotify/spotify-most-streamed-songs.csv'nndf = pd.read_csv(csv_file)ndf.head()nnntrack_name artist(s)_name artist_count released_year released_month released_day in_spotify_playlists in_spotify_charts streams in_apple_playlists … key mode danceability_% valence_% energy_% acousticness_% instrumentalness_% liveness_% speechiness_% cover_urln0 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2 2023 7 14 553 147 141381703 43 … B Major 80 89 83 31 0 8 4 Not Foundn1 LALA Myke Towers 1 2023 3 23 1474 48 133716286 48 … C# Major 71 61 74 7 0 10 4 https://i.scdn.co/image/ab67616d0000b2730656d5…n2 vampire Olivia Rodrigo 1 2023 6 30 1397 113 140003974 94 … F Major 51 32 53 17 0 31 6 https://i.scdn.co/image/ab67616d0000b273e85259…n3 Cruel Summer Taylor Swift 1 2019 8 23 7858 100 800840817 116 … A Major 55 58 72 11 0 11 15 https://i.scdn.co/image/ab67616d0000b273e787cf…n4 WHERE SHE GOES Bad Bunny 1 2023 5 18 3133 50 303236322 84 … A Minor 65 23 80 14 63 11 6 https://i.scdn.co/image/ab67616d0000b273ab5c9c…
5 rows × 25 columns
It’s relatively simple, but I can see straight away that there should be relationships between tracks and artists.
There are also data cleanliness challenges to overcome, in terms of column names and artists being comma-separated values within the artist(s)_name column.
Choosing an LLM
I really wanted to use a local LLM for this, but I found out early on that Llama 3 wouldn’t cut it. If in doubt, fall back on OpenAI:
from langchain_core.prompts import PromptTemplatenfrom langchain_core.pydantic_v1 import BaseModel, Fieldnfrom typing import Listnfrom langchain_core.output_parsers import JsonOutputParsernnfrom langchain_openai import ChatOpenAInllm = ChatOpenAI(model=u0022gpt-4ou0022)
Creating a Data Model
I used an abridged set of modeling instructions to create the data modeling prompt. I had to engineer the prompt a few times to get a consistent output.
The zero-shot example worked relatively well, but I found that the output was inconsistent. Defining a structured output to hold the JSON output really helped:
class JSONSchemaSpecification(BaseModel):n notes: str = Field(description=u0022Any notes or comments about the schemau0022)n jsonschema: str = Field(description=u0022A JSON array of JSON schema specifications that describe the entities in the data modelu0022)
Few-Shot Example Output
The JSON itself was also inconsistent, so I ended up defining a schema based on the movie recommendations dataset.
Example output:
example_output = [n dict(n title=u0022Personu0022,n type=u0022objectu0022,n description=u0022Nodeu0022,n properties=[n dict(name=u0022nameu0022, column_name=u0022person_nameu0022, type=u0022stringu0022, description=u0022The name of the personu0022, examples=[u0022Tom Hanksu0022]),n dict(name=u0022date_of_birthu0022, column_name=u0022person_dobu0022, type=u0022dateu0022, description=u0022The date of birth for the personu0022, examples=[u00221987-06-05u0022]),n dict(name=u0022idu0022, column_name=u0022person_name, date_of_birthu0022, type=u0022stringu0022, description=u0022The ID is a combination of name and date of birth to ensure uniquenessu0022, examples=[u0022tom-hanks-1987-06-05u0022]),n ],n ),n dict(n title=u0022Directoru0022,n type=u0022objectu0022,n description=u0022Nodeu0022,n properties=[n dict(name=u0022nameu0022, column_name=u0022director_namesu0022, type=u0022stringu0022, description=u0022The name of the directors. Split values in column by a commau0022, examples=[u0022Francis Ford Coppolau0022]),n ],n ),n dict(n title=u0022Movieu0022,n type=u0022objectu0022,n description=u0022Nodeu0022,n properties=[n dict(name=u0022titleu0022, column_name=u0022titleu0022, type=u0022stringu0022, description=u0022The title of the movieu0022, examples=[u0022Toy Storyu0022]),n dict(name=u0022releasedu0022, column_name=u0022releasedu0022, type=u0022integeru0022, description=u0022The year the movie was releasedu0022, examples=[u00221990u0022]),n ],n ),n dict(n title=u0022ACTED_INu0022,n type=u0022objectu0022,n description=u0022Relationshipu0022,n properties=[n dict(name=u0022_fromu0022, column_name=u0022odu0022, type=u0022stringu0022, description=u0022Person found by the `id`. The ID is a combination of name and date of birth to ensure uniquenessu0022, examples=[u0022Personu0022]),n dict(name=u0022_tou0022, column_name=u0022titleu0022, type=u0022stringu0022, description=u0022The movie titleu0022, examples=[u0022Movieu0022]),n dict(name=u0022rolesu0022, type=u0022stringu0022, column_name=u0022person_rolesu0022, description=u0022The roles the person played in the movieu0022, examples=[u0022Woodyu0022]),n ],n ),n dict(n title=u0022DIRECTEDu0022,n type=u0022objectu0022,n description=u0022Relationshipu0022,n properties=[n dict(name=u0022_fromu0022, type=u0022stringu0022, column_name=u0022director_namesu0022, description=u0022Director names are comma separatedu0022, examples=[u0022Directoru0022]),n dict(name=u0022_tou0022, type=u0022stringu0022, column_name=u0022titleu0022, description=u0022The label of the node this relationship ends atu0022, examples=[u0022Movieu0022]),n ],n ),n]
I had to deviate from strict JSON Schema and add the column_name field to the output to help the LLM generate the import script. Providing examples of descriptions also helped in this regard, otherwise the properties used in the MATCH clause were inconsistent.
The Chain
Here is the final prompt:
model_prompt = PromptTemplate.from_template(u0022u0022u0022nYou are an expert Graph Database administrator.nYour task is to design a data model based on the information provided from an existing data source.nnYou must decide where the following column fits in with the existing data model. Consider:n* Does the column represent an entity, for example a Person, Place, or Movie? If so, this should be a node in its own right.n* Does the column represent a relationship between two entities? If so, this should be a relationship between two nodes.n* Does the column represent an attribute of an entity or relationship? If so, this should be a property of a node or relationship.n* Does the column represent a shared attribute that could be interesting to query through to find similar nodes, for example a Genre? If so, this should be a node in its own right.nnn## Instructions for Nodesnn* Node labels are generally nouns, for example Person, Place, or Movien* Node titles should be in UpperCamelCasenn## Instructions for Relationshipsnn* Relationshops are generally verbs, for example ACTED_IN, DIRECTED, or PURCHASEDn* Examples of good relationships are (:Person)-[:ACTED_IN]-u003e(:Movie) or (:Person)-[:PURCHASED]-u003e(:Product)n* Relationships should be in UPPER_SNAKE_CASEn* Provide any specific instructions for the field in the description. For example, does the field contain a list of comma separated values or a single value?nn## Instructions for Propertiesnn* Relationships should be in lowerPascalCasen* Prefer the shorter name where possible, for example u0022person_idu0022 and u0022personIdu0022 should simply be u0022idu0022n* If you are changing the property name from the original field name, mention the column name in the descriptionn* Do not include examples for integer or date fieldsn* Always include instructions on data preparation for the field. Does it need to be cast as a string or split into multiple fields on a delimiting value?n* Property keys should be letters only, no numbers or special characters.nn## Important!nnConsider the examples provided. Does any data preparation need to be done to ensure the data is in the correct format?nYou must include any information about data preparation in the description.nn## Example OutputnnHere is an example of a good output:n```n{example_output}n```nn## New Data:nnKey: {key}nData Type: {type}nExample Values: {examples}nn## Existing Data ModelnnHere is the existing data model:n```n{existing_model}n```nn## Keep Existing Data ModelnnApply your changes to the existing data model but never remove any existing definitions.nu0022u0022u0022, partial_variables=dict(example_output=dumps(example_output)))nnmodel_chain = model_prompt | llm.with_structured_output(JSONSchemaSpecification)
Executing the Chain
To iteratively update the model, I iterated over the keys in the dataframe and passed each key, its datatype, and the first five unique values to the prompt:
from json_repair import dumps, loadsnnexisting_model = {}nnfor i, key in enumerate(df):n print(u0022nu0022, i, key)n print(u0022u002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du0022)n try:n res = try_chain(model_chain, dict(n existing_model=dumps(existing_model),n key=key,n type=df[key].dtype,n examples=dumps(df[key].unique()[:5].tolist())n ))n print(res.notes)n existing_model = loads(res.jsonschema)nn print([n['title'] for n in existing_model])n except Exception as e:n print(e)n passnnexisting_model
Console output:
0 track_namenu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding 'track_name' to an existing data model. This represents a music track entity.n['Track']nn 1 artist(s)_namenu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding a new column 'artist(s)_name' to the existing data model. This column represents multiple artists associated with tracks and should be modeled as a new node 'Artist' and a relationship 'PERFORMED_BY' from 'Track' to 'Artist'.n['Track', 'Artist', 'PERFORMED_BY']nn 2 artist_countnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdded artist_count as a property of Track node. This property indicates the number of artists performing in the track.n['Track', 'Artist', 'PERFORMED_BY']nn 3 released_yearnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdd the released_year column to the existing data model as a property of the Track node.n['Track', 'Artist', 'PERFORMED_BY']nn 4 released_monthnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding the 'released_month' column to the existing data model, considering it as an attribute of the Track node.n['Track', 'Artist', 'PERFORMED_BY']nn 5 released_daynu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdded a new property 'released_day' to the 'Track' node to capture the day of the month a track was released.n['Track', 'Artist', 'PERFORMED_BY']nn 6 in_spotify_playlistsnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding the new column 'in_spotify_playlists' to the existing data model as a property of the 'Track' node.n['Track', 'Artist', 'PERFORMED_BY']nn 7 in_spotify_chartsnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding the 'in_spotify_charts' column to the existing data model as a property of the Track node.n['Track', 'Artist', 'PERFORMED_BY']nn 8 streamsnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding a new column 'streams' to the existing data model, representing the number of streams for a track.n['Track', 'Artist', 'PERFORMED_BY']nn 9 in_apple_playlistsnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding new column 'in_apple_playlists' to the existing data modeln['Track', 'Artist', 'PERFORMED_BY']nn 10 in_apple_chartsnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding 'in_apple_charts' as a property to the 'Track' node, representing the number of times the track appeared in the Apple charts.n['Track', 'Artist', 'PERFORMED_BY']nn 11 in_deezer_playlistsnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdd 'in_deezer_playlists' to the existing data model for a music track database.n['Track', 'Artist', 'PERFORMED_BY']nn 12 in_deezer_chartsnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding a new property 'inDeezerCharts' to the existing 'Track' node to represent the number of times the track appeared in Deezer charts.n['Track', 'Artist', 'PERFORMED_BY']nn 13 in_shazam_chartsnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding new data 'in_shazam_charts' to the existing data model. This appears to be an attribute of the 'Track' node, indicating the number of times a track appeared in the Shazam charts.n['Track', 'Artist', 'PERFORMED_BY']nn 14 bpmnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdded bpm column as a property to the Track node as it represents a characteristic of the track.n['Track', 'Artist', 'PERFORMED_BY']nn 15 keynu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding the 'key' column to the existing data model. The 'key' represents the musical key of a track, which is a shared attribute that can be interesting to query through to find similar tracks.n['Track', 'Artist', 'PERFORMED_BY']nn 16 modenu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding 'mode' to the existing data model. It represents a musical characteristic of a track, which is best captured as an attribute of the Track node.n['Track', 'Artist', 'PERFORMED_BY']nn 17 danceability_%nu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdded 'danceability_%' to the existing data model as a property of the Track node. The field represents the danceability percentage of the track.n['Track', 'Artist', 'PERFORMED_BY']nn 18 valence_%nu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding the valence percentage column to the existing data model as a property of the Track node.n['Track', 'Artist', 'PERFORMED_BY']nn 19 energy_%nu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnIntegration of the new column 'energy_%' into the existing data model. This column represents an attribute of the Track entity and should be added as a property of the Track node.n['Track', 'Artist', 'PERFORMED_BY']nn 20 acousticness_%nu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding acousticness_% to the existing data model as a property of the Track node.n['Track', 'Artist', 'PERFORMED_BY']nn 21 instrumentalness_%nu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding the new column 'instrumentalness_%' to the existing Track node as it represents an attribute of the Track entity.n['Track', 'Artist', 'PERFORMED_BY']nn 22 liveness_%nu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding the new column 'liveness_%' to the existing data model as an attribute of the Track noden['Track', 'Artist', 'PERFORMED_BY']nn 23 speechiness_%nu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding the new column 'speechiness_%' to the existing data model as a property of the 'Track' node.n['Track', 'Artist', 'PERFORMED_BY']nn 24 cover_urlnu002du002du002du002du002du002du002du002du002du002du002du002du002du002du002du002dnAdding a new property 'cover_url' to the existing 'Track' node. This property represents the URL of the track's cover image.n['Track', 'Artist', 'PERFORMED_BY']
After a few tweaks to the prompt to handle use cases, I ended up with a model I was quite happy with. The LLM had managed to determine that the dataset consisted of Track, Artist, and a PERFORMED_BY relationship to connect the two:
[n {n u0022titleu0022: u0022Tracku0022,n u0022typeu0022: u0022objectu0022,n u0022descriptionu0022: u0022Nodeu0022,n u0022propertiesu0022: [n {n u0022nameu0022: u0022nameu0022,n u0022column_nameu0022: u0022track_nameu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The name of the tracku0022,n u0022examplesu0022: [n u0022Seven (feat. Latto) (Explicit Ver.)u0022,n u0022LALAu0022,n u0022vampireu0022,n u0022Cruel Summeru0022,n u0022WHERE SHE GOESu0022,n ],n },n {n u0022nameu0022: u0022artist_countu0022,n u0022column_nameu0022: u0022artist_countu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of artists performing in the tracku0022,n u0022examplesu0022: [2, 1, 3, 8, 4],n },n {n u0022nameu0022: u0022released_yearu0022,n u0022column_nameu0022: u0022released_yearu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The year the track was releasedu0022,n u0022examplesu0022: [2023, 2019, 2022, 2013, 2014],n },n {n u0022nameu0022: u0022released_monthu0022,n u0022column_nameu0022: u0022released_monthu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The month the track was releasedu0022,n u0022examplesu0022: [7, 3, 6, 8, 5],n },n {n u0022nameu0022: u0022released_dayu0022,n u0022column_nameu0022: u0022released_dayu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The day of the month the track was releasedu0022,n u0022examplesu0022: [14, 23, 30, 18, 1],n },n {n u0022nameu0022: u0022inSpotifyPlaylistsu0022,n u0022column_nameu0022: u0022in_spotify_playlistsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of Spotify playlists the track is in. Cast the value as an integer.u0022,n u0022examplesu0022: [553, 1474, 1397, 7858, 3133],n },n {n u0022nameu0022: u0022inSpotifyChartsu0022,n u0022column_nameu0022: u0022in_spotify_chartsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of times the track appeared in the Spotify charts. Cast the value as an integer.u0022,n u0022examplesu0022: [147, 48, 113, 100, 50],n },n {n u0022nameu0022: u0022streamsu0022,n u0022column_nameu0022: u0022streamsu0022,n u0022typeu0022: u0022arrayu0022,n u0022descriptionu0022: u0022The list of stream IDs for the track. Maintain the array format.u0022,n u0022examplesu0022: [n u0022141381703u0022,n u0022133716286u0022,n u0022140003974u0022,n u0022800840817u0022,n u0022303236322u0022,n ],n },n {n u0022nameu0022: u0022inApplePlaylistsu0022,n u0022column_nameu0022: u0022in_apple_playlistsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of Apple playlists the track is in. Cast the value as an integer.u0022,n u0022examplesu0022: [43, 48, 94, 116, 84],n },n {n u0022nameu0022: u0022inAppleChartsu0022,n u0022column_nameu0022: u0022in_apple_chartsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of times the track appeared in the Apple charts. Cast the value as an integer.u0022,n u0022examplesu0022: [263, 126, 207, 133, 213],n },n {n u0022nameu0022: u0022inDeezerPlaylistsu0022,n u0022column_nameu0022: u0022in_deezer_playlistsu0022,n u0022typeu0022: u0022arrayu0022,n u0022descriptionu0022: u0022The list of Deezer playlist IDs the track is in. Maintain the array format.u0022,n u0022examplesu0022: [u002245u0022, u002258u0022, u002291u0022, u0022125u0022, u002287u0022],n },n {n u0022nameu0022: u0022inDeezerChartsu0022,n u0022column_nameu0022: u0022in_deezer_chartsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of times the track appeared in the Deezer charts. Cast the value as an integer.u0022,n u0022examplesu0022: [10, 14, 12, 15, 17],n },n {n u0022nameu0022: u0022inShazamChartsu0022,n u0022column_nameu0022: u0022in_shazam_chartsu0022,n u0022typeu0022: u0022arrayu0022,n u0022descriptionu0022: u0022The list of Shazam chart IDs the track is in. Maintain the array format.u0022,n u0022examplesu0022: [u0022826u0022, u0022382u0022, u0022949u0022, u0022548u0022, u0022425u0022],n },n {n u0022nameu0022: u0022bpmu0022,n u0022column_nameu0022: u0022bpmu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The beats per minute of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [125, 92, 138, 170, 144],n },n {n u0022nameu0022: u0022keyu0022,n u0022column_nameu0022: u0022keyu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The musical key of the track. Cast the value as a string.u0022,n u0022examplesu0022: [u0022Bu0022, u0022C#u0022, u0022Fu0022, u0022Au0022, u0022Du0022],n },n {n u0022nameu0022: u0022modeu0022,n u0022column_nameu0022: u0022modeu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The mode of the track (e.g., Major, Minor). Cast the value as a string.u0022,n u0022examplesu0022: [u0022Majoru0022, u0022Minoru0022],n },n {n u0022nameu0022: u0022danceabilityu0022,n u0022column_nameu0022: u0022danceability_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The danceability percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [80, 71, 51, 55, 65],n },n {n u0022nameu0022: u0022valenceu0022,n u0022column_nameu0022: u0022valence_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The valence percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [89, 61, 32, 58, 23],n },n {n u0022nameu0022: u0022energyu0022,n u0022column_nameu0022: u0022energy_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The energy percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [83, 74, 53, 72, 80],n },n {n u0022nameu0022: u0022acousticnessu0022,n u0022column_nameu0022: u0022acousticness_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The acousticness percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [31, 7, 17, 11, 14],n },n {n u0022nameu0022: u0022instrumentalnessu0022,n u0022column_nameu0022: u0022instrumentalness_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The instrumentalness percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [0, 63, 17, 2, 19],n },n {n u0022nameu0022: u0022livenessu0022,n u0022column_nameu0022: u0022liveness_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The liveness percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [8, 10, 31, 11, 28],n },n {n u0022nameu0022: u0022speechinessu0022,n u0022column_nameu0022: u0022speechiness_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The speechiness percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [4, 6, 15, 24, 3],n },n {n u0022nameu0022: u0022coverUrlu0022,n u0022column_nameu0022: u0022cover_urlu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The URL of the track's cover image. If the value is 'Not Found', it should be cast as an empty string.u0022,n u0022examplesu0022: [n u0022https://i.scdn.co/image/ab67616d0000b2730656d5ce813ca3cc4b677e05u0022,n u0022https://i.scdn.co/image/ab67616d0000b273e85259a1cae29a8d91f2093du0022,n ],n },n ],n },n {n u0022titleu0022: u0022Artistu0022,n u0022typeu0022: u0022objectu0022,n u0022descriptionu0022: u0022Nodeu0022,n u0022propertiesu0022: [n {n u0022nameu0022: u0022nameu0022,n u0022column_nameu0022: u0022artist(s)_nameu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The name of the artist. Split values in column by a commau0022,n u0022examplesu0022: [n u0022Lattou0022,n u0022Jung Kooku0022,n u0022Myke Towersu0022,n u0022Olivia Rodrigou0022,n u0022Taylor Swiftu0022,n u0022Bad Bunnyu0022,n ],n }n ],n },n {n u0022titleu0022: u0022PERFORMED_BYu0022,n u0022typeu0022: u0022objectu0022,n u0022descriptionu0022: u0022Relationshipu0022,n u0022propertiesu0022: [n {n u0022nameu0022: u0022_fromu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The label of the node this relationship starts atu0022,n u0022examplesu0022: [u0022Tracku0022],n },n {n u0022nameu0022: u0022_tou0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The label of the node this relationship ends atu0022,n u0022examplesu0022: [u0022Artistu0022],n },n ],n },n]nn[n {n u0022titleu0022: u0022Tracku0022,n u0022typeu0022: u0022objectu0022,n u0022descriptionu0022: u0022Nodeu0022,n u0022propertiesu0022: [n {n u0022nameu0022: u0022nameu0022,n u0022column_nameu0022: u0022track_nameu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The name of the tracku0022,n u0022examplesu0022: [n u0022Seven (feat. Latto) (Explicit Ver.)u0022,n u0022LALAu0022,n u0022vampireu0022,n u0022Cruel Summeru0022,n u0022WHERE SHE GOESu0022,n ],n },n {n u0022nameu0022: u0022artist_countu0022,n u0022column_nameu0022: u0022artist_countu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of artists performing in the tracku0022,n u0022examplesu0022: [2, 1, 3, 8, 4],n },n {n u0022nameu0022: u0022released_yearu0022,n u0022column_nameu0022: u0022released_yearu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The year the track was releasedu0022,n u0022examplesu0022: [2023, 2019, 2022, 2013, 2014],n },n {n u0022nameu0022: u0022released_monthu0022,n u0022column_nameu0022: u0022released_monthu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The month the track was releasedu0022,n u0022examplesu0022: [7, 3, 6, 8, 5],n },n {n u0022nameu0022: u0022released_dayu0022,n u0022column_nameu0022: u0022released_dayu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The day of the month the track was releasedu0022,n u0022examplesu0022: [14, 23, 30, 18, 1],n },n {n u0022nameu0022: u0022inSpotifyPlaylistsu0022,n u0022column_nameu0022: u0022in_spotify_playlistsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of Spotify playlists the track is in. Cast the value as an integer.u0022,n u0022examplesu0022: [553, 1474, 1397, 7858, 3133],n },n {n u0022nameu0022: u0022inSpotifyChartsu0022,n u0022column_nameu0022: u0022in_spotify_chartsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of times the track appeared in the Spotify charts. Cast the value as an integer.u0022,n u0022examplesu0022: [147, 48, 113, 100, 50],n },n {n u0022nameu0022: u0022streamsu0022,n u0022column_nameu0022: u0022streamsu0022,n u0022typeu0022: u0022arrayu0022,n u0022descriptionu0022: u0022The list of stream IDs for the track. Maintain the array format.u0022,n u0022examplesu0022: [n u0022141381703u0022,n u0022133716286u0022,n u0022140003974u0022,n u0022800840817u0022,n u0022303236322u0022,n ],n },n {n u0022nameu0022: u0022inApplePlaylistsu0022,n u0022column_nameu0022: u0022in_apple_playlistsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of Apple playlists the track is in. Cast the value as an integer.u0022,n u0022examplesu0022: [43, 48, 94, 116, 84],n },n {n u0022nameu0022: u0022inAppleChartsu0022,n u0022column_nameu0022: u0022in_apple_chartsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of times the track appeared in the Apple charts. Cast the value as an integer.u0022,n u0022examplesu0022: [263, 126, 207, 133, 213],n },n {n u0022nameu0022: u0022inDeezerPlaylistsu0022,n u0022column_nameu0022: u0022in_deezer_playlistsu0022,n u0022typeu0022: u0022arrayu0022,n u0022descriptionu0022: u0022The list of Deezer playlist IDs the track is in. Maintain the array format.u0022,n u0022examplesu0022: [u002245u0022, u002258u0022, u002291u0022, u0022125u0022, u002287u0022],n },n {n u0022nameu0022: u0022inDeezerChartsu0022,n u0022column_nameu0022: u0022in_deezer_chartsu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The number of times the track appeared in the Deezer charts. Cast the value as an integer.u0022,n u0022examplesu0022: [10, 14, 12, 15, 17],n },n {n u0022nameu0022: u0022inShazamChartsu0022,n u0022column_nameu0022: u0022in_shazam_chartsu0022,n u0022typeu0022: u0022arrayu0022,n u0022descriptionu0022: u0022The list of Shazam chart IDs the track is in. Maintain the array format.u0022,n u0022examplesu0022: [u0022826u0022, u0022382u0022, u0022949u0022, u0022548u0022, u0022425u0022],n },n {n u0022nameu0022: u0022bpmu0022,n u0022column_nameu0022: u0022bpmu0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The beats per minute of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [125, 92, 138, 170, 144],n },n {n u0022nameu0022: u0022keyu0022,n u0022column_nameu0022: u0022keyu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The musical key of the track. Cast the value as a string.u0022,n u0022examplesu0022: [u0022Bu0022, u0022C#u0022, u0022Fu0022, u0022Au0022, u0022Du0022],n },n {n u0022nameu0022: u0022modeu0022,n u0022column_nameu0022: u0022modeu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The mode of the track (e.g., Major, Minor). Cast the value as a string.u0022,n u0022examplesu0022: [u0022Majoru0022, u0022Minoru0022],n },n {n u0022nameu0022: u0022danceabilityu0022,n u0022column_nameu0022: u0022danceability_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The danceability percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [80, 71, 51, 55, 65],n },n {n u0022nameu0022: u0022valenceu0022,n u0022column_nameu0022: u0022valence_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The valence percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [89, 61, 32, 58, 23],n },n {n u0022nameu0022: u0022energyu0022,n u0022column_nameu0022: u0022energy_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The energy percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [83, 74, 53, 72, 80],n },n {n u0022nameu0022: u0022acousticnessu0022,n u0022column_nameu0022: u0022acousticness_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The acousticness percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [31, 7, 17, 11, 14],n },n {n u0022nameu0022: u0022instrumentalnessu0022,n u0022column_nameu0022: u0022instrumentalness_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The instrumentalness percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [0, 63, 17, 2, 19],n },n {n u0022nameu0022: u0022livenessu0022,n u0022column_nameu0022: u0022liveness_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The liveness percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [8, 10, 31, 11, 28],n },n {n u0022nameu0022: u0022speechinessu0022,n u0022column_nameu0022: u0022speechiness_%u0022,n u0022typeu0022: u0022integeru0022,n u0022descriptionu0022: u0022The speechiness percentage of the track. Cast the value as an integer.u0022,n u0022examplesu0022: [4, 6, 15, 24, 3],n },n {n u0022nameu0022: u0022coverUrlu0022,n u0022column_nameu0022: u0022cover_urlu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The URL of the track's cover image. If the value is 'Not Found', it should be cast as an empty string.u0022,n u0022examplesu0022: [n u0022https://i.scdn.co/image/ab67616d0000b2730656d5ce813ca3cc4b677e05u0022,n u0022https://i.scdn.co/image/ab67616d0000b273e85259a1cae29a8d91f2093du0022,n ],n },n ],n },n {n u0022titleu0022: u0022Artistu0022,n u0022typeu0022: u0022objectu0022,n u0022descriptionu0022: u0022Nodeu0022,n u0022propertiesu0022: [n {n u0022nameu0022: u0022nameu0022,n u0022column_nameu0022: u0022artist(s)_nameu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The name of the artist. Split values in column by a commau0022,n u0022examplesu0022: [n u0022Lattou0022,n u0022Jung Kooku0022,n u0022Myke Towersu0022,n u0022Olivia Rodrigou0022,n u0022Taylor Swiftu0022,n u0022Bad Bunnyu0022,n ],n }n ],n },n {n u0022titleu0022: u0022PERFORMED_BYu0022,n u0022typeu0022: u0022objectu0022,n u0022descriptionu0022: u0022Relationshipu0022,n u0022propertiesu0022: [n {n u0022nameu0022: u0022_fromu0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The label of the node this relationship starts atu0022,n u0022examplesu0022: [u0022Tracku0022],n },n {n u0022nameu0022: u0022_tou0022,n u0022typeu0022: u0022stringu0022,n u0022descriptionu0022: u0022The label of the node this relationship ends atu0022,n u0022examplesu0022: [u0022Artistu0022],n },n ],n },n]
Adding Unique Identifiers
I noticed that the schema didn’t contain any unique identifiers, and this may become a problem when it comes to importing relationships. It stands to reason that different artists would release songs with the same name and two artists may have the same name.
For this reason, it was important to create an identifier for Tracks so they could be differentiated within a larger dataset:
# Add primary key/unique identifiersnuid_prompt = PromptTemplate.from_template(u0022u0022u0022nYou are a graph database expert reviewing a single entity from a data model generated by a colleague.nYou want to ensure that all of the nodes imported into the database are unique.nn## ExamplennA schema contains Actors with a number of properties including name, date of birth.nnTwo actors may have the same name then add a new compound property combining the name and date of birth.nIf combining values, include the instruction to convert the value to slug case. Call the new property 'id'.nnIf you have identified a new property, add it to the list of properties leaving the rest intact.nInclude in the description the fields that are to be concatenated.nnn## Example OutputnnHere is an example of a good output:n```n{example_output}n```nn## Current Entity Scheman```n{entity}n```nnu0022u0022u0022, partial_variables=dict(example_output=dumps(example_output)))nnuid_chain = uid_prompt | llm.with_structured_output(JSONSchemaSpecification)
This step is only really required for nodes, so I extracted the nodes from the schema, ran the chain for each and then combined the relationships with the updated definitions:
# extract nodes and relationshipsnnodes = [n for n in existing_model if u0022nodeu0022 in n[u0022descriptionu0022].lower()]nrels = [n for n in existing_model if u0022nodeu0022 not in n[u0022descriptionu0022].lower()]nn# generate a unique id for nodesnwith_uids = []nnfor entity in nodes:n res = uid_chain.invoke(dict(entity=dumps(entity)))n json = loads(res.jsonschema)nn with_uids = with_uids + json if type(json) == list else with_uids + [json]nn# combine nodes and relationshipsnwith_uids = with_uids + rels
Data Model Review
For sanity, it is worth checking the model for optimizations. The model_prompt did a good job of identifying the nouns and verbs, but in a more complex model.
One iteration treated the *_playlists and _charts columns as IDs and attempted to create Stream nodes and IN_PLAYLIST relationships. I assume this was due to the count over 1,000 including formatting with a comma (e.g., 1,001).
Nice idea, but maybe a little too clever. But this shows the importance of having a human in the loop that understands the data structure.
# Add primary key/unique identifiersnreview_prompt = PromptTemplate.from_template(u0022u0022u0022nYou are a graph database expert reviewing a data model generated by a colleague.nnYour task is to review the data model and ensure that it is fit for purpose.nCheck for:nnn## Check for nested objectsnnRemember that Neo4j cannot store arrays of objects or nested objects.nThese must be converted into into separate nodes with relationships between them.nYou must include the new node and a reference to the relationship to the output schema.nn## Check for Entities in propertiesnnIf there is a property that represents an array of IDs, a new node should be created for that entity.nYou must include the new node and a reference to the relationship to the output schema.nn# Keep InstructionsnnEnsure that the instructions for the nodes, relationships, and properties are clear and concise.nYou may improve them but the detail must not be removed in any circumstances.nnn## Current Entity Schemann```n{entity}n```nu0022u0022u0022)nnreview_chain = review_prompt | llm.with_structured_output(JSONSchemaSpecification)nnreview_nodes = [n for n in with_uids if u0022nodeu0022 in n[u0022descriptionu0022].lower() ]nreview_rels = [n for n in with_uids if u0022nodeu0022 not in n[u0022descriptionu0022].lower() ]nnreviewed = []nnfor entity in review_nodes:n res = review_chain.invoke(dict(entity=dumps(entity)))n json = loads(res.jsonschema)nn reviewed = reviewed + jsonnn# add relationships back innreviewed = reviewed + review_relsnnlen(reviewed)nnreviewed = with_uids
In a real-world scenario, I’d want to run this a few times to iteratively improve the data model. I would put a maximum limit, then iterate up to that point or the data model object no longer changes.
Generate Import Statements
By this point, the schema should be robust enough and include as much information as possible to allow an LLM to generate a set of import scripts.
In line with Neo4j data importing recommendations, the file should be processed several times, each time importing a single node or relationship to avoid eager operations and locking.
import_prompt = PromptTemplate.from_template(u0022u0022u0022nBased on the data model, write a Cypher statement to import the following data from a CSV file into Neo4j.nnDo not use LOAD CSV as this data will be imported using the Neo4j Python Driver, use UNWIND on the $rows parameter instead.nYou are writing a multi-pass import process, so concentrate on the entity mentioned.nWhen importing data, you must use the following guidelines:n* follow the instructions in the description when identifying primary keys.n* Use the instructions in the description to determine the format of properties when a finding.n* When combining fields into an ID, use the apoc.text.slug function to convert any text to slug case and toLower to convert the string to lowercase - apoc.text.slug(toLower(row.`name`))n* If you split a property, convert it to a string and use the trim function to remove any whitespace - trim(toString(row.`name`))n* When combining properties, wrap each property in the coalesce function so the property is not null if one of the values is not set - coalesce(row.`id`, '') + 'u002du002d'+ coalsece(row.`title`)n* Use the `column_name` field to map the CSV column to the property in the data model.n* Wrap all column names from the CSV in backticks - for example row.`column_name`.n* When you merge nodes, merge on the unique identifier and nothing else. All other properties should be set using `SET`.n* Do not use apoc.periodic.iterate, the files will be batched in the application.nData Model:n```n{data_model}n```nCurrent Entity:n```n{entity}n```nu0022u0022u0022)
This chain requires a different output object to the previous steps. In this case, the cypher member is most important, but I also wanted to include a chain_of_thought key to encourage Chain of Thought:
class CypherOutputSpecification(BaseModel):n chain_of_thought: str = Field(description=u0022Any reasoning used to write the Cypher statementu0022)n cypher: str = Field(description=u0022The Cypher statement to import the datau0022)n notes: Optional[str] = Field(description=u0022Any notes or closing remarks about the Cypher statementu0022)nnimport_chain = import_prompt | llm.with_structured_output(CypherOutputSpecification)
The same process then applies to iterate over each of the reviewed definitions and generate the Cypher:
import_cypher = []nfor n in reviewed:n print('nnu002du002du002du002du002du002d', n['title'])n res = import_chain.invoke(dict(n data_model=dumps(reviewed),n entity=nn ))n import_cypher.append((n res.cyphern ))n print(res.cypher)
Console output:
u002du002du002du002du002du002d TracknUNWIND $rows AS rownMERGE (t:Track {id: apoc.text.slug(toLower(coalesce(row.`track_name`, '') + '-' + coalesce(row.`released_year`, '')))})nSET t.name = trim(toString(row.`track_name`)),n t.artist_count = toInteger(row.`artist_count`),n t.released_year = toInteger(row.`released_year`),n t.released_month = toInteger(row.`released_month`),n t.released_day = toInteger(row.`released_day`),n t.inSpotifyPlaylists = toInteger(row.`in_spotify_playlists`),n t.inSpotifyCharts = toInteger(row.`in_spotify_charts`),n t.streams = row.`streams`,n t.inApplePlaylists = toInteger(row.`in_apple_playlists`),n t.inAppleCharts = toInteger(row.`in_apple_charts`),n t.inDeezerPlaylists = row.`in_deezer_playlists`,n t.inDeezerCharts = toInteger(row.`in_deezer_charts`),n t.inShazamCharts = row.`in_shazam_charts`,n t.bpm = toInteger(row.`bpm`),n t.key = trim(toString(row.`key`)),n t.mode = trim(toString(row.`mode`)),n t.danceability = toInteger(row.`danceability_%`),n t.valence = toInteger(row.`valence_%`),n t.energy = toInteger(row.`energy_%`),n t.acousticness = toInteger(row.`acousticness_%`),n t.instrumentalness = toInteger(row.`instrumentalness_%`),n t.liveness = toInteger(row.`liveness_%`),n t.speechiness = toInteger(row.`speechiness_%`),n t.coverUrl = CASE row.`cover_url` WHEN 'Not Found' THEN '' ELSE trim(toString(row.`cover_url`)) ENDnnnu002du002du002du002du002du002d ArtistnUNWIND $rows AS rownWITH row, split(row.`artist(s)_name`, ',') AS artistNamesnUNWIND artistNames AS artistNamenMERGE (a:Artist {id: apoc.text.slug(toLower(trim(artistName)))})nSET a.name = trim(artistName)nnnu002du002du002du002du002du002d PERFORMED_BYnUNWIND $rows AS rownUNWIND split(row.`artist(s)_name`, ',') AS artist_namenMERGE (t:Track {id: apoc.text.slug(toLower(row.`track_name`)) + '-' + trim(toString(row.`released_year`))})nMERGE (a:Artist {id: apoc.text.slug(toLower(trim(artist_name)))})nMERGE (t)-[:PERFORMED_BY]-u003e(a)
This prompt took some engineering to achieve consistent results:
- Sometimes the Cypher would include MERGE statement with multiple fields defined, which is suboptimal at best. If any of the columns are null, the entire import will fail.
- At times, the result would include apoc.period.iterate, which is no longer required, and I wanted code I could execute with the Python driver.
- I had to reiterate that the specified column name should be used when creating relationships.
- The LLM just wouldn’t follow the instructions when using the unique identifier on the nodes at each end of the relationship, so this took a few attempts to get it to follow the instructions in the description. There was some back and forth between this prompt and the model_prompt.
- Backticks were needed for the column names that included special characters (eg. energy_%).
It would also be beneficial to split this into two prompts — one for nodes and one for relationships. But that is a task for another day.
Create the Unique Constraints
Next, the import scripts can be used as a basis to create unique constraints in the database:
constraint_prompt = PromptTemplate.from_template(u0022u0022u0022nYou are an expert graph database administrator.nUse the following Cypher statement to write a Cypher statement toncreate unique constraints on any properties used in a MERGE statement.nnThe correct syntax for a unique constraint is:nCREATE CONSTRAINT movie_title_id IF NOT EXISTS FOR (m:Movie) REQUIRE m.title IS UNIQUE;nnCypher:n```n{cypher}n```nu0022u0022u0022)nnconstraint_chain = constraint_prompt | llm.with_structured_output(CypherOutputSpecification)nnconstraint_queries = []nnfor statement in import_cypher:n res = constraint_chain.invoke(dict(cypher=statement))nn statements = res.cypher.split(u0022;u0022)nn for cypher in statements:n constraint_queries.append(cypher)
Console output:
CREATE CONSTRAINT track_id_unique IF NOT EXISTS FOR (t:Track) REQUIRE t.id IS UNIQUEnnCREATE CONSTRAINT stream_id IF NOT EXISTS FOR (s:Stream) REQUIRE s.id IS UNIQUEnnCREATE CONSTRAINT playlist_id IF NOT EXISTS FOR (p:Playlist) REQUIRE p.id IS UNIQUEnnCREATE CONSTRAINT chart_id IF NOT EXISTS FOR (c:Chart) REQUIRE c.id IS UNIQUEnnCREATE CONSTRAINT track_id_unique IF NOT EXISTS FOR (t:Track) REQUIRE t.id IS UNIQUEnnCREATE CONSTRAINT stream_id_unique IF NOT EXISTS FOR (s:Stream) REQUIRE s.id IS UNIQUEnnCREATE CONSTRAINT track_id_unique IF NOT EXISTS FOR (t:Track) REQUIRE t.id IS UNIQUEnnCREATE CONSTRAINT playlist_id_unique IF NOT EXISTS FOR (p:Playlist) REQUIRE p.id IS UNIQUEnnCREATE CONSTRAINT track_id_unique IF NOT EXISTS FOR (track:Track) REQUIRE track.id IS UNIQUEnnCREATE CONSTRAINT chart_id_unique IF NOT EXISTS FOR (chart:Chart) REQUIRE chart.id IS UNIQUE
Sometimes this prompt would return statements for indexes and constraints, hense the split on the semi-colon.
Run the Import
With everything in place, it was time to execute ethe Cypher statements:
from os import getenvnfrom neo4j import GraphDatabasenndriver = GraphDatabase.driver(n getenv(u0022NEO4J_URIu0022),n auth=(n getenv(u0022NEO4J_USERNAMEu0022),n getenv(u0022NEO4J_PASSWORDu0022)n )n)nnwith driver.session() as session:n # truncate the dbn session.run(u0022MATCH (n) DETACH DELETE nu0022)nn # create constraintsn for q in constraint_queries:n if q.strip() != u0022u0022:n session.run(q)nn # import the datan for q in import_cypher:n if q.strip() != u0022u0022:n res = session.run(q, rows=rows).consume()n print(q)n print(res.counters)
QA on the Dataset
This post wouldn’t be complete without some QA on the dataset using the GraphCypherQAChain:
from langchain.chains import GraphCypherQAChainnfrom langchain_community.graphs import Neo4jGraphnngraph = Neo4jGraph(n url=getenv(u0022NEO4J_URIu0022),n username=getenv(u0022NEO4J_USERNAMEu0022),n password=getenv(u0022NEO4J_PASSWORDu0022),n enhanced_schema=Truen)nnqa = GraphCypherQAChain.from_llm(n llm,n graph=graph,n allow_dangerous_requests=True,n verbose=Truen)
Most Popular Artists
Who are the most popular artists in the database?
qa.invoke({u0022queryu0022: u0022Who are the most popular artists?u0022})nnu003e Entering new GraphCypherQAChain chain...nGenerated Cypher:ncyphernMATCH (:Track)-[:PERFORMED_BY]-u003e(a:Artist)nRETURN a.name, COUNT(*) AS popularitynORDER BY popularity DESCnLIMIT 10nnFull Context:n[{'a.name': 'Bad Bunny', 'popularity': 40}, {'a.name': 'Taylor Swift', 'popularity': 38}, {'a.name': 'The Weeknd', 'popularity': 36}, {'a.name': 'SZA', 'popularity': 23}, {'a.name': 'Kendrick Lamar', 'popularity': 23}, {'a.name': 'Feid', 'popularity': 21}, {'a.name': 'Drake', 'popularity': 19}, {'a.name': 'Harry Styles', 'popularity': 17}, {'a.name': 'Peso Pluma', 'popularity': 16}, {'a.name': '21 Savage', 'popularity': 14}]nnu003e Finished chain.nn{n u0022queryu0022: u0022Who are the most popular artists?u0022,n u0022resultu0022: u0022Bad Bunny, Taylor Swift, and The Weeknd are the most popular artists.u0022n}
The LLM seemed to judge popularity in terms of number of tracks an artist has been on rather than their overall number of streams.
Beats Per Minute
Which track has the highest BPM?
qa.invoke({u0022queryu0022: u0022Which track has the highest BPM?u0022})nnu003e Entering new GraphCypherQAChain chain...nGenerated Cypher:ncyphernMATCH (t:Track)nRETURN tnORDER BY t.bpm DESCnLIMIT 1nnFull Context:n[{'t': {'id': 'seven-feat-latto-explicit-veru002du002d2023'}}]nnu003e Finished chain.nn{n u0022queryu0022: u0022Which track has the highest BPM?u0022,n u0022resultu0022: u0022I don't know the answer.u0022n}
Improving the Cypher Generation Prompt
In this case, the Cypher looks fine and the correct result was included in the prompt but gpt-4o couldn’t interpret the answer. It looks like the CYPHER_GENERATION_PROMPT passed to the GraphCypherQAChain could do with additional instructions to make the column names more verbose.
Always use verbose column names in the Cypher statement using the label and property names. For example, use ‘person_name’ instead of ‘name’.
GraphCypherQAChain with custom prompt:
CYPHER_GENERATION_TEMPLATE = u0022u0022u0022Task:Generate Cypher statement to query a graph database.nInstructions:nUse only the provided relationship types and properties in the schema.nDo not use any other relationship types or properties that are not provided.nSchema:n{schema}nNote: Do not include any explanations or apologies in your responses.nDo not respond to any questions that might ask anything else than for you to construct a Cypher statement.nDo not include any text except the generated Cypher statement.nnAlways use verbose column names in the Cypher statement using the label and property names. For example, use 'person_name' instead of 'name'.nInclude data from the immediate network around the node in the result to provide extra context. For example, include the Movie release year, a list of actors and their roles, or the director of a movie.nWhen ordering by a property, add an `IS NOT NULL` check to ensure that only nodes with that property are returned.nnExamples: Here are a few examples of generated Cypher statements for particular questions:nn# How many people acted in Top Gun?nMATCH (m:Movie {{name:u0022Top Gunu0022}})nRETURN COUNT { (m)u0026lt;-[:ACTED_IN]-() } AS numberOfActorsnnThe question is:n{question}u0022u0022u0022nnCYPHER_GENERATION_PROMPT = PromptTemplate(n input_variables=[u0022schemau0022, u0022questionu0022], template=CYPHER_GENERATION_TEMPLATEn)nnqa = GraphCypherQAChain.from_llm(n llm,n graph=graph,n allow_dangerous_requests=True,n verbose=True,n cypher_prompt=CYPHER_GENERATION_PROMPT,n)
Tracks Performed By the Most Artists
Graphs excel at returning a count of the number of relationships by type and direction.
qa.invoke({u0022queryu0022: u0022Which tracks are performed by the most artists?u0022})nnu003e Entering new GraphCypherQAChain chain...nGenerated Cypher:ncyphernMATCH (t:Track)n WITH t, COUNT { (t)-[:PERFORMED_BY]-u003e(:Artist) } as artist_countnWHERE artist_count IS NOT NULLnRETURN t.id AS track_id, t.name AS track_name, artist_countnORDER BY artist_count DESCnnFull Context:n[{'track_id': 'los-del-espacio-2023', 'track_name': 'Los del Espacio', 'artist_count': 8}, {'track_id': 'se-le-ve-2021', 'track_name': 'Se Le Ve', 'artist_count': 8}, {'track_id': 'we-don-t-talk-about-bruno-2021', 'track_name': u0022We Don't Talk About Brunou0022, 'artist_count': 7}, {'track_id': 'cayï-ï-la-noche-feat-cruz-cafunï-ï-abhir-hathi-bejo-el-imau002du002d2022', 'track_name': None, 'artist_count': 6}, {'track_id': 'jhoome-jo-pathaan-2022', 'track_name': 'Jhoome Jo Pathaan', 'artist_count': 6}, {'track_id': 'besharam-rang-from-pathaanu002du002d2022', 'track_name': None, 'artist_count': 6}, {'track_id': 'nobody-like-u-from-turning-redu002du002d2022', 'track_name': None, 'artist_count': 6}, {'track_id': 'ultra-solo-remix-2022', 'track_name': 'ULTRA SOLO REMIX', 'artist_count': 5}, {'track_id': 'angel-pt-1-feat-jimin-of-bts-jvke-muni-longu002du002d2023', 'track_name': None, 'artist_count': 5}, {'track_id': 'link-up-metro-boomin-don-toliver-wizkid-feat-beam-toian-spider-verse-remix-spider-man-across-the-spider-verseu002du002d2023', 'track_name': None, 'artist_count': 5}]nnu003e Finished chain.nn{n u0022queryu0022: u0022Which tracks are performed by the most artists?u0022,n u0022resultu0022: u0022The tracks u0022Los del Espaciou0022 and u0022Se Le Veu0022 are performed by the most artists, with each track having 8 artists.u0022n}
Summary
The CSV analysis and modeling is the most time-intensive part. It could take more than five minutes to generate.
The costs themselves were pretty cheap. In eight hours of experimentation, I must have sent hundreds of requests and I ended up spending a dollar or so.
There were a number of challenges to get to this point:
- The prompts took several iterations to get right. This problem could be overcome by fine-tuning the model or providing few-shot examples.
- JSON responses from GPT-4o can be inconsistent. I was recommended json-repair, which was better than trying to get the LLM to validate its own JSON output.
I can see this approach working well in a LangGraph implementation where the operations are run in sequence, giving an LLM the ability to build and refine the model. As new models are released, they may also benefit from fine-tuning.
Learn More
Check out Harnessing Large Language Models With Neo4j for more information about streamlining the knowledge graph creation process with LLMs. Read Create a Neo4j GraphRAG Workflow Using LangChain and LangGraph for more about LangGraph and Neo4j. And to learn more about fine-tuning, check out Knowledge Graphs and LLMs: Fine-Tuning vs. Retrieval-Augmented Generation.
The Developer’s Guide:u003cbru003eHow to Build a Knowledge Graph
This ebook gives you a step-by-step walkthrough on building your first knowledge graph.