Discover AuraDB Free: Week 19 – The Wordle Graph

Head of Product Innovation & Developer Strategy, Neo4j

February 14, 2022

14 min read

I know I’m late to the game as Wordle was sold to the New York Times last week. I haven’t played it much myself just saw it popping up on my Twitter stream a lot.

If you haven’t played Wordle, it is a guessing game similar to mastermind, only with (5-letter) English words (there are now many clones in other languages). Correctly guessed letters in the right position appear green, wrong position yellow and incorrect letters gray. The goal is to guess the word in 6 attempts, i.e. make all 5 letters green.

Our creative folks even created a Neo4j version of it, see if you can spot the mistake.

So I thought one late night that it would be fun to represent the Wordle world as a graph. And here we go.

I’m probably late to the game now that #wordle has been sold to the @nytimes.
But you can make a graph of all words and characters and have #Neo4j Cypher solve it for you. Includes load script and example query.https://t.co/EUGEVypt7Y pic.twitter.com/NYDgXYINcq

— Michael Hunger (@mesirii) February 1, 2022

If you missed our live-stream here is the recording, we had a lot of fun both playing the game but also exploring the graph version of it:

That Wordle of the day was “elder”, which we got to in attempt 3, after someone from the stream hinting at it.

You can find past live sessions and deep dive blog write-ups.

All the Cypher statements and the source CSV can be found in this GitHub repository:

GitHub – jexp/wordle-graph: Wordle Solver as Neo4j graph database

But let’s create our database instance first, so that you can follow the modeling and import.

Create a Neo4j AuraDB Free Instance

Go to https://dev.neo4j.com/neo4j-aura to register or log into the service (you might need to verify your email address).

After clicking Create Database you can create a new Neo4j AuraDB Free instance. Select a Region close to you and give it a name, e.g. {db-name}.

Choose the “blank database” option as we want to import our data ourselves.

On the Credentials popup, make sure to save the password somewhere safe. The default username is always neo4j.

Then wait 2–3 minutes for your instance to be created.

Afterwards you can connect via the Open Button with Neo4j Browser (you’ll need the password).

Then also the connection URL: neo4j+s://xxx.databases.neo4j.io is available and you can copy it to your credentials as you might need it later.

If you want to see examples for programmatically connecting to the database go to the “Connect” tab of your instance and pick the language of your choice

Dataset

I found a scraped Wordle list in this repository of an R solver, which I turned into a CSV with 12k entries.

Here are the first few words, none of which I would have associated with English 🙂

aahed
aalii
aargh
aarti
abaca

The CSV is available on GitHub, it doesn’t have a header and only a single column.

Import

We can load the CSV data directly with LOAD CSV into Word nodes. First we create a constraint to make this and future lookups fast and the import repeatable.

// create constraint
CREATE CONSTRAINT word_name IF NOT EXISTS ON (w:Word) ASSERT w.name IS UNIQUE;

// load csv and turn each row into a Word node

LOAD CSV FROM "https://raw.githubusercontent.com/jexp/wordle-graph/main/wordle.csv" AS row
MERGE (:Word {name:row[0]});

That’s a lot of 5-letter words (12k) nodes.

I also put a dump file on s3: https://data.neo4j.com/wordle.dump that you can load into your AuraDB or Neo4j Desktop.

Tale of 2 Models

The idea is to split the words into their constituent letters and represent those “character positions” in the graph, sharing letters and positions as needed.

That will allow us to later have fun with things like:

Character frequency
Follow frequencies
Top-starter words
Possible solution words if we have already some positive and negative clues
Playing the game

Model 1 – Letter Positions as Nodes

Initially I represented characters at positions with dedicated nodes labeled CharAtPos with a char and an idx property connected to the word via HAS and to each other with an NEXT relationship and the first node (idx=0) having an extra STARTS relationship.

As it turns out the model might have been a bit overengineered 🙂

Here is the code to split the words into characters and then create the first CharAt pos and subsequent (1..4) nodes and connect them.

To make it work with memory-constrained environments I wrapped it in a CALL {} IN TRANSACTIONS subquery, but even in AuraDB Free that was actually not necessary.

MATCH (w:Word)
CALL { WITH w
WITH w, split(w.name,"") AS chars
MERGE (start:CharAtPos {idx:0, char:chars[0]})
MERGE (w)-[:STARTS]->(start)
MERGE (w)-[:HAS]->(start)
WITH *
UNWIND range(1,size(chars)-1) AS idx
MERGE (next:CharAtPos {idx:idx, char:chars[idx]})
MERGE (w)-[:HAS]->(next)
WITH *
MATCH (prev:CharAtPos {idx:idx-1, char:chars[idx-1]})
MERGE (prev)-[:NEXT]->(next)
} IN TRANSACTIONS OF 1000 ROWS;

IF we query a word like crash now based on our model:

match p=(:Word {name:"crash"})--()
return p

Wordle Solver v1

To solve a word, you pass in the letters you know with their positions and the letters that you don’t have the right position for and match any words that fit this pattern.

MATCH (c1:CharAtPos {idx:0, char:'c'}),
      (c5:CharAtPos {idx:4, char:'h'}),
      (c:CharAtPos {char:'a'})
match (w:Word)-[:HAS]->(c1),
      (w)-[:HAS]->(c5),
      (w)-[:HAS]->(c)
return w.name;

╒════════╕
│"w.name"│
╞════════╡
│"clach" │
├────────┤
│"clash" │
├────────┤
│"caneh" │
├────────┤
│"coach" │
├────────┤
│"catch" │
├────────┤
│"crash" │
└────────┘

If we have more information, then we can extend the query by excluding letters or positions and get a smaller result set. It takes a bit longer to query due to the exclusions.

match (c1:CharAtPos {idx:0, char:'c'}), // correct
      (c2:CharAtPos {idx:1, char:'a'}), // wrong pos
      (c3:CharAtPos {char:'l'}),  // incorrect
      (c4:CharAtPos {char:'i'}),  // incorrect
      (c5:CharAtPos {idx:4, char:'h'}), // correct
      (c:CharAtPos {char:'a'})
match (w:Word)-[h1:HAS]->(c1),
      (w)-[h2:HAS]->(c5), (w)-[h3:HAS]->(c)
WHERE not exists { (w)-[:HAS]->(c2) } and not exists { (w)-[:HAS]->(c3) } and not exists { (w)-[:HAS]->(c4) }
return *

Model – Positions in Relationships

An alternative model represents just the 26 characters and puts the position onto the relationship either as a property or as the rel-type.

Because we know we have 5 letters we can just spell it out.

MATCH (w:Word)
WITH w, split(w.name,"") AS chars
MERGE (c0:Char {char:chars[0]})
MERGE (w)-[p0:POS0]->(c0) SET p0.idx = 0
MERGE (c1:Char {char:chars[1]})
MERGE (w)-[p1:POS1]->(c1) SET p1.idx = 1
MERGE (c2:Char {char:chars[2]})
MERGE (w)-[p2:POS2]->(c2) SET p2.idx = 2
MERGE (c3:Char {char:chars[3]})
MERGE (w)-[p3:POS3]->(c3) SET p3.idx = 3
MERGE (c4:Char {char:chars[4]})
MERGE (w)-[p4:POS4]->(c4) SET p4.idx = 4;

Model 2 Exploration

We can first look at the representation of a word (“diver”) in this model.

match (w:Word {name:"diver"})-[r]->(c:Char)
return *;

Then see how shared characters between two words (“diver” and “elder”) look like, here c are the shared letters and c1, c2 the other letters respectively.

match path = (c1:Char)<--(:Word {name:"diver"})-->(c:Char)
match path2 = (c:Char)<--(:Word {name:"elder"})-->(c2:Char)
return path, path2;

Looking at that frequent suffix of er, we wanted to see what the letter frequencies look like.

Letter Frequencies

With this model we can also easy look at character frequencies and follow probabilities.

For the letter frequencies we just count the number of relationships pointing to a character node (aka the in-degree).

MATCH (c:Char)
RETURN c.char, size((c)<--()) as deg
ORDER BY deg DESC;

As expected for the English language the vowels, and R,S,T are pretty high up.

╒════════╤═════╕
│"c.char"│"deg"│
╞════════╪═════╡
│"s"     │6665 │
├────────┼─────┤
│"e"     │6662 │
├────────┼─────┤
│"a"     │5990 │
├────────┼─────┤
│"o"     │4438 │
├────────┼─────┤
│"r"     │4158 │
├────────┼─────┤
│"i"     │3759 │
├────────┼─────┤
│"l"     │3371 │
├────────┼─────┤
│"t"     │3295 │
├────────┼─────┤
│"n"     │2952 │
├────────┼─────┤
│"u"     │2511 │
├────────┼─────┤
│"d"     │2453 │
├────────┼─────┤
│"y"     │2074 │
├────────┼─────┤
│"c"     │2028 │
├────────┼─────┤
│"p"     │2019 │
├────────┼─────┤
│"m"     │1976 │
├────────┼─────┤
│"h"     │1760 │
├────────┼─────┤
│"g"     │1644 │
├────────┼─────┤
│"b"     │1627 │
├────────┼─────┤
│"k"     │1505 │
├────────┼─────┤
│"f"     │1115 │
├────────┼─────┤
│"w"     │1039 │
├────────┼─────┤
│"v"     │694  │
├────────┼─────┤
│"z"     │434  │
├────────┼─────┤
│"j"     │291  │
├────────┼─────┤
│"x"     │288  │
├────────┼─────┤
│"q"     │112  │
└────────┴─────┘

We can now use that information to find good starting words, by summing up the character frequencies for each word.

MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN w.name, sum(size((c)<--())) as total
ORDER BY total DESC LIMIT 5;

We see there are quite a lot of “cheater” words which contain the high frequency characters multiple times.

╒════════╤═══════╕
│"w.name"│"total"│
╞════════╪═══════╡
│"esses" │33319  │
├────────┼───────┤
│"sasse" │32647  │
├────────┼───────┤
│"sessa" │32647  │
├────────┼───────┤
│"asses" │32647  │
├────────┼───────┤
│"eases" │32644  │
└────────┴───────┘

If we want to avoid that we can state, that we want to only look at words with 5 distinct characters.

MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN w.name, sum(size((c)<--())) as total, count(distinct c) = 5 as uniques
ORDER BY uniques DESC, total DESC LIMIT 10;

╒════════╤═══════╤═════════╕
│"w.name"│"total"│"uniques"│
╞════════╪═══════╪═════════╡
│"arose" │27913  │true     │
├────────┼───────┼─────────┤
│"soare" │27913  │true     │
├────────┼───────┼─────────┤
│"aeros" │27913  │true     │
├────────┼───────┼─────────┤
│"serai" │27234  │true     │
├────────┼───────┼─────────┤
│"arise" │27234  │true     │
├────────┼───────┼─────────┤
│"reais" │27234  │true     │
├────────┼───────┼─────────┤
│"aesir" │27234  │true     │
├────────┼───────┼─────────┤
│"raise" │27234  │true     │
├────────┼───────┼─────────┤
│"aloes" │27126  │true     │
├────────┼───────┼─────────┤
│"stoae" │27050  │true     │
└────────┴───────┴─────────┘

Still not perfect, most of these are just variations of the same set of letters. Let’s group them by the sorted set of letters and show the first few.

MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN apoc.coll.sort(split(w.name,'')) as letters, sum(size((c)<--())) as total, count(distinct c) = 5 as uniques, collect(w.name)[0..2] as words
ORDER BY uniques DESC, total DESC LIMIT 10;

╒═════════════════════╤═══════╤═════════╤═════════════════╕
│"letters"            │"total"│"uniques"│"words"          │
╞═════════════════════╪═══════╪═════════╪═════════════════╡
│["a","e","r","s","t"]│348010 │true     │["aster","arets"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["a","e","p","r","s"]│331422 │true     │["apres","apers"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["a","e","l","s","t"]│311796 │true     │["salet","taels"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["a","e","l","p","s"]│247070 │true     │["pales","lapse"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["d","e","o","r","s"]│243760 │true     │["doser","deros"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["a","e","l","r","s"]│241614 │true     │["arles","earls"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["a","c","e","r","s"]│229527 │true     │["acres","acers"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["d","e","i","l","s"]│229100 │true     │["delis","diels"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["a","i","r","s","t"]│214803 │true     │["artis","astir"]│
├─────────────────────┼───────┼─────────┼─────────────────┤
│["a","e","l","s","v"]│210438 │true     │["avels","valse"]│
└─────────────────────┴───────┴─────────┴─────────────────┘

Follow Frequencies

For example, in position 1 we can use this for the follow frequency

MATCH (c1:Char)<-[:POS0]-(w)-[:POS1]->(c2:Char)
RETURN c1.char, c2.char, count(*) as freq
ORDER BY freq DESC LIMIT 5;

╒═════════╤═════════╤══════╕
│"c1.char"│"c2.char"│"freq"│
╞═════════╪═════════╪══════╡
│"c"      │"o"      │220   │
├─────────┼─────────┼──────┤
│"m"      │"a"      │193   │
├─────────┼─────────┼──────┤
│"r"      │"e"      │187   │
├─────────┼─────────┼──────┤
│"s"      │"t"      │183   │
├─────────┼─────────┼──────┤
│"b"      │"o"      │175   │
└─────────┴─────────┴──────┘

Globally we can either do a big union and sum them up as we did on the stream. Or with the positions on the relationships, we can also generalize our statement from above.

MATCH (c1:Char)<-[p1]-(w)-[p2]->(c2:Char)
WHERE p1.idx +1 = p2.idx
RETURN c1.char, c2.char, count(*) as freq
ORDER BY freq DESC LIMIT 10;

╒═════════╤═════════╤══════╕
│"c1.char"│"c2.char"│"freq"│
╞═════════╪═════════╪══════╡
│"e"      │"s"      │867   │
├─────────┼─────────┼──────┤
│"e"      │"r"      │765   │
├─────────┼─────────┼──────┤
│"a"      │"r"      │659   │
├─────────┼─────────┼──────┤
│"r"      │"e"      │646   │
├─────────┼─────────┼──────┤
│"e"      │"d"      │642   │
├─────────┼─────────┼──────┤
│"r"      │"a"      │562   │
├─────────┼─────────┼──────┤
│"i"      │"n"      │556   │
├─────────┼─────────┼──────┤
│"a"      │"n"      │544   │
├─────────┼─────────┼──────┤
│"a"      │"l"      │536   │
├─────────┼─────────┼──────┤
│"a"      │"s"      │532   │
└─────────┴─────────┴──────┘

These can help us finding that missing letter, but I guess you have internalized it already from your language knowledge, except for some rare combinations.

One interesting question we wanted to answer is, which letters appear frequently in succession as double-letters.

MATCH (c:Char)<-[r1]-(w:Word)-[r2]->(c)
WHERE r1.idx = r2.idx +1
RETURN c.char, count(*) as freq
ORDER BY freq DESC LIMIT 8

As expected o and e lead the pack, l was a bit surprising to me.

╒════════╤══════╕
│"c.char"│"freq"│
╞════════╪══════╡
│"o"     │328   │
├────────┼──────┤
│"e"     │308   │
├────────┼──────┤
│"l"     │209   │
├────────┼──────┤
│"t"     │114   │
├────────┼──────┤
│"f"     │111   │
├────────┼──────┤
│"r"     │108   │
├────────┼──────┤
│"s"     │105   │
├────────┼──────┤
│"n"     │91    │
└────────┴──────┘

For generic, repeats and re-occurrences one could generalize that to

Wordle Solver v2

For resolving our wordle puzzle (v2) we could use this Cypher using this time the relationships as structuring means.

MATCH (c:Char {char:'c'}),
      (h:Char {char:'h'}),
      (a:Char {char:'a'})
MATCH (wordle:Word)-[p0:POS0]->(c),
      (wordle)-[p4:POS4]->(h),
      (wordle)-[px]->(a)
WHERE not exists { (wordle)-[:POS1]->(a) }
  AND not exists { (wordle)-[:POS2]->(:Char {char:'l'}) }
  AND not exists { (wordle)-[:POS3]->(:Char {char:'i'}) }
RETURN *;

If we have more information, then we can extend the query by excluding letters or positions and get a smaller result set.

MATCH (c:Char {char:'c'}),
      (h:Char {char:'h'}),
      (a:Char {char:'a'})
MATCH (wordle:Word)-[p0:POS0]->(c),
      (wordle)-[p4:POS4]->(h),
      (wordle)-[px]->(a)
WHERE not exists { (wordle)-[:POS1]->(a) }
  AND not exists { (wordle)-[:POS2]->(:Char {char:'l'}) }
  AND not exists { (wordle)-[:POS3]->(:Char {char:'i'}) }
RETURN *;

We can even go so far as to implement a generic solver, that takes an structured input (we could also split an parse a marked up word as shown in the 2nd statement) and for each position, includes or excludes the letter (at position) as needed.

// Laser 🟩🟨⬜🟩⬜
WITH [{char:'l',match:true},{char:'a'},{char:'s',match:false},{char:'e',match:true},{char:'r',match:false}] as input
MATCH (w:Word)
CALL {
    WITH w, input
    UNWIND range(0,size(input)-1) as idx
    WITH size(input) as total, idx, w, input[idx].match as m, input[idx].char as char
    // for matching must match position
    WHERE (m AND exists { (w)-[{idx:idx}]->(:Char {char:char}) })
    // for non-matching must not contain
      OR (m = false AND NOT exists { (w)-->(:Char {char:char}) })
    // existing must contain somewhere
      OR (m IS NULL AND exists { (w)-->(:Char {char:char}) })
    // all conditions need to match for this word
    WITH total, count(*) as count WHERE count = total
    RETURN true AS found
}
RETURN w.name;

For ‘laser’ this returns 9 suggestions for the next word:
label, laced, lacet, lacey, laded, laden, laked, lamed, lapel, lated, laten, latex, laved, lawed, laxed, layed, lazed, lutea, lycea.

We could now rank them by the information gain, i.e. which of those have high frequency new characters or reduce most of the uncertainty.

I leave that for you valued reader for now, so that we can have a look at (the much simpler) wordle game implemented using Cypher.

Playing Wordle in Your Terminal

If you just want to play, run ./wordle-neo4j.sh in your terminal, it sends a Cypher query to a wordle database in demo.neo4j.labs.com (username, password, database = wordle) to see if your guesses were right.

https://medium.com/media/9165d67097601f14e45d193aba698df5/href

The statement that it’s running with the guesses in a loop (via http but could also do via cypher-shell) is:

match (w:Word)
with w skip $word limit 1
// turn guess and word into lists
with split($guess,'') as guessed, split(w.name,'') as letters, w.name as name
// iterate and combine for each position
return reduce(res='', idx in range(0,size(letters)-1) | res +
  // when correct
  case when guessed[idx] = letters[idx] then '🟩'
  // when contained
  when name contains guessed[idx] then '🟨'
  // otherwise
  else '⬜' end) as res

Happy guessing and playing with the Wordle data!

Let us know if you have more ideas / suggestions on how to have more fun with this or other datasets on AuraDB Free.

Get started with Neo4j AuraDB Free now. No download required.

Try Neo4j AuraDB Free Now

Discover AuraDB Free: Week 19 – The Wordle Graph was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.