Making Relations on the Titanic



Picture the Titanic, a floating palace that sank and killed two-thirds of those aboard. Whole families perished, children were left orphaned, wives watched their husbands drown. While on close inspection of the passenger manifest, it might be possible to work out the relationships among the people who died or survived, each instance takes a great deal of deduction – deduction that a graph and Cypher can facilitate.

Exploring the Manifest

The passenger manifest records passenger details, which means we can sort them into groups, by gender, age, class, and also by ticket. Additionally, and rather frustratingly, the manifest also records the number of parents or children (parch), each passenger that traveled with them, combined as a single number. The same conflation of data has also occurred for siblings and spouses (sibsp). With a cursory look at the passenger manifest, it appears that for each person, these numbers would only refer to one of the possible relations (i.e., that ‘parch’ refers to either parent or child but not both; that if a person is a parent, they do not have their own parents aboard). The same holds true for ‘sibsp’ where those who are married do not have any of their siblings aboard.

Understanding the specific relationships among passengers on the Titanic allows us to ask interesting questions about the graph that would be extremely difficult and time-consuming with just a tabular database like the manifest. Such questions include did those who survived have siblings aboard? If yes, were their siblings male or female? Did they have either their mother, father, or both aboard? What proportion of surviving women were single or married? And how many survivors had their children aboard? And did they also survive?

To parse the relationships from the data, it was necessary to think about what the numbers in the database represented and whether a unique set of clauses could be defined to determine relationships from that data.

Defining the Relationship Categories

The first step was to define the categories of relationships we were interested in.

Here are the three relationships I had to define: MARRIED_TO, SIBLING_TO, PARENT_OF.

Both MARRIED_TO and SIBLING_TO would imply the same relationship in the other direction between the same nodes. PARENT_OF would imply a reverse relationship of CHILD_OF.

I had to assume that family members would be traveling on the same ticket, and to be sure not to marry children to their parents or vice-versa, I needed to know their ages. Fortunately, the data on ticket numbers and ages is fairly complete in the passenger manifest, so I began with the following:

MATCH (person:Person)
WHERE person.age IS NOT NULL
MATCH (person:Person)-[:TRAVELED_ON]->(ticket:Ticket)<-[:TRAVELED_ON]-(other:Person)

To avoid including servants and other people on the tickets who were not family members in the search, I added the condition that the total number of family members for each person in the relationship had to be the same:

WHERE other.age IS NOT NULL AND person.family = other.family

With those parameters in place, I began defining the relationship for [:MARRIED_TO]. It was likely that if a MARRIED_TO relationship existed on a ticket, it would exist between the eldest people in the family, so I ordered the other people on the ticket by descending age and collected them together as a list called familyMembers:

WITH person, other
ORDER BY other.age DESC

The following parameters defined the rest of the MARRIED_TO relationship.

That they would have one spouse:

p1.sibsp = 1

That their potential spouse would also have one spouse:

p2.sibsp = 1>

They would have at least one family member on the ticket:

p2.family >= 1

They would be the opposite sex of their spouse:

p2.sex <> p1.sex

That the potential spouse was the eldest other family member on the ticket; this assumes that there is no more than one married couple on each ticket and that any family members older than the married couple would not have any siblings aboard, which would register as sibsp:

p2 = familyMembers[0]

And that if there were other family members on the ticket apart from the spouse, that the person would be older than them; this is to prevent the children being married off to their second eldest parent (this took a lot of trial and error):

(familyMembers = 1 OR p1.age > familyMembers[1].age)

The Queries

Eventually, this was the query I drew up to create the MARRIED_TO relationships:

MATCH (person:Person)
WHERE person.age IS NOT NULL
MATCH (person:Person)-[:TRAVELED_ON]->(ticket:Ticket)<-[:TRAVELED_ON]-(other:Person)
WHERE other.age IS NOT NULL AND person.family = other.family
WITH person, other
ORDER BY other.age DESC
WITH person as p1, collect(other) as familyMembers
WITH p1, familyMembers, [p2 in familyMembers WHERE
p1.sibsp = 1 AND
p2.sibsp = 1 AND
p2.family >= 1 AND
p2.sex <> p1.sex AND
p2 = familyMembers[0] AND
(size(familyMembers) = 1 OR p1.age > familyMembers[1].age)
] as spouses
FOREACH (p in spouses | CREATE (p1)-[:MARRIED_TO]->(p))

Next, I worked on creating the SIBLING_OF relationship as a list that could only be created from people who were not spouses, then finally on the PARENT_OF relationship, where I assumed parents could not also be siblings of people onboard:

MATCH (person:Person)
WHERE person.age IS NOT NULL
MATCH (person:Person)-[:TRAVELED_ON]->(ticket:Ticket)<-[:TRAVELED_ON]-(other:Person)
WHERE other.age IS NOT NULL AND person.family = other.family
WITH person, other
ORDER BY other.age DESC
WITH person as p1, collect(other) as familyMembers
WITH p1, familyMembers, [p2 in familyMembers WHERE
NOT (p2)-[:MARRIED_TO]->() AND
NOT (p1)-[:MARRIED_TO]->() AND
p2.sibsp >= 1 AND
p2.sibsp = p1.sibsp AND
p2.family >= 1 AND
(p2.parch = 1 OR p2.parch = 2) AND
NOT p2 = familyMembers [0] AND
NOT p1 = familyMembers [0] 
] as siblings
WITH p1, familyMembers, siblings, [p2 in familyMembers WHERE
NOT (p2)-[:MARRIED_TO]->() AND
NOT p2 IN siblings AND
NOT p1 IN siblings AND 
p2.family >= 1 AND
(p2.parch = 1 OR p2.parch = 2) AND
p1.parch >= 1 AND
p1.age > p2.age
] as children
FOREACH (p in siblings | CREATE (p1)-[:SIBLING_OF]->(p))
FOREACH (p in children | CREATE (p1)-[:PARENT_OF]->(p))

With these relationships created in the graph, I could then ask questions such as ‘of the people who died, how many of them had siblings aboard?’

MATCH (p:Person {fate: 'Died'}) RETURN COUNT(p), EXISTS {(p)-[:SIBLING_OF]-()}

Query to return the number of people without siblings that died or survived:

MATCH (p:Person)
WHERE NOT EXISTS {(p)-[:SIBLING_OF]-()} 
WITH COUNT (p) as totalnosib, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as nosibdied
RETURN totalnosib, nosibdied ,  1.0 * nosibdied / totalnosib as pdied, totalnosib - nosibdied as nosibsurvived, 1 - (1.0 * nosibdied / totalnosib) as psurvived

Total people without sibling = 1204
People with no siblings who died = 754
Percentage that died of those without siblings = 0.626
People with no siblings who survived = 450
Percentage that survived of those without siblings = 0.374

Total people without sibling = 1204

People with no siblings who died = 754

Percentage that died of those without siblings = 0.626

People with no siblings who survived = 450

Percentage that survived of those without siblings = 0.374

Query to return the number of people with spouses that died or survived:

MATCH (p:Person)
WHERE EXISTS {(p)-[:MARRIED_TO]-()} 
WITH COUNT (p) as totalsp, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as spdied
RETURN totalsp, spdied ,  1.0 * spdied / totalsp as pdied, totalsp - spdied as spsurvived, 1 - (1.0 * spdied / totalsp) as psurvived

Total people with a spouse = 202
People with a spouse who died = 100
Percentage that died with a spouse = 0.495
People with a spouse who survived = 102
Percentage that survived with a spouse= 0.505

And finally, a query to return the number of people with children aboard that died or survived:

MATCH (p:Person)
WHERE EXISTS {(p)-[:PARENT_OF]->()} 
WITH COUNT (p) as totalpar, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as pardied
RETURN totalpar, pardied ,  1.0 * pardied / totalpar as pdied, totalpar - pardied as parsurvived, 1 - (1.0 * pardied / totalpar) as psurvived

The returns from these queries suggest that of those who had children, more than half survived. This is without taking into account the ages of the passengers, where some would not be old enough to have children. A query to account for this would simply be a matter of adding another clause to the WHERE:

MATCH (p:Person)
WHERE p.ageClass = 'Adult' AND NOT EXISTS {(p)-[:PARENT_OF]->()}
WITH COUNT (p) as totalnopar, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as nopardied
RETURN totalnopar, nopardied ,  1.0 * nopardied / totalnopar as pdied, totalnopar - nopardied as parsurvived, 1 - (1.0 * nopardied / totalnopar) as psurvived

Totalnopar = 1075
Nopardied = 703
Pdied = 0.654
Noparsurvived = 372
Psurvived = 0.346

As survivorship for men was significantly lower than for women, we could also add a clause to see the difference between survival chances for men vs. women, both with and without children:

MATCH (p:Person)
WHERE p.ageClass = 'Adult' AND 
p.sex = 'male' AND
EXISTS {(p)-[:PARENT_OF]->()}
WITH COUNT (p) as totalnopar, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as nopardied
RETURN totalnopar, nopardied ,  1.0 * nopardied / totalnopar as pdied, totalnopar - nopardied as parsurvived, 1 - (1.0 * nopardied / totalnopar) as psurvived

The Findings

There is little difference between the survivorship of the genders according to whether they have children. However, this example shows that the flexibility of the graph means that it is possible to draw on multiple properties of an item in the construction of the graph and in the queries you can ask of it.

The data about familial relationships existed in the passenger manifest, but was only readable in that format by careful cross-referencing of passengers on each ticket. The clausal capacity of the graph, however, allowed me to extract these relationships automatically and ask questions of the data that would have been impossible to answer from a tabular database.

Information About the Dataset

The passenger manifest for this dataset was originally downloaded from GitHub, but has been added to and changed through references to the following websites:

Two passengers have been added, and several hundred are missing ages. Some of the data on these websites is uncertain and based on speculation. Where ambiguity exists, the most likely or simplest option was chosen to best fill out the CSV as completely as possible. This version of the passenger manifest should, therefore, not be taken as an accurate or complete representation of the actual passenger manifest of the Titanic.

The final dataset is available on GitHub’s gist.