Modeling Data From the Titanic



The sinking of the Titanic is a tragedy that still fascinates the world. Questions over why some people survived and others didn’t are haunted by the context of class, gender, and age. I use the passenger manifest of the Titanic to help my science and technology students understand what social implications can be drawn from such data. While in the past, I have simply used the capacities of pivot tables to introduce them to such critiques, I was intrigued to see what modeling the data in a graph could reveal about the survivorship across multiple simultaneous factors of the passenger list.

Consider the Data

Faced with a fairly complete tabular data set like a passenger manifest, it may appear that transferring such data to the graph would be a straightforward matter of creating nodes for each passenger and the objects they are attached to, like tickets and departure points, and making the rest of the columns into properties.

However, the flexibility of the graph both affords and requires some thought about how to categorize the data you want to work with. In the graph, data can be nodes, relationships, or properties of either. Depending on what kinds of queries you want to ask of the graph, you may want to structure it differently.

Designing a graph is an iterative process where thinking about what kinds of relationships you have information about influences what kinds of queries to ask and, therefore, how the data should be modeled in the graph. This leads to more thinking about what kinds of data and relationships you have available. Neo4j provides arrows.app as a platform with which to visualize relationships across the graph before uploading any specific data. With arrows.app, it is possible to think about and play with the different ways one might model the data. To demonstrate this, I will show an example of a graph of the data that was far from ideal.

Initially, I tried to make a graph that used the capacity of Cypher to find paths through the graph to conduct my analysis. This led me to create nodes out of any relationships that were widely shared between passengers on the Titanic.

While this graph does create long chains, the categorization of some data as nodes rather than properties of nodes makes asking certain questions quite awkward. For instance, to ask of all the people who escaped on a particular lifeboat, who paid the most for their ticket required a long string of node-relationship-node clauses.

Simplify the Query

So, to make this sort of query simpler, I restructured the graph to make more of the data into properties of nodes so that queries that required sorting by nodes were far more straightforward. The resulting graph looked like this.

Several relationships in this graph do not appear in the CSV of the data. How I extracted the familial relationships is the subject my other post. For this post, I want to focus on the property side, which refers to the side of the ship on which both cabins and lifeboats were located.

To begin, I had to import the passenger manifest CSV into Neo4j. I used the importer tool in Aura to map relationships between the categories of data that already existed as properties of each passenger (i.e., the relationships between each (:Person) and the (:Boat) they [:ESCAPED_ON], the (:Cabin) they were [:ASSIGNED_TO], the (:Ticket) they [:TRAVELED_ON], and the (:embarkationPoint) they [:EMBARKED_FROM].

Through investigating some of the details of how the Titanic was laid out, I discovered that all even-numbered cabins and lifeboats were on the port side of the ship, and all the odd-numbered ones on the starboard side. I thought it might be interesting to see if people assigned cabins on one side were any more likely to escape and whether they escaped on a boat on the same or opposite side to their cabin.

Extract the Data

Extracting the side from the CSV involved identifying all the cabin and boat names that ended with an even number plus boats B and D, and creating a property Port. Then I did the same for all odd-numbered boats plus boats A and C, creating a property Starboard.

MATCH (b:Boat) WHERE b.boat =~ '.*[02468]$' SET b.side = 'Port' RETURN b.boat, b.side

MATCH (b:Boat) WHERE b.boat =~ '.*[BD]$' SET b.side = 'Port' RETURN b.boat, b.side

MATCH (b:Boat) WHERE b.boat =~ '.*[13579]$' SET b.side = 'Starboard' RETURN b.boat, b.side

MATCH (b:Boat) WHERE b.boat =~ '.*[AC]$' SET b.side = 'Starboard' RETURN b.boat, b.side

MATCH (c:Cabin) WHERE c.cabin =~ '.*[02468]$' SET c.side = 'Port' RETURN c.cabin, c.side

MATCH (c:Cabin) WHERE c.cabin =~ '.*[13579]$' SET c.side = 'Starboard' RETURN c.cabin, c.side

Unfortunately, the list of assigned cabins is missing significant amounts of information. Despite this, the graph is able to give us analysis of what is available.

With the graph structured this way, it’s possible to ask about which side of the ship people escaped from in relation to the cabin they were assigned:

MATCH (c:Cabin {side: 'Starboard'})<-[a:ASSIGNED_TO]-(p:Person)-[r:ESCAPED_ON]->(b:Boat {side: 'Starboard'}) 
RETURN c, a, p, r, b

And even add conditions about the age and gender of the passengers:

MATCH (c:Cabin {side: 'Port'})<-[a:ASSIGNED_TO]-(p:Person {sex: 'male'})-[r:ESCAPED_ON]->(b:Boat {side: 'Starboard'}) 
WHERE p.ageClass = 'Child'
RETURN c, a, p, r, b



The Findings

These queries show that females had much higher survival rate than males and that children had a very high survival rate regardless of their gender. Additionally, there was little difference in survivorship related to which side of the ship each passenger had been assigned a cabin. However, there is a difference in survivorship depending on which side the passengers tried to escape. Witness accounts describe the policy of the officer in charge of filling lifeboats on the port side enforcing not “women and children first” but “women and children only.” The majority of men to escape from the port side did so in boats 14, B and D, boats that were launched only moments before the Titanic sank. As we can see, the data above supports this account, as do the results of the query below, which names the boats on which adult males escaped:

MATCH (p:Person {sex: 'male'})-[r:ESCAPED_ON]->(b:Boat {side: 'Port'}) 
WHERE p.ageClass = 'Adult'
RETURN p, r, b.boat

Further Investigation

Modeling a graph is not just a matter of deciding what data to make into nodes, relationships, or properties, but even understanding what kind of data you have is a process that takes a lot of thought and continual investigation. For this example, I used the schematics of the ship and eyewitness accounts to augment and extract more information from my data. Likewise, modeling your data may involve looking beyond the data you already have and investigating other sources that provide context for the results you get from queries of your graph.

Information About the Dataset

The passenger manifest for this dataset was originally downloaded from GitHub, but has been added to and changed through references to the following websites:

Two passengers have been added, and several hundred are missing ages. Some of the data on these websites is uncertain and based on speculation. Where ambiguity exists, the most likely or simplest option was chosen to best fill out the CSV as completely as possible. This version of the passenger manifest should, therefore, not be taken as an accurate or complete representation of the actual passenger manifest of the Titanic. The final dataset is available on GitHub’s gist.