Cypher & GQL

Spreadsheets Are Graphs Too!

Felienne Hermans, Assistant Professor at Delft University of Technology

Assistant Professor, Delft University of Technology

August 26, 2015

10 min read

Listen to Felienne Hermans’ Talk at GraphConnect Europe on Why Spreadsheets Are Graphs Too

Editor’s Note: Last May at GraphConnect Europe, Felienne Hermans – Assistant Professor at Delft University of Technology – gave this engaging talk on why you shouldn’t overlook the power of the humble spreadsheet. Listen to or read her presentation below.

Register for GraphConnect San Francisco to hear more speakers like Felienne present on the emerging world of graph database technologies.

People often ask me, ‘How is it possible that you research spreadsheets? Did you actually write a dissertation on spreadsheets?’

The answer is, Yes, I did. I research spreadsheets because they matter to a lot of people.

Ninety-five percent of all U.S. companies still use spreadsheets for financial reporting, so spreadsheets run the financial domain. And fifty percent of spreadsheets form the basis for strategic decisions within companies regardless of sector.

Analysts decide the strategy of their company based on spreadsheets. For example, they might look at a spreadsheet and ask, ‘Are we selling enough blue cars in France?’ and if they’re not selling enough blue cars, something will happen. They will stop buying blue paint or start a marketing campaign saying blue is the new black.

Either way, analysts make decisions that steer the company based on the data in their spreadsheets.

What You Don’t Know about Your Spreadsheets Is the Most Horrifying Part

Spreadsheets often exist under the radar. If you ask a random company what software systems they use, it might take a day or so but they’ll eventually be able to tell you who uses what software in the company. This is a lot harder for spreadsheets.

When I was in grad school, I worked with an investment bank doing spreadsheet research. On my first day, I went to the head of the Excel team.

I said, ‘Hello, can I have a list of all your spreadsheets?’

There was no such thing.

‘We don’t have a list of all the spreadsheets,’ he said. ‘You could ask Frank in Accounting or maybe Harry over at Finance. He’s always talking about spreadsheets. I don’t really know, but I think we might have 10,000 spreadsheets.’

10,000 spreadsheets was a gold mine of research, so I went to the IT department and conducted my first spreadsheet scan with root access in Windows Explorer.

Within one second, it had already found 10,000 spreadsheets. Within an hour, it was still finding more, with over one million Excel files located. Eventually, we found 2.5 million spreadsheets.

In short, spreadsheets run the world.

Spreadsheet Horror Stories

If spreadsheets are so common in decision-making but no one really knows where they are, things are bound to go wrong – and they do. There is even a European spreadsheet risk interest group. It has a long list of spreadsheet horror stories.

I’ve selected my three favorite ones to share with you:

You would expect that with a gazillion pounds in budget the 2012 London Olympics team would invest in a very sophisticated ticketing system. They didn’t. They used spreadsheets to keep track of ticket sales.

They made a small mistake in one of their spreadsheets causing one of their swimming stadiums to be overbooked by 10,000 tickets. There wasn’t any money lost but it was extremely embarrassing.

The University of Toledo once lost $2.4 million because of a single typo in a spreadsheet, and Canadian power company TransAlta lost $24 million because of a classic copy-paste error in a spreadsheet.

The End of Spreadsheets?

So, if spreadsheets are so bad, should we even use them anymore?

In his viral blog post, professor Daniel Lemire said we should not use spreadsheets anymore. He pointed out the example of a few economic scientists who used a spreadsheet for an economic model which turned out to be wrong. Lemire said we shouldn’t use spreadsheets anymore because they’re horrible.

First of all, I think this is pretty useless advice.

People are using spreadsheets, and we can’t ignore that. It would be like if your house was on fire and someone told you, ‘Well, you shouldn’t have made it out of wood.’ You would respond, ‘Yes, I know, but my house is on fire, can you help?’ At this point, spreadsheet cynics would hand you a brick.

We might agree that spreadsheets aren’t the right solution, but saying we should stop using them isn’t helping.

If we didn’t use spreadsheets, we’d probably turn to software. And we know that software is always perfect. Nothing wrong ever happens with software – right? Right?

In reality, we know that error rates in spreadsheets are similar to error rates in software. About one in a hundred lines of code (or formulas) will have an error. Human brains just aren’t better than that. Unless we get machines to program for us, we will always have similar error rates between spreadsheets and source code.

The source of this problem is that spreadsheets are mislabeled. People think of spreadsheets as data whereas spreadsheets are actually a means of programming.

Spreadsheets Are Code

My research motto is Spreadsheets are Code.

Spreadsheets are not data; they’re code. There are three reasons why you should consider spreadsheets as actual pieces of programming:

First of all, spreadsheets and source code are used for similar problems. Consider a typical investment calculation. You put in some parameters about principal, interest and length of investment and the calculator tool produces an output.
You could solve a typical programming problem like this with Java, maybe a web app or something on an iPad – or you could do it in a spreadsheet. The problem space is that similar.

So why do so many people use a spreadsheet for this kind of calculation instead of asking IT to create software for them? It’s because IT answers their request saying it will be done in 12 months and cost $3 million. In lieu of waiting – and in lieu of going past schedule and over budget – they use a spreadsheet, because the problem space is similar.
Second, spreadsheets are just as powerful as other programming languages. I’ve gone a long way to make this point. In fact, I’ve implemented a Turing machine using only spreadsheet formulas to prove that spreadsheet formulas are Turing complete.
You cannot say that spreadsheets are not powerful, because now we have proof that they’re as powerful as any programming language out there.
The third reason that spreadsheets are really source code is that they suffer from typical software engineering problems. For example, consider these facts:

These are typical software engineering problems. These are the problems that we struggled with in the ’70s and ’80s, when we realized that software we built on the mainframe a decade ago was still being used. So spreadsheets face the same problems that software does.

Applying a Software Engineering Approach

We know how to solve a lot of source code problems. Modern IDEs have all sorts of embedded features like debuggers, testing, code metrics and other tools that help us write responsible, error-free code.

The question is, then, If spreadsheets are code, could we apply software engineering methods to make them better?

Yes, we can.

Methods from software engineering are easily transferrable to the world of spreadsheets.

Let’s dive into the example of code smells: What code smell is this in the image below?

It’s called the feature envy smell.

That MethodXY is in class B, but it’s only using fields from class A. You can say, in a sense, that the method is envious of all the nice fields in class A, and it would make more sense to put that method in class A because that’s where the relevant fields are.

Conclusion: The design of this code is not optimal because MethodXY isn’t placed in the best location.

You can easily see how this would transfer to a spreadsheet. If you have a formula on a certain worksheet that only uses cells from a different worksheet, then that formula might be better suited on the first worksheet where all the original cells are located.

Modeling Our Spreadsheet Data for a Database

Maybe some of you are wondering, Where do graph databases come in? We are almost there, bear with me.

To analyze these code smells – what formulas are on what worksheet – we have to analyze the entire worksheet and save the information to a database in order to then query what cells and what formulas are not located in the optimal worksheet.

This is what our potential data model looks like:

Spreadsheet Code Smell Data Model Example

Let’s break down the model: You have a spreadsheet. It has worksheets and in each worksheet, there are cells.

Then, cells can have references, but the situations is more complicated than just simple references, as in =A1+A6. You can also have a reference that is a range, like SUM(A1:A5). Then, this range also has cells, which makes the model even more complicated.

Now that we have our model (complicated as it is), we’re ready to save it to a database.

Why SQL Databases Fall Short

When I was a young researcher, I didn’t think about what database would be most appropriate. Like most people, I just thought SQL was the only database.

In the beginning, using SQL was okay. For instance, writing the join for the number of worksheets within a spreadsheet was pretty simple. You just take the spreadsheets, and you count the number of worksheets like in the example below.

An SQL Query for the Number of Worksheets in a Spreadsheet

No problem, right? Totally smooth sailing.

Now, let’s write the join for the number of cells in a spreadsheet.

An SQL Query for the Number of Cells in a Spreadsheet

In this case, we had to do two joins – cells to worksheets and worksheets to spreadsheets. This is still reasonably okay, though.

But in order to detect the feature envy smell, we need to know the number of cells a given formula refers to. Then we need to calculate the ratio between which references are in the most local worksheet and which are located in other worksheets.

This should be a fairly simple query, right?

Not so much. The result is a query something like what we have below:

An SQL Query to Detect a Feature Envy Code Smell

Bit by bit, our query has become not-so-readable. Right now, I could probably figure this out, but if I looked back at it two weeks later, I wouldn’t necessarily know what it means anymore.

It gets even worse if you try to put this into SQL Server Management Studio.

SQL Server Management Studio Error Message

It says the query is so complicated that it can’t visualize the query for you. If your tools are telling you something like this, then you know you’re not on the right path.

Graphing Spreadsheet Data

This is where I found out about Neo4j.

I realized that spreadsheet information is actually very graphy. All the cells are connected to references to each other and they happen to be in a worksheet or on the spreadsheet, but that’s not really what matters. What matters is the connections.

So I started to think, maybe I should use Neo4j because it fits my domain a lot better.

This is what my graph database model looks like:

Each spreadsheet has worksheets; each worksheet has cells; cells each have connections between them; and those cell-to-cell connections can go through ranges (A4:A7) or be direct (A2). It’s a data model that’s totally decoupled from the idea of tables.

What about the horrible SQL query we had from before? Using Neo4j and Cypher, it turns into this:

The SQL Query Transformed into a Cypher Query

You just navigate from one cell to another with either a single direct hop or two hops to navigate a range. Awesome.

I must admit, though, that this Cypher query was not my first attempt. Why? Because it took me a while to get rid of my SQL addiction, to get rid of thinking in terms of tables and joins.

For example, here’s my first attempt at the same Cypher query above.

An Example of a Cypher Query that Was Too Much Like an SQL Query

For me, it was hard to start thinking ‘Cyphery’ after so many years working with SQL. This is essentially an SQL query expressed in Cypher because I was still thinking about joins in the database.

Conclusion

If we have to recap my talk in just a few points, here they are:

Most importantly, spreadsheets are code and not data. They are a serious programming language that is Turing complete, and they are running the world at this point.
I built a spreadsheet analysis tool and used SQL to store the relational information. But my queries got so large and complex that my programs couldn’t even visualize them any longer.
I switched to Neo4j which simplified my queries considerably, from 16 lines of SQL to a single line of Cypher.

If you want to learn more about my work you can go to my website, which includes the code smell detection tool, a refactoring tool and a test tool. My team (it’s not just me anymore) is essentially trying to build an IDE for spreadsheets. And if you want to connect, feel free to send me a tweet or drop me an email.

Inspired by Felienne’s talk? Click below to register for GraphConnect San Francisco on October 21, 2015 at Pier 27 to learn more about the emerging world of graph databases — from enterprise customer stories to hands-on training and workshops.

Register for GraphConnect