What You Don’t Know about Your Spreadsheets Is the Most Horrifying PartSpreadsheets often exist under the radar. If you ask a random company what software systems they use, it might take a day or so but they’ll eventually be able to tell you who uses what software in the company. This is a lot harder for spreadsheets. When I was in grad school, I worked with an investment bank doing spreadsheet research. On my first day, I went to the head of the Excel team. I said, ‘Hello, can I have a list of all your spreadsheets?’ There was no such thing. ‘We don’t have a list of all the spreadsheets,’ he said. ‘You could ask Frank in Accounting or maybe Harry over at Finance. He’s always talking about spreadsheets. I don’t really know, but I think we might have 10,000 spreadsheets.’ 10,000 spreadsheets was a gold mine of research, so I went to the IT department and conducted my first spreadsheet scan with root access in Windows Explorer. Within one second, it had already found 10,000 spreadsheets. Within an hour, it was still finding more, with over one million Excel files located. Eventually, we found 2.5 million spreadsheets. In short, spreadsheets run the world.
Spreadsheet Horror StoriesIf spreadsheets are so common in decision-making but no one really knows where they are, things are bound to go wrong – and they do. There is even a European spreadsheet risk interest group. It has a long list of spreadsheet horror stories. I’ve selected my three favorite ones to share with you: You would expect that with a gazillion pounds in budget the 2012 London Olympics team would invest in a very sophisticated ticketing system. They didn’t. They used spreadsheets to keep track of ticket sales. They made a small mistake in one of their spreadsheets causing one of their swimming stadiums to be overbooked by 10,000 tickets. There wasn’t any money lost but it was extremely embarrassing. The University of Toledo once lost $2.4 million because of a single typo in a spreadsheet, and Canadian power company TransAlta lost $24 million because of a classic copy-paste error in a spreadsheet.
The End of Spreadsheets?So, if spreadsheets are so bad, should we even use them anymore? In his viral blog post, professor Daniel Lemire said we should not use spreadsheets anymore. He pointed out the example of a few economic scientists who used a spreadsheet for an economic model which turned out to be wrong. Lemire said we shouldn’t use spreadsheets anymore because they’re horrible. First of all, I think this is pretty useless advice. People are using spreadsheets, and we can’t ignore that. It would be like if your house was on fire and someone told you, ‘Well, you shouldn’t have made it out of wood.’ You would respond, ‘Yes, I know, but my house is on fire, can you help?’ At this point, spreadsheet cynics would hand you a brick. We might agree that spreadsheets aren’t the right solution, but saying we should stop using them isn’t helping. If we didn’t use spreadsheets, we’d probably turn to software. And we know that software is always perfect. Nothing wrong ever happens with software – right? Right? In reality, we know that error rates in spreadsheets are similar to error rates in software. About one in a hundred lines of code (or formulas) will have an error. Human brains just aren’t better than that. Unless we get machines to program for us, we will always have similar error rates between spreadsheets and source code. The source of this problem is that spreadsheets are mislabeled. People think of spreadsheets as data whereas spreadsheets are actually a means of programming.
Spreadsheets Are CodeMy research motto is Spreadsheets are Code. Spreadsheets are not data; they’re code. There are three reasons why you should consider spreadsheets as actual pieces of programming:
- First of all, spreadsheets and source code are used for similar problems. Consider a typical investment calculation. You put in some parameters about principal, interest and length of investment and the calculator tool produces an output. You could solve a typical programming problem like this with Java, maybe a web app or something on an iPad – or you could do it in a spreadsheet. The problem space is that similar. So why do so many people use a spreadsheet for this kind of calculation instead of asking IT to create software for them? It’s because IT answers their request saying it will be done in 12 months and cost $3 million. In lieu of waiting – and in lieu of going past schedule and over budget – they use a spreadsheet, because the problem space is similar.
- Second, spreadsheets are just as powerful as other programming languages. I’ve gone a long way to make this point. In fact, I’ve implemented a Turing machine using only spreadsheet formulas to prove that spreadsheet formulas are Turing complete. You cannot say that spreadsheets are not powerful, because now we have proof that they’re as powerful as any programming language out there.
- The third reason that spreadsheets are really source code is that they suffer from typical software engineering problems. For example, consider these facts:
- Only one in three spreadsheets has a manual.
- An average spreadsheet is used by 12 different people.
- Spreadsheets have an average lifespan of five years.
Applying a Software Engineering ApproachWe know how to solve a lot of source code problems. Modern IDEs have all sorts of embedded features like debuggers, testing, code metrics and other tools that help us write responsible, error-free code. The question is, then, If spreadsheets are code, could we apply software engineering methods to make them better? Yes, we can. Methods from software engineering are easily transferrable to the world of spreadsheets. Let’s dive into the example of code smells: What code smell is this in the image below?
Modeling Our Spreadsheet Data for a DatabaseMaybe some of you are wondering, Where do graph databases come in? We are almost there, bear with me. To analyze these code smells – what formulas are on what worksheet – we have to analyze the entire worksheet and save the information to a database in order to then query what cells and what formulas are not located in the optimal worksheet. This is what our potential data model looks like:
Why SQL Databases Fall ShortWhen I was a young researcher, I didn’t think about what database would be most appropriate. Like most people, I just thought SQL was the only database. In the beginning, using SQL was okay. For instance, writing the join for the number of worksheets within a spreadsheet was pretty simple. You just take the spreadsheets, and you count the number of worksheets like in the example below.
Graphing Spreadsheet DataThis is where I found out about Neo4j. I realized that spreadsheet information is actually very graphy. All the cells are connected to references to each other and they happen to be in a worksheet or on the spreadsheet, but that’s not really what matters. What matters is the connections. So I started to think, maybe I should use Neo4j because it fits my domain a lot better. This is what my graph database model looks like: Cypher, it turns into this:
ConclusionIf we have to recap my talk in just a few points, here they are:
- Most importantly, spreadsheets are code and not data. They are a serious programming language that is Turing complete, and they are running the world at this point.
- I built a spreadsheet analysis tool and used SQL to store the relational information. But my queries got so large and complex that my programs couldn’t even visualize them any longer.
- I switched to Neo4j which simplified my queries considerably, from 16 lines of SQL to a single line of Cypher.
About the Author
Felienne Hermans , Assistant Professor, Delft University of Technology
Felienne Hermans is an assistant professor at Delft University of Technology, where she researches the application of software engineering methods to spreadsheets. On that, she wrote a number of publications and her Ph.D.dissertation. She is regularly asked to speakabout her research at conferences, including her GraphConnect Europe talk onNeo4j and spreadsheets.