Spreadsheets are for humans or machines. Not both!

Data practitioners blame a lot of problems on spreadsheets. They say that spreadsheets should be rectangular, perfectly formatted datasets with a single row of column headers and no untidyness like colours or notes. Machines can easily read and understand this sort of structure but problems occur when less technical people start making changes that machines can’t understand.

Here at Register Dynamics we regularly encounter spreadsheets produced by a wide range of users for a wide range of purposes. We’ve found that whilst some spreadsheets are neat and tidy, the reality of many spreadsheets is that they are tools by humans and for humans – people successfully run processes, operations and entire organisations from spreadsheets they’ve designed and built themselves. Naturally, they use all of the communication features available like grouping data, comments, colours, and highlights.

We’ve long since held the theory that spreadsheets designed for machines and spreadsheets designed for humans mix like oil and water – that is, not well – and all of the ire directed at spreadsheets comes from trying to apply techniques that work for one user group to the other.

So what is a spreadsheet, really? Is it a machine-readable format for sending structured data between databases? Or is it a communication and record-keeping tool, designed to flexibly hold structured information and communicate it back to a human visually? And crucially, how could we make it easy to extract data from a human-friendly spreadsheet automatically with a machine? We set our best data experts the mission of finding out.

We analysed open spreadsheets from across Government

We gathered 28,752 spreadsheets with over 121 million rows collectively (in Excel and CSV formats) from data.gov.uk to serve as our corpus of test data. We developed a spreadsheet analyser which enabled us to learn much about the structure of these files and create a model to understand how human-friendly or machine-friendly they were.

Here are our main findings:

Spreadsheets that are useful to humans are bad for machines. Human-friendliness and machine-friendliness are very different things. They are not compatible and are almost opposites of each other.
42% of spreadsheets could be difficult to interpret by a computer. Services that consume spreadsheets should expect data to be “untidy” about half of the time, and will need to handle some common issues to serve all of their users.
Of that 42%, about a quarter are CSV files. Whether it’s a Comma Separated Values (CSV) or Excel file, both formats can be difficult for a machine to process.
Eight key features humans use affect machine readability. Services that consume spreadsheets need to handle most of them to serve all of their users.

Spreadsheets are human-friendly or machine-friendly, not both

This grid shows a representative sample of 25 spreadsheets drawn from our 28k strong corpus. Each box is a heat map of the spreadsheet structure – a zoomed out, bird's-eye view that is grey where cells are empty and coloured where cells have data or content. In each case we are showing only the first 100 rows of the sheet (though only five have more rows than this).

The top few spreadsheets are useful for humans and have been structured in a visually simple way by their authors – your eye can spot semantic structure in the spreadsheet, even from this very reduced view. But the bottom spreadsheets appear to a human to seem messy or amorphous, and difficult to intuitively understand.

The colour scale shows which spreadsheets are easily understood by machines. The greener spreadsheets contain few features that hamper machine understanding, all the way up to the orange and red spreadsheets which are very difficult to process automatically. Only the dark green spreadsheets are considered “tidy” and will be easily consumable by popular data tools.

So we see a clear distinction between sheets made for humans and sheets made for machines. Many of the sheets which are more machine-unfriendly (those in red, orange or purple) may look nicely structured or simple to human eyes, whereas those that are more machine-friendly (those in dark green) look like intimidating walls of complicated data.

Our conclusion: human-friendly and machine-friendly spreadsheets are not the same nor are they compatible. Well-designed human-readable spreadsheets are often harder for machines to read. Understanding this is essential when trying to collect data produced by humans.

How we quantified machine unfriendliness

Our analysis and categorisation showed eight common features that people use to make spreadsheets easier on the human eye that also negatively affected machine processing:

Empty rows: adding extra rows in the middle of data, for splitting up groups of rows into more visually distinct chunks
Title row(s): adding a title above the column headers that explains what the sheet is about or contains
Merged cells: spanning sheet titles, column group headers or other non-data content across many cells to show humans that it applies to multiple columns
Empty top rows: gaps between the top of the file and the beginning of the tabular data
Empty bottom rows: gaps between the end of the tabular data and the bottom of the file.
Starting away from the top left cell: leaving padding columns or rows between the edge of the sheet and the content, which aids visual presentation
Subtitles: adding extra titles or more guidance information below the sheet title, which helps humans understand the content
Multiple tables: having multiple distinct but related tables on the same sheet, which may help humans compare and contrast data without flipping back and forth between sheets

So why do these features cause problems for machines? Machine readers are typically designed to interpret a single block of data with a single row of column names, and normally can’t handle data that is structured into sections, has gaps or headings, or uses merged cells. A machine reader encountering these features will in the best case raise an error requiring human involvement and in the worst case may silently misrepresent the data.

In our investigation, we assigned each sheet a Machine Unfriendliness (MU) score based on how many of the 8 features above we were able to detect. The score ranges from 0 (machine friendly) to 8 (human friendly).

Many spreadsheets are unfriendly (and it’s not just Excel)

The chart above shows the distribution of machine unfriendliness across the 28k data.gov.uk spreadsheets we collected, for all sheets (purple, 1st bar) and individually for CSV (light green, 2nd bar) or Excel (dark green, 3rd bar) spreadsheets.

There are some really interesting features in the chart which challenge common understanding about how people use spreadsheets:

42% of all spreadsheets aren’t easily machine readable. A spreadsheet with a score of 0 will be easily understood by a machine. Anything more than that and a spreadsheet will require either manual transformation or a more specialist data processing tool.
Excel sheets are more likely to be made for humans. Nearly two thirds (64%) of Excel sheets have an MU score greater than zero. There is a clear peak in the distribution around an MU score of 3 (16%) and 4 (21%) meaning that an Excel sheet is more likely to have a few human-friendliness features than none at all.
20% of CSVs still have issues that hinder machine processing. It’s commonly believed that whilst Excel spreadsheets created by humans can be messy and hard for machines to understand, CSVs are a more standardised data format that are only for machines. In fact, 1 in 5 of CSVs contain features for humans that a standard spreadsheet parser will struggle to deal with.

If you’re trying to collect data from people using spreadsheets, there’s no easy way around the problem: nearly half of your users will have features in their spreadsheet that require special handling, and requiring CSVs instead of Excel sheets will not make the problem go away.

We’ve spoken before about why trying to get users to wrangle their data to your spreadsheet template doesn’t work, frustrates your users, and results in low data quality. The successful answer is to build services that work with users' data as it is. So what features do you need to handle in your data processing tools? Is accounting for empty rows and titles enough?

Getting machines to read unfriendly spreadsheets is not easy

You might imagine that updating your machine reader to handle a couple of common issues would make most spreadsheets readable, but our data shows that’s not true.

Unfortunately, there aren’t one or two features that make up the bulk of machine unfriendly spreadsheets. Across our spreadsheet corpus there’s a fairly even distribution of the eight different features that comprise our MU score. Tackling the single most-common problem (empty rows) by itself increases the number of sheets you can read, but only by a modest 8%.

If you were going to attempt to tackle the issues one by one, what would be the most efficient order? We analysed what the best order to address features would be, based on which feature delivers the biggest improvement over the last in the number of spreadsheets that are now readable. If that were your strategy, this would be the most efficient order to handle them in:

“Tidy” spreadsheets with an MU score of zero accounts for 58%
Empty rows → allows machines to read 66% (+8%) of all sheets
Merged cells → 68% (+2%)
Empty bottom rows → 71% (+3%)
Multiple tables → 78% (+7%)
Subtitles → 86% (+6%)
Title rows → 94% (+8%)
Data not in A1 → 95% (+1%)
Empty top rows → 100% (+5%)

So, no feature is significantly more valuable than any other, and the biggest gains are only unlocked when you’ve covered four to six features in total. That’s because whilst most machine-unfriendly sheets only have three or four of the eight MU features, they don’t all use the same three or four. Making human-friendly spreadsheets available to machines requires a level of commitment to addressing all of the eight MU features.

Help is available in extracting data from human-friendly sheets

Summary

Our analysis of over 28k government spreadsheets has led us to understand that:

Human-friendly and machine-friendly spreadsheets are not the same, nor are they compatible.
Gathering data from human-designed spreadsheets requires handling at least eight different human-friendliness features. A spreadsheet is likely to have three or four of them present, and they will be a different three or four from sheet to sheet.
Reading the majority of human-designed spreadsheets requires understanding where tabular data starts and ends, and successfully identifying and ignoring all of the presentational features such as titles, subtitles and merged cells.

We recognise that this is a hard challenge! Successfully reading human-friendly spreadsheets is not something that many teams trying to deliver services can commit to, which is why so many services fall back to using templates.

Solution

We’ve previously suggested that reusable data tools should help with this problem, and now we’re building them. We’re working on an open-source Data Upload Design Kit that can handle all of the human-friendliness features we’ve identified. It’s a drop-in pattern and plugin that allows government services to process both machine-friendly and human-friendly spreadsheets in a user-centred way, without writing any code themselves. If you find yourself needing to gather data from your users, give it a try and send us feedback!

What’s next for this research? As well as expanding our analysis corpus to more than 300k spreadsheets available on data.gov.uk and gov.uk, we’d like to examine more niche human-friendliness features, such as colours and formatting with semantic meaning, categorising sheets as guidance or reference lists, and prevalence of date formats. If you’d like to get involved with that work or have suggestions, please get in touch with us on Bluesky!