Untangling the mess: How to stop being overwhelmed by your data

Do you really know what data you have, and where it all is? Are you confident how to interpret the data you have, and when you want some data, do you know where to look?

If so, congratulations! Please share your good practices that help you keep on top of it all. For everyone else, here are our best practices for not being overwhelmed by data.

It’s complexity that matters, not volume

Don’t worry about the size of your data in bytes or records. Computers are generally good at managing large files just as well as small files. As soon as a file or table has too many records to scroll through in a sitting, the burden on you to understand it is the same - whether it’s a thousand records or a trillion.

The challenge of managing data comes from the number of tables and the number of columns, so that’s the metric you want to try and keep down.

Avoid unnecessary complexity

So the first step is to try and keep that number down.

1.Remove old things

Don’t keep outdated data around. Sure, when you migrate from an old data format or database to a new one, there’s the temptation to keep the old one around in case you realised you missed something in the migration. But you can put such “dead but just might be needed again one day” data into some kind of archival storage (which is often cheaper!), or at least clearly label it as stale and outdated. That way you avoid the danger of somebody coming across this stale data and thinking it’s current - or, at best, having to waste time finding out it’s stale and disregarding it in their search for current data.

But don’t forget any legal or ethical obligations you have to retire outdated data - you shouldn’t keep archived old things longer than you have a need, and legal right, to do so. Archiving things is just a temporary step to keep them around in case of emergency; they still need to be deleted promptly.

2. Duplicate only when necessary

Often, data will be duplicated in a system as a kind of cache to improve performance, or to maintain system operation in the case of component failure. This is fine, as long as the duplicates are clearly marked as such (see the next point about documentation). However, sometimes data gets duplicated because of a lack of understanding - somebody might not be aware that some data is already in the system somewhere, and start recording it themselves. This happens when people are already overwhelmed with the amount of data in the system, so it creates a vicious circle.

3. Audit what you have

In a team with multiple people, and even more so in an organisation with multiple teams, everybody can’t necessarily always keep on top of what everyone else is doing so it’s natural for team members to have an outdated mental model of what’s where. Although good documentation means that they can look things up when they’re not sure, it’s still important for at least some team members to occasionally review the overall data landscape and identify areas that are getting overcomplicated, find duplication that has snuck in, and make sure overall documentation is up to date.

4. Document the complexity you have

Documentation is vital to keeping track of any non-trivial organisation’s data, but that definitely doesn’t mean weighty, verbose, manuals that are always out of date.

The most important place to put documentation is in field and table (and higher-level structures, such as schema or datasets) names themselves. Everything should be clearly named, so people can tell what they’re looking at, at first glance. However, that doesn’t mean names need to be long and descriptive - those names need to be short so they can be easily read. If the subtle nuances of a table or field’s meaning can’t fit into a nice short name, then choose a name that captures the most important part of its meaning, and put the rest into a description.

Most modern database systems let you store a longer description alongside a table or field, where you can put a sentence or two - no more than a paragraph - of descriptive text that’s too long to fit in the name. Don’t duplicate the name - the sight of a table called “User_Addresses” with a description of “Addresses of users” just means that somebody has had to tick an “All tables must have descriptions” box in a checklist somewhere, does nothing to help, and just wastes everyone’s time and mental effort when they read the description and find it has nothing of value. If the name contains everything you need to know, leave the description blank.

Make sure there’s an easy way to browse your organisation’s databases to find the names and descriptions in the first place. If the tools you use to interact with data don’t let you easily find and explore it in the first place, consider finding better tools, or at least supplement them with something that automatically and regularly generates standalone documentation from the metadata in the database.

Some important information isn’t specific to a single table, field or whatever, because it’s more “overall” in nature or pertains to a relationship between those objects, rather than any particular one of them. If there isn’t a natural place to document that inside the database itself, then you might need some kind of shared editable guide such as a wiki or a shared document. This might also be a good place to give top-level information to new starters, such as how to log into database systems and who to ask for help.

5. Communicate

Finally, the most impactful way to share information about data with the other users in the organisation is a regular well-attended show- and- tell session - where teams that have added new kinds of data to the system can spend five to ten minutes explaining their change and what it means. While documentation exists as a reference forever more, a show-and-tell session gets everyone thinking about the impact this new data can have on their own work - and then it gets them talking in the questions afterwards.

This shouldn’t just be for technical staff who will be hands-on with the data; anybody with an interest in data, including your most senior stakeholders, should be there so they’re up to date and involved in the conversations.

They should be at least weekly so new things can be shared promptly, so if you don’t usually have enough new development to show and tell each week, roll it into an existing weekly meeting that already has the right people present.

Also see: How to encourage better data sharing

Conclusion

Try to keep your data landscape as simple and tidy as is practical - and then use the tools available to you to make it easy to explore and keep the team updated with changes to it. Following these practices will keep your organisation’s data under control, a beautiful garden to explore instead of a tangled jungle of thorns that everyone is lost in!

We hope that you find these pointers practical and helpful as ways to help you handle your organisation’s data more effectively and not get overwhelmed. For more support with this we would be delighted to hear from you or feel free to learn more about how we support people with their Data management and Data discovery and cataloging.

Author

Featured

Alaric Snell-Pym

Alaric is an engineer specialising in understanding complex problems and producing simple solutions. They have a wide range of experience implementing everything from line of business systems to distributed databases comprising thousands of nodes.