Why merging services is so difficult

Those of us of a certain age may remember a time when new Web-based services had been springing up for a while - and heavyweights like Google started buying them up. We were all too familiar with receiving emails saying how our favourite services were now part of the Google family, and we had to complete a few simple steps to connect our logins with our Google Accounts.

The reason for this, of course, is that Google’s main business is behaviourally profiling us as users - and in order to do that, they need to build a single picture of us, from our various interactions. If you have separate accounts on two different services, they want to work out that those two accounts are actually the same person, so they can combine data from those services into a single user profile.

If you’re involved in a merger between services due to corporate acquisitions, new partnerships, or just a need to link together different user-facing services within a larger organisation, you will have the same basic need to match records across different datasets.

It seems like a simple task, but it can be a lot harder than it seems…

The cost of mistakes

Before you can decide how to proceed, you need to estimate the cost of making a mistake. And there are two kinds of mistake you can make.

False negatives are when you fail to match two records that refer to the same person. At the very least, this means that you can’t link data for analytics purposes, which just reduces the accuracy of your insights, but if the data is being merged for important service purposes, this can cause problems for users - things tied to one account, that are valued by the user, won’t appear in their other account. If it’s a credit balance, or some kind of points, or a record of a qualification or other achievement, they may not be too happy. This might cost you manpower in a customer service role clearing the problem up, or may lose you a customer. In the absolute worst cases, if a system contains links to criminal records and the system it’s being linked to is used to recruit people for positions of trust, it could lead to catastrophic consequences…

False positives, on the other hand, are when you match two records that don’t actually refer to the same person. Again, if you’re just doing analytics, they may not be particularly dangerous, but if you are using the data for service purposes, they are usually very expensive. At the very least, your service will behave incorrectly for the affected users, and if data from the records is ever shown or sent to users, you may reveal the private data of one user to another, mistakenly believing they’re the same person - and that usually carries direct penalties.

You need to work out the costs of false negatives and false positives in your case, because different approaches to matching records will have different likelihoods of either - so you need to know the relative costs to decide which approach is right for you.

Finding links

The first step is to directly compare the two databases to find possible matches based on known data. Perhaps you have an email address in both databases - a fairly common thing to store about users, and reasonably good for matching identities. Perhaps you have a name and a date of birth. Perhaps you have some third-party identifier such as a national tax or social security ID.

You might think it’s a simple matter of matching records based on such information - especially if it’s something nice and immutable like a social security number - but this is fraught with difficulties:

You might not have full data to begin with (eg, records missing some of the IDs).
You might already have duplicates in the databases themselves (eg, the same values of the ID records in multiple records in the same database).
Data may be incorrect, either due to mistakes in data entry or outright attempts at fraud by users.
Data might be outdated: people’s names and email addresses change, and even national identifiers might change in extreme cases such as witness protection schemes.
Identifiers might be less unique than you imagine. Personal email addresses are often shared by couples, and professional addresses may refer to a role that is handed on to another employee, or to a team.
Identifiers might be hard to compare. Is “johndoe@gmail.com” the same person as “John Doe@gmail.com”? Is “John Doe” the same person as “Jonathan Döe”? This is compounded when identifiers are stored differently in different databases. Should a record with firstname=”John” middlename=”Percival” lastname=”Doe” match with a record with name=”John P. Doe”?
Those IDs may have subtleties you’re not aware of to begin with. The Internet is full of articles titled Falsehoods Programmers Believe about… and you really shouldn’t be processing any kind of important data without having found out what common misunderstandings others have documented for you!

Generally, at this stage you can classify possible matches into two buckets - “Matched” (where the correspondence in the data makes you confident that the odds of a false positive are low enough to be worth the cost of a false positive) and “Maybe” (where you’re not confident enough to risk a false positive, but confident enough to not want to risk a false negative). If false positives are very expensive, you might only match on very definite correspondence of multiple identifiers; if you can take a slightly looser attitude to false positives, you might develop a scheme where matches earn “points” for how closely values correspond (maybe even using sounds-like comparisons for names) and set a point threshold for “Maybe” and “Matched” results.

Duplication inside the existing databases, or false positives, may result in one record in each database matching multiple records in the other database. If you think source-data duplication is likely, you can use this opportunity to try to merge records within the databases as well as merging across the databases; if not, or if the cost of false positives is too high, you can downgrade those matches to “Maybe”s.

But hopefully, after this exercise, you have a bunch of “Matched” records you can merge together! Yippee!

But what about those “Maybe”s?

Handling the Maybes

So, we have some possible matches between records that we can’t confidently merge, but which we can’t afford to just ignore and risk being false negatives.

And if we only had somewhat suspect identifiers to compare on originally, we might not be able to find any matches at all, and can only find “Maybe” matches. This is the situation companies like Google found themselves in when merging accounts for multiple services, as just matching on email address wasn’t reliable enough.

What comes next? Well, there’s three broad classes of approach to take.

1. Find external matching data

Sometimes you can find some third database that might help you better link your records. This might be expensive, and even if it isn’t, introducing a third database (with its own scope for inaccuracy and being out of date) adds more chances for error, so it’s best to consult it only in these cases where primary data wasn’t able to confirm or reject a match.

2. Ask the user

If we have contact details in either or both datasets, we can contact the users and ask them to identify themselves from the other dataset. In the case of merging online services, both datasets will have some kind of login credentials, so this can be done by asking the user to log into both services from the same Web session, thereby allowing us to unambiguously link them - except running the risk that this gives somebody who has stolen the user’s login credentials for one service to now be identified as them on the new combined service with access to both sets of data. (Think about the implications if a bank combined accounts with an insurance provider, for instance - users might have been willing to share login details for home insurance with an abusive partner, but they don’t want them to have access to their bank accounts). And some users will not get around to performing this process, so their matches will remain as, effectively, false negatives until they respond to the requests and complete the process. And they may forget they already have an account in one system and, thus, never link that account.

But most importantly, this bothers the end user by asking them to perform boring tasks - and that might not be acceptable.

3. Manual inspection

The last resort is to train a team of temporary staff to make sufficiently good human value judgements about the quality of matches, and set them sorting through the Maybe pile for manual classification. This runs the risk of human error, and if the consequences of false positives or negatives are lucrative to criminals, it also runs the risk of inviting corruption of the temporary staff.

And sometimes, possible matches just can’t be resolved with the information available; so perhaps you may still have to resort to asking the user for maybes that even manual inspection can’t resolve.

Now what happens?

Now you’ve identified records to merge, you need to do that. And you have two questions to face - how to implement that internally, and how to present it to the users.

Implementing the merge

Perhaps you can change the implementations of your services to store common data about the user in a new, shared, database; but perhaps that’s too much work, or they are third-party software components you’re not able to change. Putting common data in a shared location avoids duplication and results in a much simpler system going forward, as duplicating the data in two places curses you to forever keep them synchronised - and deal with them getting inconsistent when those systems fail for whatever reason.

What to tell the users?

Given that your services are now sharing data, it might be least confusing for users if they appear to merge into a single, larger, service. This might be achieved without actually combining them, if both services can be modified to have consistent look and feel, have shared navigation, and use the same login system so the user can just follow links between them seamlessly.

But sometimes, the services are really quite separate in the users’ minds, even if they benefit greatly from sharing the data behind the scenes. In which case, you might want to keep them as clearly separate services. You might even maintain separate login details for each service - because users will get confused if the same login details “work” on two different login screens (and telling them that’s OK invites them to type the same login details into scammer’s fake login screens, too). Or you might have them share the login system, by making them both part of a larger “meta-service” they log into then navigate to the desired service, or by following a “Log into the FooBar Ordering Service using your unified FooBar account” pattern, like how people can log into services using their Facebook accounts.

A subtler issue is how to handle the common user details shared between the services. Sure, for read access, it’s no problem that multiple services know your name once you’ve logged in - but do you still maintain “Update my details” screens in every service to modify them, or create a separate “My Profile” service that users need to navigate to to update cross-service data? The latter is a potential source of confusion for users - but having multiple different interfaces that update the same data might also be confusing.

Conclusion

Merging service databases is complicated - and we’ve only looked at user accounts here (variants on the same problems crop up when merging data about buildings, companies, livestock, books, products, and many more, but user accounts are the most common instance of this problem). There are a lot of decisions to be made that don’t have a right answer; this is a world of tradeoffs. Finding the best tradeoffs to make requires a deep understanding of the subtleties, so if you are facing these challenges, get in touch with us and tap into our years of experience!

Author

Featured

Alaric Snell-Pym

Alaric is an engineer specialising in understanding complex problems and producing simple solutions. They have a wide range of experience implementing everything from line of business systems to distributed databases comprising thousands of nodes.