Find and Merge Duplicate Contacts Optimization

Opublikowane
2010-08-10 16:49
Written by
jbertolacci - member of the CiviCRM community - view blog guidelines

Being able to efficiently identify and merge duplicate contacts and related data is be an important -- and often time-consuming -- task for organizations getting data from multiple sources. The current CiviCRM dedupe and merge process does not scale well beyond a moderate number of contacts and consequently there has been dicussion on the forums and IRC about optimizing the dedupe and merge code in CiviCRM.

Niro Solutions and the International Mountain Bicycling Association (IMBA) and have jointly made funds available to sponsor this work and we are opening up the project scope for discussion before work begins. The aim is to to have the enhanced functionalty in core for the CiviCRM 3.2.2 release.

Donald Lobo laid out the high-level goals of the project:

  • The ability to scale the dedupe process to at least one million contact records
  • Significantly improve import/event registration/contribution/other workflows that use dedupe
  • Be extensible and pluggable
  • Provide caching to prevent expensive recalculations

To Lobo's list I would add:

  • Automated batch dedupe and merge processing
  • Option for merging memberships of same type on merged contacts

To dig down a bit deeper on the scalability issues, Dave Hansen-Lange offered up a proof of concept for optimizing the dedupe rule selects. In addition to the SQL optimization, caching would be especially helpful for the find dedupes -> merge screen, so a dedupe is not repeated. Lobo and team will implement other enhancements that will allow scalability to the one million contact record goal. Read more of team's comments on the Scalability board.

Once the dedupe and merge code is optimized, automation is the next step. I propose we add a new "Find and Merge Batches" screen which would allow a user to name a batch, schedule a reocurring time period, select a dedupe rule to run at that time, and specify a "Batch Merge Template" to automatically merge the duplicates found using the dedupe rule.

The new "Batch Merge Template" would be nearly identical to the current merge duplicates screen when merging a single pair of contacts. It would have the data fields for the original (oldest) contact, the duplicate (newer) contact, and the far right column would give options for how to handle data, but instead of applying the merge to just one pair of duplicates the template would be named and saved to be used on multiple of matches as part of an automated "Find and Merge Batch" as described above.

Alternatively, a "Batch Merge Template" would be available to apply to multiple duplicate contact sets from the current find dedupes -> merge screen.

Often there will be multiple duplicates for a given contact, so the the template should be applied to each pair of duplicates in sequence. If contact id 2000 is a duplicate of contact id 1000, and 3000 is a duplicate of 2000, if the surviving contact from the first pair is contact id 1000, then when the duplicate pair containing contact id 3000 is reached, 3000 should be merged with 1000 by the rules in the template.

The final minor piece to this project is to have the option to merge memberships of the same membership type when merging contacts. Currently when two contacts are merged, if each contact has a membership of the same type, the resulting merged contact will have two memberships of identical types, but with different dates and status. This can cause trouble with membership renewal notices and break CiviCRM Membership to Drupal Roles sync.

Just like there is an "add new" option for emails and addresses when merging contacts or creating a new merge template, and option for "add new" could be added for memberships when a membership of the same type is present on each contact. The default behavior (when "add new" is not selected) would be to merge the two memberships into one. Memberships not of the same type would be brought over as-is.

How do others feel about these enhancements? Are we on the right track?
Filed under

Comments

Of course automation is desirable... deduplicating is dull work. But my experience is that it's essential for a human who knows the contacts to do the merge.

I imagine that, if we were to do the analysis, the duplicates that were most amenable to deduplication are those created during event registration, etc (ie, the contacts created without human intervention). These are also the duplicates that should be not created in the first place.

My vote would be that the effort proposed for automating deduplication would be better spent on preventing duplicates in the first place.

Ken

That's a good idea, but we probably need to introduce some extra rules to loose as little as possible, eg.

- always merge the relationships+participant+membership+...,
- always consider the higher contact id as the duplicate (unless a user is attached to the contact)
- don't merge if the two contacts have external identifier
- ...

(ie. merging is a tricky business, doing it automatically without screwing up the db is not trivial)

And it should add a "automatic merge using rule X" in the log, so we can at least know what happened.

I am not sure a full brand new interface is necessary to set up the the batch. Wouldn't it be easier to mimic the other crons, where the frequency is defined in the crons and not in civicrm ?

ie: on the dedupe rules, add a new flag/column "runs automatically", and create a bin/dedupe.php that works so:
- without any parameter, runs all the "automated" dedupe rules (and it's put in the weekly cron for instance)
- with a parameter ruleid, runs only that rule (and it's put in the daily cron as you want to run that specific rule more often).

Maybe introduce a second threshold ? (ie beside the existing "if > 10 puts it the list", add a "if > 20 merge automatically" ?)

This being said, bit scared of having a dangerous tool like dedupe running in a cron. Given the fuck up we've had with imports, can imagine that one of the automated rules is going to be changed to a single "same country". The result is going to be lovely...

But love the idea of finding a way of simplifying dedupes.
X+

Quint (niezweryfikowany)
2010-08-11 - 07:33

As has been pointed out already, the de-dupe process is fraught with danger. I think it would be great to have some sort of db snapshot that can be restored, or some way of "undoing" a de-duplication. Of course the db should be backed up before doing this, but many organizations don't have the knowledge to do this, so making it dummy proof would be a very valuable, if very optional, feature.

2010-08-11 - 14:18

Dodane przez Quint (niezweryfikowany) w odpowiedzi na

if you would like to sponsor another project which allows backup and recovery of the entire DB that can be used in scenarios like this please contact us on IRC

snapshotting a DB is not a cheap operation especially on any reasonable sized DB

lobo

Hi,

I just want to flag that the Dedupe process (3.2) soft-deletes contacts that have been merged - ie. they are still in the DB which helps mitigate the dangers of this approach. Perhaps there is a case for some other flag to identify soft-deleted contacts as having been deleted in a merge.

Also, on the UI side of the suggestion - adding the UserID would be helpful

Eileen

If contacts are soft-deleted in 3.2, could a field in the contact table point to the contact id of the surviving contact? Then a tab in the contact summary could list all contacts which have been merged into the contact being viewed.

Anonymous (niezweryfikowany)
2010-08-13 - 08:20

There are cases when two individuals have the same last name and first name, and a simple dedupe rule based only on last and first names will always want to merge them.

This could be the case even for more complicated rules.

When a person is checking those duplicates, it would be nice to have a way to tell "Contact A is never a duplicate of contact B", so that this pair would not appear in any merge again, but if in any case a contact C appears with the same last and first name, both A and B would be listed as potential duplicates for contact C.

And add it into your dedupe rules ("not duplicate" 1000, and +1000 on your matching).

By default, that's empty so they both match, got 1000 in the total and the other rules apply

If you decide that John Doe and John Doe are not duplicate, put their internal id (or whatever that is going to be different between the two contacts). Then the rule on this field won't be applied, you won't add the 1000 to the total and the duplicates will never match (assuming the sum of the other criteria will never be > 1000).

Instead of a new custom field, you might want to use the external identifier.

(idea suggested by someone else, can't remember who I should credit)

Thanks for the suggestion of the work around - a checkbox on the (new and updated once it can be funded) Merge Screen so that we can just tick a load of them and hit a 'mark as NOT matches' would be a good move. Maybe it could also be set in the 'More Actions' list (yes i realise it is getting rather big already)

The dedupe process as I understand it has a basic mathematical problem in that it grows with the square of the number of constituents. It seems to me that the key problem to make this process scalable is to make this more like a linear or logarithmic growth.

It reminds me somewhat of an old algorithm described by Donalth Knuth called the soundex (http://en.wikipedia.org/wiki/Soundex). I don't know what it was originally designed for, but it converted a word to a magic number, kind of like an md5 type algorithm, but with the nice effect that words that sounded similar ended up close together in value.

Taking that approach: if we had an algorithm like this, then each constituent would be assigned a number to the effect that numbers close to each other would be similar in the ways defined by the rule. Then a dedupe process wouldn't have to compare each constituent against each other, but instead it would just sort all these magic constituent numbers and look for values that were close together. I'm guessing somewhat here, but I think that would result in a more linear growth of calculations, it certainly sounds more efficient. The key here is that an expensive part of the comparison calculation can be done independently of the other constituent, and therefore only done once.

The only challenge here is whether such an algorithm could be found, and how the current dedupe rules would translate in terms of what 'close together' might mean.

On the other hand, having written it out, it seems rather too obvious, so maybe this ground has already been covered...

At present there is no indication if one or both records have a Drupal User account - so the User ID would be a good step. Not sure how you can work this in to a Batch system unless you had the option to say 'keep the record that logged in most recently' - won't help if neither users have logged in but guess that isn't so much of a concern.

I haven't looked into this yet but one thing I have just hit is a situation where external modules that interact with CiviCRM are not being updated when a sync is done. Having a hook of some sort would be an appropriate response to this - not sure where it would be added - probably after the de-dupe is done. Or do existing hooks cover it adequately?