Dedupe Monitor: Easy Dedupe Scanning Across All Contacts

Published
2024-09-25 17:13
Written by
AllenShaw - member of the CiviCRM community - view blog guidelines

Regular scanning for duplicates is an important part of good data hygiene, but it's certainly no fun; this extension does it for you in your sleep.

Duplicate contacts in your CRM are a source of trouble. They can lead to inaccurate reports, poor communications, and staff inefficiency.

CiviCRM has great features for limiting accidental creation of duplicates -- through its configurable Dedupe Rules -- and for merging duplicate contacts once you've identified them.

CiviCRM also provides a way to conduct periodic scans of your data to identify potential duplicates. Just pick any one of your existing Dedupe Rules, or configure a new one, and use it to scan all contacts in your system against all others.

Unfortunately, this scanning process can get to be pretty painful.

You might have experienced some of these problems already, especially if you've grown to have a substantial number of contacts:

  • Scanning the full set of contacts takes forever, and it can even lock up or crash your site.
  • You can of course set the scan to compare only a portion of your contacts, but breaking them up into small chunks is a tedious process.
  • The scanning itself is rather tedious; it's a bit of chore to sit there waiting for the scan results to load in your browser -- and if you can't process them all in one sitting, you'll have to repeat that scan again later.

I've heard from numerous organizations who have found that, owing to these challenges, they tend to avoid scanning for duplicates on a regular basis. Meanwhile the number of duplicates is increasing, which just makes the scans take longer.

But what if we could get around these problems?

The Dedupe Monitor extension for CiviCRM aims to do just that.

When you're not looking, the extension applies your configured Dedupe Rules against all of your contacts, packages up any identified duplicate contacts into little batches, and alerts you if any are found.

Your staff can then review each of these batches and use CiviCRM's existing duplicate-merge features to either merge them or mark them as "not a duplicate."

It's careful enough not to lock up your site during these background scans, and much more convenient than sitting yourself in front of a screen, scanning with each Dedupe Rule against small sub-sets of your contacts.

It also provides a dashlet titled "Dedupe Monitor Alert" (shown above), available for display on the CiviCRM front page.

This dashlet will inform you of any batches awaiting review; if no batches currently exist, it will reassure you that "You're all caught up!"

This extension has been in use by several organizations in production for over a year now. After presenting a demo at the recent UK Sprint 2024, I've been encouraged by the positive feedback to release it for public use.

If you're struggling to stay on top of duplicate contact records, I encourage you to give this extension a try, and to comment with "@twomice" (that's me) in the extensions channel on chat.civicrm.org with any feedback you may have.

You can find the Dedupe Monitor extension here.

Comments

Thanks for sharing Allen! This looked great when you demoed it. We'll give it a whirl and let you know how we get on.

Nice work Allen. I particularly like the small batches. We have a lot of groups that dedupe their entire database, deadlocking it for hours. Very thoughtful design.

Thanks for this Allen - I'm gonna have a play. I see it has unit tests!!

On the issue with large batches crash your site - we are currently running a patch in production that is in an open PR - https://github.com/civicrm/civicrm-core/pull/30591 that has made a huge difference for us (an unknown number of hours to around 2 minutes). (As most readers will know our database is fairly large - so our subset of 5000 contacts finds nearly 25k full first name + last name matches when you do a first+last name search - most of which are NOT actually duplicates ).

I haven't been pushing anyone to help me get that PR merged because I have some thoughts coming out of https://lab.civicrm.org/dev/core/-/issues/5433 that I was going to work through.

I also listed the dedupe extensions I found on that ^^ gitlab so I will add a note about your extension to it.

This is super exciting, I know many folks who will benefit from this and am excited to spread the word!