Dedupe rules are a very useful feature of CiviCRM, but one that can cause a fair bit of confusion to new or less technical users. The documentation on them is fairly spartan — sufficient for developers or experienced users, but not necessarily enough for those who may need a bit more easing in, so this post is intended to help new users get up to speed. As with my other blog posts, this one is brought to you by Freeform Solutions, a not-for-profit organization providing IT consulting to other not-for-profits.
The first thing you may be wondering is: just what are dedupe rules, and why do we need them? The short answer: “dedupe” is a contraction of de-duplication — that is, the process of preventing, or fixing, duplication of contacts.
Duplicate contacts can arise in a number of different ways — maybe a member makes a donation without having logged into your site as a member, so their contact information is recorded a second time with their donation, even though it’s already in there from their membership. Or maybe a volunteer is entering names and e-mail addresses from a signup sheet, and some of them were already in the system from past involvement with your organization. Remembering to always check before entering anyone’s contact info to see if they are already in your system is a nuisance, and even if you do, you might miss them in a search if there was a slight difference in the information — maybe their e-mail address has changed, or they used their full name one time and a nickname another time.
So it’s best to have a way of automating these checks — and configuring them intelligently to have the best chance of catching duplicates and allowing them to be merged, with the least chance of accidentally merging contacts that aren’t actually the same.
And that’s exactly what CiviCRM’s dedupe rules are for. They’re a built-in feature of CiviCRM, so they’re always present. Even if you haven’t configured them for your site yet, or even looked at them, CiviCRM comes with some basic rules preconfigured — but those rules may not be what you want, so it’s best to get familiar with how they work, so that you can set up rules that will do exactly what you want.
To get started, go to Contacts > Find and Merge Duplicate Contacts. Note that unlike a lot of admin settings, this one is not found under the Administer menu, but the Contacts menu.
The first thing you’ll notice here is that the rules are divided into three groups: Individual Rules, Organization Rules and Household Rules. If you’ve spent much time entering or otherwise handling contacts in CiviCRM, you probably already understand these groups, and even if you haven’t, they’re pretty self-evident. At the simplest level, individual contacts are people, organization contacts are companies or other groups of people, and household contacts are places where people live So two people living together would be two individual contacts sharing a relationship to a common household contact (their home address). Two co-workers would be individual contacts sharing a relationship to an organization contact (their workplace). And you can set up different rules for different types of contacts.
The next thing you’ll notice is that for each type of contact, you have an unsupervised rule, a supervised rule, and possibly one or more general rule. So, what are those?
As I mentioned earlier, you will probably want to change the preconfigured rules that come with CiviCRM — each site’s needs may be different, due to the types of contact data they handle, so it’s best to be able to have full control over how your rules work. I’ll give you some examples of suggested rules for different circumstances later on.
Each of the three categories — individual, household and organization — gets to have one unsupervised and one supervised rule. They can’t have more than one of either of those, but they can have as many general rules as you want.
If you create a new rule, and mark it supervised or unsupervised, it will replace the existing rule of that type for that category, and the previous rule will be made a general rule instead.
Some rules — notably the ones that come with CiviCRM — are marked as “Reserved”, meaning that you can’t delete them, or change anything about them other than their usage (i.e. whether they’re supervised, unsupervised or general). In fact, you can’t even see most of their settings. But you can replace them with other rules, as we mentioned.
When you click one of the “Add rule” buttons, you’ll see a form with the following fields:
This is where things begin to get a bit complex, but also highly customizable. Each field can be assigned a weight, which counts against a weight threshold that you assign to the rule as a whole. This allows you to have a very fine degree of control over exactly how the various fields work together.
In brief, a field’s weight represents how important it is in calculating whether two contact records might be duplicates. The weight threshold determines how much similarity there has to be before the records are flagged as duplicates — and it uses the field weights to calculate that. So how does it work in practice? Let’s look at a few examples:
As mentioned earlier, what rules will work best — especially for the unsupervised rules — depends on a lot on what kind of contact data you handle.
If your contacts consist only of unrelated individuals, as might be the case for a professional association, for example, then it’s probably safe to assume that no two people will have the same e-mail address. In that case, you could stick with a very simple unsupervised rule, such as one that checks just e-mail address, with the field weight and threshold being the same.
On the other hand, if you’re likely to have couples and families amongst your contacts, who might be sharing the same e-mail account, then just using e-mail is not a foolproof method. Similarly, phone number and address might be the same. In this case, the rule will have to be more complex. For example, you could have one called “Email AND First Name AND Last Name”, which requires all three of those fields to match (see the settings below):
Note that in this case, we haven’t used any length limit on the first name because it might happen that a couple by the names of Jane Doe and John Doe use the same email address, and you don’t want them to be automatically merged just because their names start with the same letter. We’ll do things differently in our supervised rule, as you’ll see in a moment. By the way, if you’re wondering why the weights are all different in the above example, instead of just making them all 10 or something, compare it to the supervised rule example that follows: the field weights are the same ones used there, which makes it easy to switch from an “and” rule to an “or” rule just by changing the threshold. This can be useful when you’re still experimenting with your rules, trying to see which formula will work best.
For supervised rules, while again it depends on the needs of your specific collection of contacts, one option that works well for many sites is to check for either a matching e-mail, or first initial and last name (first initial rather than full first name to catch variants of the same name, as with the Catherine/Cathy example we used earlier - it’s safe to do that with this rule, as we’ll have a chance to say yes or no to the match). We used something similar as an example of complex rules in the previous section. Remember, with supervised rules, false positives are less of an issue because you get to approve or reject each match. So, to set up a rule called “Email OR (First Initial AND Last Name)”, you could use the following settings:
We’ve already explained the use of supervised and unsupervised rules — both of those run automatically when contact information is entered, with unsupervised rules running when members of the public enter info in the front end of the site, and supervised rules running when staff enter info in the back end. As a reminder, unsupervised rules merge contacts they detect as duplicates automatically, while supervised ones display a pop-up that asks whether or not you want to merge the contacts in question.
But what about general rules? For those, just click the “Use rule” link next to the rule you want to use.
Note that you can also click “Use rule” for supervised or unsupervised rules, to have them search through the entire database or a specific group. Why would you want to do that if they already run automatically when you enter contact info? Because you may have changed the rule setup since you entered the contacts, in which case you might want to apply the new improved rules to all the contacts you’ve already entered, so as to weed out any duplicates your old rules didn’t catch.
However, running supervised rules manually like that may not be the best idea, since they’re typically defined broadly enough that running them on your entire database at once will probably bring up a huge number of false positives. They’re meant for suggesting possible duplicates while you’re entering one contact record at a time, rather than using en masse.
On the next screen, you can choose whether you want to apply the rule to all contacts, or just to a specific group. Choose which you want, and click Continue. One note — for large databases, you may want to start with a subset of contacts, instead of all contacts, to speed up this process and prevent the site from running out of memory, because the de-duping process is a fairly resource-intensive process. If you get a white screen or an error, just start over, but this time select smaller set of contacts to run the rule against.
If the rule detects any matches in the group you’ve chosen, you will be presented with a list of possible duplicates. For each pair of possible matches, the names will be shown linked to their contact records, so that if you’re not sure, you can open each contact record and see if they look similar enough to you that you want to merge them.
Once you’ve decided, go back to the original tab, and click either “merge” or “not a duplicate”. If you choose to merge, you’ll have some decisions to make, which we’ll deal with in the next section.
So you’ve decided to merge two contacts. As you’ll see when you click that link, it’s not quite a simple as just clicking “merge” and moving on. Instead, you’re going to be presented with a table showing the information on file for each contact.
CiviCRM will try to guess which is the original contact, which is the one they assume you’ll want to keep, and which is the duplicate to be deleted. The one it considers the original will be shown on the right, and the duplicate on the left — but if you decide the one on the left is closer to what you want, you can click the link that says “Flip between original and duplicate contacts” and they’ll be reversed.
Once you have them in the order you want, you can choose with which fields (if any) you want to move data from the duplicate (the one on the left) into the original (the one on the right) before the duplicate is deleted. This option is there because sometimes the two records may have partly different information, and you’ll want to be able to determine which is kept. Click the checkboxes in the middle column (which appear in the middle of a somewhat awkwardly-rendered arrow) to mark any fields for which you want the data from the one on the LEFT to be kept. Any that are not marked will keep the data from the one on the RIGHT instead.
Of course, how much data there is to merge depends on the complexity of your database — the example below shows two records with just the very basics (name and e-mail). But in practice, the contacts in your database may have groups, activities, memberships and more associated with them, so you may have quite a bit more information to review and decide whether to merge.
In general, when merging contacts, it’s highly recommended to merge their activities, memberships, groups, etc. However, with memberships, you should be sure to select the “create new record” option, because the same person may have multiple memberships from different years.
When you’re ready, you can click any of the buttons on the bottom, depending on what you want to do now:
Another use for dedupe rules is during data imports. Sometimes you may want to import a set of contacts from a CSV file, perhaps during the process of moving from some other contact management system to CiviCRM. You can do this under Contacts > Import Contacts.
A full tutorial on importing contacts would be a whole other blog post — for this one, I’m just going to note that there is a section under “Import Options” where you can choose a dedupe rule to apply to the contacts being imported. That will cause that rule to be applied automatically to all incoming contacts in the CSV file.
Note that contacts will be merged automatically according to this rule, so don’t make it too broad! Make sure you use a rule that won’t result in any false positives.
You’ll notice a button at the top of the Find and Merge Duplicate Contacts page that says “View Dedupe Exceptions”. When you click on it, it may not be immediately apparently what’s going on there, as it will show you a blank table (if you haven’t done much merging of contacts yet), or a list of contacts if you have, but no indication of how they get on that list!
The answer is: remember how any time a possible duplicate is detected (either via a supervised rule, or running a general rule manually), you can click on either “merge” or “not a duplicate”? Every time you click “not a duplicate”, that pair of contacts is stored in the dedupe exceptions list. So that button is there in case you ever want to view the list of exceptions that have been saved.
So, hopefully you’ve now got a better grasp on the use of dedupe rules. If you still have questions, please feel free to post your question as a comment here, or in the CiviCRM Community Forums, or contact Freeform Solutions for help.