Understanding CiviCRM Dedupe Rules

Publicado
2014-07-31 08:08
Written by
spidersilk - member of the CiviCRM community - view blog guidelines

Contents:

  1. Introduction
  2. Understanding the Structure of Dedupe Rules
  3. Creating Dedupe Rules
  4. Basic Dedupe Rule Attributes
  5. Field Weights, and Weight Threshold
  6. Some Recommended Rule Recipes
  7. Using Dedupe Rules
  8. Merging Contacts
  9. Using Dedupe Rules when Importing Contacts into CiviCRM
  10. Dedupe Exceptions
  11. Tips for Setting up Effective Dedupe Rules
  12. In Conclusion

1. Introduction

Dedupe rules are a very useful feature of CiviCRM, but one that can cause a fair bit of confusion to new or less technical users. The documentation on them is fairly spartan — sufficient for developers or experienced users, but not necessarily enough for those who may need a bit more easing in, so this post is intended to help new users get up to speed. As with my other blog posts, this one is brought to you by Freeform Solutions, a not-for-profit organization providing IT consulting to other not-for-profits.

The first thing you may be wondering is: just what are dedupe rules, and why do we need them? The short answer: “dedupe” is a contraction of de-duplication — that is, the process of preventing, or fixing, duplication of contacts.

Duplicate contacts can arise in a number of different ways — maybe a member makes a donation without having logged into your site as a member, so their contact information is recorded a second time with their donation, even though it’s already in there from their membership. Or maybe a volunteer is entering names and e-mail addresses from a signup sheet, and some of them were already in the system from past involvement with your organization. Remembering to always check before entering anyone’s contact info to see if they are already in your system is a nuisance, and even if you do, you might miss them in a search if there was a slight difference in the information — maybe their e-mail address has changed, or they used their full name one time and a nickname another time.

So it’s best to have a way of automating these checks — and configuring them intelligently to have the best chance of catching duplicates and allowing them to be merged, with the least chance of accidentally merging contacts that aren’t actually the same.

And that’s exactly what CiviCRM’s dedupe rules are for. They’re a built-in feature of CiviCRM, so they’re always present. Even if you haven’t configured them for your site yet, or even looked at them, CiviCRM comes with some basic rules preconfigured — but those rules may not be what you want, so it’s best to get familiar with how they work, so that you can set up rules that will do exactly what you want.

To get started, go to Contacts > Find and Merge Duplicate Contacts. Note that unlike a lot of admin settings, this one is not found under the Administer menu, but the Contacts menu.

Screenshot showing dedupe rules menu location, in Contacts menu

 

2. Understanding the Structure of Dedupe Rules

The first thing you’ll notice here is that the rules are divided into three groups: Individual Rules, Organization Rules and Household Rules. If you’ve spent much time entering or otherwise handling contacts in CiviCRM, you probably already understand these groups, and even if you haven’t, they’re pretty self-evident. At the simplest level, individual contacts are people, organization contacts are companies or other groups of people, and household contacts are places where people live So two people living together would be two individual contacts sharing a relationship to a common household contact (their home address). Two co-workers would be individual contacts sharing a relationship to an organization contact (their workplace). And you can set up different rules for different types of contacts.

Screenshot showing main screen for dedupe rules

The next thing you’ll notice is that for each type of contact, you have an unsupervised rule, a supervised rule, and possibly one or more general rule. So, what are those?

  • Unsupervised rules are rules that are automatically checked when a user enters contact info into a form on the front end of the site (for example, when registering for an event, making a contribution, etc.). If a match is found, CiviCRM automatically merges the new contact into the old one, without checking with the user. Because of this, unsupervised rules should be defined very narrowly and carefully, so as to avoid accidental merges of contacts that are not actual duplicates (“false positives”). So, for example, you would not want to have an unsupervised rule that matched on first and last name alone… Because then if you have two people named, say, Jane Smith submit information to your site, it would automatically merge them, even if one of them was named Jane C. Smith and lived in Vancouver while the other was Jane Q. Smith and lived in Nova Scotia. They would be merged automatically, without any human being involved in the process or even notified.
  • Supervised rules, on the other hand, are rules that are automatically checked when an admin or staff member enters info into the back end of the site (i.e., the admin interface). They do not automatically merge the contacts — they just trigger a pop-up suggesting that the contact may be a duplicate. The person entering the data can choose whether or not to merge them, so these can be defined fairly broadly — for example, in this case you can have a rule that matches on first and last name, because a quick glance at the two Jame Smiths’ information, to return to the example I just used, would show that they are not the same person and you could then choose not to merge them.
  • General rules are not triggered automatically at all. They can be run manually from the "Find and Merge Duplicate Contacts" page, or selected to run during a data import, but they do not ever just run on their own. So these can be as general or as specific as you like.
     

3. Creating Dedupe Rules

As I mentioned earlier, you will probably want to change the preconfigured rules that come with CiviCRM — each site’s needs may be different, due to the types of contact data they handle, so it’s best to be able to have full control over how your rules work. I’ll give you some examples of suggested rules for different circumstances later on.

Each of the three categories — individual, household and organization — gets to have one unsupervised and one supervised rule. They can’t have more than one of either of those, but they can have as many general rules as you want.

If you create a new rule, and mark it supervised or unsupervised, it will replace the existing rule of that type for that category, and the previous rule will be made a general rule instead.

Some rules — notably the ones that come with CiviCRM — are marked as “Reserved”, meaning that you can’t delete them, or change anything about them other than their usage (i.e. whether they’re supervised, unsupervised or general). In fact, you can’t even see most of their settings. But you can replace them with other rules, as we mentioned.

 

4. Basic Dedupe Rule Attributes

When you click one of the “Add rule” buttons, you’ll see a form with the following fields:

Screenshot showing a sample dedupe rule

  • Rule Name: give it a descriptive name that makes clear what this rule does. It’s best to include in the name what fields that rule is using, as in the example above. See the preconfigured rules for examples.
  • Usage: unsupervised, supervised or general.
  • Reserved: use with caution! You can’t undo this once it’s done, so really, it’s better not to use it at all unless you really need to for some reason.
  • Field(s): this is where you get into the specifics. You can choose any of the fields available for CiviCRM contacts of that type, and can combine more than one field if you want — for example, checking on both name and e-mail. You might be wondering whether that means matching on both e-mail and name, or just one or the other — we’ll be dealing with that momentarily.
  • Field Length: this is an optional field, which you can use if you only want to match on part of a field. For example, the first few letters of someone’s name. Why would you want to do that? Well, perhaps you want to allow for the possibility that some people might be in there twice, with two different forms of their first name: Cathy and Catherine, Will or William, etc. Obviously it wouldn’t get all nicknames, but it would get some. Another example would be to set a length of one for the middle name, in order to match on middle initial — that would have prevented the first-and-last-name rule we used an example earlier from matching Jane C. Smith and Jane Q. Smith.
  • Field Weight, and Weight Threshold: this is a complex enough topic to require its own heading, so read on!
     

5. Field Weights, and Weight Threshold

This is where things begin to get a bit complex, but also highly customizable. Each field can be assigned a weight, which counts against a weight threshold that you assign to the rule as a whole. This allows you to have a very fine degree of control over exactly how the various fields work together.

In brief, a field’s weight represents how important it is in calculating whether two contact records might be duplicates. The weight threshold determines how much similarity there has to be before the records are flagged as duplicates — and it uses the field weights to calculate that. So how does it work in practice? Let’s look at a few examples:

  • First, a very simple one: if you just want to match on one field — for example, e-mail — just assign it any weight you want, and then make the weight threshold the same, so it either matches or it doesn’t.
  • Matching on two fields? Say, e-mail and phone number? You can do that two ways, which are sometimes described as “and” and “or”, for reasons that should become clear in a moment. Let’s say you assign each one a weight of 10. If you also set the weight threshold to 10, then it will register a match if either field matches — i.e. if the e-mail address OR the phone number are the same. But if you set the threshold to 20, it will count as a match only if both fields match — i.e. if the e-mail AND the phone number are the same — because it takes two 10s to make 20. Just matching one or the other would not meet the threshold.
  • You can do the same with three or more fields, if you want to get ambitious — with three fields, depending on how you set the threshold, you could have it match if one of the three matches, if two out of three fields match, or only if all three do.
  • Of course, the fields don’t all have to have the same weight. What if you want a rule that checks first name, last name, and e-mail, and counts it as a match if the e-mail matches, or if the first and last name do, but not if it’s just first name or just last name? For that, you could set the threshold to 20, and give e-mail a weight of 20 — but set first name and last name’s weights to 10. That way, matching on e-mail would hit the threshold — 20 — but for the names, it would take both first and last to hit the threshold.
  • And you can get far more complex if you want to, combining several different fields, all with different weights, so that there might be a variety of different combinations that would trigger the threshold. But we’ll leave that up to you to experiment with.
     

6. Some Recommended Rule Recipes

As mentioned earlier, what rules will work best — especially for the unsupervised rules — depends on a lot on what kind of contact data you handle.

If your contacts consist only of unrelated individuals, as might be the case for a professional association, for example, then it’s probably safe to assume that no two people will have the same e-mail address. In that case, you could stick with a very simple unsupervised rule, such as one that checks just e-mail address, with the field weight and threshold being the same.

Screenshot showing settings for an e-mail only rule, with field weight and threshold set to the same value

On the other hand, if you’re likely to have couples and families amongst your contacts, who might be sharing the same e-mail account, then just using e-mail is not a foolproof method. Similarly, phone number and address might be the same. In this case, the rule will have to be more complex. For example, you could have one called “Email AND First Name AND Last Name”, which requires all three of those fields to match (see the settings below):

Screenshot showing settings for an unsupervised rule that checks e-mail, first name and last name. Field weights are 15, 5 and 10 respectively, and threshold is 30.

Note that in this case, we haven’t used any length limit on the first name because it might happen that a couple by the names of Jane Doe and John Doe use the same email address, and you don’t want them to be automatically merged just because their names start with the same letter. We’ll do things differently in our supervised rule, as you’ll see in a moment. By the way, if you’re wondering why the weights are all different in the above example, instead of just making them all 10 or something, compare it to the supervised rule example that follows: the field weights are the same ones used there, which makes it easy to switch from an “and” rule to an “or” rule just by changing the threshold. This can be useful when you’re still experimenting with your rules, trying to see which formula will work best.

For supervised rules, while again it depends on the needs of your specific collection of contacts, one option that works well for many sites is to check for either a matching e-mail, or first initial and last name (first initial rather than full first name to catch variants of the same name, as with the Catherine/Cathy example we used earlier - it’s safe to do that with this rule, as we’ll have a chance to say yes or no to the match). We used something similar as an example of complex rules in the previous section. Remember, with supervised rules, false positives are less of an issue because you get to approve or reject each match. So, to set up a rule called “Email OR (First Initial AND Last Name)”, you could use the following settings:

Screenshot showin settings for a supervised rule that checks e-mail, OR first initial + last name. Field weights are 15, 5 and 10, as in the previous screenshot, but length of the first name is set to 1 and threshold is set to 15.

 

7. Using Dedupe Rules

We’ve already explained the use of supervised and unsupervised rules — both of those run automatically when contact information is entered, with unsupervised rules running when members of the public enter info in the front end of the site, and supervised rules running when staff enter info in the back end. As a reminder, unsupervised rules merge contacts they detect as duplicates automatically, while supervised ones display a pop-up that asks whether or not you want to merge the contacts in question.

But what about general rules? For those, just click the “Use rule” link next to the rule you want to use.

Note that you can also click “Use rule” for supervised or unsupervised rules, to have them search through the entire database or a specific group. Why would you want to do that if they already run automatically when you enter contact info? Because you may have changed the rule setup since you entered the contacts, in which case you might want to apply the new improved rules to all the contacts you’ve already entered, so as to weed out any duplicates your old rules didn’t catch.

However, running supervised rules manually like that may not be the best idea, since they’re typically defined broadly enough that running them on your entire database at once will probably bring up a huge number of false positives. They’re meant for suggesting possible duplicates while you’re entering one contact record at a time, rather than using en masse.

On the next screen, you can choose whether you want to apply the rule to all contacts, or just to a specific group. Choose which you want, and click Continue. One note — for large databases, you may want to start with a subset of contacts, instead of all contacts, to speed up this process and prevent the site from running out of memory, because the de-duping process is a fairly resource-intensive process. If you get a white screen or an error, just start over, but this time select smaller set of contacts to run the rule against.

Screenshot showing the screen where you select which group of contacts to apply the rule to

If the rule detects any matches in the group you’ve chosen, you will be presented with a list of possible duplicates. For each pair of possible matches, the names will be shown linked to their contact records, so that if you’re not sure, you can open each contact record and see if they look similar enough to you that you want to merge them.

Screenshot showing a list of matching contacts, with options to merge or not merge each pair

Once you’ve decided, go back to the original tab, and click either “merge” or “not a duplicate”. If you choose to merge, you’ll have some decisions to make, which we’ll deal with in the next section.

 

8. Merging Contacts

So you’ve decided to merge two contacts. As you’ll see when you click that link, it’s not quite a simple as just clicking “merge” and moving on. Instead, you’re going to be presented with a table showing the information on file for each contact.

CiviCRM will try to guess which is the original contact, which is the one they assume you’ll want to keep, and which is the duplicate to be deleted. The one it considers the original will be shown on the right, and the duplicate on the left — but if you decide the one on the left is closer to what you want, you can click the link that says “Flip between original and duplicate contacts” and they’ll be reversed.

Once you have them in the order you want, you can choose with which fields (if any) you want to move data from the duplicate (the one on the left) into the original (the one on the right) before the duplicate is deleted. This option is there because sometimes the two records may have partly different information, and you’ll want to be able to determine which is kept. Click the checkboxes in the middle column (which appear in the middle of a somewhat awkwardly-rendered arrow) to mark any fields for which you want the data from the one on the LEFT to be kept. Any that are not marked will keep the data from the one on the RIGHT instead.

Screenshot showing the screen for merging two contacts, with a list of non-matching fields and the option to choose which field to keep the data for in each case

Of course, how much data there is to merge depends on the complexity of your database — the example below shows two records with just the very basics (name and e-mail). But in practice, the contacts in your database may have groups, activities, memberships and more associated with them, so you may have quite a bit more information to review and decide whether to merge.

In general, when merging contacts, it’s highly recommended to merge their activities, memberships, groups, etc. However, with memberships, you should be sure to select the “create new record” option, because the same person may have multiple memberships from different years.

When you’re ready, you can click any of the buttons on the bottom, depending on what you want to do now:

  • The first three are fairly self-evident: merge the records and go on to the next pair of matches, merge them and go back to the list, or merge them and view the new merged record.
  • “Cancel” will cancel the whole operation, with nothing being merged, and take you back to the list.
  • “Next” will move onto the next pair of matches without doing anything with this one.
     

9. Using Dedupe Rules when Importing Contacts into CiviCRM

Another use for dedupe rules is during data imports. Sometimes you may want to import a set of contacts from a CSV file, perhaps during the process of moving from some other contact management system to CiviCRM. You can do this under Contacts > Import Contacts.

A full tutorial on importing contacts would be a whole other blog post — for this one, I’m just going to note that there is a section under “Import Options” where you can choose a dedupe rule to apply to the contacts being imported. That will cause that rule to be applied automatically to all incoming contacts in the CSV file.

Note that contacts will be merged automatically according to this rule, so don’t make it too broad! Make sure you use a rule that won’t result in any false positives.

 

10. Dedupe Exceptions

You’ll notice a button at the top of the Find and Merge Duplicate Contacts page that says “View Dedupe Exceptions”. When you click on it, it may not be immediately apparently what’s going on there, as it will show you a blank table (if you haven’t done much merging of contacts yet), or a list of contacts if you have, but no indication of how they get on that list!

The answer is: remember how any time a possible duplicate is detected (either via a supervised rule, or running a general rule manually), you can click on either “merge” or “not a duplicate”? Every time you click “not a duplicate”, that pair of contacts is stored in the dedupe exceptions list. So that button is there in case you ever want to view the list of exceptions that have been saved.

 

11. Tips for Setting up Effective Dedupe Rules

  • We’ve touched on this before, but it bears repeating, because it’s a really common, and risky, mistake: don’t make unsupervised rules too broad. If, for example, you make matching on e-mail alone an unsupervised rule, be aware that any two contacts with the same e-mail address — for example, a married couple using the same e-mail account, as occasionally happens — will be merged into one, and treated as one person! Likewise with first and last names — remember Jane C. Smith from Vancouver and Jane Q. Smith from Nova Scotia? For unsupervised rules, you want to be really, really sure that they’re not going to flag anything as a duplicate that isn’t, because they will merge those contacts automatically in the background, without either you or the contact(s) in question being aware that it’s happened.
  • Supervised rules, on the other hand, can be much more general, because you will have the chance to say yes or no to merging any contacts they detect as possible duplicates. However, if you make supervised rules too broad, they’ll be a major nuisance. You probably don’t, for example, want to have a pop-up asking you if you want to merge contacts every time two people have the same last name…
  • General rules are optional — not everyone uses them, because you have to remember to go into the Find and Merge Duplicates Contacts area and trigger them manually if you want them to run, which is more trouble than many people want to go to. But general rules can be useful if you discover a situation where some duplicates have gotten into the database without being detected by the other rules, and you want to create and run a specific rule to weed them out.
  • Although really, if duplicates are getting in undetected, you may want to revisit how your supervised and unsupervised rules are set up, and see if you can make them more effective.
  • Now, you might be thinking at this point “All this makes sense, but why can’t I have more than one supervised rule set up? I don’t want to have to remember to run a bunch of different general rules — I want to be able to have supervised rules that match if two contacts have the same first and last name, OR the same e-mail and first name, OR the same phone number and e-mail, OR… etc.” Well, that’s where the weights and weight thresholds come in! With a bit of effort and experimentation — and math — you can make one single supervised rule that checks for a whole lot of different things.
     

12. In Conclusion

So, hopefully you’ve now got a better grasp on the use of dedupe rules. If you still have questions, please feel free to post your question as a comment here, or in the CiviCRM Community Forums, or contact Freeform Solutions for help.

Filed under

Comments

Hey there, amazing post and great coverage of this important and not very well understood topic - thanks for contributing.  We'd love to have this accessible in our reference: http://book.civicrm.org/user. We do have a chapter in their called 'Deduping and Merging' http://book.civicrm.org/user/common-workflows/deduping-and-merging/
 but this post is much better.  Are you up for replacing it with your post?

Sure, that would be fine!

Wow, my writing going into the Actual Book - now I feel important! :)

Agree this is a great write up. One complexity that I think would be great if you include, is the issue of how a Rule manages multiple email address locations.

Eg if I have pd@fuzion as both my home and my main or billing email (which often happens in the db's we look at) then the rule will give both emails the 'weight' so if a rule for 

First AND Last AND Email is set eg at

Email 10

First 5

Last 5

Threshold 20

This would mean that having 2 emails on a record will cause it to count as a Match even though the First and Last both do not match!

Hence the recipe we usually recommend 'requires' email but gives it much small weighting eg

Email 3

First 5

Last 5

Threshold 13

ie it would require 5 equal emails on one record to count as a match which is pretty unlikely

I expect you can explain/describe it more clearly than I have, but we do find the above catches people out often enough in forum to be worth the effort.

Excellent write-up on a topic that needs it - functionality that should be used by most organizations, but that is poorly documented, complex, and important.

Great job!

Let's say that Linda Kane signed up for an event using her email linda123@mailinator.com.

For her first name, she enters "Linda and John" and for last name "Kane".

I want the system to recognize that she's the same linda123@mailinator.com so it should "merge", but I don't want it to update her contact info in the system.  How do I do that?