Published
Saturday, August 14, 2010 - 06:18
Written by
Hi all, We probably all have the same problem: the same contact is present several times in our CRM. This isn't a trivial problem to solve, as the same name can be spelt differently, eg. "José Manuel Durão Barroso", president of the "European Commission" is the same as "Jose Manuel Barrosso", employed by the "EC", and very likely to be found misspelt as well as "Barroso", or "Durao Barosso". For a computer, "José", "Jose" and "José Manuel" aren't the same first name and you will always need human beings to decide if they are duplicates, but CiviCRM tries to spot the really obvious matches. CiviCRM has several dedupe rules that help avoiding creating these duplicates. In a nutshell, a dedupe rule is a list of fields that need to be different to be considered different contacts. By default, if two individuals have the same email address, they are considered duplicates, but you can change it to say they need as well the same last name, same first name, same birth date and same country, or whatever combination of fields you want. For each type of contact (Individual, Household, Organisation), you have one and only one "strict default" dedupe rule, and possibly several other strict rules and other fuzzy rules. CiviCRM has a deduplicate tool to help finding and merging these duplicates and a previous blog post discussed about how to improve that process, but it also tries to prevent creating duplicates beside helping you getting rid of them at several places where you can create new contacts:

For a CiviCRM user (eg. staff or volunteer)

When you Create new contact

When you have entered all the contact details and want to save the contact, CiviCRM applies the "default fuzzy" rule and if there is another contact that matches, it displays an error and suggest you to update the existing contact instead of creating a duplicate (but it means you have "lost" all the details -phone number, address, birthdate... that you typed and have to enter them again by modifying the existing contact). As of 3.2, when you create a new Individual, it automatically ajax checks if they are other contacts having the same last name as soon as you have finished typing a last name. These might not be real duplicates, more a quick early warning that when you create a new "John Doe", you might already have a "J. Doe" in your CRM that is the same person.

Import contacts

When you import contacts (and if you choose an option that checks for duplicate), it will applies the "default strict" rule. It means that if the "default strict" rule is too permissive (eg. every contact from the same country is a duplicate) it will override the same contact over and over, or that if it's too strict (eg. need to match on "birthdate" ... that you don't have in your imported file), it will always create duplicate contacts.

For an anonymous visitor

Online registration (to an event, donation or using profiles)

Visitors can create contacts themselves when registering to an event, making a donation or any end-user form CiviCRM manages. As you can create an event registration without any profile, the only field you can match a contact on is the email (otherwise, it would always create duplicate). This is one of the reasons the "default strict" dedupe rule is only the email field.

Mailing list registrations & user registration

I think the "default strict" rule is applied, but not 100% sure.

Possible improvements

Better names

There is a naming problem, where "strict" vs. "fuzzy" are categories that aren't meaningful (ie. you can have a fuzzy rule that is stricter than a strict one, and vice versa). Instead of having 2 variables (default & category), with not one, having 3 possible values ("default for anonymous", "default for user", ""). 2) Being able to have a default rule different for subtypes (eg. the students have to have a different "student ID" field, different rule than for Individual). Not sure that's very useful, at least until we can't more easily create subtypes using profiles

Being able to choose the rule on the imports.

On the import, the user should be able to choose a dedupe rule instead of applying the "default strict" one. So if you import a list of members, you know that you can use the "member" rule, that will match the contacts on the card number (a custom field) that will work better than the default one. That will also serve as an implicit warning "you might either create lots of duplicates or override lots of contacts", and choose more carefully based on the file you import.

Being able to choose the rule on the profiles

If you use a profile for an online registration (event, donation...) and you choose to match existing contacts, you can also choose what rule to apply. That's the same logic as for the import: if you create an online registration for an event with a profile that contains "student ID" that is mandatory, you might want to choose the rule "dedupe on the student ID".

Warn when a rule doesn't make sense in the context

If you choose a rule that test the "last name", but that you don't have the last name field in your profile or as a column in your csv file, it should warn you that the dedupe test will fail.

Over to you...

This is a quick description of the automatic dedupe, and these are only suggestions to see if we mostly agree about the problem and how to improve it and is the result of a previous discussion. Please comment if you have a better ideas, that's very much still open to discussion, and... If you think that's important enough, please consider contributing financially (or in code), as none of this is planned nor budgeted (lobo has estimated it needs 30 hours, but we might come up with better solutions, that are easier or more complicated to implement).
Filed under

Comments

Personally, I would like to see an out-of-the-box solution to the issue of a Drupal user being checked for an associated Civi contact. Then, if there isn't one, one is created. Also, exisiting users in civicrm getting an associated Drupal account. I would donate to a project of this nature.

But something that could be useful to others indeed. Could you write down the spec in the wiki and post in the forum, see who wants to write it (and contribute to the dev cost) ?

One issue to define is how to give the created user a login+password, what role/acl including for contacts without emails.

Might be better to use more the checksum token so most of the changes can be done by the contact without having to create a user account everytime ?

This part happens currently (and has for a long time). Anytime a new Drupal user registers and logs in - Civ checks for matching contact and creates one if not found.

I think you got this description above wrong.
"For a CiviCRM user (eg. staff or volunteer)
When you Create new contact

When you have entered all the contact details and want to save the contact, CiviCRM applies the "default strict" rule.

My understanding and expectation is that doing this at the backend applies the 'default fuzzy' rule.

Yes this blog post is not correct. Any time a contact is created from within the CiviCRM back-end (i.e. by an administrator of some variety) they default fuzzy rule is used. When a contact is created by some process on the front-end (i.e. by anyone) the default strict rule is used. If a contact is created via the API the fuzzy rule is used, but there's a key to bypass that.

updated the post.

Hi,

You haven't mentioned what impact permissions do (or should) have. For example if you are adding a contact that matches another contact in Civi that you don't have permission to what happens? I suppose the logic should possibly be the same as an event or other type of front end registration & it should 'quietly' use the default rule?

Of course if you were really sneaky you could use this behaviour to get access to a contact you shouldn't have access to

I've found that the dupe rules are quite useful, but it takes a lot of experimentation to tune them.
Suggestion #1: What if one could click "more strict" (less strict" and quickly see the result instead of burrowing into the specifics?

I've found that once I have found my dupes, I often have a big list. Current system makes it awkward to address each one. Too many clicks.
Suggestion #2: Create a system that cycles quickly through the dupes so a human can make quick decisions -- delete, merge, etc -- and then immediately on to the next. Make it easy for human intelligence, which is better than the machine is likely to be, to operate quickly.