We probably all have the same problem: the same contact is present several times in our CRM.
This isn't a trivial problem to solve, as the same name can be spelt differently, eg. "José Manuel Durão Barroso", president of the "European Commission" is the same as "Jose Manuel Barrosso", employed by the "EC", and very likely to be found misspelt as well as "Barroso", or "Durao Barosso". For a computer, "José", "Jose" and "José Manuel" aren't the same first name and you will always need human beings to decide if they are duplicates, but CiviCRM tries to spot the really obvious matches.
CiviCRM has several dedupe rules
that help avoiding creating these duplicates.
In a nutshell, a dedupe rule is a list of fields that need to be different to be considered different contacts. By default, if two individuals have the same email address, they are considered duplicates, but you can change it to say they need as well the same last name, same first name, same birth date and same country, or whatever combination of fields you want.
For each type of contact (Individual, Household, Organisation), you have one and only one "strict default" dedupe rule, and possibly several other strict rules and other fuzzy rules.
CiviCRM has a deduplicate tool to help finding and merging these duplicates and a previous blog post discussed about how to improve that process
, but it also tries to prevent creating duplicates beside helping you getting rid of them at several places where you can create new contacts:
For a CiviCRM user (eg. staff or volunteer)
When you Create new contact
When you have entered all the contact details and want to save the contact, CiviCRM applies the "default fuzzy" rule and if there is another contact that matches, it displays an error and suggest you to update the existing contact instead of creating a duplicate (but it means you have "lost" all the details -phone number, address, birthdate... that you typed and have to enter them again by modifying the existing contact).
As of 3.2, when you create a new Individual, it automatically ajax checks if they are other contacts having the same last name as soon as you have finished typing a last name. These might not be real duplicates, more a quick early warning that when you create a new "John Doe", you might already have a "J. Doe" in your CRM that is the same person.
When you import contacts (and if you choose an option that checks for duplicate), it will applies the "default strict" rule.
It means that if the "default strict" rule is too permissive (eg. every contact from the same country is a duplicate) it will override the same contact over and over, or that if it's too strict (eg. need to match on "birthdate" ... that you don't have in your imported file), it will always create duplicate contacts.
For an anonymous visitor
Online registration (to an event, donation or using profiles)
Visitors can create contacts themselves when registering to an event, making a donation or any end-user form CiviCRM manages.
As you can create an event registration without any profile, the only field you can match a contact on is the email (otherwise, it would always create duplicate).
This is one of the reasons the "default strict" dedupe rule is only the email field.
Mailing list registrations & user registration
I think the "default strict" rule is applied, but not 100% sure.
There is a naming problem, where "strict" vs. "fuzzy" are categories that aren't meaningful (ie. you can have a fuzzy rule that is stricter than a strict one, and vice versa). Instead of having 2 variables (default & category), with not one, having 3 possible values ("default for anonymous", "default for user", "").
2) Being able to have a default rule different for subtypes (eg. the students have to have a different "student ID" field, different rule than for Individual). Not sure that's very useful, at least until we can't more easily create subtypes using profiles
Being able to choose the rule on the imports.
On the import, the user should be able to choose a dedupe rule instead of applying the "default strict" one.
So if you import a list of members, you know that you can use the "member" rule, that will match the contacts on the card number (a custom field) that will work better than the default one.
That will also serve as an implicit warning "you might either create lots of duplicates or override lots of contacts", and choose more carefully based on the file you import.
Being able to choose the rule on the profiles
If you use a profile for an online registration (event, donation...) and you choose to match existing contacts, you can also choose what rule to apply.
That's the same logic as for the import: if you create an online registration for an event with a profile that contains "student ID" that is mandatory, you might want to choose the rule "dedupe on the student ID".
Warn when a rule doesn't make sense in the context
If you choose a rule that test the "last name", but that you don't have the last name field in your profile or as a column in your csv file, it should warn you that the dedupe test will fail.
Over to you...
This is a quick description of the automatic dedupe, and these are only suggestions to see if we mostly agree about the problem and how to improve it and is the result of a previous discussion
. Please comment if you have a better ideas, that's very much still open to discussion, and...
If you think that's important enough, please consider contributing financially (or in code), as none of this is planned nor budgeted (lobo has estimated it needs 30 hours, but we might come up with better solutions, that are easier or more complicated to implement).