De-dupe Rules: Lengths, Weights, Thresholds, Oh My!

Published
2007-06-27 07:57
Written by
The new de-dupe functionality, introduced in CiviCRM 1.8, is configurable under Administer CiviCRMDuplicate Contact Rules. This post describes the meaning of the fields and the way their contents impact the de-dupe search engine. The first decision to make after going to Administer CiviCRMDuplicate Contact Rules is which rule to edit. For CiviCRM 1.8 we decided to allow one rule per every contact type (individual, organization and household); in future versions this can be extended to arbitrary list of rules. After making the choice of which rule to edit, the next screen displays the selected rule’s properties. Every de-dupe rule consists of several criteria that any two contacts (of a given type) must meet to be considered ‘similar’ (this is an approach introduced in David Strauss’s CiviCluster module). Every criterion, in turn, consists of three parameters – a contact field (like Last Name or Email), an optional length of the field and the criterion’s weight. If two contacts have the same value of the field in question (up to the field’s length, if it’s specified), the criterion’s weight is counted; if, for the two given contacts, the sum of matching criteria’s weights is greater than the rule’s threshold, the two contacts are considered matching. Hopefully an example will clear the description a bit: the default de-dupe rule for individuals is First Name (weight 5), Last Name (wieght 7) and Email (weight 10) with a threshold of 20. This means that for two contacts to be considered duplicates, their First Name and Last Name and Email must match (otherwise the criteria weights won’t add up to over 20). If the threshold was lowered to 17, it would be enough if Last Name and Email matched; if the threshold was 12, any pair of the two fields would be enough. The default rule presented above does not use the fields’ length property, but it could; if the matching should be over areas covered by the same phone number prefix, one of the criteria could be Phone, length 3. (Note: For the de-dupe searches to be efficient, the database must index the queried rows; CiviCRM has indices on the rows queried by default rules, but it’s up to the administrator to create them for other queried rows, like civicrm_phone.phone for the above example.) The default de-dupe rule for organizations and households is Organization/Household Name (weight 5) and Email (weight 5) with a threshold 10; this means that both these fields have to have the same values for two contacts to be considered matching. These default values match the ones we use for matching on contact creation and import. Stay tuned for more de-dupe news in the future, as well as 1.8.alpha sandbox sporting the de-dupe functionality that should be up next week.
Filed under