Matching contacts with the dedupe hook

Published
2011-09-19 14:21
Written by
CiviCRM 3.3 introduced a new hook that allows you to interact with and alter dedupe queries. Unfortunately, it was something of a "hidden hook" as it lacked documentation for quite a while. But it can be quite useful and powerful, and thus deserving of a review. In version 3.3 and 3.4/4.0, the dedupe tools in CiviCRM received several notable improvements, including a restructuring of how the dedupe queries are built to improve performance, and the introduction of a paginated display with table caching that significantly improves the dedupe review and merging process. However, running a rule can still take quite some time on a large database (250k+ contacts), and the matching mechanism is quite basic -- simply comparing field data with field data. That's where the dedupe hook can come into play. The dedupe hook allows you to rebuild the queries to improve performance, and add additional logic and comparison intelligence to increase successful matching. Working with the New York State Senate, those two goals became very important, as each district's database runs somewhere between 160k and 300k records, making duplicate reduction and performance of prime importance. For those interested, our dedupe module can be reviewed here: https://github.com/nysenatecio/Bluebird-CRM/tree/master/modules/nyss_dedupe The module has evolved quite a bit over the life of the project. We began by focusing on the individual strict and individual fuzzy rules for the purpose of improving import matching and increasing performance on internally run rules. Since then we've modified another rule commonly used for imports, and rebuilt all our queries used for internal deduping to make them as efficient as possible. Kudos to Graylin Kim from the Senate for work done to refactor the module and find some better ways of handling various matching conditions. The import matching improvement piece in our module is interesting (nyss_dedupe_indiv1_record). There's a couple fields that are problematic when doing a straight data match -- such as street address. It's very easy to have slight variations in the same street address, where simple matching mechanisms will not return a match. For example, using the standard CiviCRM matching, "123 Main Street" and "123 Main St." will not match. Yes -- we can use the length value in the rule definition to only match on a certain number of characters from the left of the string. But that impedes performance and will still miss many valid matches. So we identified the main causes of missed matches and performed data normalization on the incoming record (stored as an array) and the database query. This included such operations as removing punctuation, trimming the values, removing ordinals, and condensing street suffixes. After performing that work, we saw a significant increase in legitimate matches during import. There are several things to note when using this hook. There are four places the dedupe rules are called, and the queries are constructed differently for each scenario. That means you need to condition your code around the intended usage and construct your queries accordingly. Those four places are when a CMS contact is created and linked to a CiviCRM contact, during import, when a contact record is saved, and when running a find duplicates rule. There's enough information in the object passed to the hook to handle these conditions. Because this hook is geared toward altering the existing dupe queries, you must first have the duplicate group defined in your system before you can modify it. Also, you may want to unset the queries constructed by the system before you create your modified query. But if you do that, make sure your query matches the threshold required to flag a duplicate. Documentation for this hook can be found here: http://wiki.civicrm.org/confluence/display/CRMDOC40/CiviCRM+hook+specification#CiviCRMhookspecification-hookcivicrmdupeQuery
Filed under

Comments

Hey, thanks for updating the documentation and providing an example. Nice work!

1. Can you give us approximate performance numbers so folks know the magnitude of the improvements possible via a hook

2. I'm thinking that maybe for the default rules that we ship, we dont allow any modifications and have an optimized query for the ones that are inefficient. In specific i suspect the strict email can be optimized a lot

3. The current code uses an IN clause for contacts in a group. This does not scale at all for groups with any reasonable number of contacts

lobo

running our indiv1 rule, which is strict individual and matches on fname, mname, lname, suffix, address, postal code, and recording time with the devel module:

total contacts in db: 269465

with dedupe hook (single query):
3363 matches
9.49178 sec

without dedupe hook (and less the postal code, as the native dedupe only accommodates 5 fields):
199 matches
292.87303 sec