Keeping your DB clean: how about a hook_civicrm_dedupe?

Gepubliceerd
2011-09-04 14:50
Written by
colemanw - member of the CiviCRM community and Core Team member - about the Core Team

Like many others who have no doubt gone before me, I found myself spending a large chunk of time this week finding and merging duplicate contacts. Civi's deduping UI has gotten a lot better lately, and I love having the ability to go through it more quickly than before. But deduping is only as good as the queries it runs, and of course prevention is better than cleanup.

Here are some observations of how all those dupes in my DB got created:

  • People treat nicknames and first names interchangeably -- Bob might call himself Robert on one form and Bob on another, and not even notice he's doing it.
  • People change email addresses like they change their shoes!
  • Ditto for phone numbers and addresses!
  • Phone number is currently just a textfield, so deduping on phone is almost useless (555.5555 doesn't match 555-5555 or 5555555 or 555 5555 and if you include area codes there's even more permutations)
  • A slightly mistyped name (it's amazing how many people can't spell their own name!) will result in a dupe that might never be discovered, since even your fuzzy rule won't catch it!
  • Setting your strict rule strict enough that you don't get any false positives results in way too many false negatives (creating loads of dupes in the DB)
  • Setting your fuzzy rule fuzzy enough to clean up all those dupes means sifting through thousands of "possible dupes" who are unquestionably different people.

In order to solve these and other problems, our dedupe queries need to be able to:

  • Use OR logic (match nick_name to first_name and vice-versa)
  • Use SOUNDS LIKE or wildcards to catch misspellings
  • Intelligently choose fields to match based on what data is available (the current "weight threshold" system is too simplistic)
  • Group matching data by category (email, phone, and street address all help to locate a person, but none of those things uniquely identify them -- members of households or businesses frequently share all three)
  • Strip irrelevant characters such as phone number punctuation

For example:

Because we have a few custom modules running for our school to register people for programs and such, we save a lot of contacts via API. I've created my own wrapper that handles matching and saving contacts, and before it passes data to the Civi API, it first runs a query to check to see if that contact already exists in the database. The logic for the query goes like this:

IF first_name OR nick_name match (or cross-match)
AND last_name matches
AND email OR street address OR phone number match

Then we have a match!

It's a great query, and almost guarantees no false positives or negatives. You could use it too, or write a better one (since it doesn't tackle the misspelling issue). You or I could also write a great fuzzy query that really nails those dupes without making us comb through thousands of false-positives. But in order for that to happen outside my own little custom API function, we need...

hook_civicrm_dedupe()

Lobo and I were discussing this on IRC the other day, and came up with a couple possibilities. See what you think:

/**
* The "hijack and replace query" method
*/
hook_civicrm_dedupe($contactType, $rule, &$query, $groupID=NULL, $params=NULL) {
  // In this version, the dedupe query has been built and is about to be run
  // $contactType might be Individual, Household, etc.
  // $rule array or object containing the rule being invoked (rule ID, strict or fuzzy, name, label, is_default, etc)
  // $query is the SQL string that we can modify or overwrite
  // $groupID limits the search to a particular group
  // $params will be empty for whole-database deduping, or will contain an array of user-entered data from a webform if we are matching a single contact (i.e. from a contribution page).

  return 'SELECT some other query';

  // CiviCRM will run the new query instead of the one it was about to run.

}

Another possibility would be to for the hook to let you register a new dedupe rule programatically, rather than just hijaking an existing one. (that hook would be called when giving the user-admin dedupe options, so it would show up as a rule that could be made, say, the site's default strict rule.

/**
* The "register query-builder" method
*/
hook_civicrm_dedupe() {
  // existing dedupe rules have numeric id's, so we should probably use non-numeric keys to avoid conflicts

  return array(
    'my_great_rule' => array(
      'level' => 'strict',
      'contact_type' => 'Individual',
      'name' => 'One Rule to Rule them All',
      'callback' => 'my_great_rule_query_builder',
      'file' => 'my_module_dedupe.inc',
    ),
    'my_other_rule' => array(
      //etc...
    ),
  );

  // CiviCRM will display these new rules as options which can be used or made default.
  // When actually running the query from this rule, it will load the specified include file and call the specified function, much in the same way as the first example (but without the $query param), and expect a query string in return.

}

I think this may be be preferable for a number of reasons, namely portability and efficiency. It's more portable because you can define the rule, package it in a module, and it is not at all site-specific. It's also more efficient because we're not asking CiviCRM to build a query only to have it get discarded and overwritten by the hijack-method hook.

In either case, we need to be aware of the security implications of this hook, and avoid making it too easy for newbies to directly concat user-entered data into a query.

Comments

this already exists (though slightly different than your suggestion:

     * This hook allows modification of the queries constructed from dupe rules.
     * @param string $obj object of rulegroup class
     * @param string $type type of queries e.g table / threshold
     * @param array  $query set of queries
     *
     * @access public
     */
    static function dupeQuery( $obj, $type, &$query ) {...

 

we use it extensively, and it works much like your initial suggestion -- the hook receives the rules defined in a rule group, and allows you to reconstruct/enhance them.

the one weakness is as you describe toward the end -- it uses the "hijack" method, which means there needs to be a rule group with at least one rule defined in the system. for our purposes, we will create a basic group that has a single rule with weight 1 and threshold 1. those values can be set within the hook, so they only exist in the db for the sake of registering the rule in the system so it can be modified in the hook.

we're doing some nice algorithms in the hook, such as normalizing the street address (strip spaces, remove ordinals, condense street name suffix (street > st, road > rd, etc), removed punctuation, etc.) -- which has helped up pick up quite a few more dupes.

there are a couple gotchas with the hook, such as making sure its not run on user account creation, and you need to construct the queries differerntly to account for when the rule is run during import vs. when the rule is run from the interface. but it gives you the access you're looking for.

here is what we are doing with that hook:

https://github.com/nysenatecio/Bluebird-CRM/blob/master/modules/nyss_dedupe/nyss_dedupe.module

I like the second option (being able to introduce new rules) better as well, mostly because it means being able to choose it explicitely (eg. in the import).

 

Modifying the query for dedupe is quite a hidden feature where the rule as displayed isn't the one applied (eg. you can change the threshold and add more fields matching without any impact). This is likely to confuse the users, isn't it?

 

And it seems that (according to brian's example) that you match on the dedupe name to know when to apply. using as a key something users can change is recipes for interesting issues ;)

 

Otherwise, the module introduces nice features (eg st == street), looks great. 

X+

i think the rule group ID is passed in the object. and we should probably alter to use the id at some point in our implementation. 

i agree that it would be useful to have the flexibility to dynamically create a rule group, and perhaps change the interface so the rule group is not editable. but maybe that should be handled with a new hook. so one hook can create the rule group, and a second alters the queries. the only gotcha with doing that is the dedupe code handles three scenarios right now -- user account creation, import, and find dupes interface -- each of which constructs queries in slightly different ways because of the different data being compared. the fact that you currently have to construct a dummy group/rule in the system before modifying it with the hook, provides something of a fallback should you not account for all three scenarios.

another comment about our implementation --

the way the dedupe works on import (which was our primary usage), it compares the incoming file row (converted to an array) with the table in the db. implementing data normalization to the array is fairly straightfoward. but is a bit of a pain with the mysql side. we accomplished with a cyclical REPLACE on the fields in question, which is fairly inefficient (though functional).

we've recently installed a regex library into mysql, with the goal of improving some of our algorithms further -- both in terms of flexibility and efficiency. might be nice to figure out a better way to structure some of the dedupe code so that we have more flexibility to perform the data conversion strictly in PHP.

lcdweb: this is great stuff, I really appreciate you sharing your code. I'm going to give implimenting the existing hook a try before spouting off any more about creating a new one. I do agree that having a query already constructed is nice so your hook has the option to do nothing.

I also agree about the pain of replacing strings in mysql. My phone number stripping query looks like this: "LEFT(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(phone.phone, '.', ''), ' ', ''), '-', ''), '(', ''), ')', ''), 10) = '%s'" yep, it's ugly. Let me know how your regex experiments go.

I wonder why Lobo didn't mention the existing hook when he told me to go write a blog article about this...

and his brain seems to have developed a few too many memory leaks :(

sorry about making u do the work, but on the flip side a good discussion was had by all!

lobo

That would be another nice outcome of this discussion.

 

Might even possibly somewhat balance the terrible discovery that lobo isn't omniscient (or that marmite is his kryptonite ;)

 

X+