Deduping for Guppies

Published

2016-02-05 23:37

Written by

Eileen - member of the CiviCRM community and Core Team member - about the Core Team

Welcome to the second in my blog series 'Performance for Guppies'. If you don't remember the first one that's OK - you're among guppies here. If you do want to refresh your memory it's here - I talked about what database indexes are and some instances when they don't get used.

I hit a situation recently when a customer asked me why it was impossible to run dedupe rules on the street_address field. I tried his rules out and indeed they were server-destroyingly slow. As in my previous blog there are things developers should do differently to prevent this - but there are things that an admin can avoid in order to get better performance. Deduping is inherently prone to slow queries so understanding what rules will or won't perform well can be handy.

The way deduping works is it creates a table of matches based on the field you are checking. So if you were checking surname and your table had an index on surname it would build a table internally something like this

-

John Kennedy	Kennedy	Kennedy	Bobby Kennedy
John Kennedy	Kennedy	Kennedy	Jack Kennedy
Bobby Kennedy	Kennedy	Kennedy	Jack Kennedy
George Bush	Bush	Bush	Jeb Bush

Now if you were to add Maria Kennedy to that table above you would have to add another 3 more rows. And for the next Kennedy you have to add another 4 and so one. Probably you have a manageable number of last name matches in your database. Matches on city, postcode or state are a different kettle of guppies. Try to imagine how many individuals matches you can make based on city from your database and you can understand why that doesn't work well in a dedupe rule.

The database will build an internal table like the one above for EVERY field in your dedupe rule - regardless of it's weighting. Then it will go through that table and find the rows that meet the criteria and put them in the dedupe temp table.

Veda consulting have done some dedupe improvements in 4.7 (as have Alex and John from University of Cambridge). In addition to UI improvements, Veda offers an extension with a specific rule to optimise a specific post code dedupe. If you must dedupe based on postcode I advise you to have a small database or give that extension a go. Avoid City and State.

Right, back to indexes. If your table did NOT have an index it the database internal table would look like this!

-

John Kennedy	Kennedy	Kennedy	Bobby Kennedy
John Kennedy	Kennedy	Kennedy	Jack Kennedy
Bobby Kennedy	Kennedy	Kennedy	Jack Kennedy
George Bush	Bush	Bush	Jeb Bush
George Bush	Bush	Kennedy	John Kennedy
George Bush	Bush	Kennedy	Bobby Kennedy
George Bush	Bush	Kennedy	Jack Kennedy
Jeb Bush	Bush	Kennedy	John Kennedy
Jeb Bush	Bush	Kennedy	Bobby Kennedy
Jeb Bush	Bush	Kennedy	Jack Kennedy

Not only did that get a lot bigger pretty quickly ... the more politicians you add the crazier it will get! (I'm sure there is some wisdom in there somewhere). Last time we added Maria another 3 rows got added - but this time around another 10 will be added because she gets matched with the Bushs as well as the Kennedys.

So, providing we have an index on the street address field - it should produce a sensible table of matches. Not too many people will have exact matches on street address. So, why was it so slow? Was there no index? In fact there was not one but three indexes as shown by the show create table command

    INDEX `dedupe_index_street_address_10` (`street_address`(10)),
    INDEX `dedupe_index_street_address_20` (`street_address`(20)),
    INDEX `dedupe_index_street_address_12` (`street_address`(12)),

But using the EXPLAIN command told me none of them were being used ..why?

Looking at the query I could see it was using a join on SUBSTR(street_address, 0, 10) = SUBSTR(street_address, 0, 10) - in other words it was using a database function to get the first 10 characters to compare. A quick consult with the wizard of the web confirmed my suspicions - if you use a database function it won't use an index.

Now when I say this query was server-destroyingly slow I'm talking about a query that we had to manually kill after 20 minutes. By simply changing the rule configuration in the UI to NOT have length set I was able to get the result (~11,000 matches) returned in less than 1/4 of a second.

So moral of the story - be careful when configuring your dedupe rules. Avoid specifying the length of the match in the rule. And avoid fields where lots of people might match each other.

Filed under

Guppies

Deduping for Guppies

Streamlining CiviCRM Hosting and Development with DevPanel and DrupalForge

How we use and contribute to CiviCRM at the Tor Project

Stripe now supports TWINT/iDEAL payments

CiviCRM 6.4 Release

SYSTOPIA Extension of the Month #18: CiviSEPA

CiviCon 2025 Retrospective

CiviCamp and CiviSprint Europe coming up and open for registration

WORK WITH A TRUSTED EXPERT
TO SETUP YOUR CIVICRM

Languages

Deduping for Guppies

WORK WITH A TRUSTED EXPERT TO SETUP YOUR CIVICRM

WORK WITH A TRUSTED EXPERT
TO SETUP YOUR CIVICRM