Deduping for Guppies

Published
2016-02-05 23:37
Written by
Welcome to the second in my blog series 'Performance for Guppies'. If you don't remember the first one that's OK - you're among guppies here. If you do want to refresh your memory it's here -  I talked about what database indexes are and some instances when they don't get used.

I hit a situation recently when a customer asked me why it was impossible to run dedupe rules on the street_address field. I tried his rules out and indeed they were server-destroyingly slow. As in my previous blog there are things developers should do differently to prevent this - but there are things that an admin can avoid in order to get better performance. Deduping is inherently prone to slow queries so understanding what rules will or won't perform well can be handy.

 The way deduping works is it creates a table of matches based on the field you are checking. So if you were checking surname and your table had an index on surname it would build a table internally something like this

 -
John Kennedy Kennedy Kennedy Bobby Kennedy
John Kennedy Kennedy Kennedy Jack Kennedy
Bobby Kennedy Kennedy Kennedy Jack Kennedy
George Bush Bush Bush Jeb Bush


Now if you were to add Maria Kennedy to that table above you would have to add another 3 more rows. And for the next Kennedy you have to add another 4 and so one. Probably you have a manageable number of last name matches in your database. Matches on city, postcode or state are a different kettle of guppies. Try to imagine how many individuals matches you can make based on city from your database and you can understand why that doesn't work well in a dedupe rule.

 

The database will build an internal table like the one above for EVERY field in your dedupe rule - regardless of it's weighting. Then it will go through that table and find the rows that meet the criteria and put them in the dedupe temp table.

Veda consulting have done some dedupe improvements in 4.7 (as have Alex and John from University of Cambridge). In addition to  UI improvements, Veda offers an extension with a specific rule to optimise a specific post code dedupe.   If you must dedupe based on postcode I advise you to have a small database or give that extension a go. Avoid City and State.

Right, back to indexes. If your table did NOT have an index it the database internal table would look like this!

 -

John Kennedy Kennedy Kennedy Bobby Kennedy
John Kennedy Kennedy Kennedy Jack Kennedy
Bobby Kennedy Kennedy Kennedy Jack Kennedy
George Bush Bush Bush Jeb Bush
George Bush Bush Kennedy John Kennedy
George Bush Bush Kennedy Bobby Kennedy
George Bush Bush Kennedy Jack Kennedy
Jeb Bush Bush Kennedy John Kennedy
Jeb Bush Bush Kennedy Bobby Kennedy
Jeb Bush Bush Kennedy Jack Kennedy


Not only did that get a lot bigger pretty quickly ... the more politicians you add the crazier it will get! (I'm sure there is some wisdom in there somewhere).  Last time we added Maria another 3 rows got added - but this time around another 10 will be added because she gets matched with the Bushs as well as the Kennedys.

So, providing we have an index on the street address field - it should produce a sensible table of matches. Not too many people will have exact matches on street address. So, why was it so slow? Was there no index? In fact there was not one but three indexes as shown by the show create table command

    INDEX `dedupe_index_street_address_10` (`street_address`(10)),
    INDEX `dedupe_index_street_address_20` (`street_address`(20)),
    INDEX `dedupe_index_street_address_12` (`street_address`(12)),

But using the EXPLAIN command told me none of them were being used ..why?

Looking at the query I could see it was using a join on SUBSTR(street_address, 0, 10) =  SUBSTR(street_address, 0, 10) - in other words it was using a database function to get the first 10 characters to compare. A quick consult with the wizard of the web confirmed my suspicions - if you use a database function it won't use an index.

Now when I say this query was server-destroyingly slow I'm talking about a query that we had to manually kill after 20 minutes. By simply changing the rule configuration in the UI to NOT have length set I was able to get the result (~11,000 matches) returned in less than 1/4 of a second.

So moral of the story - be careful when configuring your dedupe rules. Avoid specifying the length of the match in the rule. And avoid fields where lots of people might match each other.

 

Filed under