DeDupe Optimization in v3.3

Published
2010-10-30 13:15
Written by

I have been working on dedupe optimization, part of 3.3 release and a make it happen project, and we are quite happy with the results. A fuzzy rule (first+last+email) which would take 4.3 mins on a 65K contact database, now takes 1.02 sec (tested on a iCore5, 4Gig machine). On a 1.45 million database same rule which used to take forever (i had to quit after 1 hr), now takes 13 sec. Below are some more stats.

 

65 K Contacts 1.45 M Contacts
Old code New code Old code New code
Fuzzy Rule Strict Rule Params Strict Search Params Fuzzy Search Fuzzy Rule Strict Rule Params Strict Search Params Fuzzy Search Fuzzy Rule Strict Rule Params Strict Search Params Fuzzy Search Fuzzy Rule Strict Rule Params Strict Search Params Fuzzy Search
4.3 mins 0.34 sec 0.21 sec 0.23 sec 0.6 sec 0.34 sec 0.21 sec 0.2 sec > 1 hr 10.5 sec 0.22 sec 5.7 sec 13 sec 10 sec 0.21 sec 0.22 sec

The significant gain comes because of the way query execution pattern works based on the results found in initial queries, to reduce burden of subsequent queries. You would notice that there isn't much gain with strict rule because the strict rule here was made of only one email field. And same is the case with dupesByParams search. The new dedupe optimizer comes into affect when there are more than one fields involved for a particular rule e.g Fuzzy rule with first-name + last-name + email. That said if we add more fields to strict rule, we would start observing the gains and so with dupesByParams search (assuming dupesByParams uses strict rule).

There has also been improvements on individual queries but gains are not as huge as we see with multiple queries, and might only be visible on large data sets. The query improvements that were made are -

  • Union clause have been removed. So a query that used to look like -

CREATE TEMPORARY TABLE dedupe SELECT t1.id id1, t2.id id2, 5 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 USING (first_name) WHERE t1.id < t2.id AND t1.first_name IS NOT NULL

UNION ALL

SELECT t1.id id1, t2.id id2, 7 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 USING (last_name) WHERE t1.id < t2.id AND t1.last_name IS NOT NULL

UNION ALL

SELECT t1.contact_id id1, t2.contact_id id2, 10 weight FROM civicrm_email t1 JOIN civicrm_email t2 USING (email) WHERE t1.contact_id < t2.contact_id AND t1.email IS NOT NULL;

now looks like -
INSERT INTO dedupe (id1, id2, weight) SELECT t1.id id1, t2.id id2, 10 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 USING (first_name) WHERE t1.id < t2.id AND t1.first_name IS NOT NULL GROUP BY id1, id2 ON DUPLICATE KEY UPDATE weight = weight + VALUES(weight);

INSERT INTO dedupe (id1, id2, weight) SELECT t1.id id1, t2.id id2, 7 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 USING (last_name)  WHERE t1.id < t2.id AND t1.last_name IS NOT NULL GROUP BY id1, id2 ON DUPLICATE KEY UPDATE weight = weight + VALUES(weight);

INSERT INTO dedupe (id1, id2, weight) SELECT t1.contact_id id1, t2.contact_id id2, 10 weight FROM civicrm_email t1 JOIN civicrm_email t2 USING (email) WHERE t1.contact_id < t2.contact_id AND t1.email IS NOT NULL GROUP BY id1, id2 ON DUPLICATE KEY UPDATE weight = weight + VALUES(weight);

with each query executed one at at time. Check this post for more details.

  • GROUP BY and INSERT ON DUPLICATE UPDATE is used at query level, eliminating GROUP BY and Aggregate function from threshold query and keeping weight computation discrete.  
  • Joining with dedupe temp table in order to lessen the number of rows to be scanned, as required by new query execution pattern ( discussed in next section ).

INSERT INTO dedupe (id1, id2, weight) SELECT t1.id id1, t2.id id2, 5 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 USING (first_name) JOIN dedupe_copy ON dedupe_copy.id1 = t1.id AND dedupe_copy.id2 = t2.id WHERE t1.id < t2.id AND t1.first_name IS NOT NULL GROUP BY id1, id2 ON DUPLICATE KEY UPDATE weight = weight + VALUES(weight)

 

Query execution pattern

 

Consider a rule for example with first-name + last-name + email with weights as 5,7,10 and threshold of 20.
For a pair of contact to be declared as duplicate all queries are to be satisfied. In other words failing of any one query is enough to indicate that its not going to make to threshold. If first-name query can't find any duplicates there is no need to invest time running last-name and email queries.
Now say first-name query finds 5 pairs of contacts as duplicates, its very much sure that its this set of contacts that could potentially make to threshold. And therefore it would be a good decision to run last-name and email queries for this set of contacts only, instead of scanning through whole contact records again. Whats even faster is starting with a query for a field which belongs to a table with least number of records. For example if email table (civicrm_email) has less number of records than contact table (civicrm_contact), the queries executed in the order of email->first/last would be much faster than first/last->email order. And that's what optimizer takes care of.

If threshold was 10 for same rule in above example, situation would be slightly different as there would be atleast two queries required to run.
set 1 -> first-name query (5) AND last-name query (7)
set 2 -> email query (10)
In set 1 last-name query would be executed only when first-name query finds any dupes. And email query from set 2 would be executed in any case.

 

Bottlenecks

 

A worst case rule would be first-name, last-name, email with weights as 10,10,10 and threshold of 10, since this would require dedupe to run all queries and could still give dedupe a tough time on large data sets.

New query execution pattern requires looking back at already found dupes stored in temp table. This requires us to do an expensive operation of copying temp table content to another temp table just for the reference, because at the moment mysql doesn't allow referencing a temporary table more than once in the same query.

 

Special thanks


Thanks for all the ideas generated in this forum topic. And special thanks to Roland B (independent consultant in Melbourne, AU) and Jason B (from IMBA) for sponsoring this work.

Filed under

Comments

This is great Deepak. De-duping can be quite painful & the response to the two make-it-happen initiatives on de-duping was really strong. I've just set up a contribution page for the next one in the list: Adding previous / next buttons to speed up dedupe navigation (and hence the de-dupe process)

 

http://civicrm.org/civicrm/contribute/transact?reset=1&id=10

Will we be able to merge contacts faster?  Unless I am missing something, I'm having to do this one at a time.

I have a large contact list and am  doing strict on email.  When I get the list, I click on merge next to a name and it goes through the merge workflow... then I have to start the entire dedupe process all over again (actually run the query again).

 

jA

and in the meantime if you just open the records to merge in a new record then you don't have to rebuilt the list of people who match the rule

Wow!  great to see the proof-of-concept taken to the next level.  Yet another crticial barrier to CiviCRM adoption in medium and large organizations being blown away. 

Anonymous (not verified)
2010-12-02 - 02:36

Firstly, congratulations Deepak. This has exceeded our expectations of the response times. We've got a database of around 200k (contacts), with many of the duplication caused by integrations with other systems (and the online registrations for seminars). So this will certainly reduce the number of duplicates in the system we currently maintain.

 

Also, I just wanted to also note that 'Niro Solutions' is sponsoring the work (not Roland B). We have a few in the team here and we all contribute to the community.

 

But well done. Looking forward to the final release of 3.3.

 

Cheers,

Roland.