CiviCRM Dedupe Plans and a Plea for Large Databases

Published
2008-04-09 21:38
Written by
shot - member of the CiviCRM community - view blog guidelines
We’re currently planning on various improvements to the duplicate contact finding (and merging) engine for CiviCRM 2.1. Among others, we plan to have a more responsive mechanism by caching the dedupe search results in a more effective way, add the ability to restrict deduping to a certain group, as well as move at least parts of the dedupe out of PHP and into MySQL (now that we require MySQL 5 anyway). Thus, a plea: if you have a large real-world database that you could share with us, please do so – either by sending it to shot@civicrm.org encrypted with my GPG key (0xD128F14A) or mailing me to co-ordinate some other way to share it. Your database will not be disclosed to anyone, will be used solely for CiviCRM 2.1 dedupe profiling and will be destroyed once the profiling is done.
Filed under

Comments

Anonymous (not verified)
2008-04-10 - 09:14

We only have 18,000 records in our database now so this is probably too small a sample. But we are getting ready to do a project with 13.1 million records, all the registered voters in Texas. We had great success in installing and configuring CiviCRM with both Joomla and Drupal on our Ubuntu Linux server. Contact terry@spring.net or call 512.581.9617

That's great to hear that you're making improvements to the deduping mechanism. I often run into timeout problems with the deduping.

One request -- rather than have a single option for deduping rules for each contact type, I'd like the flexibility to create multiple deduping rules. For example, sometimes I want to run a dedup on individuals' email addresses only, and then another dedup on individuals' firstname/lastname. It's a bit of a pain to have to go back and readjust the rules every time.

Re: the plea for a large database. What do you define as large? I may have one I can give you, but I'm not sure if it's large enough for your needs.

Anonymous (not verified)
2008-04-18 - 00:12

we have 2500 contacts. I have always run into a timeout when de-duping with that - on a shared server.

will send if that helps.

Anonymous (not verified)
2008-05-05 - 11:45

We are working currenty with a database of about 35K records, and will soon be importing another 5K or so. Once that is complete, we will import a statewide checklist of about 425K records. If that is useful, please e-mail morgan@progressiveparty.org.