Scaling the contact import process in CiviCRM, a perspective and solution

Published
2010-06-08 20:53
Written by
A couple month ago I raised the question about CiviCRM importing scalability, and received mixed answers.
  1. CiviCRM should be mostly used for data import, not for data cleaning. Most importing scalability issues stem from people's reliance on the system to perform both data clean up and import.
  2. CiviCRM has a great data cleaning process that should be taken advantage of when doing a large import.

The truth is, the importing system performs a variety of tasks that can be cumbersome when dealing with large data set import. Moreover, the work flow (4 step process) often can be time consuming when importing a large contact file. Specifically from step 2 to step 3 will take forever if you have a large file to import. In addition, you have to keep the importing browser window or tab open the whole time during the import since the importing script is ran from the browser. As always, my interest comes from problems I have encountered. Specific problem in my scenario:

  • Infrastructure Limitation:
    • Our sites are hosted by Rackspace Cloud Hosting which imposes a notorious limit of 30 seconds timeout/redirect from the load balancer so it doesn't matter if the php_timeout is set to 0 it will give the end user a website unavailable message when script runs longer than 30 seconds. One interesting information tough is that even if the script is running on a specific server in the cloud longer than 30 seconds and the user receives the time out message, the server will still finish running the script in most cases.
    • This problem is specifically severe in the contact importing process because the contact importing process is a multi-step process that performs logic in each step, our sites usually hangs up on the Preview stage which means that the temporary import job table has been created and populated, however the import process was not finished.
  • Import File Size:
    • We generally do import job of size 50,000 to 500,000 in each contact file, fields in those files can vary from 10 to 25, even given a hypothetic scenario that our infrastructure restriction from the cloud host does not apply to everyone. CiviCRM's 8MB limit would generally limit a import file to under 100,000 contact per file, not to mention the waiting period for the importing process the end user experiences (The slow progress bar) Other problems with this method is also the session integrity. Someone could mistakenly navigate to another page in the browser tab or basically open another browser tab on the site and went to another area to work on other things.

To address those shortcomings of CiviCRM, the proposal I have come up are

  1. Simplify the work flow of the import process by creating a 1 step form
  2. Use a import job queue based system to fetch import jobs and run them on CRON
  3. Quick data validation of the first xxx number of rows to give user a snapshot of their data set
  4. Use file logging for errors rather than using database to reduce query cost

The first alpha of this project is now available on github. Although it is still under heavy testing, I'm proud to announce I have been able to use the script to run 50,000 records contact files and it should be able to handle up to 100,000 record CSV files.

Theoretically, one of the main advantage this process can offer in the future is to run multiple import jobs at the same time. Each instance of the cron will pick up an import job and run it independently from the other import jobs, one can also apply their own data validation rules. (For example: a record must have first name, last name, email address to be imported)

Some disadvantages right now:

  • The heart of this script uses a cocktail of CiviCRM APIs (contact, location, group, tag). Each API call can be costly. Specifically 2 addresses for a contact record import will require 2 location API call.
  • Lack of feedback and exceptional handling: If the script runs into PHP memory exhaustion or other fatal error, the user will not receive any feedback, the import job goes into a limbo stage.
  • No relationship import as well as some other fields: Because of my use case of importing just individual contacts, I have neglected fields that I do not need.

What's next?

I'm still working on fine tuning this queue concept and multi processing to kick this thing into over-drive, a new batch import API is being developed by CiviCRM team at http://svn.civicrm.org/civicrm/branches/v3.2.import/ which I will switch to (from the API cocktails). Ultimately I hope to stabilize this script for more broad use case and other type of importing needs.

 

Please contact me if you run into trouble using the script or would like to provide feedback or contribute to the project as well as the new import API

 

Cheers!

Filed under

Comments

very interesting

I am shocked that someone could import 50,000 contacts in a single batch. I have never had success going over 5000 contacts per batch. I always get a timeout over 5000. Normally I split my imports into batches of 2000

I think that with enough memory allocated to PHP, it is possible to crunch 100,000 a batch. With de-duping option turned off I was able to import the 50,000 contacts in 2 hours which is not bad at all.