Published
2010-06-14 19:10
While doing a deployment for a large organization, our good friends at Rayogram hit upon a pretty severe export scalability issue. The primary export was failing for approx 40K rows. They contacted us to see if we can figure out whats happening and why. They first assumed that the big issue was using a custom feature (merge same address / merge household address) and wanted us to look at it and potentially optimize and fix the issue.
In general dealing with memory issues in PHP are not something i look forward to (it comes a close second to debugging core dumps in PHP). Basically you cannot afford to leak a lot of memory in every iteration. For the export, if you leak 10K for every iteration, u will need 400M for a 40K export not including any of the memory used before export was called. On my initial tracing of the code we were leaking between 15K of memory for every iteration. On my local labtop, i could process approx 5K contacts before running out of memory :(
It took me some time to get my head around the code and figure out what the potential problems were. I first eliminated our good friend DB_DataObject (which has a tendency to store things in a static global variables). I also ensured that we were free'ing various objects as soon as possible. Spent a few hours on this and realized that i was not making any significant progress.
We were collecting the name/values for each contact record in memory as an associative array before writing them out as a CSV file (using code borrowed from PHPMyAdmin). I spent some time trying to figure out i could use just normal arrays rather than an associative array, but the code changes for this looked a wee bit prohibitive. I then figured out how to make the CSV write incremental so we can write chunks of CSV at a time, thus avoiding storing the entire file in memory (in an inefficient manner). I was a bit sad when i only managed to get upto 8K contacts before running out of memory.
I also realized that the query was designed in a manner not conducive to 40K contacts (to put it in a very nice way). Basically the query was constructing this GIANT in clause with all 40K contact ID's. While this is a great optimization for 50-100 contacts, not sure how good/bad this performs with 40K contacts. It also made debugging a pain and i decided to fix it at the same time. I used a mysql temp table on the HEAP to store these contact_id's and then did an inner join in the final sql clause. While this cleaned up the code it did not do much to the memory usage (which was a wee bit disappointing).
At this point, i was stuck and had to resort to isolating memory usage for every function. If you have XDebug enabled, there is a nice utility function,
CRM_Utils_System::xMemory
, which will print out memory stats for your current php script. Using that function i narrowed the culprit down to mysql_fetch_array. I'm not exactly sure why it keeps using more and more memory, but seems like it does cache the previous results. Our initial code was:
$query = 'real complex query here'; $dao = CRM_Core_DAO::executeQuery( $query ); while ( $dao->fetch( ) ) { // process dao object and generate values for csv line }I rewrote the query to be
$query = 'real complex query here'; $rowCount = 1000; $offset = 0; while ( 1 ) { $limitQuery = "{$query} LIMIT {$offset}, {$rowCount}"; $dao = CRM_Core_DAO::executeQuery( $limitQuery ); if ( $dao->N <= 0 ) { break; } while ( $dao->fetch( ) ) { // process dao object and generate values for csv line } $dao->free( ); // this basically recycles the memory for the next cycle }The above fix (along with a combination of the incremental CSV patch basically allows export to scale for a LARGE number of contacts (potentially infinite, as long as the query does return in a reasonable amount of time). Yet another learning experience in our CiviCRM adventure The next part of the project would be to optimize the merge "same address" / "same household address" and make them scalable for a large number of records. I dont think we can do this in memory, so we might need to do a file based scheme here. More on this in a future blog post
Filed under
Comments
Hi,
Definitely great investigation and fix, it will help every installation, not only big ones.
I've seen that there is work so the import can be run from the shell, is it planed to do the same for the exports ?
X+
but i think once we have import "batch api'able" export will be the next step. if someone would like to help make this happen and/or help sponsor it please contact us :)
lobo
I think CLI would be a good idea, although it would not apply to of people who are using shared hosting service.
I believe using a cron job and a batch/queue import and export can help solve a lot of the scalability issues for everyone.
@Lobo: Debugging memory leak is a pain in the @ from limited experience I encountered, I guess getting away from data object and pear library in general is a good idea going forward.
I have a couple of projects that needs to be wrapped up, after that I have renewed confidence in helping tackling the import/export batch API.
Chang