Pubblicato
2008-01-22 20:31
v2.0 has significant schema changes, and hence a simple sql script could no longer serve as our primary upgrade mechanism. We are introducing a new upgrade system in 2.0 which will run in a couple of steps and upgrade the user's database. It has been an interesting exercise to come up with the upgrade script. Here are a few things we learned as we went through the process.
To simplify things, we split the upgrade process into multiple small steps. One step for each of the four major schema changes (contact, location, activity and custom), one step for the other minor schema changes and a final step to get rid of all the old columns and tables. With the exception of custom, most of the others were packaged as simple sql scripts that could be executed against the old database.
Upgrading Custom group/field/value required some amount of php code. We had to create new tables for each old custom group with columns based on the fields in each custom group. The custom option values had to be transferred to the global option value table that potentially allows reuse among different custom fields (a v2.0 feature). Finally we had to transfer the custom values to the newly created tables.
We wrote a fairly simple PHP script which worked quite nicely and did the job for our small test database. We were fortunate and managed to get a fairly decent sized 1.9 database with approx 70K contacts, 3 custom groups, 21 custom fields and 300K custom values. We fixed a few bugs with the code but pretty soon ran into memory issues when transferring the custom values. This helped us clean up the core code a bit more and ensure we free'd up all the database objects (which are a big memory sink). However the process took a bit more than 15 minutes to run :( This definitely was not acceptable and we wanted to do much better. The goal was to get this to less than a minute (at the least!)
The code was looping through all the 300K custom values and then deciding what table/column each value needs to live in. Optimizing the code would not give us a significant speed boost. So we decided to restructure it completely and let MySQL do most of the hard work. A quick look at the mysql documentation introduced us to the wonders of the INSERT ... SELECT ... ON DUPLICATE KEY UPDATE statement. Some quick experimentation resulted in the following code for each custom field
INSERT INTO civicrm_value_1_test_group ( domain_id, entity_id, col_1 )
SELECT 1, cv.entity_id, cv.int_data
FROM civicrm_custom_value cv
WHERE cv.custom_field_id = 1
ON DUPLICATE KEY UPDATE col_1 = cv.int_data;
So rather than looping over a 300K table (and hence doing 300K * x queries, we looped over 21 fields and did 21 * y queries). This worked wonders and got the running time for that section of the code to a bit under 2 minutes. So we are pretty close and will continue our profiling to get the time to under a minute. We debated a bit over whether it was worth trying to do multiple columns with one insert statement. We decided not to, since we this would involve a "LEFT JOIN" of the civicrm_custom_value table which would probably be inefficient.
So for the above DB, the CiviCRM v2.0 has now created 3 new tables and the custom fields are now columns in those tables. the 3 tables have 32K, 4K and 25K rows. A search (or export) for 5 custom values in the old schema would involve a LEFT JOIN of civicrm_custom_value (a 300K table) 5 times with different aliases. In the new scheme, it would involve a LEFT JOIN of maximum 3 tables (if we had picked fields from all 3 tables), BUT each table is SIGNIFICANTLY smaller than the original civicrm_custom_value table. I suspect the speedup from search and other associated operations will be quite significant. We are running some benchmarks with the above db and i'll write them up in a later blog post.
The upgrade will be part of our alpha release later this week. If you have a significantly large v1.9 db (as large or larger than the above), we'd love to try our upgrade script on it and potentially improve it. We will destroy the db when we are done and will respect the privacy of the data. Please send me mail if you can share your database.
Comments
Thanks for the interesting read dlobo! I think 2 minutes is really impressive. If you manage to get it under one minute that will be a massive achievement!