Caching Smart Groups (and optimizing Search / ACL's / Nested groups)

Published
2008-05-13 20:06
Written by
One of the core features of CiviCRM is the ability to store a query as a group (smart group). This allows folks to create groups of contacts that share the same attribute(s). For e.g. A group of all the people in California. Thus contacts can be added/edited/deleted from the database, and the smart group will always give you all the contacts who live in CA at that point in time. I suspect this feature is heavily used by quite a few installs. Hierarchical organizations / political parties can separate their contacts based on electorate / voting segment / branch etc by using smart groups. For the NZ green party, this results in approx 80 smart groups. More details on their setup can be read in this blog post While smart groups are very convenient, the current implementation within CiviCRM is quite inefficient. The complexity of the queries increase quite significantly as we try to string multiple smart groups together. Thus searching for all the households in the db gives this query:
SELECT count(DISTINCT contact_a.id)  
FROM     civicrm_contact contact_a 
WHERE  ( contact_a.contact_type IN ('Household') )  AND  ( 1 )
Lets make that into a smart group, Household. Search for all the contacts in the Household smart group that live in CA gives us this query. Note that the query is a wee bit more complicated since we also allow contacts to be added / removed from a smart group independent of the query, i.e. a smart group is a combination of a query and a static group. This extension is needed in most real life cases, where a query gives you most of your contacts BUT not all
SELECT     count(DISTINCT contact_a.id)  
FROM         civicrm_contact contact_a 
LEFT JOIN civicrm_address ON ( contact_a.id = civicrm_address.contact_id AND civicrm_address.is_primary = 1 ) 
LEFT JOIN civicrm_state_province ON civicrm_address.state_province_id = civicrm_state_province.id   
LEFT JOIN civicrm_group_contact `civicrm_group_contact-7` ON contact_a.id = `civicrm_group_contact-7`.contact_id   
WHERE      ( ( ( `civicrm_group_contact-7`.group_id IN (7) AND `civicrm_group_contact-7`.status IN ("Added") ) OR (  
                            ( contact_a.id IN ( 
  SELECT  DISTINCT(contact_a.id) as id  
  FROM      civicrm_contact contact_a 
  WHERE  ( contact_a.contact_type IN ('Household') )  AND  ( 1 )  AND 1    )  
       AND contact_a.id NOT IN ( 
         SELECT contact_id 
         FROM     civicrm_group_contact 
         WHERE  civicrm_group_contact.group_id = 7 
               AND  civicrm_group_contact.status = 'Removed' ) ) ) )
  AND LOWER(civicrm_state_province.name) = 'california' )  AND  ( 1 )
If you convert households from a smart group to a static group, you get this query
SELECT     count(DISTINCT contact_a.id)  
FROM         civicrm_contact contact_a 
LEFT JOIN civicrm_address ON ( contact_a.id = civicrm_address.contact_id AND civicrm_address.is_primary = 1 ) 
LEFT JOIN civicrm_state_province ON civicrm_address.state_province_id = civicrm_state_province.id       
LEFT JOIN civicrm_group_contact `civicrm_group_contact-1` ON contact_a.id = `civicrm_group_contact-1`.contact_id   
WHERE  ( `civicrm_group_contact-1`.group_id IN (1) AND `civicrm_group_contact-1`.status IN ("Added") AND LOWER(civicrm_state_province.name) = 'california' )  AND  ( 1 )
Note how much cleaner and more efficient the above query looks. We also avoid using subselects whose performance is suspect in current versions of mysql. So we want the functionality and power of smart groups along with the efficiency of static groups? We can accomplish this by introducing a smart group cache table and cache the results there. Here is the query with a caching table introduced
SELECT count(DISTINCT contact_a.id)  
FROM civicrm_contact contact_a
LEFT JOIN civicrm_address ON ( contact_a.id = civicrm_address.contact_id AND civicrm_address.is_primary = 1 ) 
LEFT JOIN civicrm_state_province ON civicrm_address.state_province_id = civicrm_state_province.id   
LEFT JOIN civicrm_group_contact `civicrm_group_contact-5` ON contact_a.id = `civicrm_group_contact-5`.contact_id   
LEFT JOIN civicrm_group_contact_cache `civicrm_group_contact_cache_5` ON contact_a.id = `civicrm_group_contact_cache_5`.contact_id   
WHERE     ( ( ( `civicrm_group_contact-5`.group_id IN (5) AND `civicrm_group_contact-5`.status IN ("Added") ) 
          OR    ( `civicrm_group_contact_cache_5`.group_id = 5 ) ) 
       AND    LOWER(civicrm_state_province.name) = 'california' )  
       AND    ( 1 )
Notice how close the above two queries are. The cached query above uses one more LEFT JOIN than the static group query. This is because a smart group is a combination of a query AND a static group. The query is cached in the civicrm_group_contact_cache while the static group is present in the civicrm_group_contact table. We can optimize it further by checking the static group table when generating the cache and hence eliminate one more LEFT JOIN. I'll try to figure out how to do that over the next few days So now the only problem left to solve is when do you invalidate the cache? The cache needs to be reset when data has changed that could potentially alter the query. This could be a great use of MySQL triggers
foreach smart group find all the tables and fields that could affect it. In our current example, this would be the civicrm_contact table and the contact_type field
Add a trigger to the database to reset the cache entries for that smart group if:
   a. A record was created or deleted in the civicrm_contact table
   b. A record was edited and the the contact_id column changed.
If a smart group is made up of other smart groups, we will need to do this recursively. We can further optimize the above by letting the trigger fire only when the value changes to/from the value we are interested in (i.e. if a record was created/delete/edited with either the old/new value of household type). All this gets potentially quite complex, but i suspect is the right way to go about doing so in the medium/long term. For the short term we can rely on CiviCRM hooks and ensure that they have good coverage. if we ensure that all contact objects have hooks, we can reset the cache whenever a create/edit/delete operation is fired on a contact. Since for most db's read outnumber writes, resetting the cache aggresively is not a bad idea. We also populate the cache only when it is used, i.e. on a on-needed basis Most of the above changes are now available in our trunk svn. If you any holes or missing pieces, please do let us know before we roll this out as part of 2.1 :). I also suspect the performance improvements with this change will be huge. We definitely need a few "real" database and scenarios to test and benchmark things against. Maybe we'll use the NZ voter database as our test db. If you can share some large dbs with us (preferably public domain), please let us know. Unlike previous occasions we will need to keep the db around for a long period of time The sandbox has been updated with the latest code and database changes. So feel free to experiment there (http://sandbox.civicrm.org/). Note that the sandbox is also running Drupal 6 :) lobo UPDATE: I've added the LEFT JOIN optimization described above, which should help query performance a bit more if group contact status is 'Added'
Filed under