A common question on the mailing list has been about CiviCRM search, scalability and speed. Currently basic/advanced search has been designed to try to get as many matches as possible from your contact data. However trying to get as many matches as possible results in some major inefficiencies which does not work for data sets of reasonable size
Currently the default search does the following:
1. For any value typed in the name field, it searches the civicrm_contact table and the civicrm_email table. It connects the two tables using two LEFT JOIN's via the civicrm_location table. LEFT JOIN's are not very efficient sql operators
2. To capture as many results as possible, it uses the LIKE "%value%" sql operator for the sort_name and email field. This results in a sequential scan of the civicrm_contact table and the civicrm_email table
3. It needs to also retrieve the total count of all the contacts that match the query. Thus a SELECT COUNT(*) query is needed with the above where clause
4. For the alphabetical pager, its needs to do a SELECT DISTINCT UPPER(LEFT(contact.sort_name, 1) ) with the above where clause
5. To retrieve all the values that is displayed in the selector it needs to do a multiple LEFT JOIN with the civicrm_location, civicrm_address, civicrm_phone, civicrm_email, civicrm_state_province and civicrm_country tables. It also does an ORDER BY and LIMIT clause based on user input. (in the default case, we sort by contact.sort_name and limit the results to the first 50 entries).
Thus search issues 3 queries with some fairly complex LEFT JOIN's dependent on what the user has typed in. When groups/tags are involved we also need to throw in a DISTINCT operator to avoid getting multiple results for the same contact.
Note that the query builder builds all the sql code for the above dynamically based on user entererd values. The same code is used for search builder/contribution/member/profile search and as such is fairly complex and has to handle a wide variety of cases (so the issue is bit more complex than just optimizing one fixed query). Note that we do not consider ourselves as sql experts, so any sql experts out there please let us know if we could be doing stuff in a more optimized manner.
How could we potentially optimize the above and make things scale for organizations with a reasonable number of contacts ( > 10K) and avoid the sequential scans. (We hope to incorporate some of the below ideas along with new discoveries in CiviCRM v1.7). CiviCRM v1.5 has some hidden configuration variables which allow you to speed up the search significanly in exchange for some loss of functionality.
SELECT sort_name, first_name, street_address, email, phone FROM civicrm_contact LEFT JOIN civicrm_location LEFT JOIN civicrm_address LEFT JOIN civicrm_email LEFT_JOIN civicrm_phone WHERE COMPLEX_WHERE_CLAUSE ORDER BY sort_name LIMIT 0,50;
For large data sets the above query is ridiculously inefficient (since the LEFT JOIN's take forever) A more optimized query is:
SELECT contact_id FROM civicrm_contact WHERE COMPLEX_WHERE_CLAUSE ORDER BY sort_name LIMIT 0,50; SELECT sort_name, first_name, street_address, email, phone FROM civicrm_contact LEFT JOIN civicrm_location LEFT JOIN civicrm_address LEFT JOIN civicrm_email LEFT_JOIN civicrm_phone WHERE contact_id IN ( LIST_GENERATED BY ABOVE QUERY ) ORDER BY sort_name
Note the absence of the LEFT JOIN in the first query (there might be a few depending on the search criteria) and the change in the WHERE clause for the second query. mysql has an incredibly crappy sub-select implementation (for independent sub-select clauses) which are best avoided. If your querying on address/email/state values all at the same time the above might not be very optimal
The queries that we generate with Search Builder and Search views are along the lines of the above. We do want to address and solve some of the issues raised in this post in future releases. SQL experts, please feel free to offer advice/critique either as comments or on our mailing list