CiviCRM Search Demystified

Pubblicato
2006-11-14 06:57
Written by
A common question on the mailing list has been about CiviCRM search, scalability and speed. Currently basic/advanced search has been designed to try to get as many matches as possible from your contact data. However trying to get as many matches as possible results in some major inefficiencies which does not work for data sets of reasonable size Currently the default search does the following: 1. For any value typed in the name field, it searches the civicrm_contact table and the civicrm_email table. It connects the two tables using two LEFT JOIN's via the civicrm_location table. LEFT JOIN's are not very efficient sql operators 2. To capture as many results as possible, it uses the LIKE "%value%" sql operator for the sort_name and email field. This results in a sequential scan of the civicrm_contact table and the civicrm_email table 3. It needs to also retrieve the total count of all the contacts that match the query. Thus a SELECT COUNT(*) query is needed with the above where clause 4. For the alphabetical pager, its needs to do a SELECT DISTINCT UPPER(LEFT(contact.sort_name, 1) ) with the above where clause 5. To retrieve all the values that is displayed in the selector it needs to do a multiple LEFT JOIN with the civicrm_location, civicrm_address, civicrm_phone, civicrm_email, civicrm_state_province and civicrm_country tables. It also does an ORDER BY and LIMIT clause based on user input. (in the default case, we sort by contact.sort_name and limit the results to the first 50 entries). Thus search issues 3 queries with some fairly complex LEFT JOIN's dependent on what the user has typed in. When groups/tags are involved we also need to throw in a DISTINCT operator to avoid getting multiple results for the same contact. Note that the query builder builds all the sql code for the above dynamically based on user entererd values. The same code is used for search builder/contribution/member/profile search and as such is fairly complex and has to handle a wide variety of cases (so the issue is bit more complex than just optimizing one fixed query). Note that we do not consider ourselves as sql experts, so any sql experts out there please let us know if we could be doing stuff in a more optimized manner. How could we potentially optimize the above and make things scale for organizations with a reasonable number of contacts ( > 10K) and avoid the sequential scans. (We hope to incorporate some of the below ideas along with new discoveries in CiviCRM v1.7). CiviCRM v1.5 has some hidden configuration variables which allow you to speed up the search significanly in exchange for some loss of functionality.
  • Let the "name" field represent sort_name only. Avoid the "LEFT JOIN" with civicrm_email table. This skips the two LEFT JOIN's and makes the first two queries significantly faster (CRM_Core_Config::$includeEmailInSearch)
  • For the sort_name,use the clause LIKE "value%" instead of LIKE "%value%". This allows mysql to use an index rather than the sequential scan.
  • If the alphabetical pagination display is not very important, suppress it completely (CRM_Core_Config::includeAlphabeticalPager)
  • If the ordering is not important, suppress the ORDER BY clause (CRM_Core_Config::$includeOrderByClause)
  • If you are running only one civicrm install (i.e. one domain id), suppressing the domain where clause also speeds things up (CRM_Core_Config::includeDomainID)
  • For large data sets it is significanly faster to compute the list of contacts you want information on and then use that list to restrict the results. As an example our current query looks like:
    
    SELECT sort_name, first_name, street_address, email, phone
    FROM civicrm_contact
    LEFT JOIN civicrm_location
    LEFT JOIN civicrm_address
    LEFT JOIN civicrm_email
    LEFT_JOIN civicrm_phone
    WHERE COMPLEX_WHERE_CLAUSE
    ORDER BY sort_name
    LIMIT 0,50;
    
    
    For large data sets the above query is ridiculously inefficient (since the LEFT JOIN's take forever) A more optimized query is:
    
    SELECT contact_id
    FROM civicrm_contact
    WHERE COMPLEX_WHERE_CLAUSE
    ORDER BY sort_name
    LIMIT 0,50;
    
    SELECT sort_name, first_name, street_address, email, phone
    FROM civicrm_contact
    LEFT JOIN civicrm_location
    LEFT JOIN civicrm_address
    LEFT JOIN civicrm_email
    LEFT_JOIN civicrm_phone
    WHERE contact_id IN ( LIST_GENERATED BY ABOVE QUERY )
    ORDER BY sort_name
    
    
    Note the absence of the LEFT JOIN in the first query (there might be a few depending on the search criteria) and the change in the WHERE clause for the second query. mysql has an incredibly crappy sub-select implementation (for independent sub-select clauses) which are best avoided. If your querying on address/email/state values all at the same time the above might not be very optimal
The queries that we generate with Search Builder and Search views are along the lines of the above. We do want to address and solve some of the issues raised in this post in future releases. SQL experts, please feel free to offer advice/critique either as comments or on our mailing list

Comments

So my understanding is that you create a complex query and then do variations on that query to retrieve the pager etc.

I think the idea of splitting the initial complex query up is a good start, but IN statements are quite inefficient. Have you thought of creating a temp table with the initial results instead of using a WHERE cid IN (list_of_cids). This might gain some ground. Generally it's more efficient to do your filtering in the FROM statement, rather than the WHERE statement.

My initial reaction is actually to try sub-queries. You mention that you've had bad experiences with that route, but I think in the past you've added the sub-queries in the WHERE statement, not in the FROM statement. From my experience with MS DB engines, sub-queries in the FROM statement is the most efficient way to go for virtually all complex queries. I'm assuming the results would be similar with MySQL, though I haven't tested thoroughly.

The following approach is very inefficient:

SELECT c.contact_id, e.email, p.phone
FROM (
civicrm_contact AS c
LEFT JOIN civicrm_phone AS p ON contact_id
)
LEFT JOIN civicrm_email AS e ON contact_id
WHERE contact_id IN (
SELECT contact_id
FROM civicrm_email
WHERE email LIKE "johndoe%"
);

The better way to go is something like:

SELECT c.contact_id, sqry.email, p.phone
FROM (
civicrm_contact AS c INNER JOIN (
SELECT contact_id, email
FROM civicrm_email
WHERE email LIKE "johndoe%"
)
AS sqry ON contact_id
)
LEFT JOIN civicrm_phone AS p ON contact_id;

MySQL basically creates a temp table of the sub-query and uses the temp table in the master query. This is naturally better than calling the email table twice. This method gets rid of some left joins too.

In this example you might also try to subquery the phone table to return only the primary phone, then you wouldn't need to use any left joins.

You could probably do your pager in php too. This would mean that you're only using one query.

It has the potential to not work out at all, but it might be something to look into.

We've got about 100K contacts in our database.

Searches were taking 20-100 seconds to complete (basic searches with a
query term and default options).

After reading http://civicrm.org/node/99 , I implemented some of the
suggestions and here's what I found.

All of my times were gathered with a pageload timer in FF and are based
on 2-3 tests of each optimization, each using the same query. Our site
implements eaccelerator opcode cache.

Changes to settings were made in CRM/Core/Config.php

Baseline: With all optimizations off
- 2 queries averaged 95 seconds.

Set $includeEmailInSearch = 0;
- 3 queries averaged 5.3 seconds
- search term still matched on email addresses.
- search for user@example.com returned a match on email (this is most likely because the user did not have a first/last name and hence their sort/display name is their email address)

Set $includeAlphabeticalPager = 0;
- 3 queries averaged 3.7 seconds

Set $includeOrderByClause = 0;
- 3 queries averaged 2.6 seconds

Set $includeDomainID = 0;
- 3 queries averaged 2.4 seconds

dave:

Would be awesome if you could run some tests of your suggested queries above vs what our optimized queries are and document the time difference between the two approaches. Will help us significantly and allow us to make a more uniformed decision

thanx

lobo