Full-text Search for CiviCRM: Initial Thoughts ...

Publicat
2006-11-17 12:47
Written by
Dave Greenberg - member of the CiviCRM community - view blog guidelines
It is important for CiviCRM to have a full fledged un-structured search engine in addition to the current structured query. I don't think MySQL full text searching (MFTS) is a good model for a couple of reasons. Firstly MFTS is restricted to myisam tables and CiviCRM uses innodb tables. Secondly MFTS is still a table level search and i don't think it can handle hierarchical data. CiviCRM contacts are hierarchical data sets. Would be great to integrate something like Lucene into CiviCRM. A potential work flow could be as follows: 1. Publish an xml specification of the CiviCRM data model. We have done a fair amount of this work for the Branner project. We could extend and automate this quite nicely using our code generator. Also xml fits quite nicely since we can represent hierarchical data 2. Extend the logging functionality so we are aware of all modifications to any part of a contact record. Currently we are restricted to changes to the civicrm_contact, civicrm_individual, civicrm_household, civicrm_organization records in our logging framework. We need to decide what tables are part of the "contact" data and make appropriate modifications (e.g. civicrm_location is directly connected to a contact while civicrm_email is indirectly connected via civicrm_location, so we need a fairly efficient system to record such changes as contact level changes). 3. On a periodic basis (triggered by a cron job) incrementally update the xml entries of all the contacts that have been modified since the last cron and reset their status 4. Incorporate these new changes into lucene's search index 5. Link up CiviCRM search to Lucene search. we can use the Zend framework port of Lucene to PHP to accomplish this 6. Give users some mechanism to see where the contact record matched the search criteria (potentially display the contact's xml definition?) Please do send us email / get in touch if you have a better understanding of this issue and can help us design / develop this further.
Filed under

Comments

I would make contacts full-fledged nodes. Then, we can use Drupal's built-in indexing and search system, which can use the backend of the user's choice.

Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_D

You don't need to know Java at all. Search results can be retrieved via HTML/XML/JSON.

It's a much scalable approach than MySql's built-in limited search.

It has a 30M limit for free version, but I think it's big enough for most Drupal or CivicCRM instances.

Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_D

This came up on the mailing list earlier today and I started looking around for various options out of curiosity. dbsight is definitely an attractive option, however not being open source is a major limitation and hence highly unlikely that we will go with it. My current thoughts are to use Solr and Lucene which seems relatively easy

CHX was singing the praises of http://www.sphinxsearch.com/ on the drupal developer list a couple of days ago...