Sharing experiences with ETL tools

Publié
2011-06-06 09:30
Written by
JoeMurray - member of the CiviCRM community - view blog guidelines

Are you interested in open source collaboration on how to use ETL (Extract, Transform and Load) software for data migrations into CiviCRM?

There was a good discussion on the LinkedIn CiviCRM group (http://www.linkedin.com/groups?mostPopular=&gid=1418647) last week about importing voter files into CiviCRM. It seems a number of organizations are using specialized tools for ETL (Extract, Transform, and Load), including Pentaho Kettle, Talend, and SQLManager.net. I see advantages in terms of scalability, ease of use, builtin cleanup and transformation functionality,  decreased cost and improved performance when doing large and complex data migrations in comparison with the import tools currently available in CiviCRM. The latter continue to be best in my view for smaller and simpler migrations.

We're thinking that it might be useful to 'open-source' how we are using these tools in terms of both code and documentation. Initially these might be more like the shared code snippets and recipes in the developer documentation. In some cases there will be opportunities for significant reuse, e.g., migrations from popular platforms like Raiser's Edge, Convio, Salesforce.com and probably some specialized tools like Constant Contact. Perhaps down the road there might be some areas where joint development would make sense, for example, CiviCRM extensions to support importing from a particular platform.

If people could comment here if they are interested in contributing to such an effort or making use of it that would help us identify the possible demand. It would also be useful to know who is using these tools for what sorts of migrations, and where they see opportunities for collaboration. Finally, it would good to get feedback on where we should try to facilitate this sharing...would the wiki and github be best, or somewhere else?

Comments

Yes, I totally agree on this. ETL has been very powerful for us in the past - for example using Kettle with OpenERP. They have even developed a toolkit called TerminatOOR for OpenERP that allows you to readily build simple scripts to map data and synchronize/import with the entire OpenERP API by brute-forcing the database. This is even being used in production to synchronize product inventories with Point of Sale systems. They are even talking about an ETL helper module being built into core, with a UI for mapping objects and fields. As soon as these tools came along, adoption by large companies grew as data migration and synchronization costs dropped precipitously.

So this could be a huge step forward for the CiviCRM community. I am almost certain that folks are already doing it, and especially as the US election season is ramping up, this would make for a great collaborative effort. I would be glad to help in building at least some kind of toolkit to aid this kind of work on github or somewhere similar. Potentially it could involve into something analagous to TerminatOOR that could act as a common dependency for extensions as you suggest!

I think voter data is a good starting project for this initiative since it is simple in terms of structure and thus coding requirements but also benefits from scalability and performance of these tools. Because of the regular release of new versions of voter lists, it also has a one-way data synchronization requirement. I am hoping we can see some re-use.

I agree that this is an accelerator for CiviCRM adoption and deserves attention.

I think this is great!  Comment is here so I can get updates...

I don't know much about ETL, but I'm a SQL nerd and interested in helping on that side of things. I know the Raiser's Edge data model pretty much back and forth and can help write conversion 'recipes' to get 95% of the data out in the best format for Civi, either with scripts or by using their front end tools.

Yeah, I worked there for 7 years. Raiser's Edge still beats everyone else in a client/server, internal-only setup, but it has the price tag to match and they don't understand internet stuff or donor/patron self-service. Just email or find me via the forums or my website, and put me to work where you need me.

Hi,

 

If I can manage to export as separate csv with the column name I want (usually with an intermediate onliner to replace "contact_name,surname,.." but "sort_name,last_name") and get an external unique id for the contact, I then use the csv import tool

If the format is too complex (eg too many relations between entities), I code a few lines of php using the api to put the right stuff into the right place (the relationship API could welcome some patch to deal with other fields than only the internal id)

If I need to manually massage the data before I use google refine.

I always set the source and often add a tag/group to be able to undo, or at least know where the mess comes from.

When some items are likely to be on errors, I add a tag "To be verified"

 

I stay away of the "regular" import because the wizard takes a minimum of 2 minutes to run through each step, as opposed to 1 second to arrow up and run the import command again on the shell.

 

And, of course, lots of coffee and cursing.

X+

I agree that for more complex imports, custom scripts help a lot, and run much faster. I think the ETL tools are trying to make this even easier and to make it more re-usable as well as less subject to typos and errors. Google Refine is another excellent product, and I see advantages to integrating the technologies used for data migration. As Charles mentions, these tools will be extremely useful when they provide full support for an external product like Raiser's Edge.

What I'm a bit reluctant to see is a lot of investment in replicating inside CiviCRM the kind of excellent, polished ETL functionality already available in other open source projects. We need enough inside CiviCRM to make small quick one-offs by certain types of users possible. For more sophisticated and better resourced users, either low-level custom approaches like you advocate or ones that leverage the best available open source tools seem more appropriate.

I like the idea of trying to link some of these other tools into the api rather than directly into the database. I'm not sure how many are set up for that, but I'm pretty sure they allow arbitrary code to be executed per row, which should do the job.

As I mentioned in the LinkedIn discussion, I have been leveraging Pentaho's data integration tool, Kettle, in order to migrate a client from a custom VBscript + MS SQL Server application over to the CiviCase component of CiviCRM to facilitate developing a Case Management/EMR application. My client has a number of forms which I created Activity types for and there are a 1K+ custom fields throughout all the custom Activity types I created.

I have primarily been using Kettle transformations in order to import into the 1K+ custom fields that exist in my app. First I created some custom views in MS SQL that transformed the source data to match the CiviCRM data schemas (e.g. civicrm_contacts, civicrm_activities, etc). I then exported those to CSV's and used the CiviCRM import interface to populate the civicrm_contacts table. Then I used the MySQL data import capabilities (phpMyAdmin) to import into the civicrm_activities table. I also exported all my custom field data into an Excel document.

That populated some of, what I refer to as, some of the "core entity" tables (Contacts, Activities) which was the foundation for my Kettle transformations because I could now use Kettle steps to facilitate populating the "civicrm_value_" tables (e.g. Activity type custom fields) by using the "merge join" steps to join my custom field data stored in the Excel document with the Activities stored in my civicrm_activity tables.

In addition to populating my custom field tables I've learned how to attach documents to Activities at the database level (e.g. populate the civicrm_file and civicrm_entity_file tables).

I think Pentaho would be a great platform to leverage given its extensibility and open source license. Over time I imagine custom ETL steps developed using the Kettle API specifically for pushing data either through the CiviCRM API or directly into the database. Or, as mentioned, developing more custom ETL steps based on the platform a user is migrating from such as Raiser's Edge, etc.

I would love to contribute further to the knowledge base.

Anonymous (non vérifié)
2011-06-07 - 19:49

 

We've been using Talend tools in various data migration projects.
 
Talend has many DB and file connectors which facilitate the field mapping,
missing link discovery, etc.
 
What I can see is some custom CiviCRM connectors at the CiviCRM module-level rather than CiviCRM table-level.
That should greatly simplify the data import mapping and job orchestrating from other systems to CiviCRM.

Yes, I think the value will really become apparent when the objects being imported are not table specific but deal with the data for a whole component, or even multiple components. That's when handling the foreign keys that isn't done so easily via csv really comes to the fore.

Dina (non vérifié)
2011-06-10 - 03:09

Good initiative. I'm working for years with Talend open source ETL tool and PowerCenter as well. Most of the open source ETL tools do not support the principle of reusability, but some commercial do. Such a principle can support your initiative, because it's easy then to save a code snippet (documentation and annotattions included) and let it reused by another ETL project. How do you find out if an ETL tool is supporting reusability? Maybe the ETL tools comparison is worth looking at.

Hi Dina,

That reports looks like it may contain some useful data. I'm very interested in re-usability, and making it easy and feasible to reuse previous efforts. At $825 USD its a bit expensive just for a quick scan. Do you have access to it, perhaps through your employer? I'm particularly interested in evaluations of open source tools.

Roly (non vérifié)
2011-06-16 - 18:28

Hi, I have been communicating with Dave and Lobo over the past ... wow... probably nearly 2 years now about an appropriate ETL for CiviCRM (at least for our needs).

Just to give you some background, we receive massive CSV files from an external system (currently once a week but will move to daily) - total size of all csv files together is around 2GB!!

There are approximately 3 million records processed each time and around 300 custom fields in the CRM that we need to map to. There are too many translations occurring as the system we are receiving the data from is some 30 years old and some of the values within their fields cannot be easily identified unless you are familiar with their data model and products/services.

We investigating many ETL applications and after an exhaustive research, we settled on Jasper ETL.

The primary reason for this choice was the fact that it could process the files very quickly and the user interface is extremely easy to use. It only took half a day to work out how the whole system would work for our needs (obviously setting up the mapping took longer).

We also had restrictions with CiviCRM Reports not being as flexible for what we needed to do, so it was a natural to implement Jasper Reports as well.

Anyway, after a good 2 years of "data clean up" in the external system, we now have the whole process completely automated where the external system places a password protected zip file onto our server (over a secure connection) and our code unzips and processes the files.

The whole process takes around 3-4 hours to complete which is generally completed very early in the morning.

We would like to share our experience with the community as well as some of the code we used. The only issue we see is that it is fairly specific to the financial services industry and because we have some sensitivity about that from our client. However, what I think is important is the concept of what we did and in particular, the documentation of the agreed "data dictionaries" and mapping documentation.

Our developers had to look at the "guts" of CiviCRM to find out how what core fields should be used and what custom fields were required eg should the field be at the membership level or the contact level. If it is at the membership level then you cannot extract the data out of the system easily via the "advanced search" functionality. These types of issues were the main ones and we spent many hours trying to get this right first.

We will be submitting a "case study" to Dave and Lobo within the next few weeks and hopefully that will help others in the community supporting this fantastic system.

Anyway, I hope the above gives you some insight to what we have achieved and stand by for further details. If there is any specific information you would like to know about Jasper ETL and our experiences, please let me know and I will ensure it is covered in our case study.

Sincerely,
Roly.

Has anyone since 2011 proposed building a CiviCRM ETL as an extension (as opposed to using Kettle/Talend outside of CiviCRM))? Not only is there a need to transform data but also for less brittle Reports (like a tool that doesn't require hacking about with templates)? Jasper makes nice reports but how could it be integrated /used in an extension, requires a JVM, right?