The Need of Data Integration Tools

Lets say you have a CiviCRM database and you want to perform a massive change. For instance, lets say you have the names uppercase and you want to fix it. You may be tempted to execute a SQL query but you shouldn't. CiviCRM has a complex database where, when you modify a value, a lot of things are happening. If you modify the last name of a contact, the display name and addressee fields will be, probably, modified too and a new record will be added to the log history to keep track of who made the change and when.

An alternative to perform massive updates over your database, is using the API. The advantage it offers is that it will take care of database consistency, logging and all this complex relations between fields. The disadvantage is that you'll need write some code. So, what are the possibilities you've with CiviCRM if you want to do this kind of updates without programming?

An Alternative for Simple Tasks

ETL Process
CiviCRM offers out-of-the-box some import/export functionalities you may be interested in. If the number of records you want to update are a few thousand, this tools can perfectly do the job. You will only need to export to a csv file the set of rows you want to change, use your preferred tool to update them and, finally, import the updated rows. Of course, you'll have to take care about keeping a column (typically the id), during the whole process, that allows you to identify each row so you'll be able to import the results updating the corresponding records, not inserting them as new rows.

This three steps are commonly known as an ETL process, that stands for Extract, Transform and Load. ETL processes are part of a more general type of tools and techniques called Data Integration.

A New Tool for Advanced Data Integration Tasks

CiviCRM standard functionalities for import and export data are very useful but they are limited if you want to work with a big number of records. Plus, they have been designed to be manual, so it's not easy to schedule them, or programmatically run them. But there's a new -and very powerful- tool.

Amnistía Internacional and StrateBi

We -at Amnesty International Spain-, in collaboration with StrateBi, have just launched a first release of a new Data Integration tool for CiviCRM. One of the most known and used open source Data Integration platforms is the Pentaho Data Integration (aka Kettle). What we have done, briefly, is to develop a CiviCRM plugin for Pentaho Data Integration.

How Does it Work?

Let me explain you with a practical example what can this tool do for you. I'll tell you how to install it, create our first Data Integration job and execute it. So, let's go!

Download and Install

  1. First, you'll need to download Pentaho Data Integration from Pentaho Business Analyticss space at Source Forge and uncompress it. You won't need to install it but, before running it, you'll need to get Java if you didn't already have it. I'll assume you've installed Java and uncompressed the Data Integration package into a folder called /data-integration.
  2. Download our project from GitHub and uncompress it too. This time, I'll assume you've uncompressed it into a folder called /civicrm-data-integration.
  3. Copy the content of the /bin folder (from /civicrm-data-integration) into /data-integration/plugins/steps.
  4. Finally, copy the json-3.1.1.jar file (from /civicrm-data-integration/src/main/CiviCRMUtil/lib) into /data-integration/lib (or data-integration/libext for Pentaho Data Integration 4.4.x and previous versions).

Create our First Process

Once you've already installed the plugin, you can launch your Pentaho Data Integration tool and start working. To execute it, run the script (Spoon.bat if you're using Windows). Then, open the menu option File > New > Transformation.

Creating a New Transformation with Spoon

Creating a New Transformation with Spoon

There are two tabs at the left panel, View and Design. Open the second one. What you see there, is a lot of tools you can use for your Data Integration processes. The textbox at the top allows you to filter them. If you write “civicrm”, you should see two options: “CiviCRM Input” and “CiviCRM Output”.
The former, allows you to read information from a CiviCRM database. The second one, allows you to write data into a CiviCRM instance. For our first example, we'll use the input component. Drag “CiviCRM Input” and drop it over the white canvas at the right side. A new icon will be created in the canvas. Now, double click it and fill it with your REST URL (something like http://your-host-name-here/civicrm/sites/all/modules/civicrm/extern/rest...), your site key and your API key. Don't have to say that it's much better if your server supports the use of https instead of the use of http.

First View of the New CiviCRM Plugin for Pentaho Data Integration

First View of the New CiviCRM Plugin for Pentaho Data Integration

If you click “Get entity list” the list of entity types will be filled with all the possible entities in your database. An entity, in case you don't know it, can be everything in CiviCRM: a contact, an activity, an event or a contribution, for instance. Select “Contact” as Entity Type and add the fields id and email, then click Ok. You can preview the results, clicking the CiviCRM input icon with the right button.

Go, once again, to the left sidebar and filter using “Mail Validator”. Then, drag and drop it to the white canvas. Drag and drop with the middle button of the mouse, from CiviCRM Input to Mail Validator. Then, double click the Mail Validator icon, select “email” as the field to validate and check the options “SMTP Check” and “Result is a string”.

SMTP check for email addresses

SMTP check for email addresses

Finally, add a “Microsoft Excel Output” to the canvas and, using the middle button of your mouse, connect the “Mail Validator” with it. Then, double click it and set “${Internal.Transformation.Filename.Directory}/mail-validations” as filename. Then, go to the fields tab and click “Get fields”. You are ready to click OK and run your script.

What have we done?

We have created a process that we can save for launching whenever we want. The process will obtain a list of our CiviCRM contacts. Then, it will validate the email in a very powerful way, because we have activated the SMTP validation. It means that it won't only validate the structure of the email but it will “try” to connect the server and will ask it for the existence of the account.

If you have a non-existent email like, this script will validate it and will generate an Excel file, located in the same folder of your transformation (you'll be asked to save it before running the process) with you CiviCRM contacts, including its email addresses, and a column validating them.

The process we've created is really simple and it lacks a lot of good practises. In fact, if you execute it, you'll realise that it's very slow. It's normal, as long as it will create a real SMTP connection for each email address. Anyway, its a good introduction to the possibilities of Pentaho Data Integration. We've learnt how to Extract information from a CiviCRM database. How to Transform it, adding two new columns about the validity of the email address in our database. And how to Load it into a final resource, in this case, an Excel file.

A View of the New Possibilities

There is a lot of things you can do with Pentaho Data Integration. Almost every thing you can imagine will be done adding an icon to the white canvas. This icons are called steps and there are steps for doing an incredible amount of tasks. We've seen three of them: read information from a CiviCRM database, validate an email and write an Excel file. But there are much more options: read a text file, download a file from an FTP server, execute a SQL query to update a MySQL table, create a new column concatenating the content of two existing columns and many, many more, as you can see at the Pentaho Data Integration Steps list in the official wiki page.

And remember, the power of Pentaho Data Integration isn't the big amount of steps you have, but the fact that you can combine the however you want! Now, you can very easily create complex processes like obtaining a csv file from an SFTP server, validating it's content, importing it into your CiviCRM database and sending you an email with the result of the process. Plus, once you've created your job, you can execute it from a command line, what means that you can also schedule it to be executed every night.

Ok, I Want to Start Now!

GitHub CiviCRM Data Integration project

GitHub CiviCRM Data Integration project

If you want to know more about Pentaho Data Integration, I suggest to start reading its FAQ for begginers. In particular, it will be useful for you to have an idea about the difference between transformations and jobs.

Finally, a last advice: as you can see at the GitHub account of the project, what we've released is just a beta version. There's already a few issues we've found so, before using the CiviCRM connector in a production environment, test it in a development one.

If you find any bug, or you have ideas to improve the project, feel free to open new issues in GitHub, or send us your pull requests. If you have doubts about how to use it or, in general, you want to talk about it, please, open a discussion at the CiviCRM extensions forum.


This is looking really cool!

Take into account that we've already found a few issues.

We hope we'll release a new version in January fixing them!

There's a bunch of us in the Civi community who already use Kettle (especially for migrations) - but up until now, I've relied on direct SQL connections (for reading Civi data) and pushing CSVs to the server and running the CLI API import tool for writing data to Civi.

I'm excited to try this plugin and see how it stacks up - not having to create SSH tunnels to pull Civi data from a remote database is enticing!  Though I'm worried that we'll lose Kettle's SQL caching in the process, which will hurt performance more than it helps.  I'll definitely check it out and provide feedback though!

Thanks so much, this is very helpful for our community! I am going to start evaluating this immediately for use in client projects.

Looks interesting. Have been a fan of the Pentaho ETL tools for years. One thing I'll mention for anyone put off by Java, it's really unlikely anything you want to accomplish needs java programming skills. For example, custom scripting of data transformations between steps in the chain can be done with javascript within the UI editor.


We had the chance to get it demoed first hand at the sprint in the UK, it's absolutely magic, thanks for sharing!

On the actions that update civi, how do you manage the requests? Is there a queue or everything goes in parallel?

Had quite a bit of struggle with node.js as a civi client and without a queue, it killed the server.

Once there's a few people using this we should pool our jobs and transformations into a repositry (Git hub?).

There's likely to be crossover in what we want to achieve especially for things like data cleansing

Xavier, Kettle (Pentaho Data Integration) takes care of the petitions queue. I've launched a test transformation for creating 200.000 contacts into a clean CiviCRM instance from a CSV file and it worked fine (took a few hours, but worked!).

Owen, totally agree! Having a space dedicated to collect and explain our jobs and transformations is a great idea!


considering the interest in this post, might be good for you to do a google hangout demo of the work you've done so far and the next steps. Seems like quite a few folks might be interested and good to get that conversation happening real-time


Good idea! Is it something you use to do? I mean, is there a forum where I should announce it, or something?

and also blog / tweet about it. For events that we know well in advance, we also add them to the newsletter

I suspect we'll get enough of an audience by using the forums / event / blog / tweet. Wanna ping us on IRC and we can coordinate dates etc


Very nice tool, I didn't know Kettle and it's amazing. Better with CiviCRM connector.

I think we should open a forum brach to discuss about the tool.

For example, I have a question: the CiviCRM Output is capable of logging it's own output (return messages)? I tried to connect it to the Log step and to the Text File Output, but either of them don't recognize CiviCRM Output. So, I gone into trouble to debug the connection.



I totally agree with the idea of creating a forum branch! :-)

And what you say about the output, maybe considered a bug. I'll take a look at it. Feel free to open issues in GitHub, if you find bugs or you have improvement ideas.

Aaah, I didn't make the connection between this tool and kettle. Now I know why it looked familiar ;)

For those that tried and failed to find a download link from pentaho's site:

Kettle isn't too bad for a free tool (not as robust/good as MS SSIS but that isn't free) - what about collecting a repository of sample Jobs/Transformations for sharing, for those who don't have the time to invest in learning Kettle/Spoon (the UI is not so user-friendly IMO)?