Data Mapper Pattern: A CiviCRM Architecture Proposal

Published
2009-07-15 06:57
Written by

Current CiviCRM architecture pitfalls

rasantiago has proposed a new CiviCRM architecture with details of the ORM layer. Torenware commented on the latter and mentioned some particular scalability issues.  Our own experience with the Active Record design pattern proposed by rasantiago is that it works well for small projects but doesn't scale well.  We believe that CiviCRM is now facing serious scaling problems in several areas, to wit:
  • Domain (business) logic is co-mingled with the database schema objects.  When making a change in one of these it is necessary to get involved with code for the other.
  • Overly complex class hierarchies are hard to understand.  This appears to be the result of incorporating packages developed in a variety of different environments.
  • It is difficult to isolate components and inject dependencies for testing.  This in turn makes testing a burden rather than a help during development.
  • There is no defined, stable interface for developers to code to.  Community contributed software breaks with every minor release, thus discouraging contributions.  CiviCRM currently does not have an API in the sense that many large software projects do; what it refers to as an API is a thin layer over schema objects that change with every release.

For a simple practical example of the issues, see class CRM_Event_BAO which extends CRM_Event_DAO which extends CRM_Core_DAO which extends DB_DataObject which extends DB_DataObject_Overload, which provides the service of supporting PHP 4 (which the rest of the software doesn't support).  In particular, see static function CRM_Event_BAO::add( ).  This simply calls static function CRM_Core_DAO::copyValues() to copy whatever is passed in the argument to add() into the columns of the civicrm_event table, thus tying the add() function tightly to the current schema.  Since all these methods are static, it is difficult to insert mock objects for testing, which leads to the practice of testing the entire software stack as a unit, which is slow, cumbersome, and of little or no help during development.

Co-mingling of logic and poorly defined interfaces are not a problem for a small group of developers working closely together.  On the other hand, these practices don't support expansion of the development effort to a larger, more diverse group of developers.

Goals

To support a large, diverse developer community and provide coordination without rigidity, our proposal has the following goals:

  • Provide more isolation of function.  In particular, isolate the domain (business) logic from the database schema and the forms mechanism.  This will support simultaneous independent development efforts.
  • Provide greater testability.  To test successfully, the tests must run quickly and conveniently and must rapidly isolate the problem.  Tests that are slow and cumbersome to run and give ambiguous results will not be run by programmers under schedule pressure.  Therefore, a new architecture must make testing simpler and faster than it is now.
  • Support high performance.  The new architecture must allow a function to be done where it can be done best, i.e. if it can be done faster in SQL then the architecture must not prevent doing the function that way.


Methods

  • Move the domain logic out of the current BAOs and forms controllers into Domain Model and Data Mapper objects.  The Domain Model object(s) will contain the domain logic but will be isolated from the database internals by the Data Mapper object(s).  SQL will belong to the Data Mapper.  The Domain Model will not contain SQL or any direct table reference.  The Data Mapper will be isolated from the Domain Model by not making any direct reference to the Domain Model.  The Data Mapper must expose methods that permit the Domain Model to do whatever it needs to do, efficiently.  Therefore if the Domain Model needs aggregate information that is currently distributed across several tables, the Data Mapper must provide a method that can carry out the necessary SQL join(s) without forcing SQL into the Domain Model.  If the Data Mapper needs to call up to the Domain Model, this is to be arranged by a Publish-Subscribe design so that the Domain Model can provide a callback to the Data Mapper without exposing its internal details.
  • Use Dependency Injection to  define the object(s) referenced by the various Domain Model and Data Mapper objects, replacing the existing require_once statements.  This will make it possible to substitute a mock object during testing, which will make the tests faster, more thorough and more useful.
A start in this direction has been made, using the CiviCRM 2.2.7 CiviEvent component as a testbed.  This component was chosen because it is small enough to convert in a reasonable amount of time.  The code can be retrieved from the DharmaTech Subversion repository anonymously at https://svn.dharmatech.org/svn/civicrm/2.2/branches/newarch
The new work is in CRM/Event/DataMapper.php, CRM/Event/Table/ and tests/CRM/Event/ .  The tests are designed to run under PHPUnit 3.3.17 and require the PDO SQLite adapter.  The Zend Framework has been added to provide integrated common functions.  Note that this code is under rapid development and the strategy used will change with experience.

Discussion

The current CiviCRM architecture ties domain logic closely to the database schema in a design pattern similar to Active Record.  This is an adequate architecture for a small software package but becomes cumbersome as the package grows.  When the database schema is changed, the domain logic must be reorganized even in areas that should not be directly impacted by the schema change. When the domain logic is changed, the change must be distributed over several modules associated with various parts of the schema.  In comparison to the Active Record design pattern, the Data Mapper pattern promotes greater separation between domain logic and the database.
Any solution to this problem must guarantee that there is a good place for complex SQL joins.  An over-simplifed architecture may hurt performance by forcing the retrieval of unnecessary records, or by substituting multiple database calls when a single join would be better. The Data Mapper can include any complex SQL necessary for optimal performance.
Given a desired change, say for example the addition of mother's maiden name to the contact information, it is of course necessary to change several parts of the system.  The new information must be input on a form, stored, searched for and retrieved.  The goal of isolating functionality is to protect areas of the software that are not directly impacted by this change.
The current static object hierarchy does not provide a good way to inject mock objects during testing.  Therefore testing is currently done with the entire hierarchy of production objects, which makes testing slow and test results difficult to interpret.  By injecting the object dependencies in a suitable way, the production configuration of objects can be easily adjusted during testing to insert mock objects where appropriate.  This will lead to faster tests with better isolation of defects.  It becomes possible, for example, to replace the production MySQL database with an SQLite database implemented entirely in RAM, which can be set up and torn down much more quickly and does not require a username/password.
When software changes its interface with every minor release, a developer outside the small group of core developers is unable to write interfacing software with assurance that it will still run in the next release.   Most large software projects solve this problem by defining and documenting an API.  See for example Drupal's API definition.  It is customary to make a serious effort to protect backward compatibility in new releases to protect the investment in existing software.  In order to implement this policy, the system architecture must provide enough stability to define an interface, and there must be enough regression tests in place to verify that the interface still works after a change.  The decision to break backward compatibility is a major decision, not to be undertaken lightly, since it destroys working software.
DharmaTech
Filed under

Comments

Great post! It reinforces most of raSANTIAGO's current proposal and prototypes. Everything you call out as desirable is already provided through CiviBASE, our application of PHP Doctrine to CiviCRM plus some of our own development (e.g. Dependency Injection, robust API, etc.). Where we disagree is on the ORM layer.

There are two important points which are completely ignored in the objections to ActiveRecord being made by you and Torenware. These likely come from limited experience with Active Record implementations and large scale application architecting:

1. There is nothing within ActiveRecord which prevents the use of custom SQL to optimize batch and multiobject CRUD operations. To wit, the PHP Doctrine implementations expects there will be times when handcrafted SQL will be needed. Like good ActiveRecord implementations it has a facility to insert this SQL and to manage it separately from business logic. This is why for each object there are up to three files (the Object File, the Base Files and the Table File) which are transparent to the developer working at the business logic level. This is described in our previous post and well documented in the Doctrine documentation. I think its worth gaining some experience with a decent Active Record implementation (Doctrine specifically) before suggesting something like the Data Mapper pattern, which IMHO would be a huge step backwards for CiviCRM.

2. Scaling issues usually have nothing to do with the ORM layer. They are usually artifacts of poor object design, not understanding how to use the ORM layer or from trying to perform ETL or Data Analytics on an Application architecture. ETL and Data Analytics architectures are very different in nature from Application architectures. So I have yet to see any scaling issues that have anything to do with Active Record. To wit, you can look at large applications like Twitter and anything from 37 signals and see that scaling with Active Record is not an issue.

That said, I think its best that we put the Data Mapper idea on hold till you get a chance to learn about how Active Record works and how Doctrine works. We will be shooting out a series of blog posts in rapid succession to step through a number of demonstrations.

In these posts we will directly talk about those issues you highlight that we are 100% behind solving (e.g. demingling business logic, dependency injection, testing, API, etc.). Moreover, we want to integrate our new architecture with your excellent work on testing. We think they go hand in hand.

I would make one request, though. One of our major foci will be "scaling and complexity issues". What would help immensely is if you could provide a real CiviCRM example of an operation you think ActiveRecord will just bog down on. I think this will help put to rest these objections. (Everyone is welcome to contribute on that one)

Roberto, since Active Record has been around for a few years now we think you can assume most professionals know something about its pros and cons. We believe there is general agreement in the industry that Active Record, especially in its Ruby on Rails implementation, is a good way to put up a small web site fast. The issue is how well you will be able to make that approach scale. We look forward to seeing your code to do this. As of now your Github repository contains many files that are only skeletons, e.g. civibase/models/CivicrmAcl.php, and relatively few that have working code, e.g. civibase/models/ActivityRecordBehavior.php. No doubt we can always learn from the work of others. As we continue developing our Data Mapper architecture, you will be able to study it in our public repository. If you do the same with your Doctrine Project AR implementation, we will all be able to compare them and learn from them.

I don't want to prolong this point because its really just ignorant to make sweeping claims like the ones you are making about Active Record and RoR. So here is a good blog post, one of many, that connects to different case studies where RoR scales just fine to very large sites with lots of data and users. There are some very recognizable names in that set of case studies. I suggest reading through this information and then try again at an analysis of Active Record and RoR. If you need further references just shoot me an email. My team would also be happy to take your team through a set of tutorials on Active Record, RoR and the real pros and cons.

We are preparing our next post right now with different "complex" examples as Lobo has suggested. I think what would be good in the interest of this discussion is to provide two things:

1. A defintion of what you mean by "scaling"
2. An example of how you see Active Record not being able to handle it

My hope is that with the sureness of your statements that this should be easy to produce.

Looking forward to your response.

I was just thinking something similar to Lobo's last comment. My other thought is that there seems to be a lot of enthusiasm & good ideas from both Dharmatech & RSantiago & there might be a point in actually workshopping it face-to-face.

Maybe the two camps, and anyone else who has another proposal, could put together proposals and demos for the Advisory Board.

Going through the exorcise of writing the proposals and demos and timlines for an audience that ranges from developer to user would probably be be worth it. Not to mention a good test for the parties to gauge if they are ready and willing to lead the charge on such a big project.

Kinda cool to see these proposals coming up on the blog. accompanied with sample code of your thoughts and ideas along with tests and documentation

Did an svn checkout of the branch and it seems like the Table/DataMapper is self contained and not referenced by the other CiviEvent code. Would be cool if you'll could take it to the next step and migrate say the Manage Event (Edit/Update/View) to follow the new pattern. It has the advantage of mainly being isolated to CiviEvent

In an answer to roberto's queries about complex sql conditions, the 3 most complex processes within CiviCRM are:

1. The Search Query: applies across all tables directly or indirectly connected to a contact. This also applied to all component searches

2. The "Mailing List Query": The query which figures out what contacts will be sent the mail

3. The dedupe process from finding the list of contacts that can be potentially merged to the actual selective merging of data from one contact to the other

4. (a bonus) Import is quite complex from a sql perspective especially if you want to batch and optimize it.

For an active record implementation, i'd vote for the dedupe process

Also interested on your thoughts and comments on the Zend Framework and how suitable it is for CiviCRM going forward?

lobo

We're just beginning this effort. CiviEvent was chosen for the first test because it is a small component and therefore approachable. So far, ZF has been working well with PHPUnit to do test driven development. It's really nice to write a line or two of code, push F6 and run all the unit tests in a second or two. We're learning a lot about Zend Framework and the process of developing with it and PHPUnit. As you noticed, there is as yet no integration with the CiviEvent user interface. There is a lot of support for forms in Zend Framework that we have not yet explored. As we do, perhaps we can learn enough to give an intelligent answer to your question about whether Zend Framework is suitable for CiviCRM.

ZF has been high on our list of potential frameworks that we can/should adopt. However their forms package still need the ability to handle multiple form pages in a clean AND easy AND transparent manner. Building a multi-page form within CiviCRM is crucial, in prior version ZF support for this was not great. However there were a couple of proposals that were planning to address it, so punting that to a later stage might help

that said, i took a quick glance at 1.8.4 and it does seem that sub-forms are part of the main package now :)

lobo

Have you looked at Symfony at all? There is a lot to like about that framework (one of them being the use of Doctrine as the ORM layer). Here is a tutorial on multi page forms in Symfony. Tell me what you think. I am not up on server side generated multi page forms.

That brings me to my question: is there any real need for a special multiple form library? If so, why? We have been doing multiple form interfaces and never required any. Not sure I am understanding the issue.

At this point in time we are not evaluating or looking at frameworks. We want to get 2.3 out the door. for some future release we will take a look at what framework we can potentially use, but we just dont have the time to carry out the exercise right now. If you'll have a fair amount of free time, please help out by QA'ing 2.3

IMO, it is not very polite to call other folks irresponsible when you have no idea what their schedule / work load / obligations are.