Developers: we want your input. Database layer improvement in sight (?)

Gepubliceerd
2011-07-12 23:27
Written by
xavier - member of the CiviCRM community - view blog guidelines

Hi,

 

We are having a discussion on the API group about the current limitations of the framework used in civicrm for the ORM and DB layer (pear DB_DataObject and DB... ).

If you are not familiar with the BAO/DAO structure, you might want to read this blog  post (written in 2006) first.

Beside a general dislike of the existing framework that show its age, we have identified several issues:

  • DB_Object as an ORM wasn't able to handle the complex queries needed, so they are various extra query classes. 
  • At various places, the developers worked around the ORM and generated the sql and used the DAO only to execute the query.
  • We aren't that often using the result as attributes of the object, but more often directy convert them to an array.
  • And anyway given all the limitations and shared global variables, we tend to avoid using a dao for more than one request (or freeing it).
  • Most of the BAO have created static delete  methods that have different names (deleteLineItems, deleteParticipantPayment, del...), making it harder to find the right one without looking at the code.

2 years ago, a new framework has been evaluated and the proposal was to switch to doctrine.

Today, the database abstration layer of drupal7 is an independant project that could be used by civicrm too.

As this is a fairly big change that will last for years, we would like to have as many input as possible so we are choosing the best available solution today, and that will still be a good one in 2019.

I know that most ORM are able to handle the trivial CRUD, but we need one that would allow us to handle more complex queries.

 

The starting point was the contribution get API. We need, based on the requested fields to be returned to be able to join the contribution to the contact, to the address, email, phone of the contact, to the membership linked to the contribution, to the event linked to the contribution.

Given the complex database shema, would it be possible to generate the needed query with only the needed jointures, without having to write the sql directly?

What is your experience with existing frameworks (propel, doctrine, redbean, Kohana ) ? Anything you'd like to promote as the best solution?

 

X+

 

 

Comments

I have no experience with any framework or ORM, but I imagine it will be a huge improvement. I have been hit a number of times with the current inconsistencies when developing, fixing issues or upgrading API. And I am lazy enough to use the DAO just to execute queries...I admit, guilty! Although I have become better at using the DAO thanks to assisting with the API development....something every Civi developer should do for some time hint hint).

Question: is the database abstraction layer of Drupal7 robust enough to give us all that we need? What were the reasons to choose Doctrine at the time, and what is the answer if we take the same reasons and project them against today's framework world?

Erik

Luciano (niet gecontroleerd)
2011-07-13 - 03:16

Well this topic I have to say it's the most common discussion in any software development.
DB relation-oriented vs Source Code object-oriented. Which way is the best to connect these two different worlds?
And in my experience the right answer is: There is no right answer.

Depends on the project, the technology/frameworks available at the moment, performance, the business model, manpower, and so many variables.
I don't really think that the option you choose today will be as effective in 2019. Things change too quick in software.
I would say that the primary concern here is the programming language. PHP is not a fully object-oriented language as the OO definition was conceived.
I've developed in the last 10 years maybe around 50-60 of PHP projects, and I've never found that including an ORM in the DAL was worth it. Basically because the time I had to spend including and adapting it to the PHP way was considarable, and the benefits I got were not that much.
But the scenery it's totally different when programming in languages like JAVA, .NET or any fully Object oriented language

So is it really worth to integrate ORM framework or develop a new one, change the whole Data Access Layer, spends hours in a re-factoring, when the same goal can be achieved by a single SQL query executed in 2 lines of code?

Of course this won't follow the standards and patterns that every developer want (or should want) to achieve to write a "pure" code, but it will be easier, faster and reaching the same goal.

So, as always in life, it's a matter of balance.
Anyway I will be glad to cooperate in this discussion and try to find the best approach for this issue

Hi,

 

I'm not convinced by the ORM we use, as we seem to spend more code walking around it than it would be to be closer to the SQL.

 

This being said, with an ORM like django's (the only "serious" one I know), I've been more or less able to generate the queries I wanted by simply using the ORM, without too much fight (mostly to understand the syntax, find the magic option...). But this was for simplier DBs than civicrm.

 

Right now, the contribution API is using the contact query builder, that is definitely not a part that could be re-factored easily. beside, we would need a contribution query as the starting point of the request, not a contact query.

 

I'm looking for a solution to solve that. If it doesn't solve it, that's probably not good enough to be worthwhile the switch to a complete ORM and the option to get rid of the ORM might be a good one. Really not sure, either way

 

X+

In a previous role, my group was tasked with researching ORMs, SQL frameworks and the like. One of the discoveries we made is that 80% of applications in that organization did not need or desire a true ORM. 80% just needed a simplified, cross-vendor means of interacting with the database. The 20% that needed a full ORM were teams that had the luxury of starting fresh, where they allowed the tool to generate the db tables and all needed SQL based on the objects in their application. (what many call top down style).

So we narrowed our evaluation focus to "wrapper" type db frameworks, rather than a ORM tool. So the main criteria was:
- elimination of boilerplate code when accessing the db
- being able to swap out databases from different vendors, such as MySQL, Oracle, DB2, etc. without changing code.
- Simplified error/exception handling.
- Simplified mapping result sets to objects/arrays.
- Standard data mappings done automatically, such as mapping SQL dates to programming language dates.

Just to be devil's advocate, the only programmers interacting with the database layer at all should be the core team, and the API team. Everybody else should use the APIs exclusively. Of course in the real world, we often need to directly access the database layer.

The DB abstration layer is only an issue for those working on the core (and thankfully that's not only the API and core team).

The vast majority of extensions have no valid reasons NOT to use the API, and plenty to use it indeed.

However, some (custom search/custom report/ACL hook) are going to expose the DB more than I wish (but with valid reasons).

X+

I think this is a really important decision, and I agree with the "ORM is overkill" crowd here.  Which makes me inclined to the endorse the DBTNG option.  I worry about the future of MYSQL with Oracle at the helm, case in point Haskel/Jenkins.  So being able to plug into another db option would be very useful.

 

In my experience, there are plenty of reasons to need to use the ORM/SQL building tool when you are writing custom modules.  In fact we rarely have been in a situation where the API solves all the problems we're looking to solve.  We often find ourselves using the API 90% of the time but needing to call BAO/DAO (usually executeQuery) to get that last 10% done.

Sarah Gladstone (niet gecontroleerd)
2011-07-13 - 03:20

Is the Drupal Database layer a true ORM? Or is it just a thin layer over PDO? My impression is that its the later.

Also, how would this impact writing custom searches or custom reports?

indeed, that's more like a simple PDO wrapper.

As for impacting custom searches (or anything for that matter), I don't have a clue of what are the options yet, bit too early to tell, but definitely we shouldn't brake the existing external code (without a good reason and a clear way to fix it at least)

 

X+

The Drupal 7 DB layer project referenced is a backport of the D7 approach to Drupal 6.  It's not really a library is it?  Would that require forking the D6 backport into a CiviCRM specific solution?

Some random thoughts... which may or may not be consistent... or repetitive...

 

* Before working on CiviCRM, I developed several systems with JEE 5, and I really liked Hibernate, so the idea of adopting a robust ORM sounds great...

 

* An ORM or similar data layer can provide several different values and benefits:

 

 + By binding SQL types/contracts to the language's native types/contracts, programmers get the benefit of native contracts -- i.e. compilers/interpreters/IDEs can provide better guidance and errors. Of course, based on experiences with Drupal/CiviCRM, I'm pretty pessimistic about realizing this benefit in PHP.

 + By providing reflection, the data-layer can facilitate model-driven applications -- e.g. when composing the "Edit Contact" screen, one can enumerate the custom data fields, get their types, and prepare suitable HTML widgets. This is very important in CiviCRM.

 + By providing a query-builder, it can enable dynamic queries that are shaped by end-user choices (e.g. "Advanced Search" fields), by administrative policies (e.g. ACLs), by contributed modules, etc.

 

* There is a schism in the data layer: some tables and fields are managed in the xml/gencode/DAO subsystem, and others are managed in the CustomGroup/CustomField/CustomValue subsystem, and considerable effort goes into writing logic that works with both. There should only be one data layer with one model, but it should load its configuration from multiple sources (i.e. XML files or SQL rows in civicrm_custom_field).

 

* While deficiencies in DB_DataObject are probably the root cause, the sickness is the multiplication of similar-but-different, slightly-incompatible data-access mechanisms and query-builders. To be clear:

 

 + DAO::executeQuery

 + DAO (build query with attributes, whereAdd, etc)

 + BAO (static helpers)

 + BAO_Query

 + APIv2 / APIv3

 + Custom Search

 + Custom Report

 + Ad hoc Selectors

 + Misc AJAX

 + ACL hooks

 + Drupal db_*

 + Smarty-based data tables

 + Javascript-based data tables

 

To demonstrate the value of a more robust query-mapper, we should show how these layers would do a better job of working together.

 

* There's going to be considerable effort in plotting a migration path -- e.g. which functions/classes/contracts are replaced; do we immediately rip out/update all the dependent code, or do we provide a compataibility layer; etc. It would be really interesting to get some hard-data about the scope of this effort and the tradeoffs in it -- e.g. setup some kind of function-trace and assignment-trace for DB_DataObject and its descendents.

On your list, you are mixing UI issues and data access issues, related and probably too many of each, but don't need be addressed at the same time.

I think some are unavoildable, and being able to shoot sql directly should still be an option, but civicrm (core, not extensions) should probably do a better job at keeping them in the same place.

Not sure how to trace calls to DB_DataObject. If you have an idea, could you start a thread in the dev forum?

The forum thread on tracing is here:

 

http://forum.civicrm.org/index.php/topic,20749.0.html

 

I agree that my post mixes a list of things which aren't apples-to-apples -- the idea was to identify areas of the CiviCRM architecture which have been negatively influenced (i.e. overly-complicated, sometimes reinventing the same concepts) by weaknesses in DB_DataObject.

 

If we focus too narrowly and if we -only- swap DB_DataObject with something else, then what do we have? A new API plus adapters for legacy code? That sounds more complex than the status quo. To ensure a net-improvement, shouldn't we take advantage of a new tool by improving other parts of the system?

Still not sure why you are putting things like ajax in the list.

Part of the problem of the weak DBO is that everything is calling everywhere at every level. A saner DBO would allow to enforce that only the BAO calls the DAO, that BAO override CRUD methods instead of adding new ones...

 

(and whould be good to move that to the forum, following a discussion in the forum is a PITA).

I'm strongly in favour of making a move along these lines.

I think DBTNG from Drupal should be given a very serious look. It's a bit apples to oranges, since there would be additional strong benefits provided that would not be provided by the other frameworks in terms of being able to leverage the Drupal as a development framework stack more easily. Personally, my bias is towards being able to do more bulk queries more easily, to push more business rules into database stored procedures, and to eliminate techniques that prevent a good DBMS from doing its job like the entity_table/entity_id approach that precludes the database from properly managing FK constraints and cascades.

I like the idea of DB provider abstraction that PDO provides, and which most of the other candidates likely also provide.

I would recommend that, as in any open source project evaluation, the strength and diversity of the developer and user community, as well as documentation and other eco-system metrics be included in the evaluation. I'm less concerned about theoretical purity of approach and more about how it will help improve developer productivity and make it easier for people to climb the learning curve to contributing to core code. A large developer base that might start working in CiviCRM would be a big plus. Scalability of the underlying engine and performance are important, but I'm reluctant to move towards NoSQL for our core needs.

I have a few larger installations (500k - 700k contacts) and may shortly be doing a 3M contact instance. I'd be happy to volunteer to do some testing of different approaches on them.

Anonymous (niet gecontroleerd)
2011-07-13 - 23:27

Coming from an SqlAlchemy background I think there are two main features of an ORM:

  • 1. It let's you get rid of simple but repetitive tasks like:
  •   simple queries involving only one or a few tables
  •   handle datatype conversions
  •   support form creation and validation
  •   …

 

2. It helps with some tedious tasks:

  •   DBMS independence
  •   schema migration
  •   Since queries are built in a structured form modules could be allowed to hook-into and modify queries before they are executed.

 

But for all it's goodness it has also drawbacks:

  •   There are limits to it's capabilities. CiviCRM uses queries that are more complex than any ORM can handle. That's the point where SQL needs to be used directly. So the question is: Is the ORM able to integrate custom queries? How big is the gap?
  •   For complex queries the ORM is bound to be not much simpler than SQL. Everything that's (nearly) as capable as SQL also needs it's complexity. What benefits does the ORM provide in these cases? Are they worth learning another as-complex-as-SQL language?

Some of these issues do not only depend on the ORM-framework itself but also on how it is used.

 

Drupals „ORM“ is a rather thin layer that fits it's not so OO approach to programming. CiviCRM is a lot more DB-intensive. So I guess drupal's ORM alone won't suffice. A yet stronger integration with Drupal would be nice from a e-campaigning perspective though.

 

But first things first: What CiviCRM needs is unification as there are currently so many ways to interact with the database. A new framework won't solve this.

Hi,

 

I think knowing all the different places that touch directly the DB is useful to identify the missing features of the existing and help choosing the right one.

 

As they all use DAO, that's not that bad and we *might* be able to add an executeQuery method and cover more or less all these case.

 

Don't have experience with SqlAlchemy, but could it be used to generate a query for the Contribution (fetch all the contributions, the contact & membership associated, the address of the contact, email...) "easily" and be smart enough not to join with the email table if the email isn't requested in the result list?

Is that's as complicated as the existing Contact Query ?

I agree with torotil that xavier is almost surely over-optimistic in what an ORM can do with regard solving the issues with CiviCRM's interactions with its data store. It will be able to handle the simple stuff, but to the extent it is used to handle everything it will become as complex as SQL. Database independence is a big thing for me, but object relationship management per se is not for an application like CiviCRM. We'd do better to enhance the Data layer so that fewer calls were made to MySQL that bypassed DAO. ORM is one way, PDO another, DGTNG a third. Personally, I'd prefer to push more functionality in the BAO layer into stored procedures and triggers.

Obviously I've campaigned for DBTNG for a while. But to be fair, since CiviCRM has never been concerned with supporting other databases than MySQL, PDO alone may be good enough, and may have a slightly easier learning curve, though it's not actually an ORM. If it's desirable to truly open the door to supporting other DB backends, then DB:TNG is the way to go, IMO. If we also used SchemaAPI (which is not really an API, but just a format for defining DB schema as PHP arrays) then we'd have everything we'd get from a 'real' ORM. I have no experience with Doctrine, and I haven't actually used Kohana, but have reviewed it, and I think it's more than Civi needs, but if DBTNG/PDO loses to something, Kohana would probably be my second choice based on what I know of it.

Hi,

So far, the issue is rather than the ORMs wouldn't be offering all what we need. Do you think kohana could manage generating the complex queries for the contribution get for instance?

 

As for schemaAPI, that's a drupal only thing, or is this standalone (like DBTNG)?

Erich Schulz (niet gecontroleerd)
2011-07-18 - 03:07

I'm a complete amateur, but I struggle with the benefits of the api for reading the database, also most writes are probably quite safe provided any 'gotcha's' are documented... and there can't be that many gotchas

so imho, documenting the database (which tables and fields it is safe to attack and which ones aren't) will provide the best balance of power and flexibility...

and API should be seen as value adding - not as guardian of data integrity - sql is standard and ubiquitous... why not unlock it's power?

the disclaimer like the one here
http://wiki.civicrm.org/confluence/display/CRMDOC40/CiviCRM+data+architecture

just seems to be an uneccessary "barrier to entry" for civicrm newbies - if there are a few transactions that it's dumb to do in sql (and there may well be a few legitimate examples) then can't they just be identified and explained??

sorry the above is spoken out of ignorance of all the wonder of the current apis - but i've spent at least 30 hours trying to digest all this stuff and I can't help but wondering if its not all been made more complicated than it needs to be...

but you did ask for opinions

:-)

Hi,

Could you post on the forum api what doesn't work for you?

Beside being almost 100% of the case shorter to code via the API, it allows to abstract the DB schema, that changes.

Going directly to the DB, you are almost sure to have your code braking or corrupting data at one point or another. Tried that, it's a pain in the...

X+

I agree that the API can provide more concise code in several circumstances (e.g. some mundane data-conversions are built-in), and I've often used the PHP API on the theory that it's more reliable/future-proof than SQL. But, to my mind, the facts don't fully support that theory. The biggest headaches I've encountered when upgrading have stemmed from subtle, under-documented grey areas in the API -- things which would have been clearer or irrelevant in SQL.

 

For example, consider locational data (emails/phone numbers). In the past, we wrote some code which accessed locational data with SQL and db_query(), and we wrote other code which used api/v2/Location. In leaping from CiviCRM 2.2 to 3.4, the SQL code continued to work -- but the code which used api/v2/Location suffered subtle regressions that wouldn't be noticed without conscientious manual testing because gray areas in the API were changed (hint: one version creates blank placeholder records in civicrm_address when writing phone numbers; the other version doesn't; and, depending on the behavior, subsequent updates might use civicrm_location_add or civicrm_location_update). At this point, it wasn't worth documenting or fixing the issue because (a) api/v2 was deprecated and (b) the code which consumed api/v2/Location was doing crazy backflips in order to satisfy its contract. Eventually, I had to rewrite the API-consumer with api/v3.

 

In sum, the PHP-based API is a pain, too.

 

The fact is that, regardless of whether the published interface is SQL or PHP, the published interface will eventually change, and that will eventually suck for downstream developers. What matters is how often changes happen -- and how we handle those changes.

 

One can reasonably argue that SQL handles those better than the historical/current PHP approach -- it's explicitly-typed, performs basic data-validation on every single field, clearly tracks schema revisions (just run a diff on xml/schema), and benefits from a de-facto commitment from core (i.e. lots of core code and lots of reporting uses SQL, so there's a built-in incentive to avoid changes).

 

(Aside: For clarity, I'm only talking about SQL as a mental-model/data-model. For writing queries, it's better to use a query-builder like DBTNG or Doctrine.)

And we have been bitten too.

API v3 is build with a stronger test case and more logical and simple function (CRUD on separated entities), so it will/should be way more stable. History will tell.

 

The core doesn't make a de facto commitment on the schema. We explicitely says that the schema will change and API will not (or if it does that we will handle better the communication around it).

 

This being said, of course that's open source and one can change manipulate data at any level (and I did hammer SQL too). Just that we encourage people doing it through the API.

Personally, I think the larger priority should be pushing less and less to mysql functions and instead move the processing to PHP. We should be able to support having tables on different database systems and types--mysql for transactional and some nosql variant for everything else. We should not be relying on foreign keys to cause cascades. We should not be relying on db triggers. We should not be relying on temp tables in the DB. All of these things make it much harder for a developer to accurately track down relevant code. For example, if you import data in to a database that does not support foreign keys, production data starts getting really hairy really fast.

Considering how much easier it is to scale app servers than it is to scale a database, this would greatly improve general Civi scalability.

It just so happens that moving excessivly complex functionality out of the DB and in to code simplifies the process of replacing a database abstraction layer. There is a good reason that many of these layers don't support much past basic queries.

I would trust MUCH more a relational db to delete related data or trigger an aditionnal action than into the coder that would need to implement it in code anyway above a dumber database.

 

Moreover, it might be possible with a lot of brain and sweat to scale better when having loads of app server and no relational database, but more than 99% of the install are way below that line, and on simple one shared server app+db, a relational db is going to perform better than something written above (arguably).

 

I agree that some nosql db features would be useful (eg. sync between servers with couch db would allow local storage that'd be awesome), but as for being a good idea to move up having to deal with coherency and relations and so on, we'll have to disagree.

 

X+

I would respectfully disagree.

If a database gets a corrupted trigger or FK, it is impossible for the average developer to figure out what is going on without taking a very deep dive in to how civi works and how the database is structured, and why. Almost every time I hire a new engineer, it takes them a long time to come to terms with how civi FKs/triggers are set up and why. PHP code immediately makes sense to new developers and is instantly testable.

Doing a cascading delete inside a transaction all from PHP is not really any more time consuming than having mysql take care of it. Espically considering most people are on shared hosts with way over-stuffed database servers, the "dumber" the query can be written, the quicker the database can deal with it. Easy queries should not require a join buffer or creation of temp tables -- which for all anyone knows, the db server has long exausted resources for. Whenever I have tried to run Civi on a shared host, the database has always been the limiting factor, and in every single case, simplifying queries would have shaved seconds off of page load.

I would argue that by moving the more complicated logic to code (where logic should be IMHO), you make things run smoother and more predictabally for the shared hoster and far more scalable and flexible for the higher end types. After all, isn't the ultimate goal of replacing a DB abstraction layer to make it a better product for everyone?

I would be thrilled to see Civi's underlying architecture updated using a modern ORM/MVC framework. I have quite a bit of experience with DBTNG. I love DBTNG, but it's not the right tool to fix Civi's shortcomings. Civi's complex entity relationships are a great candidate for some solid domain driven design, and an ORM framework would help facilitate that nicely. I understand this is a departure for SQL-centric developers, but in the long run I think it will lead to great benefits to the developers. Specifically it facilitates helping solve these problems:

* Separation of concerns -- far too much in Civi depends on far to many other objects. Tightly coupled complex applications break and regress easily and are difficult to change. Over time that entrenches problems; the longer the tight coupling exists, the more code is built to work around that coupling, the more work to re-engineer away from it.

 

It also makes testing relatively quite difficult generally -- unit tests are impossible, functional tests have to suffice, and they're much slower, disrupting a TDD workflow, and less likely to demonstrate actual issues.

 

* Rampant mixing of application layers: a sure sign of a complex app in trouble is business logic code interspersed with BOTH database queries AND presentation.

 

My biggest hestitation is that this move needs to be a sea-change to be successful, and there's a huge feature set that would need to get re-envisioned in order to make this work. I'm not sure I could recommend a good transition strategy that would allow for substantial re-use of existing code. This is a CiviCRM 5.0 kind of approach.