Pubblicato
2009-12-09 10:36
As we’re closer and closer to the release of CiviCRM 3.1, we began thinking what to schedule in CiviCRM 3.2. As one of the features we want to add is the ability to undelete certain CiviCRM entities, we want to discuss with the community the initial approach to undelete that we came up with – along with a sketch of logging functionality that we consider for a future release.
General Remarks
To make both of these features successful, we need something that’s both generic (i.e., works in the same way – or, at least, similarily – for all/most of the CiviCRM tables) and transparent from the perspective of the existing codebase. The experience with multilingual support done in that matter convinced us that it’s a sane approach; move as much as possible to the database layer, sprinkle with views and triggers if needed, add a bit of logic to CRM_Core_DAO and make the existing codebase blissfully unaware that there’s hidden functionality beneath (with the clueful codebase parts being able to see through the smoke and mirrors, of course).
Fortunately, both triggers and views are supported since MySQL 5.0; unfortunately, the ability to create triggers in 5.0 requires the SUPER privilege (5.1 introduced separate privileges for this). Still, we can declare that the undelete/logging functionalities requires either MySQL 5.0 with the SUPER privilege or MySQL 5.1 (which went stable over a year ago) and simply do not offer undelete/logging where not possible, automagically falling back to the current schema.
Undelete
The simplest way to introduce undelete is to add a boolean is_deleted column to the relevant tables, add a trigger that will catch DELETE calls and turn them into UPDATE calls that set the is_deleted column for the affected rows to true, and then either (a) fix DAO-originating and CRM_Core_DAO’s SELECT and UPDATE queries to not return/alter is_deleted=true rows (unless the code in question explicitely says it knows about the is_deleted column and doesn’t need any hand-holding) or (b) for every affected table, create a view that does not have the is_deleted column and exposes only is_deleted=false rows. (The (b) solution is how we handle the multilingual stuff.)
The benefit of (a) is that we keep current table names for ‘real’ data and only add a column to them (plus some trigger and DAO logic).
The benefit of (b) – assuming that we move all the ‘real’ data to new table names (e.g. civicrm_contact_all) and expose the is_deleted=false rows in views under the ‘old’ names (e.g. civicrm_contact) – is that hard-coded queries that do not pass through CRM_Core_DAO (like the ones in Joomla!’s XML view files) work transparently (because they work on the view, not the ‘real’ table).
Our vote for now would be to go with (a), possibly fixing the few hard-coded queries that do not pass through CRM_Core_DAO by making them skip is_deleted=true rows (if the is_deleted column is present) – but we’re open for suggestions.
With either scenario, the only thing that’s left to decide is how to handle the places where we depend on the ON DELETE cascading, but the cascading should either trigger the relevant DELETE triggers in the cascade tables or, if it does not, we should be able to take care of them with a generic ON DELETE trigger template that would generate the triggers en masse based on ON DELETE rules.
Logging
Logging/change tracking is not currently considered for CiviCRM 3.2, and we mostly though about it mainly in the context of how it would interfere with undelete – on one hand, whether an implemented logging solution wouldn’t simply provide undelete out-of-the-box; on the other, whether implementing undelete now won’t complicate implementing logging later.
The executive summary is that (a) logging could provide undelete, but the simplest approach to logging wouldn’t provide undelete which is as simple as the above is_deleted solution, and (b) implementing undelete in the way described above should not alter how logging is done in the slightest (as the is_deleted column would simply be another column, with it changes logged over time).
Now, onto logging/change tracking. Once again we should at least consider doing it transparently (from the point of view of the current codebase).
The solution that is the most appealing to us is to leave the ‘current’ tables the way they are – by which we mean they would keep reflecting the current state of CiviCRM, optionally with the undelete functionality – and do all the logging in separate tables.
These log tables (e.g. civicrm_contact_log) would have the same columns as the ‘original’ CiviCRM tables they track, plus two new columns, log_time and log_action, and the primary keys altered from (id) to (id, log_time). All the work would be done with triggers on the ‘original’ tables catching INSERT, UPDATE and DELETE operations; these triggers would simply insert into the log table the new state of the original table’s row, adding on the fly the (current) log_time and setting the log_action column to either ‘I’, ‘U’ or ‘D’.
The pros of this approach are as follows:
- the original tables are untouched, and only their triggers know there’s logging going on;
- even if there are many updates, the original tables do not grow, as they only carry the most recent state – the thing that grows are the log tables, which can be purged as needed, with simple DELETE FROM table WHERE log_time < X queries;
- with the logs in separate tables, there’s no need for any DAO/CRM_Core_DAO voodoo to hide logging-related columns – and any code aware of logging will simply query the log tables at will; likewise, primary and foreign keys in the ‘current’ tables would simply work the same as they do now, and foreign keys in the log tables will also keep working (while the primary keys would simply extend to require only id+log_time uniqueness).
- it’s not trivial to track *who* did the changes (which is what we definitely want to do), so we might want to add an updated_by column (referring to a civicrm_contact.id matching the current user) to all the ‘current’ tables that we want to log; once this contact id is handled properly on the PHP side, the log tables would simply inherit the column and track its changes over time without any additional logic;
- because MySQL’s DATETIME and TIMESTAMP column types have the resolution of one second, we must either assume that a single row will never be altered twice in the same second (an ugly assumption, but in practice not insane) or use floats for time tracking and pass the microseconds all the way from PHP;
- one can’t trivially aggregate trans-table changes over time, and to do this one needs to write some SQL logic that would cluefully scan the log table.
Filed under
Comments
If I understand correctly, the triggers approach for undelete will require Super privileges and therefore not work in a shared hosting environment. I would strongly suggest you consider that as criteria. Civi is targeting non-profits who often need to start (or stay) with shared hosting. Of course you might disable that feature for shared environments (reasonable) but if there is a way to avoid it, you should consider that.
Perhaps an naivete question but if logging might use mirrored (relevant) tables, why not undelete also? Effectively, deletes would be "moves" to another database. Defintely not performance friendly, but you could keep a separate database for logging, undelete or other "future" meta information about each transaction .... just a wild thought.
Finally, whatever the "solution" please consider the that if civi maintains its growth rate, it will likely be a much more ubiquitous tool in the near future. While views might make legacy code more manageable, please make sure you build a solid architecture for the future.
Great thinking and excellent post!
Shawn
While I agree that most current shared hosting solutions offer non-SUPER MySQL 5.0, I do think that this is temporary; our stats show that 16% of CiviCRM installs are already using MySQL 5.1, and this percentage has been growing steadily in the past year. Doing this kind of thing on the PHP side (rather than with triggers) would be IMHO both very error-prone and inefficient, not to mention not easily scalable across tables.
As for the undelete vs. logging split – I agree that logging could provide undelete, but I’d rather not couple them too tightly unless its either necessary or very beneficial. First, we need undelete sooner than logging (and it’s easier to implement); second, they serve different purposes and have different target audience – undelete should be usable by anyone with edit/delete permissions, while logging/revisioning should be more tailored towards admins (but perhaps to be usable as a historical view by ‘regular’ users with a separate permission for this).
I agree that at some point we might want to offer the ‘bringing back’ of any past state of a contact, but I doubt it’s trivial (and so I doubt this is doable for CiviCRM 3.2); also, if we do end up doing that and it turns out to cover undelete without cluttering the UI/workflows, we can easily retire undelete at that time.
Finally, I agree we should think deep about our designs, but this is also the reason why I much rather see this kind of functionality as decoupled from other things as possible (and transparent to these parts of CiviCRM that shouldn’t be bothered whether there’s any undelete and/or logging functionality at all).
I would just reiterate that civi is targeting lower cost infrastructure deployments and shared hosting has been an important vehicle for some organizations. frankly finding a good, inexpensive provider with innodb has been challenging enough. At some point the shared hosting companies might catch up , but they are traditionally very slow (years?). Its also a question of existing environment and potentially changing vendors. Please consider this in evaluating a solution that depends on super user SQL privileges.
for multilingual implementation. Both undelete and logging and fairly complex and using advanced mysql features make things a lot more easier
The hosting / VPS / cloud provider market is growing at a nice rate and finding one that meets the requirements should not be too hard. We do think that spending more than cheap shared hosting is definitely worth it for a civi install. Please continue this discussion on the forums :)
lobo
Technically (and from our point of view) we don’t require the SUPER privilege, we just want to use triggers (as implementing undelete/logging otherwise seems insanely complicated in comparison); the problem is that MySQL 5.0 does not have a separate CREATE TRIGGER privilege. This is fixed in MySQL 5.1, so this functionality (as the multilingual one does now) will require either MySQL 5.0 + SUPER privilege *or* MySQL 5.1.
As mentioned above, we see a steady rise of MySQL 5.1-based CiviCRM installs, and I think the majority of shared hosting vendors will catch up over time. (Also, it’s not like CiviCRM 3.x will *require* this privilege; if we can’t use triggers, the undelete/logging will be simply turned off.)
If every logged object has a start_date, wouldn't that enable you to find the state of a contact at a particular point in time by:
(a) determining if the object's start_date is before relevant point in time - if yes, the current values are the right ones for that point in time; and
(b) if no, looking through the log table for that object where the start_date is before the relevant point in time and the log_date is after - that record will have the right values for that point in time.
There’s no need for a separate start_date (and I can’t envision how you’d implement it, given that (a) you don’t want to refer to the log table’s contents in the trigger and (b) the start of the log table can be truncated at will to save space); all this can be done based solely on the log_date and the log_action column. What I meant is that it’s not *trivial* to fetch cross-table dependencies, because for every joined table you need to analyse both log_time and log_action. Still, this is preferable to storing ‘validity window’ start+end dates for rows in the main database.
I support option (a) for the undelete flag, with one modification that will be a step towards full logging support without adding hurdles getting the basics in to 3.2
Proposal:
Instead of having an is_deleted column, use an integer column called "setID". SetID stores the following values:
- If record is added, setID defaults to 0
- if a record is updated, create a copy first and assign setID to the primary key of original record. Then update the original row
- If record is deleted assign setID = -1
example (omitting the modifiedBy/timestamp columns for clarity)
id, name, setID
1 David -1
2 dave 1
3 david 1
4 David 1
This indicates that a record was edited 4 times. First time it was added the name was "dave", then changed to "david", then to "David", then deleted.
As with the is_deleted flag, returning the HEAD version is simple: WHERE setID=0. If the record is deleted it will not show up.
If we require to undelete, simply look for setID=-1 and set it back to 0.
Notes:
- a little counterintuitive is that the HEAD row has an id that is always less than all of it's revisions. But that's not too hard to get used to, especially since this only concerns logging.
- this scheme assumes that all objects requiring logging have integer primary keys. If not, a CHAR key could be used. VARCHAR is no good for performance reasons.
- Timestamp has to have higher resolution (per millisecond or better) or we'll get in to trouble running scripts that may update the same records at the same time. In mysql long is the only option.
If time is of the essence for 3.2, we don't need to concern ourselves with storing the log row. What we do accomplish is having the right column type, and introducing the idea of a HEAD revision in all of our views/queries and the scheme will be ready for logging.
A further enhancement is to add a second column - "revision" - which is akin to revisions in SVN but requires more application logic - i.e. if I go through and tag several contacts and do a bulk update, they will all get their setIDs updated, and will also get a common revision number. This allows rollback - that's a much bigger project for the application layer.
Regarding multiple tables:
- I lean towards not creating separate log tables - harder to maintain changes, more tables to manage, harder to build queries that span both HEAD and log data.
- if performance is of the essence logging can be turned off entirely or kept to a certain period, as described in the original post. Indexing is obviously key.
- if someone needs logging in perpetuity and the data volume is huge, they will run into similar issues of manageability with the log tables, i.e. they will be useless unless properly optimized.
- if there is a use-case for large installations with perpetual logging, it would be easy to extend the "truncate after X days" function with an option to move to archive.
- And there is always log replication.
All that to say that single table seems easier to manage, and there are workarounds for large-scale installations.
Other 3.x features:
- switches for non-logged operations (for certain tasks such as bulk import/update)
- roll-back functionality (requires "revision")
- audit log interface. Needs to be designed to be useful - on a table/row basis changes are easy to show for a db admin, however providing a list of only the changes in human readable form is not that easy. Alternative is to populate civi's existing edit log feature with additional information.
Thanks a lot for your comments! Let me address the most important ones:
I like your setID model, but I see two main drawbacks – first, the table bloat that dalin mentions; second, additional logic required to get an archival state of cross-table relations.
I thought through the idea of extending the log tables’ primary key with log_time, and at Dave Greenberg’s suggestion I think the sanest way is to add an auto-increment revision column to the log tables, make the primary key id+revision and use the log_time only useful for fetching relation state at a given point in time.
With the separate log tables, any operations on non-current state (which I think as a quite separate kind of operations than whatever CiviCRM can do now) can be done on that data, and with the id+revision primary keys and the log_time column we end up with side-stepping MySQL’s DATETIME’s low resolution while still being quite trivially able to answer the quesion of ‘what was contact’s number 102 last name and their maritial status (which is a custom custom data) on July 27th?’ – all you’d need to do would be to query the relevant custom data value related to contact_id of 102, fetching the row with the highest revision earlier than July 28th.
(I agree this is doable with your model, by simply doing round-trips through the current – or last-before-deletion – ids, but it’s not as simple.)
I also think that dalin might be right that the separate log tables can be optimised to be insert-only (I think I originally got that notion from Rob Thorne; IIRC he pondered such solutions for archvising CiviMail events). There are also some musings on logging SELECTs (to be able to say who accessed given data), which can’t (and most probably shouldn’t) be done via triggers, but rather need to be derived from query logs. I have a gut feeling that if one wants to log SELECTs, the table bloat will be huge and separate log tables (with INSERT-only optimisations) might be very beneficial.
+1 for Triggers and Views. Shared hosting will catch up soon enough. They are not using it now because while MySQL 5.1 is labelled as stable it is not actually production ready. There are performance issues that are getting in the way of any large deployments. Though it's close.
The challenge with (a) is that rewriting query strings is a world of hurt. In D7 now that we have a real good database API that uses PDO means that db_rewrite_sql() and hook_db_rewrite_sql() are gone (hurrah).
I can't say that I'm a big fan of Stantale's proposal. Having an extra record in the table each time a record is edited makes for a big table for no reason. This would be really bad for high-performance sites for two reasons. First this means an update query and an insert query for each edit. I think with a Trigger system writing to an additional table this hit would be lessened. Still not really ideal. Something that works with MySQL binary logs would be better. But this would exclude shared hosting all together. Secondly simple SELECT queries to a table with 10million rows is much slower than the same query to .5million rows. You also need more memory to avoid temp tables swapping to disk. Stantale offers objection to multiple tables on the argument that a query to both the main table and the log table would be complex. I can't offhand think of a use case for this. Perhaps if you want to do some reporting about how things were a year ago vs. how they are now or the like. But I think the frequency of such a use case would be rare enough that the general performance issues out-rank it. What we're talking about here is essentially table sharding. And I think that even a small installation will find the performance degradation of having an order magnitude more rows in a table to be unacceptable. Keep in mind that we wouldn't need a duplicate for *every* table. the contact tables, and custom field tables would probably suffice.
A one second granularity is only good enough if the system can gracefully handle cases where multiple changes happen within the same second because it will happen often enough.
I'd recommend that the initial release only focus on the back-end and allow the ability to turn the feature off. Once the back-end is in production for a release cycle a UI could be constructed to show revisions and perhaps allow rollbacks.
Hi Dalin,
I completely agree with you that performance will suffer somewhat, and partitioning would be a better alternative in that respect particularly if we are only tracking a few tables. Multiple writes to the same table as you mentioned is the major drawback for intensive write sites - the amount of data is less important in my mind, but it is obviously also a consideration.
However I would suggest that what is and isn't logged should be left up to admin control, and my assumption was that the system would be able to track all changes to all tables (activities/groups/cases/events...) For example tracking group membership over time could be an interesting feature to many organizations.
The perfect solution will always be case dependent. My assumption - which may be incorrect - is that large sites with huge amounts of records are probably less likely to change frequently percentage-wise (e.g. public donation sites, event registration) whereas smaller environment with actively managed constituents/recordsets will have a lot more revisions.
I am definitely not against managing duplicate tables, it might just be overkill for most instances and the order of magnitude can generally be addressed quite well with indices.
This is not an easy decision and needs to be thought through carefully. The following are the features I would like to see:
- logging of all CiviCRM business objects
- ability to enable/disable what objects are logged (this impacts how triggers are designed)
- ability to enable/disable logging for user changes vs. automated updates (imports/crons).
Any which way this will be a great feature!
Thanks for your comments!
Another reason against this kind of ‘current’ table bloat is that our dedupe mechanism is based on db-side joins; I think getting the civicrm_contact table a couple of times larger just because we want logging might mean the dedupe mechanism needs a rewrite, and I’d rather not tackle both at the same time. ;)
(This is one of the examples why I’d rather have this kind of functionality as as much orthogonal to the ‘core’ schema and functionality as possible – hence the separate log tables.)
One option too would be to allow the log tables to use Archive or even MyISAM. It would require benchmarking, but I think there would be a performance gain because the tables would only be inserted to and never updated.
I think these sound like terrific features to develop.
Regarding the logging functionality, I like the cleanliness of the approach, and I know at least one other open-source framework which uses a similar table design (Hibernate Envers). You may be able to find some interesting ideas by looking at their implementation.
Some more detailed comments regarding the "cons" of this approach:
1. I'm surprised that the identity of the current user is non-trivial. It seems that you could track it in a MySQL variable. See, e.g., this discussion. Specifically, as part of civicrm_initialize(), one might execute:
$dao->query(sprintf("set @current_user = %d", $session->get('userID')));
2. I agree that using a primary key of id+log_time (where log_time is measured in seconds) is problematic. However, I don't see the benefit of using id+log_time as the primary key. Why not use a surrogate key? What use-cases require that id+log_time be unique?
Another angle: Is there _anything_ that Civi does that requires sub-second accuracy? For example, suppose an end-user fills out a profile form at the same time that an admin edits the contact record. They both hit save at "2009-12-10 07:49:20" -- specifically, the user's request is processed at "2009-12-10 07:49:20.00206"; the admin's request, at "2009-12-10 07:49:20.00601". In this case, the log should show the concurrent revisions (because this implies a data integrity problem), and it should show which revision wins, but it probably doesn't matter if the window between revisions is 5ms, 100ms, or 603ms.
3. I don't specifically understand the last con.
From where I'm sitting, the con's don't seem very bad.
1. Thanks for pointing out the solution for tracking the current user’s contact id. I’m not certain it’ll be trivial, but you’re definitely right that a right combination of per-db-connection and per-CiviCRM-session updates to a MySQL variable might be quite enough to accomplish this.
2. As I mentioned in replies below, I think rather than extending the primary key to id+log_time we should go with a separate, auto-increment revision column and extend the primary key to id+revision, with the log_time column being there only for time-based queries. An otherwise-agnostic revision integer would also take care of updates happening during the same second, of course.
3. The last con means that you can’t *trivially* ask for the state of the database at the moment X; for every id-unique row you’ll need to check its last state with log_time < X (if exists) and the first state with log_time > X (if exists) and do some logic based on the action done at these points (whether it was an INSERT, an UPDATE or a DELETE). The other solution is to implement ‘windows of time validity’ for every row, which say that the values in the given row are valid between start_date and end_date – but I do believe such a solution (especially applied to our ‘main’, or ‘current’ database) would be an overkill.
Hi,
Great ideas !
How is the current log (on contact) working ? is this going to be replaced by the much improved log ?
As for the delete, I'd love to see that as an option to replace the cumbersome "are you sure you want to delete" by a nicer delete directly + msg "contact has been deleted, would you like to undo ?"
X+
I agree with Xavier that this should be part of the UI. This approach is becoming standard "best practice" from a usability point of view.
I've started drafting specs for this at http://wiki.civicrm.org/confluence/display/CRMDOC/Logging - edits and comments welcome.
I think the above URL should be http://wiki.civicrm.org/confluence/display/CRM/Contact+Info+Logging