Cached Data Invalidation: An Introduction

Posted on 11 April 2010 | Comments Off

So, here we are — the first real post.  I was planning to just jump in and discuss my current design problem, but, in attempting to write it up, I’ve come to realize I don’t really understand it well enough to talk about it.  I have a fairly simple approach, at the moment, but I have a nagging suspicion that it’s wrong, and I can’t figure out how.  But I still want to get going on this blog, so instead I’ll just talk about some of the context, and hopefully post more when I have a better grasp on things.

My current problem relates to cached values for derived fields and sets. 

SchemaForm allows you to define derived (calculated) fields, much as you would in a spreadsheet or programming class.  Such calculations can use any function, any relational operator, and any in-scope variable.  They can produce any datatype — including relation types.  In fact, the only thing a derived field can’t do is reference itself (either directly or indirectly).  Derived fields are first-class elements in SchemaForm.  You can use them in other calculations (including other derived fields), query on them, and generally use them just like any other field.  The only real difference is that they are read-only (although you can get around that by using a calculated default value for an optional, stored field). 

SchemaForm also allows you to define derived sets — roughly equivalent to “views” in SQL parlance — but that can have keys and access patterns all their own.  Such derived sets are defined in terms of relational operations, and can be as simple or as complicated as SchemaForm’s relational math allows (and that’s pretty complicated).

In the original version of SchemaForm, all derived fields and sets were calculated on demand — the value was constructed in queries and included in generated procedural code.  This was simple to manage, but ultimately very expensive in terms of runtime cost — especially if the calculation included data from multiple tables (via joins).

The new SchemaForm aims to cache current values of derived fields and sets in appropriate places in the underlying schema, choosing instead to keep track of what needs to be invalidated and recreated when the source data is updated.  In this way, SchemaForm chooses to favour the cost of reading over the cost of writing, and takes over the (potentially huge) job of making sure this denormalized schema is always maintained in a valid state.

To complicate matters, we don’t want to have to recalculate an entire derived “column” of a table (or an entire table) any time we change a single source record.  Ideally, we want to identify not just the columns to be recalculated, but also a minimal range of rows within that column to recalculate.  If a change will only affect a derived field in one record, ideally, we should only update the derived field in that one record.  Of course, the minimal set may prove too complicated or costly to calculate, so SchemaForm takes a best-effort approach to that identification — as long as no invalid records are missed, there is no harm in updating extra records, if we have to.

Some Examples

1) A simple derived field might contain a record count of subordinate records.  Let’s say you have a one-to-many relationship between two classes, and you want easy access to the number of records on the many-side for each on the one-side.  You could add a derived field that joins the two sets together on the link field(s), and then counts the number of rows.  Such a field would need to be recalculated any time a record is added or removed from that subordinate set, and any time a subordinate record is updated to change its ownership — two master records will have to be updated, in that case: the one losing the record, and the one gaining it.  The count field would also need to be updated if the master record was updated to change its copy of the link field(s), as that would completely change the membership of the subordinate set.  Presumably, in such a case, the orphaned records will either end up in another record’s set, or will be deleted from the database.

2) Another simple derived field might summarize a set of data into a string.  Let’s say you had a books database, and for each book there could be one or more authors.  The books class might contain a derived field called “authors” that contains all author records related to the current book (drawn from the underlying authors data via a join).  A second derived field might then project out the author names from that set, convert them to a list, and the pass them through a conversion function that produces a natural-language version of the author list (ie. “John Stag” or “Jane Doe & John Stag” or “Jane Doe, John Stag, & James Fawn”).  Ultimately, this example devolves to much the same criteria as our first example, except that invalidation now propagates through two levels, and we must also trigger invalidation if the author name fields change.

3) A (much) more complicated example might be a derived set that scores some underlying data (say, a set of webfictionguide.com listings) by way of a complicated algorithm using data drawn from 4 or 5 sources (say, reviews, ratings, recommendations, all weighted by member karma, plus vote counts from topwebfiction.com, and additional weighting factors for recent mentions in articles or recently updated RSS feeds).  In such a case, the system would need to analyze how the underlying data is drawn together, and keep track of (join and restriction) conditions that limit the scope of a change in the underlying data.  Ultimately, the goal would be to produce a map of these changes so that we (ideally) update just one record in the derived set when a new review is posted, or a new rating entered, or a new RSS article arrives — all without any programmer planning or intervention.


Responses are closed for this post.

SchemaForm Development Blog © 2005-2010 Chris Poirier  ·  Header image © Doughnuts64 | Dreamstime.com
Powered by WordPress  ·  SubtleFlux Theme  ·  Site admin  ·  Log in  ·  Valid XHTML