Cached Data Invalidation: First Attempt
Posted on 12 April 2010 | 4 responses
First, I’m going to admit to the lie. I’m not actually going to discuss my first attempt at figuring out cached data invalidation. Nor the second. Nor the third. Let’s just say they didn’t even come close to working and leave it at that.
My first workable attempt at building a map of which derived fields to invalidate for any given change was to start from the other end of the problem — working in from the derived fields, descending recursively through the formulas, I built SourceMaps — a tree of data structures that map out the sources for the derived field, and their purposes.
As discussed in the previous article, there are a lot of ways to invalidate a derived field. Changes to fields that are directly used in the calculation obviously should trigger invalidation, but other values may also interact. If data is pulled into a calculation by way of a join, the derived field becomes dependent on the join fields themselves, as any change in those fields can move data into or out of scope with respect to the calculation. Additionally, the derived field is also dependent on any fields used to restrict those joins (the “where clause”, in SQL parlance), as those restriction sources will help narrow the scope of an invalidation.
As it turns out, these SourceMaps were actually very easy to build, as the language Model is already structured to provide the information. It was simply a matter of building a new pass to walk the code for the derived field, and pull data from variables and relational operators. The next step was then to invert the SourceMaps — to walk the SourceMap and carefully turn it “inside-out”, keeping all dependencies intact, with the goal of building a TargetMap for each source field — something that would point the way to the target from the sources.
And that’s where things started to get hairy.
For my first attempt at this, I decided that each SourceMap would transit intermediate derived fields, and go always to fundamental sources — fields that were user-supplied data (not other derived fields). This seemed a natural choice, as only such fields would ever be directly updated by the user, and so trigger invalidation. The problem was that this immediately created an ordering problem. If a derived field (directly or indirectly) referenced another derived field, recalculating it could potentially pull stale data from that referenced field, if it was also invalidated by the same update (possibly due to the same source field, possibly due to another field in the same update). It’s a mismatch problem — the SourceMaps were dealing in only fundamental sources, while the derived field calculations on the other end didn’t have to be so pure.
This sent me trying to figure out a way to build ordering data into the constructed TargetMaps, but I now conclude that that was a mistake. And so another couple of weeks disappeared from my life. Yay.
An obvious alternative is to stop the SourceMap build at the first line of referenced fields — even if they are derived (instead of transiting through to fundamental sources). The problem is that any such implementation would depend on intermediate fields being cached. If they were, then we could run an invalidation operation on update of the cache, and so propagate changes in the correct order, from source to intermediates to targets. But caching decisions are not made at this point in the compiler run, and the system is free to determine that any given field should not be cached at all. And so we could end up with (permanently) stale data.
The solution I’m working on now is a hybrid one. When the Mapper encounters an intermediate field, it will build a special node to hold it — one that can act both as a terminus of the SourceMap, and as a bridge to the next layer of SourceMap. Then, when inverting the SourceMap into the TargetMap, we will use the whole thing, but those intermediate nodes will be converted to pruning points. After caching decisions are made, we can then run through all TargetMaps and ask them to prune off any parts of the map that are rendered unnecessary by the presence of a cached intermediate field. If cached, we will let the invalidation wave cascade from source to intermediates to target; if not, the larger TargetMap will be used, and the wave will cascade directly from source to target.
At the moment (I only came up with this this afternoon, over a beer) I think this will work.
Still To Do
The process of inverting the SourceMap into a TargetMap is not quite as trivial as I’ve implied in the discussion above. It’s not even remotely trivial, actually, and I’m still working out how to do it. And, yes, I still have that nagging suspicion that I’m missing something important, in terms of constraints on the SourceMap (and, therefore, the TargetMap).
Will know (and post) more soon.
4 responses to Cached Data Invalidation: First Attempt
Um, Chris . . . WTF?
Oh, be quiet Chris. We already know all of this is beyond you.
Oh, don’t be silly, Chris.
You can do this!
Well, I can — but he can’t.