Back on it
Posted on 2 August 2010 | Comments Off
Well, it’s been a number of months — an unfortunate side effect of a new job. I’ve finally started back in on TargetMap construction this weekend, and have made some progress. I think I’ve also figured out what I was missing before. Still a long way to go before I have a working solution, but I should be able to write up something new this week, once I get a little farther.
Status Update
Posted on 25 April 2010 | Comments Off
Well, I spent the weekend beating my head against inverting the SourceMaps into TargetMaps. Made some progress, but, mostly, ended up with a sore head. Currently trying to figure out exactly what the TargetMap needs to look like — I’m thinking the SourceMaps will need a little massaging, but mostly seem correct. The TargetMaps, though . . . so far, my first couple of attempts haven’t worked.
More soon.
Cached Data Invalidation: First Attempt
Posted on 12 April 2010 | 4 responses
First, I’m going to admit to the lie. I’m not actually going to discuss my first attempt at figuring out cached data invalidation. Nor the second. Nor the third. Let’s just say they didn’t even come close to working and leave it at that.
My first workable attempt at building a map of which derived fields to invalidate for any given change was to start from the other end of the problem — working in from the derived fields, descending recursively through the formulas, I built SourceMaps — a tree of data structures that map out the sources for the derived field, and their purposes.
As discussed in the previous article, there are a lot of ways to invalidate a derived field. Changes to fields that are directly used in the calculation obviously should trigger invalidation, but other values may also interact. If data is pulled into a calculation by way of a join, the derived field becomes dependent on the join fields themselves, as any change in those fields can move data into or out of scope with respect to the calculation. Additionally, the derived field is also dependent on any fields used to restrict those joins (the “where clause”, in SQL parlance), as those restriction sources will help narrow the scope of an invalidation.
As it turns out, these SourceMaps were actually very easy to build, as the language Model is already structured to provide the information. It was simply a matter of building a new pass to walk the code for the derived field, and pull data from variables and relational operators. The next step was then to invert the SourceMaps — to walk the SourceMap and carefully turn it “inside-out”, keeping all dependencies intact, with the goal of building a TargetMap for each source field — something that would point the way to the target from the sources.
And that’s where things started to get hairy.
For my first attempt at this, I decided that each SourceMap would transit intermediate derived fields, and go always to fundamental sources — fields that were user-supplied data (not other derived fields). This seemed a natural choice, as only such fields would ever be directly updated by the user, and so trigger invalidation. The problem was that this immediately created an ordering problem. If a derived field (directly or indirectly) referenced another derived field, recalculating it could potentially pull stale data from that referenced field, if it was also invalidated by the same update (possibly due to the same source field, possibly due to another field in the same update). It’s a mismatch problem — the SourceMaps were dealing in only fundamental sources, while the derived field calculations on the other end didn’t have to be so pure.
This sent me trying to figure out a way to build ordering data into the constructed TargetMaps, but I now conclude that that was a mistake. And so another couple of weeks disappeared from my life. Yay.
An obvious alternative is to stop the SourceMap build at the first line of referenced fields — even if they are derived (instead of transiting through to fundamental sources). The problem is that any such implementation would depend on intermediate fields being cached. If they were, then we could run an invalidation operation on update of the cache, and so propagate changes in the correct order, from source to intermediates to targets. But caching decisions are not made at this point in the compiler run, and the system is free to determine that any given field should not be cached at all. And so we could end up with (permanently) stale data.
The solution I’m working on now is a hybrid one. When the Mapper encounters an intermediate field, it will build a special node to hold it — one that can act both as a terminus of the SourceMap, and as a bridge to the next layer of SourceMap. Then, when inverting the SourceMap into the TargetMap, we will use the whole thing, but those intermediate nodes will be converted to pruning points. After caching decisions are made, we can then run through all TargetMaps and ask them to prune off any parts of the map that are rendered unnecessary by the presence of a cached intermediate field. If cached, we will let the invalidation wave cascade from source to intermediates to target; if not, the larger TargetMap will be used, and the wave will cascade directly from source to target.
At the moment (I only came up with this this afternoon, over a beer) I think this will work.
Still To Do
The process of inverting the SourceMap into a TargetMap is not quite as trivial as I’ve implied in the discussion above. It’s not even remotely trivial, actually, and I’m still working out how to do it. And, yes, I still have that nagging suspicion that I’m missing something important, in terms of constraints on the SourceMap (and, therefore, the TargetMap).
Will know (and post) more soon.
Cached Data Invalidation: An Introduction
Posted on 11 April 2010 | Comments Off
So, here we are — the first real post. I was planning to just jump in and discuss my current design problem, but, in attempting to write it up, I’ve come to realize I don’t really understand it well enough to talk about it. I have a fairly simple approach, at the moment, but I have a nagging suspicion that it’s wrong, and I can’t figure out how. But I still want to get going on this blog, so instead I’ll just talk about some of the context, and hopefully post more when I have a better grasp on things.
My current problem relates to cached values for derived fields and sets.
SchemaForm allows you to define derived (calculated) fields, much as you would in a spreadsheet or programming class. Such calculations can use any function, any relational operator, and any in-scope variable. They can produce any datatype — including relation types. In fact, the only thing a derived field can’t do is reference itself (either directly or indirectly). Derived fields are first-class elements in SchemaForm. You can use them in other calculations (including other derived fields), query on them, and generally use them just like any other field. The only real difference is that they are read-only (although you can get around that by using a calculated default value for an optional, stored field).
SchemaForm also allows you to define derived sets — roughly equivalent to “views” in SQL parlance — but that can have keys and access patterns all their own. Such derived sets are defined in terms of relational operations, and can be as simple or as complicated as SchemaForm’s relational math allows (and that’s pretty complicated).
In the original version of SchemaForm, all derived fields and sets were calculated on demand — the value was constructed in queries and included in generated procedural code. This was simple to manage, but ultimately very expensive in terms of runtime cost — especially if the calculation included data from multiple tables (via joins).
The new SchemaForm aims to cache current values of derived fields and sets in appropriate places in the underlying schema, choosing instead to keep track of what needs to be invalidated and recreated when the source data is updated. In this way, SchemaForm chooses to favour the cost of reading over the cost of writing, and takes over the (potentially huge) job of making sure this denormalized schema is always maintained in a valid state.
To complicate matters, we don’t want to have to recalculate an entire derived “column” of a table (or an entire table) any time we change a single source record. Ideally, we want to identify not just the columns to be recalculated, but also a minimal range of rows within that column to recalculate. If a change will only affect a derived field in one record, ideally, we should only update the derived field in that one record. Of course, the minimal set may prove too complicated or costly to calculate, so SchemaForm takes a best-effort approach to that identification — as long as no invalid records are missed, there is no harm in updating extra records, if we have to.
Some Examples
1) A simple derived field might contain a record count of subordinate records. Let’s say you have a one-to-many relationship between two classes, and you want easy access to the number of records on the many-side for each on the one-side. You could add a derived field that joins the two sets together on the link field(s), and then counts the number of rows. Such a field would need to be recalculated any time a record is added or removed from that subordinate set, and any time a subordinate record is updated to change its ownership — two master records will have to be updated, in that case: the one losing the record, and the one gaining it. The count field would also need to be updated if the master record was updated to change its copy of the link field(s), as that would completely change the membership of the subordinate set. Presumably, in such a case, the orphaned records will either end up in another record’s set, or will be deleted from the database.
2) Another simple derived field might summarize a set of data into a string. Let’s say you had a books database, and for each book there could be one or more authors. The books class might contain a derived field called “authors” that contains all author records related to the current book (drawn from the underlying authors data via a join). A second derived field might then project out the author names from that set, convert them to a list, and the pass them through a conversion function that produces a natural-language version of the author list (ie. “John Stag” or “Jane Doe & John Stag” or “Jane Doe, John Stag, & James Fawn”). Ultimately, this example devolves to much the same criteria as our first example, except that invalidation now propagates through two levels, and we must also trigger invalidation if the author name fields change.
3) A (much) more complicated example might be a derived set that scores some underlying data (say, a set of webfictionguide.com listings) by way of a complicated algorithm using data drawn from 4 or 5 sources (say, reviews, ratings, recommendations, all weighted by member karma, plus vote counts from topwebfiction.com, and additional weighting factors for recent mentions in articles or recently updated RSS feeds). In such a case, the system would need to analyze how the underlying data is drawn together, and keep track of (join and restriction) conditions that limit the scope of a change in the underlying data. Ultimately, the goal would be to produce a map of these changes so that we (ideally) update just one record in the derived set when a new review is posted, or a new rating entered, or a new RSS article arrives — all without any programmer planning or intervention.
The Plan
Posted on 28 February 2010 | Comments Off
Back in 2004/2005, I started working on a high-level database programming language that could do all the grunt work I normally do when turning a high-level design into something that actually works. That language was called SchemaForm.
I did get the language working, at the time, but the generated code did not perform to my expectations, and I eventually shelved the project. Well, now, armed with a much more practical approach to database systems than I had then, SchemaForm is back in development, and I hope to release a working (and performant) language in the coming months.
Eventually, this site will be the home for the language. Until then, I’ve decided I’ll blog about current design problems as I continue to develop the compiler. Might prove interesting to look back on it, later.
However, until I get a first post up, feel free to giggle at the project’s original website, courtesy of the Internet Archive’s Wayback Machine:
Schema Deformations
Posted on 10 August 2006 | Comments Off
I initially posted this article to my old blog in August of 2006, and then later on the blog for another project (which I have since shelved). Now that SchemaForm is back in development and has its own website, I figure it belongs here.
Please note that I’ve included this here mostly for its historical value — while the ideas have gone on to influence the new SchemaForm, the language has changed somewhat since this was written.
And I apologize in advance for the long-windedness.
Okay, bear with me here: I’ve got something I think is worthwhile to say, but I’m not sure how to pack it into a little space.
Actually, knowing me, I’m probably not even going to try.
I’ve always hated database systems. Not the idea of them, not the practicality of them — I’ve always liked those things — no, it’s the PITA involved in making them work.
The problem is normalization. Now, don’t get me wrong — normalization is great for programmer sanity and system maintainability. It is, frankly, the only way to design. But if you develop a normalized data model for your system, you are going to have to deal with two fundamental problems. 1) Your system will most likely run like molasses. Uphill. On a brisk November morning. And 2) You are probably going to have to spend a lot of time writing glue code in queries and representation classes to pull all that data back together into useful structures.
This is the joy and pain of normalization. You store any particular datum only once, so it stays minimal and consistent. But putting it back together into the assembled, summarized views we humans find so useful is really costly.
A couple of years ago, I spent a year trying to solve the problem. I had an impossibly large project I convinced myself I could pull off in no time and on no budget if I could just write a programming language that would do all that bookkeeping/reassembly for me.
Well, I was wrong about the project. (Enough said on that topic.) For the sake of this discussion, the language is the important thing. SchemaForm was it’s name. I wrote it in Ruby. I know, I know — what kind of crazy person tries to write a compiler in a scripting language. Well, my kind of crazy person.
Anyway, SchemaForm’s aim was to make it simple to declare and manipulate not just the data itself, but also the relationships between the data. Instead of declaring your “physical” data in one place (the tables) and your assemblies of data in another (views and procedural code), and your validations in another (field and table constraints, procedural code), SchemaForm let you declare them together, using a common language. The SchemaForm compiler then took those declarations and used them to build SQL and object-oriented library code to create the database and retrieve and manipulate its data.
Perhaps an example would be helpful. Here’s a simple problem: you want to store address information for your customers. Your staff work in English (they want to read “Quebec City”), but some of your customers speak and read primarily French (and expect their mail addressed “Ville de Québec”). A normalized design might have a City table which references a Strings table, which has a one-to-many relationship with a Translations table. To get the French name of a particular city, you join from City to String on CityNameID, then from String to Translations on StringID and restrict on the French language code. Simple, right? Here’s what it might look like in SchemaForm (with a few conveniences for this particular application):
(class City Cities
(oid CityID )
(field NameStringID String)
(derived EnglishName (dereference $NameStringID EnglishText))
(derived FrenchName (dereference $NameStringID FrenchText))
(key NameStringID)
(key EnglishName)
)
(class String
(oid StringID)
(derived Description (dereference (lookup Translation Language $StringID "en-ca") Data ""))
(derived EnglishText (dereference (lookup Translation Language $StringID "en-ca") Data ""))
(derived FrenchText (dereference (lookup Translation Language $StringID "fr-ca") Data ""))
)
(class Translation
(oid TranslationID )
(owner StringID String )
(field LanguageID Language )
(field Data text:4000)
(derived Language (dereference $LanguageID Code ))
(derived LanguageDescription (dereference $LanguageID Description))
(key StringID LanguageID (name LanguageID))
(key StringID Language (name Language))
)
Okay, no that isn’t tiny. But, what it does is give SchemaForm enough information to do most of the assembly for you. For instance, each City record has a calculated field called EnglishName, which is the English name of the city. If you happen to be holding a City object, you access its EnglishName field, and you’ve got the value. If you are holding a set of Cities, you can query on it. In fact, if you look carefully, City actually gets its EnglishName from the calculated EnglishText field of String, which is defined as a dereference of a lookup on a Translations table candidate key (StringID and Language). And if you look even closer, you’ll find that even Language is a derived field — the stored value in Translations is a LanguageID, which references a code table (not shown). Whew! Imagine constructing all those joins by hand!
SchemaForm does a lot more than what is shown here, actually. For instance, it provides field- and type-level validations and filters; it provides a relational algebra for building operations and retrievers; it integrates object-oriented ideas of type extensions into the the schema definition language (for instance. your Invoices reference a Customer, which may be either a Person or a Company); and it provides a definition language for rendering query results into hierarchical, summarized structures (you might call them XML-based reports).
So, why isn’t every one using SchemaForm, I hear you asking. Well, because the language sucks. Okay, maybe that’s unfair. The language doesn’t suck. But the implementation does. As it turns out, a lot of that great stuff SchemaForm does for you doesn’t translate well into SQL. Performant SQL, in any event. Slap as may indices as you want on that beautifully normalized database, and it will still perform terribly, because all that joining is expensive. And my experience with SQL Server, at least, is that the query optimizer actually makes the problem about an order of magnitude worse.
Oh, and from a development standpoint, placing a compiler run between your thought process and the running system turns out to be less than optimal. Surprise, surprise.
But it wasn’t all a waste, methinks, for writing SchemaForm got me thinking about something that I have come to believe is really worthwhile.
Denormalization.
Okay, okay, maybe I should qualify that a bit.
Denormalization involves caching known and/or stored data in more than one place; generally a place were it doesn’t naturally belong. It is done because it can often (but by no means always) improve the performance of your database operations, by keeping data in the place where it will be used. The cached data might be a verbatim copy of a field from another table, in order to save a join operation in the average case (or to be used in a multi-column index). It might be a summary field, like a count of the records on the other end of a one-to-many relationship. Or it might be the result of a scalar calculation that is too expensive to do on a regular basis (a conversion of Markdown text to HTML, for instance).
But there is a cost to all this performance enhancement. You end up with multiple whole and part copies of data elements, and this can quickly become a nightmare for the poor programmer who has to keep all of those copies in sync. Imagine, for instance, if you decided to denormalize the simple three-class schema I quoted above. Imagine if you wanted to cache every one of those derived fields directly in the underlying table, so that you wouldn’t have to take the join hit every time you used them in a query, or so that you could use those fields in a SQL index. Imagine the bookkeeping work you’d have to do, the dozens of places you’d have to carefully insert the extra code to keep all those copies in sync.
Okay, now imagine this. It’s Monday morning. Your boss comes to you and tells you that e has a new requirement for your beautifully normalized, 2.5 million record inventory system. E wants to be able to search the database for Customers over the age of 35 whose parents are both Customers, who drive a particular make of car, and who have bought over some dollar threshold of some inventory part, or of any part that contains that part, or of any part that has the same manufacturer as that part. Those were variables, not constants you heard in there, BTW. And E wants the system to supply those answers instantly, with data that is always completely up-to-date. And E needs it working yesterday (isn’t that always the way?) and it can’t have any impact on any other part of the system (which, being beautifully normalized, has been working flawlessly all this time).
Ready? Set? Go!
What?
Okay, actually, I’m sure you can get the thing functional in short order. All the data is in there. You already have relationship information between your Customer records. You have optional data on Customer birthdate. You can get a good approximation of the make of car the Customer drives based on the invoicing information (you don’t generally buy Honda parts for your Ferrari, after all). Your parts table shows relationships between the parts, and a transitive closure will discover all of the equivalents, and from there you can pull in all the unrelated parts with the same manufacturer. With SchemaForm’s relational algebra and dereference function, you can probably whip up something that will work.
There’s just one small problem. At this point, each query your boss runs on your still-beautifully normalized database is probably going to run for a couple of days. You could build a reporting database with the data, but nightly runs aren’t going to supply that “up-to-date” thing your boss was so adamant about.
So, it’s intractable, right?
But what if you could go into the schema, add your calculated fields, add the necessary relational functions (views), and then mark them all as cacheable? You don’t do the denormalization — you just suggest to the database system that perhaps it should do it for you! You still use the system like it’s beautifully normalized, but under the covers, the database runtime figures out how to propagate data changes to all the dependent locations within the system, and does so. Because, after all, it has the functional description of how those calculations are to be done. It is, in fact, the only entity that is truly qualified to determine all those dependencies, and to propagate the changes in the right order. You use that calculated field or relational function just like normal, except that now it runs like stink! Now, all of sudden, you are back in the hero seat, because your system did not require any changes to data entry screens or model classes, but, nonetheless, your boss gets eir instant results to eir insane questions.
It’s frickin’ Utopia, man!
Okay, maybe I’m getting too excited.
So why am I writing about this today, after not writing about it for most of two years? Well, I finally got around to looking at Ruby on Rails today. I looked at it a few years back, when it first came out, but I wasn’t really doing web development at the time, so I let it slide. Anyway, today I was feeling frustrated and exhausted with work stuff, and, being the geek that I am, I thought playing with a new toy — especially a Ruby toy — would cheer me up. It sort of worked. Turns out I’ve been building Ruby on Rails for PHP for the last three months, only RoR has done it better, and I’ve been avoiding Ruby for no reason. Yes, it’s depressing for both reasons.
In particular, their schema migration stuff works better than my PHP version, because they actually update the schema with the changes you make in the migrations. Nice job, guys!
Anyway, suddenly I’m thinking that maybe the schema deformation layer can be written without the compiler pass, and without an actual new type of database server (the path I have been considering). Maybe — I’m not sure yet. This idea isn’t even half-baked yet. But maybe, I could build it for Rails. Not the full version, probably — I’ve already learned my lesson about Ruby’s compilation/graph-analysis prowess (ie. it doesn’t have any) — but a reasonable hand-drawn facsimile thereof might be possible. And it would certainly make my database programming life more enjoyable.
So, there you have it. Yes, I rambled, and I probably rushed over some of the important stuff, too. But I hope you aren’t feeling I’ve wasted your time with this. If the idea interests you, drop me a line.
Thanks, Chris.
