Database applications can be made much simpler if you maintain a body of data that describes your tables. Every single programming task in a database application needs to know something about the tables it is working with, so every program in a framework and many programs in an application can benefit from a central store of information about the database.
Introducing the Data Dictionary
The term "data dictionary" is used by many, including myself, to denote a separate set of tables that describes the application tables. The Data Dictionary contains such information as column names, types, and sizes, but also descriptive information such as titles, captions, primary keys, foreign keys, and hints to the user interface about how to display the field.
A super-simple beginning Data Dictionary might look like this:
TABLE | COLUMN | TYPE | PREC* | SCALE | PK | AUTO** |DESCRIPTION ---------+---------+------+-------+-------+----+--------+------------------- students | student | int | - | - | Y | IDENT | Student ID students | firstnm | char | 20 | - | | | First Name students | lastnm | char | 20 | - | | | Last Name students | city | char | 20 | - | | | City students | gpa | num | 2 | 1 | | | Grade Point Avg * Precision: Needed for chars, varchars, and numerics ** automation: to be described more below
The First Question: Where to Put It
It might seem that the first question about a data dictionary would be "What do we put in it?" We will get to this question in a moment, but the most important question is usually, "Where do we put it?" By this I mean do you put into an XML file, or do you encode it into classes in program code? Or is it somehow directly entered into database tables? There are of course many opinions on this.
The first opinion is that you do not really need a data dictionary per se because you can get this information from the database server (if this is news to you, google "information_schema" and your favorite database). There are two major drawbacks when you depend on the server. First, you cannot build a database out of your data dictionary, because of course you don't have one until the database is built. The second drawback is much worse, which is that you cannot put any extended properties into the dictionary, and without those the dictionary is of little use.
Another method is to put the dictionary into classes, using the typical one-class-per-table approach and storing the data as properties of the class. There are multiple drawbacks to this approach:
- Data that is "trapped" in code is very difficult to deal with efficiently, so many operations with the dictionary will be much harder to code and will be slower.
- The dictionary is spread out into many files.
- Ironically, a good data dictionary allows you to generate most CRUD forms, and so you don't need those one-class-per-table files filling up your directory. It seems silly to make a class that contains data that makes the class unnecessary.
The option I prefer is a plaintext file, which can be generated by a GUI or typed in by hand. The only requirement for the file is that it be easy to type and read, and easy to parse by a program. These requirements are well met by two formats: YAML and JSON (XML is a bletcherous horror to work with manually, so it is disqualified before the race starts). Both YAML and JSON enjoy parsers and writers in nearly all popular languages, so if you create your data dictionary in one of those you have a file that is human readable and writable, machine readable and writable, useful in nearly every language on every platform, easily placed in source control, and very flexible in what it can represent.
First Basic Use: Building and Upgrading Databases
A data dictionary is a wonderful way to build and upgrade databases. It is a real thrill to write your first SQL generator that scans a table of column definitions, writes out a CREATE TABLE statement, and executes it. The next step in the evolution of this process is to have the program query the INFORMATION_SCHEMA of the database, then work out which columns have been added to your data dictionary that are not in the table, and then upgrade the table with an ALTER TABLE statement.
This basic approach can easily be extended to include indexes and keys.
Sidebar: Data Dictionary Versus Upgrade Scripts
Many programmers use upgrade scripts to alter their schemas. The idea is that programmer Sax Russell adds a feature. Along with his code he writes a script that makes the necessary alterations to the database. Then comes Ann Clayborne who does the same thing, followed by Hiroko Ai, and then Sax again. When a customer upgrades their system, they run all four scripts in order. This can lead to horrible upgrade experiences in cases where multiple scripts are upgrading a large table several times. A data dictionary is far superior because it can simply examine the tables as they are, examine the data dictionary, work out a diff, and execute the smallest set of commands to bring the database current. This approach does require of course that the data dictionary be in source control like any other application file.
Second Basic Use: HTML Code Generation
A rich data dictionary can provide you with everything you need to provide "free" CRUD screens for most of your tables. Now to be sure, there are always those tasks that require special screens for the users, but there is just no reason to sit down and code up a screen to managae a table of customer types, zip codes, or even a general ledger chart of accounts. Here is an example of an expanded data dictionary in YAML format that contains enough information to generate screens with zero application code:
table customer_types: # Use this to generate menus! module: sales # Use this for the menu and the page title description: Customer Types column customer_type: type_id: char column_precision: 10 # This is crucial for code generation description: Customer Type primary_key: "Y" column description: type_id: char column_precision: 40 description: Description
The use of most of those properties should be pretty obvious, but I would like to point out one particular clever trick you can do. The "primary_key" flag can be used to enable a column during insert mode (if it is a user-entered key), and then to gray out the column in edit mode. When you consider this basic example it starts to become clear that nearly all of the code in most CRUD screens can be reduced to a few routines that read the data dictionary and generate HTML.
I would like to repeat that I do not mean to say that every single CRUD form in an application will work this way. In my experience 9 out of 10 tables can use "free" generated forms, but about 1 in 10 are used so often by users that you end up making special forms with shortcuts and wizards to speed up their work. The dictionary can help you there if you use it to generate the inputs, but it is no use trying to expand the dictionary to cover every conceivable case, there is always one more that the next customer needs that would just plain go faster if you coded it up by hand.
Third Basic Use: Trigger Generation
Generating triggers is a very powerful thing you can do with a data dictionary. We saw last week that the most complete encapsulation of code and data occurs when a trigger is placed on a table. Imagine you had a data dictionary that looked like this:
table orderlines: column extended_price: type_id: numeric column_precision: 10 column_scale: 2 calculate: @price * @qty
That magic little "calculate" value can be used to generate a trigger and put it onto the table. The trigger code might look something like this (The SQL version is PostgreSQL):
CREATE OR REPLACE FUNCTION example RETURNS TRIGGER AS $$ BODY -- direct assignments are an error if new.extended_price is not null then raise error 'Direct assignment forbidden: extended_price'; else new.extended_price = new.price * new.qty; end if; END $$ SECURITY DEFINER LANGUAGE PLPGSQL;
It is not my purpose here to explain how to generate that trigger, but simply to suggest that it is a very doable thing. Because I am not tracing out the steps, I do need to explain that I slipped those "@" signs into the "calculate" value so that when the trigger is built the builder program could detect column names and put the appropriate "old." and "new." prefixes in front of them.
Fourth Basic Use: SQL Access
A database application is full of code that reads and writes to tables. All frameworks and even very modest websites end up with a handful of routines that handle the mundane tasks of sending inserts and updates to the tables.
The Data Dictionary is the perfect resource for these routines. It can be used to properly format data, put quotes where they belong, clip overlong values (or throw an error), prevent the user from changing a primary key, and many other things.
Fifth Basic Use: Documentation
If you have a "rich" data dictionary, one that contains lots of extended properties that describe everything about columns and tables, and if you build your database and generate your forms out of that dictionary, then the next and final natural step is to generate technical documentation out of it as well.
In a first pass, technical documentation is limited to simply displaying the columns that go into a table, which admittedly does not mean much even to technical users. But once you get past that basic task you can begin to layer on lists of parent and child tables (as links of course), descriptions of formulas, and so on.
In a similar vein, Tooltip descriptions go well in a data dictionary.
Conclusion: True Synchronization
The holy grail of database programming is synchronization of code and data. Sychronization means a lot more than simply making sure you delivered the right code and upgraded the tables. Complete synchronization means that the framework is intrinsically incapable of mistakenly accessing missing columns or failing to consider important columns. At the framework level, if you make use of a data dictionary as described above then your framework will be organically synchronized, which is to say that the synchronization is built into the code itself.