Wednesday, February 18, 2009

The Data Driven Conspiracy

In a previous post called An Irrational Love of the Relational I asked whether the relational database (RDB) is the best data store for an object-oriented application. Along the way I mentioned that RDBs tend to make coders adopt a data driven design approach instead of the more object-oriented domain driven design

That got me thinking: "Why has the RDB remained at the heart of our technology stack despite a steadily growing demand for a domain-centric design methodology?"

I think the answer to that comes in two parts.
  1. Historical
  2. Conspiratorial
1. Historical
Edgar Codd, an employee of IBM in the 1970's, was getting frustrated at the lack of a search feature that would allow him to quickly retrieve information from his IBM mainframe hard-drive array. So he worked out a cool way of structuring the data on the drive that would allow queries to be constructed and so allow for ad hoc data retrieval. He released a paper describing this work called A Relational Model of Data for Large Shared Data Banks [1]. 

In this paper he described, for the first time, all the features of an RDB that we find so familiar today. Data normalisation, columns , rows, tables, foreign keys etc. Codd's rules laid out clearly what constituted a true relational database and allowed manufacturers to create their own RDB management systems. 

This tech was very cool and so much better than anything that had gone before. It allowed for data-integrity, which helped the coders and it facilitated a whole new class of program that relied on queries that sliced and diced huge data-sets revealing subtle and surprising inter-connections. 

2. Conspiratorial
By the 1980's most young coders were convinced that relational data and the RDB was the answer to life, the universe and everything. New companies grew up around the RDB, Microsoft got in on the act with SqlServer and began, slowly, to dominate the market. Time passed and eventually those young turks grew older, their beards grew greyer and, like the RDB software they depended on, they became bloated. Then, just when people were forgetting that data could be stored in any other way, the first cracks began to appear in the RDB monolith. 

It started when object-orientation finally broke free of the university labs and escaped into the wider coding world. Programs written as a hierarchical collection objects (an object domain) had been around since the 1960's when a pair of Norwegian academics invented a language called Simula-67. But this technology did not really take off until the 1990's with the widespread adoption of C++.

As soon as business coders started to regularly cast complex business systems into objects they began to notice a fundamental problem. Hierarchical object domains of de-normalised state data do not look anything like the relational, normalised data used by RDBs.

But the companies who provided the RDB monolith software and the generation of coders who had invested their youth evangelising RDB tech could not accept the cognitive dissonance this observation created. It became imperative to the profits of the RDB suppliers and the reputations of the RDB evangelists that some way to ignore the problem was found.  Thus the data-driven design methodology was born.

Data-driven design demands that the relational database must be designed first. Only then can the database be translated into the radically different structure of the object domain and to perform this translation we must write not one but two additional layers of logic.
Stored Procedures
Codd's original stored procedure language was called SEQUEL, which IBM reworked and cheekily named the 'new' language SQL.

SQL is an excellent language for constructing data queries and hence mining data sets. SQL also has a range of management features - you  can Add, Edit and Delete the structured data on the disk and the RDB data itegrity rules will help prevent the data getting messed up. These management features are nice but they are not the core purpose of SQL, which remains the ability to perform complex data retrieval by exploiting the relationships created in the structured data. 

Yet it is precisely these data management features (add, delete etc), along with some typically trivial queries, that are used 99% by object-oriented coders in order to save and retrieve (serialize) their object state data. 

Data Access Layer
Communicating with sprocs from code is a complicated business and requires its own set of coding techniques, objects libraries and tools. These all come together in the data access layer (DAL). The DAL is the code that marshals object state to and from the Sprocs. 

This code is typically complex and fragile. As the application requirements evolve, or bugs are discovered and fixed, so the RDB tables, columns etc must be updated to reflect those design changes. This is the heart of data driven design. It forces a cascade of changes to the sprocs and then changes to the DAL code before finally allowing the class design to be updated.

Data-Driven Design = Change  DB -> Sprocs -> DAL -> Class

The Enlightenment
At last things are changing. Leading the way is Domain Driven Design. This is, at its heart, simply a statement of the obvious - that the best way to design an object hierarchy is to design the constituent objects. 

Domain Driven Design = Change  Class

The domain driven enlightenment has been born out of the fundamental realisation that the old ways of writing software just do not work. Replacing them is Agile, a coding methodology that demands that we refactor, evolve and simplify. These Agile concepts are the very antithesis of data driven design, with its multi-layered, stultifying, baroque complexity. 

Thus Agile demands that we throw out the DAL, the Sprocs and the RDB because they are not, and never were, an appropriate minimal solution to the problem of object state serialisation.

Object state serialisation does not imply a relational database

References
  1. Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks" In: Communications of the ACM 13 (6): 377–387.

3 comments:

  1. Nice article, and still today a few years after it is written most people as far as I can tell are heavily using RDBMS for OO software.

    One comment I would like to make to support the shift to NoSQL is that in theory (depending on the specific design), a relational database can store the same amount of data using less disk space due to normalization.

    "Back in the day" this was a major benefit as space cost a lot of money. Nowadays with space being so cheap and Software Developers still being expensive, I would argue that the time & money you save using say a Document Database over a Relational Database will pay off that extra space usage easily.

    ReplyDelete
  2. You are right about the space. Interestingly RavenDB, the Document db I am currently using, goes some way to fixing this with support for 'foreign key' like id properties that allows for loading sub-documents efficiently, so it uses more _memory_ to realise the denormalised data but less disk space to hold the (partially) normalised data.

    Anyway, as you point out, even the concept of 'disk space' shows that we are getting old. There is probably a generation of coders who don't even know the 'cloud' has disks :-)

    ReplyDelete