Biofractal: February 2009

Friday, February 27, 2009

Generic Lists of Anonymous Type

[cross-posted to StormId blog]

Anonymous types can be very useful when you need a few transient classes for use in the middle of a process.

Of course you could just write a class in the usual way but this can quickly clutter up your domain with class definitions that have little meaning beyond the scope of their transient use as part of another process.

For example, I often use anonymous types when I am generating reports from my domain. The snippet below shows me using an anonymous type to store data values that I have collected from my domain.

for (var k = 0; k < optionCount; k++)

    var option = options[k];

    var optionTotal = results[option.Id];

    var percent = (questionTotal > 0) ? ((optionTotal/(float)questionTotal) * 100): 0;

    reportList.Add(new

            Diagnostic = diagnostic.Name,

            Question = question.Text,

            Option = option.Text,

            Count = optionTotal,

            Percent = percent

});

Here I am generating a report on the use of diagnostics (a type of survey). It shows how often each option of each question in each diagnostic has been selected by a user, both count and percent.

You can see that the new anonymous type instance is being added to a list called reportList. This list is strongly typed as can been seen by this next bit of code where I order the list using LINQ.

reportList = reportList

    .OrderBy(x => x.Diagnostic)

    .ThenBy (x => x.Question)

    .ThenBy (x => x.Percent)

    .ToList();

This is where the problem comes in, how is it possible to create a strongly typed (generic) list for an anonymous type? The answer is to use a generics trick, as the following code snippet shows.

public static List<T> MakeList<T>(T example)

    return new List<T>();

The MakeList method takes in a parameter of type <T> and returns a generic list of the same type. Since this method will accept any type then we can pass an anonymous type instance with no problems. The next snippet shows this happening.

var exampleReportItem = new

        Diagnostic = string.Empty,

        Question = string.Empty,

        Option = string.Empty,

        Count = 0,

        Percent = 0f

};

var reportList = MakeList(exampleReportItem);

So here is the context for all these snippets. The following code gathers my report data and stores it in a strongly typed list containing a transient anonymous type.

var exampleReportItem = new

        Diagnostic = string.Empty,

        Question = string.Empty,

        Option = string.Empty,

        Count = 0,

        Percent = 0f

};

var reportList = MakeList(exampleReportItem);

for (var i = 0; i < count; i++)

    var diagnostic = diagnostics[i];

    var questionCount = diagnostic.Questions.Count;

    for (var j = 0; j < questionCount; j++)

        var question = diagnostic.Questions[j];

        var questionTotal = results[question.Id];

        var options = question.Options;

        var optionCount = options.Count;

        for (var k = 0; k < optionCount; k++)

            var option = options[k];

            var optionTotal = results[option.Id];

            var percent = (questionTotal > 0) ? ((optionTotal/(float)questionTotal) * 100): 0;

            reportList.Add(new

                    Diagnostic = diagnostic.Name,

                    Question = question.Text,

                    Option = option.Text,

                    Count = optionTotal,

                    Percent = percent

});

Perhaps you are wondering how the type of the anonymous exampleReportItem is the same as the type of the anonymous object I add to the reportList?

This works because of the way the type identities are assigned for anonymous types. If two anonymous types share the same public signature, that is if their property names and types are the same (you can't have methods on anonymous types) then the compiler treats them as the same type.

This is how the MakeList method can do its job. The exampleReportItem instance sent to the MakeList function has exactly the same properties as the anonymous type added to the generic reportList. Because they have the same signatures then they are recognised as the same anonymous type and all is well.

Wednesday, February 18, 2009

The Data Driven Conspiracy

In a previous post called An Irrational Love of the Relational I asked whether the relational database (RDB) is the best data store for an object-oriented application. Along the way I mentioned that RDBs tend to make coders adopt a data driven design approach instead of the more object-oriented domain driven design.

That got me thinking: "Why has the RDB remained at the heart of our technology stack despite a steadily growing demand for a domain-centric design methodology?"

I think the answer to that comes in two parts.

Historical
Conspiratorial

1. Historical

Edgar Codd, an employee of IBM in the 1970's, was getting frustrated at the lack of a search feature that would allow him to quickly retrieve information from his IBM mainframe hard-drive array. So he worked out a cool way of structuring the data on the drive that would allow queries to be constructed and so allow for ad hoc data retrieval. He released a paper describing this work called A Relational Model of Data for Large Shared Data Banks [1].

In this paper he described, for the first time, all the features of an RDB that we find so familiar today. Data normalisation, columns , rows, tables, foreign keys etc. Codd's rules laid out clearly what constituted a true relational database and allowed manufacturers to create their own RDB management systems.

This tech was very cool and so much better than anything that had gone before. It allowed for data-integrity, which helped the coders and it facilitated a whole new class of program that relied on queries that sliced and diced huge data-sets revealing subtle and surprising inter-connections.

2. Conspiratorial

By the 1980's most young coders were convinced that relational data and the RDB was the answer to life, the universe and everything. New companies grew up around the RDB, Microsoft got in on the act with SqlServer and began, slowly, to dominate the market. Time passed and eventually those young turks grew older, their beards grew greyer and, like the RDB software they depended on, they became bloated. Then, just when people were forgetting that data could be stored in any other way, the first cracks began to appear in the RDB monolith.

It started when object-orientation finally broke free of the university labs and escaped into the wider coding world. Programs written as a hierarchical collection objects (an object domain) had been around since the 1960's when a pair of Norwegian academics invented a language called Simula-67. But this technology did not really take off until the 1990's with the widespread adoption of C++.

As soon as business coders started to regularly cast complex business systems into objects they began to notice a fundamental problem. Hierarchical object domains of de-normalised state data do not look anything like the relational, normalised data used by RDBs.

But the companies who provided the RDB monolith software and the generation of coders who had invested their youth evangelising RDB tech could not accept the cognitive dissonance this observation created. It became imperative to the profits of the RDB suppliers and the reputations of the RDB evangelists that some way to ignore the problem was found. Thus the data-driven design methodology was born.

Data-driven design demands that the relational database must be designed first. Only then can the database be translated into the radically different structure of the object domain and to perform this translation we must write not one but two additional layers of logic.

Stored Procedures (Sprocs)
The Data Access Layer (DAL)

Stored Procedures

Codd's original stored procedure language was called SEQUEL, which IBM reworked and cheekily named the 'new' language SQL.

SQL is an excellent language for constructing data queries and hence mining data sets. SQL also has a range of management features - you can Add, Edit and Delete the structured data on the disk and the RDB data itegrity rules will help prevent the data getting messed up. These management features are nice but they are not the core purpose of SQL, which remains the ability to perform complex data retrieval by exploiting the relationships created in the structured data.

Yet it is precisely these data management features (add, delete etc), along with some typically trivial queries, that are used 99% by object-oriented coders in order to save and retrieve (serialize) their object state data.

Data Access Layer

Communicating with sprocs from code is a complicated business and requires its own set of coding techniques, objects libraries and tools. These all come together in the data access layer (DAL). The DAL is the code that marshals object state to and from the Sprocs.

This code is typically complex and fragile. As the application requirements evolve, or bugs are discovered and fixed, so the RDB tables, columns etc must be updated to reflect those design changes. This is the heart of data driven design. It forces a cascade of changes to the sprocs and then changes to the DAL code before finally allowing the class design to be updated.

Data-Driven Design = Change DB -> Sprocs -> DAL -> Class

The Enlightenment

At last things are changing. Leading the way is Domain Driven Design. This is, at its heart, simply a statement of the obvious - that the best way to design an object hierarchy is to design the constituent objects.

Domain Driven Design = Change Class

The domain driven enlightenment has been born out of the fundamental realisation that the old ways of writing software just do not work. Replacing them is Agile, a coding methodology that demands that we refactor, evolve and simplify. These Agile concepts are the very antithesis of data driven design, with its multi-layered, stultifying, baroque complexity.

Thus Agile demands that we throw out the DAL, the Sprocs and the RDB because they are not, and never were, an appropriate minimal solution to the problem of object state serialisation.

Object state serialisation does not imply a relational database

References

Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks" In: Communications of the ACM 13 (6): 377–387.

Friday, February 13, 2009

An Irrational Love of the Relational

Is a relational database the best data-store for serializing object-graphs?

Caveats

This applies to coders who employ Agile, Domain Driven Design
The application you are building is not a data mining app

Imagine building an object-oriented application. You find that you need to save and subsequently retrieve an object's state (serialize and de-serialize your object graph). Is a relational database (RDB) the best data store for you?

The answer seems pretty clear to me - Relational databases are just about the worst kind of data store for storing an hierarchical object-graph.

Of course an RDB can store object state. Mapping-tables, foreign keys and normalisation can all be used to project your natural object hierarchy into tables, rows and columns in the way that a globe can be projected onto the flat page of an atlas.

But the map is not the terrain, the RDB projection is nothing less than a distortion of the object graph. Translating between the object-graph and its distorted, relational representation requires work - and not just computer cycles, although it needs plenty of those, but also the hand-rolled code / bugs required to manage the transformation of data to and from the RDB object-store.

Traditionally this code was encapsulated into a bespoke Data Access Layer (DAL). It makes me shudder to remember just how much of my life I have wasted writing and debugging DAL code. But all that wasted time is not the critical problem, the real kicker is that a DAL severely limits how Agile you can be.

Take a typical agile process, the quick refactoring of a class definition, say the addition of a new public property.

Add new property to class
Add equivalent mapping to DAL class
Add equivalent parameter to SPROC
Add equivalent column to RDB table

Notice how that list feels back to front? Surely life would easier if I did things the other way around?

Add new column to RDB table
Add equivalent parameter to SPROC
Add equivalent mapping to DAL class
Add equivalent property to class

This shows up another problem with using RDBs - It promotes data-driven design.

Data Driven Design

tables

[Domain Driven Design] = Design your objects by designing objects

To me the natural way to design a domain is to play with the objects but an RDB + DAL approach flips this natural flow on its head and makes you design backwards. It makes you design the data-store before the domain, from tables -> objects.

Data driven design means that you do all the work up-front (table, sproc, DAL then finally object) which greatly increases the cost of experimentation. This ossifies the design process. Data-driven design strongly inhibits a successful design evolving out a series of cheap experiments. This is why agile coders tend to use domain driven design.

So why suffer all this RDB pain? Why not use a data store whose intrinsic architecture fits the structure of an object graph and does not require piles of buggy DAL code just to satisfy the basics of object serialization? DBAs often cite two main reasons:

You can run reports that cut across the object graph
You can keep you data application agnostic allowing future applications to use the data

Reason 1 - Because I am not writing a data-mining app I know my report designs in advance. Therefore I don't need ad-hoc, dynamic reports. Since my reports are pre-defined they can be represented in my object-graph as a collection of serializable report objects. Reports are just filtered collections of report objects.

Reason 2 - I adhere to strict YAGNI (you aren't going to need it) principles therefore I only write code to do the job. No matter how tempting it is, I do not write code just in case it might be needed in the future.

So what can the modern object-oriented coder do to make life a bit easier?

Use an Object Database

You can use a database that has been specifically designed to store object state data. They are called object databases. I have played around with the open source object database called DB4Objects which I found to be very fast and easy to use (as easy as NHibernate) but when I was playing its development was in a state of rapid flux. If you have used an object database then please leave a comment.

Scrap the DAL

Object Relational Mappers (ORM) e.g NHibernate allow you to scrap the DAL. ORMs automate, as far as possible, the cruddy DAL code and get you closer to the agile ideal of fast and cheap refactoring. You can create new classes, add and remove and interface elements and the ORM will take care of adding new tables and columns. You need never write another object persistence save / update / delete SPROC again (almost).

Whilst much better than writing DAL code, ORMs are not perfect and are not pain free. Every once in a while you have to go to the DB and mess around. This might seem trivial but you will be amazed how quickly all that DB experience evaporates.

Also, to ORM-enable your objects you need to somehow provide the mapping information that links the objects to their equivalent tables. These days this can be handled by marking up your classes with fairly simple [attribute] metadata but it is still a surprising amount of work to keep the attributes up to date.

So ORMs are not the final answer because they are really just papering over the the underlying problem: A relational database is not the best data store for serializing object-graphs.

Wednesday, February 11, 2009

Code Coverage: How much is enough?

How much code should you cover with unit tests. How much code coverage is enough?

This was a question that came up in our regular technical team meeting down the pub the other day. I have been pretty evangelical about unit tests ever since the penny dropped about a year ago but it has to be said that I am struggling to stick to the high moral standards set by the 100% code-coverage purists.

You all know my grubby little excuses. Unless you live in some coding Utopia then time pressure, last minute changes that need to be deployed now, flashes of code-god inspiration that require input right now because it is just too exciting to wait etc. I admit it - my flesh is weak.

But should I be so hard on myself? Is 100% really worth aiming for? At what point does the law of diminishing returns kick in? Nobody I have spoken to seems to have any evidence (as opposed to anecdotal opinion) one way or the other. It boils down to unsupported assertions like "you should at least aim for > 95% code coverage".

I wonder if there is more efficient way to balance time spent on unit tests and the value those test return. Fighting entropy requires a lot of work so perhaps we reduce the work by breaking the problem into smaller pieces and create a hierarchy of code coverage.

Public Interfaces
These should have unit tests created for each possible interaction with the interface members. There should be an iron rule that public interfaces have 100% code coverage.
Black Box Code
The plumbing code that supports the public interface functionality will be covered by tests as required by TDD (test driven development) but new tests should only be written for new bug fixes and any obvious TDD development.

In a perfect world I accept that 100% is the ideal, but in a world where I need to make money for my company then 100% code coverage 100% of the time is surely too much.

100% coverage for selected, critical code and significantly less elsewhere might just be the way to go.

Tuesday, February 10, 2009

Agile Technical Specifications

A technical specification document attempts to bridge the gap between a business driven functional specification and the raw code pouring from a developers mind.

Yet technical specifications now belong to a bygone, pre-Agile age. Back then business people and technical managers collaborated to produce a functional specification intended to describe all the features of the final system. The technical specification was then derived from this functional specification. It was a structured document that described the system's architecture and intended features in unambiguous terms.

It is now understood that this top-down process is unrealistic. Agile is, in part, a response to the divergence between pragmatic coding and 'command-economy' management. Agile accepts that nobody can really specify a software system before the first versions of the code are written. Nowadays we realise that code must evolve alongside the specification with each informing the other and the whole informing the client in a system of mutual feedback loops. The rigid, top-down information flows of yesterday have given way to a much more flexible and dynamic approach to system specification, planning and construction.

But you don't know what you have until you lose it. Now technical specifications have gone the way of the Dodo I can't help thinking that they were not entirely evil after all. If we imagine a utopian technical specification that really did describe a complete system with no mistakes then this would surely be a very useful document.

Is there any way we can get to an Agile version of the technical specification? Can we achieve a technical specification that delivers the usefully specific technical information whilst still being agile, flexible and evolutionary?

I think so. I think that if I was given a set of automatically generated, pre-defined class interfaces and a corresponding set of unit tests then together these would constitute an Agile Technical Specification.

[Public Interfaces] + [Unit Tests] = [Agile Technical Specification]

Given a set of interfaces, my job as a coder would be to create an equivalent set of domain classes that implemented those interface contracts.

Given a set of unit tests my job would be to use all my creative powers to flesh out my domain classes until those unit tests passed.

If all the unit tests passed then I would have satisfied the functional requirements and this functionality would, perforce, be presented via a public API specified by the interfaces. In other words I would have translated my Agile Technical Specification into working code that both defined and self-certified its own features.

So how do we 'generate' the interfaces and unit tests that comprise an Agile Technical Specification? Well we certainly don't want to be writing these by hand. That would just put us back into the old boat where time constraints cause the code to relentlessly drift away from the specification. Instead these artefacts need to be automatically generated if they are to be of any use.

If they are to be auto-generated then what specifies and drives the auto-generation? The answer is an Agile Functional Specification. Technical specifications are always derived from functional specifications, the difference is that now this process will be automatic.

Fortunately Agile Functional Specifications already exist. They are the sets of user stories that evolve in conjunction with the client or their business representatives. This implies that as the user stories evolve so the Technical Specification will evolve as an automatic derivative.

To achieve this the user stories must be written in a context-aware structured syntax, in other words a domain specific language (DSL). Then it will be possible to consume the user stories and from them auto-generate both the interfaces and the unit tests required to create an Agile Technical Specification. The interfaces are derived from the user story setups and the unit tests from the user story 'Where' clauses (constraints).

References

Agile Documentation

TAGRI (they aren't going to read it)