RDF as a data modelling language

This is a challenge to the traditional design paradigm that we often see used in content management systems across the Internet.

You have an encumbered data model, continually inflated by business demands, that slows down development, frustrates collaboration between teams and makes feature-toggling error prone and difficult to revert.

You want rapid, feature-driven development across multiple teams and a rock solid data model that preserves data in a timeless format that can be understood by tomorrows applications.

Here are two tensions that describe challenges that you may deal with when constructing your data model. Here are also two tenets that suggests ways to remedy these tensions.

In the following discussion I will use the Drupal CMS as a reference point, because it offers powerful and well-integrated tools for building your data models and coupling them to your application. As such, much of my critique may be specific to the way Drupal does data modelling, but even if you are unfamiliar with Drupal, you should still be able to see the point I make and how it may apply to your data modelling tools and methodologies.

Tension 1: The shallow data model

You may be familiar with the term “format rot”, which deals with the problem of old file formats that become obsolete and unsupported over time, leaving data using that format unreadable.

A similar kind of rotting happens when the data model is shallow, or in other words, when the data model only serves the immediate needs of the application. We can call this “data model rot”.

This first tension deals with how traditional data modelling uses non-interchangeable content types that lack contextual information.

In Drupal, data modelling allows for one level of abstraction: a generic template (called ‘node’) and one or more specialized content types (called ‘node types’) from which the user can create content. A typical arrangement of content types is depicted in Fig 1. Each content type inherits the properties from the template.

Even though content types share a common template, there is no built-in way of recasting content to other types in Drupal. For example, once a piece of content of type Article has been created it cannot later be changed to, say, content of type Blog post. Much of Drupal's inner workings are built around the expectation of immutable content types.

Loss of context

Data only has meaning to us through context. When we call a collection of words an ‘article’, it becomes valuable to us only with an understanding of what ‘article’ means to us. This applies to all data we deal with. Because we are humans, we must be able to apply some context to the data in order to make sense of it.

If context never changed, Drupal's content types would be a perfect fit to encompass all that an article is. But this is not the case: context does change over time. What the word ‘article’ means to us changes over time, and maybe it changes to such a degree that it no longer fits what we originally thought of as an ‘article’.

So, while the data remains the same, the context for how we see the data has changed. This doesn’t mean that the data is irrelevant, only that we must see it in a new light. And to reflect that change in the data model, we must not insist on immutable content types: maybe it isn’t an article anymore, but something different.

Oversimplified hierarchy of abstraction

In Drupal, due to having only one level of abstraction, data model templates can only contain fields that are shared between all content types. This tend to leave the template very light, and each content type very heavy with specialized fields. It also promotes duplicate fields across content types because no intermediate abstraction can collect common fields between, for example, newspaper articles and web articles.

Fig 1: Drupal only allows a single level of inheritance between a template and the actual content type.

Tenet 1: The ideal content

Fig 2: Plato (depicted left) posited that all things have an ideal form (wikimedia.org)

While the kind of abstraction Drupal's data modelling tools provides is a good start, they do not nearly offer enough depth to accurately describe data models.

When constructing your data models, ask yourself how many fields you can strip from your content type without loosing the ‘pure shape’ or perfect representation. If you are building a data model for an article, is there a perfect article? For example, is a piece of content an article if it has an author? Is it an article if it hasn't an author? What fields are required to make an article?

And once stripped of all superfluous fields, does your data model still describe an article or does it describe something more generic and abstract?

When we look for the perfect representation of an article, perhaps we end up with something that could more generically be called ‘a work’. This work could be an article, but only because we choose to cast it in that light. It could also be presented as a blog post. Perhaps, in 50 years, the term ‘article’ makes no sense to us, yet the data of the article is still relevant. What will we call it then?

For an example of this, see the schema.org data model schema:CreativeWork. This is a ‘pure shape’ data model containing fields that are relevant to any creative work, such as a song, a written text or a television episode. From the ‘CreativeWork’ data model, all sorts of more specific data models (such as schema:Article) can be derived. This approach allows for as many layers of abstraction as you need.

Fig 3: Using the schema.org hierarchy of classes allows for as many layers of abstraction as needed. Because it is clearly mapped out what is unique and and what is shared you get a concise and uncompromising definition of your content types.

Tension 2: Barren fields

Traditionally, a lot of faith is put on the name of a field as the primary source of explanation for the field's use. For example, the author field of an article is regarded as self-explanatory. 

But what if the article had both a writer and author field? No longer is the intention clear, simply by looking at the field names. We need context to determine which field we need, yet applications often suppress context in favor of connotation - unspoken assumptions about the real world counterpart of the data model (such as an article). In other words, traditional data models have a tendency to describe real world things and at the same time be clean-room implementations. One website's idea of what an article is may differ completely from another website's. This lack of a common data model language, on some level, makes guesswork of what the intention of a data model is, even amongst it's creators.

Another dimension to this is the sometimes arcane data types of fields that confuses third-party developers consuming the application's API. Both input and output may be in non-standard formats, often times influenced by the applications inner workings.

What if there was a way for describing data models and data types that was unambiguous, standardized and commonly used?

Tenet 2: RDF to the rescue