Using triplestores as a CMS backend storage solution

Content management systems are widely used and available as prepackaged, ready-to-use solutions for authoring, editing and publishing content to the web.

Requirements

A storage backend to a modern, fully-fledged CMS will at a minimum have support for these features:

  1. Content cross-references
  2. Content translation
  3. Content versioning
  4. Customizable data models

Traditional solutions

To allow for all of the features outlined above, CMS backend storage solutions often will need to adopt a complex, application-specific schema on top of the abstractions offered by the storage solution.

As an example, Drupal, a common CMS, must keep a higher level of abstraction on top of the SQL schema to keep track of all of this information. This is visible when looking at a SQL database used for Drupal: it will contain troves of tables to keep all these dimensions, such as language, versions and custom fields, in sync.

A potential problem of using abstractions on top of the schema language of a storage backend is that it usually ends up being impossible or extremely difficult for other applications to consume the data outside of the CMS. This means that the CMS becomes the gatekeeper of its own data – a situation that is bad for future-proofing and makes decoupling your application more expensive.

Using triplestores

A triplestore is a specialized storage application for RDF. It is called ‘triplestore’ because it stores RDF triples of the form subject → predicate →object. Triplestores use an SQL-like query language called SPARQL that offers many unique features not found in SQL-based storage solutions.

Using triplestores and RDF overcomes the problems of traditional backend storage solutions on multiple fronts. In the following, I go through each feature requirement and how triplestores solve them.

1. Content cross-references

The RDF schema has built-in support for cross-references using the IRI primitive to denote a reference (uses the <…> notation)

<http://example.org/my-article> → schema:isPartOf → <http://example.org/my-other-article>

This language-level cross-reference notation is human-friendly, but only offers a direct reference between two resources. What if we want to describe the cross-reference? The schema.org vocabulary offers contextual cross-references using the schema.org/Role class.
This allows for cross-references to add information about the reference itself.
For example, to describe that a person is connected to an organization through the role of being CEO, we could write:

<http://example.org/my-profile-page> → schema:memberOf → <http://example.org/my-role> <http://example.org/my-role> → schema:roleName → "CEO" <http://example.org/my-role> → schema:memberOf → <http://example.org/my-organization>

For more information, see Introducing ‘Role’ on the schema.org website.

2. Content translation

The RDF schema has built-in language support, using the ISO-639-1 language codes.

<http://example.org/my-article> → schema:headline → "my article"@en <http://example.org/my-article> → schema:headline → "min artikel"@da

3. Content versioning

The RDF schema has built-in support for storing sets of RDF triples in their own graphs. These graphs can be used to distinguish between one state and another, i.e. between two revisions.

Figure 1: Two graphs, each containing a document revision.

This allows each piece of content to have multiple revision states. An open question remains as to the extend of state that each graph should record (for example, if one document references another document, should the revision state include the referenced document? This issue exists for other backend storage solutions as well).

For another example of a triplestore revision implementation, see R43ples: Revisions for Triples.

4. Customizable data models

A CMS typically uses one or more data models to describe the document type. For example, a document type called ‘article’ can include a headline, author and publishing date field, whereas a document type called ‘note’ may just have a title and body field.

The modern CMS can create document types on-the-fly to allow new use cases to be met. The backend storage must be able to store this document type which typically introduces yet another layer of abstraction on the storage. For example, Drupal stores its document types in a SQL database, necessitating high levels of abstractions on top of SQL.

Using triplestores as backend storage meets this issue by using the same language for data and data models. The schema.org website offers a large body of document types (articles, events, persons, etc.) that are all written in RDF. Using the backend storage language to define your document types leads to a multitude of advantages. For example, not only can third-party developers consume data directly from the backend storage without the CMS as mediator, but they can also build the data model by reading from the same storage!

Storing both data and data model in the same place (and having that place use a non-propriterary, standardized and commonly used language) makes it much easier for new and third-party developers to access, understand and develop your application.

For more information about RDF used for data modelling, go here.

Bonus features of triplestores

Triplestores offer more than just meeting the basic requisites of a modern CMS storage solution. Here, I discuss a few.

Bidirectional references

While Drupal supports one-directional references, reverse (bidirectional) references are typically expensive, as they require a lookup in the database. Triplestores do not have this limitation, as reverse references are supported at language-level.

A normal reference seeking the content that an article is part of:

<http://example.org/my-article> → schema:isPartOf → ?article

A reverse reference using the ^ tag to denote a reverse relationship. We now request the article that has a property schema:isPartOf referencing <http://example.org/my-other-article>.

<http://example.org/my-other-article> → ^schema:isPartOf → ?article

Connectivity with other triplestores

Using the SERVICE tag, Triplestores can talk with each other out of the box. Within the context of a CMS, this means that we can connect to content stores such as Wikidata natively from the backend storage. Because of the common language of SPARQL, triplestores can easily draw on information on the Internet and connect it to it's own content.

The SPARQL request below queries for house cats in Wikidata and returns local content of type schema:Article that mentions any of them:

SELECT ?article WHERE {     ?article schema:mentions ?cat     ?article rdf:type schema:Article     SERVICE <https://query.wikidata.org/bigdata/namespace/wdq/sparql> {         ?cat wdt:P31 wd:Q146 .    } }

Readability of the backend storage

While bitrot and format rot are familiar concepts, it is also reasonable to discuss ‘data model’ rot. With time, traditional data models will lose relevance and context. For example, the concept of a ‘newspaper article’ may not be relevant in one hundred years, and the contextual information (that is, inferred or unspoken information about for example a field in your data model) will likely be lost.

This final point is one of readability and making your data futureproof. As an example, imagine that in 100 years, you attempt to restore data from your 100 year old CMS. Further, imagine that you are unable to run the application to extract the data. You are then left extracting the data directly from the backend storage.

With a traditional CMS like Drupal, you will likely have to read up on how data is stored and reverse-engineer the data models in order to assemble the data. But even then, much of the context of the data model may have been lost – while it is possible to document the data model in Drupal, developers typically do not bother with annotating each field and document type with exhaustive contextual information, because the data model is typically only used within the scope of that specific Drupal application.

With RDF, you get ready-to-use vocabularies that are richly documented and provide context for your data model. Further, since the data model is part of the backend storage data, you will not need the application to figure out the data model.