The Data Catalog Vocabulary (DCAT) defines a standard way to publish machine-readable metadata about a dataset.
The simplest way to publish a description of your dataset is to publish DCAT metadata using RDFa. RDFa allows machine-readable metadata to be embedded in a webpage. This means that publishing your dataset metadata can be easily achieved by updating the HTML for your dataset homepage.
This guide provides a short introduction to publishing DCAT metadata using RDFa. For more advanced use cases, including publishing data in other formats, take a look at the official W3C documentation for DCAT. The RDFa primer may also be useful background reading.
The Open Data Certificates application supports reading DCAT published as RDFa. So as well as providing machine-readable metadata for data consumers, using DCAT will simplify the process of certifying your dataset as the application will be able to automatically populate some of the answers for you.
Getting started
The first thing to do is to let applications know that your web page is describing a dataset. To do this we need to declare the metadata schemas we will be using to describe the dataset and then indicate the type of thing being described.
Here is a fragment of HTML that provides a starting point. Replace {url} with the URL of your dataset page.
<html prefix="dct: http://purl.org/dc/terms/
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
dcat: http://www.w3.org/ns/dcat#
foaf: http://xmlns.com/foaf/0.1/">
<body>
<div typeof="dcat:Dataset" resource="{url}"> ... </div>
</body>
</html>
The html element has a prefix attribute that declares the schemes. The div element declares which resource it is describing and the type of resource using the resource and typeof attributes.
The rest of the metadata about the dataset will then be added to HTML elements nested inside this container <div>. You don't have to use a div element, it could be any HTML element, so adapt this to the structure of your dataset's page.
The following sections each illustrate how to add extra metadata elements that flesh out the description of your dataset. Try and provide as full a description of your dataset as possible.
Again, the HTML elements used in the examples are just a suggestion, they can be anything you want, so adapt the examples to your existing page. You may need to add some extra elements, e.g. to wrap dates, titles, etc .that are already in the page as plain-text.
The important part of the examples are the RDFa attributes: about, property, content, datatype, etc. These attributes define what property of the dataset is being described and provide the machine-readable metadata.
Title
Specify the title for your dataset using the dct:title property:
<pre><code><h1 property="dct:title">Example Dataset</h1> </code></pre>
Date created
Specify the date your dataset was created using the dct:created property:
<pre><code><p property="dct:created" content='2010-10-25T09:00:00+00:00' datatype='xsd:dateTime'>25th October 2010</p> </code></pre>
In this case the human-readable text for the property is contained within the paragraph tag. It's value can be anything, but the machine readable date (specified in the content attribute), must use a defined data type so it can be easily parsed.
It's recommended that you use the XML Schema date or XML schema dateTime format format.
Date modified
Specify the date your dataset was last updated using the dct:modified property:
<pre><code><p property="dct:modified" content='2010-10-25T09:00:00+00:00' datatype='xsd:dateTime'>25th October 2010</p> </code></pre>
See above for notes on the date formats.
Description
Markup the description of your dataset using the dct:description property:
<pre><code><p property="dct:description">This is the description.<p> </code></pre>
License
The markup for declaring your dataset license is slightly more complex. You need to declare the license property (dct:license) as well as the name and URL for the license:
Substitute the {license URL} and {license name} placeholders for the values that apply to your dataset.
<pre><code><div property="dct:license" resource="{license URL}"> <a href="{license URL}"> <span property="dct:title">{license name}</span> </a> </div> </code></pre>
For a more detailed guide on publishing a comprehensive rights statement for your dataset, including license, copyright statements and preferred form of attribution read the Publishers Guide to the Open Data Rights Statement Vocabulary.
Publisher
Declare the publisher of your dataset using the dct:publisher property. Again, there are several elements to declare here including the name and homepage URL for the publisher.
Replace the publisher URL and publisher name properties with the appropriate values.
<pre><code><div property="dct:publisher" resource="{publisher URL}"> <a href="{publisher URL}" about="{publisher URL}" property="foaf:homepage"> <span property="foaf:name">{publisher name}</span> </a> </div> </code></pre>
Keywords
Keywords can be attached to a dataset using the dcat:keyword property. The property values are simple labels or tags. You can have as many or as few (or none!) of these as you want.
<pre><code><span property="dcat:keyword">Examples</span>, <span property="dcat:keyword">DCAT</span> </code></pre>
Update frequency
The dcat:accrualPeriodicity property is used to define how often a dataset is updated. The values for the property are URIs which are taken a simple controlled vocabulary.
<pre><code><a href="{frequency}" property="dcat:accrualPeriodicity">{frequency (human readable)}</a> </code></pre>
Substitute the {frequency} placeholder with one of the following URIs:
- http://purl.org/linked-data/sdmx/2009/code#freq-A - Annual
- http://purl.org/linked-data/sdmx/2009/code#freq-B - Every working day (Mon - Fri)
- http://purl.org/linked-data/sdmx/2009/code#freq-D - Daily (7 days a week)
- http://purl.org/linked-data/sdmx/2009/code#freq-M - Monthly
- http://purl.org/linked-data/sdmx/2009/code#freq-N - Every minute
- http://purl.org/linked-data/sdmx/2009/code#freq-Q - Every quarter
- http://purl.org/linked-data/sdmx/2009/code#freq-S - Half yearly
- http://purl.org/linked-data/sdmx/2009/code#freq-W - Weekly
Distributions
A dataset can have a number of distributions. Distributions describe how a dataset is packaged and released. Your dataset may have several distributions, e.g. if you publish a series of data over a period of time as separate packages, or if it is available in different formats.
The markup here is a little more complex. It defines a new resource (a dcat:Distribution) and associates that with your dataset. The nested markup then provides metadata about the new resource, e.g. its format, size, publication date, etc.
<pre><code><div property='dcat:distribution' typeof='dcat:Distribution'> <span property="dct:title">{Distribution title}</span> <ul> <li><strong>Format</strong> <span content='{format}' property='dcat:mediaType'>{format (human readable)</span></li> <li><strong>Size</strong> <span content='{size in bytes}' datatype='xsd:decimal' property='dcat:byteSize'>{size (human readable)}</span></li> <li><strong>Issued</strong> <span property='dct:issued' content='{date issued}' datatype='xsd:date'>{date issued (human readable)}</span></li> </ul> <p><a href='{link to data}' property='dcat:accessURL'>Download the full dataset</a></p> </div> </code></pre>
The {format} placeholder should be a recognised MIME type, (for example text/csv or application/json)
The dct:issued property specifies the date that the distribution was published. The property should follow the same guidelines as for the dct:created property outlined above.
Putting it all together
Here is a complete example of an HTML page marked up using DCAT. It provides all of the core metadata for the dataset, including a description of a single distribution.:
<pre><code><!DOCTYPE html> <html prefix="dct: http://purl.org/dc/terms/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# dcat: http://www.w3.org/ns/dcat# foaf: http://xmlns.com/foaf/0.1/"> <head> <title>DCAT in RDFa</title> </head> <body> <div typeof="dcat:Dataset" resource="http://gov.example.org/dataset/finances"> <h1 property="dct:title">Example DCAT Dataset</h1> <p property="dct:created" content='2010-10-25T09:00:00+00:00' datatype='xsd:dateTime'>25th October 2010</p> <p property="dct:modified" content='2013-05-10T13:39:36+00:00' datatype='xsd:dateTime'>10th March 2013</p> <p property="dct:description">This is the description.<p> <div property="dct:license" resource="http://reference.data.gov.uk/id/open-government-licence"> <a href="http://reference.data.gov.uk/id/open-government-licence"> <span property="dct:title">UK Open Government Licence (OGL)</span> </a> </div> <div property="dct:publisher" resource="http://example.org/publisher"> <a href="http://example.org/publisher" about="http://example.org/publisher" property="foaf:homepage"> <span property="foaf:name">Example Publisher</span> </a> </div> <div> <span property="dcat:keyword">Examples</span>, <span property="dcat:keyword">DCAT</span> </div> <div> <a href="http://purl.org/linked-data/sdmx/2009/code#freq-W" property="dcat:accrualPeriodicity">Weekly</a> </div> <div property='dcat:distribution' typeof='dcat:Distribution'> <span property="dct:title">CSV download</span> <ul> <li><strong>Format</strong> <span content='text/csv' property='dcat:mediaType'>CSV</span></li> <li><strong>Size</strong> <span content='240585277' datatype='xsd:decimal' property='dcat:byteSize'>1024MB</span></li> <li><strong>Issues</strong> <span property='dct:issued'>2012-01-01</span></li> </ul> <p><a class='btn btn-primary' href='http://example.org/distribution.csv.zip' property='dcat:accessURL'>Download the full dataset</a></p> </div> </body> </html> </code></pre>
You can also see an example 'in the wild' at http://smtm.labs.theodi.org/download/