Packaging Darwin Core data items

A data item in Darwin Core is what you might put in the blanks in a data entry form, as shown below for a highly simplified occurrence record.


There are different ways to package data items for storing, sharing and further processing. When you download a Darwin Core archive from an IPT through GBIF, the data items are usually packaged as a tab-separated table (TSV) with the field names in the first line. The result looks like this for a couple of my simplified records:


A second packaging method uses XML, which “nests” each of the data items within markup tags, and encloses each record separately with additional tags:


A third, widely used packaging method is JSON (JavaScript Object Notation). As with XML, data items in JSON are individually assigned to their respective fields. There are slightly complicated rules for writing JSON correctly, though.

Yet another way to package data items is with a triple. This is a statement that links a subject with an object by means of a predicate which explains the relationship of the subject and object. Triples for the two records might look like this, with angle brackets enclosing subject, predicate and object:


The four packaging methods described above are all human-readable, and all of them have variations. For example, triples can be rewritten as RDF triples, in which the subject and predicate (and sometimes the object) each contain a special identifier that defines what kind of data item is represented.[1]

Which of the four packaging methods is best? For Darwin Core data, it probably doesn’t matter. Software is available to convert just about every kind of packaging to every other kind of packaging without error, and today’s computers work so fast that even “slow” methods work efficiently.[2]

But the key advantage of the four methods shown here, in my view, is that they are all plain text. Plain text can be read and understood in an email, in a text editor, in a word-processing program, in a command-line shell etc etc. Plain text is the most widely used format for storing and sharing data, and it isn’t subject to versioning: plain text 1.0 = plain text 1000000.0, and plain text annum 1758 = plain text annum 2023. Plain text is the most portable data format and is completely independent of patents, licensing and other commercial obstacles. You can re-package plain text data items in many different ways (see above) and still have the same plain text data items.

Which packaging method is worst? Speaking as a data auditor, I vote for Microsoft Excel. Spreadsheets are responsible for numerous problems in Darwin Core datasets, and spreadsheet software is usually needed to view the spreadsheet’s plain text content. Please don’t keep or share Darwin Core data as Excel files. Instead, please export, save-as or convert-to a plain text format like TSV.

Robert Mesibov (“datafixer”);

  1. There are also non-human-readable packaging methods where data items are stored or shared in their binary form (“01100001” instead of “a”). The number of data storage and data exchange formats is really very large, and the ones that get used in everyday practice will depend on databasing or other requirements. (Or sometimes fashion. The cool kids these days insist on Parquet.) ↩︎

  2. The JSON you get when using GBIF APIs can be tricky to turn into something more useful, like a table. On the command line, I’ve found the gron JSON-flattening utility to be indispensable for dealing with API output. ↩︎

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.