Please share your dates correctly

Events in biodiversity data usually have a date. Suppose the event was on 3 May 2021. How would that date be represented numerically in your records or (if you’re at a museum or herbarium) in your CMS? Here are some possibilities:

5/3/21
5-3-21
05/03/21
05-03-21
5/3/2021
5-3-2021
05/03/2021
05-03-2021
3/5/21
3-5-21
03/05/21
03-05-21
3/5/2021
3-5-2021
03/05/2021
03-05-2021
2021/5/3
2021-5-3
2021/05/03
03052021
20210305
v.3.21
v.3.2021
3.v.21
3.v.2021

OK, now which of the above is the correct format for sharing biodiversity data?

Answer: None of the above. The international standard way of representing that date is 2021-05-03.

This YYYY-MM-DD format was adopted as a date standard (ISO 8601) more than 30 years ago. You can use whatever date format you prefer in your own records and in your own institution, but if you want to share your dates with the rest of the world, please use YYYY-MM-DD.

Advantages

The biggest advantage of YYYY-MM-DD is that it avoids ambiguity. Is “5/3/21” meant to be 3 May 1921? 3 May 2021? 5 March 1921? 5 March 2021?

The next biggest advantage comes when your computer sorts YYYY-MM-DD dates. By default they sort chronologically, oldest first.

Parochialism

USA-based readers of this post might like to point out that everyone in the USA writes their dates with month first, then day, then year (MDY), so that “5/3/2021” is perfectly understandable as May 3 2021.

In response, I’d like to point out (politely) that the USA is almost completely isolated in using MDY. Other countries prefer DMY or YMD and would see “5/3/2021” as 5 March 2021.

Spreadsheets

The nasty behaviour of spreadsheets with dates (and things that might be dates) is so well known that I won’t give Horrible Examples here. But there are two simple ways to enter “2021-05-03” in a spreadsheet cell without it becoming something else:

  • Format the column as “Text” instead of “Date” or “General”
  • Type an apostrophe before the YYYY-MM-DD. This forces the spreadsheet to see the entry as text. The apostrophe won’t show in the cell (but you can see it in the formula bar), and if you copy-and-paste the cell into a text file, the apostrophe disappears.

Even better, avoid unwanted changes in date format and other spreadsheet disasters by not using a spreadsheet.

Missing elements

Suppose you only know the year of the event, or only the year and month. The ISO 8601 (2004) way to represent year-only dates is YYYY, and YYYY-MM for year-and-month-only dates.

Please do not share dates like “2021-00-00” or "2021-?? ", and avoid sharing “1950-01-01” for “sometime in 1950”.

If you only know the month, or only the month and day of an event, in Darwin Core you can use the month, day and verbatimEventDate fields to enter what you know. Leave eventDate blank, because “??-07-23” or “–07-23” will not be correctly parsed as ISO 8601.

Date ranges

ISO 8601 also has a simple way to represent intervals like “3 May 2021 to 16 May 2021”. It’s “2021-05-03/2021-05-16”, or alternatively “2021-05-03/16”. Note that there is no space on either side of the forward slash “/”.

An interval such as “sometime between 1869 and 1873” is most simply represented as “1869/1873”, not as “1869-01-01/1873-12-31”.

More on uncertainties

ISO 8601 was extended in 2019 to introduce standard ways to deal with uncertainty in dates. These have not yet been widely adopted. For an overview of the extensions and some skeptical comments, see this blog post.

iso_8601

An XKCD webcomic by Randall Munroe.


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

4 Likes

Now riddle me this, what is the best way to store (not share) your dates in your database, i.e. before you share them?

@matt, you’re lucky if your RDBMS database software offers you a choice. If it does, a text field with YYYY-MM-DD should work fine. If you’re stuck with something awful like Excel for compiling data, librarycarpentry says (https://librarycarpentry.org/lc-spreadsheets/03-dates-as-data/)

Due to the issues caused by the way that spreadsheet programs handle and display dates, it is safer to store dates with YEAR, MONTH, DAY in separate columns or as YEAR and DAY-OF-YEAR in separate columns. This practice will make your data less ambiguous, will guard against changes that may be caused if the spreadsheet is opened by other software, and generally make your data table more accessible and interoperable.

But if you have a choice and you don’t have to use a proper database, YYYY-MM-DD or YYYY|MM|DD (separate fields) in ModernCSV (https://www.moderncsv.com/) is an excellent option.

I store my own data as plain-text tables with primary/foreign key linkages, and with YYYY-MM-DD in an events-table field. The data are managed by command-line operations, mainly using AWK. I moved to this system from SQLite in 2012. It’s “relational databasing without a database”.

I am not a RDBMS expert, but since I learned to use it, for dates and time I usually store them as timestamp because the textbook told me it’s more efficient in storage and sorting. With timestamps I can format them with various functions provided by the system to yyyy-mm-dd, if that’s the target requirement.

1 Like

@cjk, so what is the timestamp format in your database? And yes, it’s very good that your system allows you to choose a date or date/time export format.

Getting back to the topic of the post above, data publishers can use GBIF’s “issues” flags to check for date problems, including date format - one such flag is “Recorded date invalid”. For example, in an occurrence dataset harvested yesterday, 7942 records out of 7983 were flagged “Recorded date invalid”. The records I looked at had eventDate entries in the format MM/DD/YY.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.