100 GBIF datasets, improved

clean

Biodiversity data papers are a special kind of scientific publication. As explained by Chavan and Penev (2011):

A data paper is a journal publication whose primary purpose is to describe data, rather than to report a research investigation. As such, it contains facts about data, not hypotheses and arguments in support of those hypotheses based on data, as found in a conventional research article. Its purposes are threefold: to provide a citable journal publication that brings scholarly credit to data publishers; to describe the data in a structured human-readable form; and to bring the existence of the data to the attention of the scholarly community.

In recent years many publishers of GBIF data have also published data papers describing their GBIF datasets. If the data paper manuscript was submitted to a Pensoft journal, the dataset was carefully audited (Penev et al. 2019). Manuscripts were not passed to reviewers until dataset errors in structure, format or compliance with Darwin Core recommendations were corrected.

As of 2021-07-13, 100 GBIF datasets had passed through the Pensoft audit process and in many cases had been substantially improved. In this blog post I give an overview of the data errors found in the GBIF datasets.

In this overview I’ve used data problem categories that overlap to some extent, and although I avoided double-counting an error, the numbers should not be regarded as exact sample statistics. The numbers are also minima, as I haven’t counted very minor errors that occurred only once or twice in a dataset.

Please also note that the 100 datasets audited were original datasets, not the datasets after processing by GBIF. These original datasets were downloaded as Darwin Core “source archives” from IPTs or from the GBIF website. There is, of course, some overlap between the data quality checks carried out independently by Pensoft and by GBIF. From the data publisher’s point of view, an important difference is that GBIF does not require publishers to fix the problems it identifies.

The cost of the data auditing service provided by Pensoft is included in the article processing charges for the data paper. Pensoft also offers data auditing and cleaning as a stand-alone service (Mesibov 2020).

Structural problems

Three datasets contained records with field shifts, where blocks of data items were shifted “right” or “left” into fields in which they did not belong.

Four datasets had incompletely joined “event.txt” and “occurrence.txt” files. In other words, there were eventID entries in “occurrence.txt” that were not found in “event.txt”. In a database, this would be called a referential integrity failure. Another dataset contained an “extendedmeasurementorfact.txt” table with no joining field (foreign key) at all.

File content problems

Five occurrence-record datasets contained records that were exact duplicates apart from the unique code in occurrenceID. Some partial-duplicate records were also queried during audits and removed by authors after inspection.

Two occurrence-record datasets contained non-unique identifiers, i.e. duplicate codes in occurrenceID.

Sixty-two of the 100 datasets lacked one or more of 17 fields where the data context required that field to be present. These missing fields included eventDate (seven datasets) and occurrenceID (one dataset), which GBIF requires in occurrence record datasets. The most frequently omitted field was coordinateUncertaintyInMeters (21 datasets).

Between-field content problems

In 23 of the datasets, there were valid entries in the wrong field, with 25 fields affected. A simple example would be the entry “Holotype of Aus bus Smith, 1900” in the type field rather than in typeStatus.

Nearly half the datasets (49) contained disagreements between fields within a single record. Most of the disagreements were in taxon name fields, for example the genus or specificEpithet entries disagreeing with the entry in scientificName. Other common disagreements were between eventDate and year, month or day. This category of problems also includes minimum depth or elevation greater than maximum depth or elevation, and date anomalies such as dateIdentified before eventDate.

Spatial data formed a sub-category of between-field content problems. In many datasets, it was apparent that the data compilers did not clearly understand the relationship between coordinate values, coordinate uncertainty and coordinate precision.

Within-field content problems

This category holds most of the problems identified in Pensoft audits.

Seventy-one of the 100 datasets had invalid or incorrect entries in one or more of 57 fields. The nature of the error varied with the field. In scientificName, for example, invalid entries ranged from “Sponge massive violet” to “Hesperiidae sp.”, while 33°58’18.69" was a data error in decimalLongitude.

Forty-six datasets had missing-but-expected entries in one or more of 41 fields (completeness failure, in data science). Often the fields had valid entries in some records, but blanks in others.

A common problem was pseudo-duplication, where the same entry appeared in a field in more than one format (consistency failure). Pseudo-duplications were found in 33 fields across 36 datasets.

Two datasets had an incrementing fill-down error, in which an entry such as “WGS84” in geodeticDatum appeared as “WGS84”, “WGS85”, “WGS86”, “WGS87” etc.

Data item formatting problems

Sixty-five datasets had non-recommended, incorrect or inconsistent formats for data items in one or more of 31 fields. The eventDate field, for example, should contain dates in YYYY-MM-DD format, but data publishers used a wide range of other formats including Microsoft’s serial date numbers, and sometimes mixed the date formats within a single dataset.

Thirty datasets had NITS as the sole entry in one or more of 29 fields. NITS (Nothing Interesting To Say) are entries like “-”, “??”, “nodata” and “N/A” that fill in for missing data and can be replaced by blanks.

Three datasets had truncated data items in four fields. These typically arise when data items exceed a character-count limit in a database field.

Six datasets had unnecessary quoting around data items in nine fields. Excess quoting often appears in a dataset built from a CSV file in which quotes are used to surround data items containing commas or spaces.

Character problems in data items

Thirty-four datasets had encoding conversion failures in a range of fields, but most often in scientificName. Characters not in the basic ASCII set of letters and punctuation had been replaced by “?”, �, invisible control characters or mojibake (e.g. “Lefèbvre” became “Lefèbvre”) by the time the data had been UTF-8 encoded for GBIF use. Another two datasets had combining characters in place of characters with diacritical marks.

A common character issue was the replacement of plain spaces with no-break spaces (NBSPs). An NBSP tells a Web browser or word-processing program that the words on either side of the NBSP are not to be spread across lines, and must remain together on the same line. NBSPs are invisible and a human reader cannot see the difference between Aus bus with a plain space and Aus bus with an NBSP. A text-parsing program will not only see the difference but will treat the two names as different. NBSPs were found in 17 fields across 39 datasets and were particularly abundant in scientificName.

What to do about dataset quality?

GBIF datasets account for about 40% of the audits for Pensoft data papers since 2017. The non-GBIF datasets had the same range of problems as the GBIF ones, but between-field and within-field problems were much less common in datasets which did not use the Darwin Core framework.

It would be great if GBIF dataset compilers

  • knew the basics of Darwin Core
  • knew GBIF’s data quality requirements and recommendations
  • knew the basics of georeferencing
  • knew the basics of character encoding
  • knew about spreadsheet hazards (if using spreadsheets to compile data) such as field shifts, fill-down errors, empty end-of-record fields and unwanted re-formatting of dates and numbers
  • knew how to efficiently check for between-field and within-field data problems

The Pensoft experience, however, has been that many compilers do not have the necessary knowledge or skills to produce tidy, complete, consistent and Darwin Core-compliant datasets. For those Pensoft authors, a post-compilation data-checking service has been both welcome and worthwhile.

3 Likes

Ergh. This is the exact opposite of what you’d expect an exchange standard to engender. It would be worth investigating this further to learn if this was:

  • a statistical artifact
  • mismatch between a DwC term definition and use of it
  • mismatch between data models (i.e. technical gymnastics to create an Occurrence record created edge cases)

Whatever the cause, GBIF and the TDWG community should be very concerned about it and then take remedial action to better connect with data publishers.

1 Like

@francois in you career, you have direct experience with exactly what @datafixer outlines so clearly. Note his statement “many compilers do not have the necessary knowledge or skills to produce tidy, complete, consistent and Darwin Core-compliant datasets.”

  1. What other skills would you add to this list?
  2. This experience is happening at the level of the data publisher. What are your thoughts about where best to address these skillset / knowledge issues? (e.g. at the level of the CMS? undergraduate education? via publishers? in graduate school? workshops?)
  3. In your post to the Workforce and capacity development thread, you note the need for communities of practice. It would be great to understand and learn from you – where (at what levels) these inclusive communities need to be put in place, and what groups could lead the way.

Thanks @datafixer for addressing some of the questions asked in this thread about Workforce capacity development and inclusion.

Here’s a bit on what funders like #NSF could do to potentially help raise the importance and value of these types of data products. In theory anyway, increased value (perceived and actual) may help encourage engagement in this data curation realm.

@dshorthouse

Most of the non-GBIF datasets were simple data tables that were neither checklists nor lists of occurrences, so the DwC standard wouldn’t apply.