The vexed question of missing data in Darwin Core

datafixer · July 4, 2022, 4:20am

It may not be possible to fill every field in a Darwin Core occurrence record. Some data items may be missing. In a GBIF dataset I recently audited from a Canadian museum, the proportion of blank data items was 26%.

The missing-data proportion is often actually higher than the blank-item proportion, because the data compiler has used “filler” in place of blanks. I call these filler items NITS, for Nothing Interesting To Say. Here are some of the NITS I’ve found in Darwin Core datasets:

Missing data items aren’t a problem for many users of GBIF occurrence records. In the extreme case of a species distribution modeler, only three data items are needed: scientificName, decimalLatitude and decimalLongitude. Modelers typically discard records with any or all of these three blank or erroneous, and may in addition filter out records with blank or suboptimal coordinateUncertaintyInMeters, or records dated earlier than a particular year. For grid-based modeling, the remaining records are then trimmed to leave only one occurrence per grid cell.

Missing data items, however, seem to be a problem for many data publishers, especially those using collection management systems. Sometimes the CMS won’t allow a record to be saved unless something (even a NITS) is entered in every field. There could also be differing reasons for the “missingness” of a data item in a CMS, among them

the item is known but hasn’t been entered yet, perhaps because the collection manager needs to recheck the specimen label
the item is not known because it was not originally recorded, and it cannot be inferred or determined
the item is not currently known because it was not originally recorded, but it could be inferred or determined in future
the item is known to have one of several possible values, but the database only allows one value to be entered

Collection managers might prefer that the CMS hold the reason a particular data item is missing, and Groom et al. (2019) have suggested a standardised set of “missing data” values for CMSes, with the following examples:

unknown = Empty value in a digital record of unknown provenance
unknown:undigitized = Empty value in a skeletal record to which data still need to be added from the label
unknown:missing = A value of S.D. used by transcription platforms to indicate the absence of a date value
unknown:indecipherable = An indication made by a transcriber that they failed to transcribe the information
known:withheld = A georeferenced record for which coordinate data are available but withheld for conservation considerations

The authors add:

…generic “unknown” indicates that the information is indeed not available. The additives “undigitized”, “missing” and “indecipherable” allow elaboration as to why the data are unavailable, if this reason is known. “known:withheld” indicates that the data are digitally available in a more primary source and could potentially be retrieved after contacting the data provider.

Whether or not collection managers avoid blanks and adopt standardised values for missing data, the question remains: what should be done with missing data in Darwin Core datasets built from a CMS?

The Darwin Core recommendations don’t provide a lot of guidance. The entry “unknown” is recommended when footprintSRS, geodeticDatum, verticalDatum or verbatimSRS isn’t known. On the other hand, the recommendation for coordinateUncertaintyInMeters is Leave the value empty if the uncertainty is unknown, cannot be estimated, or is not applicable (because there are no coordinates).

Here’s a possible answer to the “What to do with missing data?” question, and it’s one I regularly propose to the compilers whose Darwin Core data tables I audit: If a data item is missing, leave it blank. If you have a reason for the "missingness’, put it in a …Remarks field.

Some examples:

scientificName blank; in identificationRemarks, “indeterminate juvenile”
recordedBy blank; in occurrenceRemarks, “collector’s name unclear on label, could be H. Neisse or A. Meisse”
sex blank; in organismRemarks, “rear of specimen missing, could be male or female”
eventDate blank; in eventRemarks, “date not yet transcribed from label”
decimalLatitude, decimalLongitude and verbatimCoordinates blank; in georeferenceRemarks, “coordinates yet to be determined from collector’s field notes”

This is not a CMS solution, but Darwin Core is not a CMS. It’s a data model for sharing certain key elements of occurrence records in a standardised way. A blank in a Darwin Core field means “this data item is not currently available”. To add information to a record about the “missingness” of a data item, you can use an existing field other than the one containing the blank.

And to prevent data users and data cleaners from becoming seriously annoyed with your dataset, please don’t enter NITS or “see …Remarks”.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

kcopas · July 4, 2022, 3:37pm

Thanks for the post, Bob. Seems worth mentioning that, while the focus will be on the current work on the data model, reps from various CMSs will take part in our webinar this Thursday, 7 July.

Anyone interested can find the registration link and on our event page.

jegelewicz · July 8, 2022, 1:02pm

Bob,

I think you are bringing up an important concept that the community in general should be considering. I don’t have any good answers and I know that in Arctos we do not do this in a consistent way. I am especially NOT fond of the fact that in date fields we have to enter SOMETHING when the data is unknown. There are a LOT of misleading dates associated with Arctos records because we can’t seem to deal with NULL in dates of collection or attribute determination. I’ve started up a discussion in our community about this, the “unknown” issue for collectors, and the “No specific locality recorded.” recommendation (and it’s lack of consistent use) in locality. I’ll do what I can with our community, but this really is a broader community issue that should be considered by TDWG, GBIF, and everyone who both provides biodiversity data or makes use of it.

Thanks for your data reviews - they always point out things we could be doing better!
Teresa

Debbie · July 14, 2022, 6:46pm

Hi Bob,

Succinct synopsis. Thank you! Your post sent me down memory lane in early days developing Morphbank. I asked the very same types of questions when I would see “unknown” or “not specified” or “not applicable.”

Does unknown mean no one has looked up the information yet or just didn’t find anything? Could it be knowable? Does not specified mean they don’t want to or it’s missing? (One could infer meaning if a label image is present). Etc.

As @jegelewicz suggests, from the collections point-of-view, some key pieces (functions) for tracking record “completeness” and “knowability” (as in maybe it’s empty but we just haven’t gotten to it yet) are not in most CMS. We are developing in TaxonWorks, ways in which you can better understand record completeness and what’s left to do.

The use case examples of rapid digitization to produce a “skeletal” data record or the workflow of transcribing data from an image raises institutional knowledge questions, data management, vocabulary (controlled?), and scoping challenges.

How does a collection doing skeletal record digitization plan to fill in the rest of the data?
From an institutional knowledge standpoint, who (and it’s usually a person) knows which records (or groups of records) are skeletal? We don’t really have a way for the software to tell us this information either by soft validation or metadata methods. When the person leaves, this knowledge of “what’s left to do” likely also goes with them.
From a database point-of-view again, how could one decide a record is as “complete” as possible (for the moment) where all that’s currently known has been added. How often might this be reviewed for potential new information?

For all these reasons, I think you find these NITS bits. People are trying to track what’s been done, left to be done, and cannot be done or known. The MIDS work could possibly help with this somewhat. But as I understand MIDS – it’s designed to focus on the needs of the researcher who needs certain fields to have data for their particular research use case. If data are missing (= a lower MIDS level) the researcher can ask for “digitization on demand” to fill in the missing needed bits. Again, are they, the missing bits, even knowable?

We need various methods (flags, metadata, soft validation, data visualization, controlled vocabs) in our CMS to help us manage all of these issues. All of these will contribute to producing datasets with fewer NITS bits and easier more strategic curation of a growing data pile.

Here’s a visualization example from TaxonWorks, the INHS Insect Collection. The graph shows where there are label images and if label information in those images has been entered or not into the database. In TaxonWorks speak, buffered fields hold verbatim information that is then parsed into appropriate fields. Each of those hexagons is a record that can be clicked to work on.

Example 1.

Example 2.

Of course, it’s not addressing your point about NITS, but it is getting at helping collections understand what’s been done, what needs to be done, without resorting to searching for NITS-like terms.

@matt or @tmcelrath may want to add more thoughts here. It would be great to know what other software folks are doing to help collections understand “completeness” and to manage their NITS as you call them. And perhaps we can also learn from what others are doing about these data challenges? @abentley @arbolitoloco @emeyke @tkarim @JessUtrup

Progress is being made in that we are both talking about these issues in broad daylight and sharing what we’re trying to do / thinking about / needing. So thanks Bob

waddink · July 15, 2022, 6:57am

Interesting topic and interesting visualisation in Taxonworks. In DiSSCo we are just starting to visualise the status of our network in simple terms of number of specimens, issue flags and progress in time. But presenting more information on completeness is high on our wishlist as it can support further curation and enrichment of the data, one of our main objectives.

Deb I think you misunderstood the purpose of MIDS a little, this is to help with digitisation, not so much to focus on the needs of a researcher for particular use cases. There would be better options for that. MIDS is the minimum amount of information made digitally and openly available online following each of the major stages of digitisation. Level 1 is the minimum needed to enable the attachment of other information about the specimen, level 2 enables scientific use (for at least some common scientific use cases, but not all).

datafixer · July 15, 2022, 7:56am

@waddink, it’s great to see you write “as it can support further curation and enrichment of the data”, which I know was an early goal of DiSSCo. In other words, DiSSCo hopes to provide an improved version of an original CMS record - with (I imagine) a link back to the “authoritative” record in the CMS.

This is not very different from a Darwin Core version of a CMS record being a derived version in which improvement is possible. When auditing museum/herbarium records I try to make that point: you can fix the DwC version you are sharing with the world today without fixing the CMS version, which you can fix later. (But keep a record of changes made between CMS and DwC!)

In the post above I pointed out that DwC has an easy solution to the problem of how to explain incomplete records - just use a blank and an explanation in a …Remarks field. The corresponding method in a CMS will vary with the CMS and may not be so easy.

In fact, it may be easier to fill out skeletal records in DwC than in the CMS. There may be many fewer fields than in the CMS, and much easier global editing. So maybe DiSSCo should aim for “fixed and filled” records in the DES?

waddink · July 15, 2022, 12:02pm

our aims go a bit further than that, but yes that is the idea. We have planned to do an early demonstration of a working implementation at the next TDWG conference. However to enable machine curation to help the human curation, the explanation should be machine actionable, not just human readable.

Debbie · July 19, 2022, 3:49pm

Thanks @waddink, I’m looking to see how collections will implement MIDS. I’m guessing that @ehaston is working on that? For a CMS to use MIDS they’ll need to be able to group for records that meet different MIDS-level criteria, yes? Once implemented, I can see another possible use of MIDS (depending on how it’s done in the database) to track / find skeletal records associated with a given project, that need more transcription / data entry.

If people put NITS values in some of the MIDS fields, then as Bob points out, they are not of the greatest use and give the researcher the sense there’s more in the package than is there in reality.

system · August 19, 2022, 1:50am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Darwin Core Half-Million - UPDATE Data Publishing	11	1083	December 8, 2022
Designating undescribed species Data Publishing	2	417	July 14, 2022
Announcing the 2022 Darwin Core Half-Million Data Publishing	1	760	November 11, 2022
Please don't be so certain about your uncertainty Data Publishing	1	943	May 18, 2023
Field disagreements in Darwin Core Data Publishing	2	570	July 23, 2022

The vexed question of missing data in Darwin Core

Related topics