It may not be possible to fill every field in a Darwin Core occurrence record. Some data items may be missing. In a GBIF dataset I recently audited from a Canadian museum, the proportion of blank data items was 26%.
The missing-data proportion is often actually higher than the blank-item proportion, because the data compiler has used “filler” in place of blanks. I call these filler items NITS, for Nothing Interesting To Say. Here are some of the NITS I’ve found in Darwin Core datasets:
Missing data items aren’t a problem for many users of GBIF occurrence records. In the extreme case of a species distribution modeler, only three data items are needed: scientificName, decimalLatitude and decimalLongitude. Modelers typically discard records with any or all of these three blank or erroneous, and may in addition filter out records with blank or suboptimal coordinateUncertaintyInMeters, or records dated earlier than a particular year. For grid-based modeling, the remaining records are then trimmed to leave only one occurrence per grid cell.
Missing data items, however, seem to be a problem for many data publishers, especially those using collection management systems. Sometimes the CMS won’t allow a record to be saved unless something (even a NITS) is entered in every field. There could also be differing reasons for the “missingness” of a data item in a CMS, among them
- the item is known but hasn’t been entered yet, perhaps because the collection manager needs to recheck the specimen label
- the item is not known because it was not originally recorded, and it cannot be inferred or determined
- the item is not currently known because it was not originally recorded, but it could be inferred or determined in future
- the item is known to have one of several possible values, but the database only allows one value to be entered
Collection managers might prefer that the CMS hold the reason a particular data item is missing, and Groom et al. (2019) have suggested a standardised set of “missing data” values for CMSes, with the following examples:
- unknown = Empty value in a digital record of unknown provenance
- unknown:undigitized = Empty value in a skeletal record to which data still need to be added from the label
- unknown:missing = A value of S.D. used by transcription platforms to indicate the absence of a date value
- unknown:indecipherable = An indication made by a transcriber that they failed to transcribe the information
- known:withheld = A georeferenced record for which coordinate data are available but withheld for conservation considerations
The authors add:
…generic “unknown” indicates that the information is indeed not available. The additives “undigitized”, “missing” and “indecipherable” allow elaboration as to why the data are unavailable, if this reason is known. “known:withheld” indicates that the data are digitally available in a more primary source and could potentially be retrieved after contacting the data provider.
Whether or not collection managers avoid blanks and adopt standardised values for missing data, the question remains: what should be done with missing data in Darwin Core datasets built from a CMS?
The Darwin Core recommendations don’t provide a lot of guidance. The entry “unknown” is recommended when footprintSRS, geodeticDatum, verticalDatum or verbatimSRS isn’t known. On the other hand, the recommendation for coordinateUncertaintyInMeters is Leave the value empty if the uncertainty is unknown, cannot be estimated, or is not applicable (because there are no coordinates).
Here’s a possible answer to the “What to do with missing data?” question, and it’s one I regularly propose to the compilers whose Darwin Core data tables I audit: If a data item is missing, leave it blank. If you have a reason for the "missingness’, put it in a …Remarks field.
Some examples:
- scientificName blank; in identificationRemarks, “indeterminate juvenile”
- recordedBy blank; in occurrenceRemarks, “collector’s name unclear on label, could be H. Neisse or A. Meisse”
- sex blank; in organismRemarks, “rear of specimen missing, could be male or female”
- eventDate blank; in eventRemarks, “date not yet transcribed from label”
- decimalLatitude, decimalLongitude and verbatimCoordinates blank; in georeferenceRemarks, “coordinates yet to be determined from collector’s field notes”
This is not a CMS solution, but Darwin Core is not a CMS. It’s a data model for sharing certain key elements of occurrence records in a standardised way. A blank in a Darwin Core field means “this data item is not currently available”. To add information to a record about the “missingness” of a data item, you can use an existing field other than the one containing the blank.
And to prevent data users and data cleaners from becoming seriously annoyed with your dataset, please don’t enter NITS or “see …Remarks”.
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com