I’m helping a collection at my institution update their database and data publishing pipeline with our IPT, which we self-host.
In this process, I’ve discovered a number of really bad issues with data quality from the collection’s current GBIF snapshot (now about 8 years old) that need to be fixed when we push a new update. The main issues are that many of the identifiers for records in the current GBIF snapshot are incorrect. Many of the dwc:catalogNumbers are erroneous, and the dwc:occurrenceIDs use a deprecated URN format that no longer resolves to anything in our home institution’s domain namespace (essentially, it uses a domain name for a college within our institution that no longer exists and we do not wish to maintain this as the occurrenceID).
The data has been cited in GBIF hundreds of times. Some of these records need to be deleted because they’re completely wrong, but for the ones that merely have identifiers that will be changed, what is the currently accepted best practice for updating the identifiers?
For catalogNumber we will be putting the former catalogNumber in the dwc:otherCatalogNumbers field.
But is there a place for us to put the prior occurrenceID that preserves the link to the existing GBIF records and will be intelligible to data users?
GBIF aims to maintain identifiers wherever possible. We have a blog post here that explains the basic process. There is also a video from our technical support hour for GBIF nodes (about 13min long).
If you can maintain a reference between old and new identifiers in a simple two-column file as the blog post explains, we will be able to migrate identifiers so that the pre-existing records can be updated under the same GBIF URL, rather than creating one new record while deleting an old one. This way, earlier external references are preserved.
If this kind of change concerns a major part of your dataset, the update will automatically be held during ingestion at GBIF, and an issue created that will trigger a manual follow-up during which you can supply your reference file on old and new ids. If only a minor portion of your dataset ids are concerned, due to threshold settings rather check beforehand with helpdesk@gbif.org to preempt the update in the GBIF index.
The issue does concern a major part of the dataset, as every single record in the current GBIF dataset has the old occurrenceID. It’s unfortunate that the old namespace is no longer supported by our institution, but I’m glad to hear that GBIF will be able to update those existing records. Is there a place where GBIF will store the old occurrenceIDs then, within the dataset itself? Or is the link solely going to be preserved through the prior published version of the dataset, such that someone would need to do a join of the old and new datasets by GBIF ID to find what the old occurrenceID was for a given record?
I already have the old identifiers in our master database update csv, each row has the old occurrenceID new occurrenceID, and current GBIF ID for the record if the record exists in GBIF already, so doing the update as you suggested shouldn’t be a problem.
We do not maintain the old id in the records. You would still be able to find them in the quarterly snapshots, if relevant, but there is no data element particularly dedicated to trace that in a machine-readable way.
If you would like to supply them for the convenience of human users, you could consider adding them to a notes field within the published record (occurrenceRemarks or dynamicProperties are likely the closest fit.