Dear GBIF community,
I am preparing to publish a curated occurrence dataset through IPT/GBIF and would appreciate expert guidance before doing so.
The dataset is a consolidated corpus of freshwater and diadromous fish occurrence records from western Patagonia. It consolidates more than ten datasets into a single one, with several thousand records. It includes occurrences from original field surveys, private collections, museum collections, digital repositories, records from intensive literature reviews, iNaturalist, and records previously available through GBIF. The consolidated dataset was curated to apply spatial and taxonomic filtering, taxonomic and georeferencing corrections, Darwin Core (DwC) homologation, batch aggregation of individual-level observations, deduplication, and identification of related records, among other enhancements. Therefore, the result includes substantial original information and hundreds of hours of curatorial work, but it also includes records that have already appeared in GBIF or other digital sources, many of which have been corrected. This is the primary reason I am seeking advice before publication.
I have reviewed the GBIF/IPT guidance and forum discussions I could find on DwC term definitions, duplicate occurrences, identifiers, derived datasets, and GBIF clustering, and I have also discussed the issue with our national IPT node representative. We have followed general recommendations by preserving stable identifiers wherever possible, including occurrenceID, catalogNumber, and other legacy identifiers. I also understand that GBIF does not automatically deduplicate occurrences across datasets, and that republishing existing records can create duplication if not handled carefully.
My main question is: what is the recommended way to publish a curated synthesis dataset of this type while preserving clear per-record provenance?
More specifically:
-
To begin with, should this be treated as a new curated occurrence dataset, with its own datasetName and datasetID, given that it includes substantial original data and extensive curatorial work in addition to pre-existing records?
-
Assuming we publish it as a new dataset, should datasetName refer only to the new curated dataset, while legacy source names are preserved elsewhere or not at all? I wish there were a DwC field to keep legacy dataset names, something like an âotherDatasetNamesâ which does not exist. We have even contemplated concatenating current and legacy dataset names as necessary, separated by " | ", which would be long but informative.
-
More generally, what are the preferred DwC fields for recording the dataset provenance of each occurrence when a record may have one or more legacy sources, without getting hidden in obscure fields seldom utilized?
-
How should we flag occurrence-level enhancements, such as corrected coordinates or added metadata? I suppose occurrenceRemarks is appropriate, but is there a better practice for occurrence-level version control, so users or machines can recognize curated records as improved versions of previously published occurrences?
At present, we are preserving permanent identifiers unaltered whenever possible, such as occurrenceID and catalogNumber. We mint a new occurrenceID values only when strictly necessary. When available, we are also relocating the original gbifID in otherCatalogNumbers, formatted as example âgbifID:6179536516â, since a new gbifID will be generated upon publication of the curated dataset. However, I am unsure whether this is the best practice.
I am particularly interested in whether there is a standard guideline for publishing curated synthesis occurrence datasets that can include both original records and records previously published elsewhere, while remaining useful as a cohesive, corrected regional corpus. I must be missing something because this does not strike me as an uncommon problem.
Iâd appreciate it if you could point me to truly relevant resources, and any advice on best practice before publication would be very welcome.
Thank you!
