Traceability and version control when publishing a curated regional occurrence dataset with mixed original and previously published records

Dear GBIF community,

I am preparing to publish a curated occurrence dataset through IPT/GBIF and would appreciate expert guidance before doing so.

The dataset is a consolidated corpus of freshwater and diadromous fish occurrence records from western Patagonia. It consolidates more than ten datasets into a single one, with several thousand records. It includes occurrences from original field surveys, private collections, museum collections, digital repositories, records from intensive literature reviews, iNaturalist, and records previously available through GBIF. The consolidated dataset was curated to apply spatial and taxonomic filtering, taxonomic and georeferencing corrections, Darwin Core (DwC) homologation, batch aggregation of individual-level observations, deduplication, and identification of related records, among other enhancements. Therefore, the result includes substantial original information and hundreds of hours of curatorial work, but it also includes records that have already appeared in GBIF or other digital sources, many of which have been corrected. This is the primary reason I am seeking advice before publication.

I have reviewed the GBIF/IPT guidance and forum discussions I could find on DwC term definitions, duplicate occurrences, identifiers, derived datasets, and GBIF clustering, and I have also discussed the issue with our national IPT node representative. We have followed general recommendations by preserving stable identifiers wherever possible, including occurrenceID, catalogNumber, and other legacy identifiers. I also understand that GBIF does not automatically deduplicate occurrences across datasets, and that republishing existing records can create duplication if not handled carefully.

My main question is: what is the recommended way to publish a curated synthesis dataset of this type while preserving clear per-record provenance?

More specifically:

  1. To begin with, should this be treated as a new curated occurrence dataset, with its own datasetName and datasetID, given that it includes substantial original data and extensive curatorial work in addition to pre-existing records?

  2. Assuming we publish it as a new dataset, should datasetName refer only to the new curated dataset, while legacy source names are preserved elsewhere or not at all? I wish there were a DwC field to keep legacy dataset names, something like an ‘otherDatasetNames’ which does not exist. We have even contemplated concatenating current and legacy dataset names as necessary, separated by " | ", which would be long but informative.

  3. More generally, what are the preferred DwC fields for recording the dataset provenance of each occurrence when a record may have one or more legacy sources, without getting hidden in obscure fields seldom utilized?

  4. How should we flag occurrence-level enhancements, such as corrected coordinates or added metadata? I suppose occurrenceRemarks is appropriate, but is there a better practice for occurrence-level version control, so users or machines can recognize curated records as improved versions of previously published occurrences?

At present, we are preserving permanent identifiers unaltered whenever possible, such as occurrenceID and catalogNumber. We mint a new occurrenceID values only when strictly necessary. When available, we are also relocating the original gbifID in otherCatalogNumbers, formatted as example “gbifID:6179536516”, since a new gbifID will be generated upon publication of the curated dataset. However, I am unsure whether this is the best practice.

I am particularly interested in whether there is a standard guideline for publishing curated synthesis occurrence datasets that can include both original records and records previously published elsewhere, while remaining useful as a cohesive, corrected regional corpus. I must be missing something because this does not strike me as an uncommon problem.

I’d appreciate it if you could point me to truly relevant resources, and any advice on best practice before publication would be very welcome.

Thank you!

Dear @Cristian,

If you have curated some datasets, the best way to approach it will be to share your curated data with the original publishers, so they can update their datasets on GBIF and benefit from your work.

With the original records that you have, you could publish those as a new occurrence dataset if you have the permits from the original data holders.

As a rule, the preferred way is not publishing data already public on GBIF and that contains know duplicates. Because republishing occurrence records will duplicate records in GBIF and we currently have no option to de-duplicate for data users, and it will be difficult for users to (a) detect the duplication, and (b) decide which version of a record to use. So, the best way is to update the original sources.

And if you need to publish the whole dataset, we recommend to upload the cleaned version as a DwC-A to a different repository (like Zenodo) as a “derived dataset”, including metadata data to specify the original data used (download DOI, source datasets, citation), and generate a metadata-only dataset in GBIF that describes the intent and methodology behind the external dataset, each of the two pointing at each other, to make them mutually detectable. As for the provenance in this “derived dataset”, you can just use the datasetName and datasetID elements to document the source of each record. In these elements you usually don’t put the information of the current dataset, so you don’t need different terms.

Dear @estebanmhGBIF,

Thank you very much for your clear and helpful advice. It helped us avoid publishing the full consolidated occurrence dataset directly through GBIF, which could have created known duplicates and made the data harder for users to interpret.

Based on your recommendation, we have revised our strategy. We will first publish only the high-quality original records that have not previously been made available through GBIF, as separate IPT/GBIF occurrence datasets. Once published, we will bring the new GBIF dataset names, dataset IDs and GBIF-generated identifiers back into our consolidated master database, so that each record can be linked to its corresponding GBIF source where applicable.

Records already obtained from GBIF will retain their original GBIF identifiers and source dataset information. Some additional subsets will remain only in the consolidated master database for practical, technical or data-quality reasons. We will then publish the complete curated, deduplicated and analysis-ready master database as a derived dataset in Zenodo, and create a GBIF metadata-only dataset pointing to it, so the derived product remains discoverable through GBIF without duplicating occurrence records in the GBIF occurrence index.

This seems to us a more technically correct and transparent pathway. However, one unresolved issue remains: this strategy does not directly solve the problem of correcting erroneous or incomplete records that are already published in GBIF by their original sources. As you suggested, the ideal route would be to contact each original publisher and share the corrections with them, although in practice this may become a substantial additional curation effort in itself.

Thanks again for your guidance. It has been very useful in helping us define a more responsible publication strategy.

Dear @estebanmhGBIF,

Thank you very much for your clear and helpful advice.

Based on your recommendation, we have revised our strategy. We will first publish only the high-quality original records that have not previously been made available through GBIF, as separate GBIF occurrence datasets. Once published, we will bring the new GBIF dataset names, dataset IDs and GBIF-generated identifiers back into our consolidated master database, so that each record can be linked to its corresponding GBIF source where applicable.

Records already obtained from GBIF will retain their original GBIF identifiers and source dataset information. Some additional subsets will remain only in the consolidated master database for practical, technical or data-quality reasons. We will then publish the complete curated, deduplicated and analysis-ready master database as a derived dataset in Zenodo, and create a GBIF metadata-only dataset pointing to it, so the derived product remains discoverable through GBIF without duplicating occurrence records in the GBIF occurrence index.

This seems to us a more technically correct, however, one unresolved issue is that this strategy does not directly solve the problem of correcting erroneous or incomplete records that are already published in GBIF by their original sources. As you suggested, the ideal route would be to contact each original publisher and share the corrections with them, although in practice, this may become a substantial additional curation effort in itself.

Thanks again for your guidance!

I look forward to hearing opinions from others.

@Cristian I want to suggest an alternative approach, because I think the response from @estebanmhGBIF , while the standard response from GBIF is, in my opinion, the wrong response.

Given the offer of a high quality, curated dataset that potentially adds value to GBIF’s users (i.e., the scientific community using GBIF data), the response cannot be “well, we don’t want that because it duplicates existing records, but please send your corrections to our data providers”.

In my opinion what should happen is:

  1. GBIF accepts the high quality dataset (it doesn’t try to shunt it off to the oblivion of “metadata-only”)
  2. It uses its clustering algorithms to indentify possible duplicate records and clusters them
  3. When it detects differences in values between new and existing records (e.g., a different taxonomic identification) GBIF notifies the original publishers of the differences. This should be GBIF’s job, it has all the required information to automate this.
  4. The original publisher (e.g., a museum), if willing and able, accepts/rejects the edits, and updates their data accordingly

The contributor of the high value data gets it published, users get better data, the publisher gets automated feedback on their data, so everybody wins. The mantra of scientists curating data having to send corrections ot the original provider simply doesn’t scale. GBIF is already full of massively duplicated data - once GBIF started aggregating from sequences databases such as BOLD, EMBL, and Plazi started publishing literature based records of existing specimens, that became inevitable. So the trick is to accept this and make use of it to enhance the quality of the data in GBIF.

Thank you for sharing this contrasting view and philosophy, @rdmpage. Very interesting points indeed, and in line with our original intuition.

Contacting original providers is an auditing job on its own that is probably out of the scope of most projects, and a burden that shouldn’t be inflicted on contributors. Besides, sophisticated algorithms likely to continue to evolve at GBIF will be better equipped to offer the type of services you outlined. Future developments could also include some sort of version control mechanism at the occurrence level.

I’m surprised that the GBIF community didn’t jump back at us with a resonant unifying answer. I would have guessed this was a classic FAQ, but it seems it isn’t, is it? I’d love to hear more opinions.

Thanks for your comments, @rdmpage, you are making some very good points.

In general, we try to discourage re-upload of filtered/aggregated data from GBIF. A case like this, where significant improvements have been made to the records themselves, might require a different approach and policy in future.

Concerning the follow-up services that would be required to make the duplication transparent to data users and informative to publishers, unfortunately, we are just not there yet. At this point in time, none of the required automation exists that would allow to

  • cluster or consolidate all versions of the same record during data ingestion (the above-mentioned clustering mechanism still being in a trial phase, with a primary goal of connecting related records relating to the same organism),
  • inform data users of existing duplicates and the preferred/recommended version in their data downloads, or
  • inform data publishers of annotations to or edited versions of their contribution, and provide a suggestion for revision.

All these are desirable goals that we need to consider and discuss, but their implementation is not directly around the corner, requiring more exploration of options and planning.

Meanwhile, the core of any recommendation for derived data content is, as @estebanmhGBIF outlined above and I hope we can agree on, to maintain the original identifiers of both source and records as faithfully as possible, so that the future option to connect the derived/enriched version of records to their original source remains open.

@Cristian, I faced a similar situation not long ago. GBIF shared a 2019 dataset of mine from ALA : https://www.gbif.org/dataset/834442e7-c3ba-4ff9-bd1f-3288989f7725. Errors in that dataset were fixed and new records added in a 2025 dataset shared with GBIF that did not pass through ALA: https://www.gbif.org/dataset/9b37920c-8d47-41a0-8f90-064b792b6a15. To avoid users having to deal with pseudo-duplicated records, GBIF deleted the 2019 dataset (but left its webpage online) with a note saying it had been replaced by the 2025 dataset.

That’s a much simpler case than yours, but it did “directly solve the problem of correcting erroneous or incomplete records that are already published in GBIF by their original sources”.

What’s lacking both in my case and in the general one is documentation of corrections. If original records are edited, the new record should contain the edited data item; the original, unedited data item; who did the edit, and when; and some explanation of the edit. If there are further edits, these need separate documentation.

I’m borrowing here (incompletely) from everyday practice in large corporate and public databases, but the lack of attention to this matter in biodiversity informatics has long bothered me. GBIF “interprets” data items in contributed datasets, and flags edited data items as “altered”. That’s good, but it leaves out the “who” (which algorithm, and what triggered its action), the “when” and the “why”.

@rdmpage @datafixer I’ve been thinking about feedback mechanisms like the ones you’ve proposed, and I’d like to suggest a variation: instead of asking GBIF (or other networks) to build feedback tools for their specific systems, we could create a standard for sending feedback to data providers.

The idea is essentially a “pull request” for Darwin Core. Here’s what I’m imagining:

  1. A structured data file specifying: (1) record identifier(s), (2) field(s) to be modified, (3) old value(s), and (4) new value(s).

  2. A structured metadata file recording: (1) attribution for the person submitting the PR, and (2) the version of the DwC-A resource being modified.

  3. A standalone package combining the two. It could be submitted to the curator of a DwC-A, who can choose to incorporate it and republish the updated dataset, or it can live as a separate publication that downstream users can selectively apply when reusing the data.

The benefit: QA work done by data re-users gets shared in a standard, reusable format, even when the original providers lack the capacity to incorporate it. Ideally, though, providers do merge the changes back into the source data, with full provenance attached.

One challenge is that many data providers don’t work directly in DwC-A. But I suspect that once people pick themselves up off the floor after encountering this brilliant idea, they’ll be inspired to build the tooling needed to bridge that gap and rendering the concern moot :grin: .

@sformel, sounds good to me. The (1) file is what I already build for DwC edits (Darwin Core checker: Special topics), but my script would need to be tweaked to allow for various structural changes in the original data. (2) is also a good idea and could have lots of additional info if desired (metadata files can be expanded infinitely…).

It’s (1) and (2) that should be publicly available to end users, as you say. When you write “curator of a DwC-A, who can choose to incorporate it and republish the updated dataset”, and “Ideally, though, providers do merge the changes back into the source data”, I think you’re dreaming. This happens so rarely and with such little enthusiasm that I can only shake my head when I see a GBIF portal GitHub issue with an error to be corrected, and the GBIF response “Publisher informed by email”.

Readers of this thread might be interested in our experimental rule-based annotations approach. In this approach users wouldn’t re-upload a curated dataset, but would rather save the rules used to generate the finished product. It might not currently handle every edge, but it something that already exists.

Thanks @jwaller ! This came up at BioMonWeek and I definitely see it’s utility as a “bird in the hand.”

Thnak @jwaller great tool, very useful indeed! Thank you for sharing.
Although it doesn’t sufficiently address the subject of this discussion, it is definitely a useful tool moving in the right direction by letting the community flag potential errors, alert the publishers holding editor rights on datasets, and enabling end users to harness expert wisdom, regardless of publishers’ responsiveness to flagged cases.

Thanks @datafixer for sharing that script. Quite useful to modify tables at will and then generate a formal log of changes. Pieces that could fit in @sformel’s scheme outlined above.