Calculating collection date --> GBIF upload date lag times

Hi everyone,
I am trying to generate estimates for specimen-based data of the lag time between collection of a specimen (eventDate) and its date of first upload to GBIF.

The goal is to generate upload time curves based on collection-to-upload lag times, so that I can generate estimates of how many records will end up on GBIF for recent years once all recently collected records have been databased, processed & uploaded to GBIF.

Is there something available like an upload date / accession date / date of first appearance in GBIF, at the record level (by gbifID), which I could use for this analysis?

I have been attempting this using some exports from the GBIF snapshot tables, but without much luck so far, either at the dataset level or record level. Challenges in this regard include fluctuation in counts by dataset (records being both added and removed), and non-stable gbifID/occurence ID fields, which make inferring quite challenging.

Ideally we would want to work with records added to a dataset after its inception/initial upload, since we are interested in ongoing upload trends representative of incremental updates, rather than inception of datasets/big initial uploads. This is something I could plausibly do post-hoc by filtering out the known first upload dates of each

Not sure if this is feasible at all, but I would appreciate any advice or pointers in the directions of other threads which have thought about similar issues re: upload dates to GBIF.

Many thanks in advance for your advice and support!

The main reason why a dataset’s occurrence count is going to decrease is because the publisher deleted records (either by accident or on purpose).

Sometimes there can be simple mistakes, like mis-formatting the darwin core archive or there can be some bug in the ingestion process.

Another source might be dataset migrations where part of the dataset is moved to a new dataset.

Thanks John - that is helpful. Thanks for your email too - I am happy for this thread to be closed (unsure how to close it myself, apologies)

@of2 We can leave the issue open in case others in the community may want to comment.

The main issue with working with GBIF historical snapshot data is that gbifids are not very stable over time.

Some GBIF publishers change occurrenceIds and so-called triplet codes of

  1. Institution code
  2. Collection code
  3. Catalog number

meaning that gbifids overtime can change.

To calculate lag time the general approach given this instability would be to work with aggregated data and make some assumptions about publication date.

The way that I would work is by looking for the “first appearance” of a datasetKey, and treating that as all of those occurrences as published on that “snapshot date”. Later appearance of a datasetKey could be new occurrences or the same occurrences with new gbifids.

You could perhaps subtract the occurrences from the previous appearance and treat those as new occurrences, if you want just a global estimate (but this might not be 100% accurate for the reasons stated above (mainly deletion and migration events)).

In summary, only an estimate of the lag time is possible since the publication date GBIF is not known and certain assumptions have to be made in order to account for non-stable gbifids over times.

Going forward, gbifids will likely be much more stable as GBIF has implemented a new occurrenceID checking system upon dataset ingestion.