I agree with the premise of @dshorthouse’s suggestion, the optimal solution would be to have individual records be cited rather than datasets. I think this makes conceptual sense from everyone’s perspective, if someone cites 10 records from one bird dataset, 10 from Kew, 10 from Harvard… 10 from so on and so forth, the main focus from a scientific perspective are the individual birds, not the datasets from which they originated. Having a dataset get a citation for 1 or 2 birds, while technically true, isn’t really capturing the essence of what was done.
So, that would leave a citation that is a giant list of occurrences, all referenced individually by some identifier. Since GBIF is the one calculating and distributing the citations, it would make sense to use the gbifID, since it is guaranteed to be common among all contributing publishers. Great in theory, but how would that work with someone who wants to cite 10+ million records? For datasetKeys, the maximum length of the list would be the number of datasets in GBIF, which currently stands at 109,000. This doi highlights how even for 2.5+ billion occurrence records, the list of datasetKeys stays manageable. Individual occurrences would be orders of magnitude more, not only from a storage perspective but from a computational perspective.
As David suggests, perhaps some of the computation could be switched to the client side, or even distributed among those who are most interested in getting granular views into citations. In the massive doi referenced earlier, the ideal function from an institution’s perspective (in my view) would be an ability to download only occurrences of a particular datasetKey(s) from within a particular download. Already through the API we can do the following actions
# Lists datasets present in a download
/occurrence/download/{prefix}/{suffix}/datasets
# Download a previously created dataset
/occurrence/download/request/{key}
The main pain point (as @sformel mentions) is executing the download for millions of occurrences just to subset them for datasetKeys of interest, it is wasteful to send all that data over the network only to discard 90% of it. However, what if, when using /occurrence/download/request/{key}
we could append a ?datasets=
parameter to only download a subset of that citation, those occurrences from that list of datasets? To lighten the load even further, an ?idOnly=
parameter could just return the identifiers. We are just looking to tally and link, so the rest of the data is unnecessary (@jhpoelen may disagree, as the information in the record could have drifted). From there it is up to the investigator/institution to process and link the citation to individual records in whichever system they use. I believe Katie Pearson did a demo at SPNHC-TDWG 2024 on how Symbiota can link occurrences to citations, Specify has similar functionality.
In summary
GBIF minimizes compute by only calculating the subset of records on request, and never storing long lists of identifiers or transmitting them between systems. The minimum amount of information is transmitted to the data analyst to save them from having to download the full length and width of a derived dataset, and the system is flexible to however the end user wishes to link citations up to individual records.