I was reading through the most recent instalment of the Canadian Museum of Nature’s annual Science Review (albeit from 2022), modelled after GBIF’s own excellent Annual Science Review and I was struck by a paragraph in its Methods on p.7:
That’s a problem. But, whose is it to solve?
On the one hand, museums should receive signals of impact that their digitization & data publication activities have engendered through the GBIF network. And, perhaps that might require some modest technical skills to use an API. But on the other hand, a desirable signal ought not be swamped by a firehose; it should have a proper sieve.
What sorts of things should GBIF do with its literature tracking services such that data publishers receive the signals that work for them? It’d be a shame if other organizations like the Canadian Museum of Nature elect to use manual methods enabled by Google Scholar instead of making best use of GBIF’s automated services. These do in fact require considerable effort to maintain at GBIF’s end.
2 Likes
Thanks for this comment, David. I see a couple of challenges that might need addressing in different ways:
- users download and cite much more data than they actually use
- users keep using (parts of) the same old download for multiple studies
- no easy way of quantifying an individual dataset’s or publisher’s contribution to a citation
The latter is probably something we can try to address to give publishers a more nuanced idea of the “impact” sharing their data through GBIF has. If you have suggestions or ideas, please let us know
@dnoesgaard I had an opportunity to think about this recently when I was asked to demonstrate the linkage of 33 citations to four datasets published by the MARINe/PISCO projects.
Here are a few suggestions to make it easier to parse how integral the cited data were to the paper:
-
Your first bullet: Flagging citations that are for DOIs/queries in which no filters were included e.g.https://doi.org/10.15468/dl.3c2mmc. It’d be nice if they could easily be set to the side while reviewing citations.
-
Your third bullet: I use the proportion of records contributed to a download as a simple score to help me understand the relative contributions of a dataset. I don’t think there is an easy way to see this on gbif.org, and I couldn’t find a way to do this simply in the API. It’s not that hard to calculate this, but a function in rgbif
might make it easier for people that aren’t using the API as often as I do. It would look something like this:
readr::read_tsv("https://api.gbif.org/v1/occurrence/download/0001572-220831081235567/datasets/export?format=TSV") %>%
mutate(total_records = sum(number_records)) %>%
group_by(dataset_key) %>%
summarise(contribution_score = number_records/total_records) %>%
arrange(desc(contribution_score))
- Your third bullet: A method for identifying which occurrences were included in the download without executing the download (my apologies if there is already a way to do this, I just couldn’t figure it out). Knowing the
gbifid
or occurrenceID
would be enough for those who want to go the extra mile. This would help them understand, (1) relevance (e.g. was a lizard from their data cited in a paper on mammals), and (2) occurrences which are most frequently cited (e.g. rare specimens that seem to be extra valuable to science).
2 Likes
@dshorthouse @dnoesgaard @sformel Great to read your discussion on data citation.
I am still confused about the language used around GBIF “download DOIs.”
As far as I understand these so-called GBIF data download DOIs convey a loosely defined relation of a digital identifier (the DOI) through their human readable html landing page associated with an user query.
Then, this user query is loosely coupled with some temporary copy of underlying GBIF mediated data that happened to fit these criteria for some implicit version of the GBIF corpus as seen from the perspective of an implicitly defined version of the GBIF backbone taxonomy.
I’d like to think of it as an expression of interest by the user. And, I wouldn’t be surprised if there’s quite some structure to these expressions of interests, ranging from I-dont-really-know-what-I-am-looking-for-so-I-am-going-to-ask-for-everything-and-do-some-post-processing-myself to I-am-interested-in-these-specific-types-specimen-from-this-particular-collection.
Wouldn’t it make sense to start referring to “Expression of User Interest” or Query DOIs, instead of GBIF Data Download DOIs? In my mind, this change in language would invite a more descriptive view on the weight that these DOIs carry, and perhaps avoid large institutions getting confused about what these “data” citation (i.e., not-data-but-query citations) metrics actually mean . . .
Curious to hear your thoughts on this. . .
PS Note that machine readable signed data citations offer a framework/method to trace actual data use at scale. For description of signed citations, please see (disclaimer: I am a co-author)
Elliott M.J., Poelen, J.H. & Fortes, J.A.B. (2023) Signing data citations enables data verification and citation persistence. Sci Data . Signing data citations enables data verification and citation persistence | Scientific Data hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d
@jhpoelen very interesting paper! I share your opinion that at present the DOI structure of GBIF represents more a representation of what the user was originally interested in, than what was actually used in the analysis in some cases. I am not sure (nor do I think anyone is) exactly what percentage fall more on “expression of interest” compared to “this is a representation of the records that were used in our analysis”, but that answering this question would be a great problem to solve.
In the past year, there have been 2 instances that I have noticed when researchers downloaded the entirety of GBIF (or at least that is what they cite) despite the manuscript not conducting an analysis anywhere close to that scale. I am not sure that hashing would solve this, as if their interpretation was that citing all of GBIF was appropriate, then they may just use the hash of all of GBIF at that time. It seems to be a communication problem as much as it is a technical problem.
Perhaps one piece of low hanging fruit would be a flag for review of citations that use more than a threshold of records. If a paper is citing more than 50 million records, it is incredibly unlikely (though possible) that their analysis was done on all 50 million records. Naturally from this comes the question of what entity should be in charge of reviewing the citations. Is it the journal’s responsibility? The peer-reviewers? Or on those that receive credit for the citation (either as an institution or as an individual). Once the citation is made and the manuscript has been published, I am not sure what can be done. What stings is that these erroneous citations then end up in statistics and tallies and undermine the authority of other valid citations.
To help researchers more accurately cite the data they use, I think it would be helpful to map out the process for how a research unit conducts analysis. Critically, if we expect all filtering to be done before the download is made (to generate an accurate DOI, or even to get the hashes to match if that was implemented) then are all operations that an analyst could do supported? Let’s say that a researcher wanted to get all occurrences that had a catalogNumber that fits a particular regular expression pattern, is that possible through the download api (not to my knowledge)? For research teams that aren’t as technical, what functionality can they do from the website? Certainly the range of operations that can be performed in excel, R, python etc exceeds the filtering capabilities available at the point of download.