When does evidence of impact become too onerous to track?

I was reading through the most recent instalment of the Canadian Museum of Nature’s annual Science Review (albeit from 2022), modelled after GBIF’s own excellent Annual Science Review and I was struck by a paragraph in its Methods on p.7:

That’s a problem. But, whose is it to solve?

On the one hand, museums should receive signals of impact that their digitization & data publication activities have engendered through the GBIF network. And, perhaps that might require some modest technical skills to use an API. But on the other hand, a desirable signal ought not be swamped by a firehose; it should have a proper sieve.

What sorts of things should GBIF do with its literature tracking services such that data publishers receive the signals that work for them? It’d be a shame if other organizations like the Canadian Museum of Nature elect to use manual methods enabled by Google Scholar instead of making best use of GBIF’s automated services. These do in fact require considerable effort to maintain at GBIF’s end.

2 Likes

Thanks for this comment, David. I see a couple of challenges that might need addressing in different ways:

  • users download and cite much more data than they actually use
  • users keep using (parts of) the same old download for multiple studies
  • no easy way of quantifying an individual dataset’s or publisher’s contribution to a citation

The latter is probably something we can try to address to give publishers a more nuanced idea of the “impact” sharing their data through GBIF has. If you have suggestions or ideas, please let us know :slight_smile:

@dnoesgaard I had an opportunity to think about this recently when I was asked to demonstrate the linkage of 33 citations to four datasets published by the MARINe/PISCO projects.

Here are a few suggestions to make it easier to parse how integral the cited data were to the paper:

  1. Your first bullet: Flagging citations that are for DOIs/queries in which no filters were included e.g.https://doi.org/10.15468/dl.3c2mmc. It’d be nice if they could easily be set to the side while reviewing citations.

  2. Your third bullet: I use the proportion of records contributed to a download as a simple score to help me understand the relative contributions of a dataset. I don’t think there is an easy way to see this on gbif.org, and I couldn’t find a way to do this simply in the API. It’s not that hard to calculate this, but a function in rgbif might make it easier for people that aren’t using the API as often as I do. It would look something like this:

readr::read_tsv("https://api.gbif.org/v1/occurrence/download/0001572-220831081235567/datasets/export?format=TSV") %>% 
  mutate(total_records = sum(number_records)) %>% 
  group_by(dataset_key) %>% 
  summarise(contribution_score = number_records/total_records) %>% 
  arrange(desc(contribution_score))
  1. Your third bullet: A method for identifying which occurrences were included in the download without executing the download (my apologies if there is already a way to do this, I just couldn’t figure it out). Knowing the gbifid or occurrenceID would be enough for those who want to go the extra mile. This would help them understand, (1) relevance (e.g. was a lizard from their data cited in a paper on mammals), and (2) occurrences which are most frequently cited (e.g. rare specimens that seem to be extra valuable to science).
3 Likes

@dshorthouse @dnoesgaard @sformel Great to read your discussion on data citation.

I am still confused about the language used around GBIF “download DOIs.”

As far as I understand these so-called GBIF data download DOIs convey a loosely defined relation of a digital identifier (the DOI) through their human readable html landing page associated with an user query.

Then, this user query is loosely coupled with some temporary copy of underlying GBIF mediated data that happened to fit these criteria for some implicit version of the GBIF corpus as seen from the perspective of an implicitly defined version of the GBIF backbone taxonomy.

I’d like to think of it as an expression of interest by the user. And, I wouldn’t be surprised if there’s quite some structure to these expressions of interests, ranging from I-dont-really-know-what-I-am-looking-for-so-I-am-going-to-ask-for-everything-and-do-some-post-processing-myself to I-am-interested-in-these-specific-types-specimen-from-this-particular-collection.

Wouldn’t it make sense to start referring to “Expression of User Interest” or Query DOIs, instead of GBIF Data Download DOIs? In my mind, this change in language would invite a more descriptive view on the weight that these DOIs carry, and perhaps avoid large institutions getting confused about what these “data” citation (i.e., not-data-but-query citations) metrics actually mean . . .

Curious to hear your thoughts on this. . .

PS Note that machine readable signed data citations offer a framework/method to trace actual data use at scale. For description of signed citations, please see (disclaimer: I am a co-author)

Elliott M.J., Poelen, J.H. & Fortes, J.A.B. (2023) Signing data citations enables data verification and citation persistence. Sci Data . Signing data citations enables data verification and citation persistence | Scientific Data hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d

@jhpoelen very interesting paper! I share your opinion that at present the DOI structure of GBIF represents more a representation of what the user was originally interested in, than what was actually used in the analysis in some cases. I am not sure (nor do I think anyone is) exactly what percentage fall more on “expression of interest” compared to “this is a representation of the records that were used in our analysis”, but that answering this question would be a great problem to solve.

In the past year, there have been 2 instances that I have noticed when researchers downloaded the entirety of GBIF (or at least that is what they cite) despite the manuscript not conducting an analysis anywhere close to that scale. I am not sure that hashing would solve this, as if their interpretation was that citing all of GBIF was appropriate, then they may just use the hash of all of GBIF at that time. It seems to be a communication problem as much as it is a technical problem.

Perhaps one piece of low hanging fruit would be a flag for review of citations that use more than a threshold of records. If a paper is citing more than 50 million records, it is incredibly unlikely (though possible) that their analysis was done on all 50 million records. Naturally from this comes the question of what entity should be in charge of reviewing the citations. Is it the journal’s responsibility? The peer-reviewers? Or on those that receive credit for the citation (either as an institution or as an individual). Once the citation is made and the manuscript has been published, I am not sure what can be done. What stings is that these erroneous citations then end up in statistics and tallies and undermine the authority of other valid citations.

To help researchers more accurately cite the data they use, I think it would be helpful to map out the process for how a research unit conducts analysis. Critically, if we expect all filtering to be done before the download is made (to generate an accurate DOI, or even to get the hashes to match if that was implemented) then are all operations that an analyst could do supported? Let’s say that a researcher wanted to get all occurrences that had a catalogNumber that fits a particular regular expression pattern, is that possible through the download api (not to my knowledge)? For research teams that aren’t as technical, what functionality can they do from the website? Certainly the range of operations that can be performed in excel, R, python etc exceeds the filtering capabilities available at the point of download.

Edit: It is indeed possible to use regular expressions through the download api. The general point around whether most researchers will go this far I believe is still applicable.

Thanks for the interesting comments, all!

We’ve certainly been successful in getting people to start citing DOIs when they use GBIF-mediated data through our #CiteTheDOI campaign and a lot of direct engagement with authors. These days, there are more publications including DOIs in the their citations of GBIF than not. That being said, we might do better at making people understand why were asking this and how the information is being used.

Getting people to use more filters before downloading is key to avoid citing data that isn’t going to be used at all. But I think the challenge here might be trust in the quality of data. People might need species x in country y, but they end up downloading the parent genus (or even family) for the entire continent or world—just be sure they got everything. I’m not quite sure how to solve this.

However, we do have the derived dataset as a solution to most of these problems. Sure, it’s an extra step for researchers, but it allows them to download as much or little data in any manner they prefer, because at any point they can summarize their derived dataset and create a new DOI which reflects this—and only this.

Finally, I’d like to mention that we’ve started tagging papers according to the scope of the citations, flagging “over”-citations as well as “under”-citations, where the cited data doesn’t fit what is described as used in the paper. It’s still work in progress and the tag is not indexed or searchable via APIs.

I’d like explore the idea of relative contributions a bit more to see if there’s anything we can do on our side. Adding metadata to downloads should be simple enough, but extracting and decorating literature to make it usable for publishers is trickier!

5 Likes

How might derived datasets (that’s essentially a copy/paste of unique datasetKeys, right?) also include relative contributions? Could you instead bite the bullet and replace datsetKeys in derived datasets with the more granular gbifIDs for each of the occurrences? That might make for enormous lists (zip it client-side maybe?), but you’d have all the information you’d need to then derive relative contributions by whichever aggregating term is relevant.

Hey @mark-pitblado -

Thanks for engaging in this discussion about expression of interest, DOI, and actual data use.

In the 2020 Elliott et al. paper [1], we showed that biodiversity data is likely to change or disappear. In the 2023 Elliott et al. paper [2], we offered a cheap way (a “hack” perhaps :wink: ) to augment existing (data) references to include identifiers allow independent verification of associated data and make it easier to keep dataset copies around in different places on whatever available digital storage media.

In my mind, the next logical step, as hinted to in the discussion of [2], is to apply a machine readable workflow description to carefully describe how datasets are related. And, it so happens that we’ve applied such an approach from the start [1] through readily available standards like rdf/nquad [3] and ProvO/PAV [4].

With this, a machine readable workflow representation can be make to connect a source dataset (as provided by the publisher) to an “interpreted” dataset (as provided by GBIF or other data indexer/processor) to a dataset used in a research dataset. In addition, individual workflow steps (or ProvO’s “activities”) can be documented to specify how the data was processed, in addition to specifying what data was used to generate which derived dataset.

So, what I am trying to say is that, yes, we can be a little more deliberate about describing the origins of data so that researchers can be a little more specific about how they ended up using some provided dataset. And, some work is needed to generate these workflow descriptions along with the dataset that are flung into this world. I’d be happy to share examples on how I am benefiting from these data provenance tracking techniques in ongoing projects.

Hope this helps, and curious to hear your take on ways to better document how (biodiversity) data is used in derived works like research papers, or intermediate datasets.

-jorrit

References

[1] Elliott M.J., Poelen J.H., Fortes J.A.B. (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. Redirecting hash://sha256/136c3c1808bcf463bb04b11622bb2e7b5fba28f5be1fc258c5ea55b3b84f482c

[2] Elliott M.J., Poelen, J.H. & Fortes, J.A.B. (2023) Signing data citations enables data verification and citation persistence. Sci Data. Signing data citations enables data verification and citation persistence | Scientific Data hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d

[3] WC3 Contributors. 2014. W3C Recommendation: A line-based syntax for RDF datasets. Accessed on 2024-10-08 at https://www.w3.org/TR/n-quads/ . https://linker.bio/hash://sha256/0e0f395f9956e97cc4477a54c07c69747a4a8ecfc29d6d9ccabb9b662c8efc91 .

[4] WC3 Contributors. 2013. PROV-O: The PROV Ontology. Accessed on 2024-10-08 at https://www.w3.org/TR/prov-o/ https://linker.bio/hash://sha256/6b96671ab84faf12ce3f041aca12c3f93a6df2ed242348810743179a68e69555 .

I agree with the premise of @dshorthouse’s suggestion, the optimal solution would be to have individual records be cited rather than datasets. I think this makes conceptual sense from everyone’s perspective, if someone cites 10 records from one bird dataset, 10 from Kew, 10 from Harvard… 10 from so on and so forth, the main focus from a scientific perspective are the individual birds, not the datasets from which they originated. Having a dataset get a citation for 1 or 2 birds, while technically true, isn’t really capturing the essence of what was done.

So, that would leave a citation that is a giant list of occurrences, all referenced individually by some identifier. Since GBIF is the one calculating and distributing the citations, it would make sense to use the gbifID, since it is guaranteed to be common among all contributing publishers. Great in theory, but how would that work with someone who wants to cite 10+ million records? For datasetKeys, the maximum length of the list would be the number of datasets in GBIF, which currently stands at 109,000. This doi highlights how even for 2.5+ billion occurrence records, the list of datasetKeys stays manageable. Individual occurrences would be orders of magnitude more, not only from a storage perspective but from a computational perspective.

As David suggests, perhaps some of the computation could be switched to the client side, or even distributed among those who are most interested in getting granular views into citations. In the massive doi referenced earlier, the ideal function from an institution’s perspective (in my view) would be an ability to download only occurrences of a particular datasetKey(s) from within a particular download. Already through the API we can do the following actions

# Lists datasets present in a download
/occurrence/download/{prefix}/{suffix}/datasets
# Download a previously created dataset
/occurrence/download/request/{key}

The main pain point (as @sformel mentions) is executing the download for millions of occurrences just to subset them for datasetKeys of interest, it is wasteful to send all that data over the network only to discard 90% of it. However, what if, when using /occurrence/download/request/{key} we could append a ?datasets= parameter to only download a subset of that citation, those occurrences from that list of datasets? To lighten the load even further, an ?idOnly= parameter could just return the identifiers. We are just looking to tally and link, so the rest of the data is unnecessary (@jhpoelen may disagree, as the information in the record could have drifted). From there it is up to the investigator/institution to process and link the citation to individual records in whichever system they use. I believe Katie Pearson did a demo at SPNHC-TDWG 2024 on how Symbiota can link occurrences to citations, Specify has similar functionality.

In summary

GBIF minimizes compute by only calculating the subset of records on request, and never storing long lists of identifiers or transmitting them between systems. The minimum amount of information is transmitted to the data analyst to save them from having to download the full length and width of a derived dataset, and the system is flexible to however the end user wishes to link citations up to individual records.

After reading through those papers I have been thinking about hashing, provenance, record keeping and everything in between on my bus rides to and from work. I don’t want to give the impression that I ignored these ideas in my post above, rather, it is that they are so big that I am still working through them and don’t have a concrete idea to add yet. Recently I have been doing a deep dive into UUID’s, and I think that hashing presents the mirror image of how I have been thinking of record permanence (one identifier that people can use to track an object across systems and time)

With hashes, we can point users to a particular object, at a particular point in time, that also have self-checks for validity. That is really cool. One thought I had was if the hash is included in the data for the object itself, and if this creates a paradox of sorts?

Let’s say I have a record composed of three fields, that I wish to publish

scientificName eventDate hash
Vulpes vulpes 2024-10-09 sha256:6a6f41ec226cfbc7c48fdd96bfce9a2fca00a9cfd729f8218704f50335a99068

That hash represents Vulpes vulpes, 2024-10-09. The hash cannot be a hash of itself, otherwise the hash would change, which would change the hash, so on and so forth. So if hashs were to be shared by publishers, they would represent a hash of everything but the hash field?

In terms of steps taken for analysis of a dataset, I love it when researchers share code for this reason, it unambiguously shows what was done. However, recognizing that a lot of analysis is done in programs like excel, it would be nice if there was the ability for the user to just hit “record” and then the program keeps a log of all steps taken, even if those steps are done through a GUI. Perhaps there is room for either of those syntax in a workflow like that

Hey @mark-pitblado -

Thanks for taking a peek at the references I shared.

I’ve created an example GitHub - jhpoelen/cite-the-bunnies: proof of concept for citing bunny records found in a version of GBIF/iDigBio such that you can trace the results back to the original dataset in a machine readable manner. With this, you can scale the operation of tracing use of records by deploying specialized bots (or programs) that are designed to do so.

The first two records (you should be able to reproduce) are:

Note that each record provide an exact location of the record found (e.g., https://linker.bio/line:zip:hash://sha256/69c839dc05a1b22d2e1aac1c84dec1cfd7af8425479053c74122e54998a1ddc2!/occurrence.txt!/L1657 ), as well as the version of the corpus (i.e., “versionAnchor”) in which the record was found. This anchor is a resource that describes the process in which the dataset was found and retrieved from some location.

In this case, the first bunny data record was found on line 1657 of the occurrence.txt file in the resource identified by hash://sha256/69c839dc05a1b22d2e1aac1c84dec1cfd7af8425479053c74122e54998a1ddc2 .

or (including the header for readability through curl 'https://linker.bio/line:zip:hash://sha256/69c839dc05a1b22d2e1aac1c84dec1cfd7af8425479053c74122e54998a1ddc2!/occurrence.txt!/L1,L1657' | mlr --itsvlite --oxtab cat

id                             acfe0bdd7341bbd26bb77d45cb619382
type                           Event
modified                       2021-03-19
language                       es
license                        CC-BY
rightsHolder                   Comisión Nacional para el Conocimiento y Uso de la Biodiversidad (CONABIO)
bibliographicCitation          González-Cózatl F. 2012. Computarización de la colección de mamíferos del Centro de Educación Ambiental e Investigación Sierra de Huautla (CEAMISH) de la Universidad Autónoma del Estado de Morelos (UAEM). Colección CMC. Centro de Investigación en Biodiversidad y Conservación. Universidad Autónoma del Estado de Morelos. Bases de datos SNIB-CONABIO, proyecto HC022. México, D. F.
institutionID                  1911
collectionID                   4004
datasetID                      9d9e5f6c31f9e22b4fa07dcc4684c69e
institutionCode                CIByC-UAEM
collectionCode                 CMC
datasetName                    Computarización de la colección de mamíferos del Centro de Educación Ambiental e Investigación Sierra de Huautla (CEAMISH) de la Universidad Autónoma del Estado de Morelos (UAEM)
ownerInstitutionCode           Centro de Investigación en Biodiversidad y Conservación, Universidad Autónoma del Estado de Morelos
basisOfRecord                  PreservedSpecimen
dynamicProperties              TipoVegetacion / Bosque mesófilo de montaña Características de Mamíferos / LongitudTotal: 185 mm Características de Mamíferos / LongitudCola: 35 mm Características de Mamíferos / LongitudPataPosterior: 36 mm Características de Mamíferos / LongitudOreja: 40 mm Características de Mamíferos / Peso: 84 g Tipo de Material / Tejido / Sí / Hígado Tipo de Material / Tejido / Sí / Corazón Tipo de Material / Tejido / Sí / Riñón Tipo de Material / Sangre (Nobutos) / Sí Tipo de Material / Ectoparásitos / No Tipo de Material / Contenido Estomacal / No Características de Mamíferos / Condición Reproductiva / Utero / No Inflamado Otros Datos Ejemplar / Capturista: edith
occurrenceID                   acfe0bdd7341bbd26bb77d45cb619382
catalogNumber                  2011
recordedBy                     RMV
individualCount                1
sex                            Hembra
lifeStage                      
georeferenceVerificationStatus 
preparations                   Taxidermia
associatedReferences           
associatedTaxa                 
occurrenceRemarks              
previousIdentifications        
fieldNumber                    192
eventDate                      2006-04-23
eventTime                      00:00:00
startDayOfYear                 113
endDayOfYear                   
year                           2006
month                          4
day                            23
habitat                        
samplingProtocol               
locationID                     89
country                        MEXICO
countryCode                    MX
stateProvince                  PUEBLA
county                         ZACATLAN
locality                       Racho 22 de Marzo Km 75.8 carretera Ahuazotepec-Zacatlán.
minimumElevationInMeters       2270.0
maximumElevationInMeters       0.0
verbatimElevation              
minimumDepthInMeters           0.0
maximumDepthInMeters           0.0
verbatimDepth                  
decimalLatitude                19.9294881
decimalLongitude               -97.9890139
geodeticDatum                  NAD27
coordinateUncertaintyInMeters  
verbatimLatitude               19 55 45.12 N
verbatimLongitude              97 59 20.52 W
verbatimCoordinateSystem       grados minutos segundos
georeferencedBy                
georeferencedDate              
georeferenceProtocol           
georeferenceSources            Geoposicionador, GARMIN GPS 12XL, 10
georeferenceRemarks            
identificationID               1733
identificationQualifier        
typeStatus                     NO APLICA
identifiedBy                   RMV
taxonID                        d84a1fd0577d2fc5ea68f9cf95d3d613
scientificNameID               1734
acceptedNameUsageID            
originalNameUsageID            
scientificName                 Sylvilagus floridanus
acceptedNameUsage              Sylvilagus floridanus
parentNameUsage                Sylvilagus
originalNameUsage              
nameAccordingTo                Ramírez-Pulido, González-Ruiz, Gardner & Arroyo-Cabrales, 2014
namePublishedIn                
namePublishedInYear            0
higherClassification           Animalia; Chordata; Mammalia; Lagomorpha; Leporidae; Sylvilagus
kingdom                        Animalia
phylum                         Chordata
class                          Mammalia
order                          Lagomorpha
family                         Leporidae
genus                          Sylvilagus
subgenus                       
specificEpithet                floridanus
infraspecificEpithet           
taxonRank                      especie
scientificNameAuthorship       (J. A. Allen, 1890)
vernacularName                 conejo serrano, conejo, conejo serrano, rowi (Yuto-nahua), tochtli (Yuto-nahua)
taxonomicStatus                válido
nomenclaturalStatus            

And, this resource was described in the versioned corpus identified by hash://sha256/37bdd8ddb12df4ee02978ca59b695afd651f94398c0fe2e1f8b182849a876bb2 as part of activity urn:uuid:53c83e98-2cf8-4bc4-97d4-09b2b21d2b46 which is described by the following statements in curl ‘https://linker.bio/line:hash://sha256/37bdd8ddb12df4ee02978ca59b695afd651f94398c0fe2e1f8b182849a876bb2!/L1044,L1048

<urn:uuid:53c83e98-2cf8-4bc4-97d4-09b2b21d2b46> <http://www.w3.org/ns/prov#generatedAtTime> "2024-04-01T21:33:37.506Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:53c83e98-2cf8-4bc4-97d4-09b2b21d2b46> .
<https://www.snib.mx/iptconabio/archive.do?r=SNIB-HC022> <http://purl.org/pav/hasVersion> <hash://sha256/69c839dc05a1b22d2e1aac1c84dec1cfd7af8425479053c74122e54998a1ddc2> <urn:uuid:53c83e98-2cf8-4bc4-97d4-09b2b21d2b46> .

which indicates that the resource was retrieved from https://www.snib.mx/iptconabio/archive.do?r=SNIB-HC022 on 2024-04-01T21:33:37.506Z (or April 1, 2024 no pun intended :wink: ).

With this you’d have a concise way to reference one or more individual records as found in a well-defined versioned and clonable corpus of data.

I hope I gave you some more food for thought for tracing provenance of (and use of) existing records . . . and I am curious to hear about them.

Looking forward,
-jorrit

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.