Why (pushing) data citations (still) matter

The path to automatic discovery of scientific studies making use of GBIF-mediated data is long and slow. The infrastructure is in place, but uptake is lacking. I have been been focusing a lot on reaching out to authors when their papers fail to cite GBIF data correctly, and I’m happy to say that things are looking slightly brighter. In 2018, among all papers using GBIF data, this is the percentage of the people who got the data citation right:

These numbers, however, merely represent the presence of a GBIF DOI in a paper. For automatic linking between paper and data, this DOI has to be included as a reference, deposited with Crossref in the article metadata. Every time this happens, an event is triggered which we (or others) can “consume” to update our citation (and thus, data usage) statistics.

So how often does this happen? Not very often. In fact, if you search for events of papers citing GBIF-prefixed DOIs in 2018, you get 48 events representing 27 unique articles. That’s less than a third of the papers that cite data use correctly–not to mention, less than 5 per cent of the papers that used GBIF data in 2018. One publisher is worth noting, as Pensoft are responsible for more than half of these 27 articles.

In other words, authors seem to be doing better, but (most) journals have a long way to go.

With that being said, below are the 50 most cited(*) datasets in GBIF (as of 14 December 2018). There are probably interesting lessons to be learned here, but I’ll let you speculate and draw conclusions yourself–but please do share them :slight_smile:

No Dataset # citations # occurrences
1 iNaturalist Research-grade Observations 199 6,326,638
2 Natural History Museum (London) Collection Specimens 180 3,803,921
3 naturgucker 152 8,947,938
4 Tropicos Specimen Data 147 4,440,020
5 The vascular plants collection § at the Herbarium of the Muséum national d’Histoire Naturelle (MNHN - Paris) 132 5,464,719
6 Royal Botanic Gardens, Kew - Herbarium Specimens 128 924,700
7 MEL AVH data 127 5,481,472
8 Royal Botanic Garden Edinburgh Herbarium (E) 126 903,681
9 Geographically tagged INSDC sequences 124 7,715,696
10 Naturalis Biodiversity Center (NL) - Botany 123 4,892,987
11 Bernice P. Bishop Museum 120 836,971
12 CSIC-Real Jardín Botánico-Colección de Plantas Vasculares (MA) 120 761,311
13 Phanerogamic Botanical Collections (S) 114 1,030,627
14 NMNH Extant Specimen Records 112 7,112,112
15 Field Museum of Natural History (Botany) Seed Plant Collection 108 603,745
16 The New York Botanical Garden Herbarium (NY) 107 3,693,643
17 Herbarium Berolinense 106 380,448
18 CONN 106 172,098
19 Botany (UPS) 105 706,679
20 Harvard University Herbaria 103 554,014
21 Museum of Comparative Zoology, Harvard University 99 1,922,632
22 Natural History Museum, Vienna - Herbarium W 98 270,776
23 Herbarium Senckenbergianum (FR) 92 109,588
24 CAS Botany (BOT) 92 532,485
25 University of British Columbia Herbarium (UBC) - Vascular Plant Collection 91 180,614
26 EURISCO, The European Genetic Resources Search Catalogue 91 976,457
27 MBM - Herbário do Museu Botânico Municipal 91 288,559
28 Lund Botanical Museum (LD) 89 1,006,725
29 Allan Herbarium (CHR) 88 272,389
30 University of Vienna, Institute for Botany - Herbarium WU 87 162,573
31 Vascular Plant Herbarium, Oslo (O) 87 858,156
32 Royal Botanic Garden Edinburgh Living Plant Collections (E) 84 106,179
33 Biologiezentrum Linz 83 2,243,034
34 RB - Rio de Janeiro Botanical Garden Herbarium Collection 82 720,885
35 Collections and observation data National Museum of Natural History Luxembourg 81 1,756,490
36 Geneva Herbarium – General Collection (G) 80 188,801
37 Paleobiology Database 79 764,816
38 Banco de Datos de la Biodiversidad de la Comunitat Valenciana 78 1,954,698
39 Instituto de Botánica Darwinion 78 317,934
40 MGC Herbarium of University of Malaga (Spain): MGC-Cormof dataset 77 78,930
41 R. L. McGregor Herbarium Vascular Plants Collection 77 268,202
42 Artportalen (Swedish Species Observation System) 76 64,482,477
43 Plantae, TAIF (Taiwan e-Learning and Digital Archives Program, TELDAP) 76 268,160
44 Colección de plantas vasculares del herbario de la Universitat de València (VAL). 75 215,335
45 Institut Botanic de Barcelona (IBB-CSIC-ICUB), BC-Plantae 74 128,961
46 Herbarium specimens of Université de Montpellier 2, Institut de Botanique (MPU)) 74 876,216
47 Marie-Victorin Herbarium (MT) - Plantes vasculaires 73 138,467
48 Fairchild Tropical Botanic Garden Virtual Herbarium Darwin Core format 73 78,283
49 Herbario de Plantas Vasculares de la Universidad de Salamanca: SALA 73 135,628
50 Carnet en Ligne 73 333,761

(*) disclaimer: this list represent the 50 datasets with the highest number of “citations”–a citation in this case means that a paper cited substantive use of the specific dataset or a aggregate of records to which the dataset contributed at least one record.

Happy holidays!

7 Likes

Great info, thank you! Interesting to see iNaturalist dataset on the top.

1 Like

Indeed. iNaturalist has impressive breadth–both taxonomically and geographically. Their contribution to each download might not be massive, but they contribute to many, many downloads…

1 Like

What if you standardized # of citations by the date a dataset was first published on GBIF? That might incentivize data holders to publish data by illustrating the average turn-around time between publication and first citation & subsequent rate of accrual.

1 Like

Thanks, that’s not a bad idea. I’ll add it to my pile :slight_smile:

1 Like

How’s the DOI citation uptake progressing? Anecdotally (based on Google Scholar searches and the daily summary I get from them) iNaturalist is getting dozens (hundreds?) of data uses that aren’t citing GBIF exports, so we can’t track them well at all. This is definitely on us, since I just recently added a section to the FAQs about citing a GBIF export.

2 Likes

Hi Carrie,

Here’s a quick graph showing the percentage of papers citing use of GBIF-mediated data with a DOI:

We continue to work with authors, reviewers and editors to improve this. If you have any suggestions or ideas, let me know. I’m also happy to share some of the (semi-automatic) workflows we have in place for tracking citations.

2 Likes

Thanks! I’m interested to know more about the workflows you have in place. I just came across this interesting paper on invasive geckos and Dengue fever that is included in the 394 iNaturalist citations, but digging into the paper I see they just cite GBIF generally rather than a DOI (argh!). I’m certain there’s (lots of) iNat data in this paper given how many gecko records we have, but it’s a bummer to lack the clarity of a DOI in making the connection. Do you happen to know if you reached out to those authors?

Weterings, R., Barbetti, M. & Buckley, H.L. Biol Invasions (2019). https://doi.org/10.1007/s10530-019-02066-x (do you think altmetrics pick up forum posts?)

1 Like

I failed to notice earlier that that paper is connected to two data downloads:


(I went straight to the paper and expected to find the citations there.)

I found this explanation of the literature tracking system overall (cool, btw), but not how you make a match when they don’t cite a DOI, as in this case. Did you get this from asking the authors? Or finding downloads associated with their account?

Hi Carrie,

I reach out to all authors of papers that appear to use GBIF-mediated data but fail to include a DOI in the citation. In this case, the authors were able to retrace their steps and find the relevant downloads and associated DOIs so I could link them. The authors did also highlight that their ended up discarding one dataset, but the Gekko data should be accurate—including the 3k records from iNaturalist.

Btw, I just added a bunch of new papers pushing the iNat counter up to 400. Congrats! :smiley:

/Daniel

1 Like

Awesome! Thanks for reaching out to the authors and prompting that detective work. I just saw the count go to 400 and we tweeted about it!

1 Like

Hi @dnoesgaard! Is there any chance to update the list above to see if there has been significant changes from December 2018 to now? also, is there a way to filter by country? :slight_smile: I just wonder what are the most cited Spanish datasets
Thank you!

Hi Cristina,

I’ll see what I can do—will get back to you perhaps tomorrow or Friday. But I know that this is the most cited one: https://www.gbif.org/dataset/834c9918-f762-11e1-a439-00145eb45e9a

/Daniel

1 Like

Thank you Daniel! Knowing the most cited one is already useful :slight_smile:

Cristina

Hi Cristina,

Sorry for the late reply. Here are the “top” datasets from Spain with more than 100 citations:

Citations Dataset
258 CSIC-Real Jardín Botánico-Colección de Plantas Vasculares (MA)
198 Banco de Datos de la Biodiversidad de la Comunitat Valenciana
166 Institut Botanic de Barcelona (IBB-CSIC-ICUB), BC-Plantae
161 Colección de plantas vasculares del herbario de la Universitat de València (VAL)
160 Herbario de Plantas Vasculares de la Universidad de Salamanca: SALA
158 MGC Herbarium of University of Malaga (Spain): MGC-Cormof dataset
156 Herbario de la Universidad de Sevilla
152 SANT Herbarium vascular plants collection
146 VIT Herbarium - Vascular Plants (The Natural History Museum of Alava)
138 Herbario ABH (Universidad de Alicante)
132 CSIC-Real Jardín Botánico-Anthos. Sistema de Información de las Plantas de España
130 Jardín Botánico de Córdoba: Herbarium COA
128 Universidad de Oviedo. Departamento de Biología de Organismos y Sistemas: FCO
128 Herbarium of Vascular Plants Collection of the University of Extremadura (Spain)
122 Herbario HSS Finca La Orden-Valdesequera (CICYTEX). Gobierno de Extremadura
119 Universidad de Oviedo. Departamento de Biología de Organismos y Sistemas: FCO-Briof
117 Herbario EMMA. Herbario de la Escuela Técnica Superior de Ingenieros de Montes. UPM
109 Base de datos de plantas vasculares del País Vasco: ARAN-EH
108 Sistema de Información de la vegetación Ibérica y Macaronésica
107 Cartografía de vegetación a escala de detalle 1:10.000 de la masa forestal de Andalucía
106 Aranzadi Zientzi Elkartea
106 Herbario COFC de la Universidad de Córdoba: colección general de plantas vasculares
105 Universidad de Navarra, Herbarium: PAMP-Vascular Plants
103 Herbario de la Universidad de Almería
100 Herbario de Universidad de Murcia: MUB

here’s the full list (datasetKeys):
all_cited_datasets_es.txt (11.3 KB)

1 Like

Hi Daniel. Please, don’t be sorry! It is great, thank you so much for the effort. It is very useful for our reports and good also to spread it to our providers. Thanks again!