Why (pushing) data citations (still) matter


#1

The path to automatic discovery of scientific studies making use of GBIF-mediated data is long and slow. The infrastructure is in place, but uptake is lacking. I have been been focusing a lot on reaching out to authors when their papers fail to cite GBIF data correctly, and I’m happy to say that things are looking slightly brighter. In 2018, among all papers using GBIF data, this is the percentage of the people who got the data citation right:

These numbers, however, merely represent the presence of a GBIF DOI in a paper. For automatic linking between paper and data, this DOI has to be included as a reference, deposited with Crossref in the article metadata. Every time this happens, an event is triggered which we (or others) can “consume” to update our citation (and thus, data usage) statistics.

So how often does this happen? Not very often. In fact, if you search for events of papers citing GBIF-prefixed DOIs in 2018, you get 48 events representing 27 unique articles. That’s less than a third of the papers that cite data use correctly–not to mention, less than 5 per cent of the papers that used GBIF data in 2018. One publisher is worth noting, as Pensoft are responsible for more than half of these 27 articles.

In other words, authors seem to be doing better, but (most) journals have a long way to go.

With that being said, below are the 50 most cited(*) datasets in GBIF (as of 14 December 2018). There are probably interesting lessons to be learned here, but I’ll let you speculate and draw conclusions yourself–but please do share them :slight_smile:

No Dataset # citations # occurrences
1 iNaturalist Research-grade Observations 199 6,326,638
2 Natural History Museum (London) Collection Specimens 180 3,803,921
3 naturgucker 152 8,947,938
4 Tropicos Specimen Data 147 4,440,020
5 The vascular plants collection § at the Herbarium of the Muséum national d’Histoire Naturelle (MNHN - Paris) 132 5,464,719
6 Royal Botanic Gardens, Kew - Herbarium Specimens 128 924,700
7 MEL AVH data 127 5,481,472
8 Royal Botanic Garden Edinburgh Herbarium (E) 126 903,681
9 Geographically tagged INSDC sequences 124 7,715,696
10 Naturalis Biodiversity Center (NL) - Botany 123 4,892,987
11 Bernice P. Bishop Museum 120 836,971
12 CSIC-Real Jardín Botánico-Colección de Plantas Vasculares (MA) 120 761,311
13 Phanerogamic Botanical Collections (S) 114 1,030,627
14 NMNH Extant Specimen Records 112 7,112,112
15 Field Museum of Natural History (Botany) Seed Plant Collection 108 603,745
16 The New York Botanical Garden Herbarium (NY) 107 3,693,643
17 Herbarium Berolinense 106 380,448
18 CONN 106 172,098
19 Botany (UPS) 105 706,679
20 Harvard University Herbaria 103 554,014
21 Museum of Comparative Zoology, Harvard University 99 1,922,632
22 Natural History Museum, Vienna - Herbarium W 98 270,776
23 Herbarium Senckenbergianum (FR) 92 109,588
24 CAS Botany (BOT) 92 532,485
25 University of British Columbia Herbarium (UBC) - Vascular Plant Collection 91 180,614
26 EURISCO, The European Genetic Resources Search Catalogue 91 976,457
27 MBM - Herbário do Museu Botânico Municipal 91 288,559
28 Lund Botanical Museum (LD) 89 1,006,725
29 Allan Herbarium (CHR) 88 272,389
30 University of Vienna, Institute for Botany - Herbarium WU 87 162,573
31 Vascular Plant Herbarium, Oslo (O) 87 858,156
32 Royal Botanic Garden Edinburgh Living Plant Collections (E) 84 106,179
33 Biologiezentrum Linz 83 2,243,034
34 RB - Rio de Janeiro Botanical Garden Herbarium Collection 82 720,885
35 Collections and observation data National Museum of Natural History Luxembourg 81 1,756,490
36 Geneva Herbarium – General Collection (G) 80 188,801
37 Paleobiology Database 79 764,816
38 Banco de Datos de la Biodiversidad de la Comunitat Valenciana 78 1,954,698
39 Instituto de Botánica Darwinion 78 317,934
40 MGC Herbarium of University of Malaga (Spain): MGC-Cormof dataset 77 78,930
41 R. L. McGregor Herbarium Vascular Plants Collection 77 268,202
42 Artportalen (Swedish Species Observation System) 76 64,482,477
43 Plantae, TAIF (Taiwan e-Learning and Digital Archives Program, TELDAP) 76 268,160
44 Colección de plantas vasculares del herbario de la Universitat de València (VAL). 75 215,335
45 Institut Botanic de Barcelona (IBB-CSIC-ICUB), BC-Plantae 74 128,961
46 Herbarium specimens of Université de Montpellier 2, Institut de Botanique (MPU)) 74 876,216
47 Marie-Victorin Herbarium (MT) - Plantes vasculaires 73 138,467
48 Fairchild Tropical Botanic Garden Virtual Herbarium Darwin Core format 73 78,283
49 Herbario de Plantas Vasculares de la Universidad de Salamanca: SALA 73 135,628
50 Carnet en Ligne 73 333,761

(*) disclaimer: this list represent the 50 datasets with the highest number of “citations”–a citation in this case means that a paper cited substantive use of the specific dataset or a aggregate of records to which the dataset contributed at least one record.

Happy holidays!


#2

Great info, thank you! Interesting to see iNaturalist dataset on the top.


#3

Indeed. iNaturalist has impressive breadth–both taxonomically and geographically. Their contribution to each download might not be massive, but they contribute to many, many downloads…


#4

What if you standardized # of citations by the date a dataset was first published on GBIF? That might incentivize data holders to publish data by illustrating the average turn-around time between publication and first citation & subsequent rate of accrual.


#5

Thanks, that’s not a bad idea. I’ll add it to my pile :slight_smile: