The path to automatic discovery of scientific studies making use of GBIF-mediated data is long and slow. The infrastructure is in place, but uptake is lacking. I have been been focusing a lot on reaching out to authors when their papers fail to cite GBIF data correctly, and I’m happy to say that things are looking slightly brighter. In 2018, among all papers using GBIF data, this is the percentage of the people who got the data citation right:
These numbers, however, merely represent the presence of a GBIF DOI in a paper. For automatic linking between paper and data, this DOI has to be included as a reference, deposited with Crossref in the article metadata. Every time this happens, an event is triggered which we (or others) can “consume” to update our citation (and thus, data usage) statistics.
So how often does this happen? Not very often. In fact, if you search for events of papers citing GBIF-prefixed DOIs in 2018, you get 48 events representing 27 unique articles. That’s less than a third of the papers that cite data use correctly–not to mention, less than 5 per cent of the papers that used GBIF data in 2018. One publisher is worth noting, as Pensoft are responsible for more than half of these 27 articles.
In other words, authors seem to be doing better, but (most) journals have a long way to go.
With that being said, below are the 50 most cited(*) datasets in GBIF (as of 14 December 2018). There are probably interesting lessons to be learned here, but I’ll let you speculate and draw conclusions yourself–but please do share them
No |
Dataset |
# citations |
# occurrences |
1 |
iNaturalist Research-grade Observations |
199 |
6,326,638 |
2 |
Natural History Museum (London) Collection Specimens |
180 |
3,803,921 |
3 |
naturgucker |
152 |
8,947,938 |
4 |
Tropicos Specimen Data |
147 |
4,440,020 |
5 |
The vascular plants collection § at the Herbarium of the Muséum national d’Histoire Naturelle (MNHN - Paris) |
132 |
5,464,719 |
6 |
Royal Botanic Gardens, Kew - Herbarium Specimens |
128 |
924,700 |
7 |
MEL AVH data |
127 |
5,481,472 |
8 |
Royal Botanic Garden Edinburgh Herbarium (E) |
126 |
903,681 |
9 |
Geographically tagged INSDC sequences |
124 |
7,715,696 |
10 |
Naturalis Biodiversity Center (NL) - Botany |
123 |
4,892,987 |
11 |
Bernice P. Bishop Museum |
120 |
836,971 |
12 |
CSIC-Real Jardín Botánico-Colección de Plantas Vasculares (MA) |
120 |
761,311 |
13 |
Phanerogamic Botanical Collections (S) |
114 |
1,030,627 |
14 |
NMNH Extant Specimen Records |
112 |
7,112,112 |
15 |
Field Museum of Natural History (Botany) Seed Plant Collection |
108 |
603,745 |
16 |
The New York Botanical Garden Herbarium (NY) |
107 |
3,693,643 |
17 |
Herbarium Berolinense |
106 |
380,448 |
18 |
CONN |
106 |
172,098 |
19 |
Botany (UPS) |
105 |
706,679 |
20 |
Harvard University Herbaria |
103 |
554,014 |
21 |
Museum of Comparative Zoology, Harvard University |
99 |
1,922,632 |
22 |
Natural History Museum, Vienna - Herbarium W |
98 |
270,776 |
23 |
Herbarium Senckenbergianum (FR) |
92 |
109,588 |
24 |
CAS Botany (BOT) |
92 |
532,485 |
25 |
University of British Columbia Herbarium (UBC) - Vascular Plant Collection |
91 |
180,614 |
26 |
EURISCO, The European Genetic Resources Search Catalogue |
91 |
976,457 |
27 |
MBM - Herbário do Museu Botânico Municipal |
91 |
288,559 |
28 |
Lund Botanical Museum (LD) |
89 |
1,006,725 |
29 |
Allan Herbarium (CHR) |
88 |
272,389 |
30 |
University of Vienna, Institute for Botany - Herbarium WU |
87 |
162,573 |
31 |
Vascular Plant Herbarium, Oslo (O) |
87 |
858,156 |
32 |
Royal Botanic Garden Edinburgh Living Plant Collections (E) |
84 |
106,179 |
33 |
Biologiezentrum Linz |
83 |
2,243,034 |
34 |
RB - Rio de Janeiro Botanical Garden Herbarium Collection |
82 |
720,885 |
35 |
Collections and observation data National Museum of Natural History Luxembourg |
81 |
1,756,490 |
36 |
Geneva Herbarium – General Collection (G) |
80 |
188,801 |
37 |
Paleobiology Database |
79 |
764,816 |
38 |
Banco de Datos de la Biodiversidad de la Comunitat Valenciana |
78 |
1,954,698 |
39 |
Instituto de Botánica Darwinion |
78 |
317,934 |
40 |
MGC Herbarium of University of Malaga (Spain): MGC-Cormof dataset |
77 |
78,930 |
41 |
R. L. McGregor Herbarium Vascular Plants Collection |
77 |
268,202 |
42 |
Artportalen (Swedish Species Observation System) |
76 |
64,482,477 |
43 |
Plantae, TAIF (Taiwan e-Learning and Digital Archives Program, TELDAP) |
76 |
268,160 |
44 |
Colección de plantas vasculares del herbario de la Universitat de València (VAL). |
75 |
215,335 |
45 |
Institut Botanic de Barcelona (IBB-CSIC-ICUB), BC-Plantae |
74 |
128,961 |
46 |
Herbarium specimens of Université de Montpellier 2, Institut de Botanique (MPU)) |
74 |
876,216 |
47 |
Marie-Victorin Herbarium (MT) - Plantes vasculaires |
73 |
138,467 |
48 |
Fairchild Tropical Botanic Garden Virtual Herbarium Darwin Core format |
73 |
78,283 |
49 |
Herbario de Plantas Vasculares de la Universidad de Salamanca: SALA |
73 |
135,628 |
50 |
Carnet en Ligne |
73 |
333,761 |
(*) disclaimer: this list represent the 50 datasets with the highest number of “citations”–a citation in this case means that a paper cited substantive use of the specific dataset or a aggregate of records to which the dataset contributed at least one record.
Happy holidays!
7 Likes
Great info, thank you! Interesting to see iNaturalist dataset on the top.
1 Like
Indeed. iNaturalist has impressive breadth–both taxonomically and geographically. Their contribution to each download might not be massive, but they contribute to many, many downloads…
1 Like
What if you standardized # of citations by the date a dataset was first published on GBIF? That might incentivize data holders to publish data by illustrating the average turn-around time between publication and first citation & subsequent rate of accrual.
1 Like
Thanks, that’s not a bad idea. I’ll add it to my pile
1 Like
How’s the DOI citation uptake progressing? Anecdotally (based on Google Scholar searches and the daily summary I get from them) iNaturalist is getting dozens (hundreds?) of data uses that aren’t citing GBIF exports, so we can’t track them well at all. This is definitely on us, since I just recently added a section to the FAQs about citing a GBIF export.
2 Likes
Hi Carrie,
Here’s a quick graph showing the percentage of papers citing use of GBIF-mediated data with a DOI:
We continue to work with authors, reviewers and editors to improve this. If you have any suggestions or ideas, let me know. I’m also happy to share some of the (semi-automatic) workflows we have in place for tracking citations.
2 Likes
Thanks! I’m interested to know more about the workflows you have in place. I just came across this interesting paper on invasive geckos and Dengue fever that is included in the 394 iNaturalist citations, but digging into the paper I see they just cite GBIF generally rather than a DOI (argh!). I’m certain there’s (lots of) iNat data in this paper given how many gecko records we have, but it’s a bummer to lack the clarity of a DOI in making the connection. Do you happen to know if you reached out to those authors?
Weterings, R., Barbetti, M. & Buckley, H.L. Biol Invasions (2019). https://doi.org/10.1007/s10530-019-02066-x (do you think altmetrics pick up forum posts?)
1 Like
I failed to notice earlier that that paper is connected to two data downloads:
(I went straight to the paper and expected to find the citations there.)
I found this explanation of the literature tracking system overall (cool, btw), but not how you make a match when they don’t cite a DOI, as in this case. Did you get this from asking the authors? Or finding downloads associated with their account?
Hi Carrie,
I reach out to all authors of papers that appear to use GBIF-mediated data but fail to include a DOI in the citation. In this case, the authors were able to retrace their steps and find the relevant downloads and associated DOIs so I could link them. The authors did also highlight that their ended up discarding one dataset, but the Gekko data should be accurate—including the 3k records from iNaturalist.
Btw, I just added a bunch of new papers pushing the iNat counter up to 400. Congrats!
/Daniel
1 Like
Awesome! Thanks for reaching out to the authors and prompting that detective work. I just saw the count go to 400 and we tweeted about it!
1 Like
Hi @dnoesgaard! Is there any chance to update the list above to see if there has been significant changes from December 2018 to now? also, is there a way to filter by country? I just wonder what are the most cited Spanish datasets
Thank you!
Hi Cristina,
I’ll see what I can do—will get back to you perhaps tomorrow or Friday. But I know that this is the most cited one: https://www.gbif.org/dataset/834c9918-f762-11e1-a439-00145eb45e9a
/Daniel
1 Like
Thank you Daniel! Knowing the most cited one is already useful
Cristina
Hi Daniel. Please, don’t be sorry! It is great, thank you so much for the effort. It is very useful for our reports and good also to spread it to our providers. Thanks again!