GBIF Literature tracking (GBIF technical support hour for Nodes)

The GBIF technical support hour for Nodes is back!

For this session, we have a guest, Daniel Noesgaard from the Communication team, who will explain how GBIF tracks literature citations of GBIF data. How the citations are found, labelled, and integrated in the GBIF index.

The Data Product team will also be there to answer any helpdesk question related or not to the topic of the presentation.

We will begin every session with a five minute practical topic, and then open the floor for technical questions including, but not limited to, the presented topic.


Thank you all for the session last week!

The Video is now available here: Literature tracking on Vimeo

Here is the transcript from the questions during the session:

Regarding the citation button on the dataset page, sometimes the articles don’t seem related to the data (for example, an article on Marine biology citing data from a herbarium). It seems as if the authors just downloaded and cited all of GBIF and not the relevant data. If we notice this, who and how to notify the GBIF Secretariat?

We trust that what is cited is what is used in the work described by the authors. If somebody cites a DOI, we trust that this is the DOI for the data that they use.
We see a bit of over-citation happening. Where people cite all of GBIF when they really use a small subset. We aren’t sure why this is happening. Maybe the users are being very cautious and download more than what they need before filtering the data themselves. Sometimes the authors use a GBIF snapshot (available on cloud services) and automatically get a DOI citing all of GBIF data.

When Daniel notices that the article doesn’t seem to correspond at all to the data cited, he will reach out to ask a bit more information. Sometimes authors download the whole GBIF but use data from only 2-3 species. We advise them to change the citation of the download for a derived dataset citation (Citation guidelines).

How do you assign a citation to a dataset when no download DOI is cited (for example, the author only mention working with GBIF data)?

We don’t make any assumption and we don’t try to derive information based on the content of an article. For example, if someone write that they used data from a country, we don’t try to find the relevant data ourselves. We will send an automatic email to the corresponding author asking them to provide more specific information and use the DOI for their downloads if they have one.

Do you use Artificial Intelligence for citation tracking?

We are looking at all options but we aren’t using AI for citation tracking.

There is a mismatch between the number of citation on country page and when looking at citations from publishers.

The numbers displayed here: Ecuador are for the number of publications by authors affiliated with an institution in the country. Not the number of citations for Ecuadorian datasets.

We have now logged the idea of literature by country of the publishers whose data was cited. If you want to follow up, see: Index literature by publishing country of the dataset cited · Issue #4962 · gbif/portal-feedback · GitHub

I would like to know a bit more about the work you started on identifying citation of GBIF-mediated data in policy documents. The policy documents often cite articles that might have used and cited GBIF-mediated data. How do you identify those?

What we are doing at the moment is looking for direct citation of GBIF-mediated data in policy document. There aren’t as many as citations in research papers but there are some. The added layer of having reports like the ones published by IPCC or IPBES that cite thousands of publications that may rely on GBIF data, isn’t making the tracking process easy. We are looking at how we could expose this layer but it isn’t clear at the moment.
In the reports, we try to identify if there are publication cited that we already identified as citing GBIF data.

We are working with IPBES to improve the citation of the data. So that a sentence in the executive summary can be linked to the data that it relies on. In general, it can be challenging in large reports to know exactly which data contributed to certain parts of the report as so many publications are cited.

Do those cited publications in reports cite the DOIs for GBIF download? Can you detect the citations automatically?

Sometimes they do. We scrape all the DOIs and try to find of there is any GBIF ones. There are also many papers that don’t cite DOIs.

What is the best way to cite GBIF in policy documents so it can be findable?

We discover publications via systems like Google scholar or Overton. So we will find policy documents published in a way that the source can be picked up by those systems.

Another way is to notify us directly. Anyone who has generated a GBIF download can use the notification option on the download page. They can tell us up front that some data has been/will be used in a publication. This also has the advantage of making sure that we don’t delete the data.

Is there a slide or a one page document that we can send/give people that summaries the citation tracking at GBIF?

At the moment no. But we will look into that. See this GitHub issue for follow up: Make material summarising the citation tracking and data use at GBIF · Issue #4963 · gbif/portal-feedback · GitHub

Could there be an equivalent on the website?

We have this page: Data use. It explains roughly what we do with literature tracking. It is updated often and has the latest data use articles. Some ideas were logged to make this page more visible: Link to the Data Use About page instead of just the literature search from the homepage · Issue #4964 · gbif/portal-feedback · GitHub

Sometimes I get approached by researchers who want to publish their data along with their research article. They don’t want to publish the data on GBIF before they publish their research article but they need a DOI to cite the data in their article.

If you have a DataCite account associated with your IPT, you can reserve a DOI.

You can also publish and register a dataset as metadata-only (which will create a DOI) and add records afterwards to the same dataset. The DOI won’t change.

When a dataset is migrated from a publisher to another, the citations remain visible on the previous publisher page. Is there a way to migrate the citations along with the dataset under the new publisher?

If a dataset is published by one publisher and the same dataset changes publisher, the citation should automatically be transferred.

If you deleted a dataset and republished a new one, the citation will remain on the previous dataset page. This is because changing the information there would change facts. However, there are a few GitHub issues discussing how we could make the citations on deleted datasets more discoverable. See:

What do the values in the relevance filter in the literature search index mean? “GBIF cited”, “GBIF discussed”, etc.?

You can find the definitions here: FAQ
This is our attempt at determining the contribution of GBIF to the publication. For example, if someone writes “this species occurs at that location”, it doesn’t make substantial use of GBIF data, it would be under the category GBIF cited . It is a qualitative fact derived from the data.
There are ambiguous cases of course. For example, we recently had a paper that checked at the number of records available. In that case, is it using the GBIF data? It doesn’t correspond to any particular record. In that case, what made the difference was the volume of data used.

We appreciated being able to search citations by project identifier but a dataset can only have one projectID. This is a bit limiting for projects with multiple funders for example.

We are currently working on making possible to have multiple values in the projectID field: Make the metadata projectID field multivalue · Issue #1927 · gbif/ipt · GitHub as well as have project identifiers at the occurrence level: Allow adding a projectId to individual occurrences · Issue #115 · gbif/ · GitHub

Regarding PLIC extensions under test - When will they be released to the production version?

When it is ratified by TDWG. Note that We aren’t planning to do anything more with that extension than making the content mapped to the extension available on
See also Is PlinianCore ratified by TDWG? · Issue #24 · tdwg/PlinianCore · GitHub for the status of the extension.

Alternatives to RSS Feed when there are several IPT. I would like to know a little more about RSS and how to monitor the entry of new records in a data resource, new resources and removal of resources. Does RSS provide all of this information?

RSS doesn’t monitor the entry of new records in a data resource.
RSS can help find new resources but doesn’t help find removal of resources.

I am managing 8 IPTs and I would like to know what are the ways I can use to detect changes and monitor the activity on these IPTs.

You can use the RSS feed of each IPT you manage (for example: GBIF Africa), this will show you the 25 most-recently changed datasets. Otherwise, you could also consider checking automatically publisher, dataset or IPT activity with the GBIF Registry API but this will only work for GBIF-registered datasets.

1 Like