Adding GBIF identifiers to NCBI BioSample data during NCBI data upload

heidimeudt · March 21, 2024, 3:16am

Hello, I have a question about linking sequence data between NCBI and GBIF. We will be in the next few weeks uploading a lot of sequence data to NCBI in the category BioSample using the following template: Plant; version 1.0 Package - BioSample - NCBI There are a host of other attributes that can be used to annotate this kind of data which are found here: https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/ However there is no attribute specifically for a GBIF identifier to the voucher specimen, which is what I want (in addition to of course using the voucher specimen field to put the institution & accession number). This is surprising and disappointing… And means not many people are attempting to put GBIF identifiers in when uploading their BioSample data to NCBI. When I search for “GBIF” in the NCBI “BioSamples” database, I only get 122 hits, which is very low, and represents sequences from only a handful of projects. So, some researchers are trying to add GBIF links to their BioSample sequences, but you can see they are adding it to all different attributes: gbif - BioSample - NCBI (Going the other way, once submitted I will add the NCBI identifiers to our herbarium database, and then I think these can be added to the GBIF record using “associated_sequences”.) I’m sure some of you have experience with NCBI/GBIF, so at the very least, any suggestions as to where to best add the GBIF identifiers to my NCBI BioSamples would be greatly appreciated (or suggest a different forum to ask this question…).

rdmpage · March 21, 2024, 9:09am

Here be dragons. GBIF occurrence ids are not stable, hence these links can (and do break). NCBI may be reluctant to add links that are known not to be persistent. As you state, there is a way to have the sequence linked to the record in the herbarium (which also makes assumptions about how persistent those links are - hint, not very).

tfroeslev · March 21, 2024, 9:40am

About linking from GBIF to BioSample:
For DNA-associated records (e.g. metabarcoding data) we recommend to use the Material Sample ID to link to Biosamples and Associated sequence to link to read files: like e.g. here: Occurrence Detail 3890532237

DNA-publ guide mentions this in table2:

rdmpage · March 21, 2024, 10:25am

@tfroeslev I think @heidimeudt wants to go in the other direction, NCBI to GBIF. There are mechanisms for linking sequences to natural history collections. For example L[Institution code] - Biocollections - NCBI find institutions with code “L”, there is a page for National Herbarium of the Netherlands (NHN), and these are the sequences GenBank knows about from specimens in L: "collection_7264"[Properties] - Nucleotide - NCBI

I had hoped this would then make a live link between the sequence and the specimen online in L, but that doesn’t seem to be the case. I’ve seen this functionality for some other collections.

In any discussion between NCBI and GBIF (and indeed any individual natural history collection) I suspect the key issue will be “how persistent are these links to specimens?”.

tfroeslev · March 21, 2024, 12:41pm

Sorry for not being specific. I was responding to the "Going the other way, once submitted I will add the NCBI identifiers to our herbarium database, and then I think these can be added to the GBIF record using “associated_sequences”.

tfroeslev · March 21, 2024, 12:46pm

I can add that there is a plan to show BioSample records in GBIF (some developments in ENA has made this possible). Currently GBIF exposes the INSDC sequences (the so called flat files), but not BioSamples.
As for the INSDC sequences in GBIF, when there are sample/voucher IDs on the records, the GBIF clustering algorithm will be able to group such associated records.

heidimeudt · March 21, 2024, 10:44pm

Thanks Rod. Apologies for posting such a naive question! But I will follow it with another How do I know what links and identifiers are persistent? Instead of using a GBIF identifier/link in my NCBI upload metadata, should I use one directly from the herbarium/institution instead (but are they persistent? How can I tell)?

heidimeudt · March 21, 2024, 10:46pm

Thank you, this is exactly what I was looking for (answering the second part of my question)! Wonderful.

heidimeudt · March 21, 2024, 10:49pm

By the way, I also wrote the same question to the NCBI BioSamples team directly, and they replied that I can simply create my own attribute “GBIF identifier” and add it in (without further discussion of persistent links or why they don’t already have an attribute for this)…

mgrosjean · March 22, 2024, 7:21am

@heidimeudt GBIF Occurrence URLs aren’t persistent and we have a video and a blogpost explaining why and what we do to try to improve it.

Generally everything associated with a DOI is persistent on GBIF. This means that you can either cite the DOI a dataset registered on GBIF (for example https://doi.org/10.15468/ib5ypt) or you can generate a download of occurrence via our interface or API. Each download is associated with a citable DOI.

As a side note, you are welcome to use the institution and collections codes and identifiers that are found on GRSciColl: https://scientific-collections.gbif.org/. We aim to keep the institution and collection URLs stable there as well (the specimen records come from GBIF so you will encounter the same challenges).

rdmpage · March 22, 2024, 9:38am

@heidimeidt The short answer is that unless it’s a DOI it is very hard to judge the persistence of a link. DOIs have a mechanism to facilitate persistence (redirection), an infrastructure that supports persistence (and customer support for links that break), and cost money (hence incentivise those using them to make them work).

Some herbaria aim to provide persistent links, but with varying success. Royal Botanic Gardens Kew for example, have broken all their direct links to their specimens, and don’t seem to care.

While we wait for the promised land of persistent links, perhaps the simplest strategy is to link to the GBIF record but also add that link to the Internet Archive’s Wayback Machine so that someone in the future will be able to figure out what record you were linking too even if the original link has broken.

heidimeudt · April 23, 2024, 10:53pm

Thank you @mgrosjean that video and blog were really informative!

Topic		Replies	Views
GBIF attempts to improve identifier stability by monitoring changes of occurrenceIDs - GBIF Data Blog Data blog	6	5204	November 9, 2023
Bionomia: Indexing, displaying links to collectors & determiners Miscellaneous	12	1741	May 9, 2022
Type Specimen CASTYPE1652 found via filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq Miscellaneous	18	616	May 6, 2023
Toward Reliable Biodiversity Dataset References Data Use	8	2985	February 24, 2020
Collections catalogue (GRBio) Miscellaneous	52	6465	June 28, 2020

Adding GBIF identifiers to NCBI BioSample data during NCBI data upload

Related topics