DNA Data Publishing (GBIF technical support hour for Nodes)

The theme for February session of the Technical Support Hour for Node, is DNA data publishing. For this session, we welcome Tobias Guldberg Frøslev (@tfroeslev) from the Participation and Engagement team. Tobias will present the resource currently available for sharing DNA-derived data - with a focus on a new tool for formatting environmental DNA dataset – on GBIF and answer all your questions on the topic.

You are welcome to read more about DNA and GBIF here

The Data Product team will join as well. We will be happy to answer any question relating or not to the topic.

The event will be on the 7th of February 2024 at 4pm CET (UTC+1). The invitation with registration link will be sent to the GBIF Nodes. If you are interested in attending, you can reach out to your local node.

The edited recording and the transcript of the questions will be made available here.


Would love to join this session. How do I connect to it?

@gambleb you need to contact the Node Manager in your country (for the US it is @sformel) and ask them to forward you the invitation.

@mgrosjean I am very sorry to ask this stupid question but what is the definition of a Node?

A Node is defined as:

a team designated by a Participant to coordinate a network of people and institutions that produce, manage and use biodiversity data, collectively building an infrastructure for delivering biodiversity information. They are supported by organizational arrangements and informatics solutions, working to improve the availability and usefulness of biodiversity data for research, policy and decision-making.

You can read more about how GBIF is organised here: https://www.gbif.org/the-gbif-network and about nodes here: Nodes

@gambleb as the saying goes, there are no stupid questions! You can learn more about the US node and the community we serve at gbif.us. One of our goals this year is to try to strengthen our network, so it might be worth us getting together to update each other on our work and priorities.

I’m glad you reached out about this; but I wasn’t able to easily find your email address online. Could you please send me an email (sformel@usgs.gov)? I will forward you the invitation to the technical hour.

If you’re generally interested in DNA data, I’ve been working with Tobias, and a variety of groups in the US, developing strategy and work flows for publishing DNA data. I’d be happy to invite you to any of these groups, since it’s a complicated conversation with many needs to be served.

1 Like

@sformel Thanks Steve. I sent you an email. Yes, we are very interested in the DNA publishing topic, especially around Environmental Samples and their DNA.I am not a biologist but rather a computer scientist, database administrator. I come from the perspective of what data needs to be captured, what system it should be captured in, and how to share it. Thanks for reaching out!
Beth :slight_smile:


The video is available here: DNA data publishing on Vimeo

Here is the transcript of the questions during the session.

The eDNA tool presented generates a “classical” Darwin Core Archive (DwCA), will it create a file that works directly with the new Data Model?
Yes, that’s the reason why the first formats the data using the BIOM format (https://biom-format.org). These intermediate files are then used to create DwCA. The future implementations of the new model models could ingest directly the BIOM files or a sampling event core with a DNA extension.

Is the eDNA tool presented public and in production mode? Can we (the Nodes) advertise it to the community?
The tool is available online for testing but still in development. The tool allows to generate Darwin Core Archives but not publish on GBIF.org. Publishers have to download the archive and publish it from an IPT or other publishing tool.

Where can I access this tool?
You can use the tool online. See also Invitation to share DNA metabarcoding data to test early pilot of data-publishing tool.

What is the registry that is being used for the assigned taxonomy choice in the eDNA tool?
The taxonomy is assigned to sequences using this other tool: Sequence ID. When you upload and process sequences in the eDNA tool, you need to specify the marker genes that were used (in the demonstration, the ITS2 region was used). This will then determine the reference database for taxonomic assignment. The higher taxonomy is based on the GBIF backbone taxonomy, which might not always correspond to the taxonomy from other reference databases.
You are very welcome to provide your own taxonomy associated with your sequences. You don’t have to use the built-in function.

In the presentation, you mentioned that the sequences from BOLD and MGnify go to GBIF automatically. Should we (Nodes) push data providers to use the eDNA tool or IPT to format and publish their data on GBIF? Or is it enough that they publish their data on one of those platforms? What is the difference between the two paths?

  • BOLD is for single sequences associated with specimens (which could also be referred as “DNA-associated” data). If you have barcoded-specimen associated with one the barcodes supported by BOLD, you should share your data in BOLD, which will share rich data on GBIF. Note that BOLD doesn’t support all taxa and gene groups.
  • Some of the data shared on GeneBank (or any International Nucleotide Sequence Database Collaboration’s platforms) are single sequences (flat files) associated with some data like coordinates and specimen voucher numbers (but not images). Those are automatically shared on GBIF.
  • The eDNA datasets are a bit different. The European Nucleotide Archive (ENA) offers a tool which can process the output of eDNA studies (fastq files). This tool is called MGnify, uses a standardised bioinformatics pipeline to (re-)analyse data shared on the ENA platforms. It cleans the data and assign a taxonomy based on the latest version of the reference databases. The processed data generated by MGnify is then shared on GBIF. However, MGnify doesn’t process all the data deposited in the ENA platform (and only processes 16S data). Publishing eDNA data in ENA doesn’t mean that the data will be shared on GBIF. The only way to ensure that the eDNA data is shared on GBIF is for the data holders to publish it themselves. One advantage is that they know best their data and how to process them.

In general, publishing directly on GBIF allows the data to be cited and attributed to the original data provider.

Would someone know if their data were processed by MGnify?
No, they wouldn’t be notified.

Are there any plans to add additional filters in the GBIF portal for some genetic fields? Maybe something like target gene, primers, DNA sequence, etc.
Occurrence records can already be filtered by their associated extension(s). For example, here is a selection of occurrences published with a DNA derived extension. The dwc:associatedSequences field is also now indexed and searchable. See this example.
However, there is no plan to short term plan to index specific fields in the DNA derived data extension such as primers.

Will this tool be embedded in the IPT? From our (Nodes) perspective it would be really useful if is the same tool.
We are exploring options. One option could be to imbed the tool in the IPT, another would be to offer a mechanism so the IPT can fetch an archive from a tool. The idea would be that the archive is directly updated from the tool.
The current form of the tool is unlikely to be baked directly in the IPT.

Do you need additional help from the nodes for this model? We have made some tests, but if you need something more specific we can try to do it.
If you are interested in testing and helping, please let us know at dna@gbif.org, see also: Invitation to share DNA metabarcoding data to test early pilot of data-publishing tool. We have some instructions and links we are happy to send you.

Is there a general estimated date for the eDNA tool release?
Some challenges need to be addressed before the tool can be released to all. However, you are very welcome to use the online tool currently available.

Have people have experienced changes following the recent eventDate interpretation change on GBIF? GBIF API: Supporting ranges in occurrence eventDate
The informatics team has been fixing few issues that were found after the changes (many records got flagged unnecessarily and for a little while, API facets on date ranges didn’t work). If you notice more issue, please report them.

We had a problem with GBIF interpreting the data without a time stamp. The default values for the time of day is midnight. It is misleading, it looks like people are collecting butterflies in the middle of the night.
The new interpretation of the data should no longer infer default day time. This issue should now be solved. Don’t hesitate to contact us if you notice anything odd.

I’ve been helping people working on eDNA data and we were wondering where to put some information such as the name of the company/lab doing the bioinformatics analysis?
There is no dedicated field for this type of information but publishers like MGnify have been using the dwc:identificationRemarks and the dwc:dynamicProperties fields to specify the bioinformatics pipeline used to process the data. dwc:dynamicProperties allows structured information.
Note that in the new model, such data could be formatted as another agent with a specific role.
You could also put it in the dataset metadata.

1 Like