The video is available here: DNA data publishing on Vimeo
Here is the transcript of the questions during the session.
The eDNA tool presented generates a “classical” Darwin Core Archive (DwCA), will it create a file that works directly with the new Data Model?
Yes, that’s the reason why the first formats the data using the BIOM format (https://biom-format.org). These intermediate files are then used to create DwCA. The future implementations of the new model models could ingest directly the BIOM files or a sampling event core with a DNA extension.
Is the eDNA tool presented public and in production mode? Can we (the Nodes) advertise it to the community?
The tool is available online for testing but still in development. The tool allows to generate Darwin Core Archives but not publish on GBIF.org. Publishers have to download the archive and publish it from an IPT or other publishing tool.
Where can I access this tool?
You can use the tool online. See also Invitation to share DNA metabarcoding data to test early pilot of data-publishing tool.
What is the registry that is being used for the assigned taxonomy choice in the eDNA tool?
The taxonomy is assigned to sequences using this other tool: Sequence ID. When you upload and process sequences in the eDNA tool, you need to specify the marker genes that were used (in the demonstration, the ITS2 region was used). This will then determine the reference database for taxonomic assignment. The higher taxonomy is based on the GBIF backbone taxonomy, which might not always correspond to the taxonomy from other reference databases.
You are very welcome to provide your own taxonomy associated with your sequences. You don’t have to use the built-in function.
In the presentation, you mentioned that the sequences from BOLD and MGnify go to GBIF automatically. Should we (Nodes) push data providers to use the eDNA tool or IPT to format and publish their data on GBIF? Or is it enough that they publish their data on one of those platforms? What is the difference between the two paths?
- BOLD is for single sequences associated with specimens (which could also be referred as “DNA-associated” data). If you have barcoded-specimen associated with one the barcodes supported by BOLD, you should share your data in BOLD, which will share rich data on GBIF. Note that BOLD doesn’t support all taxa and gene groups.
- Some of the data shared on GeneBank (or any International Nucleotide Sequence Database Collaboration’s platforms) are single sequences (flat files) associated with some data like coordinates and specimen voucher numbers (but not images). Those are automatically shared on GBIF.
- The eDNA datasets are a bit different. The European Nucleotide Archive (ENA) offers a tool which can process the output of eDNA studies (fastq files). This tool is called MGnify, uses a standardised bioinformatics pipeline to (re-)analyse data shared on the ENA platforms. It cleans the data and assign a taxonomy based on the latest version of the reference databases. The processed data generated by MGnify is then shared on GBIF. However, MGnify doesn’t process all the data deposited in the ENA platform (and only processes 16S data). Publishing eDNA data in ENA doesn’t mean that the data will be shared on GBIF. The only way to ensure that the eDNA data is shared on GBIF is for the data holders to publish it themselves. One advantage is that they know best their data and how to process them.
In general, publishing directly on GBIF allows the data to be cited and attributed to the original data provider.
Would someone know if their data were processed by MGnify?
No, they wouldn’t be notified.
Are there any plans to add additional filters in the GBIF portal for some genetic fields? Maybe something like target gene, primers, DNA sequence, etc.
Occurrence records can already be filtered by their associated extension(s). For example, here is a selection of occurrences published with a DNA derived extension. The dwc:associatedSequences field is also now indexed and searchable. See this example.
However, there is no plan to short term plan to index specific fields in the DNA derived data extension such as primers.
Will this tool be embedded in the IPT? From our (Nodes) perspective it would be really useful if is the same tool.
We are exploring options. One option could be to imbed the tool in the IPT, another would be to offer a mechanism so the IPT can fetch an archive from a tool. The idea would be that the archive is directly updated from the tool.
The current form of the tool is unlikely to be baked directly in the IPT.
Do you need additional help from the nodes for this model? We have made some tests, but if you need something more specific we can try to do it.
If you are interested in testing and helping, please let us know at email@example.com, see also: Invitation to share DNA metabarcoding data to test early pilot of data-publishing tool. We have some instructions and links we are happy to send you.
Is there a general estimated date for the eDNA tool release?
Some challenges need to be addressed before the tool can be released to all. However, you are very welcome to use the online tool currently available.
Have people have experienced changes following the recent eventDate interpretation change on GBIF? GBIF API: Supporting ranges in occurrence eventDate
The informatics team has been fixing few issues that were found after the changes (many records got flagged unnecessarily and for a little while, API facets on date ranges didn’t work). If you notice more issue, please report them.
We had a problem with GBIF interpreting the data without a time stamp. The default values for the time of day is midnight. It is misleading, it looks like people are collecting butterflies in the middle of the night.
The new interpretation of the data should no longer infer default day time. This issue should now be solved. Don’t hesitate to contact us if you notice anything odd.
I’ve been helping people working on eDNA data and we were wondering where to put some information such as the name of the company/lab doing the bioinformatics analysis?
There is no dedicated field for this type of information but publishers like MGnify have been using the
dwc:identificationRemarks and the
dwc:dynamicProperties fields to specify the bioinformatics pipeline used to process the data.
dwc:dynamicProperties allows structured information.
Note that in the new model, such data could be formatted as another agent with a specific role.
You could also put it in the dataset metadata.