Sequenced-based data on GBIF - What you need to know before analyzing data - GBIF Data Blog

As I mentioned in my previous post, a lot more sequence-based data has been made available on GBIF this past year. MGnify alone, published 295 datasets for a total of 13,285,109 occurrences. Even though most of these occurrences are Bacteria or Chromista, more than a million of them are animals and more than 300,000 are plants. So chances are, that even if you are not interested in bacteria, you might encounter sequence-based data on GBIF.


This is a companion discussion topic for the original entry at https://data-blog.gbif.org/post/gbif-molecular-data-quality/
2 Likes

Many thanks, @dodobot, for this post- it’s good to see the topic explored. I’d like quickly to add two comments.

First, if a Material Sample does indeed relate to organisms cultured in a dish (or similar), these may be just as robustly identified as any observable specimen. A subset of data in our global network has always come from living culture collections, where live strains of microorganisms and maintained for long-term reference and use. In such cases, even if BasisOfRecord is MaterialSample, we’d be better off treating it the same as a specimen.

Secondly, you are quite correct to give guidance on recognising and filtering records that derive purely from DNA/RNA but it would be sad if readers simply assume that such data are always worse than data from othe sources. (You don’t say this!) In particular, all methods vary in their ability to derect all species in a sample or community, so that is not a unique weakness of metagenomics, and all methods lead to some proportion of misidentifications.As we proceed, it will be good to make the basis of evidence clear rather than just the basis of record and for us to develop models of uncertainty and repeatability around different methods/protocols (including selection of different sequencing platforms, primers, etc.).

2 Likes

Hi! To find out what I can do, say @dodobot display help.

Thanks for the quick reply, GBIF Dodo! I should have acknowledged @mgrosjean instead, so thanks, Marie.

1 Like

Thank you @dhobern for your comments.
I tried to find some examples of datasets with Material Sample occurrences that are not inferred from sequence analysis. Surprisingly, I mostly found non-bacterial occurrences (although some microorganisms).
Here are a few example:

In all three cases, the basis of record could perhaps be observation, but this is not clear.
I agree, it would be good to emphasise the basis of evidence instead of just having the basis of record.

Here is an example of data quality issue associated with sequence-based occurrences: 90 occurrences of tobacco (Nicotiana attenuata Torr. ex S.Watson) all over the planet from MGnify.
Are these true positives? Maybe.
Does Nicotiana attenuata grow in the sea? Probably not, maybe the Tara ship had some smokers.