Metagenomics and metacrap

I’ve written a blog post looking at a case where metagenomic data results in GBIF’s map for the parasitic plant Rafflesia (famous for it’s giant flowers) showing occurrences in the ocean:

Turns out the identification is based on a short sequence for a picoplankton being matched to a flowering plant. I worry that these sorts of errors may be widespread, and that they are hard to track down (I had to get original sequence data and do a BLAST search).

I’m not arguing against including meta genomic data, quite the opposite, but this stuff has errors, and those errors may “leak” into unexpected places. GBIF already has enough issues with data quality, so perhaps we could think of ways to minimise the impact of spurious identifications of metagenomic data.


A further example of the impact of “metacrap”, here is the GBIF page for the cricket Paroecanthus, native to Central and South America . The bulk of the data comes from metagenomics, and comes from multiple datasets.

I BLASTed one “Paroecanthus” sequence


and got back “Uncultured eukaryote clones” (e.g., AY605189.1) and Apicomplexa (e.g., HQ876008.1), the later from a paper entitled " Identification of a divergent environmental DNA sequence clade using the phylogeny of gregarine parasites (Apicomplexa) from crustacean hosts" PMID: 21483868.

There are some systemic issues here, maybe these can be addressed by using stricter filters on identification (e.g., require a closer match before accepting a taxonomic identification). Looking forward to seeing how this is resolved.

1 Like

Note also that these errors have implications for citation counts (e.g. work by @dnoesgaard ). The TARA dataset has three citations, two are publications and these have nothing to do with marine organisms. So in effect this metagenomics dataset is getting credit for spurious data.


This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Thank you for raising this which prompted discussion with the data publishers. To improve this we applied some further filters in the data preparation removing Metazoa (Animalia) and all Viridiplantae that are not Chlorophyta.
As an example, the following map image illustrates the result you reference in the blog for Rafflesia:

And the map for Paroecanthus now looks like this:

1 Like

Hi @thomasstjerne, many thanks for following this up.

I’ve two quick comments. If this is after-the-fact filtering, then that doesn’t really address the problem - what if there are valid Metazoa and Viridiplantae in the original data? They have now been a priori excluded.

Furthermore, whatever filtering has been applied doesn’t seem to work completely. For example, looking the taxon “wheel” for the TARA dataset there are still a few animals, in this case the genus Minorissa which is another orthopteran that is apparently marine.

Hi Rod
Thanks again for your input.
There might be valid Metazoa and Viridiplantae in the data, but the outcome of the discussion is that in those groups the 16s and 18s regions is not enough evidence alone to claim a species occurrence.

For the second point, you are right. This is because not all lineages produced by the metagenomic classification pipeline has a Kingdom rank in them. Minorissa comes out with this classification: Eukaryota::::::Minorisa (i.e. only the domain and the genus). So it passes through the filter. On next ingestion we could further remove data with an unknown kingdom.
The filtering is applied before the DwC-A files are generated for ingestion into GBIF.


Ok, makes sense. I understand that this is a challenging task.

FYI I think the issue with “Minorissa” is a taxonomic mismatch (note the “Taxon match fuzzy” flag on these occurrences. The sequences are from “Minorisa” (one “s”) which is described here:

Del Campo, J., Not, F., Forn, I., Sieracki, M. E., & Massana, R. (2012). Taming the smallest predators of the oceans. The ISME Journal, 7(2), 351–358. doi:10.1038/ismej.2012.85, PMC3554395

Thanks Rod. Yes the species match provides that. Fuzzy matching is necessary to handle all the various misspellings etc. which are (or were) numerous. However, it is also making mistakes, as we see here. One idea is to avoid fuzzy matching where no higher taxa are provided in the occurrence record - we’re running some numbers now. More soon…

Following an analysis of data, we’ve opened this issue whereby GBIF will stop fuzzy matching records with no higher taxa. This won’t fix everything but will fix the Minorisa issue and stop GBF from introducing as many errors. Publishers will be approached suggesting to add some higher taxa, and default values will be injected on datasets (the registry supports this) where it makes sense. I suggest any discussion necessary on fuzzy matching continue on the GitHub issue leaving this thread to focus on metagenomic specific discussion.

1 Like

very interested post: ! Thank you, very instructive !

1 Like