Metagenomics and metacrap

I’ve written a blog post looking at a case where metagenomic data results in GBIF’s map for the parasitic plant Rafflesia (famous for it’s giant flowers) showing occurrences in the ocean:

Turns out the identification is based on a short sequence for a picoplankton being matched to a flowering plant. I worry that these sorts of errors may be widespread, and that they are hard to track down (I had to get original sequence data and do a BLAST search).

I’m not arguing against including meta genomic data, quite the opposite, but this stuff has errors, and those errors may “leak” into unexpected places. GBIF already has enough issues with data quality, so perhaps we could think of ways to minimise the impact of spurious identifications of metagenomic data.


A further example of the impact of “metacrap”, here is the GBIF page for the cricket Paroecanthus, native to Central and South America . The bulk of the data comes from metagenomics, and comes from multiple datasets.

I BLASTed one “Paroecanthus” sequence


and got back “Uncultured eukaryote clones” (e.g., AY605189.1) and Apicomplexa (e.g., HQ876008.1), the later from a paper entitled " Identification of a divergent environmental DNA sequence clade using the phylogeny of gregarine parasites (Apicomplexa) from crustacean hosts" PMID: 21483868.

There are some systemic issues here, maybe these can be addressed by using stricter filters on identification (e.g., require a closer match before accepting a taxonomic identification). Looking forward to seeing how this is resolved.

1 Like

Note also that these errors have implications for citation counts (e.g. work by @dnoesgaard ). The TARA dataset has three citations, two are publications and these have nothing to do with marine organisms. So in effect this metagenomics dataset is getting credit for spurious data.

1 Like