Hey y’all:
In working with biodiversity datasets containing millions of names (e.g., Global Biotic Interactions [1]), I continue to look for tools that help match taxonomic names.
Many such tools are available (e.g., taxize [2], globalnames resolver, taxadb [3]), however, many (e.g., taxize, globalnames resolver) rely on web apis, whereas others (e.g., taxadb) operate in an R [4] environment only.
For me, the use of web APIs are limited due to my poor internet connection, access controls, rate limiting, and/or external processing delays . Also, having the requirement to install a full R environment for a taxonomic name match seems a bit too much.
In the end, I often end up writing my own tools that meet my needs.
For instance, in collaboration with, or with input/contributions from, folks like:
- Nicolas Le Guillarme (nleguillarme · GitHub, ORCID, TaxonNERD [5]), and,
- José Augusto Salim (zedomel (José Augusto Salim) · GitHub, https://br.linkedin.com/in/joseasalim, Apiagri)
- Carl Boettiger (cboettig (Carl Boettiger) · GitHub, ORCID, https://carlboettig.info)
, I’ve contributed to Nomer, a (taxonomic) name/id matcher, for some time now.
With today’s release of Nomer v0.2.0, I can match about 1M names in less than a minute (at ~20k names/s) using a locally indexed version of the GBIF backbone taxonomy on a ~10 year old laptop running Ubuntu 18.04.
With this, I no longer have to rely on an internet connection or worry about stressing/overloading some external server (for details, see support offline GBIF backbone matcher · Issue #40 · globalbioticinteractions/nomer · GitHub). Also, the version of the GBIF backbone taxonomy only changes when I choose to update it. However, I do realize that Nomer has plenty of room for improvement, and that these improvements take time and effort to realize.
Because name matching is such a common operation, I continue to look for existing tools with similar features (offline-enabled, simple to install, >10k matches/s): I’d rather re-use an existing tool than having to spend significant time maintaining yet another tool as I do now.
My questions are:
1. Can you recommend an existing tool that help to do fast, offline-enabled, taxonomic name matching on a run-of-the-mill laptop?
and,
2. What tools do you use for taxonomic name/id matching?
Looking forward to your insights and suggestions,
-jorrit
References
[1] Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics . Redirecting.
[2] Scott Chamberlain, Eduard Szocs (2013). “taxize - taxonomic search and retrieval in R.” F1000Research . taxize: taxonomic search and retrieval in R | F1000Research taxize: taxonomic search and retrieval in R | F1000Research.
[3] Kari E. A. Norman, Scott Chamberlain, and Carl Boettiger (2020). taxadb: A high-performance local taxonomic database interface. Methods in Ecology and Evolution, 11(9), 1153-1159. https://doi.org/10.1111/2041-210X.13440.
[4] R Core Team (2018). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
URL https://www.R-project.org/
[5] Le Guillarme, N., & Thuiller, W. (2021). TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature. bioRxiv. https://doi.org/10.1101/2021.06.08.444426v1.full .
PS @Debbie @JoeMiller - I believe that my question is related to your interest in finding ways to better collaborate (11. Partnerships to collaborate more effectively). Why is it that we do not seem to have the grep - Wikipedia equivalent for taxonomic name matching yet? What can we do to help reduce expensive re-work?