Looking for offline-enabled name/id lookup in GBIF taxonomy backbone with >10k matches/s

jhpoelen · August 19, 2021, 6:31pm

Hey y’all:

In working with biodiversity datasets containing millions of names (e.g., Global Biotic Interactions [1]), I continue to look for tools that help match taxonomic names.

Many such tools are available (e.g., taxize [2], globalnames resolver, taxadb [3]), however, many (e.g., taxize, globalnames resolver) rely on web apis, whereas others (e.g., taxadb) operate in an R [4] environment only.

For me, the use of web APIs are limited due to my poor internet connection, access controls, rate limiting, and/or external processing delays . Also, having the requirement to install a full R environment for a taxonomic name match seems a bit too much.

In the end, I often end up writing my own tools that meet my needs.

For instance, in collaboration with, or with input/contributions from, folks like:

Nicolas Le Guillarme (nleguillarme · GitHub, ORCID, TaxonNERD [5]), and,
José Augusto Salim (zedomel (José Augusto Salim) · GitHub, https://br.linkedin.com/in/joseasalim, Apiagri)
Carl Boettiger (cboettig (Carl Boettiger) · GitHub, ORCID, https://carlboettig.info)

, I’ve contributed to Nomer, a (taxonomic) name/id matcher, for some time now.

With today’s release of Nomer v0.2.0, I can match about 1M names in less than a minute (at ~20k names/s) using a locally indexed version of the GBIF backbone taxonomy on a ~10 year old laptop running Ubuntu 18.04.

With this, I no longer have to rely on an internet connection or worry about stressing/overloading some external server (for details, see support offline GBIF backbone matcher · Issue #40 · globalbioticinteractions/nomer · GitHub). Also, the version of the GBIF backbone taxonomy only changes when I choose to update it. However, I do realize that Nomer has plenty of room for improvement, and that these improvements take time and effort to realize.

Because name matching is such a common operation, I continue to look for existing tools with similar features (offline-enabled, simple to install, >10k matches/s): I’d rather re-use an existing tool than having to spend significant time maintaining yet another tool as I do now.

My questions are:

1. Can you recommend an existing tool that help to do fast, offline-enabled, taxonomic name matching on a run-of-the-mill laptop?

and,

2. What tools do you use for taxonomic name/id matching?

Looking forward to your insights and suggestions,

-jorrit

References

[1] Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics . Redirecting.

[2] Scott Chamberlain, Eduard Szocs (2013). “taxize - taxonomic search and retrieval in R.” F1000Research . taxize: taxonomic search and retrieval in R | F1000Research taxize: taxonomic search and retrieval in R | F1000Research.

[3] Kari E. A. Norman, Scott Chamberlain, and Carl Boettiger (2020). taxadb: A high-performance local taxonomic database interface. Methods in Ecology and Evolution, 11(9), 1153-1159. https://doi.org/10.1111/2041-210X.13440.

[4] R Core Team (2018). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
URL https://www.R-project.org/

[5] Le Guillarme, N., & Thuiller, W. (2021). TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature. bioRxiv. https://doi.org/10.1101/2021.06.08.444426v1.full .

PS @Debbie @JoeMiller - I believe that my question is related to your interest in finding ways to better collaborate (11. Partnerships to collaborate more effectively). Why is it that we do not seem to have the grep - Wikipedia equivalent for taxonomic name matching yet? What can we do to help reduce expensive re-work?

dshorthouse · August 20, 2021, 2:14am

See gnames · GitHub in particular gnfinder, gnparser, gnverifier.

nleguillarme · August 20, 2021, 10:11am

Hi Jorrit

TaxoNERD is dedicated to taxon names recognition and linking in textual documents.
As such, it has to cope with detection errors (over/under-estimated entity boundaries), misspellings, unexpected naming variants, etc.

Consequently, it cannot rely on exact match, so I resorted to the approximate nearest neighbor search approach implemented in scispacy’s Entity Linker.

This works quite well, and currently taxon names can be linked to the GBIF Backbone Taxonomy (or) TAXREF (or) NCBI Taxonomy. It is also pretty easy to add new taxonomies. It is offline-enabled, at the cost of high memory requirements (the entire list of names must be loaded into memory, so it depends directly on the size of the taxonomy). I don’t think there is any other tool that does approximate name matching.

However, I do not think that it meets your >10k matches/s criterion (although I have not done a performance evaluation yet). Also, I had to come up with an entity linker very quickly, and scispacy’s was straightforward to reuse. But I guess there is plenty of room for improvement (and even better approaches for linking entities out there).

I plan to spend more time on this topic, but for now I am focusing on another exciting task: the extraction of trophic information from text.

Anyway, I am also very interested in this topic, and as a regular nomer user, I’d be happy to help in any way I can.

Nicolas

PS: thank you for publicizing TaxoNERD, I really think it’s a pretty decent work

jhpoelen · August 20, 2021, 2:34pm

Hey @dshorthouse -

Thanks for sharing the links to Dmitry Mozzherin et al. new incarnation of globalnames.org tools at gnames · GitHub , implemented in Google’s language “Go”.

You are right that these tools are a great fit for processing millions of names efficiently. I’ve been in touch with Dmitry over the years and have had a recent exchange with him on the topic.

From the tools you mentioned, gnverifier fits the use case I described in this post. Unfortunately, the tool is not (yet?) primarily designed to work offline (try running the tool after turning off your internet connection). As far as I understand, GitHub - gnames/gnverifier: Takes tab-delimited file with scientific names and verifies names using gn makes web requests to a some server running a GitHub - gnames/gnmatcher: gnmatcher provides fast stemming, fuzzy matching altorithms for matching sc instance.

While Dmitry has been quite helpful to outline how gnmatcher and associated data can be installed on a server, the process is quite involved and has some heavy dependencies (e.g. docker, postgres, data archives). Also, I am not quite sure how I can select a specific version of taxonomic name lists (e.g., GBIF backbone taxonomy, https://itis.gov) to match against. Perhaps with some tweaks, the installation process can be simplified by reducing the complexity of the dependencies and allowing for installing specific versions of taxonomic name lists.

I’ve notified Dmitry about this thread and hope to learn from his insights and architectural considerations especially given his extensive experience with building taxonomic name tools.

And, I am curious to hear comments and an experiences related to the topic of name matching from you and others.

-jorrit

jhpoelen · August 20, 2021, 3:00pm

hey @nleguillarme Nicolas -

Thanks for sharing your notes on TaxoNERD .

It seems to me that you and Dmitry Mozzherin might have some shared interests. If you haven’t already done so, I hope you’ll get a chance to exchange ideas with him and his colleagues one of these days.

Also, how exciting that you are experimenting with extracting trophic information (e.g., “man eats dog”) from texts! Let me know when you are ready to let GloBI index your extracted trophic interactions records along with their origins (e.g., scholarly communications, papers, grey literature).

-jorrit

jhpoelen · September 14, 2021, 11:34am

I heard back from Dmitry Mozzherin (developer of Global Names, https://globalnames.org) by email, and this is what he said (posted with permission):

"I think it is great that you solved your need for offline verification, and that it works so fast for you, hopefully your solution will be useful for other usecases as well.

At some point I am going to make gnfinder offline-compatible, but first I have to make a harvester that will allow to cherrypick data according to users’ needs. I still think that may be preston might be useful for that.

Performance-wise I have 20k/sec for name matching on a high end laptop, and sources lookup 2.5k/sec. I doubt I can make matching much faster (and I doubt it is needed), but sources lookup can be much faster, when I switch it to key-value store. I use Postgres at the moment until the process is stable, it is easier to tweak SQL than rethink how key-values should interact."

Also, about @nleguillarme 's TaxoNerd -

“I looked at taxonerd, and will definitely try it. Thank you for letting me know about it. The devil is in details with such projects. Deep learning is an important tool and there are a few projects that start using it now for name finidng. What I saw so far required much work though, hopefully taxonerd is more mature. From scientific names perspective I would concentrate on using deep learning not for finding names, but on removing false positives. Tons of names use geographical entities, human names, common words, there is even a snail Casus belli and a moth La cucaracha :). I hope TDWG will be a good time to talk about parsers, finders and resolvers!”

system · October 14, 2021, 9:35pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to check scientific name Data Use	10	551	April 4, 2024
How to use data validator through the API?	5	454	June 25, 2023
Version of gbif OpenSource-community	5	159	July 7, 2024
Data Use Club Practical Session: Name matching and the GBIF Taxonomic Backbone Data Use	1	414	June 23, 2023
Download a Simple List of Species and GBIF ID nos	3	924	December 14, 2021

Looking for offline-enabled name/id lookup in GBIF taxonomy backbone with >10k matches/s

Related topics