Using gnparser to check scientificName entries

Data publishers sometimes forget that the Darwin Core term verbatimIdentification heads a field in which you can put “A string representing the taxonomic identification as it appeared in the original record”. That original string might or might not be a proper scientific name. It’s the identification on a specimen label and/or in a local database.

The scientificName field, on the other hand, is strictly for proper scientific names. It isn’t the place for informal names (“lichen”, “Carabus sp. A”), qualified names (“Mus cf. musculus”, “Eucalyptus aff. amygdalina”), comments (“Indet”, “Not a fungus”) or nonsense (“???”).

Checking a long list of scientific names by eye for formal correctness can be tedious, but there are excellent tools available for doing the job programmatically. GBIF’s online species name matcher is a good one; it relies on GBIF’s backbone taxonomy for matching, fuzzy-matching and non-matching name strings.

My personal favourite is gnparser. Written by Dmitry Mozzherin and colleagues, it’s a phenomenally fast tool that not only breaks up scientific names into name, author and year (when given), but also rates the quality of the name string, with well-defined quality categories. For example, “lichen” has quality 0 (Parsing failed), “Carabus sp. A” and “Eucalyptus aff. amygdalina” have quality 4 (Name approximate), and “Aartsenia candida (Møller, 1842)” will earn you a quality 1 (Parsing finished without detecting any problems).

gnparser is available both as an online service with a limit of 5000 names per batch, or as a downloadable command-line program with no upper limit. I use the CLI version, which is currently (2023-09-12) at version 1.7.4.

I needed gnparser for a recent check of the Mollusca dataset shared with GBIF by Naturalis Biodiversity Center. The dataset (ca 730K records) is very messy, and the scientificName field is really a verbatimIdentification field. I also noticed an apparently high level of pseudo-duplication in scientificName. In other words, a single name might appear in several different formats.

To shed more light on the pseudo-duplication problem I first did a tally of the unique name strings in scientificName. There were 52011 unique strings in 738016 records.

I then used grep and regular expressions to separate out the invalid and badly formatted entries. I passed the apparently OK name strings to gnparser and looked for entries with a quality rating other than “1”. Where appropriate, I added these to my list of rejects. The final list had 1913 rejected strings.

The “OK” list of tallied name strings (50099 unique strings from 730154 records) could be not OK in a a variety of ways. Even though the form of the string might be correct, the taxon names, authors and authorship years could be wrong, or misspelled in the case of taxon and author names.

Nevertheless, I was interested in pseudo-duplication, errors or no, so I used some AWK voodoo to generate from the gnparser output a separated list of pseudo-duplicate sets. There were 3851 such sets, mostly pairs with or without authorship or subgenus:

Lambis scorpius [7 records]
Lambis scorpius (Linnaeus, 1758) [76]

Melampus (Detracia) monile (Bruguière, 1789) [2 records]
Melampus monile (Bruguière, 1789) [68]

There were also numerous sets with authorship with or without parentheses:

Leptothyra laeta (Montrouzier, 1863) [11 records; correct]
Leptothyra laeta Montrouzier, 1863 [1; incorrect]

Two of the sets had serious incremental fill-down errors:

Conus coronatus Gmelin, 1791 [478 records; correct]
Conus coronatus Gmelin, 1792 [1]
Conus coronatus Gmelin, 1793 [1]
Conus coronatus Gmelin, 1794 [1]
Conus coronatus Gmelin, 1795 [1]
Conus coronatus Gmelin, 1796 [1]
Conus coronatus Gmelin, 1797 [1]
Conus coronatus Gmelin, 1798 [1]
Conus coronatus Gmelin, 1799 [1]

Conus eburneus Hwass, 1792 [204 records; correct]
Conus eburneus Hwass, 1793 [1]
Conus eburneus Hwass, 1794 [1]
Conus eburneus Hwass, 1795 [1]
Conus eburneus Hwass, 1796 [1]
Conus eburneus Hwass, 1797 [1]
Conus eburneus Hwass in Bruguière, 1792 [2]

and one set had five versions of the same name:

Vitrea crystallina [93 records]
Vitrea crystallina (Müller, 1774) [213]
Vitrea crystallina (O.F. Müller, 1774) [1]
Vitrea (Crystallus) crystallina [23]
Vitrea (Crystallus) crystallina (Müller, 1774) [39]

The sets also revealed authorship and formatting errors, e.g.

Cheilea papyracea Adams [1 record]
Cheilea papyracea (Linnaeus, 1758) [1]
Cheilea papyracea (Reeve, 1858) [2; correct]

Thalotia conica (Gray 1827) [1 record]
Thalotia conica (Gray, 1827) [25; correct]
Thalotia conica [33]

This set-building would not have been possible without the excellent gnparser, which I highly recommend for data publishers and data checkers!

Robert Mesibov (“datafixer”);


You can also use this for name parsing in R
rgbif::name_parse("Aartsenia candida (Møller, 1842)")