Century errors

Many of the records shared with GBIF from collections (museums, herbaria) contain occurrence date errors. I suspect that most of these originate when label data are digitised.

For example, a specimen label with “coll. May 1958” might have been digitised as “May 1985” or “Mar 1958” by mistake. In Darwin Core’s eventDate field, the correct “1958-05” then appears as “1985-05” or “1958-03”.

A particular category of date mistakes are “century errors”, in which the occurrence date is off by exactly 100 years. The label says “6.ix.95” and means “6 September 1895”, but the person entering the data interprets “95” as “1995”.

One way to look for century errors is to see if the date is possible within the lifetime of the collector, and a great source for collector (and identifier) data is the Bionomia project managed by David Shorthouse. The Bionomia page for the American entomologist William Morton Wheeler, for example, has a graph with collecting decades from 1900 to 1930, which fits with Wheeler’s lifespan, 1865-1937.

The page also shows 1,971 identifications done by Wheeler after 2010. Going to the source of the records in the Museum of Comparative Zoology’s MCZbase, it turns out that in these records “Determination year is a proxy for unknown data. It will be updated when year is known Determinations made by Wheeler”. Example: https://MCZbase.mcz.harvard.edu/guid/MCZ:Ent:21081]

I use a variation of the collector-based method to check for century errors in whole datasets. The code is partly based on commands described in the Darwin Core table checker, and if you’re a command-line user you can write to me for details. What I get as an output is a list of collectors with the range of their collecting years in decreasing order. Where the range is very large, e.g. greater than 100, something is wrong.

A check of William Morton Wheeler in the records MCZbase shares with GBIF turned up 20 records with collections after Wheeler’s death, and some genuine century errors:

catalogNumber verbatimEventDate eventDate
21493 5.6.04 2004-05-06
21641 Oct. 14 '14 2014-10-14
556705 VI.2.07 2007-06-02
556747 6-14-07 2007-06-14
556748 6-14-07 2007-06-14
556749 6-14-07 2007-06-14

My method also picked up century errors in MCZbase for the living biologist David J. Lohman, whose collections are mainly from 1991 to 2012:

catalogNumber verbatimEventDate eventDate
210303 23 V 01 1901-05-23
210325 16 VI 01 1901-06-16
210357 16 VI 01 1901-06-16
210358 16 VI 01 1901-06-16
210362 16 VI 01 1901-06-16

even though similar verbatimEventDate entries for Lohman are correctly interpreted, e.g.

catalogNumber verbatimEventDate eventDate
210349 15 VI 01 2001-06-15
210373 15 VI 01 2001-06-15

Unfortunately, century errors are usually buried among other errors (such as wrong collector and wrong verbatim collecting date) and my command-line method requires tidy data. If the recordedBy field has a range of variations such as

William Morton Wheeler; William M. Wheeler; W.M. Wheeler; W. M. Wheeler; Wm. M. Wheeler; WMW; W.M.W.; Wheeler, William Morton; Wheeler, W.M.

and so on, not to mention spelling mistakes, then checking for century errors this way becomes very tedious for large datasets.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com