In a GBIF forum post in 2022 I described a programmatic check to find anomalous occurrence records where the same collector was recorded as being in two very widely separated places on the same day.
That 2022 check was for a particular auditing job. I’ve now generalised the check for clean Darwin Core datasets. By “clean” I mean that recordedBy is filled in and free of pseudo-duplicates, eventDate is in ISO 8601 format and both decimalLatitude and decimalLongitude are filled in and valid. The check also allows me to set a threshold distance between sites said to be visited by the same collector on the same day.
With a threshold of 1000 km, the Smithsonian’s Extant Specimen Records, updated 2025-05-02, returned more than 3600 anomalous blocks of occurrence records. Most of these are the result of data entry errors in lat/lon and could be flagged by GBIF, for example as “presumed negative latitude” or “country coordinate mismatch”. Here are three such blocks:
recordedBy | eventDate | decLat | decLon | catalogNumber |
---|---|---|---|---|
T. Mendietta & B. Taylor | 2023-08-01 | 33.5232 | -103.863 | US 3760262 |
T. Mendietta & B. Taylor | 2023-08-01 | 33.6727 | -4.40142 | US 3760274 |
T. Mendietta & B. Taylor | 2023-08-01 | 33.6421 | -104.357 | US 3760278 |
D. Rodriguez | 2019-10-07 | 89.4074 | -14.3674 | US 3753480 |
D. Rodriguez | 2019-10-07 | 14.3641 | -89.4055 | US 3760898 |
B. Wallnöfer | 2006-05-01 | 48.2517 | 16.2325 | US 3520318 |
B. Wallnöfer | 2006-05-01 | 48.2517 | 48.2517 | US 3520319 |
B. Wallnöfer | 2006-05-01 | 48.2517 | 16.2325 | US 3520320 |
B. Wallnöfer | 2006-05-01 | 48.2506 | 16.2708 | US 3520321 |
Other anomalies are more subtle and would suggest that curators re-check the accession details, as in this block:
recordedBy | eventDate | decLat | decLon | catalogNumber |
---|---|---|---|---|
E. Jenkins | 2000-01-25 | 68.1912 | -135.917 | USNM 1385987 |
E. Jenkins | 2000-01-25 | 68.1912 | -135.917 | USNM 1385988 |
E. Jenkins | 2000-01-25 | 68.1912 | -135.917 | USNM 1400750 |
E. Jenkins | 2000-01-25 | 51.5918 | -116.061 | USNM 1400751 |
E. Jenkins | 2000-01-25 | 51.5918 | -116.061 | USNM 1400752 |
The collector was the same person (E.J. Jenkins) and the records are from the same country (Canada), but they’re ca 2100 km apart. Possible on the same day?
I’ve done this particular check for a number of large Darwin Core datasets. The strangest result was for an entomology dataset where Collector X seemed to be in two places at once quite often. The collection manager told me that
(Collector X) often had others collecting for him, and he was a bit ornery and sometimes did not acknowledge those other collectors on labels (used his own name), especially several that he paid to collect and didn’t really consider to be “bona fide entomologists.” Not nice. And so (Collector X) sometimes does seem to be in multiple countries at once (when he is in fact traveling abroad and active and another one of his local USA collectors is active) or in multiple states (when he is traveling around USA). For better or worse, his specimen labels say what the labels say, and those discrepancies cannot really be placed at the feet of data entry personnel, these would need quite a bit of extra curatorial thinking/research/validation (smile).
Robert Mesibov (“datafixer”);mesibov@datafix.com.au