Most of the data tests currently listed by the TDWG Biodiversity Data Quality Interest Group are single-field checks. Is there an entry in that field? Is it a valid entry?
A few of the tests involve multiple fields. For example, Do the geographic coordinates fall on or within the boundaries of the territory given in dwc:countryCode or its Exclusive Economic Zone? If not, the record has what GBIF calls a “country coordinate mismatch”. For some other multiple-field checks, see this forum post.
For a recent project I’ve been doing a multiple-field, multiple record check, and I’m finding data problems in seemingly OK records. The check finds the distance between sites visited by a single collector on a single day, or on two consecutive days. Sometimes the distance is highly suspicious, such as two sites thousands of km apart with records from the same collector on the same day.
There are various possible explanations. Maybe the record has the wrong collector, or there are two collectors with the same name who were active on the same day. Maybe the georeferencing was wrong. Maybe the wrong date was entered by carrying-forward in a database, or by copying down in the date column in a spreadsheet.
In any case, a suspicious same-day or consecutive-day result deserves to be investigated.
Here’s an example, from a dataset suggested by David Shorthouse. The EH Strickland Entomological Museum at the University of Alberta holds numerous specimens collected by Robin Ernest Leech (1937-2016).
A check of same-day records shows three specimen lots collected on 22 September 1983 at Oregon Bend on the Kentucky River near Salvisa, Kentucky, coordinates 37.917 -84.825. The record for one of these is:
On the same day, 22 September 1983, Leech collected three specimen lots near or in Drake Park in Bend, Oregon, at coordinates 44.059 -121.32. Example:
There’s seemingly nothing wrong with these records when looked at individually, but the two localities are ca 3109 km apart (great circle distance)!
In this case, I’m guessing that the georeferencer for the second lot of records used an online gazetteer and matched “Oregon Bend” to Bend, Oregon, and that all six specimen lots are really from a Leech visit to Kentucky. I haven’t seen any images of original specimen labels.
Sometimes a big separation between same-day records is real. In my current project I’m in contact with the Australian collector, who writes of four very well-separated same-day records, many years ago:
The above four are on a drive to Darwin, so quite possible as roadside stops in one day’s drive of about 600 km.
However, most of the anomalies I’ve found with this particular check are very large and stand out dramatically. I use the command line for the check but it could also be done in a spreadsheet (email me directly for suggestions, if interested). Before doing the check, it is essential that event dates and coordinates be valid and correctly formatted.
Robert Mesibov (“datafixer”); email@example.com