Checking geographical assignments below the country level

GBIF helpfully checks occurrences records to see if the country or countryCode field entry agrees with the decimalLatitude and decimalLongitude entries. If not, GBIF flags the record with “country coordinate mismatch”.

In my data auditing I often look at other location categories for disagreements. For example, this check may find more than one stateProvince entry for exactly the same coordinates.

However, that check doesn’t tell me which of the two or more stateProvince entries is the correct one for those coordinates. In other words, it isn’t a “stateProvince coordinate mismatch” test.

I recently did a “stateProvince coordinate mismatch” test on a plant dataset from a European country, using QGIS. I found that 3-4% of ca. 98000 records had an apparent mismatch. Note the “apparent” in that result. It might be that the records were close to a stateProvince boundary, and the GIS stateProvince layer I was using might not have had the boundary located properly. (I got administrative layers for the country from the DIVA-GIS website.)

To demonstrate this method I’ll work with a different dataset, the University of Arizona Insect Collection . As of 2024-01-02 the UAIC dataset in GBIF had 128117 occurrences records. The unprocessed dataset was remarkably untidy (see the GBIF flags for examples of issues) but had good coverage of the US state of Arizona, which is what I wanted. I wanted to look at county field entries, and Arizona has only 15 counties.

From the full dataset I selected the 17148 records for which

  • stateProvince contained something like Arizona (entries were arizona, Arizona, ARIZONA and “Arizona [? abbreviated ““Ar.”””)
  • decimalLatitude and decimalLongitude were both filled
  • geodeticDatum had “WGS84”
  • coordinateUncertaintyInMeters was less than 500
  • county was filled

From these 17148 records I built a text file with just the fields occurrenceID, county, decimalLatitude, decimalLongitude, geodeticDatum and coordinateUncertaintyInMeters.

To tidy the file I deleted the 19 records with invalid county entries (Brown, Catron, Clark, General Plutarco Elías Calles, San Juan, St. Johns), leaving 17129 records. I normalised the remaining entries (e.g., “Cochise”, “Cochise co.” and “Cochise County” all became “Cochise”) and I rounded or expanded all coordinate entries to 5 decimal places.

I loaded the tidied table into QGIS and converted it to a shapefile, mainly to ensure the points layer had a spatial index. I then added to the GIS project a (free) county boundaries polygon layer from the Arizona Geographic Information Council, and coloured the occurrences markers using “county” in the points table as a category.

If every point had been assigned to its correct county, each county polygon should have had points with just one colour. Unfortunately, this wasn’t true:

To select the records with incorrect county names, I intersected the points and polygon layers, retaining the 6 fields from the points layer but just the county name field (“NAME”) from the polygon layer. I then calculated a new field, “mismatch”, for which “county (IS NOT) NAME”, and visualised the points with “mismatch” = 1 by giving them a new colour. (As a GIS user you might prefer a different filtering method.)

Of the 17129 records, 143 had the wrong county, which is an apparent error rate of less than 1% in a fairly tidy, much-reduced dataset. I write “apparent” again because some of the mismatches might be close to a county border. Selecting these from the QGIS map view, I found 37 of the 143 were close to a county border, and in each of those cases the county entry in the occurrence record was indeed one of the two names of the bordering counties.

Ignoring those literally “borderline” cases, the georeferencing of 106 of the 17129 records needs checking: is the county assignment wrong, or are the coordinates incorrect? Given the unfortunate state of the dataset as a whole, this is a minor issue, but if county names appear in an occurrence record, they should be correct for the location.

Readers interested in the GIS and other details of this exercise are welcome to email me directly.


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

3 Likes

Very cool, thanks for showing off.

What if the stateProvince moved/changed since the record was created? Borders are very prone to changing over time

@pieter, my aim in checking is to find discrepancies or disagreements that need to be reviewed by the data compiler. You are correct that the geographical/administrative unit may have changed its position or size over time. Or its name: the label might say “Burma”, but we would enter “Myanmar” in country and (possibly) put “label says Burma” in locationRemarks or georeferenceRemarks.

2 Likes

And if the data labels were imaged, you’d also be able to tell if “it was wrong on the label” or if someone selected the wrong county from a dropdown list (so easy to do).