How I check decimalLatitude/Longitude against verbatimCoordinates

Some of the datasets I audit have degree-minute-second coordinates (DMS) in verbatimCoordinates and decimal degrees coordinates (DD) in decimalLatitude and decimalLongitude. Checking that DMS-to-DD was done correctly isn’t easy because the DMS might be in any one of several formats:

  • the degree number might be followed by the degree symbol (°), the masculine ordinal symbol (º), the modifier small letter o (ᵒ), the ring above character (˚), the letter “d” or even an asterisk (*)
  • minutes might be followed by an apostrophe ('), a right single quote (’) or the letter “m”
  • seconds might be followed by a quote ("), a right double quotation mark (”), two apostrophes (‘’), two right single quotes (’’) or the letter “s”
  • the seconds item might be a whole number (like “31”) or a decimal (like “30.975”)
  • spacing between DMS items can be variable, from no space to one or more spaces
  • there may or may not be a comma between the latitude and longitude elements of the DMS

To convert DMS to 4-decimal-place DD I use GNU AWK. For readers familiar with BASH and the command line, the command is stored in the function “DMStoDD4”:

DMStoDD4() {awk -v FPAT=“[0-9]+|[0-9]+\.[0-9]+|[NEWS]” ‘{lat=sprintf(“%0.4f”,($1+($2/60)+($3/3600))); lon=sprintf(“%0.4f”,($5+($6/60)+($7/3600))); {if ($4==“S”) printf(“%0.4f “,(-1)*lat); else printf(”%0.4f “,lat)}; {if ($8==“W”) printf(”%0.4f\n”,(-1)*lon); else printf(“%0.4f\n”,lon)}}’; }

Here AWK is defining fields by matching a pattern. The patterns to match are one or more digits, one or more digits separated by a “.”, and the direction letters N, E, W and S. In effect, AWK is ignoring punctuation and spaces. All it sees as fields are the six numbers and the two letters in the DMS:

1

The command works well even with “pathological” DMS formatting:

DMStoDD4 will fail if the decimal separator is a comma rather than a “.”, and if the direction letters are other than NEWS, as in the Spanish Oeste" for “west”.

I do find disagreements between verbatimCoordinates and decimalLatitude/decimalLongitude in Darwin Core datasets, and in those cases the disagreements go back to the data compiler for a fix. The fix proceeds if the dataset is referenced in a Pensoft publication like Biodiversity Data Journal. Less often the disagreement is in a museum or herbarium database, and nothing happens. As a museum data manager recently wrote to me:

You are right that there are … issues with [our] datasets. The main reason for not fixing them is lack of time. Personally, for me is data quality very important, but other issues with data management keep us (the data managers) busy. The collection managers are also extremely busy, and also have no time.

To “Linnean shortfall” and “Wallacean shortfall” as obstacles to better biodiversity knowledge we should add “data management shortfall”.


Robert Mesibov (“datafixer”); [mesibov@datafix.com.au](a href=“mailto:mesibov@datafix.com.au”)