Exact duplicate records in a Darwin Core table have identical entries in every field. They’re easy to find and remove, even in Microsoft Excel.
Strict duplicate records have identical entries except in the one or more fields that gives each record a unique ID. Those unique-ID fields might be eventID or occurrenceID. These duplicates should be removed, because they contribute no additional information to the table.
Finding strict duplicate records isn’t so easy. Unfortunately, strict duplicates do exist in datasets shared with GBIF, and they aren’t detected by GBIF’s Data Validator.
The best way I know to find strict duplicates is in a BASH shell with the GNU AWK command suggested here on the Darwin Core table checker website. The simplified table below (named “event.txt”) will show how the command works but please note: I’ve used a semicolon as field separator only so you can copy and paste the table into a text editor from this forum page. Once it’s in the editor, please replace semicolons with tabs and delete blank lines so that the table is tab-separated (TSV).
eventID;locality;decimalLatitude;decimalLongitude;geodeticDatum;eventDate
ev001;Towells;-37.6520;154.7603;WGS84;2024-06-10
ev002;Gramdun;-37.6502;154.7612;WGS84;2024-07-14
ev003;Gramdun;-37.6502;154.7612;WGS84;2024-07-15
ev004;One Tree Hill;-37.6491;154.7318;WGS84;2024-06-03
ev005;Fernoaks;-37.6513;154.7598;WGS84;2024-07-29
ev006;Towells;-37.6520;154.7603;WGS84;2024-06-10
ev007;Lees Paddock;-37.6501;154.7702;WGS84;2024-06-25
ev008;Fernoaks;-37.6513;154.7598;WGS84;2024-07-29
ev009;Towells;-37.6520;154.7603;WGS84;2024-06-10
ev010;One Tree Hill;-37.6491;154.7318;WGS84;2024-06-05
The AWK command groups the strict duplicates and neatly separates the groups with a blank line:
If you’re familiar with BASH tools you may be wondering if you can do the same job with sort and uniq. Yes, but:
The method fails if the unique-ID field to be ignored isn’t the first field in the record, or if the record contains more than one unique-ID field in several locations in the table. The sort-uniq approach also doesn’t separate the strict duplicates into groups.
The AWK command can also be used to find duplicates where one of the data fields is missing an entry, or where the field entries have not been properly normalised. The command below ignores both eventID and locality, where one of the “Towells” entries is missing and another is misspelled. (A locality field should have normalised entries. Original entries can go in a verbatimLocality field.)
eventID;locality ;decimalLatitude;decimalLongitude;geodeticDatum;eventDate
ev001;Towells;-37.6520;154.7603;WGS84;2024-06-10
ev002;Gramdun;-37.6502;154.7612;WGS84;2024-07-14
ev003;Gramdun;-37.6502;154.7612;WGS84;2024-07-15
ev004;One Tree Hill;-37.6491;154.7318;WGS84;2024-06-03
ev005;Fernoaks;-37.6513;154.7598;WGS84;2024-07-29
ev006;;-37.6520;154.7603;WGS84;2024-06-10
ev007;Lees Paddock;-37.6501;154.7702;WGS84;2024-06-25
ev008;Fernoaks;-37.6513;154.7598;WGS84;2024-07-29
ev009;Towels;-37.6520;154.7603;WGS84;2024-06-10
ev010;One Tree Hill;-37.6491;154.7318;WGS84;2024-06-05
If there are “pythonistas” or “useRs” who follow this forum, I would be interested to hear about simple Python or R methods for finding and grouping strict duplicates.
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com