I distinguish three different kinds of duplicate Darwin Core occurrence records:
exact duplicates have the same entries in all fields, including unique identifiers like occurrenceID.
strict duplicates have the same entries in all fields except those which give the record a unique ID, like the occurrenceID field.
relaxed duplicates have the same entries in key occurrence fields, such as taxon, coordinates, date and collector/observer.
Exact duplicates are very rare in biodiversity datasets. They probably arise from bookkeeping errors when records are compiled, or when records are converted to the Darwin Core format. Because exact duplicates are easy to find (see below), they’re easy to eliminate as well.
More common are strict duplicates. I suspect that many of these are generated when minor fields are dropped during conversion to Darwin Core. For example, suppose a CMS has these two (simplified) records:
ID | COLLDATE | COLLTOR | SITE | CABINET | DRAWER
Leptanena depressa | 16 May 1923 | H. Murchison | Hamilton Quarry | 3BH | 15
Leptanena depressa | 16 May 1923 | H. Murchison | Hamilton Quarry | 3BH | 16
The minor fields “CABINET” and “DRAWER” are dropped in the Darwin Core version, resulting in two strictly duplicate records:
occurrenceID | scientificName | eventDate | recordedBy | locality
2133 | Leptanena depressa | 1923-05-16 | H. Murchison | Hamilton Quarry
2134 | Leptanena depressa | 1923-05-16 | H. Murchison | Hamilton Quarry
Strict duplicates might also be the result of delayed identifications. Two records might have exactly the same observation or collection details, but one might have a definite ID — call it Aus bus — while the other ID might be uncertain. At a later date the uncertain ID is firmed up as Aus bus, the two records become strict duplicates and they can be merged.
In the example shown below, there are only two strict duplicates in a ca 13000-record dataset. I’ve seen datasets containing thousands of strict duplicates, mostly the result of bookkeeping errors.
Unlike exact and strict duplicates, relaxed duplicates are quite common in Darwin Core datasets and generally don’t arise from compilation or bookkeeping problems. For example, a single plant specimen might be divided and distributed to several herbaria, or individual specimens of a certain species from a particular collecting event might be separately registered in a collection, or the same taxon+location+date+collector might come from two different collection methods (different samplingProtocol).
The code I use to detect duplicates relies on a dataset being tidy: entries must be consistent and consistently formatted. “1989-06-05” and “5/6/89” might be the same date in your mind, but they’re different for a computer routine that processes data character by character.
Similarly, “C. Darwin”, “Darwin, C.” and “Darwin, C” are different and won’t be recognised as the same by the duplicate-hunting methods described below. More not-the-same examples:
- “B. Abbott; L. Costello” vs “B. Abbott | L. Costello” in recordedBy
- “1989-06-05” vs "1989-06-05 " in eventDate (note the trailing space in the second data item)
- “40.0618924337825” vs “40.06189” in decimalLatitude
- “K. Alexandrov” vs “K. Аlexandrov” (the “A” in the second data item is the Cyrillic “A”, not the Latin one)
It’s possible to write duplicate-hunting code that allows for variations like these, by first normalising all entries to a selected format. This is an awkward solution for the problem, because normalising a wide range of formats (think of dates) is very code-heavy, because automated normalisation may fail for some “pathological” entries and because the original dataset is left in its untidy state.
Tidy data is a blessing for end users, and tidying of data also makes it much easier for both the data compiler and the end-user to find duplicates!
exact duplicates
Easily found in a BASH shell with sort [file] | uniq -D. It’s best to first pipe the result to a line count with wc -l, to avoid flooding the terminal in case there are a lot of duplicates. The following example comes from a dataset of psyllid occurrences with 1975 records and 28 fields. Only the first two exact duplicates are shown, and I’ve piped the output for clarity to barsep, an alias that replaces tabs with space-bar-space:
strict duplicates
More information here and here
This and the next code recipe use GNU AWK 4 (or 5) in a BASH shell, work very fast and will easily process large numbers of records (millions) with large numbers of fields (hundreds). For more information on these and related methods, see A Data Cleaner’s Cookbook.
The example here is occurrence.txt and it comes from the source archive for this very tidy GBIF dataset (accessed 2022-11-30). The tab-separated file has 13742 records and 35 fields, and the fields that uniquely identify a record are id and occurrenceID, which are fields 1 and 7, respectively. The command I used is
awk -F"\t" 'FNR==NR {$1=$7=1; a[$0]++; next} {x=$0; $1=$7=1} a[$0]>1 {print x}' occurrence.txt occurrence.txt
The two records shown are identical apart from the id and occurrenceID entries “Pteridophytes_of_the_Urals_02516” and “Pteridophytes_of_the_Urals_02517”. If there are numerous strict duplicates they can be sorted afterwards to group the duplicates by their non-unique fields of choice.
relaxed duplicates
More information here and here
This example is from a just-published GBIF dataset (source archive accessed 2022-12-01), with 7310 records and 92 fields, and no exact or strict duplicates. The fields of interest are scientificName (field 16), decimalLatitude (68), decimalLongitude (69), eventDate (34) and recordedBy (32). In the source Darwin Core archive, occurrences are in occurrence.csv, which I converted to tab-separated oc1 from the horrible CSV format. A preliminary check showed that many records lacked coordinates or date or collector, so I added conditions to the AWK command that these fields should be non-empty. The core command is then
awk -F"\t" 'FNR==NR {a[$16,$68,$69,$34,$32]++; next} $68!="" && $69!="" && $34!="" && $32!="" && a[$16,$68,$69,$34,$32]>1' oc1 oc1
and it finds 1123 relaxed duplicates (top of screenshot below).
The command was modified to print these key fields together with occurrenceID (field 7) and catalogNumber (8), and the output was piped to sort to sort the duplicates by taxon, coordinates, date and collector (in that order). Here are the first 10 relaxed duplicates (lower part of screenshot):
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com