Duplicate occurrence records

I distinguish three different kinds of duplicate Darwin Core occurrence records:

exact duplicates have the same entries in all fields, including unique identifiers like occurrenceID.
strict duplicates have the same entries in all fields except those which give the record a unique ID, like the occurrenceID field.
relaxed duplicates have the same entries in key occurrence fields, such as taxon, coordinates, date and collector/observer.

Exact duplicates are very rare in biodiversity datasets. They probably arise from bookkeeping errors when records are compiled, or when records are converted to the Darwin Core format. Because exact duplicates are easy to find (see below), they’re easy to eliminate as well.

More common are strict duplicates. I suspect that many of these are generated when minor fields are dropped during conversion to Darwin Core. For example, suppose a CMS has these two (simplified) records:

    ID | COLLDATE | COLLTOR | SITE | CABINET | DRAWER
    Leptanena depressa | 16 May 1923 | H. Murchison | Hamilton Quarry | 3BH | 15
    Leptanena depressa | 16 May 1923 | H. Murchison | Hamilton Quarry | 3BH | 16

The minor fields “CABINET” and “DRAWER” are dropped in the Darwin Core version, resulting in two strictly duplicate records:

    occurrenceID | scientificName | eventDate | recordedBy | locality
    2133 | Leptanena depressa | 1923-05-16 | H. Murchison | Hamilton Quarry
    2134 | Leptanena depressa | 1923-05-16 | H. Murchison | Hamilton Quarry

Strict duplicates might also be the result of delayed identifications. Two records might have exactly the same observation or collection details, but one might have a definite ID — call it Aus bus — while the other ID might be uncertain. At a later date the uncertain ID is firmed up as Aus bus, the two records become strict duplicates and they can be merged.

In the example shown below, there are only two strict duplicates in a ca 13000-record dataset. I’ve seen datasets containing thousands of strict duplicates, mostly the result of bookkeeping errors.

Unlike exact and strict duplicates, relaxed duplicates are quite common in Darwin Core datasets and generally don’t arise from compilation or bookkeeping problems. For example, a single plant specimen might be divided and distributed to several herbaria, or individual specimens of a certain species from a particular collecting event might be separately registered in a collection, or the same taxon+location+date+collector might come from two different collection methods (different samplingProtocol).

The code I use to detect duplicates relies on a dataset being tidy: entries must be consistent and consistently formatted. “1989-06-05” and “5/6/89” might be the same date in your mind, but they’re different for a computer routine that processes data character by character.

Similarly, “C. Darwin”, “Darwin, C.” and “Darwin, C” are different and won’t be recognised as the same by the duplicate-hunting methods described below. More not-the-same examples:

  • “B. Abbott; L. Costello” vs “B. Abbott | L. Costello” in recordedBy
  • “1989-06-05” vs "1989-06-05 " in eventDate (note the trailing space in the second data item)
  • “40.0618924337825” vs “40.06189” in decimalLatitude
  • “K. Alexandrov” vs “K. Аlexandrov” (the “A” in the second data item is the Cyrillic “A”, not the Latin one)

It’s possible to write duplicate-hunting code that allows for variations like these, by first normalising all entries to a selected format. This is an awkward solution for the problem, because normalising a wide range of formats (think of dates) is very code-heavy, because automated normalisation may fail for some “pathological” entries and because the original dataset is left in its untidy state.

Tidy data is a blessing for end users, and tidying of data also makes it much easier for both the data compiler and the end-user to find duplicates!


exact duplicates

Easily found in a BASH shell with sort [file] | uniq -D. It’s best to first pipe the result to a line count with wc -l, to avoid flooding the terminal in case there are a lot of duplicates. The following example comes from a dataset of psyllid occurrences with 1975 records and 28 fields. Only the first two exact duplicates are shown, and I’ve piped the output for clarity to barsep, an alias that replaces tabs with space-bar-space:


strict duplicates
More information here and here

This and the next code recipe use GNU AWK 4 (or 5) in a BASH shell, work very fast and will easily process large numbers of records (millions) with large numbers of fields (hundreds). For more information on these and related methods, see A Data Cleaner’s Cookbook.

The example here is occurrence.txt and it comes from the source archive for this very tidy GBIF dataset (accessed 2022-11-30). The tab-separated file has 13742 records and 35 fields, and the fields that uniquely identify a record are id and occurrenceID, which are fields 1 and 7, respectively. The command I used is

    awk -F"\t" 'FNR==NR {$1=$7=1; a[$0]++; next} {x=$0; $1=$7=1} a[$0]>1 {print x}' occurrence.txt occurrence.txt

The two records shown are identical apart from the id and occurrenceID entries “Pteridophytes_of_the_Urals_02516” and “Pteridophytes_of_the_Urals_02517”. If there are numerous strict duplicates they can be sorted afterwards to group the duplicates by their non-unique fields of choice.


relaxed duplicates
More information here and here

This example is from a just-published GBIF dataset (source archive accessed 2022-12-01), with 7310 records and 92 fields, and no exact or strict duplicates. The fields of interest are scientificName (field 16), decimalLatitude (68), decimalLongitude (69), eventDate (34) and recordedBy (32). In the source Darwin Core archive, occurrences are in occurrence.csv, which I converted to tab-separated oc1 from the horrible CSV format. A preliminary check showed that many records lacked coordinates or date or collector, so I added conditions to the AWK command that these fields should be non-empty. The core command is then

    awk -F"\t" 'FNR==NR {a[$16,$68,$69,$34,$32]++; next} $68!="" && $69!="" && $34!="" && $32!="" && a[$16,$68,$69,$34,$32]>1' oc1 oc1

and it finds 1123 relaxed duplicates (top of screenshot below).

The command was modified to print these key fields together with occurrenceID (field 7) and catalogNumber (8), and the output was piped to sort to sort the duplicates by taxon, coordinates, date and collector (in that order). Here are the first 10 relaxed duplicates (lower part of screenshot):


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

4 Likes

I wonder how to make these de-duplication efforts more widely available. Ideally de-duplication could be done on the backend using some QC checks like these. However, DwC data processing features like this can also be published as standalone packages.

1 Like

@7yl4r, please note that the code shown in the post does not do de-duplication. Data compilers and publishers have the responsibility to de-duplicate, or rather, to review and fix (where necessary) any duplication issues detected by the code.

Please also note that the code relies on data being tidy. If data aren’t tidy, neither the code shown above nor a rewrite in a language suitable for backend use will detect duplicates.

I’m not aware of any backend or external packages (e.g. in R or Python) that can tidy a Darwin Core dataset. The OBIS list of checks is shorter than the latest TDWG and GBIF lists of checks and all are shorter and less sophisticated than the checks routinely done for authors of Pensoft data papers, although the checks do not overlap perfectly (they complement each other). All these checks simply flag data issues which then should be (OBIS, GBIF) or have to be (Pensoft) fixed by the data compiler.

Fortunately or unfortunately, Darwin Core datasets are complex beasts that don’t lend themselves to automated checking and fixing. For this reason people (not backend routines) are the best Darwin Core data cleaners. The code recipes I use are freely available on the Web and I (and now others) are happy to train others in their use.

I would be happy to discuss this further with you off-list.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.