How to find strict duplicates in a Darwin Core table

Exact duplicate records in a Darwin Core table have identical entries in every field. They’re easy to find and remove, even in Microsoft Excel.

Strict duplicate records have identical entries except in the one or more fields that gives each record a unique ID. Those unique-ID fields might be eventID or occurrenceID. These duplicates should be removed, because they contribute no additional information to the table.

Finding strict duplicate records isn’t so easy. Unfortunately, strict duplicates do exist in datasets shared with GBIF, and they aren’t detected by GBIF’s Data Validator.

The best way I know to find strict duplicates is in a BASH shell with the GNU AWK command suggested here on the Darwin Core table checker website. The simplified table below (named “event.txt”) will show how the command works but please note: I’ve used a semicolon as field separator only so you can copy and paste the table into a text editor from this forum page. Once it’s in the editor, please replace semicolons with tabs and delete blank lines so that the table is tab-separated (TSV).

eventID;locality;decimalLatitude;decimalLongitude;geodeticDatum;eventDate

ev001;Towells;-37.6520;154.7603;WGS84;2024-06-10

ev002;Gramdun;-37.6502;154.7612;WGS84;2024-07-14

ev003;Gramdun;-37.6502;154.7612;WGS84;2024-07-15

ev004;One Tree Hill;-37.6491;154.7318;WGS84;2024-06-03

ev005;Fernoaks;-37.6513;154.7598;WGS84;2024-07-29

ev006;Towells;-37.6520;154.7603;WGS84;2024-06-10

ev007;Lees Paddock;-37.6501;154.7702;WGS84;2024-06-25

ev008;Fernoaks;-37.6513;154.7598;WGS84;2024-07-29

ev009;Towells;-37.6520;154.7603;WGS84;2024-06-10

ev010;One Tree Hill;-37.6491;154.7318;WGS84;2024-06-05

The AWK command groups the strict duplicates and neatly separates the groups with a blank line:

If you’re familiar with BASH tools you may be wondering if you can do the same job with sort and uniq. Yes, but:

The method fails if the unique-ID field to be ignored isn’t the first field in the record, or if the record contains more than one unique-ID field in several locations in the table. The sort-uniq approach also doesn’t separate the strict duplicates into groups.

The AWK command can also be used to find duplicates where one of the data fields is missing an entry, or where the field entries have not been properly normalised. The command below ignores both eventID and locality, where one of the “Towells” entries is missing and another is misspelled. (A locality field should have normalised entries. Original entries can go in a verbatimLocality field.)

eventID;locality ;decimalLatitude;decimalLongitude;geodeticDatum;eventDate

ev001;Towells;-37.6520;154.7603;WGS84;2024-06-10

ev002;Gramdun;-37.6502;154.7612;WGS84;2024-07-14

ev003;Gramdun;-37.6502;154.7612;WGS84;2024-07-15

ev004;One Tree Hill;-37.6491;154.7318;WGS84;2024-06-03

ev005;Fernoaks;-37.6513;154.7598;WGS84;2024-07-29

ev006;;-37.6520;154.7603;WGS84;2024-06-10

ev007;Lees Paddock;-37.6501;154.7702;WGS84;2024-06-25

ev008;Fernoaks;-37.6513;154.7598;WGS84;2024-07-29

ev009;Towels;-37.6520;154.7603;WGS84;2024-06-10

ev010;One Tree Hill;-37.6491;154.7318;WGS84;2024-06-05

If there are “pythonistas” or “useRs” who follow this forum, I would be interested to hear about simple Python or R methods for finding and grouping strict duplicates.


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

3 Likes

In pandas (python), one can easily identify duplicates only considering a subset of columns using the subset argument of the duplicated method on a DataFrame. If you wish to drop them, drop_duplicates is a very convenient way to do so.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

I have copied out the example provided in the documentation for convenience below

Mark Duplicates

Create example DataFrame

>>> df = pd.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

Find duplicates considering all columns

>>> df.duplicated()
0    False
1     True
2    False
3    False
4    False
dtype: bool

Find duplicates considering only the brand column

subset takes a list as an argument, so you can pass as many or as little columns as you would like to consider

>>> df.duplicated(subset=['brand'])
0    False
1     True
2    False
3     True
4     True
dtype: bool

Mark only non-duplicates

Using the keep argument, you may select to mark the first or last duplicate, or to mark all duplicates as true. In the context of a DwCA, first or last would likely make sense, rather than marking all duplicates.

>>> df.duplicated(keep='last')
0     True
1    False
2    False
3    False
4    False
dtype: bool

Drop duplicates

Dropping duplicates follows the same arguments as duplicated() for acting on a subset of columns.

>>> df.drop_duplicates(subset=['brand'])
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
1 Like

@mark-pitblado, many thanks for the Python duplication examples. So you need to import pandas and specify the fields with “subset” to be ignored and checked, one by one? That could be a little tedious with a big DwC table! Is there a simple Python way to identify the ignore/check fields?

The Python example only has 1 pair of duplicate records, whereas real-world DwC tables usually have multiple sets. How would you group these (and preferably, separate them) in Python?

Also, I suggest you might want to “keep” (Python parameter) all duplicates in a data-cleaning exercise, so that the source data can be edited.