Identifying authors of iNaturalist observations within GBIF download data

The following portion of a Python script reveals the positions and names of keys within a GBIF download of occurrences of Quercus velutina:

with open("0071073-231120084113126.csv", "r") as f:
    lines = f.readlines()

header = lines[0].split("\t")

for ind, item in enumerate(header):
    print(ind, item)

Output:

0 gbifID
1 datasetKey
2 occurrenceID
3 kingdom
4 phylum
5 class
6 order
7 family
8 genus
9 species
10 infraspecificEpithet
11 taxonRank
12 scientificName
13 verbatimScientificName
14 verbatimScientificNameAuthorship
15 countryCode
16 locality
17 stateProvince
18 occurrenceStatus
19 individualCount
20 publishingOrgKey
21 decimalLatitude
22 decimalLongitude
23 coordinateUncertaintyInMeters
24 coordinatePrecision
25 elevation
26 elevationAccuracy
27 depth
28 depthAccuracy
29 eventDate
30 day
31 month
32 year
33 taxonKey
34 speciesKey
35 basisOfRecord
36 institutionCode
37 collectionCode
38 catalogNumber
39 recordNumber
40 identifiedBy
41 dateIdentified
42 license
43 rightsHolder
44 recordedBy
45 typeStatus
46 establishmentMeans
47 lastInterpreted
48 mediaType
49 issue

Among the output, the following three keys might correspond to the identity of the authors of observations from iNaturalist:

40 identifiedBy
43 rightsHolder
44 recordedBy

In most cases, the values associated with these three keys are identical, however sometimes they differ. Which of those keys would be most reliable for identifying the iNaturalist user who created the observation?

These 3 fields mean different things, identifiedBy is a list of who provided identifications for an observation, this can contain multiple names. rightsHolder is the ‘owner’ of the observation, and recordedBy is the term for who saw it. In your case you probably want rightsHolder, however, depending on the information on the iNaturalist users profile, you’ll actually get a few different things here, not necessarily a username, sometimes a person name!

Take this observation for example:

You’ll notice both recordedBy and rightsHolder contain my real name, not my inaturalist username. And because person names are not unique, this might be an issue for you. In that case you could use Recorded by ID which contains my ORCID (because I’ve provided one), but this is only the case for a small amount of iNaturalist users.

This to conclude, you could use either recordedBy or rightsHolder, but you’ll possibly have some issues with people with general names. If you want to be 100% foolproof, you’d need to go back to iNaturalist (via the API), and fetch the username.

I had this issue a while back, and decided to just use the rightsHolder instead and live with the possibility of people having the same name.

2 Likes

Thanks! That seems a good option. Checking for both that and the presence of the institutionCode for iNaturalist within an occurrence might be sufficient for minimizing the possibility of confusing the observations of different individuals, particularly where there may be contributions from individuals from other institutions whose usernames coincide with those of individuals on iNaturalist.

An additional wrinkle is that identifiedBy in the DwC download for iNaturalist records is the first identifier, and that may not be what you’re after. I had this problem when looking at “following” in iNaturalist - where identifier N agrees with identifier N-1 or some other earlier identifier, because N thinks the earlier identifier should know the correct ID. See this post.

1 Like

I think you might be better off going on the specific datasetkey:

1 Like

The primary intent of this thread was to develop a means of identifying authors of iNaturalist observations as they are represented within GBIF occurrences data. This identity might be some name or other unique identifier of a person. The replies to the original post and some of my experimentation with the Quercus velutina download data suggest that there are multiple challenges with this. Going with the specific iNaturalist Research-grade Observations might be the most realistic path to take.

A broader and potentially more useful redefinition of the stated task would be to find a means to identify authors of occurrences, in general, within GBIF occurrences data, which would include data that arrives through all institutions that are represented here. For instance, some person might wish to retrieve all GBIF data that represents observations that they performed. See the discussion GBIF Community Forum: My first PyGbif script.

As already noted, there are multiple wrinkles with the broadened definition of the original task. These include the fact that the name of a person is not necessarily unique. A combination of an institutionCode and the username of a person from that institution would presumably be unique, but as we have observed, that username is not necessarily represented within the downloaded GBIF data.

I believe the institutionCode should refer to the institution who has custody/ownership of the information published on GBIF, so even if I belong to institution A, if the dataset my record is published in is owned by institution B, then the value for instituionCode should be B, see Darwin Core Quick Reference Guide - Darwin Core

The ideal solution is that datasets populate recordedByID: Darwin Core Quick Reference Guide - Darwin Core

You might be interested in this publication: The disambiguation of people names in biological collections

Telling people apart at a large scale, with many different datasets, is quite difficult. Finding your own records, is much easier, because you of course have intimate knowledge of your own name, what identifiers/usernames you have used, where, when, etc.

For collections, it’s often a combination of time and place that is used as well, we can tell two collectors apart because we know when they lived, and we might know collector 1 was in country A at x time, but collector 2 wasn’t…

Interesting stuff this!

1 Like

Yes, indeed it is, even if there is no perfect solution.

Thanks for all the very interesting information and insights, @pieter and @datafixer.

For the record, below is some of the Python code and output that formed part of my experimentation with this effort. It involves a Python dictionary of GBIF occurrences, with each occurrence represented by a Python dictionary contained within.

# January 24, 2024

# load the GBIF occurrence data for Quercus velutina
with open("0071073-231120084113126.csv", "r") as f:
    lines = f.readlines()

header = lines[0].split("\t") # split the header into a Python list
occurrences = {} # initialize the occurrences Python dictionary

# populate the occurrences dictionary
for line in lines[1:]:
    occurrence = line.split("\t") # split the individual occurrence line into a Python list
    occurrences[occurrence[0]] = {} # initialize a Python dictionary for this occurrence; gbifID as key to it
    # iterate through values in occurrence to populate individual occurrence dictionary using header items as keys
    for i, val in enumerate(occurrence):
        occurrences[occurrence[0]][header[i]] = val

keys = list(occurrences.keys()) # get list of all occurrences dictionary keys

# display selected occurrence data for the person ("iNaturalist", "Quercitron"))
targetPersonID = ("iNaturalist", "Quercitron") # target this person
print(f'{"gbifID":12}  {"institutionCode":15}  {"rightsHolder":12}  {"scientificName":24}  {"dateIdentified":20}')
for key in keys:
    occurrence = occurrences[key]
    currentPersonID = (occurrence["institutionCode"], occurrence["rightsHolder"])
    if currentPersonID == targetPersonID: # show only occurrences from this person
        print(f'{occurrence["gbifID"]:12}  {occurrence["institutionCode"]:15}  {occurrence["rightsHolder"]:12}  {occurrence["scientificName"]:24}  {occurrence["dateIdentified"]:20}')

Output:

gbifID        institutionCode  rightsHolder  scientificName            dateIdentified      
4510320810    iNaturalist      Quercitron    Quercus velutina Lam.     2024-01-03T19:20:29 
4510285318    iNaturalist      Quercitron    Quercus velutina Lam.     2024-01-08T17:18:50 
4510284313    iNaturalist      Quercitron    Quercus velutina Lam.     2024-01-08T17:44:33 
4510082014    iNaturalist      Quercitron    Quercus velutina Lam.     2024-01-04T16:33:31 
4507956844    iNaturalist      Quercitron    Quercus velutina Lam.     2023-12-31T14:35:34 
4507896514    iNaturalist      Quercitron    Quercus velutina Lam.     2023-12-30T14:56:32 
4507789456    iNaturalist      Quercitron    Quercus velutina Lam.     2024-01-02T18:33:19 

Though highly unlikely, note that the above would have failed if another iNaturalist user had provided “Quercitron” as their personal full name.

Hi @quercitron, note that you get the same output in the GBIF portal with this query: Search

1 Like

Thanks. I’ll experiment more with the GBIF portal.

One of the reasons for bringing Python into this was to be able to download occurrence data of potential interest regarding a taxon, then experiment with it offline. For example, before going online again, we could select for specified users, geographic locations, or anything else, then generate KML, GPX, graphs, charts, and maps, or other file formats for use in various applications.