The following portion of a Python script reveals the positions and names of keys within a GBIF download of occurrences of Quercus velutina:
with open("0071073-231120084113126.csv", "r") as f:
lines = f.readlines()
header = lines[0].split("\t")
for ind, item in enumerate(header):
print(ind, item)
Among the output, the following three keys might correspond to the identity of the authors of observations from iNaturalist:
40 identifiedBy
43 rightsHolder
44 recordedBy
In most cases, the values associated with these three keys are identical, however sometimes they differ. Which of those keys would be most reliable for identifying the iNaturalist user who created the observation?
These 3 fields mean different things, identifiedBy is a list of who provided identifications for an observation, this can contain multiple names. rightsHolder is the âownerâ of the observation, and recordedBy is the term for who saw it. In your case you probably want rightsHolder, however, depending on the information on the iNaturalist users profile, youâll actually get a few different things here, not necessarily a username, sometimes a person name!
Take this observation for example:
Youâll notice both recordedBy and rightsHolder contain my real name, not my inaturalist username. And because person names are not unique, this might be an issue for you. In that case you could use Recorded by ID which contains my ORCID (because Iâve provided one), but this is only the case for a small amount of iNaturalist users.
This to conclude, you could use either recordedBy or rightsHolder, but youâll possibly have some issues with people with general names. If you want to be 100% foolproof, youâd need to go back to iNaturalist (via the API), and fetch the username.
I had this issue a while back, and decided to just use the rightsHolder instead and live with the possibility of people having the same name.
Thanks! That seems a good option. Checking for both that and the presence of the institutionCode for iNaturalist within an occurrence might be sufficient for minimizing the possibility of confusing the observations of different individuals, particularly where there may be contributions from individuals from other institutions whose usernames coincide with those of individuals on iNaturalist.
An additional wrinkle is that identifiedBy in the DwC download for iNaturalist records is the first identifier, and that may not be what youâre after. I had this problem when looking at âfollowingâ in iNaturalist - where identifier N agrees with identifier N-1 or some other earlier identifier, because N thinks the earlier identifier should know the correct ID. See this post.
The primary intent of this thread was to develop a means of identifying authors of iNaturalist observations as they are represented within GBIF occurrences data. This identity might be some name or other unique identifier of a person. The replies to the original post and some of my experimentation with the Quercus velutina download data suggest that there are multiple challenges with this. Going with the specific iNaturalist Research-grade Observations might be the most realistic path to take.
A broader and potentially more useful redefinition of the stated task would be to find a means to identify authors of occurrences, in general, within GBIF occurrences data, which would include data that arrives through all institutions that are represented here. For instance, some person might wish to retrieve all GBIF data that represents observations that they performed. See the discussion GBIF Community Forum: My first PyGbif script.
As already noted, there are multiple wrinkles with the broadened definition of the original task. These include the fact that the name of a person is not necessarily unique. A combination of an institutionCode and the username of a person from that institution would presumably be unique, but as we have observed, that username is not necessarily represented within the downloaded GBIF data.
I believe the institutionCode should refer to the institution who has custody/ownership of the information published on GBIF, so even if I belong to institution A, if the dataset my record is published in is owned by institution B, then the value for instituionCode should be B, see Darwin Core Quick Reference Guide - Darwin Core
Telling people apart at a large scale, with many different datasets, is quite difficult. Finding your own records, is much easier, because you of course have intimate knowledge of your own name, what identifiers/usernames you have used, where, when, etc.
For collections, itâs often a combination of time and place that is used as well, we can tell two collectors apart because we know when they lived, and we might know collector 1 was in country A at x time, but collector 2 wasnâtâŚ
Yes, indeed it is, even if there is no perfect solution.
Thanks for all the very interesting information and insights, @pieter and @datafixer.
For the record, below is some of the Python code and output that formed part of my experimentation with this effort. It involves a Python dictionary of GBIF occurrences, with each occurrence represented by a Python dictionary contained within.
# January 24, 2024
# load the GBIF occurrence data for Quercus velutina
with open("0071073-231120084113126.csv", "r") as f:
lines = f.readlines()
header = lines[0].split("\t") # split the header into a Python list
occurrences = {} # initialize the occurrences Python dictionary
# populate the occurrences dictionary
for line in lines[1:]:
occurrence = line.split("\t") # split the individual occurrence line into a Python list
occurrences[occurrence[0]] = {} # initialize a Python dictionary for this occurrence; gbifID as key to it
# iterate through values in occurrence to populate individual occurrence dictionary using header items as keys
for i, val in enumerate(occurrence):
occurrences[occurrence[0]][header[i]] = val
keys = list(occurrences.keys()) # get list of all occurrences dictionary keys
# display selected occurrence data for the person ("iNaturalist", "Quercitron"))
targetPersonID = ("iNaturalist", "Quercitron") # target this person
print(f'{"gbifID":12} {"institutionCode":15} {"rightsHolder":12} {"scientificName":24} {"dateIdentified":20}')
for key in keys:
occurrence = occurrences[key]
currentPersonID = (occurrence["institutionCode"], occurrence["rightsHolder"])
if currentPersonID == targetPersonID: # show only occurrences from this person
print(f'{occurrence["gbifID"]:12} {occurrence["institutionCode"]:15} {occurrence["rightsHolder"]:12} {occurrence["scientificName"]:24} {occurrence["dateIdentified"]:20}')
Thanks. Iâll experiment more with the GBIF portal.
One of the reasons for bringing Python into this was to be able to download occurrence data of potential interest regarding a taxon, then experiment with it offline. For example, before going online again, we could select for specified users, geographic locations, or anything else, then generate KML, GPX, graphs, charts, and maps, or other file formats for use in various applications.