Retrieving iNaturalist Observations for a region / observers with PyGbif

Good evening everyone :smiley:

Following this first message, I’ve made a little progress in my discovery of PyGBIF and the GBIF API.
With the help of ChatGPT, I’ve written a functional script for my needs: retrieving iNaturalist observations from a given territory, for a list of users (the aim being to integrate them into another local naturalist database).

I’m sharing the script with you, and I’d be grateful for any feedback or suggestions for improvements.
For example, it’s not possible in principle to set a list of observers as parameters: I therefore need to retrieve all the observations from the bounding box, and then filter them (I’m working in SQL after integrating the data into the database).

Here’s the script:

import csv
from pygbif import occurrences as occ

# Bounding box (latitude/longitude)
min_latitude = 48.175391
max_latitude = 48.977037
min_longitude = -0.867335
max_longitude = 0.98335

# Search params
search_params = {
    'country': 'FR',         # France
    'decimalLatitude': f'{min_latitude},{max_latitude}',
    'decimalLongitude': f'{min_longitude},{max_longitude}',
    'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
    'limit': 300,            # Limite de 300 occurrences par page
}

# Function to retrieve all occurrences with pagination
def get_all_occurrences(params):
    all_occurrences = []
    offset = 0
    while True:
        params['offset'] = offset
        occurrences = occ.search(**params)
        results = occurrences['results']
        if not results:
            break
        all_occurrences.extend(results)
        offset += len(results)
        print(f"{offset} occurrences récupérées...")
    return all_occurrences

# Retrieve all occurrences
all_occurrences = get_all_occurrences(search_params)


# List all available fields
all_fields = set()
for occurrence in all_occurrences:
    all_fields.update(occurrence.keys())


# Save occurrences in a CSV file
output_file = 'occurrences_GBIF_iNaturalist.csv'

with open(output_file, mode='w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=list(all_fields))
    writer.writeheader()
    for occurrence in all_occurrences:
        writer.writerow({field: occurrence.get(field, '') for field in all_fields})

print(f"iNaturalist occurrences have been recorded in the {output_file} file.")
2 Likes

Thanks for the code. I’m executing it as this is being written. The screen output so far is:

14400 occurrences récupérées…
14700 occurrences récupérées…

Roughly how many occurrences should we expect before the data is written to the output file and the program terminates?

Aha! Here is the answer to my question:

15365 occurrences récupérées...

iNaturalist occurrences have been recorded in the occurrences_GBIF_iNaturalist.csv file.

Yes, that’s the approximate amount of data to be recovered.

I’ve noticed that the first packets of data are recovered quickly, then that it gradually slows down.
Perhaps this is a limitation to prevent abuse?

(By the way, sorry, I’ve realised that I haven’t translated all the French comments/texts)

Regarding your list of users, is the 'identifiedBy' key the one of interest?

With the following modification for a single user, there were 1579 occurrences recovered:

# Search params
id_by = 'Sylvain Montagner'
search_params = {
    'identifiedBy': id_by,
    'country': 'FR',         # France
    'decimalLatitude': f'{min_latitude},{max_latitude}',
    'decimalLongitude': f'{min_longitude},{max_longitude}',
    'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
    'limit': 300,            # Limite de 300 occurrences par page
}

Given a list of users, you could reorganize your code to iterate through that list in a loop to perform a similar search regarding one user at a time, in order to build the collection of data. With the loop, you would update the id_by variable and perform a search during each iteration.

Give the following a try for working with a list of observers:

import csv
from pygbif import occurrences as occ

# Bounding box (latitude/longitude)
min_latitude = 48.175391
max_latitude = 48.977037
min_longitude = -0.867335
max_longitude = 0.98335

# List of observers
observer_list = ['Sylvain Montagner', 'Clément Maouche', 'Quentin Benet-Cibois']

# Search params
search_params = {
    'country': 'FR',         # France
    'decimalLatitude': f'{min_latitude},{max_latitude}',
    'decimalLongitude': f'{min_longitude},{max_longitude}',
    'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
    'limit': 300,            # Limite de 300 occurrences par page
}

# Function to retrieve all occurrences with pagination
def get_all_occurrences(params, observers):
    all_occurrences = []
    for observer in observers:
        offset = 0
        params['identifiedBy'] = observer
        while True:
            params['offset'] = offset
            occurrences = occ.search(**params)
            results = occurrences['results']
            if not results:
                break
            all_occurrences.extend(results)
            offset += len(results)
            print(f"{offset} occurrences récupérées for observer {observer} ...")
    return all_occurrences

# Retrieve all occurrences
all_occurrences = get_all_occurrences(search_params, observer_list)


# List all available fields
all_fields = set()
for occurrence in all_occurrences:
    all_fields.update(occurrence.keys())


# Save occurrences in a CSV file
output_file = 'occurrences_GBIF_iNaturalist.csv'

with open(output_file, mode='w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=list(all_fields))
    writer.writeheader()
    for occurrence in all_occurrences:
        writer.writerow({field: occurrence.get(field, '') for field in all_fields})

print(f"iNaturalist occurrences have been recorded in the {output_file} file.")

Why not filter the SIMPLE_CSV download with something like AWK before integrating into the database and having to devise SQL queries?

Thank you @quercitron :smiley:

I’ve replaced identifiedBy with recordedBy for my purposes, and it works really well! :+1:
It avoids downloading tens of thousands of pieces of data that we don’t need here.

I hadn’t tested this because ChatGPT had written to me :

The recordedBy parameter is not directly available in the GBIF API search parameters via pygbif. The GBIF API allows hits to be filtered by many criteria, but recordedBy is not a filter parameter supported directly by the API in standard hit searches.

I can therefore see that it had misled me.

So far, the script meets my needs! :smiley:
I may come back with more questions/improvements, but don’t hesitate to let me know if you see any potential improvements.

Simply because I don’t know AWK, and I’m fairly comfortable with SQL.
In any case, data management is done in SQL in our local databases.

In fact, I’m working on an improvement to the script that doesn’t use a CSV export, but instead integrates the data directly into the database (tested OK with a local SQLite database, but eventually it will be on a PostGreSQL server).

1 Like

@Sylvain_M, many thanks for your explanation. I would be very interested to learn how you imported the data (format?) into SQLite. If you don’t think this forum is the appropriate place for technical details, please feel free to email me directly: robert.mesibov@gmail.com

Don’t worry: I think this forum is suitable for this kind of discussion.
It’s just that I won’t have enough time to explain in detail (I’m doing this work voluntarily, and I’m making very slow progress).

Here’s the part of the code that concerns the connection to the SQLite database (Spatialite).
But I ran into problems naming the fields, with unsupported characters and reserved names, hence the rather complex code (suggested by ChatGPT, and probably not as good as a “real developer” would have done).

# Clean column names
def clean_column_name(name):
    return re.sub(r'\W|^(?=\d)', '_', name)

# Convert values to supported types
def convert_value(value):
    if isinstance(value, (int, float, str)):
        return value
    else:
        return str(value)

# Create and connect to the Spatialite database
conn = sqlite3.connect('gbif.db')
cursor = conn.cursor()

# Collect all fields from all occurrences
all_fields = set()
for occurrence in all_occurrences:
    all_fields.update(occurrence.keys())

# Sort fields alphabetically for consistent order
ordered_fields = sorted(all_fields)

# Cleaned column names
cleaned_fields = [clean_column_name(field) for field in ordered_fields]
fields_definition = ', '.join([f'"{field}" TEXT' for field in cleaned_fields])
create_table_query = f'CREATE TABLE IF NOT EXISTS "{table_name}" ({fields_definition});'
cursor.execute(create_table_query)

# Insert occurrences into the database
for occurrence in all_occurrences:
    cleaned_occurrence = {clean_column_name(k): convert_value(v) for k, v in occurrence.items()}
    columns = ', '.join([f'"{clean_column_name(field)}"' for field in ordered_fields])
    placeholders = ', '.join(['?' for _ in ordered_fields])
    values = [cleaned_occurrence.get(clean_column_name(field), '') for field in ordered_fields]
    insert_query = f'INSERT INTO "{table_name}" ({columns}) VALUES ({placeholders})'
    cursor.execute(insert_query, tuple(values))

# Commit and close the database connection
conn.commit()
conn.close()

For the moment, all the fields have been converted to text, which is not optimal: we need to improve the way we recognise the type of each field.

1 Like

@Sylvain_M, many thanks again. Your regex substitution is a little surprising, but I don’t have a list of your raw column names (note that \W will match underscore). And I guess you already know that typing Darwin Core fields will generate large numbers of exceptions, even after GBIF processing, so that to enforce typing you will need to do a significant amount of cleaning.

Here are all fields retrieved :

original cleaned
acceptedScientificName acceptedScientificName
acceptedTaxonKey acceptedTaxonKey
basisOfRecord basisOfRecord
catalogNumber catalogNumber
class class
classKey classKey
collectionCode collectionCode
continent continent
coordinateUncertaintyInMeters coordinateUncertaintyInMeters
country country
countryCode countryCode
crawlId crawlId
datasetKey datasetKey
datasetName datasetName
dateIdentified dateIdentified
day day
decimalLatitude decimalLatitude
decimalLongitude decimalLongitude
endDayOfYear endDayOfYear
eventDate eventDate
eventTime eventTime
extensions extensions
facts facts
family family
familyKey familyKey
gadm gadm
gbifID gbifID
gbifRegion gbifRegion
genericName genericName
genus genus
genusKey genusKey
geodeticDatum geodeticDatum
hostingOrganizationKey hostingOrganizationKey
http://unknown.org/captive http___unknown_org_captive
http://unknown.org/nick http___unknown_org_nick
identificationID identificationID
identificationRemarks identificationRemarks
identifiedBy identifiedBy
identifiedByIDs identifiedByIDs
identifier identifier
identifiers identifiers
informationWithheld informationWithheld
infraspecificEpithet infraspecificEpithet
installationKey installationKey
institutionCode institutionCode
isInCluster isInCluster
isSequenced isSequenced
issues issues
iucnRedListCategory iucnRedListCategory
key key
kingdom kingdom
kingdomKey kingdomKey
lastCrawled lastCrawled
lastInterpreted lastInterpreted
lastParsed lastParsed
license license
lifeStage lifeStage
media media
modified modified
month month
occurrenceID occurrenceID
occurrenceRemarks occurrenceRemarks
occurrenceStatus occurrenceStatus
order order
orderKey orderKey
phylum phylum
phylumKey phylumKey
protocol protocol
publishedByGbifRegion publishedByGbifRegion
publishingCountry publishingCountry
publishingOrgKey publishingOrgKey
recordedBy recordedBy
recordedByIDs recordedByIDs
references references
relations relations
reproductiveCondition reproductiveCondition
rightsHolder rightsHolder
scientificName scientificName
sex sex
species species
speciesKey speciesKey
specificEpithet specificEpithet
startDayOfYear startDayOfYear
stateProvince stateProvince
taxonID taxonID
taxonKey taxonKey
taxonRank taxonRank
taxonomicStatus taxonomicStatus
verbatimEventDate verbatimEventDate
verbatimLocality verbatimLocality
year year