Retrieving iNaturalist Observations for a region / observers with PyGbif

Sylvain_M · July 1, 2024, 7:55pm

Good evening everyone

Following this first message, I’ve made a little progress in my discovery of PyGBIF and the GBIF API.
With the help of ChatGPT, I’ve written a functional script for my needs: retrieving iNaturalist observations from a given territory, for a list of users (the aim being to integrate them into another local naturalist database).

I’m sharing the script with you, and I’d be grateful for any feedback or suggestions for improvements.
For example, it’s not possible in principle to set a list of observers as parameters: I therefore need to retrieve all the observations from the bounding box, and then filter them (I’m working in SQL after integrating the data into the database).

Here’s the script:

import csv
from pygbif import occurrences as occ

# Bounding box (latitude/longitude)
min_latitude = 48.175391
max_latitude = 48.977037
min_longitude = -0.867335
max_longitude = 0.98335

# Search params
search_params = {
    'country': 'FR',         # France
    'decimalLatitude': f'{min_latitude},{max_latitude}',
    'decimalLongitude': f'{min_longitude},{max_longitude}',
    'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
    'limit': 300,            # Limite de 300 occurrences par page
}

# Function to retrieve all occurrences with pagination
def get_all_occurrences(params):
    all_occurrences = []
    offset = 0
    while True:
        params['offset'] = offset
        occurrences = occ.search(**params)
        results = occurrences['results']
        if not results:
            break
        all_occurrences.extend(results)
        offset += len(results)
        print(f"{offset} occurrences récupérées...")
    return all_occurrences

# Retrieve all occurrences
all_occurrences = get_all_occurrences(search_params)


# List all available fields
all_fields = set()
for occurrence in all_occurrences:
    all_fields.update(occurrence.keys())


# Save occurrences in a CSV file
output_file = 'occurrences_GBIF_iNaturalist.csv'

with open(output_file, mode='w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=list(all_fields))
    writer.writeheader()
    for occurrence in all_occurrences:
        writer.writerow({field: occurrence.get(field, '') for field in all_fields})

print(f"iNaturalist occurrences have been recorded in the {output_file} file.")

quercitron · July 1, 2024, 9:24pm

Thanks for the code. I’m executing it as this is being written. The screen output so far is:
…
14400 occurrences récupérées…
14700 occurrences récupérées…
…

Roughly how many occurrences should we expect before the data is written to the output file and the program terminates?

Aha! Here is the answer to my question:

15365 occurrences récupérées...

iNaturalist occurrences have been recorded in the occurrences_GBIF_iNaturalist.csv file.

Sylvain_M · July 1, 2024, 9:38pm

Yes, that’s the approximate amount of data to be recovered.

I’ve noticed that the first packets of data are recovered quickly, then that it gradually slows down.
Perhaps this is a limitation to prevent abuse?

(By the way, sorry, I’ve realised that I haven’t translated all the French comments/texts)

quercitron · July 1, 2024, 11:05pm

Regarding your list of users, is the 'identifiedBy' key the one of interest?

With the following modification for a single user, there were 1579 occurrences recovered:

# Search params
id_by = 'Sylvain Montagner'
search_params = {
    'identifiedBy': id_by,
    'country': 'FR',         # France
    'decimalLatitude': f'{min_latitude},{max_latitude}',
    'decimalLongitude': f'{min_longitude},{max_longitude}',
    'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
    'limit': 300,            # Limite de 300 occurrences par page
}

Given a list of users, you could reorganize your code to iterate through that list in a loop to perform a similar search regarding one user at a time, in order to build the collection of data. With the loop, you would update the id_by variable and perform a search during each iteration.

quercitron · July 2, 2024, 12:46am

Give the following a try for working with a list of observers:

import csv
from pygbif import occurrences as occ

# Bounding box (latitude/longitude)
min_latitude = 48.175391
max_latitude = 48.977037
min_longitude = -0.867335
max_longitude = 0.98335

# List of observers
observer_list = ['Sylvain Montagner', 'Clément Maouche', 'Quentin Benet-Cibois']

# Search params
search_params = {
    'country': 'FR',         # France
    'decimalLatitude': f'{min_latitude},{max_latitude}',
    'decimalLongitude': f'{min_longitude},{max_longitude}',
    'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
    'limit': 300,            # Limite de 300 occurrences par page
}

# Function to retrieve all occurrences with pagination
def get_all_occurrences(params, observers):
    all_occurrences = []
    for observer in observers:
        offset = 0
        params['identifiedBy'] = observer
        while True:
            params['offset'] = offset
            occurrences = occ.search(**params)
            results = occurrences['results']
            if not results:
                break
            all_occurrences.extend(results)
            offset += len(results)
            print(f"{offset} occurrences récupérées for observer {observer} ...")
    return all_occurrences

# Retrieve all occurrences
all_occurrences = get_all_occurrences(search_params, observer_list)


# List all available fields
all_fields = set()
for occurrence in all_occurrences:
    all_fields.update(occurrence.keys())


# Save occurrences in a CSV file
output_file = 'occurrences_GBIF_iNaturalist.csv'

with open(output_file, mode='w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=list(all_fields))
    writer.writeheader()
    for occurrence in all_occurrences:
        writer.writerow({field: occurrence.get(field, '') for field in all_fields})

print(f"iNaturalist occurrences have been recorded in the {output_file} file.")

datafixer · July 2, 2024, 2:26am

Why not filter the SIMPLE_CSV download with something like AWK before integrating into the database and having to devise SQL queries?

Sylvain_M · July 2, 2024, 7:42am

Thank you @quercitron

I’ve replaced identifiedBy with recordedBy for my purposes, and it works really well!
It avoids downloading tens of thousands of pieces of data that we don’t need here.

I hadn’t tested this because ChatGPT had written to me :

The recordedBy parameter is not directly available in the GBIF API search parameters via pygbif. The GBIF API allows hits to be filtered by many criteria, but recordedBy is not a filter parameter supported directly by the API in standard hit searches.

I can therefore see that it had misled me.

So far, the script meets my needs!
I may come back with more questions/improvements, but don’t hesitate to let me know if you see any potential improvements.

Simply because I don’t know AWK, and I’m fairly comfortable with SQL.
In any case, data management is done in SQL in our local databases.

In fact, I’m working on an improvement to the script that doesn’t use a CSV export, but instead integrates the data directly into the database (tested OK with a local SQLite database, but eventually it will be on a PostGreSQL server).

datafixer · July 2, 2024, 8:07am

@Sylvain_M, many thanks for your explanation. I would be very interested to learn how you imported the data (format?) into SQLite. If you don’t think this forum is the appropriate place for technical details, please feel free to email me directly: robert.mesibov@gmail.com

Sylvain_M · July 2, 2024, 8:14am

Don’t worry: I think this forum is suitable for this kind of discussion.
It’s just that I won’t have enough time to explain in detail (I’m doing this work voluntarily, and I’m making very slow progress).

Here’s the part of the code that concerns the connection to the SQLite database (Spatialite).
But I ran into problems naming the fields, with unsupported characters and reserved names, hence the rather complex code (suggested by ChatGPT, and probably not as good as a “real developer” would have done).

# Clean column names
def clean_column_name(name):
    return re.sub(r'\W|^(?=\d)', '_', name)

# Convert values to supported types
def convert_value(value):
    if isinstance(value, (int, float, str)):
        return value
    else:
        return str(value)

# Create and connect to the Spatialite database
conn = sqlite3.connect('gbif.db')
cursor = conn.cursor()

# Collect all fields from all occurrences
all_fields = set()
for occurrence in all_occurrences:
    all_fields.update(occurrence.keys())

# Sort fields alphabetically for consistent order
ordered_fields = sorted(all_fields)

# Cleaned column names
cleaned_fields = [clean_column_name(field) for field in ordered_fields]
fields_definition = ', '.join([f'"{field}" TEXT' for field in cleaned_fields])
create_table_query = f'CREATE TABLE IF NOT EXISTS "{table_name}" ({fields_definition});'
cursor.execute(create_table_query)

# Insert occurrences into the database
for occurrence in all_occurrences:
    cleaned_occurrence = {clean_column_name(k): convert_value(v) for k, v in occurrence.items()}
    columns = ', '.join([f'"{clean_column_name(field)}"' for field in ordered_fields])
    placeholders = ', '.join(['?' for _ in ordered_fields])
    values = [cleaned_occurrence.get(clean_column_name(field), '') for field in ordered_fields]
    insert_query = f'INSERT INTO "{table_name}" ({columns}) VALUES ({placeholders})'
    cursor.execute(insert_query, tuple(values))

# Commit and close the database connection
conn.commit()
conn.close()

For the moment, all the fields have been converted to text, which is not optimal: we need to improve the way we recognise the type of each field.

datafixer · July 2, 2024, 8:26am

@Sylvain_M, many thanks again. Your regex substitution is a little surprising, but I don’t have a list of your raw column names (note that \W will match underscore). And I guess you already know that typing Darwin Core fields will generate large numbers of exceptions, even after GBIF processing, so that to enforce typing you will need to do a significant amount of cleaning.

Sylvain_M · July 2, 2024, 5:40pm

Here are all fields retrieved :

original	cleaned
acceptedScientificName	acceptedScientificName
acceptedTaxonKey	acceptedTaxonKey
basisOfRecord	basisOfRecord
catalogNumber	catalogNumber
class	class
classKey	classKey
collectionCode	collectionCode
continent	continent
coordinateUncertaintyInMeters	coordinateUncertaintyInMeters
country	country
countryCode	countryCode
crawlId	crawlId
datasetKey	datasetKey
datasetName	datasetName
dateIdentified	dateIdentified
day	day
decimalLatitude	decimalLatitude
decimalLongitude	decimalLongitude
endDayOfYear	endDayOfYear
eventDate	eventDate
eventTime	eventTime
extensions	extensions
facts	facts
family	family
familyKey	familyKey
gadm	gadm
gbifID	gbifID
gbifRegion	gbifRegion
genericName	genericName
genus	genus
genusKey	genusKey
geodeticDatum	geodeticDatum
hostingOrganizationKey	hostingOrganizationKey
http://unknown.org/captive	http___unknown_org_captive
http://unknown.org/nick	http___unknown_org_nick
identificationID	identificationID
identificationRemarks	identificationRemarks
identifiedBy	identifiedBy
identifiedByIDs	identifiedByIDs
identifier	identifier
identifiers	identifiers
informationWithheld	informationWithheld
infraspecificEpithet	infraspecificEpithet
installationKey	installationKey
institutionCode	institutionCode
isInCluster	isInCluster
isSequenced	isSequenced
issues	issues
iucnRedListCategory	iucnRedListCategory
key	key
kingdom	kingdom
kingdomKey	kingdomKey
lastCrawled	lastCrawled
lastInterpreted	lastInterpreted
lastParsed	lastParsed
license	license
lifeStage	lifeStage
media	media
modified	modified
month	month
occurrenceID	occurrenceID
occurrenceRemarks	occurrenceRemarks
occurrenceStatus	occurrenceStatus
order	order
orderKey	orderKey
phylum	phylum
phylumKey	phylumKey
protocol	protocol
publishedByGbifRegion	publishedByGbifRegion
publishingCountry	publishingCountry
publishingOrgKey	publishingOrgKey
recordedBy	recordedBy
recordedByIDs	recordedByIDs
references	references
relations	relations
reproductiveCondition	reproductiveCondition
rightsHolder	rightsHolder
scientificName	scientificName
sex	sex
species	species
speciesKey	speciesKey
specificEpithet	specificEpithet
startDayOfYear	startDayOfYear
stateProvince	stateProvince
taxonID	taxonID
taxonKey	taxonKey
taxonRank	taxonRank
taxonomicStatus	taxonomicStatus
verbatimEventDate	verbatimEventDate
verbatimLocality	verbatimLocality
year	year

Topic		Replies	Views
My first PyGbif script Data Use	4	761	July 1, 2024
Identifying authors of iNaturalist observations within GBIF download data	10	396	February 24, 2024
API usage questions Data Use	1	1001	December 18, 2019
An overnight refresh to data from Observation ActivityPub test	0	11	September 12, 2024
GBIF SQL Downloads - GBIF Data Blog data-blog	1	51	October 4, 2024

Retrieving iNaturalist Observations for a region / observers with PyGbif

Related topics