Following this first message, I’ve made a little progress in my discovery of PyGBIF and the GBIF API.
With the help of ChatGPT, I’ve written a functional script for my needs: retrieving iNaturalist observations from a given territory, for a list of users (the aim being to integrate them into another local naturalist database).
I’m sharing the script with you, and I’d be grateful for any feedback or suggestions for improvements.
For example, it’s not possible in principle to set a list of observers as parameters: I therefore need to retrieve all the observations from the bounding box, and then filter them (I’m working in SQL after integrating the data into the database).
Here’s the script:
import csv
from pygbif import occurrences as occ
# Bounding box (latitude/longitude)
min_latitude = 48.175391
max_latitude = 48.977037
min_longitude = -0.867335
max_longitude = 0.98335
# Search params
search_params = {
'country': 'FR', # France
'decimalLatitude': f'{min_latitude},{max_latitude}',
'decimalLongitude': f'{min_longitude},{max_longitude}',
'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
'limit': 300, # Limite de 300 occurrences par page
}
# Function to retrieve all occurrences with pagination
def get_all_occurrences(params):
all_occurrences = []
offset = 0
while True:
params['offset'] = offset
occurrences = occ.search(**params)
results = occurrences['results']
if not results:
break
all_occurrences.extend(results)
offset += len(results)
print(f"{offset} occurrences récupérées...")
return all_occurrences
# Retrieve all occurrences
all_occurrences = get_all_occurrences(search_params)
# List all available fields
all_fields = set()
for occurrence in all_occurrences:
all_fields.update(occurrence.keys())
# Save occurrences in a CSV file
output_file = 'occurrences_GBIF_iNaturalist.csv'
with open(output_file, mode='w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=list(all_fields))
writer.writeheader()
for occurrence in all_occurrences:
writer.writerow({field: occurrence.get(field, '') for field in all_fields})
print(f"iNaturalist occurrences have been recorded in the {output_file} file.")
Thanks for the code. I’m executing it as this is being written. The screen output so far is:
…
14400 occurrences récupérées…
14700 occurrences récupérées…
…
Roughly how many occurrences should we expect before the data is written to the output file and the program terminates?
Aha! Here is the answer to my question:
15365 occurrences récupérées...
iNaturalist occurrences have been recorded in the occurrences_GBIF_iNaturalist.csv file.
Given a list of users, you could reorganize your code to iterate through that list in a loop to perform a similar search regarding one user at a time, in order to build the collection of data. With the loop, you would update the id_by variable and perform a search during each iteration.
Give the following a try for working with a list of observers:
import csv
from pygbif import occurrences as occ
# Bounding box (latitude/longitude)
min_latitude = 48.175391
max_latitude = 48.977037
min_longitude = -0.867335
max_longitude = 0.98335
# List of observers
observer_list = ['Sylvain Montagner', 'Clément Maouche', 'Quentin Benet-Cibois']
# Search params
search_params = {
'country': 'FR', # France
'decimalLatitude': f'{min_latitude},{max_latitude}',
'decimalLongitude': f'{min_longitude},{max_longitude}',
'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
'limit': 300, # Limite de 300 occurrences par page
}
# Function to retrieve all occurrences with pagination
def get_all_occurrences(params, observers):
all_occurrences = []
for observer in observers:
offset = 0
params['identifiedBy'] = observer
while True:
params['offset'] = offset
occurrences = occ.search(**params)
results = occurrences['results']
if not results:
break
all_occurrences.extend(results)
offset += len(results)
print(f"{offset} occurrences récupérées for observer {observer} ...")
return all_occurrences
# Retrieve all occurrences
all_occurrences = get_all_occurrences(search_params, observer_list)
# List all available fields
all_fields = set()
for occurrence in all_occurrences:
all_fields.update(occurrence.keys())
# Save occurrences in a CSV file
output_file = 'occurrences_GBIF_iNaturalist.csv'
with open(output_file, mode='w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=list(all_fields))
writer.writeheader()
for occurrence in all_occurrences:
writer.writerow({field: occurrence.get(field, '') for field in all_fields})
print(f"iNaturalist occurrences have been recorded in the {output_file} file.")
I’ve replaced identifiedBy with recordedBy for my purposes, and it works really well!
It avoids downloading tens of thousands of pieces of data that we don’t need here.
I hadn’t tested this because ChatGPT had written to me :
The recordedBy parameter is not directly available in the GBIF API search parameters via pygbif. The GBIF API allows hits to be filtered by many criteria, but recordedBy is not a filter parameter supported directly by the API in standard hit searches.
I can therefore see that it had misled me.
So far, the script meets my needs!
I may come back with more questions/improvements, but don’t hesitate to let me know if you see any potential improvements.
Simply because I don’t know AWK, and I’m fairly comfortable with SQL.
In any case, data management is done in SQL in our local databases.
In fact, I’m working on an improvement to the script that doesn’t use a CSV export, but instead integrates the data directly into the database (tested OK with a local SQLite database, but eventually it will be on a PostGreSQL server).
@Sylvain_M, many thanks for your explanation. I would be very interested to learn how you imported the data (format?) into SQLite. If you don’t think this forum is the appropriate place for technical details, please feel free to email me directly: robert.mesibov@gmail.com
Don’t worry: I think this forum is suitable for this kind of discussion.
It’s just that I won’t have enough time to explain in detail (I’m doing this work voluntarily, and I’m making very slow progress).
Here’s the part of the code that concerns the connection to the SQLite database (Spatialite).
But I ran into problems naming the fields, with unsupported characters and reserved names, hence the rather complex code (suggested by ChatGPT, and probably not as good as a “real developer” would have done).
# Clean column names
def clean_column_name(name):
return re.sub(r'\W|^(?=\d)', '_', name)
# Convert values to supported types
def convert_value(value):
if isinstance(value, (int, float, str)):
return value
else:
return str(value)
# Create and connect to the Spatialite database
conn = sqlite3.connect('gbif.db')
cursor = conn.cursor()
# Collect all fields from all occurrences
all_fields = set()
for occurrence in all_occurrences:
all_fields.update(occurrence.keys())
# Sort fields alphabetically for consistent order
ordered_fields = sorted(all_fields)
# Cleaned column names
cleaned_fields = [clean_column_name(field) for field in ordered_fields]
fields_definition = ', '.join([f'"{field}" TEXT' for field in cleaned_fields])
create_table_query = f'CREATE TABLE IF NOT EXISTS "{table_name}" ({fields_definition});'
cursor.execute(create_table_query)
# Insert occurrences into the database
for occurrence in all_occurrences:
cleaned_occurrence = {clean_column_name(k): convert_value(v) for k, v in occurrence.items()}
columns = ', '.join([f'"{clean_column_name(field)}"' for field in ordered_fields])
placeholders = ', '.join(['?' for _ in ordered_fields])
values = [cleaned_occurrence.get(clean_column_name(field), '') for field in ordered_fields]
insert_query = f'INSERT INTO "{table_name}" ({columns}) VALUES ({placeholders})'
cursor.execute(insert_query, tuple(values))
# Commit and close the database connection
conn.commit()
conn.close()
For the moment, all the fields have been converted to text, which is not optimal: we need to improve the way we recognise the type of each field.
@Sylvain_M, many thanks again. Your regex substitution is a little surprising, but I don’t have a list of your raw column names (note that \W will match underscore). And I guess you already know that typing Darwin Core fields will generate large numbers of exceptions, even after GBIF processing, so that to enforce typing you will need to do a significant amount of cleaning.