Hello,
After the webinar that took place in september about using cloud services and the GBIF snapshot, I was wondering if it would be possible to get some support on how to optimise the call to bigquery so the amount of data processed is reduced.
When using the API I use the gadm code to query only my region of interest. In the table accessible through bigquery I can only identify the country column. I would rather just get the data for the region I am interested in.
I have tried using a polygon that represents my region but the query becomes huge and takes forever.
I would also like to know if there is a way to optimise the amount of data that is processed with a query. I have the following python function that gives me an estimate of how much the query would cost once I run out of free credits.
from google.cloud import bigquery
def estimate_query_cost(query):
# Initialize BigQuery client
client = bigquery.Client(project="my-project")
# Run the query in dry-run mode
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # dry_run=True Runs the query without actually processing the data, returning the number of bytes it would scan
query_job = client.query(query, job_config=job_config)
# Estimate the cost
bytes_processed = query_job.total_bytes_processed
terra_bytes_processed = bytes_processed / (1024**4)
estimated_cost = (bytes_processed / (1024**4)) * 6.25 # https://cloud.google.com/bigquery/pricing#:~:text=using%20Reservations.-,On%2Ddemand%20compute%20pricing,-By%20default%2C%20queries
return {
"bytes_processed": bytes_processed,
"terra_bytes_processed": terra_bytes_processed,
"estimated_cost_usd": estimated_cost
}
I get the feeling that the whole snapshot is always going to be processed, but I wanted to double check before giving up.
So in summary two questions:
- is it possible to query using gadm_id to the google cloud snapshot using bigquery?
- is there any optimisation strategy to reduce the cost of querying via bigquery the snapshot.
Thank you!