Optimise query to bigquery and usage of GADM id

gretacv · December 11, 2024, 2:55pm

Hello,
After the webinar that took place in september about using cloud services and the GBIF snapshot, I was wondering if it would be possible to get some support on how to optimise the call to bigquery so the amount of data processed is reduced.
When using the API I use the gadm code to query only my region of interest. In the table accessible through bigquery I can only identify the country column. I would rather just get the data for the region I am interested in.
I have tried using a polygon that represents my region but the query becomes huge and takes forever.

I would also like to know if there is a way to optimise the amount of data that is processed with a query. I have the following python function that gives me an estimate of how much the query would cost once I run out of free credits.

from google.cloud import bigquery

def estimate_query_cost(query):
    # Initialize BigQuery client
    client = bigquery.Client(project="my-project")

    # Run the query in dry-run mode
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # dry_run=True Runs the query without actually processing the data, returning the number of bytes it would scan
    query_job = client.query(query, job_config=job_config)

    # Estimate the cost
    bytes_processed = query_job.total_bytes_processed
    terra_bytes_processed = bytes_processed / (1024**4)
    estimated_cost = (bytes_processed / (1024**4)) * 6.25  # https://cloud.google.com/bigquery/pricing#:~:text=using%20Reservations.-,On%2Ddemand%20compute%20pricing,-By%20default%2C%20queries

    return {
        "bytes_processed": bytes_processed,
        "terra_bytes_processed": terra_bytes_processed,
        "estimated_cost_usd": estimated_cost
    }

I get the feeling that the whole snapshot is always going to be processed, but I wanted to double check before giving up.

So in summary two questions:

is it possible to query using gadm_id to the google cloud snapshot using bigquery?
is there any optimisation strategy to reduce the cost of querying via bigquery the snapshot.

Thank you!

jwaller · December 11, 2024, 3:10pm

@gretacv sorry, I don’t have much advice since I don’t use BigQuery very much.

You could try to use some simple filters up front, I assume these are going to reduce the size of the query.

Depending on what you are doing, you might be interested in a new GBIF feature called SQL downloads.
https://techdocs.gbif.org/en/data-use/api-sql-downloads

Topic		Replies	Views
API SQL downloads (GBIF technical support hour for nodes) Data Use NodesSupportHour	1	283	July 11, 2024
Looking to supercharge your #cloud computing skills for ActivityPub test	0	30	September 19, 2024
GBIF exports as public datasets in cloud environments	28	13002	May 19, 2022
Web client considerations for a largish dataset Data Use	6	742	March 21, 2022
GBIF SQL Downloads - GBIF Data Blog data-blog	1	36	October 4, 2024

Optimise query to bigquery and usage of GADM id

Related topics