API SQL downloads (GBIF technical support hour for nodes)

mgrosjean · July 11, 2024, 1:21pm

The recording of the session is available here: https://vimeo.com/showcase/10095350/video/976559285

Links mentioned in the presentation:

Blog Post: GBIF SQL Downloads - GBIF Data Blog
Documentation: API SQL Downloads :: Technical Documentation
Predicate Downloads:
- API Downloads :: Technical Documentation
- Getting Occurrence Data From GBIF • rgbif
Facets:
- Occurrence API :: Technical Documentation
- https://api.gbif.org/v1/occurrence/search?facet=country

Questions and answers:

Aside from the DOI that I could get using the API SQL download function, what are the differences between the API SQL download and what I can do with the GBIF Snapshots on Google Big Query, AWS and Azure ( GBIF exports as public datasets in cloud environments)?
The snapshots only contain a subset of all of the verbatim columns. The SQL download allow you to access all these columns: API SQL Downloads :: Technical Documentation (you need to open the list to display them). Accessing data via the SQL download API doesn’t cost any money while these cloud services have charges.
However, working with the snapshots will be faster than generating a download with the SQL download API.
In addition to that, you can always upload your own tables (like environmental layers) to the cloud-based systems and combine them with the GBIF snapshots. This isn’t possible with the GBIF SQL download API.

During the presentation you mentioned “gridded joins”, could you explain a bit more?
Currently, the SQL download API only allows to query one table, the occurrence table (no extension nor event). So there is no “join” possible as these operations could be very costly.
What was mentioned in the presentation was the aggregate functions where you can have counts or sums of aggregated data. There are also four different grids supported which will allow you to aggregate data for a given grid. This will allow you to get a lot of data into a lighter, more digestible download. You can read more about the grid functions here.

Can someone point me to the facet for API please?
The facet functions are part of the occurrence search API. See the documentation here: Occurrence API :: Technical Documentation and an example here: https://api.gbif.org/v1/occurrence/search?facet=country. Note that if you are using rgbif, you can use the following syntax: rgbif::occ_count(facet="country")

We would like to translate our GBIF hosted portal to Spanish and need some volunteer translator, could anyone help?
Is it possible to put Hosted Portals in Crowdin like it is the case for GRSciColl?
Technically, the GBIF hosted portals can be translated in crowdin but it isn’t a straightforward set up. If you have only a few pages to translate and no resource to figure out the set up, it might be easier to handle the translations directly in GitHub.

We manage several IPTs, one of them is giving us issues. We make that IPT available to anyone who ask and people are free to upload and manage their data however they see fit. This IPT seems to crash once to twice per week and we have to restart it. Are others experiencing the same issue? Do you have any advice for us?

If this happens again, please send us the log files so we can investigate at helpdesk@gbif.org. We host a number of IPTs at GBIF and they don’t usually crash. You can also consider emailing the IPT mailing list as others might have encountered (and fixed) the same issue. You can find it here: Info | ipt@lists.gbif.org - lists.gbif.org

From another participant in the call:

I am pretty sure it isn’t the same issue but let me share my experience with our IPT. Our IPT was crashing once in a while because it was running out of disc space. It might be helpful of you check whether this is the case with your IPT.

Will the material entity / sample (Darwin Core Quick Reference Guide - Darwin Core) core available for publishing on GBIF?
The material entity core currently available in the test environment but we don’t have an estimated date for making it available in production. For more information, please follow this GitHub thread: Interpret material core as MATERIAL SAMPLE occurrence · Issue #885 · gbif/pipelines · GitHub

One of our publishers lost their IPT installation and didn’t have backups available. In order to recover their data, I generated a download (with the format Darwin Core Archive) GBIF.org to make archives that would as close at possible to the original data. When looking at the verbatim.txt file, there were a lot of empty fields. I wrote a python script to delete empty columns, change the meta.xml files and rearrange the files. Are they other ways to get clean Darwin Core Archives?
In general, you can write to helpdesk@gbif.org and see if we have a copy of the original archive (in this particular case, we didn’t have any).
Note that GBIF had an initiative which aimed to identify and archive datasets that were no longer online. You can read more about it here: https://www.gbif.org/faq?question=what-is-an-orphan-dataset. There was a java program that could be used as a last resort to do what you described.

From another participant in the call:

I would like to see this script you wrote as I am often sent files with a lot of empty columns (publishers sometimes download data or templates from GBIF and fill only some columns and send them to the Node).

What happens if two tables are mapped to the same Darwin Core? We have a case where we aren’t able to publish the dataset.
You should be able to have several tables mapped to the same core. If all these tables have the mandatory fields and if they contain occurrences with different occurrenceIDs, they would me concatenated in one big occurrence file in the Darwin Core Archive.

I wanted to publish a “DNA-derived” dataset and try out the new eDNA tool ( https://discourse.gbif.org/t/dna-data-publishing-gbif-technical-support-hour-for-nodes/) developed by GBIF. My dataset also had images associated with the samples. The dataset should really be a sampling event dataset but if format it that way, I can’t use the DNA extension (because of the limitation of the star schema). I put my data through the eDNA tool and got the occurrence file out. Then I transformed this occurrence file in the IPT before adding to the IPT and as well as the other extensions (including the multimedia).
Have you tried the eDNA tool and a similar experience?
Response from a node: We used the eDNA tool and then still made a sampling event dataset in the IPT. It doesn’t index but at least people can download the data at the source.

Would it be possible to modify the welcome messages that publishers receive after endorsement? Some of our publishers are a bit confused by the message.
The Secretariat will explore options and get back to the Nodes later.

Topic		Replies	Views
GBIF SQL Downloads - GBIF Data Blog data-blog	1	104	October 4, 2024
April technical support hour for GBIF nodes Data Publishing NodesSupportHour	4	785	June 26, 2023
GBIF's data quality workflow (GBIF technical support hour for nodes) Data Publishing NodesSupportHour	5	581	March 15, 2024
GBIF and Apache-Spark on AWS tutorial - GBIF Data Blog data-blog	1	1291	June 4, 2021
Search, download, analyze and cite (repeat if necessary) - GBIF Data Blog data-blog	15	2993	September 15, 2021

API SQL downloads (GBIF technical support hour for nodes)

Links mentioned in the presentation:

Questions and answers:

Related topics