Parquet data schema

dmi3k · August 9, 2023, 1:41pm

Hi,
I am not sure where is the best place to ask this question, so forgive me if this belongs elsewhere (e.g. gbif repo).
Who decides on the AWS schema?
I seems like the stateprovince is not systematically recorded and is generally worse than level1Gid, which is missing from the schema. I understand it might be an inferred field in GBIF, but querying stateprovince is a major pain compared to well-structured GADM subdivisions.

I think the most valuable fields are the taxonomy keys and GADM keys.

MattBlissett · August 9, 2023, 3:05pm

Hi,

The schema is the same as the “Simple” download format available on GBIF.org — this is our (GBIF’s) responsibility, not Amazon’s, though we did discuss it with them.

level1Gid is inferred from the coordinates, if present. stateprovince is not altered at all, so it varies widely depending on the data provided to us.

sformel · August 14, 2023, 1:13pm

@MattBlissett related to this, I’ve always been curious, why are these fields changed to all lower case instead of the camelCase of Darwin Core? It seems like such a trivial change that could be avoided, but I’m guessing it serves a purpose.

MattBlissett · August 14, 2023, 2:11pm

I think it was to be more SQL-like, and was possibly also due to a limitation in Apache HBase which we used to use to hold the occurrence data.

sformel · August 14, 2023, 2:52pm

Gotcha. Thanks for the quick answer.

Topic		Replies	Views
The strange case(s) of the missing identity Miscellaneous	22	331	August 8, 2024
Using apache-arrow and parquet with GBIF mediated occurrences - GBIF Data Blog Data blog	0	1952	February 18, 2022
GBIF and Apache-Spark on AWS tutorial - GBIF Data Blog Data blog	0	1303	June 2, 2021
Type Specimen CASTYPE1652 found via filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq Miscellaneous	17	699	April 6, 2023
Investigating taxonomic issues on GBIF.org Data Publishing NodesSupportHour	6	405	February 13, 2025

Parquet data schema

Related topics