Parquet data schema

I am not sure where is the best place to ask this question, so forgive me if this belongs elsewhere (e.g. gbif repo).
Who decides on the AWS schema?
I seems like the stateprovince is not systematically recorded and is generally worse than level1Gid, which is missing from the schema. I understand it might be an inferred field in GBIF, but querying stateprovince is a major pain compared to well-structured GADM subdivisions.

I think the most valuable fields are the taxonomy keys and GADM keys.


The schema is the same as the “Simple” download format available on — this is our (GBIF’s) responsibility, not Amazon’s, though we did discuss it with them.

level1Gid is inferred from the coordinates, if present. stateprovince is not altered at all, so it varies widely depending on the data provided to us.

@MattBlissett related to this, I’ve always been curious, why are these fields changed to all lower case instead of the camelCase of Darwin Core? It seems like such a trivial change that could be avoided, but I’m guessing it serves a purpose.

I think it was to be more SQL-like, and was possibly also due to a limitation in Apache HBase which we used to use to hold the occurrence data.

1 Like

Gotcha. Thanks for the quick answer.