First, thanks for providing this open discussion forum in addition to maintaining the expansive biodiversity data-universe that GBIF maintains.
Second, apologies in advance for the long and rather detailed post below.
The executive summary is that I am trying to figure out why I can find type Specimen CASTYPE1652 in filtered query Download, but not in open-access GBIF data product Download .
The text below described how I got to the datasets, and ends with specific questions.
As I am tracking (versioned) digital traces associated with type specimen CASTYPE1652 (see https://beehind.org), I downloaded the open access data product (all :
GBIF.org (01 March 2023) GBIF Occurrence Download Download
to produce ~260G of digital content with id hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97 .
then, I used a streaming query to count all lines in the “simple” table that was included in the file. In addition, I attempted to filter the data to include only records with collectionCode CASTYPE , the collection code of the collections that keeps the type specimen with catalog number CASTYPE1652 .
After 5h15m processing at a rate of about 100k lines/s , I counted 2.07 billion lines. Also, I found that no records found with collectionCode CASTYPE.
To confirm that the collectionCode CASTYPE was actually used in associated records, and existed on and prior to 1 March 2023, I verified that 1 March 2023 (https://linker.bio/zip:hash://sha256/ffffe616beab7b4a04e46162cdbd2584f986e3f5f5b56258f9737ee31f36b6b6!/occurrence.txt), and 1 January 2023 (https://linker.bio/zip:hash://sha256/110f398aa4c8a4be870c7b3c1d698c32eb2c8dad878b614fe8e8f7a153251a43!/occurrence.txt) of the DarwinCore archive provided by the California Academy of Sciences via http://ipt.calacademy.org:8080/archive.do?r=type included records with collection code CASTYPE.
Also, I logged in to the GBIF web portal and created a “download” with citation:
GBIF.org (24 March 2023) GBIF Occurrence Download Download
This download included a filter to only include records associated with GBIF dataset CAS Entomology Type (TYPE) associated with the CASTYPE collection.
Using the same methods as earlier, I selected records including mention of collectionCode CASTYPE . Contrary to the earlier results, records with CASTYPE collectionCode now appeared, including CASTYPE1652.
So, given the contradictory results, I was wondering:
- Can anybody confirm that CASTYPE records (including CASTYPE1652) do not appear in GBIF.org (01 March 2023) GBIF Occurrence Download Download ?
- Can someone explain why the gbif front page claims to have over 2.2 billion records indexed, whereas GBIF.org (01 March 2023) GBIF Occurrence Download Download appears to include about 200M records less ?
Most likely, I don’t fully understand what to expect to be included in GBIF.org (01 March 2023) GBIF Occurrence Download Download , so I very much appreciate your insights to better understand these valuable datasets.
Again, apologies for the long and detailed post, and I am curious to hear anyone thoughts on how I should proceed.
PS. The overarching use case is to document associations between GBIF occurrence identifiers and their associated institution code, collection code, and catalog number. I need this to establish links between CASTYPE1652 (or other specimen) and their digital traces in GBIF , and, indirectly, to Bionomia. Because Bionomia uses gbif identifiers to link people to their associated records, I need to “speak” GBIF identifiers to resolve the wealth of knowledge of the people behind collections as facilitated/enriched by @dshorthouse https://bionomia.net . fyi @Debbie @seltmann
PS2. After deriving a five column (gbifID, occurrenceID, collectionCode, institutionCode, catalogNumber), I was able to find only 4 records (out of an estimated 17k specimen held at the CASTYPE collection) that included CASTYPE across these column values:
gbifID occurrenceID institutionCode collectionCode catalogNumber 2275276454 03E987E2FE8B2B6EFF3ED117FB5AFBDC.mc.3B283CA9FE8A2B6DFDBDD12EFD82F845 CAS - CASTYPE19452, MA-02-14A-35 2275275513 03E987E2FE7D2B9BFF3ED39FFB57FE0C.mc.3B283CA9FE7C2B9BFDA8D6F3FD25FE98 CAS - CASTYPE19463 2275274939 03E987E2FDBD285BFF3ED282FA68FD74.mc.3B283CA9FDBC285BFDB3D7F4FCE1FD9C CAS - CASTYPE19467, MA-02-08A-16 2275275452 03E987E2FE692B8FFF3ED056FA71FD2C.mc.3B283CA9FE682B8FFE56D793FC02FDB8 CAS - CASTYPE19451
with related gbif occurrences html landing pages: