Type Specimen CASTYPE1652 found via filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq

Hi!

First, thanks for providing this open discussion forum in addition to maintaining the expansive biodiversity data-universe that GBIF maintains.

Second, apologies in advance for the long and rather detailed post below.

The executive summary is that I am trying to figure out why I can find type Specimen CASTYPE1652 in filtered query Download, but not in open-access GBIF data product Download .

The text below described how I got to the datasets, and ends with specific questions.

As I am tracking (versioned) digital traces associated with type specimen CASTYPE1652 (see https://beehind.org), I downloaded the open access data product (all :

GBIF.org (01 March 2023) GBIF Occurrence Download Download

via https://api.gbif.org/v1/occurrence/download/request/0015281-230224095556074.zip
to produce ~260G of digital content with id hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97 .

then, I used a streaming query to count all lines in the “simple” table that was included in the file. In addition, I attempted to filter the data to include only records with collectionCode CASTYPE , the collection code of the collections that keeps the type specimen with catalog number CASTYPE1652 .

After 5h15m processing at a rate of about 100k lines/s , I counted 2.07 billion lines. Also, I found that no records found with collectionCode CASTYPE.

To confirm that the collectionCode CASTYPE was actually used in associated records, and existed on and prior to 1 March 2023, I verified that 1 March 2023 (https://linker.bio/zip:hash://sha256/ffffe616beab7b4a04e46162cdbd2584f986e3f5f5b56258f9737ee31f36b6b6!/occurrence.txt), and 1 January 2023 (https://linker.bio/zip:hash://sha256/110f398aa4c8a4be870c7b3c1d698c32eb2c8dad878b614fe8e8f7a153251a43!/occurrence.txt) of the DarwinCore archive provided by the California Academy of Sciences via http://ipt.calacademy.org:8080/archive.do?r=type included records with collection code CASTYPE.

Also, I logged in to the GBIF web portal and created a “download” with citation:

GBIF.org (24 March 2023) GBIF Occurrence Download Download

This download included a filter to only include records associated with GBIF dataset CAS Entomology Type (TYPE) associated with the CASTYPE collection.

Using the same methods as earlier, I selected records including mention of collectionCode CASTYPE . Contrary to the earlier results, records with CASTYPE collectionCode now appeared, including CASTYPE1652.

So, given the contradictory results, I was wondering:

  1. Can anybody confirm that CASTYPE records (including CASTYPE1652) do not appear in GBIF.org (01 March 2023) GBIF Occurrence Download Download ?
  2. Can someone explain why the gbif front page claims to have over 2.2 billion records indexed, whereas GBIF.org (01 March 2023) GBIF Occurrence Download Download appears to include about 200M records less ?

Most likely, I don’t fully understand what to expect to be included in GBIF.org (01 March 2023) GBIF Occurrence Download Download , so I very much appreciate your insights to better understand these valuable datasets.

Again, apologies for the long and detailed post, and I am curious to hear anyone thoughts on how I should proceed.

thx,
-jorrit

PS. The overarching use case is to document associations between GBIF occurrence identifiers and their associated institution code, collection code, and catalog number. I need this to establish links between CASTYPE1652 (or other specimen) and their digital traces in GBIF , and, indirectly, to Bionomia. Because Bionomia uses gbif identifiers to link people to their associated records, I need to “speak” GBIF identifiers to resolve the wealth of knowledge of the people behind collections as facilitated/enriched by @dshorthouse https://bionomia.net . fyi @Debbie @seltmann

PS2. After deriving a five column (gbifID, occurrenceID, collectionCode, institutionCode, catalogNumber), I was able to find only 4 records (out of an estimated 17k specimen held at the CASTYPE collection) that included CASTYPE across these column values:

gbifID     occurrenceID                                                         institutionCode collectionCode catalogNumber
2275276454 03E987E2FE8B2B6EFF3ED117FB5AFBDC.mc.3B283CA9FE8A2B6DFDBDD12EFD82F845 CAS             -              CASTYPE19452, MA-02-14A-35
2275275513 03E987E2FE7D2B9BFF3ED39FFB57FE0C.mc.3B283CA9FE7C2B9BFDA8D6F3FD25FE98 CAS             -              CASTYPE19463
2275274939 03E987E2FDBD285BFF3ED282FA68FD74.mc.3B283CA9FDBC285BFDB3D7F4FCE1FD9C CAS             -              CASTYPE19467, MA-02-08A-16
2275275452 03E987E2FE692B8FFF3ED056FA71FD2C.mc.3B283CA9FE682B8FFE56D793FC02FDB8 CAS             -              CASTYPE19451

with related gbif occurrences html landing pages:

2 Likes

Hi Jorrit,

I’ve checked the download https://doi.org/10.15468/dl.pk3trq, and I find all the expected records with collectionCode CASTYPE.

To show CASTYPE1652 exists:

$ unzip -p 0015281-230224095556074.zip 0015281-230224095556074.csv | \
  pv -s 1221527907781 | \
  split --suffix-length=6 --numeric-suffixes --line-bytes=1024M --filter='zstd > $FILE.zst' - 0015281-230224095556074-part-

$ ls -1 *.zst | parallel zstdgrep -l CASTYPE1652
0015281-230224095556074-part-001080.zst

$ zstdgrep '\bCASTYPE1652\b' 0015281-230224095556074-part-001080.zst
2238760764      6ec3c7f5-6233-48f6-b36a-06b867edbadd    urn:catalog:CAS:TYPE:1652…

It’s the 2_191_315_308th row in the TSV, out of 2_302_252_496 rows in total (expected number of occurrences + header row).

$ (ls -1 *.zst | parallel zstdcat {} '|' wc -l ) | tee counts | numsum
2302252496

From your script, you might be streaming the decompression of the Zip file. I think libarchive has a bug when streaming-decompressing large Zip archives produced by streaming compression in Java, although it loses only the last few kB of data, not the many megabytes which a total row count of 2.07 billion would suggest. Could something like that have happened?

1 Like

@MattBlissett thanks for taking the time to review my observations. I appreciate it, especially given the size and computational resources needed to do the work. Also, fun to see the tools you are using to optimize the search (e.g., parallel, split, zstdgrep).

I’ve opened an issue related to your suggestion that the root cause may be in the streaming method I am using. Like I mentioned in the issue, I am hoping to get to the bottom of this sooner or later.

For me, this example does seen to suggest that reviewing large datasets can be quite labor/resource intensive and may not be within reach for the the typical (data) paper reviewer. I wonder what can be done to help make it easier to review these “big data” processing methods and associated datasets.

Thanks again for taking the time to respond,
-jorrit

https://jhpoelen.nl

I wonder what can be done to help make it easier to review these “big data” processing methods and associated datasets.

I suspect formats like Apache Parquet/Avro or so will increasingly become mainstream for reasons such as these (plus the schema management). It’s already the case that a lot of tools like Google Big Query, Carto etc support Parquet and the Open Geospatial Consortium is looking to add geospatial indexes. As tooling starts using it hopefully the technical threshold drops.

For Preston, perhaps using the Parquet / Avro versions of GBIF might be an option to consider. There’s a blog showing an example of using Parquet here.

Edited to add: There is a summary of all the snapshots, including the GBIF Avro/Parquet ones here if anyone reading wants to explore.

@trobertson thanks for taking the time to respond.

Yes, Parquet is a neat format specifically designed to facilitate parallel processing. We used it extensively in the 2018 Thessen et al. paper [1] . Great to see that you are offering your interpreted data products prepackaged in this format!

And, with a Parquet file, as well as with any resource offered, I would like to:

  1. independently verify the authenticity of referenced resource after having retrieving it, especially with large datasets where transfer issues are to be expected.
  2. trace the origin of the data (what original data versions and workflows were used to compile the dataset?)

I find (1) useful when reviewing / re-using datasets to make sure I am working with the original copy. And (2) helps me understand how the resource was compiled so I can trace errors (debugging workflows), do error analysis (how to errors propagate) and comparing notes with data contributors (pointing to an exact version of contributed data).

If you have any ideas on how to do this with GBIF snapshots, please do share. I think it would help my attempt to build a resource to help find digital traces of CASTYPE specimen in the GBIF universe in context of the https://beehind.org .

One of the reasons I am trying to integrate with GBIF is because projects like bionomia.net rely on gbif specimen id proxies (e.g., occurrence ids, of gbifID) to make connections between people and the specimen they worked with.

In other words, I’d like to be able to answer the question: what gbif id is associated with some catalog number (e.g., CASTYPE1652) and what association does bionomia.net keep on this gbif id?

And, I’d like to be able to end up with a resource that can be copied, archived without loosing its ability to have their authenticity be independently verified. Some sort of data publication published across various platforms (e.g., zenodo, internet archive).

And, while I think switching file formats might help optimize some processing workflow, I am not sure whether items 1,2 will be addressed.

Curious to hear your thoughts,
-jorrit

PS I am still in the process of finding the root cause of why I initially wasn’t able to find the CASTYPE1652 traces in the downloaded resource associated with the 1 March 2023 snapshot [2]. And I noticed that @MattBlissett confirmed that I am working with the same copy of the snapshot as he was (content appears incomplete on streaming large files from zip · Issue #228 · bio-guoda/preston · GitHub) . Namely, the resource with content identifier hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97 .

References

[1] Thessen AE, Poelen JH, Collins M, Hammock J. 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration. PeerJ Comput Sci. 2018 Sep 17;4:e164. doi: 10.7717/peerj-cs.164. PMID: 33816817; PMCID: PMC7924439.

[2] GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq

After some time in analyzing why I was initially unable to find CASTYPE1652 in the 1 March 2023 GBIF snapshot (see content appears incomplete on streaming large files from zip · Issue #228 · bio-guoda/preston · GitHub), I was able to narrow down the root cause to a third party command-line tool - the GBIF snapshot appeared to be consistent after all :smile: .

Thanks for humoring me by reproducing my results and confirming the content id (or content hash) of the open access GBIF data product in question, and for guiding me to the root cause of my seemingly inconsistent results associated with https://doi.org/10.15468/dl.pk3trq .

I’ve reported a separate issue of the apparent root cause of it all at when processing tsv file from ~260GB zip file, mlr stops processing abruptly while filtering stream · Issue #1251 · johnkerl/miller · GitHub .

thx,
-jorrit

Excellent - thanks for letting us know @jhpoelen

Regarding:

  1. independently verify the authenticity of referenced resource after having retrieving it, especially with large datasets where transfer issues are to be expected.

Is this as simple as wishing us to provide a checksum or suchlike to accompany all files, please?

Supplying a md5 and sha256 hash for all files would be excellent! That would help me answer my own question: Did I get what I asked for?

My dream would be to also have access to a hash of the description of the provenance of the files, sort of like a shipping manifest. This manifest would reference the included files (and their hashes), their upstream resources (e.g., original datasets incl. hashes) and description of the processes (tools used to transform the original data) that turned the “raw” material (the original datasets) into the “processed” or “synthesized” dataset.

And once you have a manifest, you can generate a hash of that . . . and that hash can be embedded in a doi. With that, you’d have the joys of working with resolvable DOIs as well as identifiers that would help verify the associated resources once retrieved.

So with a minor change in the way DOI are generated, instead of something like:

https://doi.org/10.15468/dl.pk3trq

you’d have

https://doi.org/10.15468/md5:abc123...

or

https://doi.org/10.15468/sha256:abc123...

or shorten the hashes to a more printable version :

https://doi.org/10.15468/abc123...

Thanks for elaborating @jhpoelen

@trobertson @MattBlissett one thing I was wondering, and likely a very silly question: From https://api.gbif.org/v1/occurrence/download/0015281-230224095556074 with hash://sha256/d061de217c7cb898bf86c480685fc764c2c9296891fed771afe0bea121a3a87d , I found that:

{
  "key": "0015281-230224095556074",
  "doi": "10.15468/dl.pk3trq",
  "license": "http://creativecommons.org/licenses/by-nc/4.0/legalcode",
  "request": {
    "sendNotification": true,
    "format": "SIMPLE_CSV",
    "type": "OCCURRENCE",
    "verbatimExtensions": []
  },
  "created": "2023-03-01T07:25:20.843+00:00",
  "modified": "2023-03-01T07:57:48.221+00:00",
  "eraseAfter": "2023-09-01T07:25:20.774+00:00",
  "status": "SUCCEEDED",
  "downloadLink": "https://api.gbif.org/v1/occurrence/download/request/0015281-230224095556074.zip",
  "size": 278474479481,
  "totalRecords": 2302252495,
  "numberDatasets": 55107
}

Does this mean that the data associated with https://doi.org/10.15468/dl.pk3trq will be deleted after "eraseAfter": "2023-09-01T07:25:20.774+00:00", or Sept 1, 2023 ?

Hi @jhpoelen,

This should help answer your question. If download is (found to be) cited, e.g., in a journal article, this value is removed and the download will be retained for as long as possible.

Best,
Daniel

Does this mean that the data associated with https://doi.org/10.15468/dl.pk3trq will be deleted after "eraseAfter": "2023-09-01T07:25:20.774+00:00", or Sept 1, 2023 ?

It means it is eligible for deletion, not necessarily that it will be. Downloads cited in papers aren’t deleted, and users can extend the retention period as described here.

@jhpoelen and @trobertson +1 for including the hash.

However, as recommended by the DOI foundation ( https://www.doi.org/the-identifier/resources/handbook/2_numbering) I would not embed the hash in the DOI name. This should be in the associated metadata. The reasoning behind is that we avoid adding semantics in the identifier level. We keep the DOI name as opaque as possible and add meaning/semantics in the metadata (this could be inserted in the PID record which is part of the Handle system and de-coupled from the repository).

Thanks for your prompt replies. I cited cited https://doi.org/10.15468/dl.pk3trq in:

Poelen, Jorrit. (2023). Global Biodiversity Informatics Facility (GBIF): an exhaustive list of gbif record ids, dataset keys, and their associated Occurrence IDs, Institution Code, Collection Codes and Catalog Numbers (0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7789866

Assuming that this counts as a reference, how will I be notified that the data associated with the Download DOI will be preserved? If my assumption is incorrect, how would was it your criteria for counting a reference towards credits to archive the requested data ?

See also attached screenshots -


Hi @sharif.islam thanks for taking the time to respond and for referencing an applicable section in the DOI manual. Please do note, however that many DOI members do embed semantics in their doi entries.

Some examples include:

  1. https://doi.org/10.5281/zenodo.7789866
    where “zenodo” is included to (presumably) make it easier to recognize Zenodo DOI, even though their doi registry entry 10.5281 can be recognized by a trained eye (or robot). Also, the sequential publication number has meaning in it, and can serve a proxy of publication date (e.g., https://doi.org/10.5281/zenodo.7789866 was published after https://doi.org/10.5281/zenodo.7761832 ) .

  2. https://doi.org/10.1093/jme/tjad009
    where jme, refers to an acronym for Journal of Medical Entomology

  3. https://doi.org/10.1111/ele.13966
    were ele refers to an acronym for Ecology Letters.

  4. https://doi.org/10.1126/science.abn4012
    where science refers to the registrar Science magazine.

Also, note that (now) Sir Berner-Lee realized early on that ideally, URIs should be “Cool” [1] - “[…] There are no reasons at all in theory for people to change URIs (or stop maintaining documents), but millions of reasons in practice. […]” . And by retracting a download URL like the snapshot archive at https://api.gbif.org/v1/occurrence/download/request/0015281-230224095556074.zip (aka hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97) would made it un-Cool.

So yes, I appreciate specifications, recommendations, and insights that others may have help guide our decisions on documenting links between the knowledge we keep or mediate. And, I often find myself improvising within the constraints of existing infrastructures to work towards a more stable digital knowledge universe. But, I also realize that with improvisation comes the risk of hitting that weird note.

In summary - I appreciate the DOI specification and their suggestions, and, given the current use of DOIs, I still see a compelling case for embedding content ids in DOIs.

Curious to hear your thought,
-jorrit

References

[1] Berner-Lee T 1998. Cool URIs don’t change. W3C. Accessed at https://www.w3.org/Provider/Style/URI on 2023-04-03

There is no notification per se, unless you track the event data of the Zenodo resource, if which case you might find that there’s a now reciprocal reference from the download DOI to the Zenodo dataset. The download landing page now also shows “1 citation”.

We usually pick these up within a few days, but feel free to let us know using the "tell us about usage“ button on the download landing page, e.g.:

@dnoesgaard very neat to see that GBIF is picking up the Zenodo data publication I made only a few days ago (see attached screenshot)!

Your pointer to the “1 citation” remark helped answer that question.

Thank you for being patient with me.

For me, this encourages to use the “derived from” annotations offered by Zenodo as well as keeping derived copies around.

In addition, I see a lot of opportunity for a more integrated data ecosystem, making it possible to compile highly complex datasets that can be reconstructed facilitated by doi-enabled tracking while not requiring https://doi.org to exist to archive/retrieved associated data until the sun explodes.

-jorrit

PS. I was able to confirm that the eraseAfter entry was apparently removed from the download request meta-data obtained via https://api.gbif.org/v1/occurrence/download/0015281-230224095556074 on 2023-04-03 with content id hash://sha256/9ae196d00e7251ae72c74b3ec68b0a3ae53ca4acc44e507a54e6474e36bd95fd - or

 {
  "key": "0015281-230224095556074",
  "doi": "10.15468/dl.pk3trq",
  "license": "http://creativecommons.org/licenses/by-nc/4.0/legalcode",
  "request": {
    "sendNotification": true,
    "format": "SIMPLE_CSV",
    "type": "OCCURRENCE",
    "verbatimExtensions": []
  },
  "created": "2023-03-01T07:25:20.843+00:00",
  "modified": "2023-04-03T13:41:27.651+00:00",
  "status": "SUCCEEDED",
  "downloadLink": "https://api.gbif.org/v1/occurrence/download/request/0015281-230224095556074.zip",
  "size": 278474479481,
  "totalRecords": 2302252495,
  "numberDatasets": 55107
}

with the difference with their 2023-03-01 version being:

$ diff <(preston cat hash://sha256/d061de217c7cb898bf86c480685fc764c2c9296891fed771afe0bea121a3a87d | jq .) <(preston cat hash://sha256/9ae196d00e7251ae72c74b3ec68b0a3ae53ca4acc44e507a54e6474e36bd95fd | jq .)
12,13c12
<   "modified": "2023-03-01T07:57:48.221+00:00",
<   "eraseAfter": "2023-09-01T07:25:20.774+00:00",
---
>   "modified": "2023-04-03T13:41:27.651+00:00",

Hi @jhpoelen

I agree that these are valid points. It’s true that DOI usage often doesn’t follow the guidelines in the handbook. In fact, it’s generally a good idea to keep the meaning embedded in the identifier string to a minimum. Additionally, since the hash is machine-generated, using it can help reduce the risk of human error.

However, I have some concerns about coupling the hash creation step with the PID generation step. This could add another dependency to the implementation and maintenance processes. It might be better to have the hash as a separate metadata element, which could still be useful for various purposes.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.