Downloads failing to include all files in the archive

I’ve been trying to download the GBIF data for NMNH Extant Specimen Records (USNM, US). Each time I try I get a (large) DwCA file that only has the occurrences data, the other files mentioned in metadata.xml are missing (multimedia.txt and verbatim.txt). I’m especially after the multimedia file. Here’s a link to an example download Download.

What am I doing wrong?

Hey @rdmpage ,

Sorry to hear that your are unable to access the data you asked for.

Any particular reason you are using the GBIF interpreted data, instead of the original data provided by the USNM themselves?

-jorrit

@jhpoelen Several reasons:

  • GBIF download speeds are often faster than original providers (once GBIF has created the download)
  • Original providers are often offline
  • I want the GBIF occurrence ids

Turns out the missing file problem also affects data from the Missouri Botanical Garden. I’ve managed to get data from both sources, and occurrence data from GBIF.

In case you are wondering, I’m getting this data to make a mapping between “barcodes” often used to identify plant specimens (e.g., “BM000944668”), and the corresponding records in GBIF (and, by extension, the original data providers). Turns out this is a lot harder than it should be because data providers often fail to provide the barcodes, or put them in bizarre places in the data (e.g., media file names). We are some way away from having citable plant specimens,

Thanks for clarifying the context of your request.

  • GBIF download speeds are often faster than original providers (once GBIF has created the download)
  • Original providers are often offline

Yes, accessing resources through web resource locators (i.e., URLs) can be quite unpredictable [1]. Over the years (as you know), I realized that I can better control the access to valuable resources by making verifiable copies while keep tracking of their origin. Without this controlled access, integrating large collections of digital resources becomes difficult if not impossible.

If I’d want to access USNM Extant Specimen Record, I’d search for them in versioned corpus of DwC-A, and pick whatever version I’d like. In this case, I’d pick the most recent version I have. In this case, the version reported by urn:uuid:16fd3921-70b6-4695-b40d-410a511d168b, a download event described in https://linker.bio/line:hash://sha256/a755a6ac881e977bc32f11536672bfb347cf1b7657446a8a699abb639de59419!/L1014791-L1014797 .

The last line, https://linker.bio/line:hash://sha256/a755a6ac881e977bc32f11536672bfb347cf1b7657446a8a699abb639de59419!/L1014797

describes the claim that https://collections.nmnh.si.edu/ipt/archive.do?r=nmnh_extant_dwc-a at 2023-09-02T14:11:47.258Z produce content with identifier hash://sha256/dccf6783c48610f9745399ada7b17f7d0121580a8efdf668466ee6ac3e1ea2e7 . Then, I’d get the associated content from my local harddisk of the versioned DwC-A corpus. Or, I can ask someone else for content with that exact content identifier. To get the versioned data to you, I’d either make a copy of a harddisk (just send me a self-addressed 4TB harddisk with appropriate return postage), or if you are ok with using the internet instead, you can use services, like DataOne, Zenodo, Wikimedia Commons, Software Heritage Library, that allow to lookup content by their content identifier. [2].

And then, after retrieving a copy of the content, I’d verify that I got what I asked for.

Anyways, a long way of saying, yes, the current way DwC-A are distributed can be optimized just like Netflix does - by keeping copies close to those that ask for it using a Content delivery network - Wikipedia or similar . This works especially great for immutable content, like specific versions of DwC-A, or that episode of Game of Thrones.

Anyways, to make a long story short - here’s a verifiable recent copy of USNM https://linker.bio/hash://sha256/dccf6783c48610f9745399ada7b17f7d0121580a8efdf668466ee6ac3e1ea2e7.zip . How long did it take you to download it? Too slow?

  • I want the GBIF occurrence ids

I want them too. And, it took a little time to build an extensive list of all available occurrence ids out there for a specific version of “all” GBIF occurrences (make me wonder why this mapping isn’t published by GBIF themselves). Then, I published the results at [3]. I wonder whether “BM000944668” is in there. Also, I wonder how many plant specimen do have their “barcodes”. . . sounds like a nice project, am curious how you’re going to tackle your data access issues.

References

[1] MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. Redirecting hash://sha256/136c3c1808bcf463bb04b11622bb2e7b5fba28f5be1fc258c5ea55b3b84f482c

[2] Elliott, M.J., Poelen, J.H. & Fortes, J.A.B. Signing data citations enables data verification and citation persistence. Sci Data 10, 419 (2023). Signing data citations enables data verification and citation persistence | Scientific Data hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d

[3] Poelen, Jorrit. (2023). Global Biodiversity Information Facility (GBIF): an exhaustive list of gbif record ids, dataset keys, and their associated Occurrence IDs, Institution Code, Collection Codes and Catalog Numbers. hash://sha256/ea88f03a7bfd1ba853fdbea3203d54ab81ac3cdc8e8da7c96bbbba9c4b05d933 hash://md5/c49fe34785354847b37ea4509261e130 (0.1) [Data set]. Zenodo. Global Biodiversity Information Facility (GBIF): an exhaustive list of gbif record ids, dataset keys, and their associated Occurrence IDs, Institution Code, Collection Codes and Catalog Numbers. hash://sha256/ea88f03a7bfd1ba853fdbea3203d54ab81ac3cdc8e8da7c96bbbba9c4b05d933 hash://md5/c49fe34785354847b37ea4509261e130

1 Like

I think the issue is more than simply verifiable URLs. The link to the “verifiable recent copy of USNM” suffers from the same problem as the GBIF download, it lacks a big chunk of the data that GBIF demonstrably has (e.g., images), hence I suspect there’s some other issue that’s preventing the correct downloads being generated.

I ended up getting the data I need from the NMNH’s IPT site, curiously a dataset apparently not shared with GBIF NMNH_Botany (but I’m assuming part of the larger NMNH dataset that I tried to download).

I understand the argument for hash-based identifiers, and am a fan of the idea. It would be nice to (a) have some sort of discovery mechanism where I can search for datasets (in the absence of providers making hashes available), and (b) given a hash URI I have no idea whether I’m going to be downloading a terabyte of data or not. Identifiers like DOIs tend to resolve to metadata about the data, so I can then make a decision about whether to download the data. Are there any semantics for hash URIs that can tell me what to expect? For example, ARKs have inflection features where appending “?” to the URL retrieves metadata. LSIDs had a metadata/data distinction baked in as well.

At the moment https://linker.bio by itself returns 403 Forbidden. Maybe a welcome page, or even better a search function would be helpful?

Lastly, the list of GBIF ids is nice, but most of this dataset isn’t directly relevant to my needs as the bulk of GBIF data come from eBirds, and I’m focussed on museums and herbaria. The other issue is that often the identifiers I’m looking for are not in the “Darwin Core triplet”, but scattered elsewhere in the record, or indeed not even in GBIF. There isn’t a culture of providing citable identifiers for specimens, probably in part because there is no obvious value in individual specimen citations, the value of GBIF is seen as the aggregation (hence datasets and downloads get DOIs).

Thanks for taking the time to reply and for sharing ideas.

A. Through your observation we now know that the original data as provided by USNM seems to lack the media records that you are looking for. So, this (verifiable) evidence suggests that the issue of the missing media records is present before the USNM data gets indexed by GBIF.

So, now, if I’d be looking for this data, I’d go back in time and investigate exactly when these media records stopped being produced, if they were present at all.

B. Thanks for the suggestion. If you know someone who is willing to build a web front end (I am assuming you’d like to click on stuff) on top of the already existing DwC-A corpus, please do let me know. All the data is there, its just waiting for folks like you to imagine reusing the data in meaningful ways. My initially focus is on developing a methodology to reduce the chances of the raw data getting lost forever : without data, no websites. I do realize this lack of snazzy UIs is not . . . very snazzy. Luckily, web UIs can be added more easily than recreating historic data of some collection that lost funding or simply forgot to pay their monthly Amazon Cloud bill.

In an attempt to appeal to your curiousity and interest in triple stores, please note that the corpus uses nquads to document the provenance of the DwC-A records, so you can push these into the triple store as it, and write some SPARQL to select the statements of interest.

The bash code below streams these data into [your triple store] .

preston ls --anchor hash://sha256/a755a6ac881e977bc32f11536672bfb347cf1b7657446a8a699abb639de59419 --remote https://linker.bio
| [your triple store]

C. Yes, just like in B. services can be built in top of this DwC-A corpus to provide this kind of information. (btw - was the download fast enough?). Again, building these services are only limited by the imagination of those using biodiversity data, and their willingness to invest time/resources in it. Any idea who may be interested? (btw - the GBIF doi associated with their registered datasets are embedded in the corpus as rdf triples.

D. Right now, some semantics are added by a basic web application built into Preston (you can start it, jekyll-style using preston s short for preston server). This semantics include apache vfs style URI notation to do things like selecting a range in a text (e.g., as shared earlier https://linker.bio/line:hash://sha256/a755a6ac881e977bc32f11536672bfb347cf1b7657446a8a699abb639de59419!/L1014791-L1014797). I can imagine that you, and others can come up with other handy notations to facilitate data discovery and retrieval.

E. Thanks for the suggestion and please let me know who you’d think would be interested in building this . . . Anne Thessen made a similar request on May 30th 2023, I’ll add yours to the add landing page for https://linker.bio · Issue #241 · bio-guoda/preston · GitHub also.

E. You can ignore the eBird associated occurrence ids if you’d like by removed rows with the gbif minted dataset id for the (massive!) eBird dataset. I do believe that GBIF has some records beyond eBird, including museums and herbaria.

F. Yes, my experience (especially with indexing host/parasite/etc associations) is also that some collections get creative in ways to add the data. And, in my experience the collection usually do a good job in being consistent in applying their own data annotation method. Happy to see some example to help me better understand what you are talking about.

G. I am a big fan if citing individual specimen and their associated data. In fact, in collaboration with Katja Seltmann of the Big Bee project, we are experimenting with ways to create citable individual specimen data packages to help mobilize all this rich data that is available today. So, we have some evidence to suggest that there is, in fact, a culture (and also perhaps a growing desire) of providing citable identifiers for specimen.

Curious to hear your thoughts, if you have any you are willing to share.

-jorrit

PS As far as https://collections.nmnh.si.edu/ipt/archive.do?r=nmnh_botany goes, according to my records, their DwC-A hasn’t changed since 2021-12-01. You can verify their copy via https://linker.bio/hash://sha256/ae8ecbc794daa3720477b062243e5096651907931ead74b0d5cb3326d21b5277.zip . How fast was that download for you?

Hi, @rdmpage - sorry for the slow reply. Are you sure the link you posted is correct, please?

When I expand the file on the example you gave it contains a ~2GB multimedia file.

A couple of thoughts spring to mind - a corrupt download, although I’d expect the unzip tool to report that with the internal checksumming, or perhaps a difference in the unzipping tools we’re using. Can you try another, and what are you using, please?

Hi @trobertson, I was downloading in Safari. I’ll try again using curl and see what happens. I got the same problem for Download, got the occurrences, meta.xml, etc., but no media files.

Hi @rdmpage

That one also works:

tsj442@1027603 Downloads % unzip /Users/tsj442/Downloads/0008890-230918134249559.zip    
Archive:  /Users/tsj442/Downloads/0008890-230918134249559.zip
  inflating: rights.txt              
  inflating: citations.txt           
  inflating: dataset/7bd65a7a-f762-11e1-a439-00145eb45e9a.xml  
  inflating: metadata.xml            
  inflating: meta.xml                
  inflating: occurrence.txt          
  inflating: verbatim.txt            
  inflating: multimedia.txt 
tsj442@1027603 Downloads % wc -l multimedia.txt 
 3119627 multimedia.txt

I suspect this is a quirk of how expander/unzipper are dealing with large zip formats. I seem to recall it had ambiguities above a certain size. Can you please try a different expander or try zip on the terminal or so?

For what it is worth, I am able to extract over 6M multimedia records from a version of the USNM original DwC-A via

unzip -p data/dc/cf/dccf6783c48610f9745399ada7b17f7d0121580a8efdf668466ee6ac3e1ea2e7 multimedia.txt | pv -l | sha256sum
6.55M 0:00:16 [ 387k/s] [                 <=>                                  ]
1eec3d5ccea8f6e2f7be6d31b02f1c26fb2c2086c3c0c23e5d6f002e835d8ac9  -

I don’t have a way to independently verify the data file (casually) associated with Cited Query DOI:10.15468/dl.5cskyy , so not sure whether @rdmpage is looking a the same version that @trobertson is looking at. Can you perhaps compare the sha256 hash of the files you are looking at?

Hi @trobertson

Well, that was interesting. I grabbed the Download download:

curl -L 'https://api.gbif.org/v1/occurrence/download/request/0005866-230918134249559.zip' > 0005866-230918134249559.zip

Then I double clicked to expand the file (i.e., using the built in Archive Utility) and everything is there!. Strange. I guess I can’t trust Safari to download and expand large downloads from GBIF. Weirdly I retrieved the zip file Safari had downloaded (and had put in the Bin after expanding). It has the same sha1 checksum as the file I retrieved via curl (d7e38a93d51eda8c84ab4a7cadba4f87268ca0c6). Double clicking on that zip file gave me an incomplete set of files, but running unzip extracted everything.

Not sure what is going on, but I will obviously need to be more careful with large GBIF downloads in future.

1 Like

@jhpoelen I think we’ve sorted out the download issue, thanks to @trobertson intervention. The problem was at my, perhaps not surprising.

Regarding specimen citations, my model is bibliographic citation, and is mostly retrospective. Publications cite specimens in various ways, almost never using a URL or PID (although there are recent efforts in the plant world to improve that). Instead there are text strings (sometimes called “material citations”). How do I match those to, say, GBIF records? I regard this as much the same problem as converting bibliographic citations into links (which for a lot of the modern literature has been solved via DOIs).

That’s the long term goal, the immediate goal is that I’m trying to map specimen records in Global Plants to GBIF records, and the JSTOR links consistent make use of the barcodes stuck to herbaria sheets, whereas records in GBIF use all sorts of identifiers. It’s a classic case of a paywalled system being well-designed but closed, and the open system being free but messy.

I see the same effect when downloading a file in Safari: the file downloads, it says it is unzipping it automatically, but it doesn’t unzip verbatim.txt or multimedia.txt.

Rescuing the file from the bin and double-clicking it to open it does extract all the files.

1 Like

Thanks @MattBlissett, good to know it is not just me.

@MattBlissett @trobertson Some Googling suggests that this ia known issue http - Prevent Safari from auto extracting ZIP files (since sometimes it only extracts the first member) - Stack Overflow and https://github.com/uktrade/stream-zip/pull/42#issuecomment-1562324207 and apparently related to the difference between ZIP32 and ZIP64 formats. This is also consistent with my experience of things working fine for smaller ZIP files from GBIF, but not on larger ones. Maybe for big files there needs to be a warning message that Safari may fail to extract all the data?

1 Like

Great to hear that you had your needs met, and thanks for the exchange.

For me, it was an exercise in tracing the origin of data:

While I was able to trace the origin of data associated with the USNM endpoint, I am still not quite sure which version of the DwC-A provided by USNM was used in the snapshot produced and loosely associated with the provided GBIF download DOI html landing page. And . . . how to independently reproduce this snapshot from their raw ingedients provided by the respective institutions.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.