Searching on Catalogue Number

jhpoelen · February 2, 2022, 10:10pm

For what it is worth:

@ekrimmel I read your request, and related to your desire to look for patterns in biodiversity data. I can see how GBIF’s powerful search engine can help to find matching data in their current snapshot of published biodiversity data.

And, I also realize that GBIF changes all the time, and so I did a similar search using Preston (sort of like a git for biodiversity data) and was able to produce the following results for the 2022-02-01 versioned collection (i.e. hash://sha256/4000d2a1af6da5b46f374038d884f91768782a1905d4a75fff3c8c3bb6629913) of darwin core archives registered with GBIF and iDigBio with urls containing lacm in about 2 minutes**

$ time preston cat --remote https://deeplinker.bio hash://sha256/4000d2a1af6da5b46f374038d884f91768782a1905d4a75fff3c8c3bb6629913 \
| grep hasVersion\
 | grep lacm\
 | preston grep --remote https://deeplinker.bio -l tsv -o "LACMIP\s*[0-9]+\.*[0-9]*"\
 | grep value\
 > lacm-numbers.tsv
...
real	2m2.078s
user  2m13.920s
sys	0m3.503s

Note how I used a regex pattern LACMIP\s*[0-9]+\.*[0-9]* to pickup the values and their exact locations in versioned archives.

With this, you can also do fun things like listing collection events with most the specimen lots:

$ cat lacm-numbers.tsv\
 | cut -f3\
 | grep -E -o "LACMIP [0-9]+[\.]{0,1}"\
 | sort\
 | uniq -c\
 | sort -nr\
 > lacm-numbers-freq.txt 
$ head lacm-numbers-freq.tsv 
  51246 LACMIP 2533.
  13071 LACMIP 2533
   8655 LACMIP 66.
   6579 LACMIP 305.
   6252 LACMIP 305
   6129 LACMIP 66
   3954 LACMIP 23225.
   3888 LACMIP 17898.
   3840 LACMIP 435.
   3804 LACMIP 260.

Or get a list of unique suspected catalog patterns:

$ cat lacm-numbers.tsv | cut -f3 | sort | uniq  > lacm-numbers-sorted.txt
$ head lacm-numbers-sorted.txt
LACMIP 1
LACMIP 10
LACMIP 100
LACMIP 100.1
LACMIP 10016
LACMIP 10016.1
LACMIP 10017
LACMIP 10017.1
LACMIP 10017.10
LACMIP 10017.11

Looks like LACMIP 2533 and related decimals LACMIP 2533.[something] were mentioned quite a bit in the lacm records. . .

Neat thing about this is that I can cite and archive the exact source data used to produce the result, and reproduce them without having to rely on some web service.

I do realize that the UI for Preston is a bit basic (e.g., command-line), but offers quite some powerful discovery techniques for those versed on the command-line / programming. I imagine that a UI can be developed on top of these versioned datasets to increase the reach of these tools.

For now, GBIF’s incredible tools definitely offer a better user experience, as long as you don’t worry about versioning or reproducing results 5-10 years from now.

You should be able to exactly reproduce attached results below with recipes provided above.

Curious to hear your thoughts if you have any.

Hope all is well,
-jorrit

lacm-numbers-sorted.txt (1.5 MB)
lacm-numbers-freq.txt (182.1 KB)

** Initial speed may vary with internet connection, but subsequent runs use a locally stored copy of original data.

Topic		Replies	Views
Negative filter Data Use	5	838	May 21, 2021
Searching GBIF using field gbifID Data Use	3	1724	August 7, 2021
Type Specimen CASTYPE1652 found via filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq	18	540	May 6, 2023
Finding a GBIF occurrence from a specimen code	9	478	April 1, 2023
Species search by datasetKey not working Data Use	2	272	January 24, 2024

Searching on Catalogue Number

Related topics