For what it is worth:
@ekrimmel I read your request, and related to your desire to look for patterns in biodiversity data. I can see how GBIF’s powerful search engine can help to find matching data in their current snapshot of published biodiversity data.
And, I also realize that GBIF changes all the time, and so I did a similar search using Preston (sort of like a git for biodiversity data) and was able to produce the following results for the 2022-02-01 versioned collection (i.e. hash://sha256/4000d2a1af6da5b46f374038d884f91768782a1905d4a75fff3c8c3bb6629913) of darwin core archives registered with GBIF and iDigBio with urls containing lacm in about 2 minutes**
$ time preston cat --remote https://deeplinker.bio hash://sha256/4000d2a1af6da5b46f374038d884f91768782a1905d4a75fff3c8c3bb6629913 \
| grep hasVersion\
| grep lacm\
| preston grep --remote https://deeplinker.bio -l tsv -o "LACMIP\s*[0-9]+\.*[0-9]*"\
| grep value\
> lacm-numbers.tsv
...
real 2m2.078s
user 2m13.920s
sys 0m3.503s
Note how I used a regex pattern LACMIP\s*[0-9]+\.*[0-9]*
to pickup the values and their exact locations in versioned archives.
With this, you can also do fun things like listing collection events with most the specimen lots:
$ cat lacm-numbers.tsv\
| cut -f3\
| grep -E -o "LACMIP [0-9]+[\.]{0,1}"\
| sort\
| uniq -c\
| sort -nr\
> lacm-numbers-freq.txt
$ head lacm-numbers-freq.tsv
51246 LACMIP 2533.
13071 LACMIP 2533
8655 LACMIP 66.
6579 LACMIP 305.
6252 LACMIP 305
6129 LACMIP 66
3954 LACMIP 23225.
3888 LACMIP 17898.
3840 LACMIP 435.
3804 LACMIP 260.
Or get a list of unique suspected catalog patterns:
$ cat lacm-numbers.tsv | cut -f3 | sort | uniq > lacm-numbers-sorted.txt
$ head lacm-numbers-sorted.txt
LACMIP 1
LACMIP 10
LACMIP 100
LACMIP 100.1
LACMIP 10016
LACMIP 10016.1
LACMIP 10017
LACMIP 10017.1
LACMIP 10017.10
LACMIP 10017.11
Looks like LACMIP 2533
and related decimals LACMIP 2533.[something]
were mentioned quite a bit in the lacm records. . .
Neat thing about this is that I can cite and archive the exact source data used to produce the result, and reproduce them without having to rely on some web service.
I do realize that the UI for Preston is a bit basic (e.g., command-line), but offers quite some powerful discovery techniques for those versed on the command-line / programming. I imagine that a UI can be developed on top of these versioned datasets to increase the reach of these tools.
For now, GBIF’s incredible tools definitely offer a better user experience, as long as you don’t worry about versioning or reproducing results 5-10 years from now.
You should be able to exactly reproduce attached results below with recipes provided above.
Curious to hear your thoughts if you have any.
Hope all is well,
-jorrit
lacm-numbers-sorted.txt (1.5 MB)
lacm-numbers-freq.txt (182.1 KB)
** Initial speed may vary with internet connection, but subsequent runs use a locally stored copy of original data.