Searching on Catalogue Number

ekrimmel · January 27, 2022, 8:44pm

Is there a way to search the Catalogue number field in the GBIF occurrence user interface using predicates, e.g. “like”? For example, I want to see all of the specimen lots associated with a collecting event, and they have very helpful catalog numbers in the format of “LACMIP 42801.1,” “LACMIP 42801.2,” etc. I would love to be able to search for “LACMIP 42801” and retrieve all of them. I can see how to accomplish this on the occurrence API–though only because the default search behavior is less strict not because I can figure out how to search with a predicate–but not the user interface. Am I missing something, or is this type of search not possible in the user interface? Thank you for any suggestions!

Note that this is essentially the same as Rich’s earlier, unanswered question here.

mhoefft · January 28, 2022, 11:07am

Hi
Short answer: It is not possible to do so no. But it makes perfect sense why it would be useful. We are working to make it possible.

Our API version 1 has been around for many years (10 or more I believe).
The occurrence part of that API supports download and search.

For downloads we support searching with predicates. And those predicates allow you to do a LIKE filter. Such as catalogNumber=LACMIP 42801* . When you write a predicate you decide if it should be an EQUAL filter or a LIKE filter.

But the search API does not support that. It is always EQUAL. So searching catalogNumber for * will return occurrences that have the value *. there is currently 3 of those
https://api.gbif.org/v1/occurrence/search?catalogNumber=*

The endpoint https://api.gbif.org/v1/occurrence/search/catalogNumber?q=LACMIP%2042801&limit=100 that you refer to is a suggest service that we use to suggest values to the user as they type in the catalogNumber. An autocomplete service. So your best option is to use the UI and type LACMIP 42801 and then select all the available values. I can see it only suggests the top 10 or so. Let me change that so at least you get more suggestions to choose from.

We are also rewriting our search, to a version that supports wildcards to e.g. catalogNumber

Hope that helps a bit

UPDATE: I have deployed a fix that includes more suggestions (50) - it isn’t a general fix, but at least it helps in many cases. Such as this one. You can now (with some effort) select all thuse values from the suggest
example Search

ekrimmel · January 28, 2022, 3:23pm

Thank you so much for the immediate and thorough reply, and interim solution of displaying more then 10 suggested values in the UI! Sounds like the API v2 and rewritten UI search will be very useful

jhpoelen · February 2, 2022, 10:10pm

For what it is worth:

@ekrimmel I read your request, and related to your desire to look for patterns in biodiversity data. I can see how GBIF’s powerful search engine can help to find matching data in their current snapshot of published biodiversity data.

And, I also realize that GBIF changes all the time, and so I did a similar search using Preston (sort of like a git for biodiversity data) and was able to produce the following results for the 2022-02-01 versioned collection (i.e. hash://sha256/4000d2a1af6da5b46f374038d884f91768782a1905d4a75fff3c8c3bb6629913) of darwin core archives registered with GBIF and iDigBio with urls containing lacm in about 2 minutes**

$ time preston cat --remote https://deeplinker.bio hash://sha256/4000d2a1af6da5b46f374038d884f91768782a1905d4a75fff3c8c3bb6629913 \
| grep hasVersion\
 | grep lacm\
 | preston grep --remote https://deeplinker.bio -l tsv -o "LACMIP\s*[0-9]+\.*[0-9]*"\
 | grep value\
 > lacm-numbers.tsv
...
real	2m2.078s
user  2m13.920s
sys	0m3.503s

Note how I used a regex pattern LACMIP\s*[0-9]+\.*[0-9]* to pickup the values and their exact locations in versioned archives.

With this, you can also do fun things like listing collection events with most the specimen lots:

$ cat lacm-numbers.tsv\
 | cut -f3\
 | grep -E -o "LACMIP [0-9]+[\.]{0,1}"\
 | sort\
 | uniq -c\
 | sort -nr\
 > lacm-numbers-freq.txt 
$ head lacm-numbers-freq.tsv 
  51246 LACMIP 2533.
  13071 LACMIP 2533
   8655 LACMIP 66.
   6579 LACMIP 305.
   6252 LACMIP 305
   6129 LACMIP 66
   3954 LACMIP 23225.
   3888 LACMIP 17898.
   3840 LACMIP 435.
   3804 LACMIP 260.

Or get a list of unique suspected catalog patterns:

$ cat lacm-numbers.tsv | cut -f3 | sort | uniq  > lacm-numbers-sorted.txt
$ head lacm-numbers-sorted.txt
LACMIP 1
LACMIP 10
LACMIP 100
LACMIP 100.1
LACMIP 10016
LACMIP 10016.1
LACMIP 10017
LACMIP 10017.1
LACMIP 10017.10
LACMIP 10017.11

Looks like LACMIP 2533 and related decimals LACMIP 2533.[something] were mentioned quite a bit in the lacm records. . .

Neat thing about this is that I can cite and archive the exact source data used to produce the result, and reproduce them without having to rely on some web service.

I do realize that the UI for Preston is a bit basic (e.g., command-line), but offers quite some powerful discovery techniques for those versed on the command-line / programming. I imagine that a UI can be developed on top of these versioned datasets to increase the reach of these tools.

For now, GBIF’s incredible tools definitely offer a better user experience, as long as you don’t worry about versioning or reproducing results 5-10 years from now.

You should be able to exactly reproduce attached results below with recipes provided above.

Curious to hear your thoughts if you have any.

Hope all is well,
-jorrit

lacm-numbers-sorted.txt (1.5 MB)
lacm-numbers-freq.txt (182.1 KB)

** Initial speed may vary with internet connection, but subsequent runs use a locally stored copy of original data.

jhpoelen · February 2, 2022, 10:21pm

Note that a 2020-11-01 version (i.e. hash://sha256/d98e3bd2bc717bc11a3338cd43fc488bde1d96cb42d8cbe8301f0d9f9753007f) yielded the following results:

time preston cat --remote https://deeplinker.bio hash://sha256/d98e3bd2bc717bc11a3338cd43fc488bde1d96cb42d8cbe8301f0d9f9753007f \
| grep hasVersion\
 | grep lacm\
 | preston grep --remote https://deeplinker.bio -l tsv -o "LACMIP\s*[0-9]+\.*[0-9]*"\
 | grep value\
 | cut -f3\
 | sort\
 | uniq -c\
 | sort -nr\
 | head
... 
  13071 LACMIP 2533
   6252 LACMIP 305
   6129 LACMIP 66
   2880 LACMIP 183
   2094 LACMIP 142
   1968 LACMIP 23225
   1914 LACMIP 435
   1878 LACMIP 130
   1752 LACMIP 260
   1722 LACMIP 17649

real	2m1.479s
user	2m16.032s
sys	0m2.942s

Which seems to suggest that the specimen lots might have been introduced after 2020-11-01 , because no “.” numbers are seen.

ekrimmel · February 4, 2022, 4:24pm

Hey thanks, Jorrit! Great to see another type of solution.

Eventually, it would be awesome if GBIF’s download DOIs could incorporate Preston’s ability to actually capture a version of the data so that use of individual specimens could be tracked, versus tracking use at a dataset level as the current download DOI system enables. I know one could do this right now with Preston and hashing–thanks, Jorrit!–but there is just a lot to be said for an intuitive UI and an integrated data pipeline/system–thanks, GBIF!

One use case for why we need a better way to search on catalog numbers is that often a collections manager will want to direct a researcher to a subset of specimen records, and catalog number may be a really useful way to circumscribe that subset. In this case, accessing data via command line won’t cut it; the researcher almost always is in an exploratory phase and a UI that facilitates visualization and discovery beyond the known set of records is essential. Maybe this will evolve as new generations of researchers approach their work with more programmatic skills. But for now the ability to email a research a direct link to search results in a UI is an amazing time saver as the digital equivalent of physically pulling out a selection of specimens.

A second use case is that a collections manager wants to have a researcher cite specimens in a publication (e.g. here). In this case, a solution a la what Preston can do would be ideal. For many collections, the dataset activity tracking from GBIF is meeting an important need by demonstrating the value of digitized, mobilized data. Using multiple modalities (e.g. GBIF dataset activity + a separate list of dataset citations not mediated by GBIF) to track activity doesn’t sound difficult, but in the reality for most collections staff, it’s just one thing too many. Again, lots of room for this to evolve!

system · March 7, 2022, 2:25am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Negative filter Data Use	5	839	May 21, 2021
Searching GBIF using field gbifID Data Use	3	1727	August 7, 2021
Type Specimen CASTYPE1652 found via filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq	18	540	May 6, 2023
Finding a GBIF occurrence from a specimen code	9	479	April 1, 2023
Species search by datasetKey not working Data Use	2	272	January 24, 2024

Searching on Catalogue Number

Related topics