Finding a GBIF occurrence from a specimen code

I’ve (re)released a simple tool to match GBIF occurrences to specimen codes: https://material-examined.herokuapp.com

“Material examined” takes a code such as BMNH 1891.6.13.25 and attempts to find the corresponding GBIF record(s). It is focussed on zoological specimens, but also works with plant specimens that have “barcode” style codes, such as BM000944668. It doesn’t handle more “traditional” plant specimen citations that use collector names and numbers (if that would be useful I could look into that).

The use case is you have a specimen code in a paper or a database (e.g., BOLD or GenBank) and you want to find that specimen in GBIF.

Work in progress so usual provisos about limitations, etc. Feedback welcome (via GitHub issues, link on the site).

1 Like

@rdmpage Thanks for building yet another useful prototype!

Which version of GBIF do you support?

How can I cite a specific version of your tool and associated data?

@jhpoelen Thanks!

As I’m sure you already know, the “version” is whatever data GBIF is serving at the time you use the tool, and the version of the tool itself is whatever GitHub commit was deployed at the time.

As I’m sure you already know, the “version” is whatever data GBIF is serving at the time you use the tool

Actually, I don’t know what data is being served, that is why I am asking. And what data is being served? The processed data, or the original data? Yesterday’s processed data? Or today’s original data?

version of the tool itself is whatever GitHub commit was deployed at the time

Ok, that makes sense. Are you referring to your API or the API you interface with? Or the API/library used to transformed the original data into the processed data? Or the version of the database used to index the processed data?

I realize these are not straight forward questions to answer. But, given that your tools are likely to be used in a scientific context, I want to understand how to cite the results such that the origin (or provenance) of the obtained data products is clear.

For your purposes, what information should this tool provide that would be enough for you cite it with confidence?

@rdmpage Thank you for taking the time to respond. I much appreciate your willingness to discuss this complex topic of data provenance and citation and their use in digital infrastructures and services.

For your purposes

  1. attribution - potentially thousands, if not more, folks have helped to provide the data, infrastructure, and software that you use to provide a useful service. By providing an accurate and precise citation, you enable ways to resolve and credit those folks (and robots) for their contributions.

  2. debugging - being able to trace the (many?) transformations that provided data products went through in case of analyzing suspicious results

  3. error analysis - seen as a biodiversity measurements device, collection managements systems, and the services that index them, biodiversity data is not prefect, and are subject to system errors (e.g., bias) and measurement (e.g., classification errors, machine transcription errors) error. Understanding the provenance of this data helps work towards error estimation appropriate for the origin of the knowledge.

If you’d like to have more reasons, I would imagine that articles on the benefits of Open Science [0] and/ or FAIR principles ([1], note that FAIR does not necessarily mean open) would offer more details on the benefits of citing your sources. How can you access and re-use data products without understanding were they came from?

what information should this tool provide that would be enough for you cite it with confidence?

Ideally, your tool would provide signed citations [2] to make the citation and their associated content (and origin!) persistent and verifiable.

Disclaimer

I am co-author of [2].

References

[0] Patricia A. Soranno, Kendra S. Cheruvelil, Kevin C. Elliott, Georgina M. Montgomery, It’s Good to Share: Why Environmental Scientists’ Ethics Are Out of Date, BioScience , Volume 65, Issue 1, January 2015, Pages 69–73, It's Good to Share: Why Environmental Scientists’ Ethics Are Out of Date | BioScience | Oxford Academic

[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3 , 160018 (2016). The FAIR Guiding Principles for scientific data management and stewardship | Scientific Data

[2] Elliott, M. J., Poelen, J. H., & Fortes, J. (2022, in review). Signed Citations: Making Persistent and Verifiable Citations of Digital Scientific Content. https://doi.org/10.31222/osf.io/wycjn

1 Like

@jhpoelen These are all laudable goals to be sure, but completely out of scope for what is a simple proof of concept.

I wanted to be able to map specimens codes GBIF records. So I built a tool that does that (for various values of “does”), and made it available for anyone else to use “as is”.

Obviously there is scope for improvement. Anyone wanting to build a better tool will, I’m sure, find your detailed suggestions food for thought.

@rdmpage I think I understand your position and the joy of (near) instant gratification of having an idea, building a prototype, and putting it online.

And, that you are unable to cite your sources tells that me, as a community, we may need to consider getting a membership to the “cite-your-sources” gym and work those muscles. Perhaps they have a group discount.

As I see it there’s now a bit of a chasm between citation practices in traditional academics (read a paper, write a paper, cite a paper) and digital data-heavy research (“I found this on the internet on 1 March 2023, good luck finding it 10 years from now.”).

Looking forward to future discussions and learning more about your current and upcoming prototypes!

I’ve created an issue for this topic Add citations of sources and make result itself citable · Issue #3 · rdmpage/material-examined · GitHub to remind me to revisit it.

I’m not unsympathetic to the goals, and in the context of taxonomic databases I am constantly pushing for them to have proper citations of the taxonomic literature, back ups of PDFs in Internet Archive, etc.

But at the same time, we need the flexibility to make silly little toys to see what is possible. If they prove useful then we can invest the energy in making them (and their outputs) persist.

Thanks for the discussion, I always learn something new from you.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.