1.8. Improvements to citation and visibility for collections (USE)

sabaj · April 28, 2020, 1:36am

Hello everyone. Since 2010, I have maintained a reasonably complete list of annotated codes for historical and modern natural history collections associated with lost and extant specimens of fossil and Recent fishes, amphibians and reptiles.

The list currently includes 3784 codes anchored to about 2033 collections or institutions in 150 countries. A slightly dated version of the list is available here: https://asih.org/standard-symbolic-codes/about-symbolic-codes. An updated list was submitted for print publication in Copeia.

This list was generated by combing the fish and herp literature (1769 publications) for citations of codes and with help from fish/herp taxonomists and collections staff around the world.

Anyway, I can attest that authors cite specimens using a lot of different codes, some stable, many not. Given the way taxonomists work…it might be easier (at least at the beginning) to tie together codes on the “back end” rather than have taxonomists employ standardized codes on the front end (e.g., published works).

dhobern · April 28, 2020, 1:57am

Thanks @sabaj. You are correct that much of the integration will have to be after-the-event cleanup of a muddle of collection codes. Of course, these codes will in many cases not be unique, so we will need to experiment with frameworks like OpenRefine to find ways to simulate the contextual interpretation that allows human readers usually to understand the reference even in the most muddled cases.

agosti · April 29, 2020, 10:41am

I agree with @sabaj that a way to go is to collect what users of collection codes do and then create some sort of lookup table. Here is a glimpse into the collectionCodes we extract from publications. Be aware, it is dirty data, not least because of the botanists that use single letter as collection codes which makes it very difficult to mine. But this might be another starting point.
For each of the collection code we have the treatment, and the publication from where we extracted the data
Here is a CSV http://tb.plazi.org/GgServer/srsStats/stats?outputFields=colls.code+colls.name&groupingFields=colls.code+colls.name&format=CSV&separator=%2C output, and you can get more through using the Plazi stats at http://tb.plazi.org/GgServer/srsStats

gdadade · April 29, 2020, 12:47pm

GGBN already uses stable identifiers of specimens (plus the traditional triplet) to reference tissue and DNA material back to specimens and some of their members also provide stable identifiers for the DNA and tissue samples already. Adding those resources to publications would improve visibility of (biobank) collections since not in all cases it will be possible to deposit a voucher specimen. Very often the physical tissue and DNA sample is simply forgotten when citing used material, so are the biobank collections in general. A centralised catalogue with identifiers for these important collections plus identifiers for the DNA and tissue samples will help to make them more visible and enable traceability and transparency.

sabaj · April 30, 2020, 1:13am

A note about documenting tissues in publications…it is often done poorly for fishes. At least 4 identifiers might be associated with a fish tissue:

Collection code: ANSP
Specimen catalog number: 123
Specimen tag number: 456 (fishes generally kept in lots, so one lot might contain multiple tissue vouchers)
Tissue catalog number: 789
It is not uncommon for authors to only report 3 or 4 without specifying which one.

Also…when tissue subsamples are sent from one collection to another, the receiving collection might assign its own Tissue Catalog Number…which adds two more identifiers “Code + catalog number for new collection with tissue”. One might ask, why doesn’t the receiving collection simply retain the original identifiers assigned by the source collection? Most of the time they do…and that is how unused and leftover tissues often get lost in freezers. Better for the receiving collection to newly catalog the tissue (alongside data on its source) so that the receiving collection may keep track of it.

Finally…whereas most fish tissue/DNA collections employ the same code for specimens and tissues (e.g., ANSP)…a few collections use separate codes, such as:
Museu Nacional, Universidade Federal do Rio de Janeiro uses MNRJ (fishes, herps), MNTI (fish tissue collection), MNLM (vertebrate DNA extract collection)
Museu de Zoologia da Universidade de São Paulo uses MZUSP (fishes, herps), CTMZ (tissue collection) or MZict (fish tissues)
Instituto de Investigación de Recursos Biológicos Alexander von Humboldt uses IAvH-CB (Colecciones Biológicas), IAvH-CT (tissue collection), IAvH-P (fishes), IAvH-Am (amphibians), IAvH-R (reptiles) - I prefer this b/c the full code preserves the same institutional for each collection
**Fish Genetics and Biotechnology laboratory, Indian Council of Agricultural Research uses NKGMF (fish specimens), VIZNK (tissue samples)
University of Kansas Biodiversity Institute uses KUI (fishes), KUIT (fish tissues), KUH (herps), KUVP (fossils)

All is to say…tracking tissues (and DNA) for fish collections adds a significant amount of complexity. Best to develop a stable list of institutions first…then the collections contained within them.

gdadade · April 30, 2020, 6:42am

That is very true and mainly because the sending institution has no proper tissue catalog numbers. Mostly they use the same number for both specimen and tissue which is a big problem we are working on in GGBN. So for the data portal they must make the tissue catalog numbers unique by adding a prefix or suffix for example. Also often the same DNA number is used if extracted again, so noone can ever track back where the sequences are really coming from.

Within the EU funded SYNTHESYS+ project we are currently working on solution to enable a network of trusted biobanks with standardized exchange of material, like the botanical gardens community does with IPEN. Their IPEN numbers work perfectly fine for their purposes. We hope to establish something similar for GGBN and hence for all biobanks.

It is the same problem in any organism group and becomes even worse when we talk about parasites or environmental samples. That the fish is also a fish and not only an environment or host is often forgotten in such cases

Don’t know if you ever tried to search fish samples in GGBN. Vertebrates in general is the group currently best covered taxonomocially in GGBN. So might be worth a try. We are always looking for constructive feedback.

pzermoglio · April 30, 2020, 8:18am

From @WUlate in this Spanish thread

In the same way that researchers deposit sequences in GENBank (or other) to publish, the Catalogue could be part of the best practices of any scientific publication if:

we start with those already here (early adopters),
we prepare directives and agreements (SYNTHESIS+, GBIF, Nodes, etc.) that recommend the adoption of the Catalogue,
we announce the initiative to the world and gain advocates in the organizations that manage the collections,
but we offer useful products soon (first maybe small and focused) that take advantage of the value of the Catalogue y contribute to real situations (after, some larger and more complex).

Topic		Replies	Views
Integrated summary from 17 to 30 April 2020 Collections Catalogue	1	1515	May 1, 2020
Collections Catalogue - Daily Summaries Collections Catalogue	9	4311	April 30, 2020
Adelantando el Catálogo de Colecciones de Historia Natural del Mundo Collections Catalogue	31	8573	May 4, 2020
Collections catalogue (GRBio)	52	6290	June 28, 2020
2.1. Scope for the catalogue and definition of “collection” (INFORMATION) Collections Catalogue	28	5896	April 30, 2020

1.8. Improvements to citation and visibility for collections (USE)

Related topics