Collections catalogue (GRBio)

There was an announcement about the first phase of work towards a collections catalogue in GBIF today. This started with the migration of the GRBio DB to GBIF.

It was not a big surprise to me that some were quick to recognise on Twitter the issues as we well know there is much to do.

There have been a lot of people asking for the GRBio catalogue to be searchable again after it went offline. This first phase of work is simply a response to that call and to demonstrate a faithful migration of verbatim data upon which to build.

The next phases I foresee needed are as follows and welcome feedback and ideas around this:

  1. Reassignment of the DNS ownership(underway now) so that the many existing identifiers (CoolURIs etc) used in various databases resolve again
  2. Synchronisation with the key underlying sources (e.g. Index Herbariorum) so that data are consistent
  3. Providing clear citation guidelines for collections built around the existing DOI based citation model
  4. Linking occurrence data (digital specimen records)
  5. Linking of collection citations (treatment articles, publications using data downloads etc) to the collections
  6. Enabling registration of (N)CD style metadata documents
2 Likes

This blog by Rod Page mentioned in the Twitter thread remains relevant today. Clearly we (I) are thinking in terms of linkability and in particular to help connect to the DataCite citation graph which has been well received for dataset citation tracking.

What is unknown to me is how complete and used the WikiData store is for this class of content. Has anyone done an analysis of content availability for GRBio vs wikidata? Does wikidata support DOI (and associated metadata kernel) natively? A solution where we help complete wikidata is certainly not out of scope.

On the Twitter thread GRBio is referred to as being a failure. It is certainly incomplete but a quick query suggest there are more than 42 million occurrence records in GBIF.org making use of GRBio identifiers (biocol, cooluri, bci and grbio) in the Darwin Core fields of collectionCode, collectionID, institutionCode, institutionID, and datasetID. These are all currently dead links but shortly they will resolve again which is part of the motivation for getting this up. Seemingly there has been some desire to link to a catalogue by data publishers at some time.

1 Like

@trobertson I think there were two big issues with GrBio, and these issues are equally relevant for the GBIF restoration of that resource. The issues are:

  1. Who is this for?

  2. Who will maintain it?

Who is it for?

The core task of GrBio is to answer the question ā€œwhat collection has this code?ā€ The audience for this question is fairly small, thereā€™s not a lot to be gained from saying the ā€œAMNHā€ is the American Museum of Natural History.

However, if we connect the AMNH to a bunch of additional identifiers outside our domain then things get interesting. For example, we can link to Twitter and Instagram, to identifiers used by funding agencies, and more. For example, I built a simple to tool to look up institution codes in Wikidata: https://empty-opal.glitch.me/?q=AMNH This is pretty basic but already is feels richer than GrBio.

Some Wikidata records include collection size, membership in BHL, etc., so we can start to do some analysis of collection size, and institutional participation in digitisation efforts. Identifiers such as grid and ringgold enable links to be made between institutional and individual identifiers (the sort of thing @dshorthouse is doing with https://bloodhound-tracker.net ).

Who will maintain it?

Schindel et al. 2016 (doi:10.3897/bdj.4.e10293) appealed for community curation or GrBio, which didnā€™t really happen, see also the discussion on iPhylo GRBio: A Call for Community Curation - what community?. I think rather than repeat the same mistake (building something and hoping people will come) why not go to where there is a community? Once again, Wikidata seems the obvious place for this to happen. Many institutions already have entries, and the editing tools already exist. Wikidata supports a wealth of identifiers, as well as support for multiple languages, geographic location, images, relationships between parts, supporting evidence, etc.

So, the TL;DR version is the following:

  1. Use Wikidata as the data platform for institutions
  2. Invite people to edit the data there (much of this will happen anyway)
  3. Add GBIF-specific identifiers to Wikidata, such as the UUIDs and the ā€œcool URIsā€ so that GBIF can link to the richer Wikidata content.
  4. Build a tool to sit on top of Wikidata and make that ā€œGrBioā€.

In additional to all the arguments about the richness of Wikidata records, there is also the value in getting people at these institutions to see the value of GBIF. Many institutions are engaging with Wikipedia (e.g., the GLAM Wiki project) and Wikipedia is becoming a key resource for them to make their collections (broadly defined) more accessible. Letā€™s get biodiversity science integrated into this broader effort. Looking at the Instagram pages for some museums, e.g. https://empty-opal.glitch.me/?q=C or https://empty-opal.glitch.me/?q=AM we can see lots of cool stuff going on that seems disconnected from our community.

In summary, I think we need thereā€™s an opportunity here to make something useful, so long as we are prepared to learn the lessons of GrBio: make something broadly useful and highly connected, and engage with existing communities that are passionate about museums, data, or both.

1 Like

@trobertson Whoops, only saw this after posting a lengthy comment on this thread.

By ā€œfailureā€ I mean (a) it when offline, which if it was vital would be unthinkable, (b) I never got the sense that many records were edited, and I was at a meeting where there was a call to support GrBio and that call was met with stony silence, and Ā© the NCBI ended up making itā€™s own version instead of using GrBio.

Yes, I think we should definitely have a database that maps traditional collection identifiers to a digital identifier. But this mapping needs to be to multiple identifiers if we are to do anything interesting, and once you go down that route then you need to think about Wikidata.

From my perspective, we should be able to answer a bunch of questions, such as:

  1. Where is this collection?
  2. How big is it?
  3. Does this institution contribute to GBIF?
  4. Is it a part of BHL?
  5. What journals does it publish?
  6. What researchers work there?
  7. What collections do they use?
  8. Has the institution signed the Bouchout Declaration?
  9. Is it part of JSTORā€™s Global Plants? Is it part of ā€œinsert initiative hereā€?
  10. etc., etc.

Thereā€™s a real opportunity to do some interesting stuff here, but it requires people to stop thinking in terms of domain-specific ā€œcataloguesā€.

1 Like

Thanks @rdmpage for engaging in this - our replies crossed

The kind of questions I was looking to provide answers to are along the lines of what you write along with:

  1. What papers have cited this collection
  2. What specimens in the collection have been used in taxonomic treatments
  3. What is the preservation methods used and costs involved in maintaining the collection
  4. Of the digitised records, what is unique about this collection in space/time/taxa when compared to other available data
  5. What specimens have been sequenced and where are those
  6. Who has worked on the collection

(and tools like searchable image gallery for the digitised specimens)

Fundamentally I think a collection page should be able to bring together all the relevant info to showcase the scientific work the collection has enabled. Personally, Iā€™m less excited by links to instagram, twitter etc but that could reflect more on my own tastes as I donā€™t use them. Weā€™re putting together a visual around this concept which I had intended to share with others for feedback (including e.g. the NCD TDWG group).

I said to David S. too, that I donā€™t think we are in much disagreement (but am happy to be corrected) and for the first time GBIF have a collection entity to start linking information to. That could be pushed to wikidata if it makes sense to do so, but I donā€™t know enough yet.

What is still unclear to me is how much of this is in wikidata already - any idea please?

2 Likes

This is THE most important consideration. The questions about linkability above make a big assumption that the data in GRBio will always be correct & that the very fluid, hierarchical relationship(s) between institutions, collections, and people are updated in near real-time with date stamps. It might help to think of basic use-cases and what are the drivers to ensure metadata and relationships are created and maintained.

(1) What happens to people records in GRBio when someone changes institution? Dies? Shares multiple affiliations?
(2) What happens to collection records when an entire collection changes hands or heaven forbid is lost or destroyed?
(3) What happens to institution records when the institution name changes or is split?

I bet most of what we want can be accomplished in WikiData even if not now populated to the equivalent density as GRBio. As Rod says, WikiData enthusiasts can populate and repair these in a real hurry. However, there will be other data elements unlikely to appear in WikiData such as people profiles, contact information and their relationship(s) with either collections or institutions. If that is to be accomplished on GBIF proper, what is sticky enough for them to do so w/o question or fuss?

Some thoughts

  • Obviosly, collection-specimen-person linkages are very important. As seen from the Twitter part of this discussion, there is some fuzziness between collections - organization linkages, which I, suspect, is many-to-one. If linking GBIF publishers to collection IDs is impotant, we may soon realise that GBIF publisher can correspond to a collection or to an organization.

  • 1:1 links between collection IDs to datasets could be preferred around #3 in Timā€™s list, so as long as dataset (a digitial representation of a collection in GBIF) is cited, the collection can report on digital access to it neatly, see metrics etc - I think David mentioned this as a needed functionality. If more than one collection contribute to a dataset, or a dataset is only a fraction of the collection (e.g. regional or taxonomic), this could become a bit messy and require some arithmetic efforts.

  • Maintenance and updates of content - there is some psychology of ownership here. In the systems where ID owner is responsible for keeing content up to date, quality can be quite high, but only as long as it is cool and imporant to have a profie up to date. If citation of a collection through data will become a wanted feature for collections, curators will make sure the info is accurate, but for the individuals and for the ID systems there are waves of importance of being up to date - you can see these waves in ResearchGate, ORCID, LinkedIn, etc. A hybrid model where centrally (automatically or manually) generated content can be edited by the ID owner may work better - is this Google Scholarā€™s model?

Maybe we can unpack this discussion a little? I think there are several things going on:

  1. As @trobertson spells out, GBIF wants to have a way to consistently refer to collections, and GrBio seems the obvious candidate as it has a list of institution and collection identifiers, some of which are already in use. For most of the goals Tim lists, having a set of domain-specific identifiers is all you need. The challenge then is mapping messy data to those identifiers (see @dshorthouse list https://gist.github.com/dshorthouse/acb35ad544000deafb8964341071ff55 for an indication of the problem ), and having data publishers use them when they upload data to GBIF. This is the argument for GBIF taking responsibility for GrBio.

  2. Collections are part of institutions, and our domain-specific identifiers are but one of many that are relevant to those institutions. @dshorthouse is looking at metrics of taxonomic activity and collection use that are tied to ORCIDs, and these in turn are linked to institutional identifiers, such as Grid, Ringgold, FundRef, etc. These identifiers seem to be the ones that matter to people building institution-level metrics of activity, not the domain-specific ones the GrBio created. So, if we want to contribute to those metrics (and arguably this is going to be a key part of helping those institutions justify their investment in their collections) then we need mappings between these identifiers. This is the argument for using Wikidata as the identity broker for those cross-links.

My view is we do both.

Thanks @rdmpage - that does spell it out clearly, and along my lines of understanding (although 2 can of course be achieved without Wikidata).

I would still like to know how many collections are in wikidata and some metrics about the adoption rate, communities using it etc. Do you know please?

Please do not misinterpret this as negativity towards a Wikidata option. I genuinely donā€™t know enough to have an opinion and have found it confusing to navigate

Hi All, thanks for the efforts to list and synthesize salient topics. I would like to emphasize one or two of them and perhaps add new ones. And please enlighten me where you can - thanks.

First, there are quite a few initiatives currently going on that are trying to capture collection metrics at various levels and for various stakeholders. See https://www.idigbio.org/content/shining-new-light-worldā€™s-collections

  • stakeholders: global funders, aggregators, institutions, collections, collection managers, administrators, taxonomists (doing identifications), collectors, journals, funders in general, policy makers, researchers
  • some groups are focused on metrics for their own institutions (e.g. The Field Museum), while others are focused on aggregating metrics across collections (e.g. ICEDIG efforts for DiSSCo to make recommendations for building a digitization status dashboard across a large group of museums), and then others are interested in related collection-level metadata metrics at the level of aggregators (GBIF, iDigBio US Collections List, ALA, etc).
  1. Engagement and Human Effort. In order to get a resource that people will use, contribute to (without arm-twisting), it must be easy-to-use, intuitive, and linked to simplify their life. E.g. IF they update Index Herbariorum, then the API must be set up so that any other resource (GRBio at GBIF) can be updated w/out that person having to visit another site. The human effort involved is not trivial.
  2. Visualization. Whatever the tool, the community needs to be able to see not only the profile of a given collection, but be able to compare it to others, for example, to see what is unique, or whatā€™s missing (via maps, etc).
  3. Absence data - beyond text fields. For example, we need much better fields (beyond EML) that allow us to understand the (un-digitized) backlog (quantitatively) and whatā€™s in it, as well as whatā€™s been done, and still needs georeferencing.
  4. People recognition. We need to be able to track / visualize not just the specimens and efforts of collections and institutions, but the contributions of the individuals making these resources possible (taxonomists, collectors, georeferencers, etc.).
    Davidā€™s work on Bloodhound shows the value of people having unique identifiers. But we need software that supports these identifiers and social uptake of using and documenting these in our community.
  5. Credit/Attribution standards. It would help, if journals publishing articles that reference specimens, institutions, organizations, collections - had agreed upon expectations - to help drive change and adoption of identifiers and formats that facilitate tracking.
  6. Cyberinfrastructure. Meanwhile, we need an infrastructure that helps people learn and effect what they need to do to join and support this effort.
  7. Carrots. It may be very valuable to have a conversation about carrots - what can we offer our collections community (at multiple levels - institution, administration, collection manager, taxonomist, etc) so that there is tangible reward for their part in creating / supporting / engaging with any resource created.
  8. Workflow Opportunity? Perhaps since digitization has taken hold and is growing, that some thought could be given as to how to make sharing / exporting collection-level metadata could be tied to sharing specimen-level records information (for those already publishing this information).
  9. Standards Needed. The TDWG Collection Descriptions Task Group is endeavoring to offer standards that support the data and data model required to build such an (extensible) resource. See Use Cases collected so far.

In any case, letā€™s build something that is beyond individual records, and beyond a rows-and-columns interface. Help us all ā€œseeā€ what weā€™ve got so we can plan our future efforts strategically.

3 Likes

OK @trobertson Iā€™ll try and answer some of these questions. And to be clear, Iā€™m still trying to get my head around Wikidata as well. Given that many different people and communities contribute, and they often have different goals, things can get messy.

First off, to try and estimate the number of museums and herbaria in Wikidata I ran a SPARQL query:

  SELECT DISTINCT * WHERE {
  { ?repository wdt:P31 wd:Q181916. }
  UNION
  { ?repository wdt:P31 wd:Q1970365. }
  UNION
  { ?repository wdt:P31 wd:Q26959059. }
  ?repository rdfs:label ?label.
  FILTER((LANG(?label)) = "en")
}

The result is here: http://tinyurl.com/y55sfe95 This finds 387 institutions. The query is more complex than Iā€™d like because it looks for herbaria, natural history museums, and zoological museums (clearly not an exclusive list of institutions). For fun hereā€™s a map (addressing one of @Debbieā€™s concerns, Wikidata makes it trivial to create maps).

If we take GrBioā€™s 7000 institutions, then 387 is clearly fairly small. But this query will miss a lot of institutions (e.g., universities, botanic gardens, etc.) There are also lots of Wikidata entries that come from Wikispecies and are pretty minimal (often just the institutionCode). I scrapped these from Wikispecies and looked them up in Wikidata, this gives us about 1300 institutions. Wikispecies editors are creating specimen records (e.g., type specimens) and linking those to institution pages via institutionCode, it is these pages that end up in Wikidata.

In terms of communities using Wikidata for collections, I donā€™t think thatā€™s much of a thing yet, although some people are uploading specimens(!). But many museum records are quite rich, the AMNH being a great example: https://www.wikidata.org/wiki/Q217717

Thereā€™s a lot going on with Wikidata in relation to gene families, the academic literature, etc. that I havenā€™t gone into here, instead Iā€™ve focussed on museums and herbaria. I think itā€™s fair to say that there are big gaps in Wikidataā€™s coverage, and itā€™s going to be a challenge to sort out. Iā€™m trying to do some mapping between GrBio, NCBI, Wikispecies, Wikidata, and JSTOR to make some sense of this. The real test will be what happens if and when we ask the wider community to help out.

1 Like

Thank you @rdmpage - I greatly appreciate you preparing that as a start for exploration.

@trobertsonIā€™ve just posted some notes on iPhylo Where is the damned collection? Wikidata, GrBio, and a global list of all natural history collections. The post is partly to remind me of the issues, and to bookmark some links while I thrash around trying to figure out how best to make the best use of whatā€™s already in Wikidata.

Hi @rdmpage please also see the US Collections List at iDigBio, https://github.com/iDigBio/idb-us-collections (and searchable on our website). And what about the REST API at Index Herbariorum? https://github.com/nybgvh/IH-API/wiki

Hi @rdmpage, you wrote [quote=ā€œrdmpage, post:11, topic:688ā€]
The real test will be what happens if and when we ask the wider community to help out.
[/quote]

Yes. This is the critical issue if we are ever to have a robust world collections resource. Itā€™s why I stressed the process must be simple and elegant, and as integrated as possible, and we will need carrots if we are to succeed.

The maintenance issue can be solved through regional collection infrastructures like DiSSCo and iDigBio. They have the resources and the network, and they need this themselves. DiSSCo would be interested in maintaining the European part, I assume iDigBio would be too for North America.

1 Like

@Debbie Gack, yet another database of collections :disappointed: This is part of what I think we should try and avoid, another database complete with itā€™s own identifiers. The contents of IH are mostly replicated in GrBio (no doubt out of date since GrBio became moribund).

@waddink @Debbie it might be useful to to be clear on the scope of all these goals and initiatives.

From my perspective, I am interested in a single global resource with the same level of detail as in GrBio, enhanced with other identifiers that connect repositories to databases of funding, higher-level organisations, etc., as well as images, better support for names in multiple languages, links to other outputs of repositories such as publications, etc. I think this is what Wikidata is ideal for.

GBIF seems ideally placed to provide metrics and visualisations of what has currently been digitised (as @trobertson has outlined). In other words, this is what collection ā€˜xā€™ currently contributes to the global effort to digitise collections.

Measures of what remains to be done within a collection seems a third objective, and this may well be better done at the level of individual collections (if they have the resources, Iā€™ve seen some cool visualisations at the NHM in London), or at the level of regional initiatives such as iDigBio and DiSSco.

2 Likes

From my perspective we need both Wikidata and GrBio, benefiting from both their unique selling points and sharing data between them. On a practical level this would require GrBio data to be CC0, is this the case?

2 Likes

From Plaziā€™s scholarly publication point of view, we like to see GRBIO life again. We started in the earlier time to annotate collection codes we find in scholarly articles with the GRBIOā€™s persistent identifiers, then continued using our own service based on the saved version of GRBIO (doi.org/10.5281/zenodo.1285615), and now hope we can refer again to a live version of GRBIO.

For us to have GBIF as maintainer of GRBIO would be helpful, because we submit all the data (treatments and material citations) to GBIF, whereby the collection code is one element. Referring to the same reference would reduce the risk, that the code refers to different things. It would facilitate GBIF to produce statistics that also include scholarly articles (in the current GBIF language collections) treatments, and material citations that are probably the biggest user of collection based data.

If this is in place, it not only provides better access to data liberated from publications, but it also helps us talking to a growing number of publishers we work with to assure that they use the GRBIO terms for collections. The publishers interest is really to provide a service to their audience allowing understanding which are the collections that contributed to describing the worlds taxa, next to the other obvious candidates (specimen, collectors, authors, etc.)

3 Likes