Collections catalogue (GRBio)

#6

This is THE most important consideration. The questions about linkability above make a big assumption that the data in GRBio will always be correct & that the very fluid, hierarchical relationship(s) between institutions, collections, and people are updated in near real-time with date stamps. It might help to think of basic use-cases and what are the drivers to ensure metadata and relationships are created and maintained.

(1) What happens to people records in GRBio when someone changes institution? Dies? Shares multiple affiliations?
(2) What happens to collection records when an entire collection changes hands or heaven forbid is lost or destroyed?
(3) What happens to institution records when the institution name changes or is split?

I bet most of what we want can be accomplished in WikiData even if not now populated to the equivalent density as GRBio. As Rod says, WikiData enthusiasts can populate and repair these in a real hurry. However, there will be other data elements unlikely to appear in WikiData such as people profiles, contact information and their relationship(s) with either collections or institutions. If that is to be accomplished on GBIF proper, what is sticky enough for them to do so w/o question or fuss?

#7

Some thoughts

  • Obviosly, collection-specimen-person linkages are very important. As seen from the Twitter part of this discussion, there is some fuzziness between collections - organization linkages, which I, suspect, is many-to-one. If linking GBIF publishers to collection IDs is impotant, we may soon realise that GBIF publisher can correspond to a collection or to an organization.

  • 1:1 links between collection IDs to datasets could be preferred around #3 in Tim’s list, so as long as dataset (a digitial representation of a collection in GBIF) is cited, the collection can report on digital access to it neatly, see metrics etc - I think David mentioned this as a needed functionality. If more than one collection contribute to a dataset, or a dataset is only a fraction of the collection (e.g. regional or taxonomic), this could become a bit messy and require some arithmetic efforts.

  • Maintenance and updates of content - there is some psychology of ownership here. In the systems where ID owner is responsible for keeing content up to date, quality can be quite high, but only as long as it is cool and imporant to have a profie up to date. If citation of a collection through data will become a wanted feature for collections, curators will make sure the info is accurate, but for the individuals and for the ID systems there are waves of importance of being up to date - you can see these waves in ResearchGate, ORCID, LinkedIn, etc. A hybrid model where centrally (automatically or manually) generated content can be edited by the ID owner may work better - is this Google Scholar’s model?

#8

Maybe we can unpack this discussion a little? I think there are several things going on:

  1. As @trobertson spells out, GBIF wants to have a way to consistently refer to collections, and GrBio seems the obvious candidate as it has a list of institution and collection identifiers, some of which are already in use. For most of the goals Tim lists, having a set of domain-specific identifiers is all you need. The challenge then is mapping messy data to those identifiers (see @dshorthouse list https://gist.github.com/dshorthouse/acb35ad544000deafb8964341071ff55 for an indication of the problem ), and having data publishers use them when they upload data to GBIF. This is the argument for GBIF taking responsibility for GrBio.

  2. Collections are part of institutions, and our domain-specific identifiers are but one of many that are relevant to those institutions. @dshorthouse is looking at metrics of taxonomic activity and collection use that are tied to ORCIDs, and these in turn are linked to institutional identifiers, such as Grid, Ringgold, FundRef, etc. These identifiers seem to be the ones that matter to people building institution-level metrics of activity, not the domain-specific ones the GrBio created. So, if we want to contribute to those metrics (and arguably this is going to be a key part of helping those institutions justify their investment in their collections) then we need mappings between these identifiers. This is the argument for using Wikidata as the identity broker for those cross-links.

My view is we do both.

#9

Thanks @rdmpage - that does spell it out clearly, and along my lines of understanding (although 2 can of course be achieved without Wikidata).

I would still like to know how many collections are in wikidata and some metrics about the adoption rate, communities using it etc. Do you know please?

Please do not misinterpret this as negativity towards a Wikidata option. I genuinely don’t know enough to have an opinion and have found it confusing to navigate

#10

Hi All, thanks for the efforts to list and synthesize salient topics. I would like to emphasize one or two of them and perhaps add new ones. And please enlighten me where you can - thanks.

First, there are quite a few initiatives currently going on that are trying to capture collection metrics at various levels and for various stakeholders. See https://www.idigbio.org/content/shining-new-light-world’s-collections

  • stakeholders: global funders, aggregators, institutions, collections, collection managers, administrators, taxonomists (doing identifications), collectors, journals, funders in general, policy makers, researchers
  • some groups are focused on metrics for their own institutions (e.g. The Field Museum), while others are focused on aggregating metrics across collections (e.g. ICEDIG efforts for DiSSCo to make recommendations for building a digitization status dashboard across a large group of museums), and then others are interested in related collection-level metadata metrics at the level of aggregators (GBIF, iDigBio US Collections List, ALA, etc).
  1. Engagement and Human Effort. In order to get a resource that people will use, contribute to (without arm-twisting), it must be easy-to-use, intuitive, and linked to simplify their life. E.g. IF they update Index Herbariorum, then the API must be set up so that any other resource (GRBio at GBIF) can be updated w/out that person having to visit another site. The human effort involved is not trivial.
  2. Visualization. Whatever the tool, the community needs to be able to see not only the profile of a given collection, but be able to compare it to others, for example, to see what is unique, or what’s missing (via maps, etc).
  3. Absence data - beyond text fields. For example, we need much better fields (beyond EML) that allow us to understand the (un-digitized) backlog (quantitatively) and what’s in it, as well as what’s been done, and still needs georeferencing.
  4. People recognition. We need to be able to track / visualize not just the specimens and efforts of collections and institutions, but the contributions of the individuals making these resources possible (taxonomists, collectors, georeferencers, etc.).
    David’s work on Bloodhound shows the value of people having unique identifiers. But we need software that supports these identifiers and social uptake of using and documenting these in our community.
  5. Credit/Attribution standards. It would help, if journals publishing articles that reference specimens, institutions, organizations, collections - had agreed upon expectations - to help drive change and adoption of identifiers and formats that facilitate tracking.
  6. Cyberinfrastructure. Meanwhile, we need an infrastructure that helps people learn and effect what they need to do to join and support this effort.
  7. Carrots. It may be very valuable to have a conversation about carrots - what can we offer our collections community (at multiple levels - institution, administration, collection manager, taxonomist, etc) so that there is tangible reward for their part in creating / supporting / engaging with any resource created.
  8. Workflow Opportunity? Perhaps since digitization has taken hold and is growing, that some thought could be given as to how to make sharing / exporting collection-level metadata could be tied to sharing specimen-level records information (for those already publishing this information).
  9. Standards Needed. The TDWG Collection Descriptions Task Group is endeavoring to offer standards that support the data and data model required to build such an (extensible) resource. See Use Cases collected so far.

In any case, let’s build something that is beyond individual records, and beyond a rows-and-columns interface. Help us all “see” what we’ve got so we can plan our future efforts strategically.

2 Likes
#11

OK @trobertson I’ll try and answer some of these questions. And to be clear, I’m still trying to get my head around Wikidata as well. Given that many different people and communities contribute, and they often have different goals, things can get messy.

First off, to try and estimate the number of museums and herbaria in Wikidata I ran a SPARQL query:

  SELECT DISTINCT * WHERE {
  { ?repository wdt:P31 wd:Q181916. }
  UNION
  { ?repository wdt:P31 wd:Q1970365. }
  UNION
  { ?repository wdt:P31 wd:Q26959059. }
  ?repository rdfs:label ?label.
  FILTER((LANG(?label)) = "en")
}

The result is here: http://tinyurl.com/y55sfe95 This finds 387 institutions. The query is more complex than I’d like because it looks for herbaria, natural history museums, and zoological museums (clearly not an exclusive list of institutions). For fun here’s a map (addressing one of @Debbie’s concerns, Wikidata makes it trivial to create maps).

If we take GrBio’s 7000 institutions, then 387 is clearly fairly small. But this query will miss a lot of institutions (e.g., universities, botanic gardens, etc.) There are also lots of Wikidata entries that come from Wikispecies and are pretty minimal (often just the institutionCode). I scrapped these from Wikispecies and looked them up in Wikidata, this gives us about 1300 institutions. Wikispecies editors are creating specimen records (e.g., type specimens) and linking those to institution pages via institutionCode, it is these pages that end up in Wikidata.

In terms of communities using Wikidata for collections, I don’t think that’s much of a thing yet, although some people are uploading specimens(!). But many museum records are quite rich, the AMNH being a great example: https://www.wikidata.org/wiki/Q217717

There’s a lot going on with Wikidata in relation to gene families, the academic literature, etc. that I haven’t gone into here, instead I’ve focussed on museums and herbaria. I think it’s fair to say that there are big gaps in Wikidata’s coverage, and it’s going to be a challenge to sort out. I’m trying to do some mapping between GrBio, NCBI, Wikispecies, Wikidata, and JSTOR to make some sense of this. The real test will be what happens if and when we ask the wider community to help out.

1 Like
#12

Thank you @rdmpage - I greatly appreciate you preparing that as a start for exploration.

#13

@trobertsonI’ve just posted some notes on iPhylo Where is the damned collection? Wikidata, GrBio, and a global list of all natural history collections. The post is partly to remind me of the issues, and to bookmark some links while I thrash around trying to figure out how best to make the best use of what’s already in Wikidata.

#14

Hi @rdmpage please also see the US Collections List at iDigBio, https://github.com/iDigBio/idb-us-collections (and searchable on our website). And what about the REST API at Index Herbariorum? https://github.com/nybgvh/IH-API/wiki

#15

Hi @rdmpage, you wrote [quote=“rdmpage, post:11, topic:688”]
The real test will be what happens if and when we ask the wider community to help out.
[/quote]

Yes. This is the critical issue if we are ever to have a robust world collections resource. It’s why I stressed the process must be simple and elegant, and as integrated as possible, and we will need carrots if we are to succeed.

#16

The maintenance issue can be solved through regional collection infrastructures like DiSSCo and iDigBio. They have the resources and the network, and they need this themselves. DiSSCo would be interested in maintaining the European part, I assume iDigBio would be too for North America.

#17

@Debbie Gack, yet another database of collections :disappointed: This is part of what I think we should try and avoid, another database complete with it’s own identifiers. The contents of IH are mostly replicated in GrBio (no doubt out of date since GrBio became moribund).

#18

@waddink @Debbie it might be useful to to be clear on the scope of all these goals and initiatives.

From my perspective, I am interested in a single global resource with the same level of detail as in GrBio, enhanced with other identifiers that connect repositories to databases of funding, higher-level organisations, etc., as well as images, better support for names in multiple languages, links to other outputs of repositories such as publications, etc. I think this is what Wikidata is ideal for.

GBIF seems ideally placed to provide metrics and visualisations of what has currently been digitised (as @trobertson has outlined). In other words, this is what collection ‘x’ currently contributes to the global effort to digitise collections.

Measures of what remains to be done within a collection seems a third objective, and this may well be better done at the level of individual collections (if they have the resources, I’ve seen some cool visualisations at the NHM in London), or at the level of regional initiatives such as iDigBio and DiSSco.

1 Like
#19

From my perspective we need both Wikidata and GrBio, benefiting from both their unique selling points and sharing data between them. On a practical level this would require GrBio data to be CC0, is this the case?

2 Likes
#20

From Plazi’s scholarly publication point of view, we like to see GRBIO life again. We started in the earlier time to annotate collection codes we find in scholarly articles with the GRBIO’s persistent identifiers, then continued using our own service based on the saved version of GRBIO (doi.org/10.5281/zenodo.1285615), and now hope we can refer again to a live version of GRBIO.

For us to have GBIF as maintainer of GRBIO would be helpful, because we submit all the data (treatments and material citations) to GBIF, whereby the collection code is one element. Referring to the same reference would reduce the risk, that the code refers to different things. It would facilitate GBIF to produce statistics that also include scholarly articles (in the current GBIF language collections) treatments, and material citations that are probably the biggest user of collection based data.

If this is in place, it not only provides better access to data liberated from publications, but it also helps us talking to a growing number of publishers we work with to assure that they use the GRBIO terms for collections. The publishers interest is really to provide a service to their audience allowing understanding which are the collections that contributed to describing the worlds taxa, next to the other obvious candidates (specimen, collectors, authors, etc.)

3 Likes
#21

From my early investigations I tend to agree (not the current GRBio, but the one we envisage).

Thanks Quentin, I was not aware of that requirement. I don’t foresee that as an issue as it is largely factual information but would like to verify specifically with regard to people.

2 Likes
#22

Hi @rdmpage understandable reaction :slight_smile: But I think the way forward now will have to involve APIs that make it possible to connect these resources. And, making it very clear the direction the data is expected to flow.

#23

You wrote

most definitely. I would add that many are (including would-be funders) very interested in knowing more about the backlog. As far as metrics and visualizations at GBIF - yay! Yes, it would be great since many (most) are already sharing their specimen-level data with GBIF (or planning to), it makes sense that they would send their collections-level metadata their as well. Hopefully through a similar (familiar, simple) mechanism to make it easy to comply. @waddink @qgroom and @agosti also raise other important points about the expectations, need for, and requirements of such a system.

Data about the backlog will be easier for some collections that others. Those that have done species inventories can start by sharing this level of data about what they have. Part of being in the DiSSCo network will mean the partners have to provide this information.

I think I get the sense of what you mean about “Measures of what remains to be done within a collection…” “better done at the level of individual collections.” But many of the needed metrics are at the level of individual collections. We need to get beyond free text fields (EML), to better understand what we have, who and where the experts have, and the digitization status of these collections.

For me, the harder bit seems to be how to get people to give us this information. I think that DiSSCo has a great chance of showing what can be done - when from the beginning of their effort - the expectation that this information is to be provided is in place.

to @agosti I think @trobertson confirmed that the GBBIO IDs were / are kept in the system he is building.

#24

We support the notion that collection names should be both on GRBIO/GBIF and on Wikidata. We prefer to to have GRBIO on GBIF in a way that we can edit, add to it easily new collection names we discover in publications, that there is an API that allows to reuse the data and in wikidata a GRBIO.GBIF-id for collections. Collection codes are an essential building block for our biodiversity knowledge, so we should do all that we can maintain it, ideally in GBIF with whom we all already intensively interact. We should also make an effort to convincing the respective institutions to feel responsible to be present in GRBIO and maintain the data about themselves.

The most convincing and probably low hanging fruit is a series of dashboards, like the current ones on GBIF, one for a collection aka scholarly publication (https://www.gbif.org/dataset/378ebf94-4b5c-4451-90f4-4109f9b27ea9) or persons (e.g. bloodhound) that is about a collection

I like to see a time where occurence related data does not exist, unless it shows up in GBIF. For example, the Codes (e.g. ICZN) should be revised respectively or iDigBio and DiSSCo are only considered a success if their data is also in GBIF.

1 Like
#25

It might be useful to distinguish between (1) GrBio as a database/project and (2) the GrBio identifiers. Yes it would make sense to keep the GrBio identifiers “live” in some sense because they have been used (and we’ve already been through at least one iteration of these identifiers having to be re-routed when GrBio took over Roger Hyman’s BioCollections project).

But whether a reborn GrBio is the best way to manage the task of building a database of natural history repositories is another question. Personally I’d argue that is what Wikidata does well (especially if the GBIF makes the GrBio identifiers live again and they can then be added to the existing Wikidata records for these repositories).

1 Like