Preferred identifiers for GRSciColl entries - Should we mint DOIs for collections?

This is a thread to follow up on a topic we started discussing during the April GRSciColl community call (you can check the recording here: Global Registry of Scientific Collections (GRSciColl) community call - April 2024 on Vimeo).

A few months ago (November 2023), we ran a survey to collect input on the GRSciColl data schema in order to update it (this is in the context of our road map work).
At the time, 8 responders gave us feedback on GRSciColl identifiers.

  • Most people agree that we need to add context for the identifiers (and identifier types) available on GRSciColl. Unless you are familiar with the world of identifiers, what you see on GRSciColl might be difficult to navigate.
  • Most people also agreed that it would help to have a “how to cite” section on GRSciColl institution and collection pages so people know exactly what to cite (which identifier).
  • At the time, most responder also said that GRSciColl wouldn’t need to mint DOIs for GRSciColl entries.

During our community call on Wednesday this week, the question of preferred identifiers and DOIs came back. I would like to have a bit more input on the topic.

What do you think should be the preferred identifiers to reference institutions and collections on GRSciColl?

Here are a few ideas discussed during the call and a poll:

  • Institutions should be able to choose which identifier people should be used for referencing their entries.
  • The preferred identifiers should be (at least by default) GRSciColl URLs. The advantage being that this is something created and maintained by GBIF, it doesn’t rely on external sources.
  • The preferred identifiers should be RORs for institutions. This is a position that have been voiced multiple times. However it relies on two things:
    • Institution (or someone) making sure that the correct ROR identifiers are in GRSciColl for the correct entries (right now about 6% of the institutions have a ROR id in GRSciColl).
    • That ROR maintain those identifiers
  • The preferred identifiers should be DOIs for collection. With the caveat that minting DOIs has a cost, perhaps this could be for institutions who request them only (maybe with a button in the GRSciColl interface?)
  • The preferred identifiers should be something else? ARK identifiers were mentioned during the call:
    • ARKs are generic and persistent. The good thing is that they are completely free. The are most often used in heritage field but should be adaptable for natural collections. See https://arks.org/

What do you think? Feel free to vote here and/or comment. Many thanks!

  • Institutions should be able to choose for themselves
  • GRSciColl URLs/UUIDs should be the default
  • ROR for institutions
  • DOIs (minted by GRSciColl) for collections
  • Others (ARK?)
0 voters
1 Like

We have over 2,700,000 DOIs for downloads and 100,000 for datasets, so there would be no additional cost to make a few thousand DOIs for GRSciColl entities.

1 Like

Is the concern INSTITUTIONAL identifiers or COLLECTION identifiers. I see needing an official institutional identifier opening up a paperwork nightmare for people in charge of small collections and institutions where there is little understanding of the needs associated with information sharing, the kind of people I work with. So I opt for DOIs for institutions, minted by GRSciColl (since it is the booy wanting them) - and the option of using an established, adequately unique, recognized identifier. I do not like using the collection code for an institution code.

1 Like

I assume by DOI, you mean a DataCite DOI. Would the resourceTypeGeneral = “Collection” rather than “Dataset” as is the case for a GBIF Download? They define this as: “An aggregation of resources, which may encompass collections of one resourceType as well as those of mixed types. A collection is described as a group; its parts may also be separately described.”

In any case, you’ll need a URL and HTML landing page for a DOI, whether it’s a Dataset or not. Can you sufficiently populate (and maintain) the metadata for a DataCite DOI if it had resourceTypeGeneral = “Collection” if that entity was split across institutions as is often the case? Could you also populate relationType to specify the relationship among and between parts of a collection? Does the DataCite schema support an array of “publisher” or whatever could be used to specify the potentially many host institutions for a single collection?

3 Likes

I think @dshorthouse raises some valid points worth consideration. I think we should explore if and how well these collections fit the DataCite metadata schema. @mgrosjean, let’s put our heads together next week? :slight_smile:

2 Likes

Note that GRSciColl entries can refer to “inactive” collections which were lost since (like this one) or were split and integrated in other collections.
It can helpful to have those entries available as these collections might be referenced in publications.

@dnoesgaard and I decided that we are going to map a few GRSciColl collections to DataCite (test) and share them here in the coming week(s) so we can discuss what makes the most sense.

1 Like

@dshorthouse it took a while but here are some examples of collection mapped to DataCite:

See also the same examples with this API call: https://api.test.datacite.org/dois?client-id=gbif.grscicoll

One of the challenges is that DataCite isn’t meant for physical objects (just data) so there is no field for address for example. I ended up putting a lot of information as “other” description. It would make sense to define a minimum set of mappable data (that would go on DataCite) and leave the rest to the GRSciColl entry.

I also put “GBIF” as publisher in the DataCit schema as the record of the collection is published by GBIF/GRSciColl but I am not sure if this is the best way to proceed.
It would be great if someone from DataCite could advise. Perhaps @mjbuys or a colleague could help?

Thanks @mgrosjean, we can certainly provide some guidance. We have over 12 million samples registered and have been doing some work with the community on schema crosswalks. Kelly Stathis is best suited to support these queries. Would it be possible to either add Kelly here? Or send an email to support@datacite.org?

1 Like

Thanks @mjbuys ! I will contact support@datacite.org for now.

Nice work, @mgrosjean. I hope this was a useful exercise. The impression I have is that this is a bit like a square peg in a round hole, perhaps because the concepts of what is a “collection” differ in the DataCite world (= digital data sources) vs the natural history world (= administrative/thematic about physical resources).

It unfortunate that RoR’s concept of organizational hierarchies in v2 of its schema would not likely recognize physical collections. These are (in most cases) more granular than departments, which themselves are not first class entities in RoR: Research Organization Registry (ROR) | FAQs. It would have made for a much cleaner division of duties in GRSciColl.

1 Like

Marie,

I’m just following up from the GRSciColl demo that you just gave for a group of us at the Smithsonian with my feedback on identifiers for collections. As mentioned, there are many flavors of persistent identifiers and selecting the correct one for the use case is an investigative process. As a librarian, the challenges and considerations to take into account with authority control bodies that govern identifiers e.g. CrossRef, DataCite, ISNI, ROR are the following:

  1. There is often a cost for issuance
  2. There are schema conformance requirements (as David Shorthouse so aptly pointed out)
  3. There will be administrative burden on GBIF to deposit and maintain the data
  4. Staff turnover and loss of expertise can often put identifier issuance on hiatus at an organization because it requires expertise. (yes, this one is hard-learned from experience)

To mitigate those challenges GBIF has a few options:

  1. Issue your own (as has already been suggested)
  2. Use ARKS (they are free, no schema requirements, and are used to describe anything physical, digital, or abstract)
  3. Use Wikidata (it’s a PID broker) Example. Scroll down to the “identifiers” section to see.

Personally if it were me, and I just wanted to get the job done quick, I’d issue my own and submit the id’s to wikidata. But again, it’s important to think about the original use case.

Was it:

“As a user, I would like to persistently identify a GRSciColl collection so that I can accurately cite, point, and link to it on the web.”

Is there more to the use case than that? If not then, I think the simple solution I offer will likely suffice for this purpose.

Hope this helps. All the best,
JJ

3 Likes

My personal reason against the use of a ROR as a unique Identifier for GRSciColl institution entries: If the institution has been registered at the same level as in ROR, then the ROR is unique. Than it is OK… However, I have seen cases where two departments of the same scientific institution are registered as separate GRSCiColl entries. Still, they may share the ROR to indicate that they belong to the same higher-level institution… Therefore, I always advise people to use the GRSciColl URLs.

2 Likes