Collections catalogue (GRBio)

From Plazi’s scholarly publication point of view, we like to see GRBIO life again. We started in the earlier time to annotate collection codes we find in scholarly articles with the GRBIO’s persistent identifiers, then continued using our own service based on the saved version of GRBIO (doi.org/10.5281/zenodo.1285615), and now hope we can refer again to a live version of GRBIO.

For us to have GBIF as maintainer of GRBIO would be helpful, because we submit all the data (treatments and material citations) to GBIF, whereby the collection code is one element. Referring to the same reference would reduce the risk, that the code refers to different things. It would facilitate GBIF to produce statistics that also include scholarly articles (in the current GBIF language collections) treatments, and material citations that are probably the biggest user of collection based data.

If this is in place, it not only provides better access to data liberated from publications, but it also helps us talking to a growing number of publishers we work with to assure that they use the GRBIO terms for collections. The publishers interest is really to provide a service to their audience allowing understanding which are the collections that contributed to describing the worlds taxa, next to the other obvious candidates (specimen, collectors, authors, etc.)

3 Likes

From my early investigations I tend to agree (not the current GRBio, but the one we envisage).

Thanks Quentin, I was not aware of that requirement. I don’t foresee that as an issue as it is largely factual information but would like to verify specifically with regard to people.

2 Likes

Hi @rdmpage understandable reaction :slight_smile: But I think the way forward now will have to involve APIs that make it possible to connect these resources. And, making it very clear the direction the data is expected to flow.

You wrote

most definitely. I would add that many are (including would-be funders) very interested in knowing more about the backlog. As far as metrics and visualizations at GBIF - yay! Yes, it would be great since many (most) are already sharing their specimen-level data with GBIF (or planning to), it makes sense that they would send their collections-level metadata their as well. Hopefully through a similar (familiar, simple) mechanism to make it easy to comply. @waddink @qgroom and @agosti also raise other important points about the expectations, need for, and requirements of such a system.

Data about the backlog will be easier for some collections that others. Those that have done species inventories can start by sharing this level of data about what they have. Part of being in the DiSSCo network will mean the partners have to provide this information.

I think I get the sense of what you mean about “Measures of what remains to be done within a collection…” “better done at the level of individual collections.” But many of the needed metrics are at the level of individual collections. We need to get beyond free text fields (EML), to better understand what we have, who and where the experts have, and the digitization status of these collections.

For me, the harder bit seems to be how to get people to give us this information. I think that DiSSCo has a great chance of showing what can be done - when from the beginning of their effort - the expectation that this information is to be provided is in place.

to @agosti I think @trobertson confirmed that the GBBIO IDs were / are kept in the system he is building.

We support the notion that collection names should be both on GRBIO/GBIF and on Wikidata. We prefer to to have GRBIO on GBIF in a way that we can edit, add to it easily new collection names we discover in publications, that there is an API that allows to reuse the data and in wikidata a GRBIO.GBIF-id for collections. Collection codes are an essential building block for our biodiversity knowledge, so we should do all that we can maintain it, ideally in GBIF with whom we all already intensively interact. We should also make an effort to convincing the respective institutions to feel responsible to be present in GRBIO and maintain the data about themselves.

The most convincing and probably low hanging fruit is a series of dashboards, like the current ones on GBIF, one for a collection aka scholarly publication (https://www.gbif.org/dataset/378ebf94-4b5c-4451-90f4-4109f9b27ea9) or persons (e.g. bloodhound) that is about a collection

I like to see a time where occurence related data does not exist, unless it shows up in GBIF. For example, the Codes (e.g. ICZN) should be revised respectively or iDigBio and DiSSCo are only considered a success if their data is also in GBIF.

1 Like

It might be useful to distinguish between (1) GrBio as a database/project and (2) the GrBio identifiers. Yes it would make sense to keep the GrBio identifiers “live” in some sense because they have been used (and we’ve already been through at least one iteration of these identifiers having to be re-routed when GrBio took over Roger Hyman’s BioCollections project).

But whether a reborn GrBio is the best way to manage the task of building a database of natural history repositories is another question. Personally I’d argue that is what Wikidata does well (especially if the GBIF makes the GrBio identifiers live again and they can then be added to the existing Wikidata records for these repositories).

1 Like

We seem to have forgotten the “staff” part of the registry, https://www.gbif.org/grscicoll/person/search. This is important because it’ll quite likely fall to them to maintain their own contact information & that of the institution(s)/collection(s) they represent. I’m willing to bet that a good part of this is now so dated as to be useless and worse, frustrating to people included in the registry and expected end-users who are trying to find contact information.

(1) Have people been alerted that they’re now on a publicly accessible list?
(2) How many have died since inclusion in the registry? Who is now responsible for updating their information?
(3) Do you now need to adhere to GDPR rules?
(4) How much of that staff registry is still correct? Do emails bounce? Do phone numbers work? Is it even worth creating new forms for people to fix this when there are other entities out there that might do a better job of it (eg ORCID).

My recommendation is that you pull the staff part of the registry until you have an agreed process in place.

2 Likes

Thanks for these perspectives and questions, @dshorthouse.

To be honest, we ourselves haven’t forgotten about it—we are very aware of the presence of staff information in GRSciColl. For now, though, GBIF has simply restored this pre-existing information from GRSciColl online. Data accuracy and curation is a topic for all aspects of the registry, not just staff information.

That’s why we are working, on the one hand, to establish proper links with sources like Index Herbariorum, and on the other, planning to explore how best to integrate systems like ORCID or, indeed, Wikidata (at least in part on the strength of this discussion).

In our recent announcement on GRSciColl, we asked members of the collections community who wished to edit, update and curate this data to contact us at scientific-collections@gbif.org. We will start work shortly with our first users outside the Secretariat, working through ‘teething’ issues and resolving them as we find them.

Our collective maintenance of this information, through any relevant channel, remains critically important in promoting collections as “first-class entities.” We hope to address many of these aspects in upcoming work related to the shared data-processing pipeline, and in collaboration with the TDWG Collection Descriptions IG, CETAF, DiSSCo, SYNTHESIS+ et al.

Lastly, since you bring up the four-letter word, three main points on GDPR:

  1. GDPR does not apply to the dead.
  2. GDPR does not mean ‘no names’.
  3. The legal advice we’ve received asserts that GBIF has what GDPR terms “legitimate interests” in providing information about identifiers and recorders in occurrence data in order to serve both the public interest and purposes of scientific and historical research.

In the coming weeks, we will engage the network’s data publishing community directly to ensure that our ongoing, collective efforts fully comply with transparency requirements under GDPR—even where we likely merit exemptions from it.

Thanks @kcopas. re: GDPR

Yes, I’d expect the wider GBIF community is exempt from GDPR issues in our occurrence data…except when these are names of minors that can additionally be tied to geolocations. There are no exemptions there. We do have that in occurrence data and we have no idea how to deal with it. Let’s leave this issue aside for the moment.

My comment in the present case was about the revival of GRSciColl staff information, and how these relate to GDPR and other legal issues. I do not believe we are exempt from GDPR under these circumstances where there are email addresses, phone numbers, mailing addresses presented without permission EVEN IF these might be found elsewhere on institutional websites or were once present in earlier versions of GRSciColl. Were there click-through agreements in the earlier version of GRSciColl? Can you show this new version without also including a statement on how one can request that their name be removed?

Let’s do keep those two things separate—they are different.

In the first instance, the reason for us establishing direct communications with the data publishing community is to help them address and proactively comply with GDPR and offer such guidance as we can on ‘how to deal with it.’ It’s probably worth recalling the role that GBIF played in helping the community work through issues around data licensing here and anticipating something similar here.

In the second, GDPR art. 6(1)(f) provides a lawful basis for processing of non-sensitive personal data based on legitimate interests. Our understanding is that GBIF’s processing of personal information such as that appearing in GRSciColl requires that these interests be balanced against the interests or fundamental rights of the data subject and must consider the effect of actual processing on particular individuals.

Email addresses, phone numbers and mailing addresses are considered ‘ordinary data’ under GDPR. Questions of curatorial accuracy aside, their availability within our community is linked to individuals’ professional scientific roles and relates specifically to their association with an institution and collection. Nothing that is processed is sensitive or private, and GBIF’s publication of this data likely has no significant impact for these subjects. The guidance we have received the combination of public and scientific interests likely add ‘weight’ to GBIF’s legitimate interests when weighed against the interests or fundamental rights of the data subject in the balancing test.

Recall however that there are three classes of collections in GRSciColl, one of which is “personal”. See for example: https://www.gbif.org/grscicoll/collection/a7344e6c-a163-47b1-a124-f7a555e1a507. I assume that’s Robert’s home address.

Even ‘personal’ email addresses, phone numbers and mailing addresses are considered ‘ordinary data’ under GDPR. There is nothing innately sensitive or private about them, and we have been advised that our processing of such data is unlikely to have significant negative impacts for individual subjects.

The case of a personal collection is an interesting one. But our guidance still suggests that, because the data will likely have already been either made publicly available by the individual himself/herself or provided to a data publisher with the purpose to publish the data in GBIF and other public databases, he/she is likely to have a reasonable expectation and possibly even an interest in sharing such data for documentation and research purposes. We have been advised that we can make similar assumptions regarding to the motivation of other individuals attaching their names (and roles) to records.

Would you care to join in and help us look into and fix such issues, especially if you have insight into specific collections like this one? Very glad to see someone dig into the details of GRSciColl—it’s precisely what we need to improve the quality of all the data, whether it’s on institutions, collections or collections staff.

I do not know Robert nor his collection, though it’s entirely possible that I’d know collections and collectors. I’d be pleased to help if the data were in wikidata because I am interested in the connection between institutionCode/collectionCode and ringgold/GRID/ISNI/RoR/etc. through ORCID. Wikidata appears to be the best place to help broker the many identifiers for organizations.

1 Like

@kcopas Would it not make sense for GBIF and, say, iDigBio and/or DISCCO, to organise a small hackathon to thrash out some of these issues and see what can be built. I’m getting the sense from this conversation that there’s some degree of talking past each other as we’re coming at this from different directions and with different expectations.

1 Like

That is the intention @rdmpage

3 Likes

We’ve gone quiet. Does this mean we’ve exhausted the topic? I still think about this and I believe there’s a workable solution. It’ll take a lot of work, but if we want people to maintain contact info., codens, details about their collections, we’ll have to give them a reason for doing it. First, some observations:

  1. Your annual science review is awesome
  2. You’re now successfully tracking use of GBIF-mediated data via downloads w/ DataCite DOIs.
  3. Museums struggle to illustrate impact. Many have their own manual or semi-automated article-tracking initiaties to demonstrate that their collections are being used in new science. These often do not reference individual specimens because, as is the case for most papers in taxonomy, type specimens are not yet formally accessioned.

Perhaps we can put all this together by providing a service. GBIF wants up-to-date metadata about collections, while organizations that publish specimen data want near real-time signals of impact. Through clever use of Google Scholar, Plazi, other, why not build something rich and equivalent to Altmetric or ImpactStory? Some of us have tried to build tools that listen for new science that make mention of collections (eg https://github.com/mus-nature-ca/museum-tracker). Can we do this at scale & roll it into a core function of GRBio? Use your ORCID log-in, have those editable fields from GRBio as you’d need, but then DO something with it straight away.

Future refinements might generate simplified science reviews by pulling abstracts, keywords, etc. instead of ever-growing lists of papers. This would please collections & museums immensely.

2 Likes

This sounds like a topic for a hackathon. Maybe we can have more hack less yack at Biodiversity Next?

2 Likes

Sounds great @rdmpage. Note that there are discussions going on about funding a meeting to discuss / hash out envisioning how this all going to work. And, we are reaching out to both GRBio and BCI (Roger Hyam) to have conversations about lessons-learned, to help, in part, guide development of this future shared resource.

1 Like

Might I also suggest that you are talking without the community of data providers? Where are the voices of the curators, collection managers, etc. who provide the raw data? Please make sure they are involved and appropriately funded for this work (because it will be work for them, no matter how the cake is sliced).

1 Like

I second the need to see what data provider communities are doing. It seems like the Index Herbariorum already functions very nicely for botanists. It is built on EMu and not really linkable to other sources, but it seems really complete in coverage of herbaria and somewhat up to date on information. Another effort is the “The Insect and Spider Collections of the World” http://hbs.bishopmuseum.org/codens/ , which I think is still the first place entomologists go to look up arthropod collections. We are trying to follow the Index Herbariorum with arthropod collections, starting with North America. But I think everyone is open to creating a central repository for all NHCs even if they retain their own databases.

1 Like