6. Robust access points and data infrastructure alignment

sharif.islam · July 5, 2021, 10:55am

@SarahDavidson thanks for your insightful comments and questions.

Infrastructure here is broadly defined as a set of services, protocols, standards and software that the Natural Science Collections community needs to manage the data and research lifecycle (e.g., from field to curation, to publications and re-use). The storage infrastructure is one of the components in this ecosystem.

You have raised a good point about the need for a new infrastructure and in particular about storage. I will provide some comments based on our experience in DiSSCo where we are envisioning digitization on demand service that will create FAIR and actionable Digital Objects (see @hardistyar blog post on “What is a Digital Specimen”). These objects will be elements for new services (such as European Loans and Visits System and Specimen Data Refinery) and in order to be FAIR and actionable we will need to store these objects somewhere (for instance a DES repository). There also needs to organisational support and commitment to maintain this repository and associated services. At the same time, yes, this repository needs to be integrated with collection management systems and other databases and services (such as GBIF, Catalogue of Life, ENA) through open protocols and APIs. We are also exploring use cases where integration is preferred by institutions instead of migrating to a new system. For example this user story collected for DiSSCo mentions a scenario of using CMS and taking advantage of value added features (such as data discovery and linking as you noted). So there are use cases that will require new storage infrastructures (such as image repository for machine learning data pipeline) that might replace or extend existing infrastructures but yes the end goal should be data discovery and linking across platforms (focusing on FAIR and long term sustainability).

Yes! The framework for the minimal solution can come from existing standards initiatives like MIDS – Minimum Information about a Digital Specimen. Low-effort implementation, iterative improvements, API compliance can come from projects such as DINA, harvesting and linking solutions can come from projects such as BiCIKL https://twitter.com/Bicikl_H2020.

Currently, there are variety of initiatives across the global addressing the big issues of infrastructures and services in different ways. I think one goal of this consultation and in particular topic 6 is to see these connections and overlap better.

jegelewicz · July 9, 2021, 2:10pm

This. I feel we have done a very poor job in the past documenting “who said it and why”. In the Wiki community trust can be built and assertions can be backed up with cited resources. Personal profiles like mine demonstrate areas of interest or expertise and level of experience editing. Authoritative control does not necessarily make the best data.

jegelewicz · July 9, 2021, 2:12pm

I agree. I think we keep inventing new, shiny things when we really just need to be supporting what we already have.

jegelewicz · July 9, 2021, 2:17pm

A lot! But the community needs help with this. PIDs can be difficult to understand and getting institutions to pay for them even more difficult. There are still those who believe the catalog number or barcode are enough. A low-cost registry recognized by the global community would go a long way toward getting institutions on board.

SarahDavidson · July 9, 2021, 2:57pm

Thanks @sharif.islam for the responses. Based on that extra background, I’ll leave a few points responding to the questions up top based on our experience with the Movebank platform for animal-borne sensor data, and hopes for DES infrastructure.

Question 1 (core capabilities): In our case, we are looking for globally unique, persistent animal identifiers that users can assign to their data within our platform that can also be used within other platforms to allow archiving of diverse life history data across multiple existing relevant repositories. Requirements for our use case are

global solutions: we are based within the EU but host data from around the world and assist in meeting data archiving requirements for multiple countries, so solutions specific to the EU or US are not sufficient
solutions relevant to digital data with no associated physical museum specimen, which account for a growing proportion of biodiversity data: for more see Kays et al. 2020, “Born-digital biodiversity data: Millions and billions”

Question 2 (pain points): I imagine that a challenge to implementing DES infrastructure is how to verify true uniqueness of PIDs, i.e., ensuring that the same individuals and data are not registered multiple times: If a researcher deposits different data relating to the same animals within and across multiple repositories, who is responsible for registering the PID and ensuring the same PIDs are used in all relevant repositories and datasets? As third parties harvest, revise, and aggregate the data, how do we ensure that they do not register additional PIDs for the same individuals?

Questions 3-4 (adoption by existing initiatives): In the short and medium term, we would need a low-effort solution that allows us to maintain our existing data infrastructure, allowing users to optionally store globally recognizable animal PIDs within our database. We would adopt our existing APIs so that this data can be harvested and archived with relevant data, and as designated funding for further development becomes available, could implement additional integrations to improve data FAIRness. The future reliance on another party (e.g., GBIF) to obtain PIDs on our behalf might be possible, but would apply to a limited subset of the data we store, and I am not sure how the other party would ensure PID uniqueness as described above.

Is there a place to provide specific user stories or offer to serve as early adopters as part of this consultation?

jegelewicz · July 9, 2021, 3:43pm

@SarahDavidson you can find some juicy discussions that Arctos users have had over the years about various types of stable identifiers:

Parts
Organisms
Taxonomy
Localities and Events

Almost all of these digital things are shared by multiple institutions, which we can do within Arctos, but once the data leaves Arctos, it is difficult to match it up with other collections/databases. Even within the Arctos community, we regularly create duplicate “organisms”, localities, events, parts, and people. And sometimes, we know we have part of something for which another part is elsewhere, but we have no way of “linking” with the other record because their data is not linkable. We consider our catalog record urls to be GUIDs which works for linking, but they are not truly persistent as the records can be deleted (we provide redirects) or completely changed to be unrecognizable from what they were yesterday (as when a student incorrectly enters one object with the wrong catalog number). And I think what we have discovered is that our GUID is not enough, especially when it records “something collected somewhere at some time and identified as a taxon”, but it is really “multiple somethings” (skin, skeleton, liver, etc.) and the “something” cited in a publication is a DNA extraction from the liver of GUID. Make sense? We are struggling to keep up with appropriate links from a physical thing to the research in which it was involved AND the thing that was originally collected.

SarahDavidson · July 9, 2021, 5:36pm

Yes, this sounds very much like our situation. I provide more detailed descriptions as a potential user of DES solutions, but I’m just not sure where is best to put this or what format will be most useful.

trobertson · July 12, 2021, 2:14pm

Thanks, @SarahDavidson. Perhaps start a new Movebank specific post here on the forum, outlining examples of some of things you grapple with or things you’d like to see relating to organism identifiers?

sharif.islam · July 13, 2021, 8:24pm

Natural Science Collections data are widely used in different disciplines. Data collection, curation, and dissemination involve various actors and systems. Are the current workflows enabling collaboration between data providers, curators, and users? If yes, please provide some examples. If no, what can be done better?

SarahDavidson · July 14, 2021, 2:03pm

Thanks @trobertson—it looks like I don’t have permission to start a new post (assuming you mean a new “topic”?). If you can create one I can go ahead and add to it, otherwise I’ll post in the existing topics. I think the maybe the use case could expand to infrastructures dealing with living or wild ‘specimens’, including organisms that can be physically identified as a known individual, e.g., in zoos, botanical gardens, protected areas, or by unique markings on wild organisms (e.g., a natural marking like a zebra stripe pattern or a ring or sensor attached by humans). We could invite others to contribute to this e.g. from the TDWG Machine Observations IG.

sharif.islam · August 10, 2021, 1:27pm

jegelewicz:

Almost all of these digital things are shared by multiple institutions, which we can do within Arctos, but once the data leaves Arctos, it is difficult to match it up with other collections/databases. Even within the Arctos community, we regularly create duplicate “organisms”, localities, events, parts, and people. And sometimes, we know we have part of something for which another part is elsewhere, but we have no way of “linking” with the other record because their data is not linkable. We consider our catalog record urls to be GUIDs which works for linking, but they are not truly persistent as the records can be deleted (we provide redirects) or completely changed to be unrecognizable from what they were yesterday (as when a student incorrectly enters one object with the wrong catalog number). And I think what we have discovered is that our GUID is not enough, especially when it records “something collected somewhere at some time and identified as a taxon”, but it is really “multiple somethings” (skin, skeleton, liver, etc.) and the “something” cited in a publication is a DNA extraction from the liver of GUID. Make sense? We are struggling to keep up with appropriate links from a physical thing to the research in which it was involved AND the thing that was originally collected.

@jegelewicz If I may summarise and rewrite the above from an infrastructure design and service requirement gathering point of view, are the following items reflect what you described? Feel free to edit or update.

There are discrete entities or objects (such as organisms, samples, events, localities, parts, publications) within a collection that need to be persistently identified and managed (system generated GUID is not enough).
These entities need to be linked with other entities inside and outside the collections.
The service/infrastructure needs to handle derived and associated data.
The service also should be capable of providing data event notifications (such as new citations and publications) and automation (for example, finding related or duplicate items, creating the links and performing data integrity checks) tools.

sharif.islam · August 10, 2021, 2:06pm

+1 for these points.
There are technical solutions that can handle this. Of course, there are implementations and maintenance challenges. But I think the biggest challenge might be social and political. On the technical end, there could be event brokers, APIs and data harvester that can “subscribe” and “listen” for new entities in repositories and datasets and provide an at scale de-duplication service (using, for example, a probabilistic data structure like Bloom Filters. Instead of one registration service or authority, there could be multiple services that are interoperable. Another example is how google scholar finds similar/same publications in multiple repositories.

I think it is more challenging to train the researchers and other staff members to adhere to certain data management policies, have policy interoperability across different institutions and data repositories. Also, journal publishers and editors need to be involved in these processes as well.

jegelewicz · August 31, 2021, 8:48pm

I think that is a pretty good summary!

Topic		Replies	Views
Summary: 6. Robust access points and data infrastructure alignment Digital/Extended Specimen	0	687	July 6, 2021
Summaries - 1. Making FAIR data for specimens accessible Digital/Extended Specimen	2	1451	February 26, 2021
Background and context for phase 2 Digital/Extended Specimen	0	985	June 8, 2021
8. Meeting legal/regulatory, ethical and sensitive data obligations Digital/Extended Specimen	55	4086	August 23, 2021
Extending, enriching and integrating data Digital/Extended Specimen	53	3468	April 5, 2021

6. Robust access points and data infrastructure alignment

Related Topics