6. Robust access points and data infrastructure alignment

Moderators: Jose Fortes, Tim Robertson, Sharif Islam, David Martin

Background

Robust access points and data infrastructure are important for the generation, archival, dissemination and analysis of digital specimen data. The infrastructure based on the Digital Extended Specimens framework not only needs to be reliable and robust but also needs to be used and adopted by the user community. At the same time, we are envisioning this new framework within the existing data infrastructures and practices around the world. These contexts will help us appreciate local data practices (e.g., field and lab work), data sharing and collaboration issues (e.g., data sharing among multiple research groups or organizations), short and long term data curation and storage practices (e.g., the role of data repositories) and the role of national, regional, global, and thematic aggregation. Support for institutions of varying sizes and capabilities to deliver data (for shorter and longer-term) into such infrastructure also needs to be considered. Furthermore, the success of any global initiative around digital specimen data depends on how well the infrastructure can accommodate new capabilities such as curation by the community, providing unambiguous attribution, and provenance management services – among others.

Besides data practices, current contexts will help us to get a view of how the data journey happens from production to reuse (see data journey described by Sabina Leonelli). This data journey can range from moving individual digital objects from one repository to another to aggregating and publishing datasets (see also 2017 article by Beckett Sterner and Nico Franz). With a nuanced view of different digital artefacts, the data journey can provide guidelines for the bigger transformation that must happen to accommodate the Digital Extended Specimens framework. Given the diverse nature of the data classes and users involved, a balance needs to be achieved to understand generalized and specialized use cases. As we move toward new solutions and capabilities we should also map current professional practices to new kinds of analysis, expertise and collaboration within roles such as metadata specialist, data manager, data scientists, research software engineer and other specialists that work with data and cyberinfrastructure. Lastly, a truly global infrastructure needs to have a nuanced view of specific financial and political situations in different parts of the world regarding support and financial models for long-term sustainability.

The goal of this category is to discuss how, from our current context, we can go toward a global data infrastructure based on the Digital Extended Specimens framework. Even though the technical aspects are the primary focus, social and financial aspects are also relevant for this section.

A presentation (1.1 MB) was created to provide background information on the FAIR Digital Object concept and to highlight architectural and application layers that can materialise the vision around Digital Extended Specimens.

A note on terms

Terms such as “cyberinfrastructure”, “data infrastructure” and “research infrastructure” are used in relation to digital infrastructures providing services to the scientific community. Many such terms come from funding programs such as the U.S. National Science Foundation (NSF), from European efforts focusing on research infrastructures, such as the European Strategy Forum for Research Infrastructures (ESFRI) and the European Open Science Cloud (EOSC), and from the Research Data Alliance (RDA).

Questions to promote discussion

  1. What are the core capabilities (such as data management, data analysis) the infrastructure should satisfy?
  2. What are the current pain points (e.g., storage needs, scalability, data integrity, bandwidth)?
  3. There are various approaches to how applications, such as collection management systems, can participate in an open Digital Extended Specimen based solution. This could include full support natively in a local installation (i.e. implementing and running the appropriate APIs), use of shared systems that provide the functionality (e.g. using cloud CMSes), or synchronising with another party to provide the necessary data access services on your behalf (e.g. with DiSSCo, iDigBio, GBIF or other). We welcome discussion around deployment aspects and what level of adoption of Digital Object Architecture the community foresee within the tools used.
  4. Being able to integrate with the existing tools and data networks in use by institutions is critical for adoption. What are the constraints, the desire and capacity to adapt?
  5. Several emerging technologies and protocols may provide good frameworks for deploying infrastructure supporting the digital specimen vision. Notable mentions include blockchain to record the “change events” in the specimen lifecycle and Digital Object Architecture and it’s associated Digital Object Interface Protocol. We encourage open discussion about the merits of these, and others.

Information resources

1 Like

Global or local solutions for data storage?

Researchers and institutions dealing with natural science collections have multiple challenges for storing, indexing, sharing their data. As different types of data are involved (text, media, omics), the solution requires a constellation of tools and repositories. Repositories such as GBIF, Plazi, Morphobank are providing valuable services. There are also domain agnostic repositories such as Figshare, Zenodo, Dryad.

However, as several recent articles (here and here) have pointed out, for instance, during describing new species not all relevant information is provided by the researchers or stored in long term repositories.

Should we work toward a global solution (for example, similar to INSDC databases that store and mirror molecular data) or various interoperable data repositories that can ensure data are findable, accessible, interoperable, and reusable (FAIR)?

I would like to pitch Wikidata here. As the open data repository of the Wikimedia Foundation, closely aligned with Wikipedia, it provides a valuable infrastructure for public data. By design, Wikidata is scopeless, which allow interoperability of biodiversity data with other domains of society.

There are already some initiatives regarding biodiversity data active on Wikidata. In 2018 we started a Wikiproject Biodiversity, which was initially started as Wikiproject iNaturalist. During a field trip in the margins of a Wikimedia conference in 2018, it became apparent that there is an overlap between both the Wikidata and iNaturalist community. Mutual reuse of data from both Wikidata and iNaturalist seems straightforward. So we took the initiative to look into more straightforward integration between the two platforms. The project was later renamed Wikiproject Biodiversity to recognise the value of other resources as well, most notably GBIF.

It would be interesting to explore the role of Wikidata as a shared authority file for Biodiversity Data as well as Wikibase upon which Wikidata is built, as an available software stack for institutional Knowledge Graphs.

Further reading

I think we have two questions here: 1) do we need infrastructures where research data can be stored for the long term, and 2) should such infrastructures be built centrally or distributed.

The answer to 1) is yes and there are (mainly national) programs that have started to provide such infrastructures. In Germany, for example, this is NFDI and specifically NFDI4BioDiversity. NFDI4Biodiversity provides a mix of centrally operated and decentralized services.

To 2): I do not think that the question can be answered in general terms. We see, for example, in collection data, a transition process from local (often in-house developed) collection management systems to communities that jointly operate databases. For example, the herbarium system JACQ is used by about 50 collections that enter into one (!) database and maintain the data together. Using specific solutions for e.g. organism groups makes sense here, as workflows, designations etc. may differ.

For certain types of data, it still makes sense to be able to use centralized repositories. For example, storing image data is too difficult and expensive for many institutions, so centrally operated services would certainly be well received.

Yes, wikidata is definitely an important piece for data integration. There are already examples of successful initiatives such as Bionomia (https://bionomia.net/) created by @dshorthouse that links specimens with people (using wikidata identifiers for collectors).

Can you elaborate on the idea of “shared authority file”? Do you mean a core/authoritative data that a collection holding institute will provide (such as a natural history museum) that can be shared and re-used? What could be stored in this file?