6. Robust access points and data infrastructure alignment

Moderators: Jose Fortes, Tim Robertson, Sharif Islam, David Martin

Background

Robust access points and data infrastructure are important for the generation, archival, dissemination and analysis of digital specimen data. The infrastructure based on the Digital Extended Specimens framework not only needs to be reliable and robust but also needs to be used and adopted by the user community. At the same time, we are envisioning this new framework within the existing data infrastructures and practices around the world. These contexts will help us appreciate local data practices (e.g., field and lab work), data sharing and collaboration issues (e.g., data sharing among multiple research groups or organizations), short and long term data curation and storage practices (e.g., the role of data repositories) and the role of national, regional, global, and thematic aggregation. Support for institutions of varying sizes and capabilities to deliver data (for shorter and longer-term) into such infrastructure also needs to be considered. Furthermore, the success of any global initiative around digital specimen data depends on how well the infrastructure can accommodate new capabilities such as curation by the community, providing unambiguous attribution, and provenance management services – among others.

Besides data practices, current contexts will help us to get a view of how the data journey happens from production to reuse (see data journey described by Sabina Leonelli). This data journey can range from moving individual digital objects from one repository to another to aggregating and publishing datasets (see also 2017 article by Beckett Sterner and Nico Franz). With a nuanced view of different digital artefacts, the data journey can provide guidelines for the bigger transformation that must happen to accommodate the Digital Extended Specimens framework. Given the diverse nature of the data classes and users involved, a balance needs to be achieved to understand generalized and specialized use cases. As we move toward new solutions and capabilities we should also map current professional practices to new kinds of analysis, expertise and collaboration within roles such as metadata specialist, data manager, data scientists, research software engineer and other specialists that work with data and cyberinfrastructure. Lastly, a truly global infrastructure needs to have a nuanced view of specific financial and political situations in different parts of the world regarding support and financial models for long-term sustainability.

The goal of this category is to discuss how, from our current context, we can go toward a global data infrastructure based on the Digital Extended Specimens framework. Even though the technical aspects are the primary focus, social and financial aspects are also relevant for this section.

A presentation (1.1 MB) was created to provide background information on the FAIR Digital Object concept and to highlight architectural and application layers that can materialise the vision around Digital Extended Specimens.

A note on terms

Terms such as “cyberinfrastructure”, “data infrastructure” and “research infrastructure” are used in relation to digital infrastructures providing services to the scientific community. Many such terms come from funding programs such as the U.S. National Science Foundation (NSF), from European efforts focusing on research infrastructures, such as the European Strategy Forum for Research Infrastructures (ESFRI) and the European Open Science Cloud (EOSC), and from the Research Data Alliance (RDA).

Questions to promote discussion

  1. What are the core capabilities (such as data management, data analysis) the infrastructure should satisfy?
  2. What are the current pain points (e.g., storage needs, scalability, data integrity, bandwidth)?
  3. There are various approaches to how applications, such as collection management systems, can participate in an open Digital Extended Specimen based solution. This could include full support natively in a local installation (i.e. implementing and running the appropriate APIs), use of shared systems that provide the functionality (e.g. using cloud CMSes), or synchronising with another party to provide the necessary data access services on your behalf (e.g. with DiSSCo, iDigBio, GBIF or other). We welcome discussion around deployment aspects and what level of adoption of Digital Object Architecture the community foresee within the tools used.
  4. Being able to integrate with the existing tools and data networks in use by institutions is critical for adoption. What are the constraints, the desire and capacity to adapt?
  5. Several emerging technologies and protocols may provide good frameworks for deploying infrastructure supporting the digital specimen vision. Notable mentions include blockchain to record the “change events” in the specimen lifecycle and Digital Object Architecture and it’s associated Digital Object Interface Protocol. We encourage open discussion about the merits of these, and others.

Information resources

1 Like

Global or local solutions for data storage?

Researchers and institutions dealing with natural science collections have multiple challenges for storing, indexing, sharing their data. As different types of data are involved (text, media, omics), the solution requires a constellation of tools and repositories. Repositories such as GBIF, Plazi, Morphobank are providing valuable services. There are also domain agnostic repositories such as Figshare, Zenodo, Dryad.

However, as several recent articles (here and here) have pointed out, for instance, during describing new species not all relevant information is provided by the researchers or stored in long term repositories.

Should we work toward a global solution (for example, similar to INSDC databases that store and mirror molecular data) or various interoperable data repositories that can ensure data are findable, accessible, interoperable, and reusable (FAIR)?

I would like to pitch Wikidata here. As the open data repository of the Wikimedia Foundation, closely aligned with Wikipedia, it provides a valuable infrastructure for public data. By design, Wikidata is scopeless, which allow interoperability of biodiversity data with other domains of society.

There are already some initiatives regarding biodiversity data active on Wikidata. In 2018 we started a Wikiproject Biodiversity, which was initially started as Wikiproject iNaturalist. During a field trip in the margins of a Wikimedia conference in 2018, it became apparent that there is an overlap between both the Wikidata and iNaturalist community. Mutual reuse of data from both Wikidata and iNaturalist seems straightforward. So we took the initiative to look into more straightforward integration between the two platforms. The project was later renamed Wikiproject Biodiversity to recognise the value of other resources as well, most notably GBIF.

It would be interesting to explore the role of Wikidata as a shared authority file for Biodiversity Data as well as Wikibase upon which Wikidata is built, as an available software stack for institutional Knowledge Graphs.

Further reading

2 Likes

I think we have two questions here: 1) do we need infrastructures where research data can be stored for the long term, and 2) should such infrastructures be built centrally or distributed.

The answer to 1) is yes and there are (mainly national) programs that have started to provide such infrastructures. In Germany, for example, this is NFDI and specifically NFDI4BioDiversity. NFDI4Biodiversity provides a mix of centrally operated and decentralized services.

To 2): I do not think that the question can be answered in general terms. We see, for example, in collection data, a transition process from local (often in-house developed) collection management systems to communities that jointly operate databases. For example, the herbarium system JACQ is used by about 50 collections that enter into one (!) database and maintain the data together. Using specific solutions for e.g. organism groups makes sense here, as workflows, designations etc. may differ.

For certain types of data, it still makes sense to be able to use centralized repositories. For example, storing image data is too difficult and expensive for many institutions, so centrally operated services would certainly be well received.

Yes, wikidata is definitely an important piece for data integration. There are already examples of successful initiatives such as Bionomia (https://bionomia.net/) created by @dshorthouse that links specimens with people (using wikidata identifiers for collectors).

Can you elaborate on the idea of “shared authority file”? Do you mean a core/authoritative data that a collection holding institute will provide (such as a natural history museum) that can be shared and re-used? What could be stored in this file?

Instead of “shared authority file”, wikidata is perhaps better described as a broker of identifiers. That said, once entities are maintained in wikidata with their numerous links to external resources that circumscribe whatever are the concepts, “Q” numbers are attractive identifiers in their own right, especially if those external resources disappear. Shared use of identifiers across resources might be facilitated by a wikidata model that allows for cross-walks to be created and the used. There must evidently be motivation and rationales for creating such cross-walks & in that respect, it need not matter (much) what are the technologies, provided these & the data are open and can be queried without resrtriction.

There is the question of scale. While wikibase/Blazebraph are awesome, are these capable of 1.8B occurrence records (+ versioning) if the full breath of say GBIF were pumped into these technologies & refreshed on-demand? Our specimen data experience both administrative and contextual decay and require refresh from source to remain relevant. Periodic, wholesale refresh (= update, replace) of 1.8B records In a wikidata model is an interesting challenge while our community comes to grips with what is an identifier for a specimen.

1 Like

About two or three years ago we asked this to Wikidata people, I think at a RDA plenary, and at that time the answer was: no. Wikidata would not scale with its current implementation to the current 1.8B occurrence records and it would not well align with wikibase goals to hold each and every occurrence record. Likewise, it would not be the solution to hold all the estimated 3B specimen objects in the world. It would dwarf the less than 100M data items currently in Wikidata. It could be an interesting solution though for a selection of specimens with special value such as type specimens.

What would you say the main reasons for successful adoption of JACQ? Easy to use submission system? No need to maintain local systems for the collection holding organisations?

How do you think systems like JACQ can be part of the Digital Extended framework? The annotation examples done during the CETAF Botany pilot is probably going in the direction of extending and linking out.

About 40 institutions with collections from art to zooplankton share one (!) database at Arctos. This is completely doable and doesn’t need to cater to anything specific. BUT it does take an engaged community to keep it working.

2 Likes

Maybe baby steps here? We could have a global shared agents (people) authority in Wikidata if we just entered all of the collectors, identifiers, preparators, donors, etc. there and then used the Q numbers for them to trade information. Then we move on to geography? standard part names? preservation types?

This is the way I have thought about using Wikidata for better communication between the many systems we use. For the most part, it seems to me that the biodiversity community is still uncomfortable with the idea of “anyone can edit”. Until we can get over that hurdle, I think that anything “wiki” will be a hard sell.

I tried to get Wikidata used for an organism ID - see Kianga so that we could create good relationships between blood samples taken from this individual over many years and it was frowned upon because “Kianga” wasn’t “ours”. This is a social issue we need to work through.

That is a very good question. Wikidata is certainly not at that size yet, so how that performs at that scale is still anyone’s best guess. Wikidata (or Wikibase) is better suited for crowdsourcing or a user platform to edit and less as a platform to host linked data just as is. This user-role of Wikidata comes with a price. Data in Wikidata comes with redundancy. Wikidata items are stored as blobs in a relational database, which in turn is copied into an RDF structure to store in Blazegraph and within this RDF structure, there is again a redundancy created between the full statements and the truthy-statements.

Occurrence records already exist so we could argue that this user-interface role is less of a requirement so a core RDF store (e.g. GraphDB, Virtuoso, Stardog, etc) might be more in place here. WIkidata and other Wikibase systems could treat this core triple store as a backbone.

The question now is if a core RDF store can host 1.8B records. Resources like UniProt suggests so.

I shared this thread on Twitter to find views on storing this size of records. This lead to some followup questions.

1 Like

Considering the topic I suggest talking to the plazi developer team as they have provided RDF for a long time via their API and should be well informed on the potential size .

The approach I would investigate first is can we use Apache Drill to query the existing provided parquet files? if yes can we use a SPARQL to SQL mapping to query via the SPARQL->SQL->parquet route. I think this is possible e.g. [https://aran.library.nuigalway.ie/bitstream/handle/10379/14919/polyweb-ieee-access.pdf](https://“One size does not fit all: querying web polystores”)

This could be an economic way to provide the capabilities of SPARQL for ad-hoc analytics without investing in a huge new infrastructure.

PS. For my self I would love to be able to find occurrences with images given a taxonomic identifier. Where I can do a bit of sanity checking on the fly to make sure the mapping is good. As well as do genome to habitat analysis. E.g. given these eznymes/proteins are the co-concurences with mutations on the active site and the habitat of the species from which the protein sequence is derived.

PSS. Plazi RDF looks like this +/-200 triples for the Treatment and many more repeated for ease of use but actually not part of this specific record. So the RDF looks about 30 times larger than it actually would be from first impressions.

1 Like

I would suggest to consider Zenodo as part of a global solution.

  • Zenodo itself is part of CERN which maintains a very robust data infrastructure, and is probably as long term as it is possible at the moment.
  • Zenodo is already highly used by the biodiversity community in general, and the Biodiversity Literature Repository community alone is the largest provider of data (475,000 deposits in total (25% of total) or which 310,000 images, 51,000 publications, 110,000 taxonomic treatments)
  • Data in Zenodo is highly re-used by GBIF
  • Zenodo is used by publishers like Pensoft, MNHN, EJT to deposit figures and taxonomic treatments
  • Zenodo is at the same time a general repository, but at the same time it can be highly customized using metadata linking to external vocabularies, such as in our case DWC or OBO for biotic interactions (see eg Treatments or biotic interactions; links to institutions, specimen codes or accession codes could be added as custom metadata with the respective links.
  • All the deposits have DataCite DOI, either minted by Zenodo or then reusing existing DOIs
  • Zenodo is supported by CERN and free for the users.
  • Zenodo is through the Biodiversity Literature Repository part of an accepted EU Research Infrastructure, as well as the recently funded “super” infrastructure BiCIKL, linking omics, specimen, literature and taxonomic names databases
  • Zenodo has ca 200,000 digital herbariums sheets deposited during the ICEDIG DiSSCo project phase and
  • Zenodo is widely used in the science community
  • applications can be built based on Zenodo, such as https://ocellus.info/ to search for images, or GloBI is harvesting OBO metadata for biotic interactions

From the literature point of view this is our choice because it provides long term stability, flexibility, rich (FAIR) metadata, robust API to batch upload, annotated, and crosslink deposits to build a solution needed for our processes to make data caught in publications open and FAIR.

2 Likes

I’m curious what role people see Microsoft and Google playing here. Google Earth Engine has been hosting and distributing a lot of remote sensing data. More recently Microsoft announced their ‘Planetary Computer’, which I’m not sure I fully understand yet, but seems like it could be relevant to this conversation.

cheers - Roland

Welcome, Roland.

At GBIF, we see the public clouds as having a very important enabling role for researchers, and especially when specimen/occurrence data is being mixed in with other content, such as remote sensing data. What started as a discussion here has now led us to put monthly views of GBIF occurrence data (observations/specimens with CC0 and CC-BY) onto both MS Planetary computer and Amazon as Open Datasets - we’re targeting Google next (GCS, BigQuery and EarthEngine) and others will likely follow. You can read more about it here and there is a discussion on whether CC-BY-NC should be included.

I’d describe this as being in its infancy, but I anticipate that for things like large-scale research questions and enabling the building of machine learning models this route will become the norm. It’s not a solution for the management of primary data, but for questions where e.g. a weekly/monthly view is sufficient (i.e. most research use), we think it fits well.

“Anyone can edit” is different, of course from “any appropriately authorized person can edit”. I’d suggest the community doesn’t really want the former model but the latter one, which is increasingly used now.

The simplest means by which a person (or machine) becomes appropriately authorized is when one person (or machine) grants permission to another based on a criterion like ‘they’re a known expert’. But that isn’t the only way. Power to do something can be accrued by other means, such as in the Apache Foundation meritocracy model.

Extending and enriching specimen data is a community activity, open to contributions from all corners, with different control mechanisms, such as authorized direct editing or open annotation possible. The converged Digital extended Specimen (DS) concept addresses this by opening digital specimen data for contribution, improvement and curation by the wider community of experts whilst allowing the curating institution to retain control over the authoritative part of that data (what it is, where and when it was collected and who by).

Treating Digital extended Specimens as individual digital objects upon which authorized actions (computer operations) can be performed is an appropriate technical approach to this. It comes as close as possible to digitally replicating the acts associated with working with physical specimens - such as examination, identification, annotation, analysis, etc., as well as opening up new possibilities for new kinds of digital action.

1 Like

Responding to @aguentsch, @jegelewicz, @agosti, @sharif.islam, @trobertson it’s important also to think about the role of data storage solutions. Are they (like collection management systems, like GBIF) living, breathing systems that researchers, curators and others rely on for day-to-day access to up-to-date but constantly changing data, or are they (like Zenodo) repositories of snapshots of data (datasets) at specific moments in time i.e., archives. The former has a role to play in the management of collections and the sharing and use of data about those collections whereas the latter represents a permanent record of something of value to someone (and potentially to others).

Other kinds of storage, such as media management systems are somewhere in the middle. They can be both a long-term repository for and a day-to-day server of images (for example).

The Registry of Open Data on AWS (for example) performs a quite different role - that of ensuring that the data a computer programs needs is close to the computational resources that will act on that data.

Any of these systems can exist on one or more levels – institutional, national, international, centralised, distributed, federated – and we must recognise that multiple scenarios will always coexist, with different institutions and initiatives making their own choices.

What’s important is to decide on, stimulate and encourage the widespread adoption of a small number of standards at the external interfaces of such systems; for example ‘Triple-eye Eff’ (iiif.io) in the case of image management and presentation, openDS and MIDS in the case of open Digital extended Specimens and collection management systems – that can be both harmonising in terms of the way users (persons or machines) interact with the stored data and relatively immune to underlying changes in technology over long periods of time.

How is “appropriately authorized” defined. The benefit of “Anyone can edit” is that for data to be trustworthy and authoritative it is essential that provenance is also provided. Whereas with merit-driven systems, there is always the risk that that provenance is omitted simply because we can/should trust the experts.
So I would actually argue the contrary. We need an “Anyone can edit” model so that the context of the data is always made explicit, simply because nobody trusts everybody. The data should speak for itself not the experts.

1 Like

I’d like to better understand the following:

  1. Exactly what new infrastructure is needed and why? It’s not obvious to me that a new dedicated storage infrastructure, referred to in many posts, is required to implement the concept of a Digital Extended Specimen. As illustrated by the many examples posted of existing infrastructure, a key challenge will be to make the DES relevant and able to be integrated with existing initiatives/platforms/databases that house relevant data. In this case, do we need new infrastructure to build on existing infrastructures? Are end goals to provide data storage, eventually replace existing infrastructure, or flexibly enable data discovery and linking across platforms?

  2. What is the minimal solution that an existing infrastructure or platform could adopt to enable Digital Extended Specimens? New infrastructure should be able to allow very low-effort implementation designed to realize later, iterative improvements as trust, interest, benefits, infrastructure, funding, training, etc. grow. Such improvements could be done by existing infrastructures (they build a new compliant API) or a DES-specific infrastructure (they harvest or enable linking of data from existing infrastructures without requiring further action on the part of the existing infrastructure).

For example, how much could we achieve with “simply” starting with a global registry for persistent identifiers (Topic 7), like ORCIDs, that infrastructures could use to register or discover DES PIDs and then apply within their databases? It seems to me that a globally recognized PID for an individual organism could enable linkage between data for the same organism stored in a museum, GBIF, GenBank, published papers, etc. (like shown in the graphic below from the Phase 1 presentation), as well as identify duplicated information across platforms. The addition of the PID to DES data can be easily added to any platform/database, thus supporting low and high levels of adoption (could be simply stored within the database, or also further integrated and automated with other infrastructure). It can also support non-public data (Topic 8) by allowing a method for documenting/discovering that data for an organism exist in a repository, without requiring that the data themselves be public or interoperable.

1 Like