Moderators: Andy Bentley, Jen Zaspel, Mike Webster, and Keping Ma
Summaries - 2. Extending, enriching and integrating data
Background
Biological collections are generating a wealth of data through digitization initiatives across multiple disciplines and taxonomic units. These data are published through collection management systems (CMS’s) to numerous aggregators and through local portals, making this data available to an ever-increasing end-user community. These disparate datasets are extremely valuable individually but are made increasingly valuable through integration of data sources at various levels to extend, enrich and connect data in various ways. Integrated datasets not only facilitate novel exploration and discovery of collections data by a much larger audience but also improve our ability to answer the pressing questions of our time such as combating climate change and its effect on biodiversity, reducing spread of disease and pandemic mitigation and control and eradication of invasive species. The corollary is that this information provides attribution mechanisms and metrics for collections to advocate for their continued support and management. The extended specimen and open digital specimen concepts rely heavily on the integration of FAIR data.
In order for this to be achieved we need data integration at many different levels from multiple different sources through the work of many different actors in the data pipeline. These actors include institutions, individual collections, data aggregators, publishers, accumulators of related data (Genbank, Isobank, MorphoSource, Macaulay Library, etc.), suppliers of external sources of data (taxonomic, geographic, satellite, etc.), the broad research community (traditional taxonomic/phylogenetic, biodiversity, surveillance and monitoring, ecological, conservation, etc.) and observational datasets (eBird, etc.).
Some of this data integration is already being mediated by existing systems. Within CMS systems employed by collections, all preparations of a specimen will often have the same catalog number and unique identifier (tissue, voucher, skeleton, skin, cleared and stained, etc.) making the connections between these implicit. There are also implicit connections made between media files (images, video, sounds, field notes, etc.) that are directly connected to objects in the collection through the data model. Similarly, some CMS systems have fields or mechanisms to indicate the relationships between various objects between collections such as tissue:voucher, host:parasite, plant:pollinator, predator:prey, commensals, etc. However, there are numerous instances where these interactions are not implicit and have not been made, for instance, where preparations have different catalog numbers, linked specimens are housed in separate, disparate datasets either at the same institution, at different institutions or these datasets are outside of our immediate realm. In these circumstances, connections often need to be made outside of the CMS by other means. Sometimes this is possible through the data alone by using common data found in collecting events or locality fields. An example of this would be the new GBIF Clustering tool that creates these associations based on matching of various common fields of information (taxonomy, locality, collectors, date of collection, etc.). Additionally, there are the connections between objects in the collection and the products of research that may reference these objects (citations, Genbank sequences, CT scans, images, etc.) which are more difficult to make due sometimes to a lack of common fields through incorrect, incomplete or non-existent citation. These connections are important, not only to promote reproducible research, but also to provide important metrics for collections attribution and advocacy. The final piece of the puzzle is connecting our data to external sources of data that add value and allow much broader questions to be answered, such as environmental, ecological, conservation, geographical, observation and other research data.
These connections require both technological and social solutions, with an underlying cyberinfrastructure, connectivity mechanisms as well as best practices and standards adopted by, and with buy in from, the numerous actors involved in the data lifecycle.
Effective integration of data is also reliant on a system of global unique identifiers that effectively identify the various elements being integrated – institutions, collections, collection objects, collecting events, datasets, people, etc. There is a large variety of unique identifiers currently in use (GUIDs, LSIDs, PIDs, DOIs, ORCIDs, etc.) for various elements with no mechanism of ensuring uniqueness or aligning these to implement effective data integration. To avoid distraction from the present topic, discussion about persistent identifier schemes is deferred to a separate consultation at a later stage.
This category differs from the Annotating specimens and other data thread in that we envisage annotations as being opinions on, or additions to, existing specimen records (determinations, georeferences, etc.) rather than extensions or augmentations of a specimen record through the addition of new data elements (DNA sequences, citations, CT scans, images, vocalizations, duplicate specimens, linked specimens, etc.). However, there is some overlap in that once these annotations are made they need to be reliably linked to the original record. Similarly, the Attributing work done (Data Attribution) thread will also pursue a discussion of attribution of people for work performed but there is also data integration for the advocacy and attribution of collections that will rely on the same data infrastructure mechanisms needed for the data integration being discussed here.
Information resources
- NASEM report – chapter 5 – starting on page 93 (https://www.nationalacademies.org/our-work/biological-collections-their-past-present-and-future-contributions-and-options-for-sustaining-them)
- BCoN Extended Specimen Network report (https://bcon.aibs.org/wp-content/uploads/2019/04/Extended-Specimen-Full-Report.pdf)
- BCoN Extended specimen publication (Extended Specimen Network: A Strategy to Enhance US Biodiversity Collections, Promote Research and Education | BioScience | Oxford Academic)
- Webster 2017 – The Extended Specimen, especially chapters 1 and 13
- Zaspel et al. 2020. Human Health, Interagency Coordination, and the Need for Biodiversity Data. (https://academic.oup.com/bioscience/article/70/7/527/5861522)
- Page 2008 - Biodiversity informatics: the challenge of linking data and the role of shared identifiers (https://doi.org/10.1093/bib/bbn022)
- Page 2008 - Visualizing a scientific article (https://doi.org/10.1038/npre.2008.2579.1)
- Van Rossum 2017 - Blockchain for Research (https://doi.org/10.6084/m9.figshare.5607778)
- Berendsohn & Guntsch 2012 - OpenUp! Creating a cross-domain pipeline for natural history data (https://doi.org/10.3897/zookeys.209.3179)
- Konig, et al. 2017 - Biodiversity data integration—the significance of data resolution and domain (https://doi.org/10.1371/journal.pbio.3000183)
- Thessen, et al. 2017 - 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration (https://doi.org/10.7717/peerj-cs.164)
- Data integration enables global biodiversity synthesis New publication, #CiteTheDOI
Questions to promote discussion
- How do we bring together all the existing, disparate mechanisms of data integration into a single system that works for all?
- Can we rely on a combination of these existing mechanisms or do we need a stand-alone integration tool?
- How and where should such combinations of value-added data be stored and curated and who should take the responsibility for that?
- How do we engage and encourage the various data actors to buy into a system of data integration?
- What value propositions can be used to promote this?
- What existing data infrastructure technology elements may be important for data integration?
- What gaps in this cyberinfrastructure need to be filled?
- How do we mediate connecting all of this data to provide as rich a dataset as possible for use by the community while supporting the necessary cyberinfrastructure for data storage and dissemination?
- How do we integrate biocollections datasets with specimen datasets and/or occurrence records generated from other types of projects (i.e., surveillance and monitoring projects) as well as observation data and other kinds of data loosely related to natural history?
- What data should be connected to specimens? Where are the boundaries?
- Should observation records be integrated with collections data? If so, how?
- What use cases exist of data integration in action?