Extending, enriching and integrating data

abentley · February 22, 2021, 4:44pm

@jmheberling Yes, they can be used somewhat interchangeably but I see enriching as adding data to an existing record e.g. adding a georeference or new determination to a record (which, yes, is similar to an annotation), whereas extending would be linking somewhat disparate information to a record e.g. a Genbank sequence, citation, image, CT scan, etc. In the larger scheme of things, I don’t think the semantics of those two terms makes much difference as we are ideally looking for a system that can handle the integration/linking all of these data elements and scenarios. Others may have different viewpoints.

jmheberling · February 22, 2021, 6:16pm

@abentley Thanks for this clarification. I agree it probably is semantic but then also wonder if the distinctions are important to make in the structure of the proposed system – differentiating data extensions that are “primary” vs. “secondary” vs. “tertiary” (sensu Lendemeyer et al 2020 Bioscience) – I could envision primary extensions being treated differently, or even prioritized and stored by data publishers, compared to higher layered data that resides elsewhere. The BCON white paper suggests tertiary data be linked to external repositories, for instance, I thought.

jmheberling · February 22, 2021, 6:28pm

Many digitization projects, at least in herbaria in US, follow an “image first” workflow, where images are produced along with a very basic set of skeletal data (to genus or species, perhaps some level of locality info). Many records may remain in this partially digitized state. Would the ideal system welcome these data to be online before fully digitized (transcribed) and enable crowdsourcing of specimens by researchers with specific interests and/or mass transcription by public. Like anything, would require quality control. Not particularly exciting in the area of enriching/integrating data as core digitization but important nonetheless as specimen digitization is far from complete and must be part of extended/digital specimen conversations. Maybe this has already been considered in the many threads above or in Annotation topic threads. Transcriptions of existing primary label content and annotation labels doesn’t not fit well into annotations topic either.

abentley · February 23, 2021, 3:52pm

@jmheberling Yes, I think that is a beauty of a transactional system in that any changes or additions to the skeletal record could be recorded as transactions to leave a breadcrumb trail of modification which will not only records all changes but also provide attribution for those doing the work. You can also thereby employ the strength of the community (scientific, citizen science, and collections) to assist in the digitization and annotation of those records.

abentley · February 23, 2021, 3:55pm

@jmheberling Yes, I think there would naturally be prioritization of low hanging fruit in making the connections necessary for the ES/DS concept and some of the secondary and tertiary connections may take more effort - both socially and technologically, but I don’t think there is or should be any distinction in the underlying technology necessary to make those connections. I think the system (whatever it ends up being) should be able to accommodate all manner of connections i.e. it should be generalistic enough to handle all scenarios.

jmheberling · February 23, 2021, 7:36pm

@abentley thanks for the response and information. That’s great to hear that ideal system would be indifferent/capabable of all connections, but presumably they are quit different and therefore require different approaches/capabilities whether it be primary or tertiary (or maybe not necessarily that distinction even). Perhaps not, you know far better than me! Some extensions as I understand sensu ESN require direct linkages or data to be directly associated with record, presumably held at level of specimen database (ie another data/media field added to specimen record such as field images), while others may be broader aggregated information not specific to the specimen itself (e.g. sp range) or are information about the specific context of the given specimen but not derived from or unique to the specimen itself (e.g., climate data linking to PRISM/climate database(s) or something) , right? Others may be best placed in an external repository (e.g. TRY trait database for plants) but the link(s) provided in the specimen record. Not knowing what I’m talking about, but I would guess these different extensions/enrichments/integrations would require thinking through different informatic solutions. Hope that makes sense, is useful, and I am not rambling

abentley · February 23, 2021, 11:06pm

@jmheberling Yes, there is that distinction of resources that link to a specific collection object (citation or Genbank sequence) as opposed to resources that link to a broader concept (a taxonomic name for a distribution model). I see that as an issue of data in vs. data out. In the case of a citation or Genbank sequence (data in) you are linking external resources back to an occurrence record, whereas with a distribution model (data out) you are accumulating data into a package to then push out to produce a model. However, with the envisaged system, both scenarios should be equally supported as the entire transactional system will be completely transparent and will allow for grouping objects together for a particular function through, for instance, a DOI. Would be great to hear others views on this?

abentley · February 24, 2021, 5:56pm

There is a great discussion happening on a new thread that has implications for Extending and enriching data for data integration - https://discourse.gbif.org/t/structure-and-responsibilities-of-a-digextspecimen/2533/3. With diagrams too!!

abentley · February 25, 2021, 3:59pm

This is a comment that was made on the summary that I think belongs here: Summaries - 2. Extending, enriching and integrating data - #2 by dorsa

abentley · February 25, 2021, 4:01pm

@dorsa Yes, agree that the general principle of ensuring that links are maintained is an important part of any system but not sure that GBIF should be a mediator of this. Ideally, the system would be able to detect broken links and report these as part of the general infrastructure of the system.

dorsa · February 25, 2021, 6:18pm

Thanks, found the new thread. Who else if not @Gbif could detect broken links, but then the question is to whom report - of course to the data provider in the first place, but imagine a dataprovider with problems? We probably need sort of clearing house for data sustainability and integrity

abentley · February 25, 2021, 6:54pm

@dorsa Yes, exactly. We need some sort of independent broker that can mediate all of the records and links in the system. A blockchain, transaction-based system would automatically provide such a broker as part of the system from what I understand. That way all actors in the data pipeline have a role to play in following the rules and providing the necessary linkages between items.

MikeWebster · February 25, 2021, 8:41pm

I would like to bring up the topic of integrating specimens with observational and other types of data. Much of the discussion here has been centered – for obvious and good reasons – extensions at the primary level (e.g., specimen metadata (including enriched) and images (including CAT scans), etc.). These can enrich the value of the specimen immensely, as has been nicely illustrated. But data extensions at the secondary level can as well, but also bring challenges because they are not always directly linked to a specific specimen. Consider a herpetologist on a field collecting trip. She might make an audio recording of a calling male frog (deposited in media collection), then collect the frog itself (the specimen), take a tissue sample (to frozen tissue collection), and collect ectoparasites from the animal (sent to appropriate invert collection). These are all samples that add value to the specimen itself and should be appropriately linked to it via whatever mechanism. But she might also collect photos of the habitat or lists of other species encountered but not collected (observational data), which could be linked not just to that one specimen but to all that were collected on the same date at the same place. She might also record many other calling males that were not collected, and tissue and parasite samples from frogs that were not collected. These should all go to appropriate repositories, but be linked back to the specimens that were collected on that date/place as they all add value to each other. Hence the need to extend data associated with specimens to that secondary level, and also the need to connect observational data with collections/specimen data. To my thinking, the thing that unites these data/specimens is the collecting event itself – all were collected at the same time/place. What I don’t have a good grasp on (because it is well outside my expertise) is what technological tools might help here. Would love to hear thoughts on both the conceptual issue and the approaches/solutions.

dorsa · February 25, 2021, 8:58pm

agreed! And speaking about multimedia data, notabene that these soundfiles /pictures are often downloaded, and people forgot about the context, metadata etc! So a clear identifier plus metadata INSIDE the multimedia files would help. In addition (and apart from blockchain which I still do not grasp, and less, how it could be helpful 4 the problems we are discussing here) I heard about file connection systems (eg WO2003054724A3 - File identification system and method - Google Patents) . What I imagine is that files (images, sounds, videos) will be able to “connect” within an appropriate Internet like ecosystem, thereby tracing their origin and maintaining their relations

RogerBurkhalter · February 25, 2021, 9:01pm

Agreed. This is similar to how Paleontology collections are made. We collect a larger sample that then may have numerous taxa (mega and micro), a part may be split for lithologies, sediment categorization, geochemistry, sedimentary structures, etc. I link these in my CMS to a locality. Numerous collecting events may go back to said locality and add to the data or it may be linked to other collections. These are all things most aggregators lose with the current schemas.

abentley · February 25, 2021, 10:02pm

@MikeWebster This is where I think the CMS has a large role to play in facilitating these kinds of connections, either through relationship fields or linkages or through common data fields (e.g. collection event) as mentioned by @RogerBurkhalter below. However, the CMS provides context to these linkages and once that context is lost outside of the CMS itself, the system can break down fairly quickly. This is why we potentially need a broker service that can make, maintain and check these linkages. A blockchain transactional system of transactions on records would potentially do that.

MikeWebster · February 26, 2021, 1:27pm

@abentley Yes, exactly – the links are laborious to create and easy to break, very fragile. A service to more easily make the links (automated?), and check/maintain them, would be ideal. I need to better understand blockchain approach, but if that could help it should be explored!

hardistyar · February 26, 2021, 1:32pm

@MikeWebster @abentley @dorsa We’ve made provision for exactly what you describe in our proposals for “open digital extended specimens”, recognising the need to connect supplementary or secondary information that is derived directly from the collected specimen (an audio recording, tissue sample, DNA sequence) and also to connect tertiary information - which can best be described as ‘associated with the specimen but not derived from it’ e.g., habitat data, photographs of the locality, conservation status, etc. The technological tool for doing is globally unique, unambiguous persistent identifiers. (of which, DOI is a specific example). Every identified ‘thing’ has metadata associated with it to provide it’s context.

hardistyar · February 26, 2021, 1:40pm

@MikeWebster @dorsa A blockchain is a digital ledger (record) of transactions shared among (and contributed to by) all participants such that they all have the same view of the record of transactions. It has no overall controller or master responsible for it. Every participant is equal.

For a simple analogy, imagine of a group of people, each with a permanent marker pen standing in front of a whiteboard. All can see what is there. All can write something there. No-one can erase anything.

Oh, and by the way, there’s mechanism to stop a fight breaking out!

MikeWebster · February 26, 2021, 1:41pm

@hardistyar Helpful analogy! Thanks…

Topic		Replies	Views
Structure and responsibilities of a #digextspecimen Digital/Extended Specimen	30	4294	June 29, 2021
10. Transactional mechanisms and provenance Digital/Extended Specimen	58	3570	March 17, 2022
Analyzing/mining specimen data for novel applications Digital/Extended Specimen	43	3091	April 4, 2021
6. Robust access points and data infrastructure alignment Digital/Extended Specimen	32	3201	August 31, 2021
Making FAIR data for specimens accessible Digital/Extended Specimen	59	4563	March 5, 2021

Extending, enriching and integrating data

Related topics