Sibling datasets to overcome star schema limitation

andre · May 13, 2022, 7:44am

Dear GBIF community,

I’m about to publish my first eDNA dataset to GBIF which derives from sampling of Antarctic lakes.

To overcome the DwC star schema limitations, I am planning to publish two datasets instead of one:

a sample core with physico-chemical measurementOrFacts extension
an occurrence core with eDNA extension

To avoid duplicates, I will not add the occurrence extension to the first one but use sampleIDs in the second one. Both datasets will refer to the same project.
To be complete, I do expect to publish other eDNA derived occurrences, from other taxonomic groups, in a near future.

Did anybody tried to do that? Does it make sense?
Is there some recommendation around ‘sibling’ datasets?
What is the best approach to link two datasets apart from reusing the sampleIDs?

Thanks,

DagEndresen · May 15, 2022, 11:41am

I see absolutely no issues at all about referring to the same MaterialSample from multiple IPT published datasets! Each materialSampleID should be a stable persistent identifier, then any so-called “duplicates” are easy and straightforward to untangle.

GBIF-Norway regularly has done somewhat similar inter-dataset PID-links for a while. As described at our joint OBIS data workshop in Brussels in January 2018 (see e.g. slide number 25 and 26) – and slides from the 2018 European nodes meeting in Tallinn.

andre · May 16, 2022, 7:12am

Thanks for the reminder of these slides Dag,
I agree with you using the very same persistent sampleIDs in both datasets.
What about the suggestion to publish DNA-derived occurrences only in the OccurrenceCore dataset and how to refer from this to the ‘sibling’ sampleCore dataset apart from using the same sampleIDs.
Would this sampleCore dataset be ingested, accessible and citable even without any occurrences?
In my case, I do expect more eDNA derived occurrences, from the very same samples but for different taxonomic groups, to be published in a near future.
That is another reason for me to publish these data as two datasets, not one.

thomasstjerne · May 16, 2022, 9:39am

For inspiration, here is a similar dataset. It has both EMOF and DNA extensions, but they chose to duplicate the EMOF data onto each occurrence to keep it in one dataset. The drawback of this is of course that If you have samples with no occurrences, they will not be interpreted.

ymgan · May 19, 2022, 10:13am

Sorry, I am a little confused … When you mention “sample core”, do you mean sampling event core or material sample core (in the sandbox)?

andre · May 19, 2022, 10:51am

I meant sampling event as core, not the material sample extension.

DagEndresen · May 21, 2022, 11:10am

As also communicated by email, GBIF Norway has published some sampling event (Event core) datasets without any dwc:Occurrence records included in the same dataset. The EML metadata for these “naked” sampling event datasets are available from the GBIF portal – but none of the actual sampling events (dwc:Event records) are presented anywhere or (as far as I know) ingested into the GBIF index at all.

UiO (2015). Dannevig collections doi:10.15468/hwvr0m (280 “naked” sampling events)
UiO (2015). Drøbak collections doi:10.15468/mg7l2t (637 “naked” sampling events)
UiO (2015) BIOSKAG collections doi:10.15468/mpifue (178 “naked” sampling events)

So far, some of the corresponding “dwc:Occurrence” records (for the dwc:MaterialSample samples) collected at these sampling events (dwc:Event) are published in the following dataset.

UiO (2015). Dannevig- and Drøbak collections of Polychaeta doi:10.15468/y6cctp (2 180 occurrences)

Most of the dwc:MaterialSample samples (collected at these dwc:Event and “dwc:Occurrence” points) were deposited into the university museum collections as specimens – which SHOULD be linked to the dwc:Event records in the datasets above. Unfortunately, the museum database has nowhere to report the corresponding dwc:eventID. (So we did not reach this final step yet).

See also our data mobilization grants 2015 for more information.

system · June 20, 2022, 9:11pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sibling datasets to overcome DwCArchive star schema limitation(2) Data Publishing	3	720	August 28, 2022
The same Occurrence in different Materials of Citations (Books) Miscellaneous	2	428	October 20, 2022
Duplicates while publishing a big herbarium Data Publishing	2	521	June 28, 2023
Which data can be shared through GBIF and what cannot - GBIF Data Blog Data blog	1	727	November 17, 2022
Best practices using DNA derived data extension with event core Miscellaneous	2	456	December 12, 2022

Sibling datasets to overcome star schema limitation

Related topics