Sibling datasets to overcome star schema limitation

Dear GBIF community,

I’m about to publish my first eDNA dataset to GBIF which derives from sampling of Antarctic lakes.

To overcome the DwC star schema limitations, I am planning to publish two datasets instead of one:

  1. a sample core with physico-chemical measurementOrFacts extension
  2. an occurrence core with eDNA extension

To avoid duplicates, I will not add the occurrence extension to the first one but use sampleIDs in the second one. Both datasets will refer to the same project.
To be complete, I do expect to publish other eDNA derived occurrences, from other taxonomic groups, in a near future.

Did anybody tried to do that? Does it make sense?
Is there some recommendation around ‘sibling’ datasets?
What is the best approach to link two datasets apart from reusing the sampleIDs?



I see absolutely no issues at all about referring to the same MaterialSample from multiple IPT published datasets! Each materialSampleID should be a stable persistent identifier, then any so-called “duplicates” are easy and straightforward to untangle.

GBIF-Norway regularly has done somewhat similar inter-dataset PID-links for a while. As described at our joint OBIS data workshop in Brussels in January 2018 (see e.g. slide number 25 and 26) – and slides from the 2018 European nodes meeting in Tallinn.

Thanks for the reminder of these slides Dag,
I agree with you using the very same persistent sampleIDs in both datasets.
What about the suggestion to publish DNA-derived occurrences only in the OccurrenceCore dataset and how to refer from this to the ‘sibling’ sampleCore dataset apart from using the same sampleIDs.
Would this sampleCore dataset be ingested, accessible and citable even without any occurrences?
In my case, I do expect more eDNA derived occurrences, from the very same samples but for different taxonomic groups, to be published in a near future.
That is another reason for me to publish these data as two datasets, not one.

For inspiration, here is a similar dataset. It has both EMOF and DNA extensions, but they chose to duplicate the EMOF data onto each occurrence to keep it in one dataset. The drawback of this is of course that If you have samples with no occurrences, they will not be interpreted.

Sorry, I am a little confused … When you mention “sample core”, do you mean sampling event core or material sample core (in the sandbox)?

1 Like

I meant sampling event as core, not the material sample extension.

As also communicated by email, GBIF Norway has published some sampling event (Event core) datasets without any dwc:Occurrence records included in the same dataset. The EML metadata for these “naked” sampling event datasets are available from the GBIF portal – but none of the actual sampling events (dwc:Event records) are presented anywhere or (as far as I know) ingested into the GBIF index at all.

So far, some of the corresponding “dwc:Occurrence” records (for the dwc:MaterialSample samples) collected at these sampling events (dwc:Event) are published in the following dataset.

  • UiO (2015). Dannevig- and Drøbak collections of Polychaeta doi:10.15468/y6cctp (2 180 occurrences)

Most of the dwc:MaterialSample samples (collected at these dwc:Event and “dwc:Occurrence” points) were deposited into the university museum collections as specimens – which SHOULD be linked to the dwc:Event records in the datasets above. Unfortunately, the museum database has nowhere to report the corresponding dwc:eventID. (So we did not reach this final step yet).

See also our data mobilization grants 2015 for more information.