Diversifying the GBIF data model - intro

Darwin Core—the most commonly used data standard in the GBIF community—has provided a simple and effective framework for supporting the growth of species occurrence data available from the GBIF network. But the simplicity of this standard, established and maintained by Biodiversity Information Standards (TDWG), has significant limitations when it comes to shaping data from diverse sources.

At both the Global Nodes Meeting and the 28th meeting of the GBIF Governing Board in 2021, GBIF head of informatics Tim Robertson and long-time collaborator John Wieczorek detailed initial case studies highlighting these limitations while outlining possible approaches for supporting richer, more complex types of biodiversity data.

We welcome your feedback and encourage you to join the discussion on the case studies that interest you most—help us continue to explore novel data publishing challenges together.

I have a general remark regarding the data model, and I assumed that this would be the best place to share it.

I noticed a couple of inconsistencies regarding the Entity Relationship (ER) diagrams and their best practices. (I am primarily talking about the The Unified Common Model here, but as the conceptual models for the use cases are subsections of the Unified Model, it pertains to them as well)

I noticed that notations for the connecting arrows is wrong for most cases, where class is referencing itself, to express some kind of hierarchy. The way I see it (and this aligns both with the description given in the Glossary regarding the Explanation for Diagrams in the Use Cases and the documentation in other places, e.g. Wikipedia) they need to be flipped. This is true for: Event, Relationship, DigitalEntity and Taxon. For MaterialEntity however, it is correct.

With the self referencing, it makes it hard to read the relations in the correct order, which might explain the mix-up. It helps to separate them out into two different blocks representing the same concept, one for the parent and one for the child:

This makes it easier to read the notation as intended, in this case: “A child has one parent” and “A parent can have multiple children”. If we now flip the two entity block back on top of each other, then we get the line that has the single dash with the EntityID and the crows foot with the ParentEntityID. This also aligns with the expectation that the EntityID would be the primary key for this entity, thus being unique.

Here are now some other issues that I noticed as well:

  • by convention the primary keys should be the first property mentioned within an entity and should be named ID. Both Event and MaterialSequence break with this convention (MaterialSequence has sequence which looks like it is the primary key, but it is neither named properly nor in the right position, if this gets renamed, it also needs to change for SequenceTaxon). (Not all entities need primary keys though, entities that are there to provide many to many mappings, like MaterialGroupMembership usually work fine without one)
  • the relationship between TaxonDistribution and Event also needs to be flipped, this related to the self reference of Event
  • the hierarchy within Relationship (dependsOnRelationshipID) seems to be quite generic, which makes me wonder if it is actually the best way to model this as a hierarchy or if there might be cases, where a many-to-many connection between dependent relationship might be necessary, in which case, an additional support entity might be needed. I have not checked the different use cases if this might be necessary, but because the entire entity is so generic, it raised a warning for me.

I know that some of the issues mentioned above might seem a bit nitpicky and meta (not so much about the actual data model but the visualization of said data model for the sake of this discussion) but I wanted to make sure that there are no misunderstandings when people talk about the different entities and their relation to each other.

All of the issues and criticism mentioned above are just minor issues and don’t take anything away from the fact that @trobertson and @tuco did a great job with this really expressive data model. It covers so many different use cases while still being manageable in size and complexity.

Thank you very much @DavidFichtmueller. You are correct about all of the relationship cardinalities you identified. The relationship cardinalities are not nit-picky trivialities, they are extremely important to have correct for the deeper understanding of how the model works and how it might be implemented. I have gone through common model and all of the use cases and corrected all of the relationships in both the conceptual and publishing models.

I chose not to consistently follow the convention of putting the primary key first in the list of properties of an entity for the simple reason of trying to keep the relationship arrows as simple as possible in the diagram layout, with the justification that the primary keys are easily identifiable by a) the convention I did follow to name them consistently following the pattern {entityName}ID, and b) they are the fields that participate on the one side of one-to-many relationships. The example you mentioned, sequence, is not actually a primary key. I have modified the entity MaterialSequence to be called GeneticSequence. Its primary key is geneticSequenceID.

The Relationship entity is indeed generic. The Biotic Relationships Use Case explains the power of the Relationship part of the model.
“A second feature of the model is to provide the capacity to posit that there is a sequential dependency among relationships using a link from one relationship to another one on which it directly depends. This is beyond the simple capacity to express the order in which relationships occurred, provided naturally from the Relationships being based on dwc:Events. It provides the ability to track dependencies that are more complex than just co-occurring or sequential. The model supports complex multiple co-occurrent relationships to be modeled independently as pairwise relationships.”
Again, thank you very much for your keen perception on the model relationships.

Thank you for the updates, @tuco . I can image that this was quite a lot of work updating all fo these diagrams. It looks like you forgot to replace the images for the Summary and the Conceptual Model for Use Case 2: Camera Trap DB with the newer version. It still shows the old connections (the Publishing Model images are up to date however).

It is a reasonable choice not to put the primary ids first, in cases where it avoids overlaps between lines and thus improves readability.

I only have one further question regarding the update: does the 1:1 relationship between digitalEntityID and geneticSequenceID mean that a GeneticSequence is automatically a DigitalEntity and they share the same value as primaryID? It seems to be the only 1:1 relationship in the data model apart from the assertions (though in the Conceptual Model for use case 05 Global Malaise, it is expressed as a 1:many relationship). If that interpretation is correct, I think it might be better just make digitalEntityID the primary ID of the GeneticSequence as well, though I am not sure about this. Or am I misunderstanding something here?
Thank you for all of your work, David

Thanks again @DavidFichtmueller. Your observations are keen and careful. We really appreciate that.

I have updated the images for the Camtrap DP and Global Malaise use cases.

The 1:1 relationship between DigitalEntity and GeneticSequence is correct. A GeneticSequence is a DigitalEntity subtype. There are actually two more subtypes in the model, but they are not drawn because a) it would make a visual mess, and b) it isn’t really clear at this point how important these relationships would be. The two 1:1 subtype relationships not shown are between EntityOfInterest:DigitalEntity and EntityOfInterest:MaterialEntity.

I understand the temptation to name the primary key of a subtype the same as the primary key of the parent type (digitalEntityID for geneticSequenceID in your example). Though I think it might clarify that the entities are subtypes, I think it might confuse things when looking at only a smaller part of the conceptual model. So, unless there is a compelling reason not to do so, I think that following the pattern that the primary key is always named for the entity (e.g., geneticSequenceID for GeneticSequence) will be less confusing. The good thing is that the 1:1 relationship arrows make it explicit that they are subtypes, so that information is not lost in any way.

Thank you again for the updates and the explanation. It is good to know that my understanding of the situation was correct. I see that both ways of naming the primary key have their advantages, so I withdraw my objection on the matter and you just leave it the way you modeled it.
Cheers, David

A use case for motion ecology data?

The field of motion ecology does not collect biodiversity data per se, but it is closely related: the occurrence of the same individual in different locations and moments - plus other data related to those movements.

Biodiversity data is about the occurrence of individuals in different moments and locations, but does not necessarily link the observation of such an occurrence to an identifiable individual.

It seems to me that knowledge of the movement patterns of individuals could tell a lot about the likelihood of biodiversity observations being about the same individuals, and thus can tell a lot about how biodiversity data is capturing less biodiversity than one might think.

Besides similarity in data and relevance for biodiversity, there seems to be a good opportunity, historically. Recently, I participated in a webinar, organized by WildLabs on motion ecology, where it became clear that the field of motion ecology is rethinking its data standards.

1 Like

I would like to see these limitations listed explicitly. The challenge of standardizing data into any format is always a hurdle, but the most-commonly-recommended DwC usage of Event-core + Occurrence & MoF extensions is a generic data model that can incorporate any key-value pair.

I will need to dig into the use-cases in detail to understand better this initiative, but my first impression is:

  1. I come to GBIF looking for occurrence data - that is GBIF’s niche.
  2. From a data-consumer perspective I want to submit queries that return a table with occurrence rows; the underlying data model doesn’t matter much to me so long as the API supports this kind of query.
  3. From a dataset provider perspective: all these additional tables feel like added complication; users already struggle understanding how to create the three .csv files for the aforementioned DwC schema. Is this new model going to raise the barrier of entry for dataset submission?

Regarding the case studies I wonder whether one of them covers the following (I haven’t been able to find this): data are species observations from inventories, van have a varying degree of standardisation, in common though data are event-based. In common also is that species observations are collected using a protocol that uses a species checklists (potentially even different checklists depending on for example observer experience/knowledge). Hence, by knowing the checklist absences can be inferred. For each event there is therefore a linked checklist (could be the same checklist for all events contained in the same dataset, or different checklists for different events). The checklist may either be published standalone (and then be linked to the event) or could be included in the publishing model. Specifically I wonder whether the Grand unified model could cover such linked checklists, I would see this as information related to Protocol.

A checklist is easily represented as a set of occurrenceStatus present and absent. I don’t work with checklists though; is there something else that wouldn’t be captured that way?

Representing a checklist as a set of occurrenceStatus present and absent requires though a lot of entries for the absents. An easier way, or rather data entry and storage more economic way, would be just having to list the presences and combine this with a checklist (list of species that was looked for during the inventory), absences can the be inferred retrospective but do not have to be stored as such. I know of a lot of inventory programmes that use a species checklist and store the presences in their database, but not normally the absences (saving a lot of storage space).

2 Likes

Hi
I think I can answer this. I’ve heard it mentioned multiple times that existing data publishing models will continue to work.
Secondly, the new publishing model could in theory be simpler for some communities if adapters of a sort was build. This still isn’t clarified. But what is clear is that existing publishing models will not break.

Secondly, the APIs that expose data will not have to change either. The model could still be mapped to the existing APIs and table views.

1 Like

Thank you @DeboraArlt and @mhoefft for helping me understand. I can see the benefit of including a checklist at the dataset level in terms of archive prep complexity and size.

I feel better hearing that absences from checklists will still be available through the API in the same way.