Diversifying the GBIF data model - intro

I have a general remark regarding the data model, and I assumed that this would be the best place to share it.

I noticed a couple of inconsistencies regarding the Entity Relationship (ER) diagrams and their best practices. (I am primarily talking about the The Unified Common Model here, but as the conceptual models for the use cases are subsections of the Unified Model, it pertains to them as well)

I noticed that notations for the connecting arrows is wrong for most cases, where class is referencing itself, to express some kind of hierarchy. The way I see it (and this aligns both with the description given in the Glossary regarding the Explanation for Diagrams in the Use Cases and the documentation in other places, e.g. Wikipedia) they need to be flipped. This is true for: Event, Relationship, DigitalEntity and Taxon. For MaterialEntity however, it is correct.

With the self referencing, it makes it hard to read the relations in the correct order, which might explain the mix-up. It helps to separate them out into two different blocks representing the same concept, one for the parent and one for the child:

This makes it easier to read the notation as intended, in this case: “A child has one parent” and “A parent can have multiple children”. If we now flip the two entity block back on top of each other, then we get the line that has the single dash with the EntityID and the crows foot with the ParentEntityID. This also aligns with the expectation that the EntityID would be the primary key for this entity, thus being unique.

Here are now some other issues that I noticed as well:

  • by convention the primary keys should be the first property mentioned within an entity and should be named ID. Both Event and MaterialSequence break with this convention (MaterialSequence has sequence which looks like it is the primary key, but it is neither named properly nor in the right position, if this gets renamed, it also needs to change for SequenceTaxon). (Not all entities need primary keys though, entities that are there to provide many to many mappings, like MaterialGroupMembership usually work fine without one)
  • the relationship between TaxonDistribution and Event also needs to be flipped, this related to the self reference of Event
  • the hierarchy within Relationship (dependsOnRelationshipID) seems to be quite generic, which makes me wonder if it is actually the best way to model this as a hierarchy or if there might be cases, where a many-to-many connection between dependent relationship might be necessary, in which case, an additional support entity might be needed. I have not checked the different use cases if this might be necessary, but because the entire entity is so generic, it raised a warning for me.

I know that some of the issues mentioned above might seem a bit nitpicky and meta (not so much about the actual data model but the visualization of said data model for the sake of this discussion) but I wanted to make sure that there are no misunderstandings when people talk about the different entities and their relation to each other.

All of the issues and criticism mentioned above are just minor issues and don’t take anything away from the fact that @trobertson and @tuco did a great job with this really expressive data model. It covers so many different use cases while still being manageable in size and complexity.