Webinar 2: Entity relationships and attributes (Francesca Jaroszynska)

The following question(s) were asked in the Collection Management Systems Webinar and will be answered here.

Francesca Jaroszynska: how can the entityRelationship table handle partial information when the entity observation is a collection of individuals of a given taxon? Could nesting observations in the Entity table be more efficient? For example, when information on lifeStage, sex or ID tags are available for only some of the individuals observed.

Response:
This is an interesting question whose solution could also apply to lots in collections where individuals are not identified or tracked separately, but subsets of which do have distinct attributes. An example is the best way to demonstrate one way to deal with this.

Let the scenario be a monitoring Event targeting a flock of pigeons at a particular city site. One of the pigeons has a band and so can be identified as a specific individual with known sex and life stage. The goal is to do as well as possible characterizing the population structure in terms of sex and life stage.

First we need the Event:
eventID: event3
eventType: dwc:HumanObservation
locationID: pigeon_site1
eventDate: 2022-07-17
habitat: city

The Entities that can be instantiated are the pigeon population “PigeonPopulation1”, and the marked pigeon “Pigeon1”, which is a member of that population. Both are dwc:Organisms, though one has a dwc:organismScope of “population” and the other has a dwc:organismScope of an “individual”.
entityID: PigeonPopulation1
entityType: dwc:Organism
entityID: Pigeon1
entityType: dwc:Organism

To capture the scope of the Organisms we can use EntityAssertions:
entityAssertionID: ea6
entityID: PigeonPopulation1
entityAssertionType: dwc:organismScope
entityAssertionValue: population

entityAssertionID: ea7
entityID: Pigeon1
entityAssertionType: dwc:organismScope
entityAssertionValue: individual

In order to connect the Entities with the monitoring event in which they were observed, we need EntityEvents, one for the population:
entityID: PigeonPopulation1
eventID: event3

and one for the marked individual:
entityID: Pigeon1
eventID: event3

We need to show that “Pigeon1” was a part of “PigeonPopulation1” on the date observed. We’ll do that with an EntityRelationship:
entityRelationshipID: er_id16
subjectEntityID: Pigeon1
entityRelationshipType: member of
objectEntityID: PigeonPopulation1
entityRelationshipDate: 2022-07-17

The property entityRelationshipDate is not yet in the Unified Model yet, but this mini use case highlights the need for it. The complementary EntityRelationship is:
entityRelationshipID: er_id17
subjectEntityID: PigeonPopulation1
entityRelationshipType: has member
objectEntityID: Pigeon1
entityRelationshipDate: 2022-07-17

Now we can model the attributes of “Pigeon1” with EntityAssertions. Let’s say the marked pigeon is an adult female:
entityAssertionID: ea8
entityID: Pigeon1
entityAssertionType: dwc:lifeStage
entityAssertionValue: adult
entityAssertionDate: 2022-07-17

entityAssertionID: ea9
entityID: Pigeon1
entityAssertionType: dwc:sex
entityAssertionValue: female
entityAssertionDate: 2022-07-17

Now we can model the attributes of “PigeonPopulation1”, also with EntityAssertions. The flock had 13 individuals on the day they were observed, including the banded individual:
entityAssertionID: ea10
entityID: PigeonPopulation1
entityAssertionType: dwc:organismQuantity
entityAssertionValueNumeric: 13
entityAssertionUnit: individuals
entityAssertionDate: 2022-07-17

It was easy enough to distinguish the juveniles from the adults:
entityAssertionID: ea11
entityID: PigeonPopulation1
entityAssertionType: juvenile count
entityAssertionValueNumeric: 6
entityAssertionUnit: individuals
entityAssertionDate: 2022-07-17

entityAssertionID: ea12
entityID: PigeonPopulation1
entityAssertionType: adult count
entityAssertionValueNumeric: 7
entityAssertionUnit: individuals
entityAssertionDate: 2022-07-17

But the sex of the adults could only be divined by their behavior, which 4 of the unmarked adult population exhibited:
entityAssertionID: ea13
entityID: PigeonPopulation1
entityAssertionType: minimum adult male count
entityAssertionValueNumeric: 2
entityAssertionUnit: individuals
entityAssertionDate: 2022-07-17
entityAssertionRemark: determined by behavior

entityAssertionID: ea14
entityID: PigeonPopulation1
entityAssertionType: minimum adult female count
entityAssertionValueNumeric: 3
entityAssertionUnit: individuals
entityAssertionDate: 2022-07-17
entityAssertionRemark: determined by behavior for two individuals, the third was a marked individual of confirmed sex

Hi John,

I have difficulties with the way you handle the issue, but I think there are 2 topics: the way to handle relationship between entities, and the use of entityAssertionType vocabulary.

Let start by the second, with your exemple

entityAssertionID: ea14
entityID: PigeonPopulation1
entityAssertionType: minimum adult female count
entityAssertionValueNumeric: 3
entityAssertionUnit: individuals

Here, you tell us that you sax at least 2 adult female within one single assertionType. I am afraid here to see an enlargement of the assertionType vocabulary (@abentley topic), either by flipping words or by adding new elements (minimum white adult female count ?).

I would advocate to separate the assertion, in the same way you did for the sigle pigeon
sex: female
life stage: adult
organism quantity: 3
However, doing this means that the described entity is not anymore PigeonPopulation1, but a subpart of it.

entityAssertionID: ea10
entityID: PigeonPopulation1

entityAssertionType: dwc:organismQuantity
entityAssertionValueNumeric: 13
entityAssertionUnit: individuals

entityAssertionType: juvenile count
entityAssertionValueNumeric: 6
entityAssertionUnit: individuals

entityAssertionType: minimum adult male count
entityAssertionValueNumeric: 2
entityAssertionUnit: individuals

entityAssertionType: minimum adult female count
entityAssertionValueNumeric: 3
entityAssertionUnit: individuals

Here, you tell us that on PigeonPopulation1, you saw :
13 individuals,
6 juveniles,
7 adults,
at least 2 adult males,
at least 3 adult females.

From the structure of the data, I am not sure how many individual you saw as they all describe PigeonPopulation1

  • 13 (I guess it was your value)
  • 26 = 13 undetermined + 6 juveniles + 7 adults (including 2 males and 3 females)
  • 31 = 13 undetermined + 6 juveniles sex undetermined + 7 adults sex undetermined + 2 adult males + 3 adult females

Here, I would advocate to identify to kind of entities: the flock itself, of 13 individuals and likely other assertion specific to the folk (area covered, speed and direction…) ; and subparts of the flock, therefore as new entities related to the flock. It will help to keep the vocabulary as controlled as possible while being clear on the components. This advocates again for adding new entities.

This bring us to the first topic:

By increasing the number of entities, we increase the numbers of entity relationships. Those relationships “member of/ has member” or any kind of “parent/child” are not of the most interest for biological purposes, as they are here only to indicate a database hierarchical relation. They are, in addition, quite heavy to fill in both from scripts or hand.

If we had a parentEntityID field, we could manage that more easily. Interestingly, @DavidFichtmueller used a diagram including this parentEntityID on April 20 (topic)

entityID: PigeonPopulation1
entityType: dwc:Organism
entityAssertionType: dwc:organismQuantity
entityAssertionValue: 13

entityID: Pigeon1
parentEntityID: PigeonPopulation1
entityType: dwc:Organism*
entityAssertionType: dwc:sex
entityAssertionValue: female
entityAssertionType: dwc:lifeStage
entityAssertionValue: adult

entityID: PigeonPopulatoin1_1
parentEntityID: PigeonPopulation1
entityType: dwc:Population
entityAssertionType: dwc:organismQuantity
entityAssertionValue: 6
entityAssertionType: dwc:lifeStage
entityAssertionValue: juvenile

entityID: PigeonPopulatoin1_2
parentEntityID: PigeonPopulation1
entityType: dwc:Population
entityAssertionType: dwc:organismQuantity
entityAssertionValue: 7
entityAssertionType: dwc:lifeStage
entityAssertionValue: adult

entityID: PigeonPopulatoin1_2_1
parentEntityID: PigeonPopulation1_2
entityType: dwc:Population
entityAssertionType: dwc:organismQuantity
entityAssertionValue: 2
entityAssertionType: dwc:sex
entityAssertionValue: male

entityID: PigeonPopulatoin1_2_2
parentEntityID: PigeonPopulation1_2
entityType: dwc:Population
entityAssertionType: dwc:organismQuantity
entityAssertionValue: 3
entityAssertionType: dwc:sex
entityAssertionValue: female

This way would also be technically more easy to keep the original large observation: the flock of 13 individuals (i.e. no parentEntityID), and allows to clean the entityRelation table from the least relevant information.

The addition of “minimal” could be handle as a estimated value (discussion):
assertionID: x
parentassertionID: cf assertion ID of the “organismQuantity: 2 individuals”
assertionType: minimal
assertionValueNumeric: 2
assertionUnit: individuals

Wouldn’t that be a nice improvement, and perfectly in line with the parentEventID, parentAssertionID, parentTaxonID, and every dependsOn elements ?

2 Likes

Hi John,

Thanks for the detailed reply! I see the advantages of using assertions when there are no pre-existing DwC terms, but in the use case above, lifeStage and sex are both existing DwC terms. I fear that specifying sex and lifeStage as assertionTypes runs the risk of mis-specification of the information, resulting in scenarios similar to those outlined by Guillaume, where it is unclear whether multiple assertion entries apply to the same or different individuals of the original population/population sample. Nesting in the Entity table could avoid this potential confusion, as each ‘child’ subsample of the population observation would describe the sex and lifeStage, or ID tag of the individuals concerned.

I would think that nesting would also avoid the necessity for the verbose entityAssertionRemark which, in the current use case, would inhibit smooth handling of large datasets. For example, information about the sex and lifeStage of the individual pigeon with the leg ring could be explicitly listed as a nested entity of the parent Entity, enormously reducing the amount of effort, both human and machine, needed to extract the information from the entityAssertionRemark.

If I am understanding things correctly, using the EntityRelationship and EntityAssertion method for specifying information on ID, sex and lifeStage, provides information on potentially overlapping slices of the population observation (i.e. the same individual can appear in multiple EntityAssertions). Am I right here? Using the parentEntityID would allow us to provide information on unique, distinct slices of the population observation (I use the term population loosely here to describe occurrences of a group of related or unrelated individuals).

Not to side track discussion here, but another case being explored by @abbybenson and I hits similar questions, documented on these graphics.

In our case, there are statements like “there are 3 female individuals sized 22-23cm” against a set of same-species organisms collected in trawl data conducted within a survey protocol. The problem is compounded by an additional “the count, compensated because the trawl was shorter than the intended survey protocol is 2.7”.

In this study, my concern is that it looks like it may be impractical to reconstruct the groupings necessary (e.g. “females 22-23cm”, “females 23-24cm” etc) on which to apply measurements (frequency=3, compensatedFrequency=2.7) and may end up with a verbose set of groupings that are of limited use. One idea would be to allow an assertion model allowing one to record statements against an entity that include arbitrary combinations of fields (e.g. sex, length, lengthUnit, organismCount) rather than a singular value. This would reduce the need for creating groups and/or child entities for every combination and would more closely match what publishers often have in source data,

2 Likes

@tuco and I took a call and are satisfied that the model captures the measurements against groups of individuals in the Trawl Survey data I referred to above. We also acknowledge there may be merit in the alternative way of structuring the measurements as a wide table. We will park that idea for the time being waiting for further justification to consider it.

I’ve updated the slides to reflect these outcomes @abbybenson and think we can proceed with demonstrating a conversion from the raw data into the model if you wish to.

1 Like

@trobertson this looks like a very good solution to the problem you present!
How would your method handle datasets where only partial information is available? Would you assign all measurements to the attributes table? For example, if 17 individuals were observed, but information on sex and lifeStage were available for only part of the group:

entityID materialGroupID scientificName individualCount sex lifeStage
entity1 Sus scrofa 17
entity1 materialGroup1 Sus scrofa 5 female adult
entity1 materialGroup2 Sus scrofa 2 male adult

By specifying two material Groups, would there be any confusion as to whether a total of 17 or 24 individuals were observed? Would you also assign the material Group measurements (in this case sex and lifeStage) to the attributes table?

1 Like