The vexed question of occurrences with mixed sexes or life stages in Darwin Core

datafixer · March 6, 2023, 9:45pm

Suppose you have a registered museum sample (SM1234) containing 1 male, 2 females and 3 juveniles of a particular animal species. According to the specimen label, they were all collected at the same place on the same day by the same collector. As a collection manager you want to share the information about sex and life stage, but how would you do that in Darwin Core?

This post considers some of the possibilities. In all cases the event details are the same for each occurrence record. For convenience I’ve shown individualCount rather than organismQuantity plus organismQuantityType (individuals).

Split the occurrence. A Darwin Core maintainer has recommended splitting the occurrence into three separate occurrence records, like this:

occurrenceID	catalogNumber	individualCount	sex	lifeStage
ddd1a-1	SM1234	1	male	adult
ddd1a-2	SM1234	2	female	adult
ddd1a-3	SM1234	3		juvenile

This solution allows sex and lifeStage to have controlled vocabularies. It could also greatly increase the number of individual records arising from a single collection or observation.

Split the occurrence and also summarise it (nested occurrences). This suggestion comes from Anne-Sophie Archambeau, Guillaume Body, Francesca Jaroszynska and Sophie Pamerlon. It proposes a new Darwin Core term, parentOccurrenceID, and uses it like this:

parentOccurrenceID	occurrenceID	catalogNumber	individualCount	sex	lifeStage
	ddd1a	SM1234	6
ddd1a	ddd1a-1	SM1234	1	male	adult
ddd1a	ddd1a-2	SM1234	2	female	adult
ddd1a	ddd1a-3	SM1234	3		juvenile

Controlled vocabularies still apply. With this proposal, data publishers could have separate occurrence records for individuals from a group, for example if one of the females had an associated image or an associated DNA sequence.

Pack the information into the sex or lifeStage field. This is by far the most popular choice, and can be seen in innumerable Darwin Core datasets in GBIF:

occurrenceID	catalogNumber	individualCount	sex	lifeStage
ddd1a	SM1234	6	1 male, 2 females, 3 juveniles

or

occurrenceID	catalogNumber	individualCount	sex	lifeStage
ddd1a	SM1234	6		1 male, 2 females, 3 juveniles

No controlled vocabularies here, but this possibility requires the least work from the collection manager’s point of view, and accords with a “one sample, one record” principle.

Pack the information into an organismRemarks field. I’ve used this in a Darwin Core dataset of millipede records:

occurrenceID	catalogNumber	individualCount	sex	lifeStage	organismRemarks
ddd1a	SM1234	6			1 male \| 2 female \| 3 juvenile

The entry style is the one recommended for multiple recordedBy entries, with data items separated by [space][bar][space].

Partition the information in the sex and lifeStage fields.

occurrenceID	catalogNumber	individualCount	sex	lifeStage
ddd1a	SM1234	6	1 male, 2 females	3 adults, 3 juveniles

A bit puzzling at first, and would get fairly confused if juveniles were also distinguishable as male or female.

Further developments. GBIF is apparently discussing the use of “mixed” as an acceptable entry in the sex field. The blank sex field in the organismRemarks solution (above) could then be filled with “mixed”. I’m not clear on why “mixed” would not also be useful in lifeStage.

Comments welcome.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

gambleb · March 8, 2023, 1:20pm

I am in favor of adding “mixed” as a controlled vocabulary word in both sex and life stages and then using the organism remarks field to detail the specifics, the use of the | between distinct values is preferred to help with parsing if needed. It would inflate our already huge datasets to have to parse these lot records into individual occurrence records.

Our best practice guide is to parse out individuals into unique occurrence records if derivative records, such as tissue samples or DNA records, are created from them so those can be linked specifically to a distinct individual organism record.

Beth Gamble - SI NMNH, US

cecsve · March 9, 2023, 12:40pm

Thanks for a clear overview of how data is and can be mapped in terms of life stage and sex @datafixer!

GBIF currently have and use a lifeStage vocabulary for interpretation and we will probably refine it in the near future, for example to accommodate the British Oceanographic Data Centre’s (BODC) controlled vocabulary and improve the interpretation of the verbatim values provided by the publishers. We will revisit the verbatim values provided by the publishers to see if it would make sense to include mixed as a concept in life stage.

The sex vocabulary is underway and the preliminary plan is to include the concept mixed. BODCs vocabulary includes a bit more complex concepts than what we have planned so far - including female+indeterminate and male+female which may be relevant to include as well.

Further developments. GBIF is apparently discussing the use of “mixed” as an acceptable entry in the sex field. The blank sex field in the organismRemarks solution (above) could then be filled with “mixed”.

One thing I want to highlight is that the vocabulary interpretation is on the field in question only, and would therefore not automatically populate other fields if extra data is provided (by wrong use of fields for mapping). So GBIFs interpretation process would not split organismRemarks into the relevant individualCount, sex, and lifeStage fields. Instead, 1 male | 2 female | 3 juvenile provided in the sex field would be interpreted as either mixed (which is the current plan) or male+female, if the BODCs vocabulary was included in the controlled vocabulary.

EstebanMH-SiB · March 15, 2023, 1:38pm

This is a topic that we have been discussing internally and externally for quite a while. I am going to give our general opinion but I think this is a topic worth discussing.

Originally we used the option, Pack the information into the sex or lifeStage field, using “ | ” to separate different vocabularies. When we made the inquiry to TDWG and their suggestion was to split the records, we started to do that and it is our current practice.

We think that is the best way to store that information, specially if you can populate individualCount, there is no duplication of information per se, you have different records that are all related. Ideally data users should use invidualCount in their analysis. It is some extra work, but we think it is valuable and also we try to follow the standard as much as we can.

However this approach sometimes has problems, especially with collections.
People in some collections don’t like to split their records, because for them one record = one vial/box, regardless of the content of that vial. So we have to keep the records without split, and keep the information in organismRemarks, which is not ideal but the publisher has the last word.

Another problem is that we are forced to add a dummy letter to occurrenceID to make them different, so we end with UNAL:ICN:Ant0001a, UNAL:ICN:Ant0001b, UNAL:ICN:Ant0001c. It is not the cleanest way, but it works and we kept the original catalog number in all records, so you can say that they are related.

The real problem comes when we do not have a nice individual count for every sex/life Stage, instead you only have a number of individuals and information of sex and lifeStage.
What do you do when you have something like this?

occurrenceID	catalogNumber	individualCount	sex	lifeStage
ddd1a	SM1234	6	male \| female	adult \| juvenile

There are several options and no one is perfect, you can split the record in 4, one for male, female, adult and juvenile and dish the individualCount, because it does not apply to any one of them. You can split the record in 4 but keep the individualCount in all records. You can split the record in 3, male, female and juvenile, guessing that the juvenile were alone, etc.
Only the publisher could do it properly, but it rarely happens because data get lost, publishers do not have time to review, etc.

So we arise that question in the DwC QA https://github.com/tdwg/dwc-qa/issues/192, and the answers were focused in the mixed vocabulary, so I think that in those cases that you can not split the record properly, you should use mixed and keep just one record, and the information about sex and lifeState documented originally could be introduced in the organismRemarks for a further use. This is an alternative when publishers refuse to split their records.

For the option Split the occurrence and also summarise it (nested occurrences), we sometimes try to “nest” the information, but using a different approach. We use and indirect approach keeping catalogNumber equal to all records, sometimes keeping recordNumber or we could use associatedOrganisms to put something like encountered with: ddd1b.

Regarding the discussion of DwC terms like parentOccurrenceID, we are in line with Richard Pyle, you can split the records, avoid duplication and in those use cases where nested information is required, the relationship of resource Extension can be used.
Also in a more informal note, these parent terms are really hard for the majority of publishers, people could get really confused with this one, so it is easier to use an already established extension.

For the option Pack the information into an organismRemarks field is less than ideal, because you “lost” the information, most data users will not go to that field to retrieve information of sex/lifeStage, if people need sex information they will probably just use sex and that is all. So we prefer to split the record rather than this.

To summarize, we think splitting a record is the best approach, in cases when you can’t do it properly, keep one record and use the mixed vocabulary, keeping the original information in organism remarks. Is a trade off but is not that bad.

cecsve · March 17, 2023, 1:46pm

Thank you for elaborating on the difficulties with capturing lifeStage and Sex @EstebanMH-SiB!

When the Sex vocabulary is implemented and the lifeStage vocabulary is updated, the concept mixed will be included for both. You could add the the more detailed information in occurrenceRemarks if you go through the effort of fixing the data at source, but the verbatim value would be included in the GBIF record if you do not clean the data prior to publication to GBIF, for example:

This means that both the standardized (interpreted) value and the verbatim (original) value would be available for data users, if they download the full DwCA and not just the simple format.

system · April 16, 2023, 11:46pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sibling datasets to overcome DwCArchive star schema limitation(2) Data Publishing	3	679	August 28, 2022
Looking to filter occurrence records by ActivityPub test	4	227	February 14, 2024
Duplicate occurrence records Data Publishing	3	796	January 7, 2023
Diversifying the GBIF data model - intro Diversifying the GBIF data model	14	1181	July 21, 2022
Preferences or recommended best practices for granularity of data Data Publishing	4	611	January 15, 2022

The vexed question of occurrences with mixed sexes or life stages in Darwin Core

Related topics