Millipedes in the ocean

sz

The image above (from iNaturalist) shows a New Zealand millipede species in the genus Siphonophora. Species of Siphonophora, like all millipedes, are terrestrial animals. But most of the GBIF records for Siphonophora are in the ocean (see below; screenshot 2023-06-20). Why?

map

The core of the problem is that Siphonophora is a homonym — it has been used for (at least) two completely different taxa. One of these taxa is the millipede genus Siphonophora Brandt, 1837. Another is the jellyfish order Siphonophora Eschscholtz, 1829. However, the original spelling of the jellyfish order was Siphonophorae, and this is the currently accepted spelling.

Nevertheless, some marine biologists incorrectly use the spelling Siphonophora when recording unidentified jellyfish of this kind. As a result, Siphonophora appears in the scientificName field in Darwin Core records. If these records also had phylum “Cnidaria”, GBIF would (probably) recognise that the records were for jellyfish, because the GBIF backbone taxonomy now lists Siphonophora as a synonym of Siphonophorae.

But if the only taxonomic clue in the Darwin Core record is Siphonophora in scientificName, then GBIF processing defaults to Siphonophora as an accepted name, and for the millipede genus. GBIF then fills the empty taxonomic fields with phylum Arthropoda, class Diplopoda, order Siphonophorida, family Siphonophoridae.

And that’s what’s happened. I downloaded the 11,164 GBIF occurrences matching “Siphonophora” (DOI; retrieved 2023-06-20). Of these, 401 records had “millipede clues” in scientificName or a higher taxon entry. The other 10,763 records were “clueless” and were processed by GBIF as millipedes, including eight records that had “Cnidaria” in the wrong field and nine records with Siphonophora in the genus field:

table1

Of those 10,763 millipede records, 8340 had coordinates. Using a low-resolution coastline shapefile of the world in my GIS software, I found that all 8340 millipede records were in the ocean. The 8340 records were shared with GBIF by 10 publishers:

table2

I emailed the administrative or technical contacts at each of the 10 publishers, saying that their jellyfish had become millipedes and explaining why. Several responded very quickly and said the problem should be fixed in time for the next GBIF processing. As of 2023-06-22, I haven’t heard back from the MBA, which left higherClassification, phylum, class, taxonRank and vernacularName blank in its 7620 “Siphonophora” records.

What I did was an example of “round-tripping” with direct email contact between data publisher and data user. It was also a test case for the idea that “round-tripping” can be done with record annotations. It wasn’t much work to write 10 emails, but it would be a lot of work to write “This is a jellyfish, not a millipede” in the annotation box for 8340 individual records.

GBIF doesn’t email publishers when it finds data problems, and my emails only resolved the jellyfish/millipede confusion (I hope) in currently indexed records. What about the future?

The problem was first pointed out to GBIF staff in October 2022. It was discussed by staff at the time as issue 4361 on the GBIF GitHub portal. Several solutions were proposed:

  • Make name-matching “environment aware”, so that (for example) terrestrial species don’t get mapped in the sea. Land vs sea would need to be based on coordinates, however, because clues to habitat are often missing in Darwin Core records. Only a few of the no-coordinates “millipede” records had any sea- or land-related text in their place fields, and none of them had a samplingProtocol entry.

  • “What if we had negative dataset wide configurations for taxonomic coverages? If we could declare that this dataset never contains any Diplopoda? The name matching could then receive some new exclusion filter parameter that would allow to snap to the right Siphonophora. Such a config would likely help in a lot of cases when we receive bad matching reports and should not be terribly difficult to implement”. This would work at the dataset level, but would depend on getting “bad matching” reports. I’ve seen datasets in GBIF where impeccably correct entries have gone into taxon fields but are neverthless wrong for the declared habitat.

  • Establish a new flag: "If we could keep track of cases where existing homonymy has a potential of causing particularly bad problems in occurrence record interpretation (cross kingdom or cross other higher groups), and where occurrences come with insufficient (<- to be defined…) higher taxon information, should we raise that to some issue flag of “concern”? ". Good idea, but given that data publishers largely ignore flags, this solution is mainly of value to data users.

  • "One alternative I can think of is the ability to inject custom interpretation. Imagine a machine tag allowing the registration of a function (e.g. JS using nashorn): if (occurrence.scientificName === ‘Siphonophora’ && occurrence.phylum = null) { occurrence.phylum=‘Cnidaria’ } This would open the door to selectively fixing data and is something we’ve pondered before but discounted as a step too far and could be dangerous."

I don’t know how far GBIF staff have gotten with these or other suggestions since last October, but as of 2023-06-22, the “millipedes” are still in the ocean.


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

2 Likes

Update 2023-02-24: MBA(UK) is planning to update the dataset in GBIF.

2 Likes

I still wonder if we’ll ever be able to manage a “is marine” flag. I get that it’s more nuanced that that. We’d need some sort of ontology (likely already exists?).

Habitat is:

  • marine
  • terrestrial
  • brackish / estuarine

Of course, some species could fit into more than one category. Thinking out loud, I’m wondering if there’s some possible field in any standard (the likely soon-to-be Latimer Core standard from TDWG) where one could declared the dataset as “marine” or “not” terrestrial? @sgrant @mswoodburn ?

Perhaps there’s a “Marine Region” that could be declared for a given specimen? And that field, if true or Not NULL, would ensure objects find themselves in the virtual habitat they expect to be found in?

Maybe some clues in here
Leadbetter, A., Meaney, W., Tray, E. et al. A modular approach to cataloguing marine science data. Earth Sci Inform 13, 537–553 (2020). A modular approach to cataloguing marine science data | SpringerLink

Otherwise, as you did @datafixer, it’s reaching out to others to encourage them to fix their data if they can. Sure will help their data use / discoverability stats!

@debpaul. A habitat check was implemented by the Atlas of Living Australia: “Habitat incorrect for species”. I don’t know if it’s still in place as an ALA processing check, but “Habitat incorrect for species” is apparently still something a user can select from a drop-down list for annotating a single record . I’m sure you can see why I emphasised “single”.

The Marine Institute (Ireland) paper you linked is one of several talking about MI data management. This paper is interesting because it draws a line between Data Owners and Data Stewards. The job of the latter: “The Data Steward is involved with a dataset on a daily basis, and as such is responsible for many of the day-to-day activities around a dataset including the quality of the data; ensuring its safe archival and storage; and providing the required metadata and documentation around the dataset. Due to the technical scientific nature of their work within the organisational context, the Data Stewards will often blend aspects of the Business Data Steward and Technical Data Steward roles identified in Plotkin (2013). Therefore, as domain scientific experts they will understand the business needs fulfilled by the data they collect and curate but also will often have technical knowledge of database operations and numerical computing in scripting language environments.”

The chances of getting a Data Steward appointment in an institution are slim, because in the institutions large enough to have them, Data Managers are appointed to cover all scientific disciplines and are unlikely to be “domain scientific experts”.

1 Like

Thank you for the post. Out of the 11164 records, 7745 (69%) of the record has scientificNameID populated with WoRMS LSID.

count scientificNameID
1 urn:lsid:Marinespecies.org:taxname:892593
7632 urn:lsid:marinespecies.org:taxname:1371
32 urn:lsid:marinespecies.org:taxname:254409
80 urn:lsid:marinespecies.org:taxname:892593

urn:lsid:marinespecies.org:taxname:892593 is the actual millipedes.

I think that interpreting identifier fields like scientificNameID, taxonID etc could improve taxon interpretation in GBIF. Including interpretation of these fields which have LSIDs or other identifiers are better than matching text string alone in my humble opinion.

Furthermore, GBIF API can already be utilized to find the corresponding entry of WoRMS LSID in GBIF taxonomic backbone. Take the following for example:

WoRMS record: urn:lsid:marinespecies.org:taxname:1371
WoRMS record in GBIF API: https://api.gbif.org/v1/species?datasetKey=2d59e5db-57ad-41ff-97d6-11f5fb264527&sourceId=urn:lsid:marinespecies.org:taxname:1371

Using the parameter datasetKey=2d59e5db-57ad-41ff-97d6-11f5fb264527 here because WoRMS is a dataset in GBIF.

Attaching the nubKey of the API response to GBIF species page: 6180736, that will bring us to Siphonophorae (https://api.gbif.org/v1/species/6180736) which is the corresponding entry in GBIF taxonomic backbone. I believe the match will be correct this way (69% of the records in this download already!).

I think this will reduce the burden of data stewards, encourage more data publishers to use persistent identifiers and reduce errors in GBIF taxon matching. It could be a win win win solution :smiley:

1 Like

@ymgan, that’s certainly something for GBIF to consider.

Note that it relies on scientificNameID being a text string that specifies the source of the identifier, in this case WoRMS. There are other available identifiers for “Siphonophorae”: see its Wikidata entry.

GBIF processing would also need one-to-many protocols (because there are so many ID-code sources) to check scientificName against scientificNameID and ensure there’s no disagreement.

I don’t think taxonID would be useful in disambiguation, because most of the datasets I’ve worked with use taxonID just to distinguish a taxon within that dataset alone, e.g. taxonID 1 > 1001 for a 1001-taxon dataset.

1 Like

@ymgan, sorry, I should have explained “disagreement”. Suppose (as in this case)

scientificName = Siphonophora
scientificNameID = urn:lsid:marinespecies.org:taxname:1371

This is a disagreement. What should GBIF processing do? I would suggest processing as “incertae sedis” and flagging a name disagreement.

That would eliminate all the false positives in a search for the millipede name “Siphonophora”.

However, it puts the “please fix this” burden back on the data publisher. Since the publishers listed above did not notice (apparently) that their jellyfish had become millipedes, would you expect them to notice that their jellyfish had become “incertae sedis”?

1 Like

Yes actually I do think this is easier to spot. The fact that a jelly became a millipede is nearly impossible to check for when reviewing a GBIF dataset page or the issues and flags. In other words it’s not immediately apparent and a data provider would have to purposefully go looking for this.

Finding the ones marked incertae sedis on the other hand is very simple and shows up on the main landing page of the dataset (showing as without taxon match in the circle metrics under the description which when you click it takes you to the occurrences flagged “taxon match none”) and is one of the issues and flags you can select for the dataset.

3 Likes

Hi all,

Comments in this thread refer to GBIF using scientificNameID.
Please be aware of this github issue where we’re discussing the details of that. I’ve prepared a file summarising the impact of the change for occurrence records having WoRMS LSIDs posted on this comment. More eyes on that will only help, noting that some of the examples in this thread appear in the file.

Thanks

1 Like

A brief update on this:

Recent changes in GBIF.org now make use of the scientificNameID, taxonID, and taxonConceptID on records. We’ve initially configured GBIF.org to recognize records using the World Registry of Marine Species identifiers. This has the potential to improve sparsely populated records in particular, such as those with only a scientificName filled in, and especially for homonyms.

This change, combined with the removal of a duplicate dataset has gone a long way to improve the situation of this millipede view. There are still a few publishers that we will approach with recommendations to correct their scientificNameID or their names.

There are still a few publishers that we will approach with recommendations to correct their scientificNameID or their names.

“We” being GBIF? Is this a change from the policy explained by @CecSve here?:

No, we do not automatically notify the data publisher other than the automated warnings the publisher would see in the IPT, if they use an IPT to publish their data. As stated above, we only contact the data publisher directly if extra resources are provided in context of GBIF-handled publishing grants or if users or publishers contact us directly through helpdesk and ask for support.

No change in policy. It’s not uncommon for a publisher (edited to add: or Node) to be pinged if someone notices something suspicious; the contact details are available on all datasets.

1 Like