The image above (from iNaturalist) shows a New Zealand millipede species in the genus Siphonophora. Species of Siphonophora, like all millipedes, are terrestrial animals. But most of the GBIF records for Siphonophora are in the ocean (see below; screenshot 2023-06-20). Why?
The core of the problem is that Siphonophora is a homonym — it has been used for (at least) two completely different taxa. One of these taxa is the millipede genus Siphonophora Brandt, 1837. Another is the jellyfish order Siphonophora Eschscholtz, 1829. However, the original spelling of the jellyfish order was Siphonophorae, and this is the currently accepted spelling.
Nevertheless, some marine biologists incorrectly use the spelling Siphonophora when recording unidentified jellyfish of this kind. As a result, Siphonophora appears in the scientificName field in Darwin Core records. If these records also had phylum “Cnidaria”, GBIF would (probably) recognise that the records were for jellyfish, because the GBIF backbone taxonomy now lists Siphonophora as a synonym of Siphonophorae.
But if the only taxonomic clue in the Darwin Core record is Siphonophora in scientificName, then GBIF processing defaults to Siphonophora as an accepted name, and for the millipede genus. GBIF then fills the empty taxonomic fields with phylum Arthropoda, class Diplopoda, order Siphonophorida, family Siphonophoridae.
And that’s what’s happened. I downloaded the 11,164 GBIF occurrences matching “Siphonophora” (DOI; retrieved 2023-06-20). Of these, 401 records had “millipede clues” in scientificName or a higher taxon entry. The other 10,763 records were “clueless” and were processed by GBIF as millipedes, including eight records that had “Cnidaria” in the wrong field and nine records with Siphonophora in the genus field:
Of those 10,763 millipede records, 8340 had coordinates. Using a low-resolution coastline shapefile of the world in my GIS software, I found that all 8340 millipede records were in the ocean. The 8340 records were shared with GBIF by 10 publishers:
I emailed the administrative or technical contacts at each of the 10 publishers, saying that their jellyfish had become millipedes and explaining why. Several responded very quickly and said the problem should be fixed in time for the next GBIF processing. As of 2023-06-22, I haven’t heard back from the MBA, which left higherClassification, phylum, class, taxonRank and vernacularName blank in its 7620 “Siphonophora” records.
What I did was an example of “round-tripping” with direct email contact between data publisher and data user. It was also a test case for the idea that “round-tripping” can be done with record annotations. It wasn’t much work to write 10 emails, but it would be a lot of work to write “This is a jellyfish, not a millipede” in the annotation box for 8340 individual records.
GBIF doesn’t email publishers when it finds data problems, and my emails only resolved the jellyfish/millipede confusion (I hope) in currently indexed records. What about the future?
The problem was first pointed out to GBIF staff in October 2022. It was discussed by staff at the time as issue 4361 on the GBIF GitHub portal. Several solutions were proposed:
-
Make name-matching “environment aware”, so that (for example) terrestrial species don’t get mapped in the sea. Land vs sea would need to be based on coordinates, however, because clues to habitat are often missing in Darwin Core records. Only a few of the no-coordinates “millipede” records had any sea- or land-related text in their place fields, and none of them had a samplingProtocol entry.
-
“What if we had negative dataset wide configurations for taxonomic coverages? If we could declare that this dataset never contains any Diplopoda? The name matching could then receive some new exclusion filter parameter that would allow to snap to the right Siphonophora. Such a config would likely help in a lot of cases when we receive bad matching reports and should not be terribly difficult to implement”. This would work at the dataset level, but would depend on getting “bad matching” reports. I’ve seen datasets in GBIF where impeccably correct entries have gone into taxon fields but are neverthless wrong for the declared habitat.
-
Establish a new flag: "If we could keep track of cases where existing homonymy has a potential of causing particularly bad problems in occurrence record interpretation (cross kingdom or cross other higher groups), and where occurrences come with insufficient (<- to be defined…) higher taxon information, should we raise that to some issue flag of “concern”? ". Good idea, but given that data publishers largely ignore flags, this solution is mainly of value to data users.
-
"One alternative I can think of is the ability to inject custom interpretation. Imagine a machine tag allowing the registration of a function (e.g. JS using nashorn): if (occurrence.scientificName === ‘Siphonophora’ && occurrence.phylum = null) { occurrence.phylum=‘Cnidaria’ } This would open the door to selectively fixing data and is something we’ve pondered before but discounted as a step too far and could be dangerous."
I don’t know how far GBIF staff have gotten with these or other suggestions since last October, but as of 2023-06-22, the “millipedes” are still in the ocean.
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com