Health Data Publishing (GBIF technical support hour for Nodes)

This January, we celebrate one year of Technical Support Hour for Nodes! In the first session of 2024, Paloma Shimabukuro will give you an overview of what is currently available for publishing health data on GBIF. Paloma is the GBIF contractor behind the GBIF health Helpdesk (health@gbif.org) and is working on mobilizing vector-borne disease data. The Data Product team will join as well.

We will be happy to answer any question relating or not to the topic.
Please feel free to post questions in advance on the discourse forum or write to helpdesk@gbif.org.

The event will be on the 3rd of January 2024 at 4pm CET (UTC+1). The invitation with registration link will be sent to the GBIF Nodes. If you are interested in attending, you can reach out to your local node.

The edited recording and the transcript of the questions will be made available here.

1 Like

The video is available here: Health data publishing on Vimeo

Here is the transcript of the questions during the session.

Should we highlight the datasets that are relevant for a particular theme? For example, we have a lot of data about disease vectors, venomous species, etc. but they arenā€™t highlighted in any specific way, how can we make easier for users to find those datasets? Should we use specific keywords in the metadata for example?

We donā€™t currently have specific guidelines on how to improve discoverability of health-related datasets. The first step would be to provide metadata as complete as possible. For example, as a medical entomologist interested in health-related data, I (Paloma) will look for datasets with the words ā€œinfectious diseasesā€ or ā€œparasiteā€ which are specific and not commonly used in data repositories.
We will be working on a list of keywords that can be used to tag and to search for relevant datasets, terms like ā€œsurveillanceā€ or ā€œparasitesā€. We will communicate our recommendations as soon as possible.

On a related topic: we are trying to segment the GBIF relevant thematically by different communities. Be able to tell you which part of GBIF is health-related. Right now there is no clear why to do this. We worked on some criteria to help identify health relevant data: a combination of taxon filter and publisher identifiers. We should consider using tags provided by publishers as well. This could allow us to create thematic reports in the future.

When looking at the examples presented, a lot of the data shown seems like they could contribute to species trait information. How could we make this information available at the species level? Could this be integrated in the new data model?

This isnā€™t something that we havenā€™t been working on so far. It is possible to aggregate information from occurrences on GBIF species pages (for example, the geolocated occurrences are displayed on maps and the occurrences type specimens are emphasised), however, it requires the data to be standardised in a specific way. Right now, this would be something difficult to achieve with health-related data. For example, there are several ways to model the host-parasite relationships (with extension and without), it would be difficult to extract the information from occurrences automatically.
This could possibly be different with the outcome of the work on the new data model (particularly the work on biotic interactions).

We have a vector dataset and working with the resourceRelationship extension is quite challenging. I see that you are using the dwc:associatedTaxa ( https://dwc.tdwg.org/terms/#dwc:associatedTaxa) field in your examples. What is best: using the resourceRelationship extension or the associatedTaxa field?

The answer depends on the complexity of the host data you have. For example, if you have just the host species name, you can simply use the associatedTaxa field. If you have more complex information, you should consider the dwc:dynamicProperties or extension. It really depends on your data.
Note that currently, most extensions arenā€™t available in the download formats generated in the occurrence download interface.
If each host and parasite have an occurrence (with a relationship extension), we would encourage you to put them all in the same dataset so users can download everything together.
In any case, donā€™t hesitate to contact health@gbif.org, we can help you map your data.

We usually advise publishers to publish parasites and host as separate occurrences but it is a lot of work for them. It would be easier to publish only hosts or only parasites as occurrences and have to other species mentioned in the asscoiatedTaxa field. What would be best?

Ideally, publishers should share as much as possible, it is valuable when it comes to the ā€œone healthā€ approach. Right now, on GBIF.org, there is no way to search occurrences by value in the associatedTaxa field. This means, for example, that if you published only occurrences for parasites, there is no way for users to find those occurrences by looking for the name of the host. If you want hosts and parasites to be both discoverable, they have to be both published as occurrences. This could perhaps change with the new data model but donā€™t know what will be possible yet.

Is it interesting to expose these health data in GloBI. Do we know if this DwC-A format allow for ingestion into GloBI ( https://www.globalbioticinteractions.org/)?

This forum thread mentions GloBI being able to ingest data from DwC-A: Field Museum and iNaturalist Extending Specimen through DwC Resource Relationships - #9 by jhpoelen. There seem to be several steps needed for this to happen, this isnā€™t direct, please check the GloBI documentation.
It would make sense to make sure that the interaction datasets published on GBIF would also be compatible with GloBI, especially in the context of the new data model (have a standard that works on both platform).

From Norway we also have health data for other organisms than humans like Gyrodactylus on Salmon - for which we are interested to learn best practices to expose ( https://doi.org/10.15468/rcouob)

We donā€™t currently have specific recommendation for non-human hosts. Having concrete examples at hand will be very helpful to develop best practise documents, thank you.

We are thinking of doing one or two webinars on best practises to publish health data with our publishers. Can we reuse the material provided by GBIF during workshop training? Do I need explicit permissions?

It depends on the material concerned. For example, most GBIF training materials and guides are published with a license. You should check the licenses associated with the documents that you would like to use. For example, the license for the GBIF Data Mobilization course is available here and the license for DNA-derived data publishing guide is available here. In doubt, you are welcome to email us and we can help you find the rightful owner. If you would like to advertise your webinar on GBIF, you can use this form to create an event page: Suggest an event for the GBIF.org calendar

1 Like

Great to see the interest in learning more about the wealth of knowledge already captured in GBIF registered datasets that describe how organisms interact.

Happy to help facilitate getting health data indexed by GloBI or elsewhere. Thereā€™s a wide variety of DwC-A-based datasets already being indexed (see e.g., parasite tracker), and best practices have been published on how to use DwC-A to embed your valuable association data (Sullivan et al. 2020). Also, note that Salim et al. 2022 has shown that GloBI tools like Elton, Preston and Nomer can be used independently to index all of GBIF - JosĆ© did the analysis independently on non-GloBI infrastructure using GloBI tools and a versioned copy of data registered with GBIF (Poelen 2023).

So, if folks are serious about indexing all the health data (or other interactions) . . . the technical pieces are in place, just need some hardware, time and a little budget to make this work easier to access. Or. . . perhaps even better, you can experiment and build your own health data search index!

Hope this helps and curious to hear thoughts on helping to make health data easier to access,
-jorrit

PS For the specific example re: Bachmann L (2021). Artsprosjektet - Gyrodactylus. Version 1.10. University of Oslo. Occurrence dataset Artsprosjektet - Gyrodactylus - Iā€™ve added some notes at support for occurrenceRemarks style "Collected from ..." Ā· Issue #956 Ā· globalbioticinteractions/globalbioticinteractions Ā· GitHub .

References

Kathryn Sullivan, Katja Seltmann, Jorrit Poelen, & Jennifer M. Zaspel. (2020, May). Making Parasite-Host Associations Visible in Terrestrial Parasite Tracker (TPT) (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.3780543

Salim JA, Seltmann KC, Poelen JH, Saraiva AM (2022) Indexing Biotic Interactions in GBIF data. Biodiversity Information Science and Standards 6: e93565. Indexing Biotic Interactions in GBIF data

Poelen, J. H. (2023). A biodiversity dataset graph: GBIF, iDigBio, BioCASe hash://sha256/450deb8ed9092ac9b2f0f31d3dcf4e2b9be003c460df63dd6463d252bff37b55 hash://md5/898a9c02bedccaea5434ee4c6d64b7a2 (0.0.4) [Data set]. Zenodo. A biodiversity dataset graph: GBIF, iDigBio, BioCASe hash://sha256/450deb8ed9092ac9b2f0f31d3dcf4e2b9be003c460df63dd6463d252bff37b55 hash://md5/898a9c02bedccaea5434ee4c6d64b7a2

2 Likes

Please note that Iā€™ve added an index configuration for:

Bachmann L (2021). Artsprosjektet - Gyrodactylus. Version 1.10. University of Oslo. Occurrence dataset Artsprosjektet - Gyrodactylus

after adding support for your particular interaction type annotation using the occurrenceRemarks with phrase ā€œCollected from ā€¦ā€

For context, see support for occurrenceRemarks style "Collected from ..." Ā· Issue #956 Ā· globalbioticinteractions/globalbioticinteractions Ā· GitHub .

Less work is needed if more generally used notations are used. . . . but custom annotation can be supported also. It just takes a little longer to add the functionality and folks may oversee the wealth of information this adds when interpreting such dataset.

1 Like

Iā€™ve attached a data review pdf as automatically generated by Nomer and Elton, two naive data review bots. More details are available if folks are interested.

zmo-gyrodactylus-review-2024-03-22.pdf (384.1 KB)