Health Data Publishing (GBIF technical support hour for Nodes)

mgrosjean · December 11, 2023, 10:10am

This January, we celebrate one year of Technical Support Hour for Nodes! In the first session of 2024, Paloma Shimabukuro will give you an overview of what is currently available for publishing health data on GBIF. Paloma is the GBIF contractor behind the GBIF health Helpdesk (health@gbif.org) and is working on mobilizing vector-borne disease data. The Data Product team will join as well.

We will be happy to answer any question relating or not to the topic.
Please feel free to post questions in advance on the discourse forum or write to helpdesk@gbif.org.

The event will be on the 3rd of January 2024 at 4pm CET (UTC+1). The invitation with registration link will be sent to the GBIF Nodes. If you are interested in attending, you can reach out to your local node.

The edited recording and the transcript of the questions will be made available here.

mgrosjean · January 10, 2024, 2:24pm

The video is available here: Health data publishing on Vimeo

Here is the transcript of the questions during the session.

Should we highlight the datasets that are relevant for a particular theme? For example, we have a lot of data about disease vectors, venomous species, etc. but they aren’t highlighted in any specific way, how can we make easier for users to find those datasets? Should we use specific keywords in the metadata for example?

We don’t currently have specific guidelines on how to improve discoverability of health-related datasets. The first step would be to provide metadata as complete as possible. For example, as a medical entomologist interested in health-related data, I (Paloma) will look for datasets with the words “infectious diseases” or “parasite” which are specific and not commonly used in data repositories.
We will be working on a list of keywords that can be used to tag and to search for relevant datasets, terms like “surveillance” or “parasites”. We will communicate our recommendations as soon as possible.

On a related topic: we are trying to segment the GBIF relevant thematically by different communities. Be able to tell you which part of GBIF is health-related. Right now there is no clear why to do this. We worked on some criteria to help identify health relevant data: a combination of taxon filter and publisher identifiers. We should consider using tags provided by publishers as well. This could allow us to create thematic reports in the future.

When looking at the examples presented, a lot of the data shown seems like they could contribute to species trait information. How could we make this information available at the species level? Could this be integrated in the new data model?

This isn’t something that we haven’t been working on so far. It is possible to aggregate information from occurrences on GBIF species pages (for example, the geolocated occurrences are displayed on maps and the occurrences type specimens are emphasised), however, it requires the data to be standardised in a specific way. Right now, this would be something difficult to achieve with health-related data. For example, there are several ways to model the host-parasite relationships (with extension and without), it would be difficult to extract the information from occurrences automatically.
This could possibly be different with the outcome of the work on the new data model (particularly the work on biotic interactions).

We have a vector dataset and working with the resourceRelationship extension is quite challenging. I see that you are using the dwc:associatedTaxa ( https://dwc.tdwg.org/terms/#dwc:associatedTaxa) field in your examples. What is best: using the resourceRelationship extension or the associatedTaxa field?

The answer depends on the complexity of the host data you have. For example, if you have just the host species name, you can simply use the associatedTaxa field. If you have more complex information, you should consider the dwc:dynamicProperties or extension. It really depends on your data.
Note that currently, most extensions aren’t available in the download formats generated in the occurrence download interface.
If each host and parasite have an occurrence (with a relationship extension), we would encourage you to put them all in the same dataset so users can download everything together.
In any case, don’t hesitate to contact health@gbif.org, we can help you map your data.

We usually advise publishers to publish parasites and host as separate occurrences but it is a lot of work for them. It would be easier to publish only hosts or only parasites as occurrences and have to other species mentioned in the asscoiatedTaxa field. What would be best?

Ideally, publishers should share as much as possible, it is valuable when it comes to the “one health” approach. Right now, on GBIF.org, there is no way to search occurrences by value in the associatedTaxa field. This means, for example, that if you published only occurrences for parasites, there is no way for users to find those occurrences by looking for the name of the host. If you want hosts and parasites to be both discoverable, they have to be both published as occurrences. This could perhaps change with the new data model but don’t know what will be possible yet.

Is it interesting to expose these health data in GloBI. Do we know if this DwC-A format allow for ingestion into GloBI ( https://www.globalbioticinteractions.org/)?

This forum thread mentions GloBI being able to ingest data from DwC-A: Field Museum and iNaturalist Extending Specimen through DwC Resource Relationships - #9 by jhpoelen. There seem to be several steps needed for this to happen, this isn’t direct, please check the GloBI documentation.
It would make sense to make sure that the interaction datasets published on GBIF would also be compatible with GloBI, especially in the context of the new data model (have a standard that works on both platform).

From Norway we also have health data for other organisms than humans like Gyrodactylus on Salmon - for which we are interested to learn best practices to expose ( https://doi.org/10.15468/rcouob)

We don’t currently have specific recommendation for non-human hosts. Having concrete examples at hand will be very helpful to develop best practise documents, thank you.

We are thinking of doing one or two webinars on best practises to publish health data with our publishers. Can we reuse the material provided by GBIF during workshop training? Do I need explicit permissions?

It depends on the material concerned. For example, most GBIF training materials and guides are published with a license. You should check the licenses associated with the documents that you would like to use. For example, the license for the GBIF Data Mobilization course is available here and the license for DNA-derived data publishing guide is available here. In doubt, you are welcome to email us and we can help you find the rightful owner. If you would like to advertise your webinar on GBIF, you can use this form to create an event page: Suggest an event for the GBIF.org calendar

jhpoelen · January 10, 2024, 4:05pm

Great to see the interest in learning more about the wealth of knowledge already captured in GBIF registered datasets that describe how organisms interact.

Happy to help facilitate getting health data indexed by GloBI or elsewhere. There’s a wide variety of DwC-A-based datasets already being indexed (see e.g., parasite tracker), and best practices have been published on how to use DwC-A to embed your valuable association data (Sullivan et al. 2020). Also, note that Salim et al. 2022 has shown that GloBI tools like Elton, Preston and Nomer can be used independently to index all of GBIF - José did the analysis independently on non-GloBI infrastructure using GloBI tools and a versioned copy of data registered with GBIF (Poelen 2023).

So, if folks are serious about indexing all the health data (or other interactions) . . . the technical pieces are in place, just need some hardware, time and a little budget to make this work easier to access. Or. . . perhaps even better, you can experiment and build your own health data search index!

Hope this helps and curious to hear thoughts on helping to make health data easier to access,
-jorrit

PS For the specific example re: Bachmann L (2021). Artsprosjektet - Gyrodactylus. Version 1.10. University of Oslo. Occurrence dataset https://doi.org/10.15468/rcouob - I’ve added some notes at support for occurrenceRemarks style "Collected from ..." · Issue #956 · globalbioticinteractions/globalbioticinteractions · GitHub .

References

Kathryn Sullivan, Katja Seltmann, Jorrit Poelen, & Jennifer M. Zaspel. (2020, May). Making Parasite-Host Associations Visible in Terrestrial Parasite Tracker (TPT) (Version 0.0.1). Zenodo. Making Parasite-Host Associations Visible in Terrestrial Parasite Tracker (TPT)

Salim JA, Seltmann KC, Poelen JH, Saraiva AM (2022) Indexing Biotic Interactions in GBIF data. Biodiversity Information Science and Standards 6: e93565. Indexing Biotic Interactions in GBIF data

Poelen, J. H. (2023). A biodiversity dataset graph: GBIF, iDigBio, BioCASe hash://sha256/450deb8ed9092ac9b2f0f31d3dcf4e2b9be003c460df63dd6463d252bff37b55 hash://md5/898a9c02bedccaea5434ee4c6d64b7a2 (0.0.4) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7651831

jhpoelen · March 22, 2024, 6:46pm

Please note that I’ve added an index configuration for:

Bachmann L (2021). Artsprosjektet - Gyrodactylus. Version 1.10. University of Oslo. Occurrence dataset Artsprosjektet - Gyrodactylus

after adding support for your particular interaction type annotation using the occurrenceRemarks with phrase “Collected from …”

For context, see support for occurrenceRemarks style "Collected from ..." · Issue #956 · globalbioticinteractions/globalbioticinteractions · GitHub .

Less work is needed if more generally used notations are used. . . . but custom annotation can be supported also. It just takes a little longer to add the functionality and folks may oversee the wealth of information this adds when interpreting such dataset.

jhpoelen · March 22, 2024, 7:31pm

I’ve attached a data review pdf as automatically generated by Nomer and Elton, two naive data review bots. More details are available if folks are interested.

zmo-gyrodactylus-review-2024-03-22.pdf (384.1 KB)

Topic		Replies	Views
About the Data Publishing category Data Publishing	1	1250	May 3, 2018
GBIF's vocabulary server (GBIF technical support hour for nodes) Data Publishing NodesSupportHour	2	113	November 13, 2024
April technical support hour for GBIF nodes Data Publishing NodesSupportHour	4	786	June 26, 2023
Which data can be shared through GBIF and what cannot - GBIF Data Blog Data blog	1	728	November 17, 2022
GBIF's data quality workflow (GBIF technical support hour for nodes) Data Publishing NodesSupportHour	5	586	March 15, 2024

Health Data Publishing (GBIF technical support hour for Nodes)

References

Related topics