EML generation tool?

Hi all,

I am Grant Fitzsimmons from the Specify Collections Consortium. In the Specify 7 CMS, we have a mechanism for publishing Darwin Core Archives via an RSS feed endpoint. It looks like this:

<rss
	xmlns:ipt="http://ipt.gbif.org/" version="2.0">
	<channel>
		<link></link>
		<title>KUBI ichthyology RSS Feed</title>
		<description>RSS feed for KUBI Ichthyology Voucher and Tissue collections</description>
		<item>
			<title>KU Fish</title>
			<id>8f79c802-a58c-447f-99aa-1d6a0790825a</id>
			<link></link>
			<ipt:eml></ipt:eml>
			<pubDate>Thu, 24 Jul 2025 15:50:08 -0000</pubDate>
			<type>DWCA</type>
		</item>
		<item>
			<title>KU Fish Tissue</title>
			<id>56caf05f-1364-4f24-85f6-0c82520c2792</id>
			<link></link>
			<ipt:eml></ipt:eml>
			<pubDate>Thu, 24 Jul 2025 15:50:12 -0000</pubDate>
			<type>DWCA</type>
		</item>
	</channel>
</rss>

To do this, we ask our users to provide two things:

  1. A DwC mapping from Specify fields to Darwin Core concepts, including additional query mappings for extensions
  2. An Ecological Metadata Language (EML) file with the metadata for the dataset

Historically, many collections migrating to using Specify 7 for publishing have previously published data through the GBIF IPT or another mid-level data aggregator. In these cases, EML is already available for that dataset, and Specify is simply updating it. We can grab it from GBIF directly and package it, no problem.

More commonly now, however, is that users are creating DwCA files and publishing directly from Specify 7 itself for the first time. Our main issue is that we do not have an EML generator tool to use ourselves, and we have been using the EML generator developed by GBIF Norway, seen on the GBIF Norway services page.

I’ve opened an issue in the repository with a description of the problem: https://github.com/gbif-norway/eml-generator-js/issues/4

This is great, and we appreciate the existence of the tool. However, it does leave us with data sets containing EML files that do not validate against the GBIF schema. For example, it generates EML files with empty XML elements, which the validator does not accept:

 [1] "Element 'pubDate': '' is not a valid value of the union type '{eml://ecoinformatics.org/resource-2.1.1}yearDate'."
 [2] "Element 'keywordThesaurus': [facet 'minLength'] The value has a length of '0'; this underruns the allowed minimum length of '1'."
 [3] "Element 'keywordThesaurus': [facet 'pattern'] The value '' is not accepted by the pattern '[\s]*[\S][\s\S]*'."
 [4] "Element 'calendarDate': '' is not a valid value of the union type '{eml://ecoinformatics.org/resource-2.1.1}yearDate'."
 [5] "Element 'taxonRankName': [facet 'minLength'] The value has a length of '0'; this underruns the allowed minimum length of '1'."
 [6] "Element 'taxonRankName': [facet 'pattern'] The value '' is not accepted by the pattern '[\s]*[\S][\s\S]*'."
 [7] "Element 'taxonRankValue': [facet 'minLength'] The value has a length of '0'; this underruns the allowed minimum length of '1'."
 [8] "Element 'taxonRankValue': [facet 'pattern'] The value '' is not accepted by the pattern '[\s]*[\S][\s\S]*'."
 [9] "Element 'commonName': [facet 'minLength'] The value has a length of '0'; this underruns the allowed minimum length of '1'."
[10] "Element 'commonName': [facet 'pattern'] The value '' is not accepted by the pattern '[\s]*[\S][\s\S]*'."
[11] "Element 'maintenanceUpdateFrequency': [facet 'enumeration'] The value '' is not an element of the set {'annually', 'asNeeded', 'biannually', 'continually', 'daily', 'irregular', 'monthly', 'notPlanned', 'weekly', 'unkown', 'otherMaintenancePeriod'}."

My colleague @abentley reached out about this in the GBIF North America Slack channel and had some help from @sunray1 – thank you so much!

I believe these empty tags are largely safe to ignore since data is still publishable regardless, but the validator is not a fan.

Ultimately, we want to ensure that the EML packaged with DwCAs from Specify conforms to the schema validated by GBIF. Without establishing a private IPT instance for all users, we are left wondering how to safely and consistently advise them on generating EML files for their datasets.

Does anyone else in the community have this need or have a tool that might be better fit for our purpose? Thank you in advance!


Special thanks to GBIF Norway for developing that tool and to all the active contributors working with GBIF! We really appreciate all the work you all do!

4 Likes

Hi! I’ll take a look at this and see if I can make a better/less lazy way of generating the EML with this javascript tool. ChatIPT (also maintained and developed by GBIF Norway) uses python to generate EML: ChatIPT/back-end/api/helpers/publish.py at 3e03aadb52bce6fdc9e1b49902500d661bf9e181 · gbif-norway/ChatIPT · GitHub But that also needs some work and has some validation errors which for the moment we’re ignoring. I’m also interested to know if anyone in the community has developed a tool for this - I couldn’t find one when I looked. I hoped that @pieterpprovoost‘s dwca tool might do it GitHub - pieterprovoost/dwca-writer: Python package for writing Darwin Core Archives (DwC-A) but that doesn’t touch EML.

2 Likes

Hi @rukaya,

Thank you for your response and your efforts on the EML generator repository. We find it extremely valuable and recommend it in our publishing documentation for those who do not already have EML. :smile:

Hopefully, we can both learn about other community efforts aimed at generating it as well.

1 Like