Getting ALA data to the usable stage

Working with biodiversity data often involves a fair amount of filtering and cleaning (these are different things). We don’t often have to reformat a Darwin Core dataset, though, thanks to GBIF’s Integrated Publishing Toolkit. Occurrence datasets that have been through the IPT typically come out as simple tab-separated tables in UTF-8 encoding, with a single header line containing Darwin Core field names.

This isn’t the case for datasets shared with GBIF by the Atlas of Living Australia. A Darwin Core archive from the ALA with an occurrence core has an occurrence.txt file with 182 fields but no header line, and with every data item surrounded by quotes. The accompanying meta.xml file has 182 field names, but many of the field name references are defective:

I can’t do anything about the meta.xml, but to make ALA occurrence data usable as a single table I can use the BASH shell function shown below. I extract occurrence.txt and meta.xml from the ALA archive, navigate to the directory containing those files and enter “alaprep”. The function generates a new TSV “ocala.txt” with the 182 field names in a header line and no quotes around data items.

alaprep() { cat <(sed 's|taxonRankID|/taxonRankID|' meta.xml | awk -v FS="/|\"/>" '/field index="0"/ {f=1} f {printf("%s\t",$(NF-1))} /field index="181"/ {exit}' | sed 's/\t$/\n/') occurrence.tsv | sed 's/^"//g;s/"\t"/\t/g;s/"$//' > ocala.txt; }

Please email me directly if you have questions about the function.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

1 Like