Reading GBIF downloads (SIMPLE_CSV) with R

Recently I have been thinking that we don’t advise users very well on how to get their data into R. GBIF downloads are sometimes difficult to read into R because of quoting that causes problems. For example, read.table does not typically do a good job of reading in GBIF SIMPLE_CSV.

I ran some experiments with various csv readers in R :

  • read.table()
  • readr::read_tsv()
  • vroom::vroom()
  • data.table::fread()
  • rgbif::occ_import (uses data.table::fread() internally)

Ideally we would want a csv reader that does the job with minimal changes to the default settings. I will be using this download from a previous post.

There is are 2,639,079 rows in this download and 50 columns.

read.table()

Best settings I could find for read.table 337,716. Changing the encoding to “UTF-8” also did not improve things.

path_to_download = ""
gbif_download_key = "0080847-200221144449610"
file_path = paste0(path_to_download,gbif_download_key,"/",gbif_download_key,".csv")

read.table(file_path,sep="\t",fill=TRUE,header=TRUE,quote="") %>%
dim()
# [1] 337716     50

read.table(file_path,sep="\t",fill=TRUE,header=TRUE, quote="",encoding="UTF-8") %>%
dim()
# [1] 337716     50

readr::read_tsv()

readr reads the entire file only when using quote="".

library(readr)
read_tsv(file_path) %>% 
dim()
# [1] 1681879      50

read_tsv(file_path,quote="") %>% 
dim()

# [1] 2639079      50

vroom::vroom()

vroom will also work with quote="".

vroom::vroom(file_path) %>%
dim()
# [1] 1785588      50

vroom::vroom(file_path,quote="") %>%
dim()
# [1] 2639079      50

data.table::fread() and rgbif::occ_download_import()

data.table::fread() works with default settings.

data.table::fread(file_path) %>% 
dim()
# [1] 2639079      50

rgbif::occ_download_import(key=gbif_download_key,path=path_to_download) %>% 
dim()
# [1] 2639079      50
3 Likes