Problem parsing large occurrence downloads

I’ve searched around and couldn’t find any existing topics on here, but apologies if I’ve missed something.

I’m trying to carry out an analysis that uses an occurrence family for a whole plant family, so has > 2 million records (like the Myrtaceae family).

When I try to load the file into R or Python, I always end up with fewer rows in the table than cited on the download page.

I’ve tried a few different packages, but get the same problem each time. For instance, using the R package vroom:

library(vroom)

col_names <- c("gbifID", "genus", "species", "taxonRank",
               "scientificName", "countryCode",
               "decimalLatitude", "decimalLongitude",
               "day", "month", "year", "taxonKey", 
               "speciesKey", "basisOfRecord", "issue")

d <- vroom(MYRTACEAE_DL_PATH, delim="\t", col_select=col_names)

Gives a table with 1,785,588 rows, when the download says there should be 2,638,956 occurrences.

Is there something I’m missing here? Am I using the wrong delimiter? Or quotation character?

Any insight you can give would be greatly appreciated.

I also often have trouble reading in large GBIF file into R as well. I am not sure what the problem is exactly, but I think sometimes there are some special characters or quoting that causes trouble for some file parsers. I am not really sure…

Fortunately, I was able to read in the entire csv file using data.table::fread() .

MYRTACEAE_DL_PATH = "C:/Users/ftw712/Desktop/0080847-200221144449610.csv"
data.table::fread(MYRTACEAE_DL_PATH) %>% 
str()

# 2 639 079 observations

I was also able to get

library(dplyr)
readr::read_lines(MYRTACEAE_DL_PATH) %>% length()

# 2 639 080

to read all the lines so I am not sure what these parsers are using internally…

You shouldn’t use " to quote strings in your parser, because GBIF exports don’t do this. If you use a parser which uses " to quote by default, you’ll encounter problems if there are strings that start with a " but do not end with one, as the parser dumps everything into that string until it finds a " followed by a tab. So many rows can get swallowed into single cells this way.

This should work:

data=readr::read_tsv("0080847-200221144449610.csv",quote="",col_types = cols(.default = "c"))
1 Like

Ah, thanks both of you.

I really should have checked the quotation character more thoroughly.

That seems to have fixed everything now!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.