Problem parsing large occurrence downloads

barnabywalker · March 12, 2021, 5:28pm

I’ve searched around and couldn’t find any existing topics on here, but apologies if I’ve missed something.

I’m trying to carry out an analysis that uses an occurrence family for a whole plant family, so has > 2 million records (like the Myrtaceae family).

When I try to load the file into R or Python, I always end up with fewer rows in the table than cited on the download page.

I’ve tried a few different packages, but get the same problem each time. For instance, using the R package vroom:

library(vroom)

col_names <- c("gbifID", "genus", "species", "taxonRank",
               "scientificName", "countryCode",
               "decimalLatitude", "decimalLongitude",
               "day", "month", "year", "taxonKey", 
               "speciesKey", "basisOfRecord", "issue")

d <- vroom(MYRTACEAE_DL_PATH, delim="\t", col_select=col_names)

Gives a table with 1,785,588 rows, when the download says there should be 2,638,956 occurrences.

Is there something I’m missing here? Am I using the wrong delimiter? Or quotation character?

Any insight you can give would be greatly appreciated.

jwaller · March 15, 2021, 9:49am

I also often have trouble reading in large GBIF file into R as well. I am not sure what the problem is exactly, but I think sometimes there are some special characters or quoting that causes trouble for some file parsers. I am not really sure…

Fortunately, I was able to read in the entire csv file using data.table::fread() .

MYRTACEAE_DL_PATH = "C:/Users/ftw712/Desktop/0080847-200221144449610.csv"
data.table::fread(MYRTACEAE_DL_PATH) %>% 
str()

# 2 639 079 observations

jwaller · March 15, 2021, 10:31am

I was also able to get

library(dplyr)
readr::read_lines(MYRTACEAE_DL_PATH) %>% length()

# 2 639 080

to read all the lines so I am not sure what these parsers are using internally…

MatDillen · March 15, 2021, 11:33am

You shouldn’t use " to quote strings in your parser, because GBIF exports don’t do this. If you use a parser which uses " to quote by default, you’ll encounter problems if there are strings that start with a " but do not end with one, as the parser dumps everything into that string until it finds a " followed by a tab. So many rows can get swallowed into single cells this way.

This should work:

data=readr::read_tsv("0080847-200221144449610.csv",quote="",col_types = cols(.default = "c"))

barnabywalker · March 15, 2021, 12:18pm

Ah, thanks both of you.

I really should have checked the quotation character more thoroughly.

That seems to have fixed everything now!

system · April 14, 2021, 10:18pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reading GBIF downloads (SIMPLE_CSV) with R Data Use	8	4585	June 4, 2021
Empty data file during the Download	7	468	May 19, 2023
Skipping over species occ that DOE? Data Use	8	524	September 13, 2023
Limit columns in csv from API (R) Data Use	2	84	January 23, 2025
All gbif taxon data	10	1272	July 2, 2020

Problem parsing large occurrence downloads

Related topics