Data publishing counts distorted by eBird

The “Data Publishing” tab in country reports gives what I think is a misleading impression of publishing activity by individual countries. For example, for India GBIF reports over 30 million occurrences published.

But if I look at the list of 25 published datasets from India the total occurrences comes to 217,237. Browsing the data it seems that the bulk of the 30 million records counted as published by India are from eBird. While the eBird observations are in India, it seems a stretch to argue that they are published by India.

It also seems to overshadow efforts such as the India Biodiversity Portal which has 82,462 occurrences representing 38% of the data for datasets published by India. Yet as a percentage of the data GBIF regards as published by India this is less than 0.3%.

Note that this issue isn’t unique to India. I picked that country because I was curious about the impact of the India Biodiversity Portal. Other countries may well find their estimates of publishing activity distorted by eBird (and other?) data.

Obviously there are various ways to approach this problem, but as it stands the headline figure can give a seriously misleading indication of the direct contribution of a country to GBIF.

This topic was discussed some years ago among the GBIF Secretariat, some nodes and publishers, before the decision to attribute eBird records as a publication of the country of origin of the records. I took part on the discussion reporting the case of Portugal, where:

  • the community of portuguese birdwatchers is very active, many of which use eBird to manage their daily bird observations;
  • SPEA, the Portuguese Society for the Study of Birds and a BirdLife’s partner in Portugal, agreed with eBird to use this platform to manage all observations records. SPEA even migrated their existing databases to eBird. This agreement resulted in the regional eBird portal PortugalAves.

Therefore, in the case of Portugal, I think it is fair to say that eBird records in Portugal are published by Portugal. One issue that never got a solution by eBird was to identifying records published by SPEA, and correctly attribute them to SPEA as a GBIF publisher, without generating duplications.

1 Like

I suspected this was the explanation. The sheer scale of eBird can be overwhelming at times, making it difficult to see underlying patterns.

I would add that @rui.figueira 's point could equally apply to India. The huge number of eBird records in India is closely related to the activities of a very proactive eBird partner, Bird Count India https://birdcount.in/ Under the old way of counting ‘data from’ a country, all of these records would have been shown as being published from the United States as the headquarters of eBird. I would say it is more meaningful to group them as coming from the country of observation as this is primarily the output of Indian volunteer observers. There is never a perfect way to slice data by country, but if eBird overwhelms other data it’s because of a real difference in scale, not an artefact.

I got curious what would happen if we were to slice the observations by the country of the source dataset, This is what I found out:

source

Based on the parquet occurrence dump of august. (looking at the country of origin of the publishing organizations that host the datasets).

table

continent basisofrecord n
Africa PRESERVED_SPECIMEN 3472891
Africa OCCURRENCE 94676
Africa LIVING_SPECIMEN 57510
Africa HUMAN_OBSERVATION 29296091
Africa MATERIAL_SAMPLE 56655
Africa MACHINE_OBSERVATION 31327
Asia OCCURRENCE 530982
Asia LIVING_SPECIMEN 94678
Asia FOSSIL_SPECIMEN 86486
Asia HUMAN_OBSERVATION 5730359
Asia OBSERVATION 586884
Asia MACHINE_OBSERVATION 86981
Asia MATERIAL_SAMPLE 609
Asia PRESERVED_SPECIMEN 14102859
Asia MATERIAL_CITATION 772
Europe PRESERVED_SPECIMEN 70357904
Europe HUMAN_OBSERVATION 623233689
Europe OBSERVATION 8237730
Europe MATERIAL_CITATION 1086411
Europe OCCURRENCE 5604377
Europe MATERIAL_SAMPLE 3361261
Europe FOSSIL_SPECIMEN 3167174
Europe LIVING_SPECIMEN 1241597
Europe MACHINE_OBSERVATION 5983418
North America FOSSIL_SPECIMEN 6771940
North America OCCURRENCE 2584175
North America MATERIAL_CITATION 526
North America MATERIAL_SAMPLE 541395
North America OBSERVATION 9422865
North America HUMAN_OBSERVATION 1138294213
North America PRESERVED_SPECIMEN 83156613
North America MACHINE_OBSERVATION 2441401
North America LIVING_SPECIMEN 61432
Oceania MATERIAL_SAMPLE 1377494
Oceania FOSSIL_SPECIMEN 5359
Oceania OBSERVATION 1560379
Oceania PRESERVED_SPECIMEN 13122144
Oceania OCCURRENCE 1170654
Oceania MACHINE_OBSERVATION 1852449
Oceania HUMAN_OBSERVATION 25500411
Oceania LIVING_SPECIMEN 85423
South America FOSSIL_SPECIMEN 40631
South America LIVING_SPECIMEN 62125
South America MATERIAL_SAMPLE 121422
South America HUMAN_OBSERVATION 3746238
South America PRESERVED_SPECIMEN 13929306
South America OCCURRENCE 3570246
South America OBSERVATION 6287
South America MATERIAL_CITATION 8383
South America MACHINE_OBSERVATION 350041
NA OBSERVATION 98996
NA MATERIAL_SAMPLE 35478750
NA LIVING_SPECIMEN 275272
NA MATERIAL_CITATION 601959
NA MACHINE_OBSERVATION 3263962
NA FOSSIL_SPECIMEN 4
NA OCCURRENCE 3785904
NA HUMAN_OBSERVATION 93696982
NA PRESERVED_SPECIMEN 10159784

I wasn’t able to match up all datasets to a continent. But it still shows that most GBIF occurrences are human observations, and most are hosted in Europe or North America.

plot

If you think it would be interesting I could do the same but for class instead of basisofrecord, then we’ll probably see lots of those human observations are birds.

This query took about 20 mins on my machine. It can probably be optimized a bunch.

1 Like

It feels a little like if you’re not doing citizen science on birds, you are now in the long tail of biodiversity data.

And looks like iNaturalist will be next in line.