The “Data Publishing” tab in country reports gives what I think is a misleading impression of publishing activity by individual countries. For example, for India GBIF reports over 30 million occurrences published.
But if I look at the list of 25 published datasets from India the total occurrences comes to 217,237. Browsing the data it seems that the bulk of the 30 million records counted as published by India are from eBird. While the eBird observations are in India, it seems a stretch to argue that they are published by India.
It also seems to overshadow efforts such as the India Biodiversity Portal which has 82,462 occurrences representing 38% of the data for datasets published by India. Yet as a percentage of the data GBIF regards as published by India this is less than 0.3%.
Note that this issue isn’t unique to India. I picked that country because I was curious about the impact of the India Biodiversity Portal. Other countries may well find their estimates of publishing activity distorted by eBird (and other?) data.
Obviously there are various ways to approach this problem, but as it stands the headline figure can give a seriously misleading indication of the direct contribution of a country to GBIF.
This topic was discussed some years ago among the GBIF Secretariat, some nodes and publishers, before the decision to attribute eBird records as a publication of the country of origin of the records. I took part on the discussion reporting the case of Portugal, where:
the community of portuguese birdwatchers is very active, many of which use eBird to manage their daily bird observations;
SPEA, the Portuguese Society for the Study of Birds and a BirdLife’s partner in Portugal, agreed with eBird to use this platform to manage all observations records. SPEA even migrated their existing databases to eBird. This agreement resulted in the regional eBird portal PortugalAves.
Therefore, in the case of Portugal, I think it is fair to say that eBird records in Portugal are published by Portugal. One issue that never got a solution by eBird was to identifying records published by SPEA, and correctly attribute them to SPEA as a GBIF publisher, without generating duplications.
I would add that @rui.figueira 's point could equally apply to India. The huge number of eBird records in India is closely related to the activities of a very proactive eBird partner, Bird Count India https://birdcount.in/ Under the old way of counting ‘data from’ a country, all of these records would have been shown as being published from the United States as the headquarters of eBird. I would say it is more meaningful to group them as coming from the country of observation as this is primarily the output of Indian volunteer observers. There is never a perfect way to slice data by country, but if eBird overwhelms other data it’s because of a real difference in scale, not an artefact.
I got curious what would happen if we were to slice the observations by the country of the source dataset, This is what I found out:
source
Based on the parquet occurrence dump of august. (looking at the country of origin of the publishing organizations that host the datasets).
table
continent
basisofrecord
n
Africa
PRESERVED_SPECIMEN
3472891
Africa
OCCURRENCE
94676
Africa
LIVING_SPECIMEN
57510
Africa
HUMAN_OBSERVATION
29296091
Africa
MATERIAL_SAMPLE
56655
Africa
MACHINE_OBSERVATION
31327
Asia
OCCURRENCE
530982
Asia
LIVING_SPECIMEN
94678
Asia
FOSSIL_SPECIMEN
86486
Asia
HUMAN_OBSERVATION
5730359
Asia
OBSERVATION
586884
Asia
MACHINE_OBSERVATION
86981
Asia
MATERIAL_SAMPLE
609
Asia
PRESERVED_SPECIMEN
14102859
Asia
MATERIAL_CITATION
772
Europe
PRESERVED_SPECIMEN
70357904
Europe
HUMAN_OBSERVATION
623233689
Europe
OBSERVATION
8237730
Europe
MATERIAL_CITATION
1086411
Europe
OCCURRENCE
5604377
Europe
MATERIAL_SAMPLE
3361261
Europe
FOSSIL_SPECIMEN
3167174
Europe
LIVING_SPECIMEN
1241597
Europe
MACHINE_OBSERVATION
5983418
North America
FOSSIL_SPECIMEN
6771940
North America
OCCURRENCE
2584175
North America
MATERIAL_CITATION
526
North America
MATERIAL_SAMPLE
541395
North America
OBSERVATION
9422865
North America
HUMAN_OBSERVATION
1138294213
North America
PRESERVED_SPECIMEN
83156613
North America
MACHINE_OBSERVATION
2441401
North America
LIVING_SPECIMEN
61432
Oceania
MATERIAL_SAMPLE
1377494
Oceania
FOSSIL_SPECIMEN
5359
Oceania
OBSERVATION
1560379
Oceania
PRESERVED_SPECIMEN
13122144
Oceania
OCCURRENCE
1170654
Oceania
MACHINE_OBSERVATION
1852449
Oceania
HUMAN_OBSERVATION
25500411
Oceania
LIVING_SPECIMEN
85423
South America
FOSSIL_SPECIMEN
40631
South America
LIVING_SPECIMEN
62125
South America
MATERIAL_SAMPLE
121422
South America
HUMAN_OBSERVATION
3746238
South America
PRESERVED_SPECIMEN
13929306
South America
OCCURRENCE
3570246
South America
OBSERVATION
6287
South America
MATERIAL_CITATION
8383
South America
MACHINE_OBSERVATION
350041
NA
OBSERVATION
98996
NA
MATERIAL_SAMPLE
35478750
NA
LIVING_SPECIMEN
275272
NA
MATERIAL_CITATION
601959
NA
MACHINE_OBSERVATION
3263962
NA
FOSSIL_SPECIMEN
4
NA
OCCURRENCE
3785904
NA
HUMAN_OBSERVATION
93696982
NA
PRESERVED_SPECIMEN
10159784
I wasn’t able to match up all datasets to a continent. But it still shows that most GBIF occurrences are human observations, and most are hosted in Europe or North America.
If you think it would be interesting I could do the same but for class instead of basisofrecord, then we’ll probably see lots of those human observations are birds.
This query took about 20 mins on my machine. It can probably be optimized a bunch.