Attributing work done (Data Attribution)

Moderators: Nicky Nicolson, David Shorthouse, and Lawrence Monda

Summaries - 4. Attributing work done (Data Attribution)

Background

People want their efforts to be acknowledged and recognised. Other people want to know who did the work and when, and for that information to be unambiguous. Collections want to gain attribution for their contribution to the scientific endeavor through specimen and data use in end-products of research. Standardized mechanisms and metrics are required to facilitate this.

The goal of this category is to give shape to what it means to assign credit to individuals, organizations, or even software. It requires us to think about motivations, units of work, agencies, ethics, technologies, and standards of practice. Almost all data that flow from sender to recipients contain some form of structured or unstructured attribution. In short, the producers of new, primary data or secondary data products desire acknowledgment.

This category has significant areas of overlap with many, if not all other categories. However, it differs in its focus on a need to establish consensus on what or who are the parties that require attribution, how we uniquely identify those parties such that a token of their identity accompanies the transmission of data and is shared without ambiguity, and establishing who or what is responsible for storing and providing access to attribution data. It also differs from other categories because this is where we may expound on ethical and legal considerations. We do not count things merely because we can. We must count things with purpose, with assured measures of accuracy, and with transparent mechanisms that detect and react to abuse.

Information resources

Questions to promote discussion

Group 1 What is an Agent, Who are the Actors, What do they Expect?

  1. Who (or what entities) need(s) to be attributed?
  2. How do we uniquely identify agents (= people, organizations, software) responsible for executing work?
  3. What strategies should be employed to locally disambiguate “strings” to “things” and then share these unique identifiers for agents?

Group 2 What is a Unit of Work Worthy of Credit?

  1. What are the activities pre-, during, and post-transcription of specimen labels that constitute work that ought to be attributed?
  2. Are some activities more reflective of expertise and should be weighted more than others?
  3. What lines of evidence are appropriate and sufficient to correctly attribute work? i.e. How do we trust attributions when these might be created on others’ behalf?
  4. How do we attribute, or provide credit, to the agents responsible for linking entities and bootstrapping the knowledge graph?

Group 3 How do We Measure FAIRly?

  1. What measures should be taken to safeguard against the misattribution?
  2. How do we ensure that propagated attribution, wherever these land, can be amended or corrected?
  3. What standards exist in other domains that store and share attribution and how are these implemented?
  4. What standards are missing or require adjustment to best store and share attribution data?
  5. What are the sociological/ethical/legal pitfalls we need to be sensitive to?

Group 4 How do We Make a Roundtrip for the Attributions?

  1. What are the drivers for attributing work and how might meeting these needs contribute to the long-term sustainability of collections or other producers of primary biodiversity data?
  2. What new communities of stakeholders might benefit from access to biodiversity data that includes tokens of attribution?
  3. What metrics of reuse do we need to connect back to collections to allow for advocacy and attribution and who/what is responsible for gathering those metrics or defining their structure? (genbank sequences, publications, ??)
  4. What technologies do we need in order to make these connections and to supplement or enhance some of the social mechanisms currently in use locally in collections, nationally, and internationally?

Can I just advertise the TDWG attribution Interest Group (Attribution - TDWG).
It is actively working on several of these questions including standards and disambiguation.

5 Likes

Group 2 What is a Unit of Work Worthy of Credit?

Perhaps one missing, but perhaps important question, is whether we want to use specimen attribution for credit, for example in performance evaluations. Wouldn’t it require a much greater level of formalization of the process? Wouldn’t there also be similar problems to those with publication metrics? Do we want to credit people for the number, quality or impacts of their specimens?

From my perspective I think performance metrics on specimens are useful, but much like the
Google Scholar h-index is displayed to users, it can be more a tool for scientists themselves to track their own development, rather than a tool to be judged.

2 Likes

Hi @qgroom I second this, and I see an example already, in the Bionomia to Zenodo option, how scientists can effectively make their collection / identification efforts more discoverable (FAIR) so that they benefit, but others do too.

1 Like

As to “new communities” of stakeholders, I think some of our known communities will be “new” to getting credit.

  • In this question, I see another one. What are the professional routes, career options we offer people who create / curate / manage collections and the related digital resources? We need to address this else we end up with collections (digital and physical) without the human infrastructure to maintain and benefit from. Right now, it can be very economically challenging for collections staff. Some of this credit needs to be reflected in their livelihoods. They need to “benefit.”
  • the collections themselves (as they’ll have better awareness of what they have, what’s being accomplished with what they have, and who is doing the work)
  • discoverability (FAIR) opens unexpected doors. One example community might be the movie and media industry. We’ve often heard about a tv show, or movie, or print ad putting the wrong species in a photo or scene (supposed to be a bee, but they’ve used a fly, for example, OR the host-association used is improbable or wrong). So a) having host-association data freely available, AND b) having it linked unambiguously to who did the work (and via automated systems), increases the likelihood of Hollywood getting it right – and having the source be recognized and know their data were used.
  • Being able to discover the “who” also opens up possibilities for new collaborations with groups doing mitigation / prevention for large scale issues (see WRI International, as an example).
  • Knowing who also gets us much closer to sharing our much-needed human stories for “why do this work?”

I note here that with some of these ideas above I’ve jumped into the "drivers’ question #13.

What are the drivers for attributing work and how might meeting these needs contribute to the long-term sustainability of collections or other producers of primary biodiversity data?

4 Likes

Yes! It is my understanding that this IG is carrying on the work of the RDA WG that is published in the 2019 paper listed above.

Just for reference…
Some of these issues were discussed in the context of the RDA WG. The folder, with all of the materials for this WG is here.

1 Like

Another potential attribution model is CAM.

If we focus on the “data attribution” aspect a bit closer, the question of where to publish or share the attribution information is also important to address. This also relates to @Debbie’s point about professional routes and career options.

Recently, the idea of “data paper” is gaining some traction. These papers differ from research papers because they describe data rather than reporting research investigation. Even in “traditional” research papers, we are seeing more and more data elements (oftentimes links to CSV or Excel files).

Some references here. Chavan, V. and Penev, L in 2011 introduced the idea of “biodiversity data paper”. Even though they talk about a mechanism that can “offer scholarly recognition for efforts and investment by data publishers in authoring rich metadata and publishing them as citable academic papers”, I think it can apply to for example specimen preparation and annotation as well (similar to what is described in the RDA/TDWG Working Group paper). These activities can be recognised and attributed as data events as proposed by this 2019 paper by Li, K., Greenberg, J. and Dunic, J. An example of a data event mentioned in the paper: “In these cases, the geospatial information has been checked and herbarium specimens have been reviewed to confirm taxonomic identification".

1 Like

I think a historical look would help us as well.

Modern scientific breakthroughs are marred with not giving proper credits to the appropriate person and often times these people are marginalised (we can find examples from Ada Lovelace to Rosalind Franklin. And close to our domain,
“telling the trust about who really collected the hero collections” – about a Sarawak teenager named Ali who collaborated with Wallace.

With respect, data papers are more than just gaining traction: our community is and has been at the centre of this. See the GBIF-mediated peer-reviewed data papers and the recent special West of Urals issue of BDJ, for example.

If we identify all the aspects of the work around specimens (curation, data wrangling et al.) and treating them as part of a dynamic individual nanopublication that is steadily updated, corrected and reissued could go a long way in modelling and supporting a potential future for scholarly and academic credit systems. Representing these actions in the data then functions as a highly transparent way of documenting them—which, of course, could then be tied back to individual researchers’ PIDs.

2 Likes

Can we examine the premise that providing access to something (a specimen or data related to it) is an act performed with the expectation that tangible goods will be received in return. I don’t question the need for incentives, but at the time access is provided, attribution is in the future and of unknown value. It’s understandable that many researchers and institutions would consider it weak tea.

Our recent report “Economic Analyses of Federal Scientific Collections: Methods for Documenting Costs and Benefits” describe a “virtuous cycle” (with an example from a collection of rock cores) that collections and its users can create. Users must agree to report publications and any release of data based on the items they receive, as well as copies of the data and any subsamples they prepare. The agreement is the cost of access, in lieu of a user fee. These goods (or links to them) are then integrated into the physical and/or IT resources of the collection. Both parties exchange access with the expectation of future reward.

Here’s the virtuous cycle: The collection gives access to the user, the user promises to provide results as payment for access, and the collection increases the value of the items used by rendering them more discoverable through publications and public trait data. The results for both the collection and the user are increased re-use, citations and requests for access to the item. Users are also asked to estimate the value of the time and effort they invested in analyses of the item. These investments are then reported as valued added to the collection each year.

There is another metric suggested in our report that isn’t used much, if at all. Collections report how many items are requested each year, but they fail to display these requests relative to how long ago the items were made accessible to users. Our early efforts suggest that the age distributions of requested items match the overall age distributions of collections, suggesting that items hold their value over time. It they didn’t, requests would be skewed toward recently accessioned items.

The point of “curation” reminds me of a short commentary in Taxon several years ago that stuck with me, which I’m not sure has been taken up – that is, attribute not only those who collect, but those individuals who prepare the specimens.

Ghahremaninejad, F., & Hoseini, E. (2016). Herbarium specimen labels: A missed opportunity. Taxon, 65 (3), 685-685. doi:10.2307/taxon.65.3.685 See https://www.researchgate.net/publication/304424925_Herbarium_Specimen_Labels_a_Missed_Opportunity

What else should be attributed and what information is needed to do so, both for recognition’s sake and for the sake of data quality/transparency/repeatability/use/etc?

This relates to:

Question 5 matters for Attribution, but in the case of data transparency, expertise needed is less relevant – e.g., in case of specimen mounting, it may actually be useful to know who mounted the specimen and when many years later (for forensics such as what mounting material was used? was that preparator prone certain methods of preparation? etc)

1 Like

In the realized extended/digital specimen future, I wonder about new annotations of (often highly intensive) phenotypic data derived from specimens – should the scientists who contributed to the scoring of thousands of herbarium specimen phenologies or the measurement of leaf morphometrics be cited by name or otherwise acknowledged when these specimen-derived data are reused in downstream publications? For instance, in the case of the global plant trait database TRY, data contributors were once allowed to request users of their data contact them first. I believe this has since changed to all datasets open access, but with attribution required.

Absolutely right. There’s a TDWG working group digging into roles—those involved can identify themselves, right @dshorthouse? :relaxed:

1 Like

The attribution extension to Darwin Core that is currently being developed recognizes various different actions. This includes attribution for who prepared the specimen, who preserved it and who performed measurements on it. The big problem is the availability of this information. Preparing and preserving actions are rarely logged in our databases. Measurements may be documented in publications, but it is often not straightforward to connect these back to the specimens.

1 Like

Is there a world where, with a global registry of collections, authors of publications are required to explicitly list all the collections that they used in their work? (in standardized, accessible etc format) Just as most publishers currently do for listing funding support as standardized part of article metadata. So, this information is included in the bibliographic metadata of the articles themselves.

The data DOIs generate from GBIF downloads are a great example and sounds like would most certainly be a part of the future solutions. But what if authors didn’t use GBIF? :scream:

Some have argued collections even deserve co-authorship, which I disagree on the simple basis of authors cannot be entities and collections in the public trust can expect acknowledgment but not to level of forced collaboration. But this is an interesting argument.

I guess my real question is - what could/should journal publishers be doing to facilitate attribution?