Understanding API's output of dataset/{key}/document and optimizing related API call

I’m currently working with a subset of occurrences and I’m also interested in their related metadata. The endpoint dataset/{key} of the PrincipalMethods Registry API does not return all the information I’m interested in so I’m directly fetching the dataset{key}/document endpoint to get the complete EML in XML format.

The ‘problem’ I’m facing is I also want the information on who published the dataset.
As I experienced it, there is currently no information on the publisher returned from the endpoint dataset{key}/document (is this correct?).

Currently what I’m doing is I fetch the whole EML in XML format, I then query the dataset/{key} endpoint which returns the publishingOrganizationKey and with this I then fetch the endpoint organization/{key} from the PrincipalMethods Registry API and find a way to insert information of the publisher reformatted as a responsibleParty object in the EML.

This results in 3 API calls (one for the whole EML as XML, one for the publishingOrganizationKey, and one for the publisher information) for one dataset.

I have two questions:

1- Would there be a more optimal way (less API calls) to achieve the same end result?
2- In the EML schema 2.2.0, in the dataset element, there is a sub-element(?) for publisher. I’m sure there is a reason why it is not returned in the whole EML as XML from the endpoint dataset/{key}/document, but I’m curious to know why?

Thank you!

I have done this by placing a second call to /organization/{key}, I wonder if you could get around this by reading the EML strait from the source archive.

Thank you for the reply @pieter!

I’m not sure I understand what you mean with the second call to /organization/{key}.
Also, isn’t the endpoint /dataset/{key}/document the source archive for the EML of a GBIF dataset?

Just to reclarify my current process in 4 steps:

  1. Fetch the EML in XML format from the dataset/{key}/document endpoint (does not include any information from the publisher, to my knowledge)
  2. Fetch the publishingOrganizationKey from the dataset/{key} endpoint.
  3. Fetch the publisher information with the publishingOrganizationKey from the organization/{key} endpoint
  4. Format the information of publisher into a responsibleParty object to insert it in the EML from step 1

I think I misunderstood your initial problem, I was suggesting two calls, one to /dataset/{key} and one to /organization/{key}

What you could do is create a local lookup table of all publisher metadata, you should be able to do that in 4 calls (you can set the limit to 1000).

The issue is then reduced to getting the publishingOrganizationKey from the source EML, which might be problematic as it makes sense that this key is minted by GBIF, not by the source EML.

So by caching the organisation metadata we can get it down to 2 calls per dataset + 4 initial calls for the organisation metadata.

Alternativly, you could try to match the source EML to the organisation metadata on a different field, risking collisions. You might get away with organizationName maybe paired with some other fields? But you are at the mercy of the quality of the provided EML.