Thanks for the questions @jhpoelen
The environment we run this on is basically hooked up for Hive,Spark and HBase (to store the output).
If your intentions are exploratory, the easiest input would likely be to use one of the monthly cloud drops on AWS on the Planetary computer (Azure). They don’t contain the CC-BY-NC data though at the moment (discussed here).
The easiest way to run this, might be to use the stripped down version I created for a Hackathon event recently. In that you only need an input table from Hive, and a Spark cluster with the output dropped back to Hive. During the Hackathon we used a DataBricks cluster running on Azure and it was fairly easy.
This is all fairly experimental at the moment, and the outputs are not downloadable as a file.
Best of luck. Contributions are welcome of course.