Who said using privacy-enhancing technologies can't be enjoyable? Last month we made it fun and game with great collaborators from Europe and the US.
If you are reading this, you have probably heard people talking about privacy and privacy-enhancing technologies. However, these new technologies can seem intimidating and overcomplicated. Most data scientists and developers already have enough on their plates with a number of tools in their stack to be now thinking about problems that involve multiple stakeholders and requiring them to completely change their workflows. But is that really so? Can you just simply start using them straight away, much like any other tooling? And given the tooling, how do you tackle the main challenge with them, namely balancing privacy and accuracy in data science? Last week Oblivious had the pleasure to co-run a hackathon focused on these problems.
This first-of-its kind hackathon was organised by the Human-Centred AI Master’s Programme and CeADAR – Ireland’s Centre for Applied AI and run at four universities across Europe – in Dublin, Naples, Budapest and Utrecht.
The participants had none or very limited exposure to PETs beforehand. And the cool thing about the hackathon was that within a couple of hours they were able to combine state-of-the-art privacy enhancing technologies to build machine learning models that respect both input and output privacy.
Just to quickly recap, input privacy is about running computation on joint inputs from multiple parties without the parties revealing each other’s information. In the case of the hackathon it meant that the participants were given only part of the dataset, while the other part was kept within the enclaves. The two parts were joinable by a common ID column. The computation could be run only within enclaves without the participants seeing other input. The whole process worked smoothly through Oblivious tooling with the participants just running a couple of scripts. Now, once the computation finished, the output of it could be shared with a user. However, the catch in privacy-preserving machine learning is that we also want to make sure that from the outputs of the computation, the users cannot reverse engineer the original inputs. And that’s where output privacy comes in.
To ensure output privacy, the participants made use of differential privacy via a great open source library – OpenDP/Smartnoise. Differentially private mechanisms work by adding noise to the output of a computation. The noise is parametrised by epsilon. The larger the epsilon the smaller the noise and hence more accurate answers, however the larger the privacy leakage. What’s more, epsilons over multiple queries add up building the total privacy budget.
Hence, the real challenge is how to build accurate models while respecting privacy. And that’s what the hackathon was about. The participants could run SQL queries with differentially private answers and use differentially private synthetic data, each costing a chosen epsilon, in order to build their models. They were then evaluated by a score that took into account both the accuracy of their models run on the test data and their total privacy budget.
Within less than four hours, the winners were able to get a pretty good accuracy of above 70% with a reasonable epsilon of 3.
That trade-off between privacy and accuracy (and more generally any data processing) is a problem that any organisation that deals with data is now facing – with more prevalent privacy challenges and an increased number of data breaches . However, it cannot be solved without PETs and without accessible tooling available to data scientists and developers.
How to get involved?
We will be running more hackathons in the near future. If you want to get involved either as a co-organiser at your organisation or as a participant, do reach out to us at firstname.lastname@example.org. You can also try out the tooling yourself by signing-up for the free version of our product.