Data Anonymizing in Hadoop: A TED Case Study

Article
06/04/2015

Data Anonymizing in Hadoop: A TED Case Study

by Beat Schwegler and Ivan Judson, Partner Catalyst Team

The Problem

Our customer uses an on premise Cloudera Cluster to archive and mine customer related data. They wanted to use HDInsight for additional compute capacity and Microsoft Azure Machine Learning to create predictive models. However, their data contains PII (Personal Identifiable Information) and their company policy prevented them from uploading such data to a public cloud.

One possible approach would be to simply remove all PII related data from the dataset. However this is not feasible because building the predictive models and its predictions are related to individual customers. Therefore we explored an approach in which we anonymize data before it is sent to the cloud and de-anonymize it, once the data is back on premise.

Overview of the Solution

We wanted to make the task of anonymizing/de-anonymizing feel like a native Hadoop capability. That’s why we decided use Hadoop’s Pig and extend its capabilities by implementing two user defined functions: ANONYMIZE and DEANONYMIZE. This allows us to embed anonymizing and de-anonymizing as part of an on premise PIG data transformation job, ensuring no PII goes off-premise:

Here an example anonymizing data (in this case the ownerId) as part of a Pig script:

A = LOAD 'data/xyz_device' using PigStorage(';') AS (

ownerId: chararray,
specNr: chararray,

senderId: chararray,

deviceData: chararray);

B = FOREACH A GENERATE jaglion.ANONYMIZE(ownerId) , specNr, senderId, deviceData;

jaglion.WASBSTORE B INTO 'data/devices/anonym/xyz_device' USING PigStorage (';');

And here an example of de-anonymizing data:

A = jaglion.WASBLOAD 'data/results/anonym/xyz_result' using PigStorage(';') AS (

ownerId: chararray,
resultData: chararray);

B = FOREACH A GENERATE jaglion.DEANONYMIZE(ownerId) , resultData;

STORE B INTO 'data/xyz_result' USING PigStorage (';');

The anonymizing function uses Hadoop’s HBASE as the persistent key/value store to retrieve and store the PII data correlation. This HBASE instance is part of the on premise Hadoop cluster.

We implemented two different correlation modes for different level of privacy:

High privacy mode

Each time the ANONYMIZE function is called, it returns a unique id. In this mode, even if the customerId is the same, the function will return two different anonymized ids. This mode ensures that off premise data can’t be correlated across the anonymized dimension.

ANONYMIZE(customerId) != ANONYMIZE(customerId)

High privacy mode can be used for independent jobs such as pattern recognition, large scale data transformation or model calculations. However it won’t be useful for jobs which gain insight by correlating multiple dependent data items.

Medium privacy mode

The ANONYMIZE function returns always the same anonymized id for the same request. This mode allows for off premise correlations across the same anonymized ids.

ANONYMIZE(customerId) == ANONYMIZE(customerId)

Code Artifacts

The repository for the two user define functions (UDFs) can be found here: https://github.com/irjudson/jaglion.

To use the UDFs, the java package and its dependencies needs to be registered first:

Now, two UDFs can be used within Pig. Assume we loaded the data into A using a statement similar to

A = LOAD 'testdata';

Once A is loaded, the statement below anonymizes the first column in A using medium privacy (the same value will always generate the same anonymized value):

B = FOREACH A GENERATE jaglion.ANONYMIZE($0, 0);

The following statement will generate a unique anonymized value for each value in A, regardless if the source values are the same:

C = FOREACH A GENERATE jaglion.ANONYMIZE($0, 1);

To de-anonymize, we simply call the DEANONYMIZE function:

D = FOREACH B GENERATE jaglion.DEANONYMIZE($0);

E = FOREACH C GENERATE jaglion.DEANONYMIZE($0);

Opportunities for Reuse

The anonymizing and de-anonymizing functions can be used in any Hadoop project that needs to remove PII data from a dataset, while still being able to correlate the anonymized data back to its original PII data.

It can be leverage to enable hybrid cloud scenario or allow a dataset to be shared more broadly, either within the organization or among external partners.

Ivan R. Judson is a software engineer on the Partner Catalyst Team where he leverages deep R&D and startup experience to help partners (including startups!) solve their toughest problems. Passion for data and strong empathy for users are a common theme in his work, which he documents through talks at conferences and his blog at irjudson.org.

Beat Schwegler is a Principal Architect on the Partner Catalyst Team based in Switzerland. Beat enjoys working with organizations of all sizes to develop innovative solutions. He calls himself “head in the cloud – feet on the ground” and loves to share his experiences at conferences worldwide. He blogs at www.cloudbeatsch.com.

Data Anonymizing in Hadoop: A TED Case Study

Data Anonymizing in Hadoop: A TED Case Study

The Problem

Overview of the Solution

High privacy mode

Medium privacy mode

Code Artifacts

Opportunities for Reuse

Additional resources