Skip to main content

GDPR and Big Data

How the GDPR affects Big Data Analytics

The main objective of the GDPR (the new EU law covering the storage of personal data) is to give EU citizens back control over their personal data. Important points contained in this legislation include the need for individuals to consent to the use of their personal data, their right to delete personal data and the obligation on the part of companies and other parties to notify people in the event of any data protection violations. Violations can carry some hefty sanctions: Fines of up to €20,000,000 or up to 4% of a company's annual worldwide turnover. These fines represent a significant financial risk for companies.

Against this backdrop, managers and their analytical systems face technical, functional and organisational challenges such as:

  • Clarification of what constitutes personal data
  • 'Privacy by design' and 'privacy by default'
  • Pseudonymisation and anonymisation
  • Data quality as required by the GDPR

GDPR and data science

In practice, the GDPR influences data science and data warehousing in the following areas. Firstly, the GDPR sets tighter limits on the processing of personal data and the creation of consumer profiles. Secondly, companies that use automated decision-making technologies must give consumers a "right to explanation" regarding their practices and activities. Thirdly, the GDPR holds companies responsible for any distortions or any discrimination in their automated decision-making processes. Fourthly, companies must bear in mind that existing analyses using personal data could also become illegal when the GDPR comes into effect.

A lot needs to be done

Companies will have to examine the data they collect with regard to its GDPR impact, implement compliance procedures, evaluate how they process information and a lot more. At it-novum, we take a detailed look at the GDPR regulations and develop GDPR-compliant solutions using our Hitachi Vantara data integration and analysis solution Pentaho.

[] GDPR für BI

Potential solutions with Pentaho and Cloudera

The Cloudera Navigator includes functionality such as a metadata repository that allows metadata to be added to any table/file or directory in the form of additional tags (e.g. "IMPORTANT"). These tags can then be searched for and displayed in the data lineage.

If the relevant data has been "marked" by using metadata in the Cloudera Navigator, all "locations" where GDPR-relevant data is located and how this is further processed are known.

In the event of any data loss (e.g. hacker attack), this information helps in informing any affected parties, because the affected data can be identified. This also allows for clear governance by restricting the number of users able to access the data. In addition, it is possible to greatly limit actual data loss using the navigator's audit functions.

Both the creation of the metadata tags for data and their transmission to the Cloudera Navigator is handled by Pentaho. To do this, Pentaho calls the appropriate navigator API endpoint and then transmits the metadata tag. By using a pattern recognition procedure, for example, it is possible to automatically recognise the matching metadata tag.

Pseudonymisation and anonymisation

Companies should ensure that access to their customers' personal information is restricted. By implementing robust anonymisation, analysts cannot access personal information by default. An exception process can then be defined that allows access to personal data using appropriate security in exceptional cases.

Another solution for working with personal data without contravening the strict guidelines of the GDPR is to carry out the analyses using pseudonymised data. Technically, such pseudonymisation (i.e. Pentaho replacing the clear text name with a pseudonym) can already be implemented in the data lake during the data ingestion phase.

Right to be forgotten

You can implement a process to address customer questions and concerns about automated decisions. If, for example, a customer submits a request to have his personal data deleted, it is first of all important to determine where exactly this information is stored. This is done by searching the metadata tag in the Navigator using the data lineages and then passing this to an appropriately developed Pentaho ETL job. This job then deletes the data from all relevant processing steps.

ETL in the data lake

Data in the data lake is usually processed in a variety of forms. It is thus important to ensure that the metadata tags are properly maintained. If personal data is omitted, the GDPR tag can be removed.

In predictive models that use personal data, it should be clearly established whether this data is actually analytically necessary and whether it provides any actual added informational value. An example of whether the predictive model supports the permissible use of data would, for instance, be activities designed to prevent money laundering. Finally, you would need to define a review and acceptance process for customer-oriented predictive models that is independent of the model developers.

GDPR reporting

Companies must provide all stakeholders (employees, subsidiaries, customers and auditors) with information on their compliance status as well as progress reports. Auditors and certification bodies must be able to prove that their processing is legally compliant and that they meet their disclosure obligations vis-à-vis those whose data they hold.

All this information must be provided promptly and in a clearly understandable way. Combining Cloudera Navigator with Pentaho helps to meet these reporting requirements.


The information contained in this article is not to be understood as legal advice and should not be interpreted as such. Companies subject to the GDPR should not rely on the information contained herein and should seek legal advice from their own legal counsel or another professional legal services provider.