Skip to main content

Data Lake

Optimised for Big Data analytics

The data lake concept, developed by Pentaho founder James Dixon, is designed for implementing large data analysis systems. This technology is intended for querying and exploring data in the petabyte range, which demands very high processing throughput. Data stored in a data lake can subsequently be easily analysed with Hadoop technologies such as MapReduce, Spark, Tez and Hive.

A data lake doesn't stipulate a specific storage technology, just storage requirements. While data lakes are usually discussed synonymously with Hadoop – which is an excellent choice for many data lake tasks – they can actually be based on various technologies like NoSQL (HBase, MongoDB), object stores (Amazon S3) or RDBMS.

A major benefit of using data lake storage is that it can handle any data without it being previously converted into a native format. Specifically, this means it's not necessary to define a schema before the data is loaded. The schemas are defined directly at the time of the analysis through the interpretation of the data. Thus the schema - unlike with traditional Data Warehouse approaches - is only built when the data is actually read (Schema on Read). This allows for a high degree of flexibility of analysis and substantially simplifies data ingest.

Your benefits

Data lake storage can handle a high number of write operations for small amounts of data at low latency. This makes it the ideal solution for scenarios where data must be processed in near real time and at the lowest possible costs – e.g. for analyses of websites and Internet of Things (IoT) connected devices and sensors. Column-based and key-value store NoSQL databases can also be integrated into data Lakes.