Optimised for Big Data analytics
The data lake concept, developed by Pentaho founder James Dixon, is designed for implementing large data analysis systems. This technology is intended for querying and exploring data in the petabyte range, which demands very high processing throughput. Data stored in a data lake can subsequently be easily analysed with Hadoop technologies such as MapReduce, Spark, Tez and Hive.
A data lake doesn't stipulate a specific storage technology, just storage requirements. While data lakes are usually discussed synonymously with Hadoop – which is an excellent choice for many data lake tasks – they can actually be based on various technologies like NoSQL (HBase, MongoDB), object stores (Amazon S3) or RDBMS.
A major benefit of using data lake storage is that it can handle any data without it being previously converted into a native format. Specifically, this means it's not necessary to define a schema before the data is loaded. The schemas are defined directly at the time of the analysis through the interpretation of the data. Thus the schema - unlike with traditional Data Warehouse approaches - is only built when the data is actually read (Schema on Read). This allows for a high degree of flexibility of analysis and substantially simplifies data ingest.