Data Warehouse in the age of Big Data
A Data Warehouse is a software architecture that for many years has been helping companies recover valuable knowledge from their different IT systems. The reality of the situation in which this technology is being implemented, however, has changed dramatically in recent years: These days, many companies are producing disproportionately more data and the associated reaction rates for analysing this information has been drastically shortened. Simultaneously, the thirst for knowledge on the part of companies and organisations has been increasing. This means classic Data Warehouse approaches are soon driven their limits. Big Data technologies promise to fulfil these new requirements and offer promising approaches to enhance and modernise the traditional Data Warehouse concept.
Below, we will examine the various aspects included in current discussions around Data Warehousing and Big Data.
The classic Data Warehouse
While a company's operational systems are focused on supporting the activities necessary for daily business, a Data Warehouse focuses on analysing and reporting on how the company is run. The technological base of a Data Warehouse system is the relational database management system (RDBMS), which offers a whole host of advantages:
- Robust, highly sophisticated database software
- Widespread, quickly available knowledge
- SQL as a powerful and standardised query language
- Many business intelligence front ends available
- High consistency and reliability
- Extensive security features for controlling access
- Backup and rollback features for covering data loss
Using extraction, transformation and loading processes (known as ETL processes), the required data is loaded into the Data Warehouse in an automated and time-controlled way, with the data being aggregated, enriched and refined during the process.
Depending on the architectural approach, data marts can also be built, which provide specific sets of data from the Data Warehouse for use by special applications or organisational units. Data marts are usually multi-dimensional structures and are, therefore, optimal for use by analytical applications. The Data Warehouse and data marts are the central source of data for all analyses and reports within the company.
It can, therefore, be said that the use of relational databases and data marts for the most common Data Warehouse use cases is a good choice.
The limits of traditional Data Warehouses
When faced with extremely high data volumes, scaling a Data Warehouse can be very difficult. And for companies using commercial database software may also result in high licensing costs. This acts as a deterrent for many companies, which is why they fail to analyse their data and make use of the knowledge it contains.
Another difficulty may be the lack of performance with RDBM systems when presented with high data volumes. This can affect not just the ETL processes, but also how quickly queries are carried out. Aside from just the volume, the high frequency of data being produced (millions of transactions per second) as well as its storage can be a real challenge. Because more and more data in non-standard formats is becoming the focus of analysis, relational databases can quickly be pushed to their limits.
To satisfy these new demands, new technologies are being brought into play.
Enhancing Data Warehouse with Hadoop, NoSQL & more
A number of technological approaches have been developed to overcome the limitations of the RDBM system: NoSQL databases, Apache Hadoop, and analytical databases.
The alternative to RDBMS: NoSQL databases
NoSQL stands for "Not only SQL". Databases in this category are open source, horizontally scalable, schema-free, distributed and follow a non-relational model. Typical vendors in this category include MongoDB, Cassandra, Neo4J and CouchDB.
NoSQL provides users with the following advantages:
- Scale out: affordable, low-cost, horizontal scalability
- High-performance databases for real-time applications
- Quick and easy customisation of database schemas
- High agility for exploratory developments in Big Data environments
- Storing less-structured data
One problem, however, is the lack of query languages, none of which come close to the capabilities offered by SQL. Data is stored in NoSQL databases in the same way it is needed by applications, thus eliminating the need for a powerful query language. BI applications, however, depend on SQL.
NoSQL databases can be used in two ways: Some modern BI applications make it possible to create reports directly from the database using the appropriate interfaces. Their full range of functionality, however, can't currently be used. The alternative, then, is to work via the Data Warehouse. By doing so, the advantages of NoSQL can be loaded to carry out analyses where relational databases provide the best solution.
Hadoop as a Data Warehouse platform
Apache Hadoop takes up where other, traditional Data Warehouse systems have reached their limits. The essential problem with using conventional Data Warehousing technologies is the rapid rise in operational costs when processing large amounts of data. In addition, more and more unstructured data is being produced that just does not fit into the logic of a standard Data Warehouse. Hadoop is not a database. Instead, it consists of and relies on the distributed file system HDFS and the MapReduce framework for processing data.
Hadoop is an open source-based framework for the collection, organisation, storage, search and analysis of differently structured data on a cluster of standard computers. Using this architecture, Hadoop can be extremely scalable and process very large amounts of data very quickly. The use of standard hardware also ensures that hardware costs remain manageable and provides for an excellent price-performance ratio. In addition to the pure open-source version of the software, there are also several commercial distributions, such as Cloudera, Hortonworks or MapR. Hadoop is mainly distinguished by the following characteristics:
- Java-based open-source framework
- Hadoop Distributed File System (HDFS)
- MapReduce algorithm for the parallel processing of data
- Hive as a Data Warehouse database for Hadoop
- Hbase as NoSQL database for Hadoop
The Hadoop framework enables companies to process very large amounts of data and provides a number of technological advantages:
- Quick and easy cluster scalability
- High processing and analytical speed through the use of parallelisation
- Simultaneous processing of multiple data types (structured, semi-structured, unstructured)
- High degree of flexibility for the exploratory analysis of Big Data
- Low costs by using open source options and standard hardware
The benefits of using Hadoop are multiplied when the software is used in combination with other technologies: The results of Hadoop processing can be stored in a Data Warehouse or data marts where they can be used in conjunction with all the advantages of a standard Data Warehouse platform. The actual raw data then only exists in the Hadoop system.
Currently almost all Hadoop distribution providers of are developing tools for quick, high-performance SQL access to data stored in Hadoop. But however promising these developments may be, given the current state of the art it makes more sense to combine Hadoop with a classic Data Warehouse approach.
Analytical databases are a relatively simple and quickly implementable extension for a Data Warehouse system. Typical examples include databases from InfiniDB, Infobright, Vertica, and Vectorwise. There are also databases that are built upon relational database management systems (RDBMS), but have been optimised for fast queries.
|Best fit dynamic analyses||Good fit statistical analyses||Not a Fit Heavy OLTP|
|Primary use case|
Analytical databases use different technologies to speed up data processing. These include:
- Massive parallel processing (MPP)
- Data compression
- In-memory storage.
These open up a number of possibilities:
- Using data whose storage is usually too expensive and whose processing would take too long
- Using SQL as a powerful, common query language
- Compatibility with a wide variety of Business Intelligence (BI) front ends
- Relatively quick to implement and easy to administrate
- No special demands for complex hardware architectures
By setting up independent data marts, analytical databases can easily be used to complement a Data Warehouse installation. These topic or organisationally-specific data marts then retrieve the data directly from the given operational sources. Consistency problems, however, can arise from the resulting duplication of data between data marts. The amount of additional cost and effort also rises proportionally to the number of data marts used. Alternatively, data marts can also be fed from the Data Warehouse. Because they access the integrative layer of the Data Warehouse, the associated benefits are preserved. However, this solution requires much greater effort for modelling and development.
Big Data technologies in combination with the Data Warehouse
Let's take a look at some application scenarios and demonstrate how different Big Data technologies can be usefully combined.
For example, if a company has a web app or site that generates a lot of traffic and they want to collect this data so that they can analyse user behaviour, a combined approach is pretty much indispensable. The goal here is to create what is known as a clickstream analysis of the site visits and user actions. This analysis helps the marketing department to better understand visitors and to derive valuable insights from their behaviours.
Storing log data in a NoSQL database
The raw data required for these analyses are available in the web server logs. It makes sense to store the web application data directly in a NoSQL (Not only SQL) database, because this type of database can effectively deal with large amounts of log data. It also has the necessary flexibility to easily add new data objects.
For the purposes of our example, let's assume that some 100 million new records accrue daily. With traditional ETL (Extract, Transform, Load) and Data Warehouse technologies, trying to process these data sets with their different types of data will require a great deal of time. For this reason, the data should be stored in Hadoop, which provides powerful batch processing capabilities for working with data. The log data is condensed down to hours, hosts or page level and enriched with information from other sources.
Ready for use with powerful BI tools
The data is then aggregated and loaded into the Data Warehouse and analytical database. There, drawing on the full functionality of SQL, different technologies can be applied to the enriched, processed data for performing queries. These are optimal conditions for queries involving complex filtering, joins, groupings and OLAP (Online Analytical Processing).
Of course, various combinations or data streams are possible using this architecture. So, for instance, a report can be directly implemented using NoSQL or the log data can be directly stored in Hadoop.
Conclusion: The coexistence of a Data Warehouse with Big Data stores
Our clickstream analysis example makes clear that the concept of using a Data Warehouse even in the age of Big Data is more relevant than ever. The Data Warehouse model offers many benefits and allows for the use of high-performance BI front ends. The challenge, therefore, is to supplement and expand the Data Warehouse with Big Data technologies so that the weaknesses of a classical architecture can be offset. Because, after all, modern BI and Data Warehouse architectures must be able to process the widest possible range of data.
Conversely, Big Data technologies are presenting new challenges for ETL and data integration tools. So that data from different sources can be combined and transformed, the tools to work with this data must provide interfaces to NoSQL and Hadoop as well as related solutions such as relational databases, files, and other traditional sources.
The effort involved in setting up such an architecture, however, is worth it. Departments and businesses as a whole can use the knowledge gained to make better decisions. The new reality is, therefore, not to say goodbye to Data Warehouses, but rather to welcome the beginning of a future in which the combination of old and new approaches are used to produce faster and better analyses.
In conclusion, the Data Warehouse is a proven concept and in many companies forms the solid foundation of a Business Intelligence system.
Data Warehouse: Your benefits
Comprehensive view of enterprise data
Analyses of integrated sources of data
Time savings through standardised access to data
Automated and standardised determination of key metrics
Knowledge gains by linking information (data mining)
Historicising of corporate data for time series comparisons
Our services: Design and modelling of a Data Warehouse
As an experienced consulting company in the BI sector, we utilise powerful open source tools like Pentaho Data Integration (PDI), Jedox ETL and Talend to develop the extraction, transformation and loading (ETL) processes you need. At the database level, we use interesting, open-source technologies such as MySQL, PostgreSQL and Infobright that enable us to implement complex requirements for an analytical database.