Skip to main content

Data Warehouse

Data Warehouse in the age of Big Data

A Data Warehouse is a software architecture that for many years has been helping companies recover valuable knowledge from their different IT systems. The reality of the situation in which this technology is being implemented, however, has changed dramatically in recent years: These days, many companies are producing disproportionately more data and the associated reaction rates for analysing this information has been drastically shortened. Simultaneously, the thirst for knowledge on the part of companies and organisations has been increasing. This means classic Data Warehouse approaches are soon driven their limits. Big Data technologies promise to fulfil these new requirements and offer promising approaches to enhance and modernise the traditional Data Warehouse concept.

Below, we will examine the various aspects included in current discussions around Data Warehousing and Big Data.

The classic Data Warehouse

While a company's operational systems are focused on supporting the activities necessary for daily business, a Data Warehouse focuses on analysing and reporting on how the company is run. The technological base of a Data Warehouse system is the relational database management system (RDBMS), which offers a whole host of advantages:

  • Robust, highly sophisticated database software
  • Widespread, quickly available knowledge
  • SQL as a powerful and standardised query language
  • Many business intelligence front ends available
  • High consistency and reliability
  • Extensive security features for controlling access
  • Backup and rollback features for covering data loss

Using extraction, transformation and loading processes (known as ETL processes), the required data is loaded into the Data Warehouse in an automated and time-controlled way, with the data being aggregated, enriched and refined during the process.

Depending on the architectural approach, data marts can also be built, which provide specific sets of data from the Data Warehouse for use by special applications or organisational units. Data marts are usually multi-dimensional structures and are, therefore, optimal for use by analytical applications. The Data Warehouse and data marts are the central source of data for all analyses and reports within the company.

It can, therefore, be said that the use of relational databases and data marts for the most common Data Warehouse use cases is a good choice.

The limits of traditional Data Warehouses

When faced with extremely high data volumes, scaling a Data Warehouse can be very difficult. And for companies using commercial database software may also result in high licensing costs. This acts as a deterrent for many companies, which is why they fail to analyse their data and make use of the knowledge it contains.

Another difficulty may be the lack of performance with RDBM systems when presented with high data volumes. This can affect not just the ETL processes, but also how quickly queries are carried out. Aside from just the volume, the high frequency of data being produced (millions of transactions per second) as well as its storage can be a real challenge. Because more and more data in non-standard formats is becoming the focus of analysis, relational databases can quickly be pushed to their limits.

To satisfy these new demands, new technologies are being brought into play.

Enhancing Data Warehouse with Hadoop, NoSQL & more

A number of technological approaches have been developed to overcome the limitations of the RDBM system: NoSQL databases, Apache Hadoop, and analytical databases.

The alternative to RDBMS: NoSQL databases

NoSQL stands for "Not only SQL". Databases in this category are open source, horizontally scalable, schema-free, distributed and follow a non-relational model. Typical vendors in this category include MongoDB, Cassandra, Neo4J and CouchDB.

NoSQL provides users with the following advantages:

  • Scale out: affordable, low-cost, horizontal scalability
  • High-performance databases for real-time applications
  • Quick and easy customisation of database schemas
  • High agility for exploratory developments in Big Data environments
  • Storing less-structured data

One problem, however, is the lack of query languages, none of which come close to the capabilities offered by SQL. Data is stored in NoSQL databases in the same way it is needed by applications, thus eliminating the need for a powerful query language. BI applications, however, depend on SQL.

Expert recommendation

NoSQL databases can be used in two ways: Some modern BI applications make it possible to create reports directly from the database using the appropriate interfaces. Their full range of functionality, however, can't currently be used. The alternative, then, is to work via the Data Warehouse. By doing so, the advantages of NoSQL can be loaded to carry out analyses where relational databases provide the best solution.

Hadoop as a Data Warehouse platform

Apache Hadoop takes up where other, traditional Data Warehouse systems have reached their limits. The essential problem with using conventional Data Warehousing technologies is the rapid rise in operational costs when processing large amounts of data. In addition, more and more unstructured data is being produced that just does not fit into the logic of a standard Data Warehouse. Hadoop is not a database. Instead, it consists of and relies on the distributed file system HDFS and the MapReduce framework for processing data.

Hadoop is an open source-based framework for the collection, organisation, storage, search and analysis of differently structured data on a cluster of standard computers. Using this architecture, Hadoop can be extremely scalable and process very large amounts of data very quickly. The use of standard hardware also ensures that hardware costs remain manageable and provides for an excellent price-performance ratio. In addition to the pure open-source version of the software, there are also several commercial distributions, such as Cloudera, Hortonworks or MapR. Hadoop is mainly distinguished by the following characteristics:

  • Java-based open-source framework
  • Hadoop Distributed File System (HDFS)
  • MapReduce algorithm for the parallel processing of data
  • Hive as a Data Warehouse database for Hadoop
  • Hbase as NoSQL database for Hadoop

The Hadoop framework enables companies to process very large amounts of data and provides a number of technological advantages:

  • Quick and easy cluster scalability
  • High processing and analytical speed through the use of parallelisation
  • Simultaneous processing of multiple data types (structured, semi-structured, unstructured)
  • High degree of flexibility for the exploratory analysis of Big Data
  • Low costs by using open source options and standard hardware

Expert recommendation

The benefits of using Hadoop are multiplied when the software is used in combination with other technologies: The results of Hadoop processing can be stored in a Data Warehouse or data marts where they can be used in conjunction with all the advantages of a standard Data Warehouse platform. The actual raw data then only exists in the Hadoop system.
Currently almost all Hadoop distribution providers of are developing tools for quick, high-performance SQL access to data stored in Hadoop. But however promising these developments may be, given the current state of the art it makes more sense to combine Hadoop with a classic Data Warehouse approach.

Analytical databases

Analytical databases are a relatively simple and quickly implementable extension for a Data Warehouse system. Typical examples include databases from InfiniDB, Infobright, Vertica, and Vectorwise. There are also databases that are built upon relational database management systems (RDBMS), but have been optimised for fast queries.

Best fit dynamic analysesGood fit statistical analysesNot a Fit Heavy OLTP
Primary use case
  • Numerous ad hoc queries
  • Near real-time response times
  • Fast loading processes
  • Lots of data / quick queries
  • Aggregations: count, sum, etc.
  • High compression
  • Rapid deployment
  • End-of-day report
  • Different intensity of queries
  • Moderate modification of data
  • Simple joins
  • Loading via batch processes
  • Many transactions
  • High referential integrity
  • Zero downtime
Query characteristics
  • Analysis-intensive queries
  • Default / standard data types
  • Limited number of joins
  • Mixed intensity queries
  • Queries using data marts
  • Standard SQL queries
  • Lots of updates
  • Very frequent changes
  • Updates across multiple tables
Example queries
  • Average clicks per visit
  • Total number of visits
  • Total visit time
  • Total bounce rate
  • Number of customers per region
  • Sales per region
  • Average sale price
  • Sales per product
  • Number of new customers per month
  • Top sellers
  • Update account balance
  • Deletion of shopping cart

Analytical databases use different technologies to speed up data processing. These include:

  • Columns
  • Massive parallel processing (MPP)
  • Data compression
  • In-memory storage.

These open up a number of possibilities:

  • Using data whose storage is usually too expensive and whose processing would take too long
  • Using SQL as a powerful, common query language
  • Compatibility with a wide variety of Business Intelligence (BI) front ends
  • Relatively quick to implement and easy to administrate
  • No special demands for complex hardware architectures

Expert recommendation

By setting up independent data marts, analytical databases can easily be used to complement a Data Warehouse installation. These topic or organisationally-specific data marts then retrieve the data directly from the given operational sources. Consistency problems, however, can arise from the resulting duplication of data between data marts. The amount of additional cost and effort also rises proportionally to the number of data marts used. Alternatively, data marts can also be fed from the Data Warehouse. Because they access the integrative layer of the Data Warehouse, the associated benefits are preserved. However, this solution requires much greater effort for modelling and development.

Big Data technologies in combination with the Data Warehouse

Let's take a look at some application scenarios and demonstrate how different Big Data technologies can be usefully combined.

For example, if a company has a web app or site that generates a lot of traffic and they want to collect this data so that they can analyse user behaviour, a combined approach is pretty much indispensable. The goal here is to create what is known as a clickstream analysis of the site visits and user actions. This analysis helps the marketing department to better understand visitors and to derive valuable insights from their behaviours.

Storing log data in a NoSQL database

The raw data required for these analyses are available in the web server logs. It makes sense to store the web application data directly in a NoSQL (Not only SQL) database, because this type of database can effectively deal with large amounts of log data. It also has the necessary flexibility to easily add new data objects.

For the purposes of our example, let's assume that some 100 million new records accrue daily. With traditional ETL (Extract, Transform, Load) and Data Warehouse technologies, trying to process these data sets with their different types of data will require a great deal of time. For this reason, the data should be stored in Hadoop, which provides powerful batch processing capabilities for working with data. The log data is condensed down to hours, hosts or page level and enriched with information from other sources.

Ready for use with powerful BI tools

The data is then aggregated and loaded into the Data Warehouse and analytical database. There, drawing on the full functionality of SQL, different technologies can be applied to the enriched, processed data for performing queries. These are optimal conditions for queries involving complex filtering, joins, groupings and OLAP (Online Analytical Processing).

Of course, various combinations or data streams are possible using this architecture. So, for instance, a report can be directly implemented using NoSQL or the log data can be directly stored in Hadoop.

Conclusion: The coexistence of a Data Warehouse with Big Data stores

Our clickstream analysis example makes clear that the concept of using a Data Warehouse even in the age of Big Data is more relevant than ever. The Data Warehouse model offers many benefits and allows for the use of high-performance BI front ends. The challenge, therefore, is to supplement and expand the Data Warehouse with Big Data technologies so that the weaknesses of a classical architecture can be offset. Because, after all, modern BI and Data Warehouse architectures must be able to process the widest possible range of data.

Conversely, Big Data technologies are presenting new challenges for ETL and data integration tools. So that data from different sources can be combined and transformed, the tools to work with this data must provide interfaces to NoSQL and Hadoop as well as related solutions such as relational databases, files, and other traditional sources.

The effort involved in setting up such an architecture, however, is worth it. Departments and businesses as a whole can use the knowledge gained to make better decisions. The new reality is, therefore, not to say goodbye to Data Warehouses, but rather to welcome the beginning of a future in which the combination of old and new approaches are used to produce faster and better analyses.

In conclusion, the Data Warehouse is a proven concept and in many companies forms the solid foundation of a Business Intelligence system.

Data Warehouse: Your benefits

Comprehensive view of enterprise data

Analyses of integrated sources of data

Time savings through standardised access to data

Automated and standardised determination of key metrics

Knowledge gains by linking information (data mining)

Historicising of corporate data for time series comparisons

Our services: Design and modelling of a Data Warehouse

As an experienced consulting company in the BI sector, we utilise powerful open source tools like Pentaho Data Integration (PDI), Jedox ETL and Talend to develop the extraction, transformation and loading (ETL) processes you need. At the database level, we use interesting, open-source technologies such as MySQL, PostgreSQL and Infobright that enable us to implement complex requirements for an analytical database.