Pentaho Community Meeting 2018

0

If knowledge is the vision for our future, data and information are the means to reach it. As a platform for data integration and big data analytics, Pentaho has followed this vision for years. At Pentaho Community Meeting in Bologna, you could see to become this vision true.

The data crowd of Pentaho Community Meeting

The data crowd of Pentaho Community Meeting

I am happy and proud to welcome almost 280 participants from more than 25 countries. Together with the guys from Pentaho User Group Italia we again had the honor to organize one of the biggest open data events in 2018. This year, it´s all in the Italians‘ hand 🙂 If anyone had missed the signs at the airport, they would notice after the first minutes as Francesco Corti opened the event with a motivating, engaging….well Italian speech.

Pentaho in Hitachi Vantara

Pedro Alves started the event with a much awaited speech about Pentaho in Hitachi Vantara. After the acquisition of Pentaho two years ago, some users might still wonder about the journey Pentaho is taking within Hitachi Vantara. Starting with some facts…

  • Hitachi Vantara has 4,900 employess in R&D and committed 2,8 $billion annually to R&D
  • More than 1$billion invested in IoT and big data (entire company: Holding 119000 patents worldwide)
  • $5,4billion renevue from IoTR in 2015
  • 2,500 patents awarded in technologies for big data analysis
  • According to Thomson Reuters its one of the Top 100 Global Innovators

So, what does this mean for Pentaho? Hitachi is trying to close the gaps in the data journey. All of Pentaho users and developers have been working on the various aspects of the analytics data pipeline. And they have mostly achieved the goal of covering all different kinds of usecases and data scenarios that users have. Pentaho is covering the entire journey from coming from data to insights.

Good vibes

Good vibes

On the product roadmap for Pentaho, there are topics such as

  • Ecosystem integration
  • Edge to cloud processing
  • Streamlined data operations
  • Analytics and visualization

What’s new in Pentaho 8.2

Jens Bleuel presented an overview of all new features of Pentaho 8.2. It is very exciting that a lot of so-called Continuous Improvement topics have been fixed. These are generally smaller improvements that make the day-to-day life of a user much easier.
At release time (very soon), all new features can be seen on the What’s New in Pentaho 8.2 page of the documentation.

CERN: data warehouse & business computing challenges (Gabriele Thiede & Jan Janke)

Gabriele Thiede und Jan Janke, CERN

Gabriele Thiede und Jan Janke, CERN

Gabriele and Jan presented data warehouse & business computing challenges at CERN. Both have attended past community and user meetings with only time limiting them to cover any single aspect of their Pentaho project. Just in case that someone doesn´t know: CERN is the world’s largest research organization that breaks superlatives in every aspect. Over 23,000 people work at CERN, out of those 12,000 are researchers and with 2,700 nationals Italy is the most represented country!

With these numbers, CERN provides the infrastructure of a whole town: restraurants, hotels, banks, postoffices, doctors, fire brigade, libraries, stores and kindergarten. This, plus special challenges arise from the location of CERN on the border between two countries, France and Switzerland (a non-EU country!). This means that also administration within CERN is special: besides their own social security system including health insurance, pension fund etc. they also provide services to handle requests for licence plates, work and residence permits and so on. Pentaho helps CERN to manage the data involved in these areas and processes.

CERN uses Pentaho to combine data from personnel, finance, logistics, ERP and more systems in a data warehouse. PDI is used for ETL processes, Pentaho Analytics and CTools are used for all kinds of visualizations. A few hundred business experts work directly with Pentaho Analyzer to create reports and analyses, forms and official documents. These analyses and reports are then selectively made available to serve the entire CERN community.
The data warehouse is refreshed in near real time which means that the data is 2-15 minutes behind reality. PDI is the main data integration tool and they heavily rely on a feature called “metadata injection” to avoid repetition and ensure common logging, error handling and auditing.

Self-service analytics and reports

Pentaho Report Designer is used for providing high fidelity prints for a number of reports, certificates and attestations, e.g. the annual personnel statistics that are required by CERN’s member states. Another use case are the numerous forms supporting the custom’s formalities the organization needs to follow. A self-service BI portal called IRIS provides self-service personalized analytics to the CERN community and supports role holders to find and analyze information about the various areas of CERN’s administration.

Time travels with PDI

CERN is facing many „data challenges“ every day. A typical one is the problem of the unstable past: as such, the database is not always in line with reality, retroactive changes are not the exception but the rule. Certain events relating for instance to personal situation changes (e.g. family related events) can only be registered retroactively. Therefore, often, the status reflected in the data ware-house is a few weeks behind reality. As a result, it is difficult to produce statistics and reports that are correct and reproducible.

The solution in the past was a workaround: data was aggregated and frozen in special tables. The dilemma was however, that it was impossible to make data reproducible. So a travel-back-in-time approach with a bi-temporal data model was implemented, adding an extra time dimension in the da-ta warehouse to trace reality. Although this has an impact on the performance of the queries,it ena-bles CERN to map two realities: the business timeline and the technical timeline. This way it is pos-sible to prove what was in the database at which moment.

Just to name a few of the other Pentaho implementations and challenges at CERN:

  • A live dashboard for procurement processes to see in real time how much CERN spends for materials and services by supplier country.
  • They are currently exploring the use of ML algorithms for predictive analytics to enable auto-mating some manual processes and to help with forecasts based on models built using his-torical data.
  • In addition, integrating of cloud data despite the lack of direct database access and APIs that are not always able to provide all the data that is expected for a complete reporting is still a challenge and being actively worked on.

Gabriele presented an example where CERN relies on calling some APIs very frequently to close gaps in historical data.

Successful cooperations between Hitachi and the community (Gunther Dell & Stefan Müller)

Gunther Dell und Stefan Müller Successful cooperations between Hitachi and the community

Gunther Dell und Stefan Müller

Gunther from Hitachi Vantara and me showed how technical partnerships between Hitachi and the community can result into great solutions. Gunther Dell, responsible for SAP business at Hitachi, talked to SAP about solutions for analyzing SAP data. Though SAP was quite enthusiastic about the cooperation, there was the problem of getting out data of SAP systems into Pentaho platform. Connecting with it-novum, Gunther found out that we had already built a Pentaho connector for SAP in 2012. After some talks to Hitachi, my colleagues updated it to make it compatible with the newest Pentaho version.

SAP/Pentaho Connector supports the import of any kind of SAP data (ERP, BW etc.) in Pentaho so that it can be processed in Pentaho Data Integration and used for reports and analyses. I presented it at last year’s PCM and it is now on the Hitachi price list so that it can purchased regularly at Hitachi.

Another example for the successful cooperation between Hitachi and the Pentaho ecosystem is the HVA Connector. HVA stands for Hitachi Video Analytics, a powerful solution for analyzing data derived from videos. The connector ensures that data from 3rd party systems can be merged with video information. SAP has covered these connectors in an own blog article.

TECHNICAL TRACK

Useful Kettle Plugins (Matt Casters)

Matt Casters Useful Kettle Plugins

Matt Casters

Matt asked to me include this quote about his talk which I´m happily doing:
„For me, moving forward and supporting Neo4j Solutions is important and just like it-novum did with SAP and video analytics I started writing extra functionality on top of Kettle for what i needed. This is the nice thing about open source. Everybody does things for their own needs but everybody benefits. As it happened to be I had to create quite a bit of new things to make Kettle behave the way that I wanted and this is reflected in the long list of projects and links you can find in my presentation on pcm18.kettle.be. I can’t stress enough that a lot of this work benefited from the feedback of the Kettle community and that community welcomes everyone else’s opinions as well so join in on the fun.“

Scaling Pentaho Server with Kubernetes (Diethard Steiner)

Diethard Steiner Scaling Pentaho Server with Kubernetes

Diethard Steiner

With Kubernetes provisioning a cluster is finally a straight forward task. Diethard’s presentation covered essential points like how to break the monolithic Pentaho Server app apart into its single components so that they can be scaled independently, how to best create the Docker image and explains on a high level how to create the relevant Kubernetes deployment definitions.

Capitalizing on Lambda&Kappa architectures for IoT (Issam Hijazi)

Issam Hijazi Capitalizing on Lambda&Kappa architectures for IoT

Issam Hijazi

How to capitalize on Lambda & Kappa architectures for IoT with Pentaho was the topic of Issam’s talk. Lambda and Kappa architecture are more of design principles that are usually executed and thought of by many people but without putting context into them. They are important implementation practices that put ground rules of how to design specific data flows in variety of use cases, especially those that require near real-time data processing and analytics.

The Lambda architecture consists of 3 main layers: Batch, Speed and Serving. The batch is used as immutable master data store with append-only functionality to all data that is being streamed, and requires full re-processing of all data on each batch execution. The benefit is that it ensures data accuracy and consistency especially when code has bugs, or there are improvements into it. The speed layer processes data as it arrives, and create real-time views that reflects the data from the last batch job finished execution until now (present moment). The combination of the two results (batch + speed) give a full view when that is a requirement, where the batch views can be used individually as well on their own, and possibly real-time views for niche requirements.

The Kappa architecture is a simplification of the Lambda and removes the requirements to have the batch layer, in aim to beat its cons and question its claim to beat the “CAP theorem”. It states that the speed layer shall be able to process data as it arrives, but also reprocess the data when needed. It is worth mentioning that this architecture came from the founder of Apache Kafka, which perfectly fits its requirements and goals. Kafka can be used to pipeline the data as it arrives through different topics which are partitioned. Each message that arrives has an offset which can be used to reply data (i.e for reprocessing) when needed, making it acts as log- well, it doesn’t act, it’s actually a log. In addition, Kafka topics can be mirrored into storage (like HDFS), allowing users to change retention policies and much more. Lastly, such architecture allows users to reinitiate any target system (i.e databases) very fast and have one base of code to maintain (unlike Lambda having two, one for batch and another for speed).

In short, both architectures work great on specific use-cases. Issam stated that for IoT use-cases, Kappa would be a better choice. For cases where you need to reprocess the full data every single time (i.e daily bases), such as creating new model from ML algorithm, it would make sense to have the Lambda architecture.

Video Analytics: unlock value from videos using Pentaho/HVA Connector (Alexander Keidel)

Videos can serve as an important data source for improving public security and optimizing city infrastructure. Hitachi Video Analytics (HVA) is a powerful solution using Pentaho to analyze video data. As Alexander demonstrated, this data is of no use if it´s not blended with data from other sources.

That´s where Pentaho comes into play: with PDI you can blend information coming from cameras with other data and run analytics and reports on them. To make this possible, Alex and his team developed a connector that links Pentaho Data Integration and Hitachi Vantara Analytics. The connector offers full integration with PDI, blends video data from any 3rd party application and can read data from 13 HVA analytical modules, via batch or streaming mode.

How to ensure the integrity and convert a file based repository (Mihai Rizea & Antonio Pacheco)

Antonio Pacheco How to ensure the integrity and convert a file based repository

Antonio Pacheco

Mihai and Antonio from Ubiquis shared two real world examples of how to tackle some of the challenges of file based repositories. Ubiquis regularly publishes articles with their approach (along with their code) to real world challenges in their blog and in this case decided to give a sneak peak to the participants of PCM 2018 on a couple of upcoming features that shared the same topic.

File based repositories are not supported but fairly commonly used. The way that the paths of a job/transformation stored in the repository and the path of the file within the filesystem commonly mistmatch create a challenge to a sustainable deployment strategy. Mihai showed how he developed a job to check for, and repair discrepancies between what is expected in the repository and is actually present in the filesystem, thus allowing to more safely deploy your project.

While Antonio presented a real life account of a situation where moving away from the repository was the best course of action. And presented how Ubiquis developed a set of transformations to convert a file based repository into a flat files project. The tool creates a copy of the folder where the repository is stored while editing the XML within all of the ktr and kjb files.

A Web ETL solution based on a reflective software approach (Leonardo Coelho)

BIcenter is a web platform for the development and management of ETL processes in a multi-institution environment. Knowing that the used data is typically sensitive, it must ensure data privacy and protection. In addition, ETL processes are periodically executed to update a data warehouse or to produce statistical reports, requiring the ability to define schedule periodic executions of ETL processes. BIcenter uses Kettle and replicates PDI functionalities in an HTML5 browser, allowing multiple users to build, share and execute ETL pipelines across multiple centers. It uses a reflection-based approach that accelerates ETL components integration.

Profiling Mondrian MDX requests in a production environment (Raimonds Simanovskis)

eazyBI is built using JRuby and Ruby on Rails framework and is using JRuby mondrian-olap library which encapsulates the Mondrian engine. There are many thousands of active Mondrian schemas that are managed by the application. When some MDX requests are slow then in a development environment all Mondrian-generated MDX and SQL statements can be logged. But in a production environment with many different schemas it is not feasible to enable logging of all SQL statements. The first approach for profiling individual MDX requests is to use the Mondrian QueryTiming class and support for this is added in the JRuby mondrian-olap library. More detailed SQL statement logging for individual MDX requests was trickier – eazyBI uses table schema prefixes to filter them out from all logged SQL statements.

In addition, Raimonds explained how the Mondrian schema pool is keeping soft references to all Mondrian schemas which prevents them from garbage collection. eazyBI implemented additional Mondrian connection and schema pooling to periodically remove unused schemas from the pool and improve garbage collection performance.

Big Data OLAP with Pentaho, Kylin and more… (Emilio Arias & Roberto Tardío)

Big data landscape

Big data landscape

As Emilio and Roberto pointed out in the beginning of their speech, big data technologies that have emerged in recent times allow us to process huge sets of data, in real time and from many sources, both internal and external to our organization. Thanks to these technologies, we can further improve our decision-making processes and therefore the performance of our business. However, the choice of the most suitable stack of technologies and techniques for the implementation of a big data warehouse is often a problem that can condition the success of our project.

That´s why StrateBI created a benchmark using Pentaho with Apache Kylin, Vertica and PostgreSQL. Emilio and Roberto showed how you can build delivery-successful BI/BigData projects with huge amount of data, both using Mondrian schema and Pentaho Metadata Editor and a Hadoop Cluster. They included a detailed benchmark with info of query performance using several technologies. Furthermore, there is a Online Demo Lab available where you can see these technologies in action.

BUSINESS TRACK

A journey on Italian healthcare data (Giorgio Grillini & Virgilio Pierini)

Virgilio Pierini and Giorgio Grillini

Virgilio Pierini and Giorgio Grillini

Giorgio and Virgilio presented an innovative approach to analyzing and presenting medical data to key players of the healthcare system. Their customer, a company working for big pharma enterprises and national healthcare wanted a data warehouse and reports on pharmaceutical usage data and therapeutic efficacy. Before, the company relied on “human ETL” which is the CTO that processed and analyzed the data.

Goals of the projects were:

  • Bringing in the data (that all had their own features, from 30 sources)
  • Scalability and SQL interface
  • 40k budget
  • Provide a a minimalistic (old-fashioned, basic) dashboard
  • Replicate existing paper-based reports
  • Maybe going to the cloud

With the Pentaho-based solution, it was possible f.e. to identify efficient drugs for hypercholesterolemia that had a moderate price. Pharma industry usually pursues the „the more expensive the more efficient“ approach so data-based analyses were needed. Challenges included choosing the right data visualization which turned out to be the target resp. arrows. Regulations in the Italian pharma and healthcare sector also posed big challanges on the solution as different district, regional and ministerial guidles all required a different use and format of the data.

To analyze cloud data, the „cloud mess“ still didn´t require any GDPR compliance as it was started 2 years ago. But there were Italian and European privacy regulations that applied. So after having read a bunch of papers and consulted a lawyer the team ended up putting anonmyzed data in the cloud leaving a decode table on the anonymizer workstation.

Bursting the black box: analyzing SAP data with Pentaho (Christopher Keller, Dr. David Hames)

SAP/Pentaho Connector, Christopher Keller

SAP/Pentaho Connector, Christopher Keller

SAP is the standard ERP solution in a lot of enterprises. Christopher and David demoed how by combining it with Pentaho, users get important insights from their data to create real business value. There have been some connectors for extracting SAP data and loading it in Pentaho but all of them died in the last years. Having developed a first connector for SAP customers back in 2010, Pentaho approached it-novum to ask to update it. Incorporating the feedback of SAP customers there were also added new features, enabling the integration of data from all SAP modules like ERP, BW or HANA. Once installed the connector imports data from SAP via mouse click.

Christopher emphasized that analyzing SAP data is crucial for organizations as some of the most important data is stored in SAP systems. However, SAP data isn´t the easiest type of data as it has proprietary formats, crazy abbreviations (remember it´s a German software!) and is transaction-oriented. Also, the learning curve for developers is very steep making the hiring of skilled SAP/ABAP developers very costly. Who aren´t data integration experts usually by the way. To get the big picture of an organization’s situation, it´s crucial to blend SAP data with data from other data sources. This is why SAP/Pentaho Connector was developed.

In this case, Pentaho’s strength lies in the integration and blending of data with other information to analyze it across the entire enterprise. SAP/Pentaho connector closes the gap between the huge but very closed SAP world and the area of analytics thus introducing the capabilities of Pentaho to the management and business analysts that need to analyze their SAP data for strategic and tactical decisions. Pentaho provides a user-friendly and flexible analytics frontend. The connector provides seamless integration with SAP ERP and SAP BW and supports Metadata Injection. Concerning SAP steps, it supports SAP ERP Table Input, SAP BW/ERP RFC Executor, and SAP BW DSO Input. More SAP steps are planned.

ETL for business development – easier, better, faster, stronger (Riccardo Arzenton)

Riccardo Arzenton ETL for business

Riccardo Arzenton

Especially when dealing with ETL processes, the business value is not always clear to customers. Riccardo’s use case was a great demonstration of how ETL can add huge benefits: he presented an ETL project for Dental Trey, an Italian dental supplier. The company provides a custom management software for dentists studios and has 40 million Euro in revenuews generated with 104 employees, 100 selling agents and 10 branches. Dental Trey wanted to increase their market share in offering auto-migration of data from competitor softwares. In this case, ETL provided by Pentaho was a selling feature and a huge added value for business.

Using Pentaho Data Integration gave Dental Trey much more data than they expected so that they could offer extra data to their customers. Besides this, they profited from the following benefits:

  • Easier: Dental Trey has similar internal procedure done in VisualBasic. Pentaho is the right choice to simplify and modernize procedures.
  • Better: the entity migration approach (Patient → Medical Records → Price List → Invoices) is simpler and more efficient
    Do not migrate invoices is as simple as deleting an arrow in pentaho job
  • Faster: with the old VisualBasic procedure migrating a database with 7,000 patient records took 5 hours, whereas with Pentaho it takes only 5 minutes.
  • Stronger: Pentaho is really loved by Dental Trey. They see it as a great open source tool that outclasses previous systems.

openLighthouse: ITSM analytics with Pentaho (Stefan Müller)

Stefan Müller ITSM Analytics

Stefan Müller

The days of highly integrated IT organizations are long gone. Today, companies buy big parts of the IT value chain externally. This challenges IT managers with a great number of suppliers and individual service agreements. In order to obtain a holistic view of the central KPIs (availability, reaction times, etc.) a wide variety of systems must be used as data sources. However, most IT service management solutions lack real analytics features and only allow for analysis of their own data.

As it-novum not only provides data analytics solutions but also services for it service management, we have developed an ITSM analytics solution called openLighthouse. Based on Pentaho (mainly PDI), openLighthouse is a highly flexible analytics software that can integrate different internal and external data sources of ticket systems, network monitoring, it documentation etc. Pentaho dashboards and self-service analyses provide the necessary decision support for IT managers and CIOs. openLighthouse currently provides a module for the ticketsystem OTRS, more modules for other helpdesk systems, monitoring etc. are planned.

Enabling business users with webSpoon (Bart Maertens)

Kettle is not always the key...

Kettle is not always the key…

How can you provide your business users with better self-service ETL? Bart showed a solution that is using WebSpoon. Business users have different needs than tech-savvy business users or developers. WebSpoon is a powerful tool to provide business users with ETL skills.

But why not use Kettle?! As Bart emphasized, Kettle is an awesome tool but there are limitations to it. And there are other great tools in the Pentaho ecosystem that have been around for a while and are simple to use. WebSpoon has a central DME (Data Manipulation Environment), central DCE (Data Configuration Environment), gives a secured, personalized access to ETL code and data and provides integrated SpoonGit for version management. And it can be used in the cloud which helps to reduce costs, increase scalability, give global availability and allows for automation as everything as code.

Goals of the projects are:

  • Automation: automate everything (no manual actions!)
  • Central data, version management, installation, configuration, ETL code…
  • Integrated security: authentication, authorization
  • Cloud first
  • Roll out new environments in minutes

At the moment, the project has is been rolled out to 2 customers with more customers being ready for it. Next steps are to implement it as default internal environment, enhance security, support GCP and Kubernetes; configuration UI, and, of course, world domination 🙂

Real time streaming with Raspberry PI and PDI (Håkon Bommen)

Håkon introduced realtime communication with Kafka and PDI. In the music streaming sector there is a lot of data around. If you have millions of users on smartphones streaming music you need the right technology to make sure all plays are logged correctly. Kafka was chosen to do this as it has the needed writing performance, reliability and scalability.

In the talk Håkon showed off a small demo using a Raspberry Pi to produce messages. It is a small computer that costs 35$, has the size of a credit card and can run a full system like Linux. The message service is Apache Kafka, a distributed streaming platform developed by LinkedIn and is used by many big companies includingTtwitter, Netflix and Airbnb. The message consumer for the demo is PDI, and in the test it will simply output the messages to a text file.

In the live demo we followed the first couple of steps of the Apache Kafka quick guide setting up a single node Kafka network. We then looked at the PDI Kafka consumer step setting up the necessary configuration, and the micro transformation that does the actual processing. A small python script extracted the acceleration felt by the Raspberry unit and printed it out to screen and sending the same data to the Kafka network. Finally, when comparing the direct output with the PDI output we saw that there was a slight delay due to PDI configured to process rows in batches of ten.

Self-service dashboards with Ctools layout editor (Nuno Pereira)

Self-service dashboards with Ctools layout editor

Nuno Pereira

Nuno from Hitachi Vantara introduced a self-service dashboard built with Ctools. It was created for a telematics customer collecting vehicle offense data like harsh events or overspeed events. To visualize their data Hitachi created a self-service BI platform providing

  • pixel perfect dashboards
  • easy to use and re-use
  • widgets management
  • data multi-tenancy

The created layout editor framework suits the needs for different customers, stories and solutions. It also suites the different needs of the users. Whereas a tech user wants to be able to develop new widgets and re-use them, customize widgets and increase the widget library, the non-tech user wants to be able to create a dashboard, have access to a widgets library that will compose their dashboards, manage their dashboards (removing etc. widgets) and re-use built widgets. By using Ctools it was possible to provide both user groups with the needed tools and data visualizations.

Smarter cities with Pentaho (Gianluca Andreis)

Gianluca Andreis Smarter cities with Pentaho

Gianluca Andreis

How do we sculpture our futures with technologies available? Gianluca discussed this topic using the smart city concept. Smart city is a subset of smart spaces, that are areas where humans and machines interact. These are complex ecosystems composed by multiple technological layers, from physical infrastructure to IT, Services Ops to Digital Networks till the “human” layer composed by people and organizations.

Smart cities has countless of use cases, ranging from public safety till enhancing the city experience for both the population and tourists but ultimately focusing on items that improve operations and efficiencies, like infrastructural automation and cross-organizational data sharing and analysis.

Use case 1: City of Las Vegas

The biggest challenge for the city of Las Vegas was adapting to the increased request for mobility and infrastructure that both tourism, growth and business required. In this scenario Hitachi employed Smart Cameras, Edge Gateways, Pentaho, Hitachi Visualization Suite and Hitachi Video Analytics. The tools combined allowed for data gathering and enrichment, ultimately giving the municipality the capability to understand and act based on the analytics that was generated by Pentaho.

The solution covered the following use cases, fueled by Pentaho Dahsboards:

  • Cyclist Counting: To determine the best placement for bicycle lanes and adding safety measures
  • Activity Analysis on crossings: To understand both vehicle and people behavior and work to optimize public transportation, safety, sidewalks and road expansion, plus vehicle stop time to reduce carbon emissions
  • People and Vehicle Analysis: to determine road to expand/modify whether especially heavy traffic was reported
  • Parking Space Analysis: to expand parking areas, evaluate parking time, and provide real time information to citizens and visitors

All of the above data was also used to determine the best way to undergo road maintenance and modification limiting the generated issues.

Use case 2: State of Andhra Pradesh

The Andhra Pradesh region in India was facing the big challenge to accommodate over 80 millions of citizens in an area smaller than Italy. Andhra Pradesh that is subject to an impressive growth and the moving of the population towards the city from rural areas. One of the goals was to ensure the maximum efficiency in several process, amongst them we identified the increased food request thus agricultural development.

Three areas were tackled in:

  • Agricultural development: by monitoring soil health, crop distribution and water/rain/irrigation levels
  • Social Services: by monitoring the distribution of civil supplies card to the poorest population strata
  • Public Safety: by overviewing at tactical level public services vehicles (ambulances, fire engines, police cars) and monitoring cameras at critical infrastructure locations

The recipe for success was tying underlay hardware sensors and cameras to Hitachi Visualization Suite for situational awareness and Pentaho for producing ad hoc dashboards for data analysis. Plus, Pentaho was specifically used to correlate soil nutrient levels to income/expenditure generated. The outcome was the creation of an ad-hoc center (Real Time Governance) where all the above solutions were monitored by government officials.

In the future, Hitachi plans to integrate Pentaho in 3D LiDAR sensors for 3d space object analysis and in to cognitive AI video attribute search.

We ultimately came to four conclusion (taken from the slides):

  • We don’t have to let Big Data sleep! We need to aggregate, analyze and produce outcomes
  • We have to think big and scale up our use cases, from buildings, to neighborhood to cities to states
  • Analytics allow us to shape a better world, the use case we saw was a clear example on how we can improve the life of people
  • There are thousands of use case out there for smart cities and Pentaho, we have to go get them!

Accelerating Neo4j implementations using Kettle (Jan Aertsen)

Jan from Neo4J demonstrated that the Graph Database Neo4J and Kettle/Pentaho Data Integration work together very well. Typical usecases include fraud detection and analysis, network monitoring and optimization, recommendation engines, configuration management, data lineage and impact analyses etc.

In respect to Kettle, Neo4J usecases are “analytical” by nature, and as any type of analytics requires data from various sources. Neo4J architecturally often sits next to other databases, doing new things with the same data. It is extremely good at doing entity resolution and can complement kettle. And it can complement Kettle for impact analysis. Jan completed his presentation with the demo of an example for the use of Kettle and Neo4J.

BNova DFC: a real time tool for monitoring and manage PDI flow (Stefano Celati)

Stefano Celati real time monitoring and managing PDI flow

Stefano Celati

Bnova is one of the oldest Pentaho/Hitachi partners in Italy and has been working with Pentaho since 2009. Stefano presented BNova DFC, a Pentaho plugin developed with App Builder. App Builder is a Pentaho developed tool that facilitates the development of applications for Pentaho and is based on Ctools. It was developed for monitoring PDI flows in complex environments:

  • when a Pentaho job fails
  • a Pentaho job is taking too long
  • when you don´t have direct access to the production environment

To avoid messing with the code, DFC provides a web application giving an overview of the scheduled ETL processes. It gives a 360° degree view on your ETL processes. Natively integrated with Pentaho, you can access it from within your Pentaho system. Designed as a dashboard, it gives an overview of jobs and scheduled transformations and highlights anomalities, malfunctions and performance of single steps. DFC also integrates Vertica, Cloudera and other databases so that not only single ETL flows or groups can be monitored but also entire environments.

Tags: , , ,

Stefan Müller - Director Big Data Analytics
Nach mehreren Jahren Tätigkeit im Bereich Governance & Controlling und Sourcing Management ist Stefan Müller bei it-novum gelandet, wo er den Bereich Big Data Analytics aufgebaut hat. Stefans Herz schlägt für die Möglichkeiten, die die BI-Suiten von Pentaho und Jedox bieten, er beschäftigt sich aber auch mit anderen Open Source BI-Lösungen. Seine Begeisterung für Business Open Source im Bereich Datenintelligenz gibt Stefan regelmäßig in Fachartikeln, Statements und Vorträgen weiter.
Webprofile von Stefan: Twitter, LinkedIn, XING