10 Years Pentaho Community Meeting1 200 people from almost 40 countries have come to Mainz, Germany to join for the 10th Pentaho Community Meeting. I didn´t believe this day could come true but all of a sudden it was there! Between the old walls of Kupferbergterrassen and a huge sculpture of Bismarck’s head we celebrated ten years of a great user and developer community with a community meeting that has never been so big.
24 speakers from all parts of the world presented their newest developments and usecases and shared them with the community. The first Pentaho Community Meeting was omnipresent, at least as table cloth on the table that carried the memorabilia for the action and charity run tonight. It´s unbelievable – besides Jens and Pedro there were five other attendees that had already participated in the first community meeting. Most attendees might have wondered which implications will the rebranding of Pentaho to Hitachi Vantara have for the Pentaho community. To say it with Pedro: „This!“ – the fact that Pentaho Community Meeting was taking place showed that Hitachi Vantara is putting much focus on the community and Pedro expects more community meetings to come.In his welcome speech, Jens Bleuel (btw coming from and still living and working in Mainz) pointed out that when he joined Pentaho he had many doubts whether it would be the right decision: could a US company with a hire and fire mentality fit with his current life situation (two little children)? Some years later, he found that he had made the right choice when he was organizing the first community meeting. At that time, PCM had 38 attendees and when visiting the venure and seeing huge Bismarck hall at Kupferbergterrassen, Jens thought „What if one day we could fill this one?“ Well, nine years later here we are: more than 200 participants from three dozens of countries easily fill the hall.
What´s new in Pentaho 8.0?Pedro gave a short introduction to Pentaho 8.0. One of the bigger feature of the release are worker nodes for scaling out. When executing jobs you can now execute jobs everywhere, not only locally. The goal is to have this execution option to everything in Pentaho. Concerning Pentaho Data Integration, the new version supports streaming support, runf configurations for jobs and data explorer filters (enhanced data inspection features, numeric filters, string filters, inclusion/exclusion options for data points. Jens gave more details on that in his talk.
Scaling out the Pentaho platform in large enterprise operations: Worker NodesJens dug a littler deeper in the feature though probably not as much as he would have liked to 🙂 Worker nodes can scale work items across multiple nodes (containers), f.e. PDI jobs and transformtions, report executions… They operate easily and securely and have an elastic architecture. Large enterprises need the ability to seamlessly and efficiently spin up resources to handle 100s+ work items at different times, with different dependencies and processing requirements. Worker nodes address these needs. The difference to AEL and Hadopp MapReduce is that whereas they only scale out on data worker nodes scale out on processes (PDI jobs, PDI transformations etc.). You can even combine these two architectures to use both of them in an advanced way and get a better ROI.
What’s brewing in the kettle?As usual, Matt’s speech included a lot of code. He showed new things from the Pentaho labs like the streaming plugin. There are many things currently going on in the Kettle kitchen that was set up more than 10 years ago – obviously it is still cooking vividly. As in Hitachi Vantara many people are using PDI Kettle is far from becoming out-dated. There are lots of plans about using PDI with new technologies like blockchain or Kybernetes. Matt remembered how 9 years ago there were many topics on the wishlist – today, most them have become true. Matt encouraged people to use the Pentaho Community Meeting to give feedback and share new ideas on PDI.
Pentaho Community in Hitachi Vantara Community
Jill Ross, Community Manager at Hitachi Vantara, introduced Hitachi Vantara Community. Hitachi Vantara Community is a social collaboration network that was created in 2013 to give employees, customers, partners and prospects the possibility to interact, ask questions, join discussions, share knowledge and learn from people at Hitachi Vantara. Clearly, the purpose of it is to get answers, discover insights and make connections and there are many technical discussions and sharing solutions and usecases going on. Pentaho community is now integrated in Hitachi Vantara so should be discussions about problems and bugs. As the community is huge, there are different places where people can gather to share and create content and jointly work on solutions. You see, there are quite a few benefits to join the community (btw: we have already created a page for Pentaho Community Meeting 2017 two months ago there!).
CERN: Pentaho in a huge international organizationExpectations were high when Jan Janke entered the stage and they even increased with his fancy animated galaxy start screen. CERN is the world’s biggest particle physics research organization funded principally by its member states. At CERN, currently 2,300 staff members, 1,600 fellows, students and associates as well as more than 12,000 visiting scientists are working. The laboratory’s main purpose is to study sub-atomic particle with the goal to understand better the universe. To do this, CERN runs particle accelerators and among them its flagship LHC (Large Hadron Collider). While enabling researchers from all over the world to conduct particle physics experiments, CERN is also the place of birth of the World Wide Web, is at the forefront of developments in areas such as high precision welding or medical imaging and has put in place a world-spanning computing grid (which existed before the “cloud”).
Running an organization such as CERN creates its own business challenges. Pentaho is used in many different areas by the Administrative Information Systems (AIS) group. It has become the backbone of its BI, reporting and analytics infrastructure. Pentaho Analyzer is used by business experts from various domains. It is especially appreciated for its ease of use and its capability to provide a fast insight into various business processes in an ad-hoc manner. Because it is so easy to get going with Analyzer, the users’ demands are also very high. Providing durable good performance is still an area that CERN’s engineers are working on.As highlighted during the talk, CERN is located on top of the Franco-Swiss border and as Switzerland is not part of the EU, the organization faces many different customs procedures on a regular basis. An important number of the needed forms is generated with the help of Pentaho Report Designer (PRD). Furthermore, CERN leverages Pentaho’s APIs and the ease of embedding it within its own applications to provide all kind of reporting self-services to its employees and users.
CERN is not (yet) using Pentaho for big data. It is however used for data integration, reporting and ad-hoc analytics and operating with data provided by a data warehouse that is kept up to date in near real time. Being an international organisation, an important amount of sensitive data not usually found in regular companies is being handled. Examples are social security, childcare, health insurance or even medical data. Such sensitive data is subject to many restrictions and CERN need to put a particular focus on data protection.
In addition to Analyzer and PRD reports, CERN is also a heavy user of CTools. The CTools are used to build dashboards and KPIs for the different management layers. Pentaho is mostly embedded in CERN’s applications so not all of the 15,000 potential users at CERN are logging directly into Pentaho’s User Console. The organization is also heavily relying on Pentaho Data Integration (PDI) for its data warehousing ETL processes. During the Q&As, Jens of course wanted to have more details about CERN’s ETL processes. Jan confirmed that they use PDI a lot but „ETL is just assumed to be working“.
Using Pentaho in the public health systemMozambique is a country facing many challenges and Devan Manharlal started his talk with some of them: 7,800 nurses are serving 27 million Mozambiquans – in Germany, there are 1 million (!) health workers for a population of 85 millions. Mozambique is among the bottom 10 countries on WHO’s 2010 list of nations with serious health workforce shortages. To help to address these issues it was important to make health data available in order to analyze it to understand better the current situation and develop solutions (where do nurses live and where are facilities in need of nurses located?).
To do so, the Ministry of Health has developed a national human resources for health information systems (HRIS or eSIP-Saude), to improve the planning and management of healthcare workers countrywide. As Mozambique has a lot of different HRSIs systems, software and other data sources that aren´t connected with each other, the goal was to establish an integrated data system.
Making SAP data accessible: Pentaho/SAP Pentaho74% of all world transaction worldwide touch a SAP system. But working with SAP is not easy: migrating or exporting SAP data is almost impossible (f.e. if you want to use them for strategic decisions), data quality is oftentimes poor etc. etc. When it comes down to data integration, SAP has a proprietary format and structure, a steep learning curve for developers and so SAP programmers can be a “bottleneck” in an organization as they´re mostly not nearly as available as you would like them to be…
So, everyone knows the “solutions” coming out of this: hand-coded applications that are oftentimes limited to single individuals or departments and can´t be used somewhere else in the organization. And for developing them you rely on SAP programmers or individuals that can´t be easily replaced. Pentaho’s open APIs allow for integrating SAP data, however only to a limited extent as existing solutions like SAP Input Step haven´t been updated for a longer time. Other ones are limited to HANA and so on.
That´s why it-novum has developed the Pentaho/SAP Connector. The connector offers a broad spectrum for processing SAP data and supports the current PDI version, metadata injection, using variables, filter functionality, selecting fields, mapping SAP/Java data types, SAP table read and BAPI querying. Backed by a smartphone hotspot Alex Keidel then ran a demo on how to integrate SAP BW data with the connector. This raised the question of how a software that is used around the globe still sticks to German abbreviations for KPIs: for sure, no one outside Germany might be familiar with „Vertriebsbelegposition“.
Integrating environmental data with Pentaho
The next presentation came from the Czech Republic and covered the topic of environmental information systems. Kamil Nesetril from Liberec Technical University presented the Hydrogeological Information System that is based on Pentaho. dataearth.cz uses the entire BI stack of Pentaho. This way it brings BI tools and concepts to the world of groundwater and environment that only knows Excel for working with data. According to Kamil, challenges lie less in the field of data volume (big data) but in diversity of data as a lot of semi-structured and long-tail data has to be processed. Kamil impressed with the remark that he is not only using all the tools of the Pentaho platform but also many different environmental software solutions. Isn´t that a heavy user?!
Migrating from Microsoft to PentahoA special case of data integration was in the center of the next talk: Bart Maerten from know.bi presented the migration from Microsoft to Pentaho at Essent Belgium. Previously, Essent used Provision, a Microsoft, on-premise solution but wanted to shift to a cloud-based approach.
NOVA, the new system, is based on open source technologies and AWS. It was to be easily rolled out to other sites and group companies and that´s why Essent chose a cloud solution. Other reasons were scalability, efficiency and costs. The project started in 2015 with selecting the platform (Pentaho won over Jasper). During 2016 the environment was built and then the migration started which posed some challenges. At the end of February, the first B2B platform was released with the B2C version planned for beginning of next year.
The project team consisted of a traditional BI part with analysts, developers for the ETL processes, reports and dashboards and the project owner/manager. On the analytics side, four people data scientists and software engineers joined. The new solution is based on AWS and Pentaho and uses the ETL, Analyzer, Dashboards Designer and Reporting modules of Pentaho. Concerning AWS DMS, Bart only commented „it is quite ok if it works“.
NOVA loads data from more than ten systems to a landing area. Here, ETL processes transform the data to load it in a Postgres database before it can be transfered in the data warehouse. Data visualizations and analyzes are run on the logical area that is somewhere in between of these steps.
Lessons learnt included the surprising finding that starting a BI development without real-life data is hard. Also, AWS sometimes bites: it definitely has a lot of advantages but also some failures and it took some time to solve them. The agile developing method the team had adopted at the beginning proved to eventually pay off but it needs to be done well.
A data analytics app based on MondrianRaimonds Simanovskis started his presentation with the surprising fact that it was painter Piet Mondrian who back in 1900 created the first data visualizations and called them „Tableau 1“. Not convinced? Well, that´s what happens when a 21st century developer visits German museums… Raimonds introduced eazyBI, an easy-to-use BI application designed for Jira customers to analyze data and build reports, charts and dashboards from Jira, Jira add-ons, and other data sources. More information in this interview with Raimonds and on eazybi.com.
Running Pentaho without developers
Balázs Bárány presented a pure business use case, Austria’s first scooter sharing service SCO2T. The sharing platform works like common car sharing services and relies exclusively on open source technologies. SCO2T was founded in 2014 with first scooters rented in early 2015. Today, it offers 90 scooters that serve 6,000 customers. The interesting thing about the project is that Balázs is not a developer but a data scientist without profound programming skills. He therefore has a different and interesting view on Pentaho.When setting up SCO2T, Balázs faced the usual challenges startups have to deal with like missing manpower and money, little experience with doing business and sometimes having to invent things from scratch. Austria and particularly Vienna are a very complex regulatory environment for sharing services in the transportation sector. And, amazingly for being a internet-based startup, the team did not have a software developer. So, Balázs decided to use a solution he already knew: Pentaho. As it was open source it was also quite economic.
SCO2T uses almost all components of the Pentaho platform: Pentaho Data Integration serves purposes ranging from the application of business rules (registration confirmations, reminders, creation of line items for invoices, calculation of ride lengths and prices) to billing (invoice creation, data export for accounting and bank communication) and operations (vehicle position & state alarms, replication of data between databases). The BI/BA server is the base for the operational web apps like the customer activation, vehicle tracking and management and data management, and analytics about ride statistics, user statistics and KPIs. With Pentaho Report Designer they created the invoice template and with Pentaho dashboards they visualize the data, f.e. to discover usage peaks (like in the summer). This also includes a cool map showing the scooters currently in use or parked and allows decisions about locations where the placement of a scooter makes most sense.
For Balázs, the benefits of Pentaho are:
- Fast development of data handling processes (easy to debug, talks to everything)
- Fast development of dashboards and „web applications“ (standard open source technologies, easy to extend)
- More than just a BI platform!
For SCO2T’s CEO, the advantages of Pentaho are:
- Compared to other companies: fast development of new features, fast integration of change requests
- easy to use, good user interaction
- data export and visualization
- Disadvantage: noticeably slow performance compared to programmed (PHP) pages
As weak points of Pentaho Balázs mentioned:
- Data Integration:
Startup messages not controlled by log level
- BI Server:
Missing documentation on CDF (one relys on community and blogs)
Behavour changes between versions (map markers)
Balázs then recommended some technologies he likes using, f.e. PostgreSQL with PostGIS, PostgREST (automatic web API creation without programming) and Traccar for multi-vendor device tracking software.
Pentaho 8 Reporting for Java DevelopersAlfresco users know him well: Francesco Corti is Product Evangelist for the document management platform and very active in the Alfresco community. So, what is he doing at PCM?! Some of you might remember the book „Pentaho Reporting 3.5“, one of the undispensable items Pentaho developers used to have on their desks. This year, Francesco has updated the book and presented it to the Pentaho community. In the book, developers will discover how easy it is to embed Pentaho Reporting into their Java projects or use it as a standalone reporting platform.
The book is written for two types of professionals and students: Information Technologists with a basic knowledge of Databases and Java Developers with a medium seniority. The content is primarily written to cover the technical topics about environments, best practices and source code, to enable the reader to assemble the best reports and use them into an existing Java application.All the topics are faced using a “learning by example” description, thank to a public repository available on GitHub. The repository contains dozens of examples explained one by one in the book, to cover the most valuable features and best practices in developing reports using the standalone designer (called Pentaho Report Designer) or programmatically (through fully featured Maven projects developed using Java).
Automatic Machine Learning (AutoML) and Pentaho help to leverage Machine LearningA very hot topic was presented by Caio Moreno de Souza: AutoML and Machine Learning. The demand for machine learning experts has outpaced the supply. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts and experts alike. AutoML software can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.
In his presentation and live demo Caio demonstrated the process of how AutoML open source tools together with Pentaho can help customers to save time in the process of creating a model and deploying it into production. The tools used in the live demo were PDI, H2O AutoML, and AutoWeka. Besides AutoML, Caio is also working on a project called MinervaAutoML.
Pentaho Standardised Git Repository SetupIn the next talk, Diethard Steiner emphasized the importance standards: In big projects with multiple teams and even small projects, „standards are key“. Make everything as simple as possible and keep it consistent. The „Pentaho Standardised Git Repo Setup“ utility tries to combine these two concepts and creates a basic git structure. A version control system like Git is essential to every setup.
This implementation does not require an additional packages (only relies on bash which is usually installed by default). It is one method of putting all the theory outline into place. The main goal is to give teams a starter package in the form of a predefined Git folder structures with checks in place that a minimum set up rules is followed.
The code is available on Diethard’s webpage. The main functionality is implemented, but there are further aspects like CI that are at very early stages. Currently this project is in the early alpha stage so Diethard appreciate any feedback.
Continuous integration with PDISlawomir Chodnicki gave an overview of the benefits of automated testing, defined different types of ETL tests and showcased techniques for implementing an ETL test suite. More information is available in the articles Testing strategies for data integration and Continuous integration testing for Pentaho Kettle projects.
WebSpoon and SpoonGit for Pentaho PDI
Hiromu Hota presented exciting stuff: webSpoon and SpoonGit for Pentaho PDI. His talk got standing ovations as SpoonGit allows you to manage versions of local Kettle files without leaving Spoon. In addition to Git, Subversion is also supported. Hiromu was kind enough to immediately tweet his slides – thank you Hiromu!
Pentaho CDE NewMapComponent
Creating richer visualizations sometimes means including maps on the dashboards as well. The CDE NewMapComponent is a very flexible component that allows the user/developer to add a map and show geo-referenced data on top of it. There are many features and customizations that one can use but some of those features and configurations are not understood or even unknown. In his presentation, Kleyson Rios explained the proper use of the component, how to customize it and how to deal with its more complex configurations.