Skip to main content

Monitoring

More control: Challenges and best practice in monitoring

IT has become a fundamental component of practically every business process. Because of this, monitoring and reporting requirements have also changed dramatically.

Internal IT departments have long since become IT service providers ensuring other departments and business areas have the infrastructure and applications they need.

In daily operations, the monitoring and recording of performance data from servers, switches, routers and individual services is only part of the challenge: It is now largely taken for granted that IT departments will ensure stability and proactively provide early notification of errors so that outages can be avoided.

The real challenge today is being able to holistically monitor services and make their availability something that can be easily analysed and evaluated.

A good example of this is email. Here, a variety of components are involved in performing and providing this service. The interaction of one or several mail clusters, firewalls, infrastructure, clients and more must work together flawlessly so that users can send and receive their mails. Moreover, it's not enough to individually monitor the components and then be informed about their performance. Rather, their relationships with each other must be represented as a combined, total service in the form of logical AND / OR / MINIMUM operators.

Using this sort of comprehensive monitoring, an appropriate evaluation of the services as a whole can be carried out, thus enabling their availability to be reviewed in line with agreed SLAs.

Best Practice: Setting up monitoring the right way

The sensible starting point for any monitoring project is structured planning and a proper definition of requirements. This is where input from the various departments is required as they are the only ones who can fully identify what is needed for their particular business areas. The monitoring specialist involved in the project can certainly provide direction and ideas, but it is not his place to decide what is important for individual monitoring demands.

Include all stakeholders

Another fundamental aspect involved in the monitoring process is identifying the expectations of the employees concerned. It is important to remember that the extent to which these expectations are fulfilled significantly affects the positive acceptance of the tools involved. Experience has shown that many projects fail because these expectations were not sufficiently taken into consideration.

Incorporating "islands"

How to deal with individual, stand-alone solutions within a monitoring project is often a sensitive issue. Even when encountering resistance, it should be clear from the outset that these "islands" of information must be integrated into a central monitoring plan, because to achieve a realistic correlation and thus a clear and well laid-out service monitoring process, information must be centrally available in one system.

Identifying responsible parties

Once the above points have been sorted, all participants should carefully plan the specific details of how the project is to be implemented. Insufficient planning will mean failure in the long term if the associated systems continue to grow to the point that they can no longer be serviced or no one feels they are responsible for them. To avoid this problem, it is strongly recommended that an individual be named as the person responsible for the further development of the project rather than its operational implementation.

Define the monitoring to be performed

After setting up the monitoring metrics comes one of the most difficult tasks in implementing a monitoring system: defining thresholds and alerts. Too many "false positives" will, in the long term, lead to messages not being taken seriously and will put the reliability of the monitoring system into question.

Monitoring is a process

Finally, we come to one of the most important aspects of monitoring – developing the monitoring system further. A monitoring project is not a traditional project with a defined start and end – it is continuously evolving. This should be taken into consideration when deciding who will be responsible for the project. Monitoring usually develops around any non-detected errors that occurred. This is only logical, because only in this way can a suitable and reliable monitoring system be realised over the long term.

Practical example: Infrastructure monitoring at Europe's largest research association

The IT Department of Europe's largest research association provides central IT services for all research sites both within the country and abroad. And the monitoring of the over 100 services provided is handled by openITCOCKPIT. Internally christened "ITStats", openITCOCKPIT not only monitors whether all services are running smoothly, it also generates monthly and quarterly reports, which are sent automatically to the individuals responsible for maintaining those services. Should an error occur, the system creates a ticket with the service desk.

The system currently comprises 774 hosts, 1211 services, 65 users and covers 98% of the central IT services. Fraunhofer has been working with openITCOCKPIT since 2008. Over time, the Fraunhofer Institute IT department has gradually developed the system so that the ITStats tile map clearly shows the central IT services at a glance. Using a traffic light system, employees can immediately recognise which of the central services is experiencing a problem. And because internal staff resources are limited, both the open-source nature of the software as well as ensuring they received professional support were important for Fraunhofer. For this reason, the system is jointly operated with the manufacturer, it-novum.

To improve the quality of error messages and to avoid annoying false positives, Fraunhofer relies on the correlation measurements in openITCOCKPIT. In addition, three individual measurements are carried out for each central IT service whereby the system will raise an error message only if at least two errors are recorded. As a result of tightening the system in this way, the results have become more valid and more meaningful.

The benefits of using modern monitoring software (based on the example of openITCOCKPIT)

  1. openITCOCKPIT 3 reduces the number of email and SMS notifications sent, thus minimising the occurrence of "false positives". This enables IT administrators to focus on actual critical errors.
  2. The individually customisable dashboards in openITCOCKPIT 3 support the creation of relevant views for all relevant business staff such as helpdesk teams and system admins. Even non-IT departments and the management team can have maps, applets and reports customised to their particular needs.
  3. In openITCOCKPIT, access to modules and clients can be controlled by using a finely-tuned role and authorisation management concept.
  4. openITCOCKPIT has a web interface that dynamically adapts to the device on which it is being used.
  5. This web-based administrative interface allows for the use of all available checks for Naemon, Nagios and Check_MK.
  6. With its event correlation module, openITCOCKPIT can monitor not only individual metrics, but can also combine these using logical links. This allows openITCOCKPIT to monitor (business) services rather than just individual services.
  7. Using the auto-reports module, availability reports can be created and automatically sent for individual and correlated services.
  8. Using the interfaces developed by it-novum for i-doit (CMDB) and OTRS (ticketing system), openITCOCKPIT provides not only a monitoring solution, but becomes a fully integrated ITSM solution.