October 7, 2008
ITIL Availability Management Process
Let’s take a couple of minutes and review the ITIL availability management process. The purpose of availability management is to provide a cost-effective and defined level of availability so the organization can depend on systems to reach the business objectives.
Availability management has two major actions:
- Proactively plan the availability of IT services illustrated in service level agreements (SLAs) and monitor the availability.
An SLA is a document defining levels of service for a technology. This document provides the basis for managing the relationship between the IT organization and the customer. By meeting agreed to and understood SLA agreements our customers will be at the level of service they expect. - Initiate changes to the infrastructure increase stability and fault tolerance ultimately preventing failures of availability
Availability needs to be consistently monitored. Additionally improvement will be identified to improve the stability of the IT infrastructure.
When you are working with customers, here are some basic principles to help understand the goal of availability management.
- Availability is at the core of customer and user satisfaction.
- It is still possible to achieve customer and user satisfaction when things go awry.
- The improvement of availability is cemented in a foundational understanding of how IT services support the customer's business.
As with all processes there are inputs and outputs.
- Inputs – Service Level Agreements which illustrate the negotiated IT service requirements based upon data regarding the current availability of the IT infrastructure.
- Outputs – Events which are created as a result of activities performed within the availability management process. They contain specific outage information affecting availability regarding the IT infrastructure and information. The events should be reported and analyzed to improve availability.
Availability levels depend on the reliability of the IT infrastructure, its resilience to failure, and the quality of maintenance. To fully understand availability management planning and monitoring, you should be familiar with the following concepts:
- Availability - The ability of the IT service or component to perform during a stated period of time.
- Reliability - The IT service is available for a negotiated period without interruptions or failure.
- Maintainability - The ability of an IT component to remain in or be restored to an operational state. With maintainability, it is the organization's internal IT staff members who are responsible for returning the component to an operational state.
- Serviceability - An external supplier's capability to maintain the availability of a component or function covered under a third-party service contract.
- Resilience – A measure of freedom from operational failure and a method of keeping services reliable. It minimizes the consequences of component failure; one popular method of resilience is redundancy.
- Security - Security refers to the confidentiality, integrity, and availability of that data.
Planning is a key component. In order to ascertain the availability requirements of the business must be analyzed to assess how the IT infrastructure can deliver those required levels of availability. Planning involves designing for availability and for recovery from loss of service. Availability, reliability, maintainability, and serviceability requirements must also be determined, and the current IT infrastructure, including possible security concerns, should be analyzed.
Availability management also provides IT service continuity management with information from the infrastructure analysis. This information aids in the design for recovery by suggesting and confirming infrastructure capacities and components to support vital business functions.
A key part of availability planning is designing for availability. The following characteristics help identify areas of availability design that are most important to the organization:
- High availability – Refers to an IT service which minimizes the effects of IT component failure to the customer and users.
- Continuous operation - Illustrates the effects of planned downtime from the user, so events are planned.
- Continuous availability - Minimizes the effects of all failures and planned downtime to the user.
Monitoring involves managing, measuring, and reporting on availability. Measurements and their related reports need to satisfy the IT support organization's needs, the users' needs, and the customer's needs. This includes monitoring the availability of IT services, reporting measurements to the users and customer, and continually striving to optimize availability.
Availability management also seeks to optimize availability levels. Availability plans, which detail the long-term, cost-effective plans for the proactive improvement of availability, help availability management accomplish this goal.
The availability management process uses a variety of methods to optimize availability within the IT infrastructure. Several of these methods are listed below:
- component failure impact assessment (CFIA)
- fault tree analysis (FTA)
- the CCTA risk analysis and management method (CRAMM)
- the IT availability metrics model (ITAMM)
The following methods can also be used to optimize availability:
- Systems outage analysis (SOA) - Based on the findings of other methods, SOA attempts to identify the underlying causes of service interruptions. Once these causes are identified, targeted improvements can be made.
- The expanded incident life cycle - This is a map of the primary stages that incidents move through. The map shows the duration of the downtime for each stage of a specific incident.
- Technical observation post (TOP) - A meeting where specialized technical support staff members focus on the specific aspects of IT availability. The purpose is to monitor events as they occur to identify improvement opportunities or bottlenecks within the IT infrastructure.
Calculations compose another key method in availability management. The results of several simple formulas produce information on component and total infrastructure availability.
One common calculation is the formula for service or component availability. This is calculated as a percentage and uses these values: the agreed service time (AST) and the unplanned downtime (DT) during the agreed service time. The AST is planned downtime subtracted from total service hours. To find the percentage of service or component availability, subtract the DT from the AST, and divide the result by the AST. Then multiply by 100.
In parallel configurations, additional components are added to provide resilience so the backup component takes over automatically. The parallel configuration formula starts with the unavailability of each component and its backup. Divide the availability of each component by 100, and then subtract the availability of each component from 1 to find the unavailability. Then find the total host availability by multiplying the unavailability of each component and subtracting from 1.
The last step is to find the total infrastructure availability. You do this by multiplying by 100 the product of the total host, network, server, and workstation availability percentages, after each percentage has been divided by 100.
Along with these calculations, availability managers have a variety of activities and techniques at their disposal for ensuring that IT service availability is managed effectively.
The expanded incident life cycle is a map of the primary stages through which incidents flow. This map portrays the duration of downtime for each stage of a specific incident. This approach identifies possible areas of inefficiency that combine to make the loss of service more extreme than it should be.
Metrics should be captured at each stage of the life cycle for all incidents. The following is a list of the six stages of the expanded incident life cycle, during which metrics can be captured:
- Start - The life cycle begins when a change in the service is first noticed by the customer or by the IT staff via monitoring. This is considered the incident start time.
- Detection - The detection time is when the IT organization is made aware of the issue.
- Diagnosis - At the incident diagnosis time, the diagnosis to determine the underlying cause has been completed.
- Repair - The failure has been repaired at the incident repair time.
- Recovery - The recovery time is the time at which component recovery has been completed.
- Restoration - The life cycle ends when normal business operations resume. This is considered the restoration time.
In addition to the data provided by the metrics of the expanded incident life cycle, other data can be produced. This data can provide an indication of improving or deteriorating trends.
- Mean time between failures (MTBF) - T he average elapsed time from the time an IT service or supporting component is fully restored until the next occurrence of a failure to the same service or component.
- Mean time between system incidents (MTBSI) - The average elapsed time between the occurrence of one failure to the next failure.
- Mean time to repair (MTTR) - he average elapsed time from the occurrence of an incident to the resolution of the incident.
All of the data captured during the expanded life cycle can help identify where time is lost. Identifying these areas can help you reduce downtime and maintain customer satisfaction. Metrics provide key information that can be used by availability management and other IT service management (ITSM) processes.
To be effective, all IT service management processes need to work together, sharing information and ensuring that efforts support one another. ITSM processes relate to availability management (AM) in the following ways:
- IT service continuity management (ITSCM) - Risk management is essential to both ITSCM and AM. ITSCM provides a business impact assessment, detailing the vital business functions dependent on infrastructure availability. AM provides availability and recovery design criteria to ITSCM.
- Problem management - Problem management is involved in identifying and resolving the causes of availability problems, and AM contributes to the prevention of problems. Network performance is monitored, and the resulting data is used by availability, capacity, and problem management.
- Capacity management (CM) - CM receives from availability management a component failure impact assessment (CFIA) for a new IT service, denoting where availability techniques are deployed to provide additional infrastructure resilience.
- Service level management (SLM) - From AM, SLM receives an assessment of availability levels for new IT services. SLM provides service level agreement details for availability metrics. Service level, availability, and problem management help instigate service improvement programs.
- Change management - From availability management, change management receives details of the planned maintenance actions for components underpinning a new IT service. Change management provides a schedule of planned maintenance activities for IT components.
To monitor and ensure the effectiveness of the availability management process, metrics should be captured and used to identify problem areas. The availability management process also needs to communicate with the other IT service management processes.
A common problem with availability management is the way in which organizations view and actually manage availability, as compared to the recommended availability process. For example, managers often only feel responsible for their areas, and they don't coordinate with one another. IT can also have trouble when customers working with service level management (SLM) can't identify and agree on specific availability targets.
Common costs associated with availability management are expenses that occur during the implementation and monitoring of availability. Examples of common costs are expenses associated with personnel and the development of availability plans.
While many IT professionals may encounter some initial problems and costs, they and their organizations quickly realize the benefits in having services available when they are needed, ensuring that business operates as usual.
ITIL Availability Management Process
IT Staffing Models from HIMSS Analytics April 2008 Data
A visit to the Adirondacks
Establish a Resource Planning Process
HIMSS Sample RFP Documents
ICD-10 is in the distance
Deming's Adaption of the 14 Points for Medical Service
September 2008
August 2008
July 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
August 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
November 2005
October 2005
September 2005
August 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
Joel on Software
David Ross
Edward Prevost
Martin Fowler
The Health Care Blog
The Tales of Hoffman
The Business Word
Medical Rants
Christina's Considerations
Paul Levy
HIS Talk
Appropriate IT
Candid CIO
RSS feed




