October 7, 2008
ITIL Availability Management Process
Let’s take a couple of minutes and review the ITIL availability management process. The purpose of availability management is to provide a cost-effective and defined level of availability so the organization can depend on systems to reach the business objectives.
Availability management has two major actions:
- Proactively plan the availability of IT services illustrated in service level agreements (SLAs) and monitor the availability.
An SLA is a document defining levels of service for a technology. This document provides the basis for managing the relationship between the IT organization and the customer. By meeting agreed to and understood SLA agreements our customers will be at the level of service they expect. - Initiate changes to the infrastructure increase stability and fault tolerance ultimately preventing failures of availability
Availability needs to be consistently monitored. Additionally improvement will be identified to improve the stability of the IT infrastructure.
When you are working with customers, here are some basic principles to help understand the goal of availability management.
- Availability is at the core of customer and user satisfaction.
- It is still possible to achieve customer and user satisfaction when things go awry.
- The improvement of availability is cemented in a foundational understanding of how IT services support the customer's business.
As with all processes there are inputs and outputs.
- Inputs – Service Level Agreements which illustrate the negotiated IT service requirements based upon data regarding the current availability of the IT infrastructure.
- Outputs – Events which are created as a result of activities performed within the availability management process. They contain specific outage information affecting availability regarding the IT infrastructure and information. The events should be reported and analyzed to improve availability.
Availability levels depend on the reliability of the IT infrastructure, its resilience to failure, and the quality of maintenance. To fully understand availability management planning and monitoring, you should be familiar with the following concepts:
- Availability - The ability of the IT service or component to perform during a stated period of time.
- Reliability - The IT service is available for a negotiated period without interruptions or failure.
- Maintainability - The ability of an IT component to remain in or be restored to an operational state. With maintainability, it is the organization's internal IT staff members who are responsible for returning the component to an operational state.
- Serviceability - An external supplier's capability to maintain the availability of a component or function covered under a third-party service contract.
- Resilience – A measure of freedom from operational failure and a method of keeping services reliable. It minimizes the consequences of component failure; one popular method of resilience is redundancy.
- Security - Security refers to the confidentiality, integrity, and availability of that data.
Planning is a key component. In order to ascertain the availability requirements of the business must be analyzed to assess how the IT infrastructure can deliver those required levels of availability. Planning involves designing for availability and for recovery from loss of service. Availability, reliability, maintainability, and serviceability requirements must also be determined, and the current IT infrastructure, including possible security concerns, should be analyzed.
Availability management also provides IT service continuity management with information from the infrastructure analysis. This information aids in the design for recovery by suggesting and confirming infrastructure capacities and components to support vital business functions.
A key part of availability planning is designing for availability. The following characteristics help identify areas of availability design that are most important to the organization:
- High availability – Refers to an IT service which minimizes the effects of IT component failure to the customer and users.
- Continuous operation - Illustrates the effects of planned downtime from the user, so events are planned.
- Continuous availability - Minimizes the effects of all failures and planned downtime to the user.
Monitoring involves managing, measuring, and reporting on availability. Measurements and their related reports need to satisfy the IT support organization's needs, the users' needs, and the customer's needs. This includes monitoring the availability of IT services, reporting measurements to the users and customer, and continually striving to optimize availability.
Availability management also seeks to optimize availability levels. Availability plans, which detail the long-term, cost-effective plans for the proactive improvement of availability, help availability management accomplish this goal.
The availability management process uses a variety of methods to optimize availability within the IT infrastructure. Several of these methods are listed below:
- component failure impact assessment (CFIA)
- fault tree analysis (FTA)
- the CCTA risk analysis and management method (CRAMM)
- the IT availability metrics model (ITAMM)
The following methods can also be used to optimize availability:
- Systems outage analysis (SOA) - Based on the findings of other methods, SOA attempts to identify the underlying causes of service interruptions. Once these causes are identified, targeted improvements can be made.
- The expanded incident life cycle - This is a map of the primary stages that incidents move through. The map shows the duration of the downtime for each stage of a specific incident.
- Technical observation post (TOP) - A meeting where specialized technical support staff members focus on the specific aspects of IT availability. The purpose is to monitor events as they occur to identify improvement opportunities or bottlenecks within the IT infrastructure.
Calculations compose another key method in availability management. The results of several simple formulas produce information on component and total infrastructure availability.
One common calculation is the formula for service or component availability. This is calculated as a percentage and uses these values: the agreed service time (AST) and the unplanned downtime (DT) during the agreed service time. The AST is planned downtime subtracted from total service hours. To find the percentage of service or component availability, subtract the DT from the AST, and divide the result by the AST. Then multiply by 100.
In parallel configurations, additional components are added to provide resilience so the backup component takes over automatically. The parallel configuration formula starts with the unavailability of each component and its backup. Divide the availability of each component by 100, and then subtract the availability of each component from 1 to find the unavailability. Then find the total host availability by multiplying the unavailability of each component and subtracting from 1.
The last step is to find the total infrastructure availability. You do this by multiplying by 100 the product of the total host, network, server, and workstation availability percentages, after each percentage has been divided by 100.
Along with these calculations, availability managers have a variety of activities and techniques at their disposal for ensuring that IT service availability is managed effectively.
The expanded incident life cycle is a map of the primary stages through which incidents flow. This map portrays the duration of downtime for each stage of a specific incident. This approach identifies possible areas of inefficiency that combine to make the loss of service more extreme than it should be.
Metrics should be captured at each stage of the life cycle for all incidents. The following is a list of the six stages of the expanded incident life cycle, during which metrics can be captured:
- Start - The life cycle begins when a change in the service is first noticed by the customer or by the IT staff via monitoring. This is considered the incident start time.
- Detection - The detection time is when the IT organization is made aware of the issue.
- Diagnosis - At the incident diagnosis time, the diagnosis to determine the underlying cause has been completed.
- Repair - The failure has been repaired at the incident repair time.
- Recovery - The recovery time is the time at which component recovery has been completed.
- Restoration - The life cycle ends when normal business operations resume. This is considered the restoration time.
In addition to the data provided by the metrics of the expanded incident life cycle, other data can be produced. This data can provide an indication of improving or deteriorating trends.
- Mean time between failures (MTBF) - T he average elapsed time from the time an IT service or supporting component is fully restored until the next occurrence of a failure to the same service or component.
- Mean time between system incidents (MTBSI) - The average elapsed time between the occurrence of one failure to the next failure.
- Mean time to repair (MTTR) - he average elapsed time from the occurrence of an incident to the resolution of the incident.
All of the data captured during the expanded life cycle can help identify where time is lost. Identifying these areas can help you reduce downtime and maintain customer satisfaction. Metrics provide key information that can be used by availability management and other IT service management (ITSM) processes.
To be effective, all IT service management processes need to work together, sharing information and ensuring that efforts support one another. ITSM processes relate to availability management (AM) in the following ways:
- IT service continuity management (ITSCM) - Risk management is essential to both ITSCM and AM. ITSCM provides a business impact assessment, detailing the vital business functions dependent on infrastructure availability. AM provides availability and recovery design criteria to ITSCM.
- Problem management - Problem management is involved in identifying and resolving the causes of availability problems, and AM contributes to the prevention of problems. Network performance is monitored, and the resulting data is used by availability, capacity, and problem management.
- Capacity management (CM) - CM receives from availability management a component failure impact assessment (CFIA) for a new IT service, denoting where availability techniques are deployed to provide additional infrastructure resilience.
- Service level management (SLM) - From AM, SLM receives an assessment of availability levels for new IT services. SLM provides service level agreement details for availability metrics. Service level, availability, and problem management help instigate service improvement programs.
- Change management - From availability management, change management receives details of the planned maintenance actions for components underpinning a new IT service. Change management provides a schedule of planned maintenance activities for IT components.
To monitor and ensure the effectiveness of the availability management process, metrics should be captured and used to identify problem areas. The availability management process also needs to communicate with the other IT service management processes.
A common problem with availability management is the way in which organizations view and actually manage availability, as compared to the recommended availability process. For example, managers often only feel responsible for their areas, and they don't coordinate with one another. IT can also have trouble when customers working with service level management (SLM) can't identify and agree on specific availability targets.
Common costs associated with availability management are expenses that occur during the implementation and monitoring of availability. Examples of common costs are expenses associated with personnel and the development of availability plans.
While many IT professionals may encounter some initial problems and costs, they and their organizations quickly realize the benefits in having services available when they are needed, ensuring that business operates as usual.
ITIL Availability Management Process
Let’s take a couple of minutes and review the ITIL availability management process. The purpose of availability management is to provide a cost-effective and defined level of availability so the organization can depend on systems to reach the business objectives.
Availability management has two major actions:
- Proactively plan the availability of IT services illustrated in service level agreements (SLAs) and monitor the availability.
An SLA is a document defining levels of service for a technology. This document provides the basis for managing the relationship between the IT organization and the customer. By meeting agreed to and understood SLA agreements our customers will be at the level of service they expect. - Initiate changes to the infrastructure increase stability and fault tolerance ultimately preventing failures of availability
Availability needs to be consistently monitored. Additionally improvement will be identified to improve the stability of the IT infrastructure.
When you are working with customers, here are some basic principles to help understand the goal of availability management.
- Availability is at the core of customer and user satisfaction.
- It is still possible to achieve customer and user satisfaction when things go awry.
- The improvement of availability is cemented in a foundational understanding of how IT services support the customer's business.
As with all processes there are inputs and outputs.
- Inputs – Service Level Agreements which illustrate the negotiated IT service requirements based upon data regarding the current availability of the IT infrastructure.
- Outputs – Events which are created as a result of activities performed within the availability management process. They contain specific outage information affecting availability regarding the IT infrastructure and information. The events should be reported and analyzed to improve availability.
Availability levels depend on the reliability of the IT infrastructure, its resilience to failure, and the quality of maintenance. To fully understand availability management planning and monitoring, you should be familiar with the following concepts:
- Availability - The ability of the IT service or component to perform during a stated period of time.
- Reliability - The IT service is available for a negotiated period without interruptions or failure.
- Maintainability - The ability of an IT component to remain in or be restored to an operational state. With maintainability, it is the organization's internal IT staff members who are responsible for returning the component to an operational state.
- Serviceability - An external supplier's capability to maintain the availability of a component or function covered under a third-party service contract.
- Resilience – A measure of freedom from operational failure and a method of keeping services reliable. It minimizes the consequences of component failure; one popular method of resilience is redundancy.
- Security - Security refers to the confidentiality, integrity, and availability of that data.
Planning is a key component. In order to ascertain the availability requirements of the business must be analyzed to assess how the IT infrastructure can deliver those required levels of availability. Planning involves designing for availability and for recovery from loss of service. Availability, reliability, maintainability, and serviceability requirements must also be determined, and the current IT infrastructure, including possible security concerns, should be analyzed.
Availability management also provides IT service continuity management with information from the infrastructure analysis. This information aids in the design for recovery by suggesting and confirming infrastructure capacities and components to support vital business functions.
A key part of availability planning is designing for availability. The following characteristics help identify areas of availability design that are most important to the organization:
- High availability – Refers to an IT service which minimizes the effects of IT component failure to the customer and users.
- Continuous operation - Illustrates the effects of planned downtime from the user, so events are planned.
- Continuous availability - Minimizes the effects of all failures and planned downtime to the user.
Monitoring involves managing, measuring, and reporting on availability. Measurements and their related reports need to satisfy the IT support organization's needs, the users' needs, and the customer's needs. This includes monitoring the availability of IT services, reporting measurements to the users and customer, and continually striving to optimize availability.
Availability management also seeks to optimize availability levels. Availability plans, which detail the long-term, cost-effective plans for the proactive improvement of availability, help availability management accomplish this goal.
The availability management process uses a variety of methods to optimize availability within the IT infrastructure. Several of these methods are listed below:
- component failure impact assessment (CFIA)
- fault tree analysis (FTA)
- the CCTA risk analysis and management method (CRAMM)
- the IT availability metrics model (ITAMM)
The following methods can also be used to optimize availability:
- Systems outage analysis (SOA) - Based on the findings of other methods, SOA attempts to identify the underlying causes of service interruptions. Once these causes are identified, targeted improvements can be made.
- The expanded incident life cycle - This is a map of the primary stages that incidents move through. The map shows the duration of the downtime for each stage of a specific incident.
- Technical observation post (TOP) - A meeting where specialized technical support staff members focus on the specific aspects of IT availability. The purpose is to monitor events as they occur to identify improvement opportunities or bottlenecks within the IT infrastructure.
Calculations compose another key method in availability management. The results of several simple formulas produce information on component and total infrastructure availability.
One common calculation is the formula for service or component availability. This is calculated as a percentage and uses these values: the agreed service time (AST) and the unplanned downtime (DT) during the agreed service time. The AST is planned downtime subtracted from total service hours. To find the percentage of service or component availability, subtract the DT from the AST, and divide the result by the AST. Then multiply by 100.
In parallel configurations, additional components are added to provide resilience so the backup component takes over automatically. The parallel configuration formula starts with the unavailability of each component and its backup. Divide the availability of each component by 100, and then subtract the availability of each component from 1 to find the unavailability. Then find the total host availability by multiplying the unavailability of each component and subtracting from 1.
The last step is to find the total infrastructure availability. You do this by multiplying by 100 the product of the total host, network, server, and workstation availability percentages, after each percentage has been divided by 100.
Along with these calculations, availability managers have a variety of activities and techniques at their disposal for ensuring that IT service availability is managed effectively.
The expanded incident life cycle is a map of the primary stages through which incidents flow. This map portrays the duration of downtime for each stage of a specific incident. This approach identifies possible areas of inefficiency that combine to make the loss of service more extreme than it should be.
Metrics should be captured at each stage of the life cycle for all incidents. The following is a list of the six stages of the expanded incident life cycle, during which metrics can be captured:
- Start - The life cycle begins when a change in the service is first noticed by the customer or by the IT staff via monitoring. This is considered the incident start time.
- Detection - The detection time is when the IT organization is made aware of the issue.
- Diagnosis - At the incident diagnosis time, the diagnosis to determine the underlying cause has been completed.
- Repair - The failure has been repaired at the incident repair time.
- Recovery - The recovery time is the time at which component recovery has been completed.
- Restoration - The life cycle ends when normal business operations resume. This is considered the restoration time.
In addition to the data provided by the metrics of the expanded incident life cycle, other data can be produced. This data can provide an indication of improving or deteriorating trends.
- Mean time between failures (MTBF) - T he average elapsed time from the time an IT service or supporting component is fully restored until the next occurrence of a failure to the same service or component.
- Mean time between system incidents (MTBSI) - The average elapsed time between the occurrence of one failure to the next failure.
- Mean time to repair (MTTR) - he average elapsed time from the occurrence of an incident to the resolution of the incident.
All of the data captured during the expanded life cycle can help identify where time is lost. Identifying these areas can help you reduce downtime and maintain customer satisfaction. Metrics provide key information that can be used by availability management and other IT service management (ITSM) processes.
To be effective, all IT service management processes need to work together, sharing information and ensuring that efforts support one another. ITSM processes relate to availability management (AM) in the following ways:
- IT service continuity management (ITSCM) - Risk management is essential to both ITSCM and AM. ITSCM provides a business impact assessment, detailing the vital business functions dependent on infrastructure availability. AM provides availability and recovery design criteria to ITSCM.
- Problem management - Problem management is involved in identifying and resolving the causes of availability problems, and AM contributes to the prevention of problems. Network performance is monitored, and the resulting data is used by availability, capacity, and problem management.
- Capacity management (CM) - CM receives from availability management a component failure impact assessment (CFIA) for a new IT service, denoting where availability techniques are deployed to provide additional infrastructure resilience.
- Service level management (SLM) - From AM, SLM receives an assessment of availability levels for new IT services. SLM provides service level agreement details for availability metrics. Service level, availability, and problem management help instigate service improvement programs.
- Change management - From availability management, change management receives details of the planned maintenance actions for components underpinning a new IT service. Change management provides a schedule of planned maintenance activities for IT components.
To monitor and ensure the effectiveness of the availability management process, metrics should be captured and used to identify problem areas. The availability management process also needs to communicate with the other IT service management processes.
A common problem with availability management is the way in which organizations view and actually manage availability, as compared to the recommended availability process. For example, managers often only feel responsible for their areas, and they don't coordinate with one another. IT can also have trouble when customers working with service level management (SLM) can't identify and agree on specific availability targets.
Common costs associated with availability management are expenses that occur during the implementation and monitoring of availability. Examples of common costs are expenses associated with personnel and the development of availability plans.
While many IT professionals may encounter some initial problems and costs, they and their organizations quickly realize the benefits in having services available when they are needed, ensuring that business operates as usual.
October 4, 2008
IT Staffing Models from HIMSS Analytics April 2008 Data
For those of us feeling the impact of tightening current year budgets, this morning doing some research I came across two resources which highlight the HIMSS Analytics data set.
- Characterizing the Health Information Technology Workforce: Analysis from the HIMSS Analytics Database, April 2008
- What workforce is needed to implement the Health Information Technology Agenda? Analysis from the HIMSS Analytics Database
In order to model off of this information a couple of assumptions need to be made, in my opinion. The assumptions I utilize are as follows:
- The institution active bed count is accurate.
- In order to be rated at a stage, all components are in productive use from that stage previous stages.
- Assumes there is no outsourcing of IT resources.
- Assumes centralized IT workforce including LIS, Pharmacy IS, and informatics are in the available staffing members count.
- Assumes in-flight strategic and capital projects have resources within their funding and are not a part of this calculation
- Assumes mainly an acute care business model, without the ambulatory care centers or physician practice organizations.
Here are the thoughts for the calculation; first I evaluated our current state on the EMR Adoption Model.
EMR Adoption |
Description |
% of US Hospital Q3-2008 | IT Staff to Beds |
Stage 7 |
Medical Record Fully Electronic; Health Care Organization able to contribute to CCD as a byproduct of EMR; Data Warehousing in use |
0.1% |
Not available |
Stage 6 |
Physician Documentation (structured templates), full CDSS (variance & compliance) full R-PACS |
1.0% |
.196 |
Stage 5 |
Closed loop medication administration |
1.3% |
.167 |
Stage 4 |
CPOE, CDSS (clinical protocols) |
1.9% |
.210 |
Stage 3 |
Clinical Documentation (flow sheets), CDSS (error checking), PACS available outside Radiology |
32.9% |
.151 |
Stage 2 |
CDR, CMV, CDSS, may have Documentation Imaging |
33.2 % |
.122 |
Stage 1 |
Ancillaries – Lab, Rad, Pharmacy |
12.5 % |
.096 |
Stage 0 |
All Three Ancillaries Not Installed |
17.1 % |
.082 |
This will provision a good idea of the workforce needed to support the healthcare operations. Futhermore moving the model to a granular level the distribution of IT shop efforts from HIMSS Analytics is as follows
Job Description |
Percentage |
Project Management |