A Quantitative Framework for Computing Availability in Web Servers. Archana Ganapathi A system’s availability depends on the availability of its constituent components. However, system administrators value certain components’ availability as more critical to the system’s welfare than others. Moreover it is a challenge to simultaneously obtain high performance and high dependability. We provide a quantitative framework to serve the system architect in the best possible mix of performance and availability. We address the factors affecting availability of web-servers and provide a framework to quantitatively measure this dependability component. An operator can then compute a weighting factor for each characteristic to emphasize overall performance or simply, availability alone. Traditionally: Availability = % of time system services requests/ total system lifetime To determine whether a web-server services customer requests appropriately, we must evaluate several factors: the duration for server response to request, completeness and accuracy of serviced requests, and capacity for performance amidst failures in peer-components[1]. The server is expected to respond completely and on schedule. Enforcing response timeouts reveals network delays as well as server overloads. However, when considering response, transient failures are discarded. The failure must persist for inclusion in downtime[4]. Often, the monitor/probing tool itself may affect a server’s response; alternately, network connectivity between tools and the web-server may be unavailable. These factors are considered in the computation of the Availability metric. Sub/peer-components of the web-server include network connection topology, storage subsystem, hardware, as well as software. We consider the number of failures that are tolerated by each sub-component as their availability directly affects the system’s aggregate availability. Periodic prophylactic hardware/software fault-injection techniques can verify integrity/availability of these sub-components[3]. Fault-injection is performed during low web-server loads (perhaps weekends or late nights/early mornings) or during designated downtimes so that web-server performance is not adversely impacted. To collect relevant information for various web-server components, we use the benchmark data collection/analysis infrastructure as advocated by [6]. We distinguish between normal and abnormal per-component and per-request statistics, retrieving these figures using remote monitoring techniques. The behavioral profile of a system is computed from various perspectives – user, application, sub-components as well as service requests. They capture features such as latency, throughput, precision as well as CPU/memory consumption and network resources. It is profitable to use tools such as Pinpoint to obtain latency profiles as well as other related metrics[2]. The recovery time for each of these components is included to fairly assess their impact on system availability. Incorporating aggregate and per-component availability of the system, we develop the formula: System availability = sigma(operator_fudge_factor * system_component_availability) system_component_availability = static_factor + dynamic_factor Static_factor is provided during configuration based on "historic" performance of that component (this factor can be manufacturer-advertised availability, number of 9’s, SLA agreements etc.). Dynamic_factor is computed based on various monitoring/probing tests and data gathered from event logs. Dynamic feedback for web-server availability is obtained from injection components and also feedback from tools such as Netcraft[5]. System administrators can use this metric to compute a “figure-of-merit” for their web-server. This framework can then be used to dynamically plug-in system components to improve dependable performance. References: [1] Aaron B. Brown Towards Availability and Maintainability Benchmarks: A Case Study of Software RAID Systems. UC Berkeley Technical Report UCB//CSD-01-1132, January 2001.  [2] Chen, M., E. Kiciman, E. Fratkin, E. Brewer and A. Fox. Pinpoint: Problem Determination in Large, Dynamic, Internet Services. Proceedings of the International Conference on Dependable Systems and Networks (IPDS Track), Washington D.C., 2002. [3] Mei-Chen Hsueh,Timothy K. Tsai, and Ravishankar K. Iyer. Fault injection techniques and tools, IEEE Computer, April 1997. [4] Merzbacher, M and Dan Patterson. Measuring End-User Availability on the Web: Practical Experience. International Performance and Dependability Symposium, Washington DC,June 2002. [5] Netcraft: http://www.netcraft.com [6] David Oppenheimer, David A. Patterson, and Joseph M. Hellerstein. Decentralized systems need decentralized benchmarks. UC Berkeley Technical Report UCB//CSD-03-1234, April 2003.