☆ 4.7 Article

The Effects of Soft Errors and Mitigation Strategies for Virtualization Servers

IEEE TRANSACTIONS ON CLOUD COMPUTING (2022)

Journal

IEEE TRANSACTIONS ON CLOUD COMPUTING

Volume 10, Issue 2, Pages 1065-1081

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TCC.2020.2973146

Keywords

Virtualization; fault injection; cloud computing; fault tolerance; dependability

Funding

Centro de Informatica e Sistemas da Universidade de Coimbra (CISUC)
Portuguese National Foundation for Science and Technology (FCT) [SFRH/BD/130601/2017]
Project ADVANCE, H2020-MSCA-RISE-2018 [823788]
Project H2020 AI4EU [825619]
Project AESOP [P2020-31/SI/2017, 040004]
Fundação para a Ciência e a Tecnologia [SFRH/BD/130601/2017] Funding Source: FCT

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Virtualized servers are widely used in cloud computing environments to host online applications and provide elastic computing resources. However, the presence of soft errors in large-scale servers can lead to various failure modes, with hang failures being the most common. A recovery mechanism using online testing is developed to address these hang failures and ensure server uptime.

Virtualized servers compose the majority of cloud computing environments, where these nodes are used to host multiple clients over the same hardware. Many organizations run online applications by hiring elastic computing resources in order to match demand while reducing fixed costs. However, such organizations are unlikely to take advantage of these benefits for critical applications, as it would expose them to several risks. Among other threats, soft errors are a concern in large-scale reliable servers and are expected to become more frequent as a consequence of smaller transistors and lower operating voltages of integrated circuits. This article characterizes virtualized servers of cloud environments in presence of soft errors. Using fault injection, we collect experimental data to determine the failure modes of applications, operating systems, VMs, and hypervisor. The analysis exposes distinct failure modes, ranging from crash failures of a single virtual machine to silent data corruption in permanent storage. The most frequent failure mode, observed in 10-30 percent of injected errors, consists of a hang affecting multiple virtual machines. Given that such failures are a primary cause of downtime, we develop and evaluate a recovery mechanism which uses online testing and recovers a server from all hangs by rebooting its hypervisor.

The Effects of Soft Errors and Mitigation Strategies for Virtualization Servers

Journal

IEEE TRANSACTIONS ON CLOUD COMPUTING

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

The Effects of Soft Errors and Mitigation Strategies for Virtualization Servers

Journal

IEEE TRANSACTIONS ON CLOUD COMPUTING

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper