4.7 Article

The Effects of Soft Errors and Mitigation Strategies for Virtualization Servers

Journal

IEEE TRANSACTIONS ON CLOUD COMPUTING
Volume 10, Issue 2, Pages 1065-1081

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCC.2020.2973146

Keywords

Virtualization; fault injection; cloud computing; fault tolerance; dependability

Funding

  1. Centro de Informatica e Sistemas da Universidade de Coimbra (CISUC)
  2. Portuguese National Foundation for Science and Technology (FCT) [SFRH/BD/130601/2017]
  3. Project ADVANCE, H2020-MSCA-RISE-2018 [823788]
  4. Project H2020 AI4EU [825619]
  5. Project AESOP [P2020-31/SI/2017, 040004]
  6. Fundação para a Ciência e a Tecnologia [SFRH/BD/130601/2017] Funding Source: FCT

Ask authors/readers for more resources

Virtualized servers are widely used in cloud computing environments to host online applications and provide elastic computing resources. However, the presence of soft errors in large-scale servers can lead to various failure modes, with hang failures being the most common. A recovery mechanism using online testing is developed to address these hang failures and ensure server uptime.
Virtualized servers compose the majority of cloud computing environments, where these nodes are used to host multiple clients over the same hardware. Many organizations run online applications by hiring elastic computing resources in order to match demand while reducing fixed costs. However, such organizations are unlikely to take advantage of these benefits for critical applications, as it would expose them to several risks. Among other threats, soft errors are a concern in large-scale reliable servers and are expected to become more frequent as a consequence of smaller transistors and lower operating voltages of integrated circuits. This article characterizes virtualized servers of cloud environments in presence of soft errors. Using fault injection, we collect experimental data to determine the failure modes of applications, operating systems, VMs, and hypervisor. The analysis exposes distinct failure modes, ranging from crash failures of a single virtual machine to silent data corruption in permanent storage. The most frequent failure mode, observed in 10-30 percent of injected errors, consists of a hang affecting multiple virtual machines. Given that such failures are a primary cause of downtime, we develop and evaluate a recovery mechanism which uses online testing and recovers a server from all hangs by rebooting its hypervisor.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available