4.7 Article

Software-based failure detection and recovery in programmable network interfaces

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
Volume 18, Issue 11, Pages 1539-1550

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2007.1093

Keywords

Programmable Network Interface Card (NIC); Single Event Upset (SEU); radiation induced faults; failure detection; failure recovery; self-testing

Ask authors/readers for more resources

Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low- overhead fault tolerance technique to recover from network interface failures. Failure detection is based on a software watchdog timer that detects network processor hangs and a self- testing scheme that detects interface failures other than processor hangs. The proposed self- testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. Our failure recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. The paper shows how this technique can be made to minimize the performance impact to the host system and be completely transparent to the user.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available