10:52 a.m., March 25, 2013–The unending quest for new electronics applications and greater computational power is pushing researchers to produce computer chips that perform better and consume less power.

However, as computer chips shrink — and more devices are placed on each chip — they become increasingly unpredictable.

These reliability issues can be “show stoppers” for today’s computer systems, limiting or hampering a system’s ability to run lengthy applications.

The University of Delaware’s Chengmo Yang, assistant professor of electrical and computer engineering, recently received a prestigious five-year, $449,541 Faculty Early Career Development Award from the National Science Foundation (NSF) to develop resiliency solutions that can help computer systems overcome progressively diverse types of hardware failures.

Funded through the Division of Computer and Network Systems, the new funding will enable Yang to design and evaluate new architectural and system level solutions to boost resiliency in computer systems and to develop new algorithms aimed at simultaneously optimizing a computer’s performance, energy and reliability.

According to Yang, there are three types of hardware faults that typically occur in computers: permanent (where a device breaks or can no longer be programmed); transient (which are random faults or errors); or intermittent (problems related to execution conditions like voltage and temperature).

“Future computer systems are expected to experience continuous faults, across all levels from hardware to software applications, raising critical concerns about the impact of intermittent faults that occur frequently and irregularly over nanosecond to second time scales,” explains Yang.

Previous approaches to address these problems have included adding system redundancies, such as having the computer perform a computation twice and comparing results to ensure accuracy.

“Doing the computation twice means double the energy expenditure,” explains Yang, who instead proposes adapting the execution conditions to improve efficiency while also controlling costs.

Her approach includes creating a feedback loop within the system to improve the devices’ reliability over time through adaptive “work-arounds” for three tightly connected components:

  • Detection and check pointing — enabling computers to repeatedly adjust approaches to tasks based on a system’s reliability;
  • Error recovery — enabling computers to re-execute commands following failures in a way that minimizes chance of further problems; and
  • Resource management — enabling systems to monitor application and hardware reliability and quickly adapt scheduling decisions as needed.

By setting up systems that assign the most critical and vulnerable tasks to the computer’s most reliable cores, Yang believes she can help create computer systems that can quickly recover from unplanned or intermittent problems.

“Our approach reduces the need for devices and interconnects to be 100 percent correct in order to work, which will dramatically reduce associated manufacturing, verification and testing costs,” she says.

Yang credits her NSF award selection in part to supportive colleagues such as Guang Gao, Distinguished Professor of Electrical and Computer Engineering, and her department chair, Kenneth Barner.

“Professors Gao and Barner, and others within the department, really take junior faculty under their wing and support them. My successful proposal is one example of this,” she said.

Yang joined UD in 2010 as an assistant professor of electrical and computer engineering. She earned her bachelor’s degree in microelectronics from Peking University in Beijing, China, and her master’s and doctoral degrees at University of California, San Diego, in computer science and computer engineering respectively.

Article by Karen B. Roberts | Photo by Ambre Alexander