2015 Program

The SELSE Program Committee is proud to announce our keynote speakers for SELSE-11!
Keynote I: Application oriented fault tolerance system for Tianhe-2
Prof. Yutong Lu (NUDT)
Keynote II:RAS implications of a connected world
Dr. Robert C. Aitken (ARM)
Keynote III: Energy-Efficient Resilience: Challenges and Pitfalls
Dr. Pradip Bose (IBM)
SELSE is an informal workshop. To encourage widespread participation, authors are given the option (but are not required) to make their presentations available on this web site. We thank all of the authors for their participation.
Day 1 – March 31st, 2015 – Austin, TX
08:00 – 08:45 Breakfast and Registration
08:45 – 09:00 Welcome Remarks: SELSE General and Program Chairs
09:00 – 10:00 Session I: Keynote Speech (Chair: Sudhanva Gurumurthi (AMD/U Virginia))
Application oriented fault tolerance system for Tianhe-2
Prof. Yutong Lu (NUDT)
10:00 – 10:30 Coffee Break
10:30 – 11:45 Session II: Soft Errors�(Chair: Daniel Lowell (AMD))
Susceptibility of Planar and 3D Tri-Gate Technologies to Muon-induced Single Event Upsets
Norbert Seifert (Intel), Shah Jahinuzzaman (Intel), Jyothi Velamala (Intel) and Nikunj Patel (Intel)
Experimental Results of SEU-Tolerant Flip-flops in a 28nm Process with HI and Laser
H.-B. Wang (U Saskatchewan), E. Teng, Y. Ren (U Saskatchewan), L. Chen (U Saskatchewan), K. Lilja (Robust Chip), M. Bounasser (Robust Chip), N. Mahatme (Freescale), S.-J. Wen (Cisco), R. Wong (Cisco), R.Fung (Cisco), S. Baeg (Hanyang U), and B. Bhuva (Vanderbilt U)
Host-Compiled Reliability Modeling for Fast Estimation of Architectural Vulnerabilities
Zhuoran Zhao (U Texas at Austin), Dongwook Lee (U Texas at Austin), Andreas Gerstlauer (U Texas at Austin) and Lizy John (U Texas at Austin)
11:45 – 13:00 Lunch
13:00 – 14:15 Session III: Panel Discussion (Chair: Alan Wood (Oracle))
Topic: Tackling Reliability in Accelerator-Based Systems
Cameron McNairy (Intel)
Steve Keckler (NVIDIA)
Michael Schulte (AMD)
Mike Flynn (Maxeler)
Austin Lesea (Xilinx)
14:15 – 15:45 Session IV: Poster Session (Chair: William Robinson (Vanderbilt U)) and Coffee Break
Analysis of Soft Error Rates by Supply Voltage in 65-nm SOTB and 28-nm UTBB Structures by a PHITS-TCAD Simulation System,
Kuiyuan Zhang (Kyoto Institute of Technology), Shohei Kanda (Kyoto Institute of Technology), Junki Yamaguchi (Kyoto Institute of Technology), Jun Furuta (Kyoto Institute of Technology) and Kazutoshi Kobayashi (Kyoto Institute of Technology)
The Effects of EDA Placement Modification for Mitigating Radiation-Induced Multiple Transients,
Bradley Kiddie (Vanderbilt U) and William Robinson (Vanderbilt U)
Detecting Soft Errors in Stencil based Computations
Vishal Sharma (U Utah), Greg Bronevetsky (U Utah) and Ganesh Gopalakrishnan (Lawrence Livermore National Laboratory)
Empirical Studies of the Soft Error Susceptibility of Sorting Algorithms to Statistical Fault Injection,
Qiang Guan (Los Alamos National Laboratory), Nathan Debardeleben (Los Alamos National Laboratory), Sean Blanchard (Los Alamos National Laboratory) and Song Fu (U North Texas)
Big versus Little: Who will trip?
Reena Panda (U Texas at Austin), Christopher Erb (U Texas at Austin) and Lizy Kurian John (U Texas at Austin)
Challenges in Assessing Single Event Upset Impact on Processor Systems,
Wojciech Koszek (Xilinx), Austin Lesea (Xilinx), Glenn Steiner (Xilinx), Dagan White (Xilinx) and Pierre Maillard (Xilinx)
Examining the Impact of ACE interference on Multi-Bit AVF Estimates
Fritz Gerald Previlon (Northeastern U), Mark Wilkening (Northeastern U), Vilas Sridharan (AMD), Sudhanva Gurumurthi (AMD) and David Kaeli (Northeastern U)
15:45 – 17:00 Session V: Protection Techniques(Chair: Laura Monroe (LANL))
Software-based Dynamic Reliability Management for GPU Applications
Si Li (Georgia Institute of Technology), Vilas Sridharan (AMD), Sudhanva Gurumurthi (AMD) and Sudhakar Yalamanchili (Georgia Institute of Technology)
FAME: Flexible Real-Time Aware Error Correction by Combining Application Knowledge and Run-Time Information
Andreas Heinig (TU Dortmund), Florian Schmoll (TU Dortmund), Bjorn Bonninghoff (TU Dortmund), Peter Marwedel (TU Dortmund) and Michael Engel (Leeds Beckett U)
Resilient Flip-Flop Design under Process and Runtime Variations,
Mohammad Saber Golanbari (Karlsruhe Institute of Technology), Saman Kiamehr (Karlsruhe Institute of Technology) and Mehdi B. Tahoori (Karlsruhe Institute of Technology)
17:00 – 18:00 Business Meeting – The Future of SELSE
18:30 Reception and Banquet
Day 2 – April 1st, 2015 – Austin, TX
08:00 – 08:30 Breakfast
08:30 – 10:10 Session VI: Reliability Evaluation (Chair: John Daly (DoD))
MEMRES: A Fast Memory System Reliability Simulator
Shaodi Wang (UCLA), Henry Chaohong Hu (Samsung), Hongzhong Zheng (Samsung) and Puneet Gupta (UCLA)
A Layout-Aware Approach to Fault Injection for Improving Failure Mode Prediction,
Cyril Bottoni (ST Microelectronics), Benjamin Coeffic (ST Microelectronics), Jean Marc Daveau (ST Microelectronics), Gilles Gasiot (ST Microelectronics), Lirida Naviner (Telecom ParisTech) and Philippe Roche (ST Microelectronics)
SASSIFI: Evaluating Resilience of GPU Applications
Siva Kumar Sastry Hari (NVIDIA), Timothy Tsai (NVIDIA), Mark Stephenson (NVIDIA), Steve Keckler (NVIDIA) and Joel Emer (NVIDIA)
The Use of Benchmarks for High-Reliability Systems
Heather Quinn (LANL), Paolo Rech (UFRGS), Fernanda Lima Kastensmidt (UFRGS), William Robinson (Vardebilt U), Steve Guertin (JPL), Matteo Sonza Reorda (Politecnico di Torino), Michael Wirthlin (BYU), David Kaeli (Northeastern U), Miguel Aguirre (U Sevilla), Luis Entrena (U Carlos III de Madrid), Arno Bernard (Stellenbosch U), Luca Sterpone (Politecnico di Torino), Bradley Kiddie (Vanderbilt U) and Marco Desogus (Politecnico di Torino)
10:10 – 10:40 Coffee Break
10:40 – 11:40 Session VII: Keynote Speech�(Chair: Sarah Michalak (LANL))
RAS implications of a connected world
Dr. Robert C. Aitken (ARM)
11:40 – 12:30 Session VIII: Error Mitigation (Chair: Mattan Erez (UT Austin))
Comparing Hard and Soft Error Rate Models for Scaled SoCs,
Jerry Lee (Cisco), Amr Haggag (Freescale), Adrian Evans (iRoC) and Dan Alexandrescu (iRoC)
The Implications of Different DRAM Protection Techniques on Datacenter TCO
Panagiota Nikolaou (U Cyprus), Yiannakis Sazeides (U Cyprus), Marios Kleanthous (Mesoyios College) and Lorena Ndreu (U Cyprus)
12:30 – 13:45 Lunch
13:45 – 14:45 Session IX: Keynote Speech�(Chair: Helia Naeimi (Intel))
Energy-Efficient Resilience: Challenges and Pitfalls
Dr. Pradip Bose (IBM)
14:45 – 15:15 Coffee Break
15:15 – 16:30 Session X: Circuit Degradation (Chair: Siva Hari (NVIDIA))
MCPENS: Multiple-Critical-Path Embeddable NBTI Sensors for Dynamic Wearout Management
Xinfei Guo (U Virginia) and Mircea Stan (U Virginia)
Hardware Fault Compensation Using Supervised Learning,
Farah Taher (U Texas at Dallas) and Joseph Callenes-Sloan (U Texas at Dallas)
Sensitivity of SRAM Cell Most Probable SNM Failure Point to Time-Dependent Variability
Dimitrios Rodopoulos (NTUA), Yanos Sazeides (U Cyprus), Francky Catthoor (IMEC), Chrysostomos Nicopoulos (U Cyprus) and Dimitrios Soudris (NTUA)
16:30 Closing Remarks

Application oriented fault tolerance system for Tianhe-2


Since fault tolerance is one of the major challenges for advanced HPC systems in the post-petascale and exascale era, innovative integrated technology designs are needed for new architecture as well as associated software stacks. We need to explore the capability of cpu, accelerator, interconnection, I/O storage system, and till whole system. We couldn’t avoid the fault for such a large scale system like Tianhe-2, what we should do is to work with the emerging faults. This talk analysis the different types of faults we facing and dealing with mainly on Tianhe-2, demonstrate the mechanics we used to monitor, isolate, and recovery from the faults. We try to focus on the application oriented fault tolerance strategies, some large scale applications will be followed to show how do them run sustaining long time.


Professor Yutong Lu is the Director of the System Software Laboratory, School of Computer Science, National University of Defense Technology (NUDT), Changsha, China. She is also a professor in the State Key Laboratory of High Performance Computing, China. She got her B.S, M.S, and PhD degrees from the NUDT. Her extensive research and development experience has spanned several generations of domestic supercomputers in China. During this period, Prof. Lu was the Director Designer for the Tianhe-1A and Tianhe-2 systems, both of which have been internationally recognized as the top-ranked supercomputing system worldwide in respectively November of 2010 and June of 2013. Her continuing research interests include parallel operating systems (OS), high speed interconnect communication, global file systems, and advanced programming environments.


RAS implications of a connected world

Presentation Slides: Slides

The last few years have seen the transformation of the Internet from a predominantly wired, PC-based structure to one based on billions of interconnected mobile devices. We are now in the midst of another transformation to a world where most communication will take place between machines, the so-called Internet of Things. Conventional RAS has not been focused on either the mobile or the IOT space, but is becoming increasingly applicable as more and more happens online. This talk looks at the range of RAS requirements in both IOT and mobile and discusses which technologies can be moved from conventional resilient computing space and which problems will require new solutions.


Robert C. Aitken is an ARM Fellow and heads the Silicon portion of ARM R&D. His areas of responsibility include low power design, library architecture for advanced process nodes, and design for manufacturability. His research interests include design for variability, resilient computing, and memory robustness. His group has participated in numerous chip tape-outs, including 8 at or below the 16nm node. He has published over 80 technical papers, on a wide range of topics. Dr. Aitken joined ARM as part of its acquisition of Artisan Components in 2004. Prior to Artisan, he worked at Agilent and HP. He has given tutorials and short courses on several subjects at conferences and universities worldwide. He holds a Ph.D. from McGill University in Canada. Dr. Aitken is an IEEE Fellow, and serves on a number of conference and workshop committees.

Energy-Efficient Resilience: Challenges and Pitfalls

Presentation Slides: Slides

The cost of maintaining current levels of hardware reliability appears to be unaffordable in the post-22nm late CMOS design era. In the first part of this talk, we will examine the reasons behind such a projection, based on the modeled trends in technology, circuits and microarchitecture. We will show that the energy cost required to sustain targeted levels of system resilience is a central issue. On the other hand, the development cost needed to achieve the target level of energy efficiency, without violating the system resilience constraints is too high, if we approach the problem using current solution approaches.

In the second part of the talk, we will present a vision of cross-layer resilience optimization, which forms the basis of an IBM-led project sponsored by DARPA under its PERFECT program. The goal is to demonstrate through parameterized, cross-layer modeling that such an approach can help provide cost- and energy-efficient resilience in a class of future embedded systems of interest to DARPA, U.S. Department of Defense and also to the general IT appliance industry. The modeling framework is targeted to be flexible enough that customized trade-off analyses are expected to be of value to other R&D efforts geared toward high-end server, mainframe, cloud and large-scale supercomputing market segments as well.


PRADIP BOSE is a Distinguished Research StaffMember and Manager of the Reliability- andPower-Aware Microarchitectures Department atIBM T. J. Watson Research Center, YorktownHeights, NY. He has been involved in the designand pre-silicon modeling of virtually all IBMPOWER-series microprocessors, since the pioneering POWER1 (RS/6000) machine, whichstarted as the Cheetah (and subsequentlyAmerica) superscalar RISC project at IBMResearch. From 1992-95, he was on assignment at IBM Austin, where he was the lead performance engineer in a high-end processor development project (POWER3). During 1989-90, Dr. Bose was on a sabbatical assignment as a Visiting Associate Professor at Indian Statistical Institute, India, where he worked on practical applications of knowledge-based systems. His current research interests are in high performance computers, power- and reliability-aware microprocessor architectures, pre-silicon modeling and validation. He is the author or co-author of over ninety refereed publications (including several book chapters) and he also serves as an Adjunct Professor ar Columbia University. He has received twenty Invention Plateau Awards, several Research Accomplishment and Outstanding Innovation Awards from IBM. Dr. Bose served as the Editor-in-Chief of IEEE Micro from 2003-2006 and is the current chairperson of ACM SIGMICRO. He is an IEEE Fellow and a member of the IBM Academy of Technology.