Built-In Fault-Tolerant Computing Paradigm for Resilient Large-Scale Chip Design: A Self-Test, Self-Diagnosis, and Self-Repair-Based Approach

Li, Xiaowei, Yan, Guihai, Liu, Cheng

  • 出版商: Springer
  • 出版日期: 2024-03-03
  • 售價: $7,720
  • 貴賓價: 9.5$7,334
  • 語言: 英文
  • 頁數: 304
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 9811985537
  • ISBN-13: 9789811985539
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

With the end of Dennard scaling and Moore's law, IC chips, especially large-scale ones, now face more reliability challenges, and reliability has become one of the mainstay merits of VLSI designs. In this context, this book presents a built-in on-chip fault-tolerant computing paradigm that seeks to combine fault detection, fault diagnosis, and error recovery in large-scale VLSI design in a unified manner so as to minimize resource overhead and performance penalties. Following this computing paradigm, we propose a holistic solution based on three key components: self-test, self-diagnosis and self-repair, or "3S" for short. We then explore the use of 3S for general IC designs, general-purpose processors, network-on-chip (NoC) and deep learning accelerators, and present prototypes to demonstrate how 3S responds to in-field silicon degradation and recovery under various runtime faults caused by aging, process variations, or radical particles. Moreover, we demonstrate that 3S not onlyoffers a powerful backbone for various on-chip fault-tolerant designs and implementations, but also has farther-reaching implications such as maintaining graceful performance degradation, mitigating the impact of verification blind spots, and improving chip yield.

This book is the outcome of extensive fault-tolerant computing research pursued at the State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences over the past decade. The proposed built-in on-chip fault-tolerant computing paradigm has been verified in a broad range of scenarios, from small processors in satellite computers to large processors in HPCs. Hopefully, it will provide an alternative yet effective solution to the growing reliability challenges for large-scale VLSI designs.

商品描述(中文翻譯)

隨著Dennard scaling和摩爾定律的結束,集成電路(IC)晶片,特別是大型晶片,現在面臨更多的可靠性挑戰,而可靠性已成為超大規模集成(VLSI)設計的主要優勢之一。在這樣的背景下,本書提出了一種內建的片上容錯計算範式,旨在以統一的方式結合故障檢測、故障診斷和錯誤恢復,以最小化資源開銷和性能損失。根據這一計算範式,我們提出了一個基於三個關鍵組件的整體解決方案:自我測試、自我診斷和自我修復,簡稱為「3S」。接著,我們探討了3S在一般IC設計、通用處理器、片上網路(NoC)和深度學習加速器中的應用,並展示了原型以說明3S如何應對現場矽材料的劣化以及在由老化、製程變異或輻射粒子引起的各種運行故障下的恢復。此外,我們展示了3S不僅為各種片上容錯設計和實現提供了強大的支撐,還具有更深遠的意義,例如維持優雅的性能衰退、減輕驗證盲點的影響以及提高晶片良率。

本書是過去十年在中國科學院計算技術研究所處理器國家重點實驗室進行的廣泛容錯計算研究的成果。所提出的內建片上容錯計算範式已在從衛星計算機的小型處理器到高性能計算(HPC)的大型處理器等廣泛場景中得到驗證。希望它能為大型VLSI設計日益增長的可靠性挑戰提供一種替代而有效的解決方案。

作者簡介

Dr. Xiaowei Li is a Professor and Deputy (Executive) Director at State Key Laboratory of Computer Architecture, Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS). He received his B.Eng. degree and M.Eng. degree from Hefei University of Technology in 1985 and 1988, and his Ph.D. from ICT, CAS in 1991. He joined Peking University as a post-doc in 1991. From 1993 to 2000, he was an associate professor with the Department of Computer Science at Peking University. From 1997 to 1999, he was a Visiting Research Fellow at The University of Hong Kong and at Nara Institute of Science and Technology, Japan. His research interests include VLSI testing, fault-tolerant computing, multi-core processor design & verification, and hardware security. He has led more than 20 national research projects and helped to develop many systems and software tools in these areas. He holds more than 90 patents and more than 50 software copyrights. He has co-published over 400 peer-reviewed journal and conference papers. He has received many honors and awards, including China National Technology Innovation Award (2012), and China National Science and Technology Progress Award (2015). Dr. Li served on a number of program committees of IEEE/ACM-sponsored conferences and symposia including DAC, ICCAD and DATE, and is currently Vice-Chair of TTTC of the IEEE Computer Society. He also serves as Associate Editors of IEEE TCAD, IEEE TCAS II, and ACM TODAES. Dr. Guihai Yan is a professor at the State Key Laboratory of Processors (SKLP), Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS). He received his B.Eng. degree from Peking University in 2005 and his Ph.D. from ICT, CAS in 2011, respectively. His primary research interest is in computer architecture with an emphasis on domain-specific architectures for machine learning and financial computing. He has published more than 40 peer-reviewed papers in leading conference proceedings andjournals including ISCA, HPCA, TC and TVLSI. His research work on fault-tolerant VLSI design has been deployed in countless projects, including 973 high-throughput computing systems and self-repair computer systems. Dr. Cheng Liu is an associate professor at the State Key Laboratory of Processors (SKLP), Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS). He received his B.Eng. degree and M.Eng. degree from Harbin Institute of Technology in 2007 and 2009, and his Ph.D. from The University of Hong Kong in 2016. He also worked as a research fellow at National University of Singapore from 2016 to 2018. His research interests include fault-tolerant computing, reconfigurable computing, and customized computing particularly for deep learning and large graph processing. He has published more than 50 peer-reviewed papers in leading conference proceedings and journals for computer architecture and EDA.

作者簡介(中文翻譯)

李小偉博士是中國科學院計算技術研究所(ICT)計算機架構國家重點實驗室的教授及副(執行)主任。他於1985年和1988年分別獲得合肥工業大學的工程學學士和碩士學位,並於1991年獲得中國科學院ICT的博士學位。他於1991年加入北京大學擔任博士後研究員。從1993年到2000年,他在北京大學計算機科學系擔任副教授。1997年至1999年期間,他曾在香港大學和日本奈良先端科學技術大學擔任訪問研究員。他的研究興趣包括VLSI測試、容錯計算、多核心處理器設計與驗證以及硬體安全。他主導了20多個國家研究項目,並協助開發了許多相關系統和軟體工具。他擁有90多項專利和50多項軟體著作權,並共同發表了400多篇經過同行評審的期刊和會議論文。他獲得了許多榮譽和獎項,包括中國國家科技創新獎(2012年)和中國國家科技進步獎(2015年)。李博士曾擔任IEEE/ACM主辦的會議和研討會的多個程序委員會成員,包括DAC、ICCAD和DATE,目前是IEEE計算機學會TTTC的副主席。他還擔任IEEE TCAD、IEEE TCAS II和ACM TODAES的副編輯。

顏桂海博士是中國科學院計算技術研究所(ICT)處理器國家重點實驗室的教授。他於2005年獲得北京大學的工程學學士學位,並於2011年獲得中國科學院ICT的博士學位。他的主要研究興趣是計算機架構,特別是針對機器學習和金融計算的領域特定架構。他在ISCA、HPCA、TC和TVLSI等領先會議論文集和期刊上發表了40多篇經過同行評審的論文。他在容錯VLSI設計方面的研究工作已在無數項目中得到應用,包括973高通量計算系統和自我修復計算機系統。

劉成博士是中國科學院計算技術研究所(ICT)處理器國家重點實驗室的副教授。他於2007年和2009年分別獲得哈爾濱工業大學的工程學學士和碩士學位,並於2016年獲得香港大學的博士學位。他還曾於2016年至2018年在新加坡國立大學擔任研究員。他的研究興趣包括容錯計算、可重構計算以及特別針對深度學習和大型圖形處理的定制計算。他在計算機架構和EDA領域的領先會議論文集和期刊上發表了50多篇經過同行評審的論文。