Hyesoon Kim, Rich Vuduc, Tushar Krishna, Callie Hao (Georgia Tech, co-organizers)
Subhasish Mitra (Stanford University)
Abstract: The computation demands of 21st-century abundant-data workloads, such as AI/Machine Learning, far exceed the capabilities of today’s computing systems. For example, a Dream Chip would ideally co-locate all memory and compute on a single chip, quickly accessible at low energy. Such Dream Chips aren’t realizable today. Computing systems instead use large off-chip memory and spend enormous time and energy shuttling data back and forth. This memory wall gets worse with growing problem sizes, especially as conventional 2D miniaturization gets increasingly difficult.
The next leap in computing requires transformative NanoSystems by exploiting the unique characteristics of nanotechnologies and abundant-data workloads. We create new chip architectures through ultra-dense 3D integration of logic and memory – the N3XT 3D approach. Multiple N3XT 3D chips are integrated through a continuum of chip stacking/interposer/wafer-level integration — the N3XT 3D MOSAIC. To scale with growing problem sizes, new Illusion systems orchestrate workload execution on N3XT 3D MOSAIC creating an illusion of a Dream Chip with near-Dream energy and throughput.
Several hardware prototypes, built in industrial facilities resulting from lab-to-fab activities, demonstrate the effectiveness of our approach. We target 1,000X system-level energy-delay-product benefits, especially for abundant-data workloads. We also address new ways of ensuring robust system operation despite growing challenges of design bugs, manufacturing defects, reliability failures, and security attacks.
Bio: Subhasish Mitra holds the William E. Ayer Endowed Chair Professorship in the Departments of Electrical Engineering and Computer Science at Stanford University. He directs the Stanford Robust Systems Group, serves on the leadership team of the Microelectronics Commons AI Hardware Hub, and is the Associate Chair (Faculty Affairs) of Stanford Computer Science. His research ranges across Robust Computing, NanoSystems, Electronic Design Automation (EDA), and Neurosciences. Results from his research group have influenced almost every contemporary electronic system, and have inspired significant government and research initiatives in multiple countries. He has held several international academic appointments — the Carnot Chair of Excellence in NanoSystems at CEA-LETI in France, Invited Professor at EPFL in Switzerland, and Visiting Professor at the University of Tokyo in Japan. Prof. Mitra also has consulted for major technology companies including Cisco, Google, Intel, Samsung, and Xilinx (now AMD).
In the field of Robust Computing, he has created many key approaches for circuit failure prediction, on-line test and diagnostics, QED system validation, soft error resilience, and X-Compact test compression. Their adoption by industry is growing rapidly, in markets ranging from cloud computing to automotive systems. His X-Compact approach has proven essential for cost-effective manufacturing and high-quality testing of almost all 21st century systems, enabling billions of dollars in cost savings.
In the field of NanoSystems, with his students and collaborators, he demonstrated several firsts: the first NanoSystems hardware among all emerging nanotechnologies for energy-efficient computing systems, the first published end-to-end computing systems hardware using resistive memory, the first 3D NanoSystems with computation immersed in data storage, and the first monolithic 3D technology in a silicon foundry. These received wide recognition: cover of NATURE, several Research Highlights to the US Congress, and highlight as "important scientific breakthrough" by global news organizations.
Prof. Mitra's honors include the Harry H. Goode Memorial Award (by the IEEE Computer Society for outstanding contributions in the information processing field), Newton Technical Impact Award in EDA (test-of-time honor by ACM SIGDA and IEEE CEDA), the University Researcher Award (the highest university honor by the Semiconductor Industry Association and Semiconductor Research Corporation to recognize lifetime research contributions), the Intel Achievement Award (Intel’s highest honor), and the Distinguished Alumnus Award from the Indian Institute of Technology, Kharagpur. He and his students have published over 10 award-winning papers across 5 topic areas (technology, circuits, EDA, test, verification) at major venues including the Design Automation Conference, International Electron Devices Meeting, International Solid-State Circuits Conference, International Test Conference, Symposium on VLSI Technology, Symposium on VLSI Circuits, and Formal Methods in Computer-Aided Design. Stanford undergraduates have honored him several times "for being important to them." He is an ACM Fellow, an IEEE Fellow, and a foreign member of Academia Europaea.
Abstract: The rise of artificial intelligent (AI)-driven marvels hinges on the unrelenting advances in digital memory and storage solutions. The exponential trajectory of improvements of dynamic random-access memory (DRAM) and NAND flash, which are the mainstays of main memory and storage, respectively, is however facing formidable challenges at the technology level. In this talk, we will discuss the potential of the emerging ferroelectric technologies to upend the DRAM and NAND landscapes. We will highlight how ferroelectrics can enable the transition from 2-D to 3-D in DRAM technology and facilitate vertical scaling in NAND technology to achieve the 1000-layer milestone and beyond. We will also explore how ferroelectric devices can contribute to embedded and storage class memory technologies and examine the challenges they face.
Bio: Asif Khan is an Associate Professor in the School of Electrical and Computer Engineering with a courtesy appointment in the School of Materials Science and Engineering at Georgia Institute of Technology. Dr. Khan’s research focuses on ferroelectric materials and devices to address the challenges faced by the semiconductor technology due to the end of transistor miniaturization. His work led to the first experimental proof-of-concept demonstration of the ferroelecric negative capacitance, which can reduce the power dissipation in transistors. His recent interest is understanding and demonstrating the fundamental limits of memory technologies concerning their scalability, density, capacity, performance, and reliability. His group publishes research on topics that include both logic and memory technologies, as well as artificial intelligence and neuromorphic hardware. Dr Khan’s notable awards include the DARPA Young Faculty Award (2021), the NSF CAREER award (2021), the Intel Rising Star award (2020), the Qualcomm Innovation Fellowship (2012), TSMC Outstanding Student Research Award (2011) and University Gold Medal from Bangladesh University of Engineering and Technology (2011). Dr. Khan received the Class of 1934 CIOS Honor Roll award for excellence in teaching a graduate course on Quantum Computing Devices and Hardware in Fall 2020. He is presently serving as an editor at IEEE Electron Device Letters. In the past, he has also worked as an associate editor for IEEE Access, and as a technical program committee member for various conferences including IEEE International Electron Devices Meeting (IEDM) and Design Automation Conference (DAC), among others.
Abstract: High-density (HD) SRAM has been the conventional device of choice for the last-level cache (LLC). However, with the slowing of transistor scaling, as reflected in the industry’s almost identical HD SRAM cell size from 5 nm to 3 nm, alternative solutions such as 3D stacking with advanced packaging are pursued (such as AMD’s V-cache). Escalating data demands in AI/ML workloads now necessitate ultra-large LLCs to decrease off-chip memory movement. Therefore, the exploration of monolithic 3D integration where active devices are stacked at the back-end-of-line (BEOL) is compelling to further increase integration density without the costly bonding process. This talk introduces a cross-layer co-design framework to benchmark BEOL-compatible amorphous oxide semiconductor (AOS) based 2T-gain cell (2T-GC) cache memory enabled by a newly developed modeling tool, NS-Cache, which is coupled with Gem5 simulator for system-level power-performance-area (PPA) analysis.
Bio: Shimeng Yu is the endowed Dean’s Professor of Electrical and Computer Engineering at the Georgia Institute of Technology. He received the PhD degree from Stanford University in 2013. He is elevated for the IEEE Fellow for contributions to non-volatile memories and in-memory computing. His 400+ publications received 32,000+ citations (Google Scholar) with H-index 83. He serves flagship conferences in the field as technical program committee, including IEEE International Electron Devices Meeting (IEDM), IEEE Symposium on VLSI Technology and Circuits, etc. He is also an editor for IEEE Electron Device Letters (EDL). Among Prof. Yu’s honors, he was a recipient of National Science Foundation (NSF) CAREER Award in 2016, IEEE Electron Devices Society (EDS) Early Career Award in 2017, ACM Special Interests Group on Design Automation (SIGDA) Outstanding New Faculty Award in 2018, Semiconductor Research Corporation (SRC) Inaugural Young Faculty Award in 2019, IEEE Circuits and Systems Society (CASS) Distinguished Lecturer in 2021, IEEE Electron Devices Society (EDS) Distinguished Lecturer in 2022, and Intel Outstanding Researcher Award in 2023, etc.
Abstract: Bio: Andreas Olofsson is the founder and CEO of Zero ASIC, a chiplet semiconductor startup reducing the barrier to ASICs with chiplets. From 2017 - 2020, Mr. Olofsson was a program manager at DARPA, where he managed 8 different US research programs in heterogeneous integration, EDA, design & verification, high performance computing, machine learning, and analog computing. From 2008-2017, Mr. Olofsson founded and managed Adapteva, an ultra lean parallel processor startup that led the industry in computing energy efficiency. Prior to Adapteva he worked at Analog Devices as a design manager and architect for advanced DSP and mixed signal products. Mr. Olofsson received his Bachelor of Science in Physics and Electrical Engineering and Master of Science in Electrical Engineering from the University of Pennsylvania. He is a senior member of IEEE and holds nine U.S. patents.
Matt Sinclair (University of Wisconsin)
Abstract: In recent years, system designers have increasingly been turning to heterogeneous systems to improve performance and energy efficiency. Specialized accelerators are frequently used to improve the efficiency of computations that run inefficiently on conventional, general-purpose processors. As a result, systems ranging from smartphones to datacenters, hyperscalers, and supercomputers are increasingly using large numbers of accelerators (including GPUs) while providing better efficiency than CPU-based solutions. In particular, GPUs are widely used in these systems due to their combination of programmability and efficiency. Traditionally, GPUs are throughput-oriented, focused on data parallelism, and assume synchronization happens at a coarse granularity. However, programmers have begun using these systems for a wider variety of applications which exhibit different characteristics, including latency-sensitivity, mixes of both task and data parallelism, and fine-grained synchronization. Thus, future heterogeneous systems must evolve and make deadline-aware scheduling, more intelligent data movement, efficient fine-grained synchronization, and effective power management first-order design constraints. In the first part of this talk, I will discuss our efforts to apply hardware-software co-design to help future heterogeneous systems overcome these challenges and improve performance, energy efficiency, and scalability. Then, in the second part I will discuss how the on-going transition to chiplet-based heterogeneous systems exacerbates these challenges and how we address these challenges in chiplet-based heterogeneous systems by rethinking the control plane.
Bio: Matt Sinclair is an Assistant Professor in the Computer Sciences Department at the University of Wisconsin-Madison. He is also an Affiliate Faculty in the ECE Department and Teaching Academy at UW-Madison. His research primarily focuses on how to design, program, and optimize future heterogeneous systems. He also designs the tools for future heterogeneous systems, including serving on the gem5 Project Management Committee and the MLCommons Power, HPC, and Science Working Groups. He is a recipient of the NSF CAREER award, and his work has been funded by the DOE, Google, NSF, and SRC. Matt’s research has also been recognized several times, including an ACM Doctoral Dissertation Award nomination, a Qualcomm Innovation Fellowship, the David J. Kuck Outstanding PhD Thesis Award, and an ACM SIGARCH - IEEE Computer Society TCCA Outstanding Dissertation Award Honorable Mention. He is also the current steward for the ISCA Hall of Fame.
Abstract: TBD
Bio: TBD
Abstract: The rapid growth of AI/LLMs necessitates system-scale AI solutions. Such a solution relies not only on AI compute devices (e.g., GPUs, TPUs), but also requires comprehensive AI infrastructure (e.g., network, storage) to form a complete system. While a modern GPU (and other AI compute devices) offer growing abundance of AI tensor compute, the key challenge is in AI infrastructure, to efficiently bring data to/from the GPUs so they can be fully utilized. Unfortunately, CPUs can no longer keep up with handling such data-processing AI infrastructure tasks, drastically limiting overall system performance. Data Processing Units (DPUs, aka IPUs/SuperNICs, etc) offload and accelerate such data processing tasks from CPU to DPU hardware, to boost system performance, efficiency, and scalability. This talk first discusses challenges in AI infrastructure. Then, it presents DPU solutions developed at MangoBoost to address such challenges. Finally, it will highlight concrete results in DPU-based acceleration of multiple aspects of AI systems, such as improving inter-GPU communications, GPU-Storage communications, and inference-serving on well-known AI benchmarks, such as MLperf.
Bio: Dr. Nurvitadhi is a co-founder and the Chief Product Officer of MangoBoost, Inc. that offers novel customizable data processing unit (DPU) solutions to boost server systems performance and efficiency. MangoBoost, Inc. is a 65M$-funded start-up with rapid growth (and actively hiring). Dr. Nurvitadhi’s interests are in hardware accelerator architectures, systems, and software for key application domains (e.g., AI, analytics). Previously, he was a Principal Engineer at Intel, focused on FPGAs, accelerators, and AI technologies. He has 70+ peer-reviewed publications, 120+ patents granted/pending, with H-index of 39. In 2020, he was recognized as a top 30 inventor by Intel Patent Group and received a Mahboob Khan Outstanding Liaison Award from SRC. He has served on program committees of IEEE/ACM conferences, and as the Technical Program Chair for FCCM 2022. He received a PhD in ECE from Carnegie Mellon University, and an MBA from Oregon State University.
Abstract: The memory system has historically been a primary performance determinant for server-grade computers. The multi-faceted challenges it poses is commonly referred to as the “memory wall”, referring to rigid capacity, bandwidth, and cost constraints. Current technological trends motivate a memory architecture rethink by leveraging serial interfaces, opening opportunities to overcome current limitations. Specifically, these opportunities are embodied by the emerging Compute Express Link (CXL) technology, which is garnering widespread adoption in the industry. CXL is well-positioned to revolutionize the way server systems are built and deployed, as it enables new capabilities in memory system design. CXL-centric or CXL-augmented memory systems bear characteristics that cater well to the growing demands of modern workloads. This short talk will demonstrate novel instances of CXL-centric memory systems that alleviate long-standing memory system bottlenecks in server architectures.
Bio: Alexandros (Alex) Daglis is an Assistant Professor of Computer Science at Georgia Tech. Daglis’ research interests lie in computer architecture, with specific interests in datacenter architectures, network-compute co-design, and memory systems. His research has been supported by the NSF, IARPA, Speculative Technologies, Samsung, and Intel Corporation, and routinely appears at top-tier computer architecture venues such as ISCA, MICRO, ASPLOS, and HPCA. Daglis is a recipient of the NSF CAREER award, a Google Faculty Research Award, and a Georgia Tech Junior Faculty Teaching Award.
Abstract: In this talk, I will discuss about devising next-generation compute platforms targeting end-to-end data pipelines for large-scale AI and machine learning. I will delve into our recent works, including FLuID, Phaze, FAE/Hotline, which aim to tackle the challenges of building efficient and accelerated systems for AI at scale. These projects offer solutions that leverage the semantic similarity in embedding tables and execution properties of deep learning models to enhance performance and efficiency through acceleration of the application and the data movement. This is achieved through specialized model pruning for load balancing (FLuID), optimized architecture search and scheduling algorithms for higher utilization (Phaze). and reduced data movement across the system through semantic profiling (FAE, Hotline) in a distributed CPU/GPU system.
Bio: Divya Mahajan is an Assistant Professor in School of ECE and Computer Science. She received her Ph.D. from Georgia Institute of Technology and Master’s from UT Austin. She obtained her Bachelor’s from IIT Ropar where she was conferred the Presidents of India Gold Medal, the highest academic honor in IITs.
Prior to joining Georgia Tech, Divya was a Senior Researcher at Microsoft Azure since September 2019. Her research has been published in top-tier venues such as ISCA, HPCA, MICRO, ASPLOS, NeurIPS, and VLDB. Her dissertation has been recognized with the NCWIT Collegiate Award 2017 and distinguished paper award at High Performance Computer Architecture (HPCA), 2016.
Currently, she leads the Systems Infrastructure and Architecture Research Lab at Georgia Tech. Her research team is devising next-generation sustainable compute platforms targeting end-to-end data pipeline for large scale AI and machine learning. The work draws insights from a broad set of disciplines such as computer architecture, systems, and databases.
Abstract: Quantum computers have improved dramatically as industry has pushed the capability of these devices in terms of both scale and quality. Continued improvement requires research at all levels of the stack from the physical control of qubits to the software-layer that executes programs. Quantum systems at universities enable scientists and engineers to optimize over all these levels and to test new frameworks for quantum system design. In this talk, I will discuss how varying levels of access to quantum computers at companies, national laboratories, and universities enable different kinds of research.
Bio: Kenneth Brown is the Michael J. Fitzpatrick Distinguished Professor in the Departments of Electrical & Computer Engineering, Physics, and Chemistry at Duke University. He is an expert in quantum information science and engineering, and he uses the control of quantum systems to develop new technologies and understand the natural world. His research interests are ion trap quantum computers and quantum error correction. He serves on the American Physical Society Council of Representatives for the Division of Quantum Information. He was named a Fellow of the American Physical Society, a Kavli Fellow, and an Experienced Research Fellow of the Alexander von Humboldt Foundation for his work in quantum information. He is a scientific advisor for IonQ.
Brad Aimone (Sandia National Laboratories)
Abstract: Neuromorphic hardware is reaching brain-like scales, yet its broad value proposition has remained somewhat difficult to define. While much of the attention of neuromorphic algorithms has focused on low power neural networks, we have hypothesized that neuromorphic hardware that captures the ubiquitous stochasticity of the brain may provide a path to efficient probabilistic algorithms that could impact computing more broadly. In this talk, I will describe two algorithms for neuromorphic computing that demonstrate its near-term potential for impacting advanced computing applications. First, I will describe our neuromorphic Monte Carlo algorithm in which we observed a neuromorphic advantage for simulations of time homogeneous discrete-time Markov Chains. These neural Monte Carlo results have broad implications for many problems that can be defined as stochastic differential equations. Next, I will show our recent results in deploying neuromorphic hardware for finite element simulations. This algorithm, dubbed NeuroFEM, solves sparse linear systems efficiently with a cortex-like algorithm and we demonstrate it on real neuromorphic hardware (Intel’s Loihi 2) and show close to ideal strong and weak scaling.
Bio: Dr. Brad Aimone is a Distinguished Member of Technical Staff in the Center for Computing Research at Sandia National Laboratories, where he is a lead researcher in leveraging computational neuroscience to advance artificial intelligence and in using neuromorphic computing platforms for future scientific computing applications. He recent led a multi-institution DOE Office of Science Microelectronics Co-Design project ## titled COINFLIPS (CO-designed Influenced Neural Foundations Inspired by Physical Stochasticity) which is focused on developing a novel probabilistic neuromorphic computing platform. He also currently leads several other research efforts on designing neural algorithms for scientific computing applications and neuromorphic machine learning implementations.
Brad has published over seventy peer-reviewed journal and conference articles in venues such as Advanced Materials, Neuron, Nature Neuroscience, Nature Electronics, Communications of the ACM, and PNAS and he is one of the co-founders of the Neuro-Inspired Computational Elements, or NICE, conference. Prior to joining the technical staff at Sandia in 2011, Dr. Aimone was a postdoctoral research associate at the Salk Institute for Biological Studies, with a Ph.D. in computational neuroscience from the University of California, San Diego and Bachelor’s and Master’s degrees in chemical engineering from Rice University.
George Michelogiannakis (Lawrence Berkeley National Laboratory)
Abstract: Digital computing in cryogenic environments often has more to worry about than just tight power budgets. Cryogenic sensors require digital control circuits that operate adjacent to or within harsh environments created from external interference such as electromagnetic fields. In addition, manufacturing variations and cooling imperfections introduce further chances of timing variations and other transient errors in cryogenic digital circuits. In this talk, I will present our recent work on alternative compute models for JJ-based RSFQ cryogenic digital compute circuits, and emphasize their error resilience in addition to their ability to meet power budgets. I will then conclude with thoughts for future directions and what gap remains to be bridged.
Bio: George Michelogiannakis is a staff scientist for the computer architecture group (CAG) in the AMCR division. He has extensive work on networking (both off- and on-chip) and computer architecture. His latest work focuses on post Moore's law era looking into superconducting digital logic with novel compute models, compute and memory architectures, specialization, emerging devices (transistors), photonics, and 3D integration. He is also currently characterizing the use of key resources in modern HPC systems to reveal opportunities for resource disaggregation and is designing photonically resource-disaggregated racks.
Niraj Jha (Princeton University)
Abstract: The artificial intelligence (AI) industry is currently focused on achieving superintelligence in a top-down fashion by training a very large (hundreds of billions to trillions of parameters) omniscient multimodal model using a large language model (LLM) as the base. This approach is insatiably thirsty for data during training, leading to unsustainable CO2 emissions. Even after incurring such huge computational and energy costs, these models are known to hallucinate. This makes it difficult to employ such models in domains where accuracy is important, e.g., smart healthcare. We propose to take the opposite tack – build medical superintelligence bottom-up, modeled after how superintelligence is achieved in the human society. Each of us just has human intelligence, but a society of humans achieves superintelligence in a bottom-up fashion by looking at the problem from diverse angles. Could we build medical superintelligence in the same bottom-up fashion through a society of medical AI assistants and AI agents? The AI assistants will serve as aides to health professionals. AI agents will have more autonomy. They need to be accompanied by a robust reasoning framework, e.g., counterfactual (what-if) reasoning. The current top-down approach to developing AI agents is based on using LLMs for reasoning. However, LLMs exhibit a very uneven reasoning performance. Our medical superintelligence framework will take inspiration from neuroscience and include episodic and working memories to facilitate reasoning. The AI assistants in our superintelligence framework will be based on fine-tuned foundation models, targeted at various modalities, e.g., physiological signals, medical images, and medical text, that can be trained data-efficiently and are aligned with each other. The initial goals of the framework are accurate disease detection, individual well-being, interpretability of AI predictions, and personalized medical decision-making. In this talk, we will explore our initial progress towards realizing this vision.
Bio: Niraj K. Jha received his B.Tech. degree in Electronics and Electrical Communication Engineering from Indian Institute of Technology, Kharagpur, India in 1981 and Ph.D. degree in Electrical Engineering from University of Illinois at Urbana-Champaign in 1985. He is a Professor of Electrical and Computer Engineering at Princeton University. He has served as an Associate Director for the Princeton Andlinger Center for Energy and the Environment. He is a Fellow of IEEE and ACM. He was given a Distinguished Alumnus Award by I.I.T., Kharagpur in 2014. He has co-authored five books among which are two textbooks that are being widely used around the world. He has served as the Editor-in-Chief of IEEE Transactions on VLSI Systems and as Associate Editor of several other IEEE Transactions. He is an author or co-author of more than 490 papers among which are 16 award-winning papers. His research interests include algorithms and architectures for machine learning, with applications to smart healthcare.
Making sense of the edge: Leveraging AI-infused edge integrated sensing, communications, and computing that intelligently interacts with infrastructure
Abstract: Viewing the compute hierarchy from the edge of the physical world to the cloud, we see a significant volume of data is generated from a rapidly growing number of diverse sensors generating a volume of data that is growing faster than the corresponding compute and memory bandwidth available to gain insights from it.
We will explore a implementation of integrated sensing, communications and computation (ICC, or JCS) infused with dynamically reconfigurable AI to improve the signal-to-noise ratio of "insights-per-gigabyle" of data, and moreover, creating an end-to-end interactive mechanism at the edge to enable the exploration of insights, with applications in automotive, robotics and drones.
The result is a software-defined edge solution that reduces data, latency and power consumption for given use cases, yet can interact with upstream processing to "look" for desired outcomes as well as provide data structures that contribute to aggregate perspectives efficiently.
Bio: Scott currently is Chief Solutions Officer for Xcelerium Inc, a startup semiconductor company infusing AI into RF, communications and sensing, where he focuses on applying this unique fungible compute architecture based on the RISC-V ISA to perception, communications and data reduction and insights at the edge in automotive, aerospace & defense, industrial and enterprise applications. He also serves as the Automotive Semiconductor Advisor for Natcast.
As an engineering veteran of more than 30years, Scott has worked in roles that leveraged technical, transformation and leadership capabilities and applied them to the development of systems, embedded & cloud software and SoC designs for products ranging from mobile phones to IoT to automotive.
Amirreza Rastegari (Microsoft)
Abstract: The launch of Eagle, Azure’s hyper-scale supercomputer and the Number 3 system on the TOP500 list in November 2023, marked a new era, where cloud providers are at the forefront of supercomputing. Cloud providers design and deploy supercomputers across a spectrum of system sizes: hyper-scale systems like Eagle, deployed entirely in single data centers, and smaller systems like Reindeer, the number 32 system on the TOP500 list in November 2024, which are replicated and deployed in numerous data centers globally. These smaller systems support a growing and diverse user base while offering system-level fault tolerance across geographic locations. This dual approach ensures agility and resilience in meeting the needs of modern supercomputing users.
Despite the rapid expansion of supercomputing capabilities in the cloud, public knowledge on the performance and scalability of cloud-based supercomputers remains limited. In this talk, we discuss results from comparative analyses between Azure’s large- and small-scale supercomputers, specifically Eagle and Reindeer, and traditional on-premises supercomputers.
Bio: Amirreza Rastegari is the lead HPC performance engineer in Microsoft Azure, where he is focused on the performance assessment, analysis and optimization of Azure's next generation supercomputers. Amirreza holds a PhD in fluid dynamics and scientific computing from the University of Michigan, Ann Arbor.
Abstract: Generative AI (Gen AI), foundation models (FMs), and more specifically, large language models (LLMs) have recently advanced the field of AI significantly. such AI models are rapidly advancing and evolving, experiencing exponential growth in model size and data, requiring heterogenous hardware environment with various AI accelerators, and demanding high throughput, high energy efficiency, and dependable performance and accuracy. Above all, the compute system to support such AI workloads must be cost-effective, simple enough to build, and easy to operate, in order to ensure broad applicability and wide adoption. To this end, IBM Research together with the University of Illinois have recently published a vision white paper. We reimagine a hybrid cloud featuring a fully integrated and optimized stack that supports a wide range of AI frameworks, runtimes, tools, and various hardware resources, including cache-coherent interconnects, SmartNiCs, AI accelerators, and quantum computers - just to name a few. Our goal over the next 5-10 years is to identify new computing, storage, and communication elements, sub-systems, and innovations across all layers of the stack to reinvent the hybrid cloud for emerging AI and Agentic workloads.
Bio: Alaa Youssef is a Senior Manager and Master Inventor at IBM T.J. Watson Research Center. He leads the cloud-native AI platform research team, contributing to IBM’s watsonx and OpenShift AI platforms for training, tuning and inference of large generative AI and foundational models. His research interests are in hybrid cloud, cloud-native AI and HPC platforms, resource management and optimization, sustainable and trusted distributed cloud computing. He has co-authored technical publications in top conferences, received two best paper awards, and has over 40 patent inventions.
Dr Youssef is currently co-leader of the Hybrid Cloud & AI thrust of IBM-Illinois Discovery Accelerator Institute, where he is co-leading a number of research collaborative initiatives between IBM Research and UIUC, in the areas of hybrid cloud platform and infrastructure for AI, model optimization and runtimes, and agentic systems.
Dr Youssef has held multiple technical and management positions in IBM Research, and in IBM Software Services in multiple geographies, including USA, Egypt, and KSA. He received his PhD in Computer Science from Old Dominion University, Virginia, USA, and his BSc and MSc in Computer Engineering from Alexandria University, Egypt.
Abstract: Reducing power consumption is the dominant challenge for ML system designs. AMD has achieved tremendous scalability in accelerator throughput by leveraging chiplet technology, but this improvement is not free. Much like the rise of multi-core processors two decades ago required software to embrace multi-threaded programming to achieve high performance, tomorrow’s processors will force software to optimize for intra-chip locality to achieve high performance. This talk will highlight how to partition future GPU programs within the chip for power efficiency and how to optimize the subsequent collective communication for the on-chip memory hierarchy.
Bio: Brad Beckmann is a Fellow in AMD Research and Advanced Development group. Brad leads a team of researcher pursuing next-generation hardware and software technologies for scale-up/scale-out GPU networking. Brad joined AMD in 2007 and has led projects innovating in GPU memory consistency models, GPU cache coherence, simulation, and on-chip networks. He also co-led the initial development and release of the gem5 simulator in 2011. He has published over 30 conference and journal papers and co-authored over 40 granted patents. Prior to AMD, Brad was a software developer for Microsoft’s Windows Server Performance team. Brad has a PhD in Computer Science from University of Wisconsin-Madison.
Priyanka Ranade (US DoD / University of Maryland)
Abstract: Accelerated multi-GPU computing architectures have become increasingly popular in High-Performance Computing (HPC) systems over the last decade. Despite this widespread adoption across numerous application spaces, there are a number of scalability bottlenecks that emerge as the number of GPUs per node and cluster grows. Traditionally, inter-node GPU communication across nodes has always depended on a CPU-centric model of execution, inefficient for heterogeneous applications that require high batch, real-time data processing. In the last decade, advancements in GPU-centric technologies, allowing devices to directly access the memory of remote devices without involving the CPU. This enables efficient datapath between GPUs and other devices such as network adapters or storage systems within a high performance computing cluster. GPU-centric communication models have been optimal in growing fields like Machine Learning, inspiring application developers to design and implement novel algorithms to fully leverage GPUDirect communication patterns.
However, there is a lack of a standardized benchmark suite that evaluates GPU-centric communication patterns for machine learning applications. In this talk, we present findings and propose a set of benchmarks that evaluate the performance of gpu-centric technologies for ML workloads.Additionally, we map the benchmarks to a shmem-based GraphML use case to showcase extensions of classical benchmarks into growing application spaces.
Bio: Priyanka Ranade is a Research Scientist at the Advanced Computing Systems (ACS) division part of the Laboratory for Physical Sciences (LPS) and an adjunct professor at the University of Maryland. She is actively pursuing research in optimizing training and inference of deep neural networks using high-performance computing systems, through hardware evaluation, parallelism and scaling, and ML model architecture benchmarking. Prior to joining LPS, she was an embedded software engineer at Northrop Grumman Corporation and has a PhD in Computer Science from the University of Maryland, Baltimore County (UMBC).
Vijay Thakkar (NVIDIA / Georgia Tech)
Abstract: Targeting tensor cores for custom workloads is notoriously difficult, however, the success of NVIDIA is enabled in large part by having the CUDA programming model to allow users to easily target our hardware. CUTLASS has been the de-facto programming model for Tensor Core since CUTLASS 2.x and Ampere. CUTLASS 3.x takes this a step further with CuTe and provides an elegant and generalizable programming model for linear algebra kernels across all architectures from Fermi all the way to Blackwell. Solving this has been a process of research seven years in the making, and the end result is a library that is downloaded 3.5 million times, is deployed on tens billions of dollars worth of Ampere, Hopper, and Blackwell GPU deployments, and enabled countless ground breaking research projects in HPC and DL such as FlashAttention 2 and 3, Omniverse fVDB, ByteDance Flux to name a few. In this work we describe the design considerations that lead to the invention of CUTLASS 3.x, how we think about GPU architectures and linear algebra, how it enables all these use cases.
Bio: Vijay is a senior architect in the fast kernels team where he has worked on the CUTLASS 3.0 project since its inception as one of its leads. For the past three years he has focused on the development of Blackwell kernels via CuTe and CUTLASS's programming model and PTX ISA which just released as CUTLASS 3.8 and CUDA 12.8 respectively. He broadly collaborates with the GPU architecture, compiler, and programming model teams on software/hardware codesign for tensor cores and other DL features of datacenter GPUs. At GaTech, he is registered as a part time PhD student in Rich Vuduc's HPC garage lab, where he hopes to defend his PhD some day.
Keita Teranishi (Oak Ridge National Laboratory)
Abstract: Generative Artificial Intelligence (AI) powered by Large Language Models (LLMs) like GPT and Llama has shown promise in automating code generation across various programming languages. However, applying Generative AI to high-performance computing (HPC) presents unique challenges due to the domain’s specialized expertise requirements, performance constraints, and correctness concerns. Additionally, the complexity of the HPC software stack—including package management, installation dependencies, and hardware-specific optimizations—has made it difficult for LLMs to provide effective assistance. This presentation introduces ChatHPC, a new project that employs LLMs to develop a range of HPC and scientific applications, including: BLAS Kernels (e.g., AXPY, GEMV, GEMM) Performance Portable Programming (e.g., Kokkos) Code Translation & Modernization (e.g., Fortran to C++, CUDA to SYCL) ChatHPC leverages existing open-source LLMs, fine-tuned with a small amount of domain-specific data through streamlined workflows. Our analysis evaluates its efficacy and correctness in generating scientific computing kernels, as well as its adaptability to the specialized demands of HPC. Through this exploration, we aim to highlight the potential of Generative AI as an innovative tool for scientific computing while identifying key challenges that must be addressed to fully harness its capabilities. Bio: Dr. Keita Teranishi is a senior computer scientist and the group leader of the programming systems group with Oak Ridge National Laboratory (ORNL). He has contributed to efforts related to software engineering and performance portability among many other HPC topics including programming languages, macro-network simulation, fault-tolerance, numerical algorithms (linear and tensor algebra), and performance tuning. He is currently leading two projects funded by the Advanced Scientific Computing Research Program, Office of Science, DOE -- Stewardship for Programming Systems and Tools and Durban: Enhancing Performance Portability in HPC Software with Artificial Intelligence.
Prior to ORNL, he was a principal member of technical staff at Sandia National Laboratories, and software engineer in the math and scientific libraries group at Cray Inc. He is currently leading the S4PST project, programming systems stewardship in DOE. He received the BS and MS degrees from the University of Tennessee, Knoxville, in 1998 and 2000, respectively, and the PhD degree from Penn State University, in 2004.