diff --git a/_includes/atom.xml b/_includes/atom.xml index c85cf08..506265b 100644 --- a/_includes/atom.xml +++ b/_includes/atom.xml @@ -4,7 +4,7 @@
The premiere annual conference of the high-performance computing community, SC24, was held in Atlanta last week, and it attracted a record-shattering number of attendees--nearly 18,000 registrants, up 28% from last year! The conference felt big as well, and there seemed to be a lot more running between sessions, meetings, and the exhibition floor. Despite its objectively bigger size though, the content of the conference felt more diffuse this year, and I was left wondering if this reflected my own biases or was a real effect of the AI industry beginning to overflow into AI-adjacent technology conferences like SC.
Of course, this isn't to say that SC24 was anything short of a great conference. Some exciting new technologies were announced, a new supercomputer beat out Frontier to become the fastest supercomputer on the Top500 list, and I got to catch up with a bunch of great people that I only get to see at shows like this. I'll touch on all of these things below. But this year felt different from previous SC conferences to me, and I'll try to talk about that too.
There's no great way to arrange all the things I jotted down in my notes, but I've tried to arrange them by what readers may be interested in. Here's the table of contents:
Before getting into the details though, I should explain how my perspective shaped what I noticed (and missed) through the conference. And to be clear: these are my own personal opinions and do not necessarily reflect those of my employer. Although Microsoft covered the cost for me to attend SC, I wrote this blog post during my own free time over the Thanksgiving holiday, and nobody had any editorial control over what follows except me.
Although this is the eleventh SC conference I've attended, it was the first time that I:
Because of these changes in my identity as an attendee, I approached the conference with a different set of goals in mind:
As a hyperscale/AI person, I felt that I should prioritize attending all the cloud and AI sessions whenever forced to choose between one session or another. I chose to focus on understanding the traditional HPC community's understanding of hyperscale and AI, which meant I had to spend less time in the workshops, panels and BOFs where I built my career.
As an engineer rather than a product manager, it wasn't my primary responsibility to run private briefings and gather HPC customers' requirements and feedback. Instead, I prioritized only those meetings where my first-hand knowledge of how massive-scale AI training works could have a meaningful impact. This meant I focused on partners and practitioners who also operate in the realm of hyperscale--think massive, AI-adjacent companies and the HPC centers who have historically dominated the very top of the Top500 list.
One thing I didn't anticipate going into SC24 is that I've inherited a third identity: there are a new cohort of people in HPC who see me as a long-time community member. This resulted in a surprising amount of my time being spent talking to students and early career practitioners who were looking for advice.
These three identities and goals meant I don't many notes to share on the technical program, but I did capture more observations about broader trends in the HPC industry and community.
A cornerstone of every SC conference is the release of the new Top500 list on Monday, and this is especially true on years when a new #1 supercomputer is announced. As was widely anticipated in the weeks leading up to SC24, El Capitan unseated Frontier as the new #1 supercomputer this year, posting an impressive 1.74 EFLOPS of FP64. In addition though, Frontier grew a little (it added 400 nodes), there was a notable new #5 system (Eni's HPC6), and a number of smaller systems appeared that are worth calling out.
The highlight of the Top500 list was undoubtedly the debut of El Capitan, Lawrence Livermore National Laboratory's massive new MI300A-based exascale supercomputer. Its 1.74 EF score resulted from a 105-minute HPL run that came in under 30 MW, and a bunch of technical details about the system were disclosed by Livermore Computing's CTO, Bronis de Supinski, during an invited talk during the Top500 BOF. Plenty of others summarize the system's speeds and feeds (e.g., see The Next Platform's article on El Cap), so I won't do that. However, I will comment on how unusual Bronis' talk was.
Foremost, the El Capitan talk seemed haphazard and last-minute. Considering the system took over half a decade of planning and cost at least half a billion dollars, El Capitan's unveiling was the most unenthusiastic description of a brand-new #1 supercomputer I've ever seen. I can understand that the Livermore folks have debuted plenty of novel #1 systems in their careers, but El Capitan is objectively a fascinating system, and running a full-system job for nearly two hours across first-of-a-kind APUs is an amazing feat. If community leaders don't get excited about their own groundbreaking achievements, what kind of message should the next generation of HPC professionals take home?
In sharp contrast to the blasé announcement of this new system was the leading slide that was presented to describe the speeds and feeds of El Capitan:
I've never seen a speaker take the main stage and put a photo of himself literally in the center of the slide, in front of the supercomputer they're talking about. I don't know what the communications people at Livermore were trying to do with this graphic, but I don't think it was intended to be evocative of the first thing that came to my mind:
The supercomputer is literally named \"The Captain,\" and there's a photo of one dude (the boss of Livermore Computing, who is also standing on stage giving the talk) blocking the view of the machine. It wasn't a great look, and it left me feeling very uneasy about what I was witnessing and what message it was sending to the HPC community.
In case it needs to be said, HPC is a team sport. The unveiling of El Capitan (or any other #1 system before it) is always the product of dozens, if not hundreds, of people devoting years of their professional lives to ensuring it all comes together. It was a big miss, both to those who put in the work, and those who will have to put in the work on future systems, to suggest that a single, smiling face comes before the success of the system deployment.
The other notable entrant to the Top 10 list was HPC6, an industry system deployed by Eni (a major Italian energy company) built on MI250X. Oil and gas companies tend to be conservative in the systems they buy since the seismic imaging done on their large supercomputers informs hundred-million to billion-dollar investments in drilling a new well, and they have much less tolerance for weird architectures than federally funded leadership computing does. Thus, Eni's adoption of AMD GPUs in this #5 system is a strong endorsement of their capability in mission-critical commercial computing.
SoftBank, the Japanese investment conglomerate who, among other things, owns a significant stake in Arm, made its Top500 debut with two identical 256-node DGX H100 SuperPODs. While not technologically interesting (H100 is getting old), these systems represent significant investment in HPC by private industry in Japan and signals that SoftBank is following the lead of large American investment groups in building private AI clusters for the AI startups in their portfolios. In doing this, SoftBank's investments aren't dependent on third-party cloud providers to supply the GPUs to make these startups successful and reduces their overall risk.
Although I didn't hear anything about these SoftBank systems at the conference, NVIDIA issued a press statement during the NVIDIA AI Summit Japan during the week prior to SC24 that discussed SoftBank's investment in large NVIDIA supercomputers. The press statement states that these systems will be used \"for [SoftBank's] own generative AI development and AI-related business, as well as that of universities, research institutions and businesses throughout Japan.\" The release also suggests we can expect B200 and GB200 SuperPODs from SoftBank to appear as those technologies come online.
Just below the SoftBank systems was the precursor system to Europe's first exascale system. I was hoping that JUPITER, the full exascale system being deployed at FRJ, would appear in the Top 10, but it seems like we'll have to wait for ISC25 for that. Still, the JETI system ran HPL across 480 nodes of BullSequana XH3000, the same node that will be used in JUPITER, and achieved 83 TFLOPS. By comparison, the full JUPITER system will be over 10x larger (\"roughly 6000 compute nodes\" in the Booster), and projecting the JETI run (173 TF/node) out to this full JUPITER scale indicates that JUPITER should just squeak over the 1.0 EFLOPS line.
In preparation for JUPITER, Eviden had a couple of these BullSequana XH3000 nodes out on display this year:
And if you're interested in more, I've been tracking the technical details of JUPITER in my digital garden.
Waay down the list was Microsoft's sole new Top500 entry this cycle, an NVIDIA H200 system that ran HPL over 120 ND H200 v5 nodes in Azure. It was one of only two conventional (non-Grace) H200 clusters that appeared in the top 100, and it had a pretty good efficiency (Rmax/Rpeak > 80%). Microsoft also had a Reindeer node on display at its booth:
An astute observer may note that this node looks an awful lot like the H100 node used in its Eagle supercomputer, which was on display at SC23 last year. That's because it's the same chassis, just with an upgraded HGX baseboard.
Reindeer was not super exciting, and there were no press releases about it, but I mention it here for a couple reasons:
The exhibit floor had a few new pieces of HPC technology on display this year that are worthy of mention, but a lot of the most HPC-centric exciting stuff actually had a soft debut at ISC24 in May. For example, even though SC24 was MI300A's big splash due to the El Capitan announcement, some MI300A nodes (such as the Cray EX255a) were on display in Hamburg. However, Eviden had their MI300A node (branded XH3406-3) on display at SC24 which was new to me:
I'm unaware of anyone who's actually committed to a large Eviden MI300A system, so I was surprised to see that Eviden already has a full blade design. But as with Eni's HPC6 supercomputer, perhaps this is a sign that AMD's GPUs (and now APUs) have graduated from being built-to-order science experiments to a technology ecosystem that people will want to buy off the rack.
There was also a ton of GH200 on the exhibit hall floor, but again, these node types were also on display at ISC24. This wasn't a surprise since a bunch of upcoming European systems have invested in GH200 already; in addition to JUPITER's 6,000 GH200 nodes described above, CSCS Alps has 2,688 GH200 nodes, and Bristol's Isambard-AI will have 1,362 GH200 nodes. All of these systems will have a 1:1 CPU:GPU ratio and an NVL4 domain, suggesting this is the optimal way to configure GH200 for HPC workloads. I didn't hear a single mention of GH200 NVL32.
SC24 was the debut of NVIDIA's Blackwell GPU in the flesh, and a bunch of integrators had material on GB200 out at their booths. Interestingly, they all followed the same pattern as GH200 with an NVL4 domain size, and just about every smaller HPC integrator followed a similar pattern where
From this, I gather that not many companies have manufactured GB200 nodes yet, or if they have, there aren't enough GB200 boards available to waste them on display models. So, we had to settle for these bare NVIDIA-manufactured, 4-GPU + 2-CPU superchip boards:
What struck me is that these are very large FRUs--if a single component (CPU, GPU, voltage regulator, DRAM chip, or anything else) goes bad, you have to yank and replace four GPUs and two CPUs. And because all the components are soldered down, someone's going to have to do a lot of work to remanufacture these boards to avoid throwing out a lot of very expensive, fully functional Blackwell GPUs.
There were a few companies who were further along their GB200 journey and had more integrated nodes on display. The HPE Cray booth had this GB200 NVL4 blade (the Cray EX154n) on display:
It looks remarkably sparse compared to the super-dense blades that normally slot into the Cray EX line, but even with a single NVL4 node per blade, the Cray EX cabinet only supports 56 of these blades, leaving 8 blade slots empty in the optimal configuration. I assume this is a limitation of power and cooling.
The booth collateral around this blade suggested its use case is \"machine learning and sovereign AI\" rather than traditional HPC, and that makes sense since each node has 768 GB of HBM3e which is enough to support training some pretty large sovereign models. However, the choice to force all I/O traffic on to the high-speed network by only leaving room for one piddly node-local NVMe drive (this blade only supports one SSD per blade) will make training on this platform very sensitive to the quality of the global storage subsystem. This is great if you bundle this blade with all-flash Lustre (like Cray ClusterStor) or DAOS (handy, since Intel divested the entire DAOS development team to HPE). But it's not how I would build an AI-optimized system.
I suspect the cost-per-FLOP of this Cray GB200 solution is much lower than what a pure-play GB200 for LLM training would be. And since GB200 is actually a solid platform for FP64 (thanks to Dan Ernst for challenging me on this and sharing some great resources on the topic), I expect to see this node do well in situations that are not training frontier LLMs, but rather fine-tuning LLMs, training smaller models, and mixing in traditional scientific computing on the same general-purpose HPC/AI system.
Speaking of pure-play LLM training platforms, though, I was glad that very few exhibitors were trying to talk up GB200 NVL72 this year. It may have been the case that vendors simply aren't ready to begin selling NVL72 yet, but I like to be optimistic and instead believe that the exhibitors who show up to SC24 know that the scientific computing community likely won't get enough value out of a 72-GPU coherence domain to justify the additional cost and complexity of NVL72. I didn't see a single vendor with a GB200 NVL36 or NVL72 rack on display (or a GH200 NVL32, for that matter), and not having to think about NVL72 for the week of SC24 was a nice break from my day job.
Perhaps the closest SC24 got to NVL72 was a joint announcement at the beginning of the week by Dell and CoreWeave, who announced that they have begun bringing GB200 NVL72 racks online. Dell did have a massive, AI-focused booth on the exhibit floor, and they did talk up their high-powered, liquid-cooled rack infrastructure. But in addition to supporting GB200 with NVLink Switches, I'm sure that rack infrastructure would be equally good at supporting nodes geared more squarely at traditional HPC.
HPE Cray also debuted a new 400G Slingshot switch, appropriately named Slingshot 400. I didn't get a chance to ask anyone any questions about it, but from the marketing material that came out right before the conference, it sounds like a serdes upgrade without any significant changes to Slingshot's L2 protocol.
There was a Slingshot 400 switch for the Cray EX rack on display at their booth, and it looked pretty amazing:
It looks way more dense than the original 200G Rosetta switch, and it introduces liquid-cooled optics. If you look closely, you can also see a ton of flyover cables connecting the switch ASIC in the center to the transceivers near the top; similar flyover cables are showing up in all manner of ultra-high-performance networking equipment, likely reflecting the inability to maintain signal integrity across PCB traces.
The port density on Slingshot 400 remains the same as it was on 200G Slingshot, so there's still only 64 ports per switch, and the fabric scale limits don't increase. In addition, the media is saying that Slingshot 400 (and the GB200 blade that will launch with it) won't start appearing until \"Fall 2025.\" Considering 64-port 800G switches (like NVIDIA's SN5600 and Arista's 7060X6) will have already been on the market by then though, Slingshot 400 will be launching with HPE Cray on its back foot.
However, there was a curious statement on the placard accompanying this Slingshot 400 switch:
It reads, \"Ultra Ethernet is the future, HPE Slingshot delivers today!\"
Does this suggest that Slingshot 400 is just a stopgap until 800G Ultra Ethernet NICs begin appearing? If so, I look forward to seeing HPE Cray jam third-party 800G switch ASICs into the Cray EX liquid-cooled form factor at future SC conferences.
One of the weirder things I saw on the exhibit floor was a scale-out storage server built on NVIDIA Grace CPUs that the good folks at WEKA had on display at their booth.
Manufactured by Supermicro, this \"ARS-121L-NE316R\" server (really rolls off the tongue) uses a two-socket Grace superchip and its LPDDR5X instead of conventional, socketed CPUs and DDR. The rest of it seems like a normal scale-out storage server, with sixteen E3.S SSD slots in the front and four 400G ConnectX-7 or BlueField-3 NICs in the back. No fancy dual-controller failover or anything like that; the presumption is that whatever storage system you'd install over this server would implement its own erasure coding across drives and servers.
At a glance, this might seem like a neat idea for a compute-intensive storage system like WEKA or DAOS. However, one thing that you typically want in a storage server is high reliability and repairability, features which weren't the optimal design point for these Grace superchips. Specifically,
On the upside, though, there might be a cost advantage to using this Grace-Grace server over a beefier AMD- or Intel-based server with a bunch of traditional DIMMs. And if you really like NVIDIA products, this lets you do NVIDIA storage servers to go with your NVIDIA network and NVIDIA compute. As long as your storage software can work with the interrupt rates of such a server (e.g., it supports rebuild-on-read) and the 144 Neoverse V2 cores are a good fit for its computational requirements (e.g., calculating complex erasure codes), this server makes sense. But building a parallel storage system on LPDDR5X still gives me the willies.
I could also see this thing being useful for certain analytics workloads, especially those which may be upstream of LLM training. I look forward to hearing about where this turns up in the field.
The last bit of new and exciting HPC technology that I noted came from my very own employer in the form of HBv5, a new, monster four-socket node featuring custom-designed AMD CPUs with HBM. STH wrote up an article with great photos of HBv5 and its speeds and feeds, but in brief, this single node has:
The node itself looks kind of wacky as well, because there just isn't a lot on it:
There are the obvious four sockets of AMD EPYC 9V64H, each with 96 physical cores and 128 GB of HBM3, and giant heat pipes on top of them since it's 100% air-cooled. But there's no DDR at all, no power converter board (the node is powered by a DC bus bar), and just a few flyover cables to connect the PCIe add-in-card cages. There is a separate fan board with just two pairs of power cables connecting to the motherboard, and that's really about it.
The front end of the node shows its I/O capabilities which are similarly uncomplicated:
There are four NDR InfiniBand cards (one localized to each socket) which are 400G-capable but cabled up at 200G, eight E1.S NVMe drives, and a brand-new dual-port Azure Boost 200G NIC. Here's a close-up of the right third of the node's front:
This is the first time I've seen an Azure Boost NIC in a server, and it looksmuch better integrated than the previous-generation 100G Azure SmartNIC that put the FPGA and hard NIC on separateboards connected by a funny little pigtail. This older 100G SmartNIC with pigtail was also on display at the Microsoftbooth in an ND MI300X v5 node:
And finally, although I am no expert in this new node, I did hang around the people who are all week, and I repeatedly heard them answer the same few questions:
New technology announcements are always exciting, but one of the main reasons I attend SC and ISC is to figure out the broader trends shaping the HPC industry. What concerns are top of mind for the community, and what blind spots remain open across all the conversations happening during the week? Answering these questions requires more than just walking the exhibit floor; it involves interpreting the subtext of the discussions happening at panels and BOF sessions. However, identifying where the industry needs more information or a clearer picture informs a lot of the public-facing talks and activities in which I participate throughout the year.
The biggest realization that I confirmed this week is that the SC conference is not an HPC conference; it is a scientific computing conference. I sat in a few sessions where the phrase \"HPC workflows\" was clearly a stand-in for \"scientific workflows,\" and \"performance evaluation\" still really means \"MPI and OpenMP profiling.\" I found myself listening to ideas or hearing about tools that were intellectually interesting but ultimately not useful to me because they were so entrenched in the traditions of applying HPC to scientific computing. Let's talk about a few ways in which this manifested.
Take, for example, the topic of sustainability. There were talks, panels, papers, and BOFs that touched on the environmental impact of HPC throughout the week, but the vast majority of them really weren't talking about sustainability at all; they were talking about energy efficiency. These talks often use the following narrative:
The problem with this approach is that it declares victory when energy consumption is reduced. This is a great result if all you care about is spending less money on electricity for your supercomputer, but it completely misses the much greater issue that the electricity required to power an HPC job is often generated by burning fossil fuels, and that the carbon emissions that are directly attributable to HPC workloads are contributing to global climate change. This blind spot was exemplified by this slide, presented during a talk titled \"Towards Sustainable Post-Exascale Leadership Computing\" at the Sustainable Supercomputing workshop:
I've written about this before and I'll write about it again: FLOPS/Watt and PUE are not meaningful metrics by themselves when talking about sustainability. A PUE of 1.01 is not helpful if the datacenter that achieves it relies on burning coal for its power. Conversely, a PUE of 1.5 is not bad if all that electricity comes from a zero-carbon energy source. The biggest issue that I saw being reinforced at SC this year is that claims of \"sustainable HPC\" are accompanied by the subtext of \"as long as I can keep doing everything else the way I always have.\"
There were glimmers of hope, though. Maciej Cytowski from Pawsey presented the opening talk at the Sustainable Supercomputing workshop, and he led with the right thing--he acknowledged that 60% of the fuel mix that powers Pawsey's supercomputers comes from burning fossil fuels:
Rather than patting himself on the back at his low PUE, Dr. Cytowski's described on how they built their datacenter atop a large aquifer from which they draw water at 21°C and return it at 30°C to avoid using energy-intensive chillers. To further reduce the carbon impact of this water loop, Pawsey also installed over 200 kW of solar panels on its facility roof to power the water pumps. Given the fact that Pawsey cannot relocate to somewhere with a higher ratio of zero-carbon energy on account of its need to be physically near the Square Kilometer Array, Cytowski's talk felt like the most substantive discussion on sustainability in HPC that week.
Most other talks and panels on the topic really wanted to equate \"sustainability\" to \"FLOPS per Watt\" and pretend like where one deploys a supercomputer is not a part of the sustainability discussion. The reality is that, if the HPC industry wanted to take sustainability seriously, it would talk less about watts and more about tons of CO2. Seeing as how the average watt of electricity in Tennessee produces 2.75x more carbon than a watt of electricity in Washington, the actual environmental impact of fine-tuning Slurm scheduling or fiddling with CPU frequencies is meaningless when compared to the benefits that would be gained by deploying that supercomputer next to a hydroelectric dam instead of a coal-fired power plant.
I say all this because there are parts of the HPC industry (namely, the part in which I work) who are serious about sustainability. And those conversations go beyond simply building supercomputers in places where energy is low-carbon (thereby reducing Scope 2 emissions). They include holding suppliers to high standards on reducing the carbon impact of transporting people and material to these data centers, reducing the carbon impact of all the excess packaging that accompanies components, and being accountable for the impact of everything in the data center after it reaches end of life (termed Scope 3 emissions).
The HPC community--or more precisely, the scientific computing community--is still married to the idea that the location of a supercomputer is non-negotiable, and \"sustainability\" is a nice-to-have secondary goal. I was hoping that the sessions I attended on sustainability would approach this topic at a level where the non-scientific HPC world has been living. Unfortunately, the discussion at SC24, which spanned workshops, BOFs, and Green 500, remains largely stuck on the idea that PUE and FLOPS/Watt are the end-all sustainability metrics. Those metrics are important, but there are global optimizations that have much greater effects on reducing the environmental impact of the HPC industry.
Another area where \"HPC\" was revealed to really mean \"scientific computing\" was in the topic of AI. I sat in on a few BOFs and panels around AI topics to get a feel for where this community is in adopting AI for science, but again, I found the level of discourse to degrade to generic AI banter despite the best efforts of panelists and moderators. For example, I sat in the \"Foundational Large Language Models for High-Performance Computing\" BOF session, and Jeff Vetter very clearly defined what a \"foundational large language model\" was at the outset so we could have a productive discussion about their applicability in HPC (or, really, scientific computing):
The panelists did a good job of outlining their positions. On the upside, LLMs are good for performing source code conversion, documenting and validating code, and maximizing continuity in application codes that get passed around as graduate students come and go. On the downside, they have a difficult time creating efficient parallel code, and they struggle to debug parallel code. And that's probably where the BOF should have stopped, because LLMs, as defined at the outset of the session, don't actually have a ton of applicability in scientific computing. But as soon as the session opened up to audience questions, the session went off the rails.
The first question was an extremely basic and nonspecific question: \"Is AI a bubble?\"
It's fun to ask provocative questions to a panel of experts. I get it. But the question had nothing to do with LLMs, any of the position statements presented by panelists, or even HPC or scientific computing. It turned a BOF on \"LLMs for HPC\" into a BOF that might as well have been titled \"Let's just talk about AI!\" A few panelists tried to get things back on track by talking about the successes of surrogate models to simulate physical processes, but this reduced the conversation to a point where \"LLMs\" really meant \"any AI model\" and \"HPC\" really meant \"scientific simulations.\"
Perhaps the most productive statement to come out of that panel was when Rio Yokota asserted that \"we\" (the scientific community) should not train their own LLMs, because doing so would be \"unproductive for science.\" But I, as well as anyone who understands the difference between LLMs and \"AI,\" already knew that. And the people who don't understand the difference between an LLM and a surrogate model probably didn't pick up on Dr. Yokota's statement, so I suspect the meaning of his contribution was completely lost.
Walking out of that BOF (and, frankly, the other AI-themed BOFs and panels I attended), I was disappointed at how superficial the conversation was. This isn't to say these AI sessions were objectively bad; rather, I think it reflects the general state of understanding of AI amongst SC attendees. Or perhaps it reflects the demographic that is drawn to these sorts of sessions. If the SC community is not ready to have a meaningful discussion about AI in the context of HPC or scientific computing, attending BOFs with like-minded peers is probably a good place to begin getting immersed.
But what became clear to me this past week is that SC BOFs and panels with \"AI\" in their title aren't really meant for practitioners of AI. They're meant for scientific computing people who are beginning to dabble in AI.
I was invited to sit on a BOF panel called \"Artificial Intelligence and Machine Learning for HPC Workload Analysis\" following on a successful BOF in which I participated at ISC24. The broad intent was to have a discussion around the tools, methods, and neat ideas that HPC practitioners have been using to better understand workloads, and each of us panelists was tasked with talking about a project or idea we had in applying AI/ML to improve some aspect of workloads.
What emerged from us speakers' lightning talks is that applying AI for operations--in this case, understanding user workloads--is nascent. Rather than talking about how we use AI to affect how we design or operate supercomputers, all of us seemed to focus more on how we are collecting data and beginning to analyze that data using ML techniques. And maybe that's OK, because AI won't ever do anything for workload characterization until you have a solid grasp of the telemetry you can capture about those workloads in the first place.
But when we opened the BOF up to discussion with all attendees, despite having a packed room, there was very little that the audience had. Our BOF lead, Kadidia Konaté, tried to pull discussion out of the room from a couple of different fronts by asking what tools people were using, what challenges they were facing, and things along those lines. However, it seemed to me that the majority of the audience was in that room as spectators; they didn't know where to start applying AI towards understanding the operations of supercomputers. Folks attended to find out the art of the possible, not talk about their own challenges.
As such, the conversation wound up bubbling back up to the safety of traditional topics in scientific computing--how is LDMS working out, how do you deal with data storage challenges of collecting telemetry, and all the usual things that monitoring and telemetry folks worry about. It's easy to talk about the topics you understand, and just as the LLM conversation reverted back to generic AI for science and the sustainability topic reverted back to FLOPS/Watt, this topic of AI for operations reverted back to standard telemetry collection.
Despite the pervasive belief at SC24 that \"HPC\" and \"scientific computing\" are the same thing, there are early signs that the leaders in the community are coming to terms with the reality that there is now a significant amount of leadership HPC happening outside the scope of the conference. This was most prominent at the part of the Top500 BOF where Erich Strohmaier typically discusses trends based on the latest publication of the list.
In years past, Dr. Strohmaier's talk was full of statements that strongly implied that, if a supercomputer is not listed on Top500, it simply does not exist. This year was different though: he acknowledged that El Capitan, Frontier, and Aurora were \"the three exascale systems we are aware of,\" now being clear that there is room for exascale systems to exist that simply never ran HPL, or never submitted HPL results to Top500. He explicitly acknowledged again that China has stopped making any Top500 submissions, and although he didn't name them outright, he spent a few minutes dancing around \"hyperscalers\" who have been deploying exascale class systems such as Meta's H100 clusters (2x24K H100), xAI's Colossus (100K H100), and the full system behind Microsoft's Eagle (14K H100 is a \"tiny fraction\").
Strohmaier did an interesting analysis that estimated the total power of the Top500 list's supercomputers so he could compare it to industry buzz around hyperscalers building gigawatt-sized datacenters:
It was a fun analysis where he concluded that there are between 500-600 megawatts of supercomputers on the Top500 list, and after you factor in storage, PUE, and other ancillary power sources, the whole Top500 list sums up to what hyperscalers are talking about sticking into a single datacenter facility.
Although he didn't say it outright, I think the implication here is that the Top500 list is rapidly losing relevance in the broad HPC market, because a significant amount of the world's supercomputing capacity and capability are absent from the list. Although specific hyperscale supercomputers (like Meta's, xAI's, and Microsoft's) were not mentioned outright, their absence from the Top500 list suggests that this list might already be more incomplete than it is complete--the sum of the FLOPS or power on the Top500 supercomputers may be less than the sum of the giant supercomputers which are known but not listed. This will only get worse as the AI giants keep building systems every year while the government is stuck on its 3-5 year procurement cycles.
It follows that the meaning of the Top500 is sprinting towards a place where it is not representative of HPC so much as it is representative of the slice of HPC that serves scientific computing. Erich Strohmaier was clearly aware of this in his talk this year, and I look forward to seeing how the conversation around the Top500 list continues to morph as the years go on.
My career was started at an NSF HPC center and built up over my years in the DOE, so I feel like I owe a debt to the people who provided all the opportunities and mentorship that let me get to the place of privilege in the hyperscale/AI industry that I now enjoy. As a result, I find myself still spending a lot of my free time thinking about the role of governments in the changing face of HPC (as evidenced by my critiques of thinktank reports and federal RFIs...) and trying to bridge the gap in technical understanding between my old colleagues (in DOE, NSF, and European HPC organizations) and whatever they call what I work on now (hyperscale AI?).
To that end, I found myself doing quite a bit of business development (more on this later) with government types since I think that is where I can offer the most impact. I used to be government, and I closely follow the state of their thinking in HPC, but I also know what's going on inside the hyperscale and AI world. I also have enough context in both areas to draw a line through all the buzzy AI press releases to demonstrate how the momentum of private-sector investment in AI might affect the way national HPC efforts do business. So, I did a lot of talking to both my old colleagues in DOE and their industry partners in an attempt to help them understand how the hyperscale and AI industry thinks about infrastructure, and what they should expect in the next year.
More importantly though, I also sat in on a couple of NSF-themed BOFs to get a better understanding of where their thinking is, where NAIRR is going, how the NSF's strategy contrasts with DOE's strategy, and where the ambitions of the Office of Advanced Cyberinfrastructure might intersect with the trajectory of hyperscale AI.
What I learned was that NSF leadership is aware of everything that the community should be concerned about: the growth of data, the increasing need for specialized silicon, the incursion of AI into scientific computing, new business models and relationships with industry, and broadening the reach of HPC investments to be globally competitive. But beyond that, I struggled to see a cohesive vision for the future of NSF-funded supercomputing.
A BOF with a broad range of stakeholders probably isn't the best place to lay out a vision for the future of NSF's HPC efforts, and perhaps NSF's vision is best expressed through its funding opportunities and awards. Whichever the case may be, it seems like the NSF remains on a path to make incremental progress on a broad front of topics. Its Advanced Computing Systems and Services (ACSS) program will continue to fund the acquisition of newer supercomputers, and a smorgasbord of other research programs will continue funding efforts across public access to open science, cybersecurity, sustainable software, and other areas. My biggest concern is that peanut-buttering funding across such a broad portfolio will make net forward progress much slower than taking big bets. Perhaps big bets just aren't in the NSF's mission though.
NAIRR was also a topic that came up in every NSF-themed session I attended, but again, I didn't get a clear picture of the future. Most of the discussion that I heard was around socializing the resources that are available today through NAIRR, suggesting that the pilot's biggest issue is not a lack of HPC resources donated by industry, but awareness that NAIRR is a resource that researchers can use. This was reinforced by a survey whose results were presented in the NAIRR BOF:
It seems like the biggest challenges facing the NSF community relying on NAIRR (which has its own sample bias) is that they don't really know where to start even though they have AI resources (both GPUs and model API services) at their disposal. In a sense, this is a great position for the NSF since
However, it also means that there's not a clear role for partnership with many industry players beyond donating resources to the NAIRR pilot today in the hopes of selling resources to the full NAIRR tomorrow. I asked what OAC leadership thought about moving beyond such a transactional relationship between NSF and industry at one of the BOFs I attended, and while the panelists were eager to explore specific answers to that question, I didn't hear any ideas that would approach some sort of truly equitable partnership where both parties contributed in-kind.
I also walked away from these NSF sessions struck by how different the NSF HPC community's culture is from that of the DOE. NSF BOF attendees seemed focused on getting answers and guidance from NSF leadership, unlike the typical DOE gathering, where discussions often revolve around attendees trying to shape priorities to align with their own agendas. A room full of DOE people tends to feel like everyone thinks they're the smartest person there, while NSF gatherings appear more diverse in the expertise and areas of depth of its constituents. Neither way is inherently better or worse, but it will make the full ambition of NAIRR (as an inter-agency collaboration) challenging to navigate. This is particularly relevant as DOE is now pursuing its own multi-billion-dollar AI infrastructure effort, FASST, that appears to sidestep NAIRR.
There's no better way to figure out what's going on in the HPC industry than walking the exhibit floor each year, because booths cost money and reflect the priorities (and budgets) of all participants. This year's exhibit felt physically huge, and walking from one end to the other was an adventure. You can get a sense of the scale from this photo I took during the opening gala:
Despite having almost 18,000 registrants and the opening gala usually being acrush of people, the gala this year felt and looked very sparse just because people and booths were more spread out.There was also a perceptibly larger number of splashy vendors who have historically never attended before who werepromoting downstream HPC technologies like data center cooling and electrical distribution, and there was healthyspeculation online about whether the hugeness of the exhibit this year was due to these new power and cooling companies.
To put these questions to rest, I figured out how to yank down all the exhibitor metadata from the conference website so I could do some basic analysis on it.
The easiest way to find the biggest companies to appear this year was to compare the exhibitor list and booth sizes from SC23 to this year and see whose booth went from zero to some big square footage.
I only took the top twenty new vendors, but they broadly fall into a couple of categories:
There were a couple other companies that must've just missed last SC but aren't new to the show (NetApp, Ansys, Samsung, Micron, Broadcom). And curiously, only one new GPU-as-a-Service provider (Nebius) showed up this year, suggesting last year was the year of the GPU Cloud.
But to confirm what others had speculated: yes, a significant amount of the new square footage of the exhibit floor can be attributed to companies focused on power and cooling. This is an interesting indicator that HPC is becoming mainstream, largely thanks to AI demanding ultra-high density of power and cooling. But it's also heartening to see a few new exhibitors in higher education making an appearance. Notably, SCRCC (South Carolina Research Computing Consortium) is a consortium between Clemon, University of South Carolina, and Savannah River National Laboratory that just formed last year, and I look forward to seeing what their combined forces can bring to bear.
We can also take a look at whose booths grew the most compared to SC23:
This distribution is much more interesting, since the top 20 exhibitors who grew their footprint comprise the majority of the growth in existing exhibitors. Cherry-picking a few interesting growers:
It's also interesting to see HLRS, the German national HPC center, grow so significantly. I'm not sure what prompted such a great expansion, but I take it to mean that things have been going well there.
Finally, Dell had a massive booth and showing this year. Not only did they grow the most since SC23, but they had the single largest booth on the exhibit floor at SC24. This was no doubt a result of their great successes in partnering with NVIDIA to land massive GPU buildout deals at places like xAI and CoreWeave. They also had \"AI factory\" messaging emblazoned all over their marketing material and debuted a nice 200 kW liquid-cooled rack that will be the basis for their GB200 NVL72 solution, clearly leaning into the idea that they are leaders in AI infrastructure. Despite this messaging being off-beat for the SC audience as I've described earlier, their booth was surprisingly full all the time, and I didn't actually get a chance to get in there to talk to anyone about what they've been doing.
Equally interesting are the vendors who reduced their footprint at SC24 relative to SC23:
Reading too much into any of these big shrinkers is pretty easy; while a reduction in booth size could suggest business hasn't been as good, it could equally mean that an exhibitor just went overboard at SC23 and downsized to correct this year. A few noteworthy exhibitors to call out:
Overall, almost twice as many vendors grew their booths than scaled back, so I'd caution anyone against trying to interpret any of this as anything beyond exhibitors right-sizing their booths after going all-in last year.
Finally, there are a handful of vendors who disappeared outright after SC23:
It is critical to point out that the largest booths to vanish outright were all on the smaller size: SUSE, Tenstorrent, and Symbiosys Alliance all disappeared this year, but their booths last year were only 20x30. I was surprised to see that Tenstorrent and Arm didn't have booths, but the others are either companies I haven't heard of (suggesting the return on investment of showing at SC might've been low), are easy to rationalize as only being HPC-adjacent (such as SNIA and DigitalOcean), or simply went bankrupt in the last year.
As we say at the business factory, the net-net of the exhibit hall this year is that the square footage of booth space increased by 15,000 square feet, so it was in fact bigger, it did take longer to walk from one end to the other, and there definitely were a bunch of new power and cooling companies filling out the space. Some exhibitors shrank or vanished, but the industry as a whole appears to be moving in a healthy direction.
And if you're interested in analyzing this data more yourself, please have a look at the data and the Jupyter notebook I used to generate the above treemaps on GitHub. If you discover anything interesting, please write about it and post it online!
As an AI infrastructure person working for a major cloud provider, I kept an eye out for all the companies trying to get into the GPU-as-a-Service game. I described these players last year as \"pure-play GPU clouds,\" and it seems like the number of options available to customers who want to go this route is growing. But I found it telling that a lot of them had booths that were completely indistinguishable from each other. Here's an example of one:
As best I can tell, these companies are all NVIDIA preferred partners with data centers and a willingness to deploy NVIDIA GPUs, NVIDIA SmartNICs, and NVIDIA cloud stack, and sell multi-year commitments to consume those GPUs. I tried to accost some of these companies' booth staff to ask them my favorite question (\"What makes you different from everyone else?\"), but most of these companies' booths were staffed by people more interested in talking to each other than me.
These GPUaaS providers tend to freak me out, because, as Microsoft's CEO recently stated, these companies are often \"just a bunch of tech companies still using VC money to buy a bunch of GPUs.\" I can't help but feel like this is where the AI hype will come back to bite companies who have chosen to build houses upon sand. Walking the SC24 exhibit floor is admittedly a very narrow view of this line of business, but it seemed like some of these companies were content to buy up huge booths, hang a pretty banner above it, and otherwise leave the booth empty of anything beyond a few chairs and some generic value propositions. I didn't feel a lot of hunger or enthusiasm from these companies despite the fact that a bunch of them have hundreds of millions of dollars of GPUs effectively sitting on credit cards that they are going to have to make payments on for the next five years.
That all said, not all the companies in the GPUaaS are kicking back and letting the money pour in. In particular, I spent a few minutes chatting up someone at the CoreWeave booth, and I was surprised to hear about how much innovation they're adding on top of their conventional GPUaaS offering. For example, they developed Slurm on Kubernetes (SUNK) with one of their key customers to close the gap between the fact that CoreWeave exposes its GPU service through Kubernetes, but many AI customers have built their stack around Slurm, pyxis, and enroot.
In a weird twist of fate, I later ran into an old acquaintance who turned out to be one of the key CoreWeave customers for whom SUNK was developed. He commented that SUNK is the real deal and does exactly what his users need which, given the high standards that this person has historically had, is a strong affirmation that SUNK is more than just toy software that was developed and thrown on to GitHub for an easy press release. CoreWeave is also developing some interesting high-performance object storage caching software, and all of these software services are provided at no cost above whatever customers are already paying for their GPU service.
I bring this up because it highlights an emerging distinction in the GPUaaS market, which used to be a homogenous sea of bitcoin-turned-AI providers. Of course, many companies still rely on that simple business model: holding the bill for rapidly depreciating GPUs that NVIDIA sells and AI startups consume. However, there are now GPUaaS providers moving up the value chain by taking on the automation and engineering challenges that model developers don't want to deal with. Investing in uncertain projects like new software or diverse technology stacks is certainly risky, especially since they may never result in enough revenue to pay for themselves. But having a strong point of view, taking a stance, and investing in projects that you feel are right deserves recognition. My hat is off to the GPUaaS providers who are willing to take these risks and raise the tide for all of us rather than simply sling NVIDIA GPUs to anyone with a bag of money.
As much as I enjoy increasing shareholder value, the part of SC that gives me the greatest joy is reconnecting with the HPC community. Knowing I'll get to chat with my favorite people in the industry (and meet some new favorite people!) makes the long plane rides, upper respiratory infections, and weird hotel rooms completely worth it.
I wound up averaging under six hours of sleep per night this year in large part because 9pm or 7am were often the only free times I had to meet with people I really wanted to see. I have this unhealthy mindset where every hour of every day, from the day I land to the day I leave, is too precious to waste, and it's far too easy for me to rationalize that spending an hour talking to someone interesting is worth losing an hour of sleep.
But like I said at the outset of this blog post, this year felt different for a few reasons, and a lot of them revolve around the fact that I think I'm getting old. Now, it's always fun to say \"I'm getting old\" in a mostly braggadocious way, but this feeling manifested in concrete ways that affected the way I experienced the conference:
If you read this all and think \"boo hoo, poor Glenn is too popular and wise for his own good,\" yeah, I get it. There are worse problems to have. But this was the first year where I felt like what I put into the conference was greater than what I got out of it. Presenting at SC used to be at least as good for my career as it was useful for my audiences, but it just doesn't count for much given my current role and career stage. It felt like some of the magic was gone this year in a way I've never experienced before.
As the years have gone on, I spend an increasing amount of my week having one-on-one conversations instead of wandering aimlessly. This year though, I came to SC without really having anything to buy or sell:
Much to my surprise though, a bunch of my old vendor/partner colleagues still wanted to get together to chat this year. Reflecting back, I was surprised to realize that it was these conversations--not the ones about business--that were the most fulfilling this year.
I learned about people's hobbies, families, and their philosophies on life, and it was amazing to get to know some of the people behind the companies with whom I've long dealt. I was reminded that the person is rarely the same as the company, and even behind some of the most aggressive and blusterous tech companies are often normal people with the same concerns and moments of self-doubt that everyone else has. I was also reminded that good engineers appreciate good engineering regardless of whether it's coming from a competitor or not. The public persona of a tech exec may not openly admire a competitor's product, but that doesn't mean they don't know good work when they see it.
I also surprised a colleague whose career has been in the DOE labs with an anecdote that amounted to the following: even though two companies may be in fierce competition, the people who work for them don't have to be. The HPC community is small enough that almost everyone has got a pal at a competing company, and when there are deals to be made, people looove to gossip. If one salesperson hears a juicy rumor about a prospective customer, odds are that everyone else on the market will hear about it pretty quickly too. Of course, the boundaries of confidentiality and professionalism are respected when it matters, but the interpersonal relationships that are formed between coworkers and friends don't suddenly disappear when people change jobs.
And so, I guess it would make sense that people still want to talk to me even though I have nothing to buy or sell. I love trading gossip just as much as everyone else, and I really enjoyed this aspect of the week.
I also spent an atypically significant amount of my week talking to early career people in HPC who knew of me one way or another and wanted career advice. This is the first year I recall having the same career conversations with multiple people, and this new phase of my life was perhaps most apparent during the IEEE TCHPC/TCPP HPCSC career panel in which I was invited to speak this year.
It was an honor to be asked to present on a career panel, but I didn't feel very qualified to give career advice to up-and-coming computer science graduate students who want to pursue HPC. I am neither a computer scientist nor a researcher, but fortunately for me, my distinguished co-panelists (Drs. Dewi Yokelson, Olga Pearce, YJ Ji, and Rabab Alomairy) had plenty of more relevant wisdom to share. And at the end of the panel, there were a few things we all seemed to agree on as good advice:
In both this panel the one-on-one conversations I had with early career individuals, the best I could offer was the truth: I never had a master plan that got me to where I am; I just try out new things until I realize I don't like doing them anymore. I never knew what I wanted to be when I grew up, and I still don't really, so it now makes me nervous that people have started approaching me with the assumption that I've got it all figured out. Unless I torpedo my career and go live on a goat farm though, maybe I should prepare for this to be a significant part of my SC experiences going forward.
One last, big change in the community aspect of SC this year was the mass-migration of a ton of HPC folks from Twitter to Bluesky during the week prior to the conference. I don't really understand what prompted it so suddenly; a few of us have been trying for years to get some kind of momentum on other social platforms like Mastodon, but the general lack of engagement meant that all the excitement around SC always wound up exclusively on Twitter. This year was different though, and Bluesky hit critical mass with the HPC community.
I personally have never experienced an SC conference without Twitter; my first SC was in 2013, and part of what made that first conference so exciting was being able to pull up my phone and see what other people were seeing, thinking, and doing across the entire convention center via Twitter. Having the social media component to the conference made me feel like I was a part of something that first year, and as the years went on, Twitter became an increasingly indispensable part of the complete SC experience for me.
This year, though, I decided to try an experiment and see what SC would be like if I set Twitter aside and invested my time into Bluesky instead.
The verdict? It was actually pretty nice.
It felt a lot like the SC13 days, where my day ended and began with me popping open Bluesky to see what new #SC24 posts were made. And because many of the tech companies and HPC centers hadn't yet made it over, the hashtag wasn't clogged up by a bunch of prescheduled marketing blasts that buried the posts written by regular old conference attendees who were asking important questions:
Which booths at #sc24 have coffee? I noticed oracle do. Anyone else?
— Mike Croucher (@walkingrandomly.bsky.social) November 18, 2024 at 3:02 PM
Of course, I still clogged Bluesky up with my nonsense during the week, but there was an amazing amount of engagement by a diversity of thoughtful people--many who came from Twitter, but some whose names and handles I didn't recognize.
The volume of traffic on Bluesky during the week did feel a little lower than what it had been on Twitter in years past though. I also didn't see as many live posts of technical sessions as they happened, so I couldn't really tell whether I was missing something interesting in real time. This may have contributed to why I felt a little less connected to the pulse of the conference this year than I had in the past. It also could've been the fact that conference was physically smeared out across a massive space though; the sparsity of the convention center was at least on par with the sparsity on Bluesky.
At the end of the week, I didn't regret the experiment. In fact, I'll probably be putting more effort into my Bluesky account than my Twitter account going forward. To be clear though, this isn't a particularly political decision on my part, and I pass no judgment on anyone who wants to use one platform over the other. It's just that I like the way I feel when I scroll through my Bluesky feeds, and I don't get that same feeling when I use Twitter.
SC this year was a great conference by almost every measure, as it always is, but it still felt a little different for me. I'm sure that some of that feeling is the result of my own growth, and my role with respect to the conference seems to be evolving from someone who gets a lot out of the conference to someone who is giving more to the conference. That's not to say that I don't get a lot out of it, though; I had no shortage of wonderful interactions with everyone from technology executives to rising stars who are early in their career, and I learned a lot about both them and me as whole people. But SC24, more than any SC before it, is when I realized this change was happening.
On the technological front, we saw the debut of a new #1 system (emblazoned with the smiling face of Bronis...) and a growing crop of massive, new clusters deployed for commercial applications. The exhibit floor was quantitatively bigger, in large part due to new power and cooling companies who are suddenly relevant to the HPC world thanks to the momentum of AI. At the same time, the SC technical program is clearly separating itself out as a conference focused on scientific computing; the level of discourse around AI remains largely superficial compared to true AI conferences, the role of hyperscalers in the HPC industry is still cast more as a threat than an opportunity.
For my part, I'm still trying to get a grasp on where government agencies like DOE and NSF want to take their AI ambitions so I can try to help build a better mutual understanding between the scientific computing community and the hyperscale AI community. However, it seems like the NSF is progressing slowly on a wide front, while the DOE is doing what DOE does and charging headfirst into a landscape that has changed more than I think they realize.
There's a lot of technical content that I know I missed on account of the increasing time I've been spending on the people and community aspect of the conference, and I'm coming to terms with the idea that this just may be the way SC is from now on. And I think I'm okay with that, since the support of the community is what helped me go from being a bored materials science student into someone whose HPC career advice is worth soliciting in the short span of eleven years. Despite any or all of the cynicism that may come out in the things I say about this conference, SC is always the highlight of my year. I always go into it with excitement, gladly burn the candle at both ends all week, and fly home feeling both grateful for and humbled by everything the HPC community has done and continues to do to keep getting me out of bed in the morning.
", + "url": "https://hpc.social/personal-blog/2024/sc-24-recap/", + + + + + + "date_published": "2024-12-02T07:30:00-07:00", + "date_modified": "2024-12-02T07:30:00-07:00", + + "author": "Glenn K. Lockwood's Blog" + + }, + { "id": "https://hpc.social/personal-blog/2024/the-hpc-cluster-as-a-reflection-of-values/", "title": "The HPC cluster as a reflection of values", @@ -679,25 +698,6 @@ "author": "Mark Nelson's Blog" - }, - - { - "id": "https://hpc.social/personal-blog/2022/interesting-links-i-clicked-this-week/", - "title": "Interesting links I clicked this week", - "summary": null, - "content_text": "I watched several really interesting talks from SRECon22 Americas this week, and in particular I’d like to highlight:Principled Performance Analytics, Narayan Desai and Brent Bryan from Google. Some interesting thoughts on quantitative analysis of live performance data for monitoring and observability purposes, moving past simple percentile analysis.The ‘Success’ in SRE is Silent, Casey Rosenthal from Verica.io. Interesting thoughts here on the visibility of reliability, qualitative analysis of systems, and why regulation and certification might not be the right thing for web systems.Building and Running a Diversity-focused Pre-internship program for SRE, from Andrew Ryan at Facebook Meta. Some good lessons-learned here from an early-career internship-like program, in its first year.Taking the 737 to the Max, Nickolas Means from Sym. A really interesting analysis of the Boeing 737 Max failures from both a technical and cultural perspective, complete with some graph tracing to understand failure modes.I also ran across some other articles that I’ve been actively recommending and sharing with friends and colleagues, including:Plato’s Dashboards, Fred Hebert at Honeycomb. This article has some great analysis of how easily-measurable metrics are often poor proxies for the information we’re actually interested in, and discussing qualitative research methods as a way to gain more insight.The End of Roe Will Bring About A Sea Change In The Encryption Debate, Rianna Pfefferkorn from the Stanford Internet Observatory. You should absolutely go read this article, but to sum up: Law enforcement in states than ban abortion is now absolutely part of the threat model that encrypted messaging defends against. No one claiming to be a progressive should be arguing in favor of “exceptional access” or other law enforcement access to encryption.", - "content_html": "I watched several really interesting talks from SRECon22 Americas this week, and in particular I’d like to highlight:
I also ran across some other articles that I’ve been actively recommending and sharing with friends and colleagues, including:
Of course, this isn't to say that SC24 was anything short of a great conference. Some exciting new technologies were + announced, a new supercomputer beat out Frontier to become the fastest supercomputer on the Top500 list, and I got + to catch up with a bunch of great people that I only get to see at shows like this. I'll touch on all of these + things below. But this year felt different from previous SC conferences to me, and I'll try to talk about that too.
+There's no great way to arrange all the things I jotted down in my notes, but I've tried to arrange them by what readers may be interested in. Here's the table of contents:
+ +Before getting into the details though, I should explain how my perspective shaped what I noticed (and missed) through the conference. And to be clear: these are my own personal opinions and do not necessarily reflect those of my employer. Although Microsoft covered the cost for me to attend SC, I wrote this blog post during my own free time over the Thanksgiving holiday, and nobody had any editorial control over what follows except me.
+ +Although this is the eleventh SC conference I've attended, it was the first time that I:
+ +Because of these changes in my identity as an attendee, I approached the + conference with a different set of goals in mind:
+As a hyperscale/AI person, I felt that I should + prioritize attending all the cloud and AI sessions whenever forced to choose between one session or another. I chose to focus on understanding the traditional HPC community's understanding of hyperscale and AI, which meant I had to spend less time in the workshops, panels and BOFs where I built my career.
+As an engineer rather than a product manager, + it wasn't my primary responsibility to run private briefings and gather HPC customers' requirements and feedback. Instead, I prioritized only those meetings where my first-hand + knowledge of how massive-scale AI training works could have a meaningful impact. This meant I focused on partners and practitioners who also operate in the realm of + hyperscale--think massive, AI-adjacent companies and the HPC centers who have historically + dominated the very top of the Top500 list. +
+One thing I didn't anticipate going into SC24 is that I've inherited a third identity: there are a new cohort of people in HPC who see me as a long-time community + member. This resulted in a surprising amount of my time being spent talking to students and early career practitioners who were looking + for advice.
+These three identities and goals meant I don't many notes to share on the technical program, but I did capture more observations about broader trends in the HPC industry and community.
+A cornerstone of every SC conference is the release of the new Top500 list on Monday, and + this is especially true on years when a new #1 supercomputer is announced. As was widely anticipated in the weeks + leading up to SC24, El Capitan unseated Frontier as the new #1 supercomputer this year, posting an impressive 1.74 EFLOPS of FP64. In addition though, Frontier grew a + little (it added 400 nodes), there was a notable new #5 system (Eni's HPC6), and a number of smaller systems appeared that are worth calling + out.
+The highlight of the Top500 list was undoubtedly the debut of El Capitan, Lawrence + Livermore National Laboratory's massive new MI300A-based exascale supercomputer. Its 1.74 EF score resulted from a + 105-minute HPL run that came in under 30 MW, and a bunch of technical details about the system were disclosed by + Livermore Computing's CTO, Bronis de Supinski, during an invited talk during the Top500 BOF. Plenty of others + summarize the system's speeds and feeds (e.g., see The + Next Platform's article on El Cap), so I won't do that. However, I will comment on how unusual Bronis' talk + was.
+Foremost, the El Capitan talk seemed haphazard and last-minute. Considering the system took over half a decade of planning and cost at least half a + billion dollars, El Capitan's unveiling was the most unenthusiastic description of a brand-new #1 supercomputer I've + ever seen. I can understand that the Livermore folks have debuted plenty of novel #1 systems in their careers, but El + Capitan is objectively a fascinating system, and running a full-system job for nearly two hours across first-of-a-kind APUs + is an amazing feat. If community leaders don't get excited about their own groundbreaking achievements, what kind of message should the next generation of HPC professionals take home?
+In sharp contrast to the blasé announcement of this new system was the leading slide that was presented to describe the speeds and feeds of El Capitan:
+ + +I've never seen a speaker take the main stage and put a photo of himself literally in the center of the slide, in front of the supercomputer they're talking about. I don't know what the communications people at Livermore were trying to do with this graphic, but I don't think it + was intended to be evocative of the first thing that came to my mind:
+ + +The supercomputer is literally named "The Captain," and there's a photo of one dude (the boss of Livermore Computing, + who is also standing on stage giving the talk) blocking the view of the machine. It wasn't a great look, and it left me feeling very uneasy about what I was witnessing and what message it was sending to the HPC community.
+In case it needs to be said, HPC is a team sport. The unveiling of El Capitan (or any other #1 system + before it) is always the product of dozens, if not hundreds, of people devoting years of their professional lives to + ensuring it all comes together. It was a big miss, both to those who put in the work, and those who will have + to put in the work on future systems, to suggest that a single, smiling face comes before the success of the system deployment. +
+The other notable entrant to the Top 10 list was HPC6, an industry system deployed by Eni (a major Italian energy + company) built on MI250X. Oil and gas companies tend to be conservative in the systems they buy since the seismic + imaging done on their large supercomputers informs hundred-million to billion-dollar investments in drilling a new + well, and they have much less tolerance for weird architectures than federally funded leadership computing does. + Thus, Eni's adoption of AMD GPUs in this #5 system is a strong endorsement of their capability in mission-critical + commercial computing.
+SoftBank, the Japanese investment conglomerate who, among other things, owns a significant stake in Arm, made its Top500 debut with two identical 256-node DGX H100 SuperPODs. While + not technologically interesting (H100 is getting old), these systems represent significant investment in HPC by + private industry in Japan and signals that SoftBank is following the lead of large American investment groups in + building private AI clusters for the AI startups in their portfolios. In doing this, SoftBank's investments + aren't dependent on third-party cloud providers to supply the GPUs to make these startups successful and reduces + their overall risk.
+Although I didn't hear anything about these SoftBank systems at the conference, NVIDIA issued a press statement + during the NVIDIA AI Summit Japan during the week prior to SC24 that discussed SoftBank's + investment in large NVIDIA supercomputers. The press statement states that these systems will be used "for + [SoftBank's] own generative AI development and AI-related business, as well as that of universities, research + institutions and businesses throughout Japan." The release also suggests we can expect B200 and GB200 SuperPODs from + SoftBank to appear as those technologies come online.
+Just below the SoftBank systems was the precursor system to Europe's first exascale system. I was hoping that + JUPITER, the full exascale system being deployed at FRJ, would appear in the Top 10, but it seems like we'll have to + wait for ISC25 for that. Still, the JETI system ran HPL across 480 nodes of BullSequana XH3000, the same node that + will be used in JUPITER, and achieved 83 TFLOPS. By comparison, the full JUPITER system will be over 10x larger ("roughly 6000 compute nodes" in the Booster), and + projecting the JETI run (173 TF/node) out to this full JUPITER scale indicates that JUPITER should just squeak over + the 1.0 EFLOPS line.
+In preparation for JUPITER, Eviden had a couple of these BullSequana XH3000 nodes out on display this year:
+ + +And if you're interested in more, I've been tracking the technical details of JUPITER in my digital garden.
+Waay down the list was Microsoft's sole new Top500 entry this cycle, an NVIDIA H200 system that ran HPL over 120 ND + H200 v5 nodes in Azure. It was one of only two conventional (non-Grace) H200 clusters that appeared in the top 100, + and it had a pretty good efficiency (Rmax/Rpeak > 80%). Microsoft also had a Reindeer node on display at its + booth:
+ + + +An astute observer may note that this node looks an awful lot like the H100 node used in its Eagle supercomputer, + which was on display at SC23 last year. That's + because it's the same chassis, just with an upgraded HGX baseboard.
+Reindeer was not super exciting, and there were no press releases about it, but I mention it here for a couple + reasons:
+ +The exhibit floor had a few new pieces of HPC technology on display this year that are + worthy of mention, but a lot of the most HPC-centric exciting stuff actually had a soft debut at ISC24 in May. For example, even though SC24 was MI300A's big splash due to + the El Capitan announcement, some MI300A nodes (such as the Cray EX255a) were on display in Hamburg. However, + Eviden had their MI300A node (branded XH3406-3) on display at SC24 which was new to me:
+ + +I'm unaware of anyone who's actually committed to a large Eviden MI300A system, so I was + surprised to see that Eviden already has a full blade design. But as with Eni's HPC6 supercomputer, perhaps this is + a sign that AMD's GPUs (and now APUs) have graduated from being built-to-order science experiments to a technology + ecosystem that people will want to buy off the rack.
+There was also a ton of GH200 on the exhibit hall floor, but again, these node types were + also on display at ISC24. This wasn't a surprise since a bunch of upcoming European systems have invested in GH200 + already; in addition to JUPITER's 6,000 GH200 nodes described above, CSCS Alps has 2,688 GH200 nodes, and Bristol's Isambard-AI will have 1,362 GH200 + nodes. All of these systems will have a 1:1 CPU:GPU ratio and an NVL4 domain, suggesting this is the optimal way to + configure GH200 for HPC workloads. I didn't hear a single mention of GH200 NVL32.
+SC24 was the debut of NVIDIA's Blackwell GPU in the flesh, and a bunch of integrators had + material on GB200 out at their booths. Interestingly, they all followed the same pattern as GH200 with an NVL4 + domain size, and just about every smaller HPC integrator followed a similar pattern where
+ +From this, I gather that not many companies have manufactured GB200 nodes yet, or if they + have, there aren't enough GB200 boards available to waste them on display models. So, we had to settle for these + bare NVIDIA-manufactured, 4-GPU + 2-CPU superchip boards:
+ + +What struck me is that these are very large FRUs--if a single component (CPU, GPU, voltage + regulator, DRAM chip, or anything else) goes bad, you have to yank and replace four GPUs and two CPUs. And because + all the components are soldered down, someone's going to have to do a lot of work to remanufacture these boards to + avoid throwing out a lot of very expensive, fully functional Blackwell GPUs.
+There were a few companies who were further along their GB200 journey and had more + integrated nodes on display. The HPE Cray booth had this GB200 NVL4 blade (the Cray EX154n) on display:
+ + + +It looks remarkably sparse compared to the super-dense blades that normally slot into the + Cray EX line, but even with a single NVL4 node per blade, the Cray EX cabinet only supports 56 of these blades, + leaving 8 blade slots empty in the optimal configuration. I assume this is a limitation of power and cooling.
+The booth collateral around this blade suggested its use case is "machine learning and + sovereign AI" rather than traditional HPC, and that makes sense since each node has 768 GB of HBM3e which is enough + to support training some pretty large sovereign models. However, the choice to force all I/O traffic on to the + high-speed network by only leaving room for one piddly node-local NVMe drive (this blade only supports one SSD per + blade) will make training on this platform very sensitive to the quality of the global storage subsystem. This is + great if you bundle this blade with all-flash Lustre (like Cray ClusterStor) or DAOS (handy, since Intel divested the entire DAOS + development team to HPE). But it's not how I would build an AI-optimized system.
+I suspect the cost-per-FLOP of this Cray GB200 solution is much lower than what a pure-play + GB200 for LLM training would be. And since GB200 is actually a solid platform for FP64 (thanks to Dan Ernst for challenging me on this and sharing + some great resources on the topic), I expect to see this node do well + in situations that are not training frontier LLMs, but rather fine-tuning LLMs, training smaller models, and mixing + in traditional scientific computing on the same general-purpose HPC/AI system.
+Speaking of pure-play LLM training platforms, though, I was glad that very few exhibitors + were trying to talk up GB200 NVL72 this year. It may have been the case that vendors simply aren't ready to begin + selling NVL72 yet, but I like to be optimistic and instead believe that the exhibitors who show up to SC24 know that + the scientific computing community likely won't get enough value out of a 72-GPU coherence domain to justify the + additional cost and complexity of NVL72. I didn't see a single vendor with a GB200 NVL36 or NVL72 rack on display + (or a GH200 NVL32, for that matter), and not having to think about NVL72 for the week of SC24 was a nice break from + my day job.
+Perhaps the closest SC24 got to NVL72 was a joint announcement at the beginning of the week + by Dell and CoreWeave, who announced that they have begun bringing + GB200 NVL72 racks online. Dell did have a massive, AI-focused booth on the exhibit floor, and they did talk + up their high-powered, liquid-cooled rack infrastructure. But in addition to supporting GB200 with NVLink Switches, + I'm sure that rack infrastructure would be equally good at supporting nodes geared more squarely at traditional HPC. +
+HPE Cray also debuted a new 400G Slingshot switch, appropriately named Slingshot 400. I + didn't get a chance to ask anyone any questions about it, but from the marketing material that came out right before + the conference, it sounds like a serdes upgrade without any significant changes to Slingshot's L2 protocol.
+There was a Slingshot 400 switch for the Cray EX rack on display at their booth, and it + looked pretty amazing:
+ + +It looks way more dense than the original 200G Rosetta switch, and it introduces + liquid-cooled optics. If you look closely, you can also see a ton of flyover cables connecting the switch ASIC in + the center to the transceivers near the top; similar flyover cables are showing up in all manner of + ultra-high-performance networking equipment, likely reflecting the inability to maintain signal integrity across PCB + traces.
+The port density on Slingshot 400 remains the same as it was on 200G Slingshot, so there's + still only 64 ports per switch, and the fabric scale limits don't increase. In addition, the media is saying that + Slingshot 400 (and the GB200 blade that will launch with it) won't start appearing until "Fall + 2025." Considering 64-port 800G switches (like NVIDIA's SN5600 and Arista's + 7060X6) will have already been on the market by then though, Slingshot 400 will be launching with HPE Cray + on its back foot.
+However, there was a curious statement on the placard accompanying this Slingshot 400 + switch:
+ + +It reads, "Ultra Ethernet is the future, HPE Slingshot delivers today!"
+Does this suggest that Slingshot 400 is just a stopgap until 800G Ultra Ethernet NICs begin + appearing? If so, I look forward to seeing HPE Cray jam third-party 800G switch ASICs into the Cray EX liquid-cooled + form factor at future SC conferences.
+One of the weirder things I saw on the exhibit floor was a scale-out storage server built + on NVIDIA Grace CPUs that the good folks at WEKA had on display at their booth.
+ + + +Manufactured by Supermicro, this "ARS-121L-NE316R" server (really rolls off the tongue) + uses a two-socket Grace superchip and its LPDDR5X instead of conventional, socketed CPUs and DDR. The rest of it + seems like a normal scale-out storage server, with sixteen E3.S SSD slots in the front and four 400G ConnectX-7 or + BlueField-3 NICs in the back. No fancy dual-controller failover or anything like that; the presumption is that + whatever storage system you'd install over this server would implement its own erasure coding across drives and + servers.
+At a glance, this might seem like a neat idea for a compute-intensive storage system like + WEKA or DAOS. However, one thing that you typically want in a storage server is high reliability and repairability, + features which weren't the optimal design point for these Grace superchips. Specifically,
+ +On the upside, though, there might be a cost advantage to using this Grace-Grace server + over a beefier AMD- or Intel-based server with a bunch of traditional DIMMs. And if you really like NVIDIA products, + this lets you do NVIDIA storage servers to go with your NVIDIA network and NVIDIA compute. As long as your storage + software can work with the interrupt rates of such a server (e.g., it supports rebuild-on-read) and the 144 Neoverse + V2 cores are a good fit for its computational requirements (e.g., calculating complex erasure codes), this server + makes sense. But building a parallel storage system on LPDDR5X still gives me the willies.
+I could also see this thing being useful for certain analytics workloads, especially those + which may be upstream of LLM training. I look forward to hearing about where this turns up in the field.
+ +The last bit of new and exciting HPC technology that I noted came from my very own employer + in the form of HBv5, a new, monster four-socket node featuring custom-designed AMD CPUs with HBM. STH wrote up an article with + great photos of HBv5 and its speeds and feeds, but in brief, this single node has:
+ +The node itself looks kind of wacky as well, because there just isn't a lot on it:
+ + +There are the obvious four sockets of AMD EPYC 9V64H, each with 96 physical cores and 128 GB of HBM3, and giant heat + pipes on top of them since it's 100% air-cooled. But there's no DDR at all, no power converter board (the node is + powered by a DC bus bar), and just a few flyover cables to connect the PCIe add-in-card cages. There is a separate + fan board with just two pairs of power cables connecting to the motherboard, and that's really about it.
+The front end of the node shows its I/O capabilities which are similarly uncomplicated:
+ + + +There are four NDR InfiniBand cards (one localized to each socket) which are 400G-capable but cabled up at 200G, + eight E1.S NVMe drives, and a brand-new dual-port Azure Boost 200G NIC. Here's a close-up of the right third of the + node's front:
+ + +This is the first time I've seen an Azure Boost NIC in a server, and it looks +much better integrated than the previous-generation 100G Azure SmartNIC that put the FPGA and hard NIC on separate +boards connected by a funny little pigtail. This older 100G SmartNIC with pigtail was also on display at the Microsoft +booth in an ND MI300X v5 node:
+ + + +And finally, although I am no expert in this new node, I did hang around the people who are all week, and I + repeatedly heard them answer the same few questions:
+ +New technology announcements are always exciting, but one of the main reasons I attend + SC and ISC is to figure out the broader trends shaping the HPC industry. What concerns are top of mind for the + community, and what blind spots remain open across all the conversations happening during the week? Answering + these questions requires more than just walking the exhibit floor; it involves interpreting the subtext of the + discussions happening at panels and BOF sessions. However, identifying where the industry needs more information + or a clearer picture informs a lot of the public-facing talks and activities in which I participate throughout + the year.
+The biggest realization that I confirmed this week is that the SC conference is not an HPC + conference; it is a scientific computing conference. I sat in a few sessions where the phrase "HPC + workflows" was clearly a stand-in for "scientific workflows," and "performance evaluation" still really means "MPI + and OpenMP profiling." I found myself listening to ideas or hearing about tools that were intellectually + interesting but ultimately not useful to me because they + were so entrenched in the traditions of applying HPC to scientific computing. Let's talk about a few ways in which + this manifested.
+Take, for example, the topic of sustainability. There were talks, panels, papers, and BOFs + that touched on the environmental impact of HPC throughout the week, but the vast majority of them really weren't + talking about sustainability at all; they were talking about energy efficiency. These talks often use the following + narrative:
+ +The problem with this approach is that it declares victory when energy consumption is + reduced. This is a great result if all you care about is spending less money on electricity for your supercomputer, + but it completely misses the much greater issue that the electricity required to power an HPC job is often generated + by burning fossil fuels, and that the carbon emissions that are directly attributable to HPC workloads are + contributing to global climate change. This blind spot was exemplified by this slide, presented during a talk titled + "Towards Sustainable Post-Exascale Leadership Computing" at the Sustainable Supercomputing workshop:
+ + +I've written about + this before and I'll write about it again: FLOPS/Watt and PUE are not + meaningful metrics by themselves when talking about sustainability. A PUE of 1.01 is not helpful if the datacenter + that achieves it relies on burning coal for its power. Conversely, a PUE of 1.5 is not bad if all that electricity + comes from a zero-carbon energy source. The biggest issue that I saw being reinforced at SC this year is that + claims of "sustainable HPC" are accompanied by the subtext of "as long as I can keep doing everything else the way I + always have."
+There were glimmers of hope, though. Maciej Cytowski from Pawsey presented the opening talk + at the Sustainable Supercomputing workshop, and he led with the right thing--he acknowledged that 60% of + the fuel mix that powers Pawsey's supercomputers comes from burning fossil fuels:
+ + + +Rather than patting himself on the back at his low PUE, Dr. Cytowski's described on how + they built their datacenter atop a large aquifer from which they draw water at 21°C and return it at 30°C to avoid + using energy-intensive chillers. To further reduce the carbon impact of this water loop, Pawsey also installed over + 200 kW of solar panels on its facility roof to power the water pumps. Given the fact that Pawsey cannot relocate to + somewhere with a higher ratio of zero-carbon energy on account of its need to be physically near the Square + Kilometer Array, Cytowski's talk felt like the most substantive discussion on sustainability in HPC that week.
+Most other talks and panels on the topic really wanted to equate "sustainability" to "FLOPS + per Watt" and pretend like where one deploys a supercomputer is not a part of the sustainability discussion. The + reality is that, if the HPC industry wanted to take sustainability seriously, it would talk less about watts and + more about tons of CO2. Seeing as how the average watt of electricity in Tennessee produces 2.75x more carbon than a watt of electricity in Washington, + the actual environmental impact of fine-tuning Slurm scheduling or fiddling with CPU frequencies is meaningless when + compared to the benefits that would be gained by deploying that supercomputer next to a hydroelectric dam instead of + a coal-fired power plant.
+I say all this because there are parts of the HPC industry (namely, the part in which I work) + who are serious about sustainability. And those conversations go beyond simply building supercomputers in + places where energy is low-carbon (thereby reducing Scope 2 emissions). They + include holding suppliers to high standards on reducing the carbon impact of transporting people and material to + these data centers, reducing the carbon impact of all the excess packaging that accompanies components, and being + accountable for the impact of everything in the data center after it reaches end of life (termed Scope 3 emissions).
+The HPC community--or more precisely, the scientific computing community--is still married + to the idea that the location of a supercomputer is non-negotiable, and "sustainability" is a nice-to-have secondary + goal. I was + hoping that the sessions I attended on sustainability would approach this topic at a level where the + non-scientific HPC world has been living. Unfortunately, the discussion at SC24, which spanned workshops, BOFs, and + Green 500, remains largely stuck on the idea that PUE and FLOPS/Watt are the end-all sustainability metrics. Those + metrics are important, but there are global optimizations that have much greater effects on reducing the + environmental impact of the HPC industry.
+Another area where "HPC" was revealed to really mean "scientific computing" was in the + topic of AI. I sat in on a few BOFs and panels around AI topics to get a feel for where this community is in + adopting AI for science, but again, I found the level of discourse to degrade to generic AI banter despite the best + efforts of panelists and moderators. For example, I sat in the "Foundational Large Language Models for + High-Performance Computing" BOF session, and Jeff Vetter very clearly defined what a "foundational large language + model" was at the outset so we could have a productive discussion about their applicability in HPC (or, really, + scientific computing):
+ + + +The panelists did a good job of outlining their positions. On the upside, LLMs are good for + performing source code conversion, documenting and validating code, and maximizing continuity in application codes + that get passed around as graduate students come and go. On the downside, they have a difficult time creating + efficient parallel code, and they struggle to debug parallel code. And that's probably where the BOF should have + stopped, because LLMs, as defined at the outset of the session, don't actually have a ton of applicability in + scientific computing. But as soon as the session opened up to audience questions, the session went off the rails. +
+The first question was an extremely basic and nonspecific question: "Is AI a bubble?"
+It's fun to ask provocative questions to a panel of experts. I get it. But the question had + nothing to do with LLMs, any of the position statements presented by panelists, or even HPC or scientific computing. + It turned a BOF on "LLMs for HPC" into a BOF that might as well have been titled "Let's just talk about AI!" A few + panelists tried to get things back on track by talking about the successes of surrogate models to simulate physical + processes, but this reduced the conversation to a point where "LLMs" really meant "any AI model" and "HPC" really + meant "scientific simulations."
+Perhaps the most productive statement to come out of that panel was when Rio Yokota + asserted that "we" (the scientific community) should not train their own LLMs, because doing so would be + "unproductive for science." But I, as well as anyone who understands the difference between LLMs and "AI," already + knew that. And the people who don't understand the difference between an LLM and a surrogate model probably didn't + pick up on Dr. Yokota's statement, so I suspect the meaning of his contribution was completely lost.
+Walking out of that BOF (and, frankly, the other AI-themed BOFs and panels I attended), I + was disappointed at how superficial the conversation was. This isn't to say these AI sessions were objectively + bad; rather, I think it reflects the general state of understanding of AI amongst SC attendees. Or perhaps it + reflects the demographic that is drawn to these sorts of sessions. If the SC community is not ready to have a + meaningful discussion about AI in the context of HPC or scientific computing, attending BOFs with like-minded peers + is probably a good place to begin getting immersed. +
+But what became clear to me this past week is that SC BOFs and panels with "AI" in their + title aren't really meant for practitioners of AI. They're meant for scientific computing people who are beginning + to dabble in AI.
+I was invited to sit on a BOF panel called "Artificial Intelligence and Machine Learning + for HPC Workload Analysis" following on a successful BOF in which I participated at ISC24. The broad intent was to + have a discussion around the tools, methods, and neat ideas that HPC practitioners have been using to better + understand workloads, and each of us panelists was tasked with talking about a project or idea we had in applying + AI/ML to improve some aspect of workloads.
+ + +What emerged from us speakers' lightning talks is that applying AI for operations--in this + case, understanding user workloads--is nascent. Rather than talking about how we use AI to affect how we design or + operate supercomputers, all of us seemed to focus more on how we are collecting data and beginning to analyze that + data using ML techniques. And maybe that's OK, because AI won't ever do anything for workload characterization until + you have a solid grasp of the telemetry you can capture about those workloads in the first place.
+But when we opened the BOF up to discussion with all attendees, despite having a packed + room, there was very little that the audience had. Our BOF lead, Kadidia Konaté, tried to pull discussion out of the + room from a couple of different fronts by asking what tools people were using, what challenges they were facing, and + things along those lines. However, it seemed to me that the majority of the audience was in that room as spectators; + they didn't know where to start applying AI towards understanding the operations of supercomputers. Folks attended + to find out the art of the possible, not talk about their own challenges.
+As such, the conversation wound up bubbling back up to the safety of traditional topics in + scientific computing--how is LDMS working out, how do you deal with data storage challenges of collecting telemetry, + and all the usual things that monitoring and telemetry folks worry about. It's easy to talk about the topics you + understand, and just as the LLM conversation reverted back to generic AI for science and the sustainability topic + reverted back to FLOPS/Watt, this topic of AI for operations reverted back to standard telemetry collection.
+Despite the pervasive belief at SC24 that "HPC" and "scientific computing" are the same thing, there are early signs + that the leaders in the community are coming to terms with the reality that there is now a significant amount of + leadership HPC happening outside the scope of the conference. This was most prominent at the part of the Top500 BOF + where Erich Strohmaier typically discusses trends based on the latest publication of the list.
+In years past, Dr. Strohmaier's talk was full of statements that strongly implied that, if a supercomputer is not + listed on Top500, it simply does not exist. This year was different though: he acknowledged that El Capitan, + Frontier, and Aurora were "the three exascale systems we are aware of," now being + clear that there is room for exascale systems to exist that simply never ran HPL, or never submitted HPL results to + Top500. He explicitly acknowledged again that China has stopped making any Top500 submissions, and although he + didn't name them outright, he spent a few minutes dancing around "hyperscalers" who have been deploying exascale + class systems such as Meta's H100 + clusters (2x24K H100), xAI's + Colossus (100K H100), and the full system behind Microsoft's Eagle (14K H100 is a "tiny fraction").
+Strohmaier did an interesting analysis that estimated the total power of the Top500 list's supercomputers so he could + compare it to industry buzz around hyperscalers building gigawatt-sized datacenters:
+ + +It was a fun analysis where he concluded that there are between 500-600 megawatts of supercomputers on the Top500 + list, and after you factor in storage, PUE, and other ancillary power sources, the whole Top500 list sums up to what + hyperscalers are talking about sticking into a single datacenter facility.
+Although he didn't say it outright, I think the implication here is that the Top500 list is rapidly losing relevance + in the broad HPC market, because a significant amount of the world's supercomputing capacity and capability + are absent from the list. Although specific hyperscale supercomputers (like Meta's, xAI's, and Microsoft's) were not + mentioned outright, their absence from the Top500 list suggests that this list might already be more incomplete than + it is complete--the sum of the FLOPS or power on the Top500 supercomputers may be less than the sum of the giant + supercomputers which are known but not listed. This will only get worse as the AI giants keep building systems every + year while the government is stuck on its 3-5 year procurement cycles.
+It follows that the meaning of the Top500 is sprinting towards a place where it is not representative of HPC so much + as it is representative of the slice of HPC that serves scientific computing. Erich Strohmaier was clearly + aware of this in his talk this year, and I look forward to seeing how the conversation around the Top500 list + continues to morph as the years go on.
+My career was started at an NSF HPC center and built up over my years in the + DOE, so I feel like I owe a debt to the people who provided all the opportunities and mentorship that let me + get to the place of privilege in the hyperscale/AI industry that I now enjoy. As a result, I find myself still + spending a lot of my free time thinking about the role of governments in the changing face of + HPC (as evidenced by my critiques of thinktank reports and federal RFIs...) and trying to bridge the gap + in technical understanding between my old colleagues (in DOE, NSF, and European HPC organizations) and whatever they + call what I work on now (hyperscale AI?).
+To that end, I found myself doing quite a bit of business development (more on this later) with government + types since I think that is where I can + offer the most impact. I used to be government, and I closely follow the state of their thinking in HPC, but I also + know what's going on inside the hyperscale and AI world. I also have enough context in both areas to draw a line + through all the buzzy AI press releases to demonstrate how the momentum of private-sector investment in AI might + affect the way national HPC + efforts do business. So, I did a lot of talking to both my old colleagues in DOE and their industry partners in an + attempt to help them understand how the hyperscale and AI industry thinks about infrastructure, and what they should + expect in the next year.
+More importantly though, I also sat in on a couple of NSF-themed BOFs to get a better understanding of where their + thinking is, where NAIRR is going, how the NSF's strategy contrasts with DOE's strategy, and where the ambitions of + the Office of Advanced Cyberinfrastructure might intersect with the trajectory of hyperscale AI.
+What I learned was that NSF leadership is aware of everything that the community should be concerned about: the + growth of data, the increasing need for specialized silicon, the incursion of AI into scientific computing, new + business models and relationships with industry, and broadening the reach of HPC investments to be globally + competitive. But beyond that, I struggled to see a cohesive vision for the future of NSF-funded + supercomputing.
+A BOF with a broad range of stakeholders probably isn't the best place to lay out a vision for the future of NSF's + HPC efforts, and perhaps NSF's vision is best expressed through its funding opportunities and awards. Whichever the + case may be, it seems like the NSF remains on a path to make incremental progress on a broad front of topics. Its + Advanced Computing Systems and Services (ACSS) program will continue to fund the acquisition of newer + supercomputers, and a smorgasbord of other research programs will continue funding efforts across public access to + open science, cybersecurity, sustainable software, and other areas. My biggest concern is that peanut-buttering + funding across such a broad portfolio will make net forward progress much slower than taking big bets. Perhaps big + bets just aren't in the NSF's mission though.
+NAIRR was also a topic that came up in every NSF-themed session I attended, but again, I didn't get a clear picture + of the future. Most of the discussion that I heard was around socializing the resources that are available today + through NAIRR, suggesting that the pilot's biggest issue is not a lack of HPC resources donated by industry, but + awareness that NAIRR is a resource that researchers can use. This was reinforced by a survey whose results were + presented in the NAIRR BOF:
+ + + +It seems like the biggest challenges facing the NSF community relying on NAIRR (which has its own sample bias) is + that they don't really know where to start even though they have AI resources (both GPUs and model API services) at + their disposal. In a sense, this is a great position for the NSF since
+ +However, it also means that there's not a clear role for partnership with many industry players beyond donating + resources to the NAIRR pilot today in the hopes of selling resources to the full NAIRR tomorrow. I asked what OAC + leadership thought about moving beyond such a transactional relationship between NSF and industry at one of the BOFs + I attended, and while the panelists were eager to explore specific answers to that question, I didn't hear any ideas + that would approach some sort of truly equitable partnership where both parties contributed in-kind.
+I also walked away from these NSF sessions struck by how different the NSF HPC community's culture is from that of + the DOE. NSF BOF attendees seemed focused on getting answers and guidance from NSF leadership, unlike the typical + DOE gathering, where discussions often revolve around attendees trying to shape priorities to align with their own + agendas. A room full of DOE people tends to feel like everyone thinks they're the smartest person there, while NSF + gatherings appear more diverse in the expertise and areas of depth of its constituents. Neither way is inherently + better or worse, but it will make the full ambition of NAIRR (as an inter-agency collaboration) challenging to + navigate. This is particularly relevant as DOE is now pursuing its own multi-billion-dollar AI infrastructure + effort, FASST, that appears to sidestep NAIRR.
+There's no better way to figure out what's going on in the HPC industry than walking the + exhibit floor each year, because booths cost money and reflect the priorities (and budgets) of all participants. + This year's exhibit felt physically huge, and walking from one end to the other was an adventure. You can get a + sense of the scale from this photo I took during the opening gala:
+ + + +Despite having almost 18,000 registrants and the opening gala usually being a +crush of people, the gala this year felt and looked very sparse just because people and booths were more spread out. +There was also a perceptibly larger number of splashy vendors who have historically never attended before who were +promoting downstream HPC technologies like data center cooling and electrical distribution, and there was healthy +speculation online about whether the hugeness of the exhibit this year was due to these new power and cooling companies.
+ +To put these questions to rest, I figured out how to yank down all the exhibitor metadata + from the conference website so I could do some basic analysis on it.
+The easiest way to find the biggest companies to appear this year was to compare the + exhibitor list and booth sizes from SC23 to this year and see whose booth went from zero to some big square footage. +
+ + + +I only took the top twenty new vendors, but they broadly fall into a couple of categories:
+ +There were a couple other companies that must've just missed last SC but aren't new to + the show (NetApp, Ansys, Samsung, Micron, Broadcom). And curiously, only one new GPU-as-a-Service provider + (Nebius) showed up this year, suggesting last year was the year of the GPU Cloud.
+But to confirm what others had speculated: yes, a significant amount of the new square + footage of the exhibit floor can be attributed to companies focused on power and cooling. This is an interesting + indicator that HPC is becoming mainstream, largely thanks to AI demanding ultra-high density of power and + cooling. But it's also heartening to see a few new exhibitors in higher education making an appearance. Notably, + SCRCC (South Carolina Research Computing Consortium) is a consortium between Clemon, University of South + Carolina, and Savannah River National Laboratory that just formed last year, and I look forward to seeing what + their combined forces can bring to bear.
+We can also take a look at whose booths grew the most compared to SC23:
+ + + +This distribution is much more interesting, since the top 20 exhibitors who grew their footprint comprise the + majority of the growth in existing exhibitors. Cherry-picking a few interesting growers:
+ +It's also interesting to see HLRS, the German national HPC center, grow so + significantly. I'm not sure what prompted such a great expansion, but I take it to mean that things have been + going well there.
+Finally, Dell had a massive booth and showing this year. Not only did they grow the + most since SC23, but they had the single largest booth on the exhibit floor at SC24. This was no doubt a result + of their great successes in partnering with NVIDIA to land massive GPU buildout deals at places like xAI and CoreWeave. + They also had "AI factory" messaging emblazoned all over their marketing material and debuted a nice 200 kW + liquid-cooled rack that will be the basis for their GB200 NVL72 solution, clearly leaning into the idea that + they are leaders in AI infrastructure. Despite this messaging being off-beat for the SC audience as I've + described earlier, their booth was surprisingly full all the time, and I didn't actually get a chance to get in + there to talk to anyone about what they've been doing.
+Equally interesting are the vendors who reduced their footprint at SC24 relative to + SC23:
+ + + +Reading too much into any of these big shrinkers is pretty easy; while a reduction in + booth size could suggest business hasn't been as good, it could equally mean that an exhibitor just went + overboard at SC23 and downsized to correct this year. A few noteworthy exhibitors to call out:
+ +Overall, almost twice as many vendors grew their booths than scaled back, so I'd + caution anyone against trying to interpret any of this as anything beyond exhibitors right-sizing their booths + after going all-in last year.
+Finally, there are a handful of vendors who disappeared outright after SC23:
+ + + +It is critical to point out that the largest booths to vanish outright were all on the + smaller size: SUSE, Tenstorrent, and Symbiosys Alliance all disappeared this year, but their booths last year + were only 20x30. I was surprised to see that Tenstorrent and Arm didn't have booths, but the others are either + companies I haven't heard of (suggesting the return on investment of showing at SC might've been low), are easy + to rationalize as only being HPC-adjacent (such as SNIA and DigitalOcean), or simply went bankrupt in the last + year.
+As we say at the business factory, the net-net of the exhibit hall this year is that + the square footage of booth space increased by 15,000 square feet, so it was in fact bigger, it did take longer + to walk from one end to the other, and there definitely were a bunch of new power and cooling companies filling + out the space. Some exhibitors shrank or vanished, but the industry as a whole appears to be moving in a healthy + direction.
+And if you're interested in analyzing this data more yourself, please have a look at the data and the Jupyter notebook I used to generate + the above treemaps on GitHub. If you discover anything interesting, please write about it and post it + online!
+ + +As an AI infrastructure person working for a major cloud provider, I kept an eye out for all the companies trying + to get into the GPU-as-a-Service game. I described these players last year as + "pure-play GPU clouds," and it seems like the number of options available to customers who want to go + this route is growing. But I found it telling that a lot of them had booths that were completely + indistinguishable from each other. Here's an example of one:
+ + + +As best I can tell, these companies are all NVIDIA preferred partners with + data centers and a willingness to deploy NVIDIA GPUs, NVIDIA SmartNICs, and NVIDIA cloud stack, and sell multi-year + commitments to consume those GPUs. I tried to accost some of these companies' booth staff to ask them my favorite + question ("What makes you different from everyone else?"), but most of these companies' booths were staffed by + people more interested in talking to each other than me.
+These GPUaaS providers tend to freak me out, because, as Microsoft's CEO recently stated, these companies are + often "just a bunch of + tech companies still using VC money to buy a bunch of GPUs." I can't help but feel like this is where + the AI hype will come back to bite companies who have chosen to build houses upon sand. Walking the SC24 exhibit + floor is admittedly a very narrow view of this line of business, but it seemed like some of these companies were + content to buy up huge booths, hang a pretty banner above it, and otherwise leave the booth empty of anything + beyond a few chairs and some generic value propositions. I didn't feel a lot of hunger or enthusiasm from these + companies despite the fact that a bunch of them have hundreds of millions of dollars of GPUs effectively sitting + on credit cards that they are going to have to make payments on for the next five years.
+That all said, not all the companies in the GPUaaS are kicking back and letting the money pour in. In particular, + I spent a few minutes chatting up someone at the CoreWeave booth, and I was surprised to hear about how much + innovation they're adding on top of their conventional GPUaaS offering. For example, they developed Slurm on Kubernetes + (SUNK) with one of their key customers to close the gap between the fact that CoreWeave exposes its GPU + service through Kubernetes, but many AI customers have built their stack around Slurm, pyxis, and enroot. +
+In a weird twist of fate, I later ran into an old acquaintance who turned out to be one of the key CoreWeave + customers for whom SUNK was developed. He commented that SUNK is the real deal and does exactly what his users + need which, given the high standards that this person has historically had, is a strong affirmation that SUNK is + more than just toy software that was developed and thrown on to GitHub for an easy press release. CoreWeave is + also developing some interesting high-performance object storage caching software, and all of these software + services are provided at no cost above whatever customers are already paying for their GPU service.
+I bring this up because it highlights an emerging distinction in the GPUaaS market, which used to be a homogenous + sea of bitcoin-turned-AI providers. Of course, many companies still rely on that simple business model: holding + the bill for rapidly depreciating GPUs that NVIDIA sells and AI startups consume. However, there are now GPUaaS + providers moving up the value chain by taking on the automation and engineering challenges that model developers + don't want to deal with. Investing in uncertain projects like new software or diverse technology stacks is + certainly risky, especially since they may never result in enough revenue to pay for themselves. But having a + strong point of view, taking a stance, and investing in projects that you feel are right deserves recognition. + My hat is off to the GPUaaS providers who are willing to take these risks and raise the tide for all of us + rather than simply sling NVIDIA GPUs to anyone with a bag of money.
+As much as I enjoy increasing shareholder value, the part of SC that gives me the + greatest joy is reconnecting with the HPC community. Knowing I'll get to chat with my favorite people in the + industry (and meet some new favorite people!) makes the long plane rides, upper respiratory infections, and weird + hotel rooms completely worth it.
+ + + +I wound up averaging under six hours of sleep per night this year in large part because 9pm + or 7am were often the only free times I had to meet with people I really wanted to see. I have this unhealthy + mindset where every hour of every day, from the day I land to the day I leave, is too precious to waste, and it's + far too easy for me to rationalize that spending an hour talking to someone interesting is worth losing an hour of + sleep.
+But like I said at the outset of this blog post, this year felt different for a few + reasons, and a lot of them revolve around the fact that I think I'm getting old. Now, it's always fun to say "I'm + getting old" in a mostly braggadocious way, but this feeling manifested in concrete ways that affected the way I + experienced the conference:
+ +If you read this all and think "boo hoo, poor Glenn is too popular and wise for his own + good," yeah, I get it. There are worse problems to have. But this was the first year where I felt like what I put + into the conference was greater than what I got out of it. Presenting at SC used to be at least as good for my + career as it was useful for my audiences, but it just doesn't count for much given my current role and career stage. + It felt like some of the magic was gone this year in a way I've never experienced before.
+ +As the years have gone on, I spend an increasing amount of my week having one-on-one + conversations instead of wandering aimlessly. This year though, I came to SC without really having anything to buy + or sell:
+ +Much to my surprise though, a bunch of my old vendor/partner colleagues still wanted to get + together to chat this year. Reflecting back, I was surprised to realize that it was these conversations--not the + ones about business--that were the most fulfilling this year.
+I learned about people's hobbies, families, and their philosophies on life, and it was + amazing to get to know some of the people behind the companies with whom I've long dealt. I was reminded that the + person is rarely the same as the company, and even behind some of the most aggressive and blusterous tech companies + are often normal people with the same concerns and moments of self-doubt that everyone else has. I was also reminded + that good engineers appreciate good engineering regardless of whether it's coming from a competitor or not. The + public persona of a tech exec may not openly admire a competitor's product, but that doesn't mean they don't know + good work when they see it.
+I also surprised a colleague whose career has been in the DOE labs with an anecdote that + amounted to the following: even though two companies may be in fierce competition, the people who work for them + don't have to be. The HPC community is small enough that almost everyone has got a pal at a competing company, and + when there are deals to be made, people looove to gossip. If one salesperson hears a juicy rumor about a prospective + customer, odds are that everyone else on the market will hear about it pretty quickly too. Of course, the boundaries + of confidentiality and professionalism are respected when it matters, but the interpersonal relationships that are + formed between coworkers and friends don't suddenly disappear when people change jobs.
+And so, I guess it would make sense that people still want to talk to me even though I have + nothing to buy or sell. I love trading gossip just as much as everyone else, and I really enjoyed this aspect of the + week.
+ +I also spent an atypically significant amount of my week talking to early career people in + HPC who knew of me one way or another and wanted career advice. This is the first year I recall having the same + career conversations with multiple people, and this new phase of my life was perhaps most apparent during the IEEE + TCHPC/TCPP HPCSC career panel in which I was invited to speak this year.
+ + + +It was an honor to be asked to present on a career panel, but I didn't feel very qualified to give career advice to + up-and-coming computer science graduate students who want to pursue HPC. I am neither a computer scientist nor a + researcher, but fortunately for me, my distinguished co-panelists (Drs. Dewi Yokelson, Olga Pearce, YJ Ji, and + Rabab Alomairy) had plenty of more relevant wisdom to share. And at the end of the panel, there were a few things we + all seemed to agree on as good advice:
+ +In both this panel the one-on-one conversations I had with early career individuals, the best I could offer was the + truth: I never had a master plan that got me to where I am; I just try out new things until I realize I don't like + doing them anymore. I never knew what I wanted to be when I grew up, and I still don't really, so it now makes me + nervous that people have started approaching me with the assumption that I've got it all figured out. Unless I + torpedo my career and go live on a goat farm though, maybe I should prepare for this to be a significant part of my + SC experiences going forward.
+One last, big change in the community aspect of SC this year was the mass-migration of a ton of HPC folks from + Twitter to Bluesky during the week prior to the conference. I don't really understand what prompted it so suddenly; + a few of us have been trying for years to get some kind of momentum on other social platforms like Mastodon, but the + general lack of engagement meant that all the excitement around SC always wound up exclusively on Twitter. This year + was different though, and Bluesky hit critical mass with the HPC community.
+I personally have never experienced an SC conference without Twitter; my first SC was in 2013, and part of what made + that first conference so exciting was being able to pull up my phone and see what other people were seeing, + thinking, and doing across the entire convention center via Twitter. Having the social media component to the + conference made me feel like I was a part of something that first year, and as the years went on, Twitter became an + increasingly indispensable part of the complete SC experience for me.
+This year, though, I decided to try an + experiment and see what SC would be like if I set Twitter aside and invested my time into Bluesky instead. +
+The verdict? It was actually pretty nice.
+It felt a lot like the SC13 days, where my day ended and began with me popping open Bluesky to see what new #SC24 posts were made. And because many of the tech companies and HPC + centers hadn't yet made it over, the hashtag wasn't clogged up by a bunch of prescheduled marketing blasts that + buried the posts written by regular old conference attendees who were asking important questions:
++Which booths at #sc24 have coffee? I noticed oracle do. Anyone else?
+— Mike Croucher (@walkingrandomly.bsky.social) November 18, 2024 at 3:02 PM
Of course, I still clogged Bluesky up with my nonsense during the week, but there was an amazing amount of + engagement by a diversity of thoughtful people--many who came from Twitter, but some whose names and handles I + didn't recognize.
+The volume of traffic on Bluesky during the week did feel a little lower than what it had been on Twitter in years + past though. I also didn't see as many live posts of technical sessions as they happened, so I couldn't really tell + whether I was missing something interesting in real time. This may have contributed to why I felt a little less + connected to the pulse of the conference this year than I had in the past. It also could've been the fact that + conference was physically smeared out across a massive space though; the sparsity of the convention center was at + least on par with the sparsity on Bluesky.
+At the end of the week, I didn't regret the experiment. In fact, I'll probably be putting more effort into my Bluesky + account than my Twitter account going forward. To be clear though, this isn't a particularly political decision on + my part, and I pass no judgment on anyone who wants to use one platform over the other. It's just that I like the + way I feel when I scroll through my Bluesky feeds, and I don't get that same feeling when I use Twitter.
+SC this year was a great conference by almost every measure, as it always is, but it still felt a little different for me. I'm sure that some of that feeling is the result of my own growth, and my role with respect to the conference seems to be evolving from someone who gets a lot out of the conference to someone who is giving more to the conference. That's not to say that I don't get a lot out of it, though; I had no shortage of wonderful interactions with everyone from technology executives to rising stars who are early in their career, and I learned a lot about both them and me as whole people. But SC24, more than any SC before it, is when I realized this change was happening.
+On the technological front, we saw the debut of a new #1 system (emblazoned with the smiling face of Bronis...) and a growing crop of massive, new clusters deployed for commercial applications. The exhibit floor was quantitatively bigger, in large part due to new power and cooling companies who are suddenly relevant to the HPC world thanks to the momentum of AI. At the same time, the SC technical program is clearly separating itself out as a conference focused on scientific computing; the level of discourse around AI remains largely superficial compared to true AI conferences, the role of hyperscalers in the HPC industry is still cast more as a threat than an opportunity.
+For my part, I'm still trying to get a grasp on where government agencies like DOE and NSF want to take their AI ambitions so I can try to help build a better mutual understanding between the scientific computing community and the hyperscale AI community. However, it seems like the NSF is progressing slowly on a wide front, while the DOE is doing what DOE does and charging headfirst into a landscape that has changed more than I think they realize.
+There's a lot of technical content that I know I missed on account of the increasing time I've been spending on the people and community aspect of the conference, and I'm coming to terms with the idea that this just may be the way SC is from now on. And I think I'm okay with that, since the support of the community is what helped me go from being a bored materials science student into someone whose HPC career advice is worth soliciting in the short span of eleven years. Despite any or all of the cynicism that may come out in the things I say about this conference, SC is always the highlight of my year. I always go into it with excitement, gladly burn the candle at both ends all week, and fly home feeling both grateful for and humbled by everything the HPC community has done and continues to do to keep getting me out of bed in the morning.
+]]>If you haven’t watched this talk before, I encourage checking it out. Cantrill gave it in part to talk about why the node.js community and Joyent didn’t work well together, but I thought he had some good insights into how values get built into a technical artifact itself, as well as how the community around those artifacts will prioritize certain values.
@@ -3288,178 +4211,4 @@ I modified the collector to output a profile update every 10 seconds so I couldWe continue to look for optimizations to the collector. When looking at the output from the most recent profile, we noticed the collector is spending a significant amount of time in the logging functions. By default, we have debug logging turned on. We will look at turning off debug logging in the future.
-Additionally, the collector is spending a lot of time polling for messages. In fact, the message bus is receiving ~1500 messages a second, which is increasing the load on the message bus. After reading through optimizations for RabbitMQ, it appears that less but larger messages are better for the message bus. We will look at batching messages in the future.
]]>This was the second time I've attended the conference as a vendor instead of a customer, and this meant I spent a fair amount of time running to and from meetings instead of walking the show floor or attending technical sessions. I'm sure I missed some major announcements and themes as a result, but I thought it still might be valuable to contribute my observations based on this narrow lens of an AI-minded storage product manager for a major cloud service provider. If you're interested in a more well-rounded perspective, check out the HPC Social Supercomputing 2023 Summary and contribute your own thoughts!
-- -
I don't know the best way to organize the notes that I took, so I grouped them into a few broad categories:
- -I must also disclose that I am employed by Microsoft and I attended SC23 in that capacity. However, everything in this post is my own personal viewpoint, and my employer had no say in what I did or didn't write here. Everything below is written from my perspective as an enthusiast, not an employee, although my day job probably colors my outlook on the HPC industry.
-With all that being said, let's dive into the big news of the week!
- -The Aurora exascale system has a storied history going back to 2015; first conceived of as a 180 PF supercomputer to be delivered in 2018, it evolved into a GPU-based exascale supercomputer that was supposed to land in 2021. Now two years late and a few executives short, Intel and Argonne were stuck between a rock and a hard place in choosing whether to list their HPL results at SC23:
- -Intel and Argonne ultimately chose option #2 and listed an HPL run that used only 5,439 of Aurora's 10,624 nodes (51.1% of the total machine), and as expected, people generally understood that this sub-exaflop score was not an indictment of the whole system underdelivering, but more a reflection that the system was still not stable at its full scale. Still, headlines in trade press were dour, and there was general confusion about how to extrapolate Aurora's HPL submission to the full system. Does the half-system listing of 585.34 PF Rmax at 24.7 MW power mean that the full system will require 50 MW to achieve an Rmax that's still lower than Frontier? Why is the efficiency (Rmax/Rpeak = 55%) so low?
-Interestingly, about half the people I talked to thought that Argonne should've waited until ISC'24 to list the full system, and the other half agreed that listing half of Aurora at SC'23 was the better option. Clearly there was no clearly right answer here, and I don't think anyone can fault Argonne for doing the best they could given the Top500 submission deadline and the state of the supercomputer. In talking to a couple folks from ALCF, I got the impression that there's still plenty of room to improve the score since their HPL run was performed under a time crunch, and there were known issues affecting performance that couldn't have been repaired in time. With any luck, Aurora will be ready to go at full scale for ISC'24 and have its moment in the sun in Hamburg.
- -The other new Top500 entry near the top of the list was Eagle, Microsoft's surprise 561 PF supercomputer. Like Aurora, it is composed of GPU-heavy nodes, and like Aurora, the HPL run utilized only part (1,800 nodes) of the full system. Unlike Aurora though, the full size of Eagle is not publicly disclosed by Microsoft, and its GPU-heavy node architecture was designed for one specific workload: training large language models for generative AI.
-At the Top500 BOF, Prabhat Ram gave a brief talk about Eagle where he emphasized that the system wasn't a custom-built, one-off stunt machine. Rather, it was built from publicly available ND H100 v5 virtual machines on a single 400G NDR InfiniBand fat tree fabric, and Microsoft had one of the physical ND H100 v5 nodes at its booth. Here's the back side of it:
- -From top to bottom, you can see it has eight E1.S NVMe drives, 4x OSFP ports which support 2x 400G NDR InfiniBand each, a Microsoft SmartNIC, and a ton of power. A view from the top shows the HGX baseboard and fans:
- -
<p>Logically, this node (and the ND H100 v5 VM that runs on it) looks a lot like the NVIDIA DGX reference architecture. Physically, it is an air-cooled, Microsoft-designed OCP server, and Eagle’s Top500 run used 1,800 of these servers.</p>
Big HPL number aside, the appearance of Eagle towards the top of Top500 has powerful implications on the supercomputing industry at large. Consider the following.
-Microsoft is a for-profit, public enterprise whose success is ultimately determined by how much money it makes for its shareholders. Unlike government agencies who have historically dominated the top of the list to show their supremacy in advancing science, the Eagle submission shows that there is now a huge financial incentive to build giant supercomputers to train large language models. This is a major milestone in supercomputing; up to this point, the largest systems built by private industry have come from the oil & gas industry, and they have typically deployed at scales below the top 10.
-Eagle is also built on the latest and greatest technology--NVIDIA's H100 and NDR InfiniBand--rather than previous-generation technology that's already been proven out by the national labs. SC23 was the first time Hopper GPUs have appeared anywhere on the Top500 list, and Eagle is likely the single largest installation of both H100 and NDR InfiniBand on the planet. Not only does this signal that it's financially viable to stand up a leadership supercomputer for profit-generating R&D, but industry is now willing to take on the high risk of deploying systems using untested technology if it can give them a first-mover advantage.
-Eagle also shows us that the potential upside of bringing a massive new AI model to market is worth both the buying all the infrastructure required to build a half-exaflop system and hiring the talent required to shake out what is literally a world-class supercomputer. And while the US government can always obtain a DPAS rating to ensure it gets dibs on GPUs before AI companies can, there is no DPAS rating for hiring skilled individuals to stand up gigantic systems. This all makes me wonder: if Aurora was a machine sitting in some cloud data center instead of Argonne, and its commissioning was blocking the development of the next GPT model, would it have been able to take the #1 spot from Frontier this year?
-The appearance of such a gigantic system on Top500, motivated by and paid for as part of the AI land grab, also raises some existential questions for the US government. What role should the government have in the supercomputing industry if private industry now has a strong financial driver to invest in the development of leadership supercomputing technologies? Historically, government has always incubated cutting-edge HPC technologies so that they could stabilize enough to be palatable to commercial buyers. Today's leadership supercomputers in the national labs have always wound up as tomorrow's midrange clusters that would be deployed for profit-generating activities like seismic imaging or computer-aided engineering. If the AI industry is now taking on that mantle of incubating and de-risking new HPC technologies, perhaps government now needs to focus on ensuring that the technologies developed and matured for AI can still be used to solve scientific problems.
- -<p>I was asked a lot of questions about storage from journalists, VCs, and even trusted colleagues that followed a common theme: What storage technologies for AI excite me the most? What’s the future of storage for AI?</p>
-I don't fault people for asking such a broad question because the HPC/AI storage industry is full of bombastic claims. For example, two prominent storage vendors emblazoned their booths with claims of what their products could do for AI:
- -These photos illustrate the reality that, although there is general agreement that good storage is needed for GPUs and AI, what constitutes "good storage" is muddy and confusing. Assuming the above approach to marketing (10x faster! 20x faster!) is effective for someone out there, there appears to be a market opportunity in just capitalizing on this general confusion by (1) asserting what the I/O problem that's jamming up all AI workloads is, and (2) showing that your storage product does a great job at solving that specific problem.
For example, the MLPerf Storage working group recently announced the first MLPerf Storage benchmark, and Huiho Zheng from Argonne (co-author of the underlying DLIO tool on which MLPerf Storage was built) described how the MLPerf Storage benchmark reproduces the I/O characteristics of model training at the Workshop on Software and Hardware Co-Design of Deep Learning Systems in Accelerators:
- -When I saw this premise, I was scratching my head--my day job is to develop new storage products to meet the demands of large-scale AI model training and inferencing, and I have never had a customer come to me claiming that they need support for small and sparse I/O or random access. In fact, write-intensive checkpointing and fine-tuning, not read-intensive data loading, is the biggest challenge faced by those training large language models in my experience. It wasn't until a few slides later did I realize where these requirements may be coming from:
- -Storage and accelerator vendors are both defining and solving the I/O problems of the AI community which seems counterproductive--shouldn't a benchmark be set by the practitioners and not the solution providers?
-What I learned from talking to attendees, visiting storage vendor booths, and viewing talks like Dr. Zheng's underscores a reality that I've faced on my own work with production AI workloads: AI doesn't actually have an I/O performance problem, so storage vendors are struggling to define ways in which they're relevant in the AI market.
-I outlined the ways in which LLM training uses storage in my HDF5 BOF talk, and their needs are easy to meet with some local storage and basic programming. So easy, in fact, that a reasonably sophisticated AI practitioner can duct tape their way around I/O problems very quickly and move on to harder problems. There's no reason for them to buy into a sophisticated Rube Goldberg storage system, because it still won't fundamentally get them away from having to resort to local disk to achieve the scalability needed to train massive LLMs.
-So yes, I've got no doubt that there are storage products that can deliver 10x or 20x higher performance for some specific AI workload. And MLPerf Storage is probably an excellent way to measure that 20x performance boost. But the reality I've experienced is that a half a day of coding will deliver 19x higher performance when compared to the most naive approach, and every AI practitioner knows and does this already. That's why there are a lot of storage vendors fishing in this AI storage pond, but none of them seem to be reeling in any whoppers.
-This isn't to say that there's nothing interesting going on in high-performance storage though. If the most common question I was asked was "what's the future of storage for AI," the second most common question was "what do you think about VAST and WEKA?"
-Both companies seem to be doing something right since they were top of mind for a lot of conference attendees, and it probably grinds their respective gears that the field still groups them together in the same bucket of "interesting parallel storage systems that we should try out." Rather than throw my own opinion in the pot though (I work with and value both companies and their technologies!), I'll note the general sentiments I observed.
-WEKA came into the week riding high on their big win as U2's official technology partner in September. Their big booth attraction was a popular Guitar Hero game and leaderboard, and an oversized Bono, presumably rocking out to how much he loves WEKA, presided over one of their seating areas:
- -Much of their marketing centered around accelerating AI and other GPU workloads, and the feedback I heard from the WEKA customers I bumped into during the week backed this up. One person shared that the WEKA client does a great job with otherwise difficult small-file workloads, particularly common in life sciences workloads, and this anecdote is supported by the appearance of a very fast WEKA cluster owned by MSK Cancer Center on the IO500 Production list. People also remarked about WEKA's need for dedicated CPU cores and local storage to deliver the highest performance; this, combined with its client scalability, lends itself well to smaller clusters of fat GPU nodes. I didn't run into anyone using WEKA in the cloud though, so I assume the feedback I gathered had a bias towards more conventional, on-prem styles of architecting storage for traditional HPC.
-Whereas WEKA leaned into its rock 'n' roll theme this year, VAST doubled down on handing out the irresistibly tacky light-up cowboy hats they introduced last year (which I'm sure their neighbors at the DDN booth absolutely loved). They were all-in on promoting their new identity as a "data platform" this year, and although I didn't hear anyone refer to VAST as anything but a file system, I couldn't throw a rock without hitting someone who either recently bought a VAST system or tried one out.
- -Unlike last year though, customer sentiment around VAST wasn't all sunshine and rainbows, and I ran into a few customers who described their presales engagements as more formulaic than the white-glove treatment everyone seemed to be getting a year ago. This isn't surprising; there's no way to give all customers the same royal treatment as a business scales. But it does mean that the honeymoon period between VAST and the HPC industry is probably at an end, and they will have to spend the time between now and SC24 focusing on consistent execution to maintain the momentum they've gotten from the light-up cowboy hats.
-The good news for VAST is that they've landed some major deals this past year, and they came to SC with customers and partners in-hand. They co-hosted a standing-room-only party with CoreWeave early in the week and shared a stage with Lambda at a customer breakfast, but they also highlighted two traditional, on-prem HPC customers (TACC and NREL) at the latter event.
-VAST clearly isn't letting go of the on-prem HPC market as it also pursues partnerships with emerging GPU cloud service providers; this contrasted with WEKA's apparent focus on AI, GPUs, and the cloud. Time will tell which strategy (if either, or both) proves to be the better approach.
-I usually make it a point to attend the annual DAOS User Group meeting since it is always attended by all the top minds in high-performance I/O research, but I had to miss it this year on account of it running at the same time as my I/O tutorial. Fortunately, DAOS was pervasive throughout the conference, and there was no shortage of opportunity to find out what the latest news in the DAOS was. For example, check out the lineup for PDSW 2023 this year:
- -Three out of thirteen talks were about DAOS which is more than any other single storage product or project. DAOS also won big at this year's IO500, taking the top two spots in the production storage system list:
-
<div class="separator" style="clear: both; text-align: center;"></div>
In fact, DAOS underpinned every single new awardee this year, and DAOS is now the second most represented storage system on the list behind Lustre:
- -Why is DAOS at the top of so many people's minds this year? Well, DAOS reached a few major milestones in the past few months which has thrust it into the public eye.
-First, Aurora is finally online and running jobs, and while the compute system is only running at half its capability, the full DAOS system (all 220 petabytes of it, all of which is TLC NVMe) is up and running--a testament to the scalability of DAOS that many parallel storage systems--including VAST and WEKA--have not publicly demonstrated. Because DAOS is open-source software and Aurora is an open-science system, all of DAOS' at-scale warts are also on full display to the community in a way that no competitive storage system besides of Lustre is.
-Second, Google Cloud cast a bold vote of confidence in DAOS by launching Parallelstore, its high-performance parallel file service based on DAOS, in August. Whereas AWS and Azure have bet on Lustre to fill the high-performance file gap (via FSx Lustre and Azure Managed Lustre), GCP has planted a stake in the ground by betting that DAOS will be the better foundation for a high-performance file service for HPC and AI workloads.
-Parallelstore is still in private preview and details are scant, but GCP had DAOS and Parallelstore dignitaries at all the major storage sessions in the technical program to fill in the gaps. From what I gathered, Parallelstore is still in its early stages and is intended to be a fast scratch tier; it's using DRAM for metadata which means it relies on erasure coding across servers to avoid data loss on a single server reboot, and there's no way to recover data if the whole cluster goes down at once. This lack of durability makes it ineligible for the IO500 list right now, but the upcoming metadata-on-NVMe feature (which previews in upstream DAOS in 1H2024) will be the long-term solution to that limitation.
-Finally, the third major bit of DAOS news was about the formation of the DAOS Foundation. First announced earlier this month, this initiative lives under the umbrella of the Linux Foundation and is led by its five founding members:
- -I see this handoff of DAOS from Intel to this new foundation as a positive change that makes DAOS a more stable long-term bet; should Intel choose to divest itself of DAOS once its obligations to the Aurora program end, DAOS now can live on without the community having to fork it. The DAOS Foundation is somewhat analogous to OpenSFS (one of the nonprofits backing Lustre) in that it is a vendor-neutral organization around which the DAOS community can gather.
-But unlike OpenSFS, the DAOS Foundation will also assume the responsibility of releasing new versions of DAOS after Intel releases its final version (2.6) in March 2024. The DAOS Foundation will also steer feature prioritization, but seeing as how the DAOS Foundation doesn't fund developers directly, it's not clear that contributors like Intel or GCP are actually at the mercy of the foundation's decisions. It's more likely that the DAOS Foundation will just have authority to decide what features will roll up into the next formal DAOS release, and developers contributing code to DAOS will still prioritize whatever features their employers tell them to.
-So, DAOS was the talk of the town at SC23. Does this all mean that DAOS is ready for prime time?
-While Intel and Argonne may say yes, the community seems to have mixed feelings. Consider this slide presented by László Szűcs from LRZ at the DAOS Storage Community BOF:
- -DAOS is clearly crazy fast and scales to hundreds of petabytes in production--Aurora's IO500 listing proves that. However, that performance comes with a lot of complexity that is currently being foisted on application developers, end-users, and system administrators. The "opportunities" listed in László's slide are choices that people running at leadership HPC scale may be comfortable making, but the average HPC user is not equipped to make many of these decisions and make thoughtful choices about container types and library interfaces.
-The fact that DAOS was featured so prominently at PDSW--a research workshop--probably underscores this as well. This slide presented by Adrian Jackson's lighting talk sums up the complexity along two different dimensions:
- -His results showed that your choice of DAOS object class and I/O library atop the DAOS POSIX interface can result in wildly different checkpoint bandwidth. It's hard enough to teach HPC users about getting optimal performance out of a parallel file system like Lustre; I can't imagine those same users will embrace the idea that they should be mindful of which object class they use as they generate data.
-The other DAOS-related research talk, presented by Greg Eisenhauer, was a full-length paper that caught me by surprise and exposed how much performance varies when using different APIs into DAOS. This slide is one of many that highlighted this:
- -I naively thought that the choice of native userspace API (key-value or array) would have negligible effects on performance, but Eisenhauer's talk showed that this isn't true. The reality appears to be that, although DAOS is capable of handling unaligned writes better than Lustre, aligning arrays on large, power-of-two boundaries still has a significant performance benefit.
-Based on these sorts of technical talks about DAOS presented this year, the original question--is DAOS ready for prime time--can't be answered with a simple yes or no yet. The performance it offers is truly best in class, but achieving that performance doesn't come easy right now. Teams who are already putting heroic effort into solving a high-value problems will probably leap at the opportunity to realize the I/O performance that DAOS can deliver. Such high value problems include things like training the next generation of foundational LLMs, and GCP's bet on DAOS probably adds differentiable value to their platform as a place to train such models as efficiently as possible. But the complexity of DAOS at present probably limits its appeal to the highest echelons of leadership HPC and AI, and I think it'll be a while before DAOS is in a place where a typical summer intern will be able to appreciate its full value.
-It would be unfair of me to give all this regard to WEKA, VAST, and DAOS without also mentioning DDN's brand new Infinia product, launched right before SC23. Those in the HPC storage industry have been awaiting its launch for years now, but despite the anticipation, it really didn't come up in any conversations in which I was involved. I did learn that the engineering team developing Infinia inside DDN is completely separate from the Whamcloud team who is developing Lustre, but this could be a double-edged sword. On the good side, it means that open-source Lustre development effort isn't competing with DDN's proprietary product in engineering priorities on a day-to-day basis. On the bad side though, I still struggle to see how Infinia and Lustre can avoid eventually competing for the same business.
-For the time being, Infinia does seem to prioritize more enterprisey features like multitenancy and hands-free operation while Lustre is squarely aimed at delivering maximum performance to a broadening range of workloads. Their paths may eventually cross, but that day is probably a long way off, and Lustre has the benefit of being deeply entrenched across the HPC industry.
-In addition to chatting with people about what's new in storage, I also went into SC23 wanting to understand how other cloud service providers are structuring end-to-end solutions for large-scale AI workloads. What I didn't anticipate was how many smaller cloud service providers (CSPs) showed up to SC for the first time this year, all waving the banner of offering NVIDIA H100 GPUs. These are predominantly companies that either didn't exist a few years ago or have historically focused on commodity cloud services like virtual private servers and managed WordPress sites, so it was jarring to suddenly see them at an HPC conference. How did so many of these smaller CSPs suddenly become experts in deploying GPU-based supercomputers in the time between SC22 and SC23?
-I got to talking to a few folks at these smaller CSPs to figure out exactly what they were offering to customers, and their approach is quite different from how AWS, Azure, and GCP operate. Rather than defining a standard cluster architecture and deploying copies of it all over to be consumed by whoever is willing to pay, these smaller CSPs deploy clusters of whitebox GPU nodes to customer specification and sell them as dedicated resources for fixed terms. If a customer wants a bunch of HGX H100s interconnected with InfiniBand, that's what they get. If they want RoCE, the CSP will deploy that instead. And the same is true with storage: if a customer wants EXAScaler or Weka, they'll deploy that too.
-While this is much closer to a traditional on-prem cluster deployment than a typical elastic, pay-as-you-go infrastructure-as-a-service offering, this is different from being a fancy colo. The end customer still consumes those GPUs as a cloud resource and never has to worry about the infrastructure that has to be deployed behind the curtain, and when the customer's contract term is up, their cluster is still owned by the CSP. As a result, the CSP can either resell that same infrastructure via pay-as-you-go or repurpose it for another dedicated customer. By owning the GPUs and selling them as a service, these CSPs can also do weird stuff like take out giant loans to build more data centers using GPUs as collateral. Meanwhile, NVIDIA can sell GPUs wholesale to these CSPs, book the revenue en masse, and let the CSPs deal with making sure they're maintained in production and well utilized.
-It also seems like the services that customers of these smaller CSPs get is often more barebones than what they'd get from a Big 3 CSP (AWS, Azure, and GCP). They get big GPU nodes and an RDMA fabric, but managed services beyond that are hit and miss.
-For example, one of these smaller CSPs told me that most of their storage is built on hundreds of petabytes of open-source Ceph. Ceph fulfills the minimum required storage services that any cloud must provide (object, block, and file), but it's generally insufficient for large-scale model training. As a result, all the smaller CSPs with whom I spoke said they are also actively exploring VAST and Weka as options for their growing GPU-based workloads. Since both VAST and Weka offer solid S3 and file interfaces, either could conceivably act as the underpinnings of these GPU clouds' first-party storage services as well.
-As I said above though, it seems like the predominant model is for these CSPs to just ship whatever dedicated parallel storage the customer wants if something like Ceph isn't good enough. This, and the growing interest in storage from companies like VAST and Weka, suggest a few things:
- -None of these observations are terribly surprising; at the price these smaller CSPs are offering GPUs compared to the Big 3 CSPs, their gross margin (and therefore their ability to invest in developing services on top of their IaaS offerings) has got to be pretty low. In the short term, it's cheaper and easier to deploy one-off high-performance storage systems alongside dedicated GPU clusters based on customer demand than develop and support a standard solution across all customers.
-Of course, building a low-cost GPU service opens the doors for other companies to develop their own AI services on top of inexpensive GPU IaaS that is cost-competitive with the Big 3's native AI platforms (AWS SageMaker, Azure Machine Learning, and Google AI Platform). For example, I chatted with some folks at together.ai, a startup whose booth caught my eye with its bold claim of being "the fastest cloud for [generative] AI:"
- -Contrary to their banner, they aren't a cloud; rather, they provide AI services--think inferencing and fine-tuning--that are accessible through an API much like OpenAI's API. They've engineered their backend stack to be rapidly deployable on any cloud that provides basic IaaS like GPU-equipped VMs, and this allows them to actually run their computational backend on whatever cloud can offer the lowest-cost, no-frills GPU VMs. In a sense, companies like together.ai develop and sell the frills that these new GPU CSPs lack, establishing a symbiotic alternative to the vertically integrated AI platforms on bigger clouds.
-I did ask a few of these smaller CSPs what their overall pitch was. Why I would choose GPU cloud X over their direct competitor GPU cloud Y? The answers went in two directions:
- -There's a big caveat here: I didn't talk to many representatives at these CSPs, so my sample size was small and not authoritative. However, taking these value propositions at face value struck me as being quite precarious since their value is really a byproduct of severe GPU shortages driven by the hyped-up AI industry. What happens to these CSPs (and the symbionts whose businesses depend on them) when AMD GPUs appear on the market in volume? What happens if NVIDIA changes course and, instead of peanut-buttering its GPUs across CSPs of all sizes, it focuses its attention on prioritizing deliveries to just a few blessed CSPs?
-There is no moat around generative AI, and I left SC23 feeling like there's a dearth of long-term value being generated by some of these smaller GPU CSPs. For those CSPs whose primary focus is buying and deploying as many GPUs in as short a time as possible, not everyone can survive. They'll either come out of this GPU shortage having lost a lot of money building data centers that will go unused, or they'll be sold for parts.
-More importantly to me though, I learned that I should give less credence to the splashy press events of hot AI-adjacent startups if their successes lie exclusively with smaller GPU CSPs. Some of these CSPs are paying to make their problems go away in an effort to keep their focus on racking and stacking GPUs in the short term, and I worry that there's a lack of long-term vision and strong opinions in some of these companies. Some of these smaller CSPs seem much more like coin-operated GPU cluster vending machines than platform providers, and that business model doesn't lend itself to making big bets and changing the industry.
-Put another way, my job--both previous and current--has always been to think beyond short-term band aids and make sure that my employer has a clear and opinionated view of the technical approach that will be needed to address the challenges of HPC ten years in the future. I know who my peers are at the other Big 3 CSPs and leadership computing facilities across the world, and I know they're thinking hard about the same problems that I am. What worries me is that I do not know who my peers are at these smaller CSPs, and given their speed of growth and smaller margins, I worry that they aren't as prepared for the future as they will need to be. The AI industry as a whole will be better off when GPUs are no longer in such short supply, but the ecosystem surrounding some of these smaller GPU CSPs is going to take some damage when that day comes.
- - -Because I spent my booth duty standing next to one of Eagle's 8-way HGX H100 nodes, a lot of people asked me if I thought the Grace Hopper superchip would be interesting. I'm not an expert in either GPUs or AI, but I did catch up with a few colleagues who are smarter than me in this space last week, and here's the story as I understand it:
-The Grace Hopper superchip (let's just call it GH100) is an evolution of the architecture developed for Summit, where V100 GPUs were cache-coherent with the CPUs through a special widget that converted NVLink to the on-chip coherence protocol for Power9. With GH100, the protocol used to maintain coherence across the CPU is directly compatible with the ARM AMBA coherence protocol, eliminating one bump in the path that Power9+V100 had. Grace also has a much more capable memory subsystem and NOC that makes accessing host memory from the GPU more beneficial.
-Now, do AI workloads really need 72 cores per H100 GPU? Probably not.
-What AI (and HPC) will need are some high-performance cores to handle all the parts of application execution that GPUs are bad at--divergent code paths, pointer chasing, and I/O. Putting capable CPU cores (Neoverse V2, not the N2 used in CPUs like new Microsoft's Cobalt 100) on a capable NOC that is connected to the GPU memory subsystem at 900 GB/s opens doors for using hierarchical memory to train LLMs in clever ways.
-For example, naively training an LLM whose weights and activations are evenly scattered across both host memory and GPU memory won't go well since that 900 GB/s of NVLink C2C would be on the critical path of many computations. However, techniques like activation checkpointing could become a lot more versatile when the cost of offloading certain tensors from GPU memory is so much lower. In essence, the presence of easily accessible host memory will likely allow GPU memory to be used more efficiently since the time required to transfer tensors into and out of HBM is easier to hide underneath other computational steps during training.
-Pairing an over-specified Grace CPU with a Hopper GPU also allows the rate of GPU development to proceed independently of CPU development. Even if workloads that saturate an H100 GPU might not also need all 72 cores of the Grace CPU, H200 or other future-generation GPUs can grow into the capabilities of Grace without having to rev the entire superchip.
-I didn't get a chance to talk to any of my colleagues at AMD to get their perspective on the MI300 APU, but I'd imagine their story is a bit simpler since their memory space is flatter than NVIDIA's superchip design. This will make training some models undoubtedly more straightforward but perhaps leave less room for sophisticated optimizations that can otherwise cram more of a model into a given capacity of HBM. I'm no expert though, and I'd be happy to reference any explanations that real experts can offer!
-Quantum computing has been a hot topic for many years of SC now, but it feels like a topic that is finally making its way out of pure CS research and into the minds of the everyday HPC facility leaders. I talked to several people last week who asked me for my opinion on quantum computing because they have come to the realization that they need to know more about it than they do, and I have to confess, I'm in the same boat as they are. I don't follow quantum computing advancements very closely, but I know an increasing number of people who do--and they're the sort who work in CTOs' offices and have to worry about risks and opportunities more than intellectual curiosities.
-It's hard to say there've been any seismic shifts in the state of the art in quantum computing at SC23; as best I can tell, there's still a rich ecosystem of venture capital-backed startups who keep cranking out more qubits. But this year felt like the first year where HPC facilities who haven't yet started thinking about their position on quantum computing are now behind. Not everyone needs a quantum computer, and not everyone even needs a quantum computing researcher on staff. But everyone should be prepared with a strong point of view if they are asked "what will you be doing with quantum computing?" by a funding agency or chief executive.
-One of the least-stealthy stealth-mode startups in the HPC industry has been NextSilicon, a company who debuted from stealth mode at SC23, launched their new Maverick accelerator, and announced their first big win with Sandia National Lab's Vanguard II project.
-What's notable about NextSilicon is that, unlike just about every other accelerator startup out there, they are not trying to go head-to-head with NVIDIA in the AI acceleration market. Rather, they've created a dataflow accelerator that aims to accelerate challenging HPC workloads that GPUs are particularly bad at--things like irregular algorithms and sparse data structures. They've paired this hardware with a magical runtime that continually optimizes the way the computational kernel is mapped to the accelerator's reconfigurable units to progressively improve the throughput of the accelerator as the application is running.
-The concept of dataflow accelerators has always been intriguing since they're the only alternative to improving computational throughput besides making larger and larger vectors. The challenge has always been that these accelerators are more like FPGAs than general-purpose processors, and they require similar amounts of hardcore CS expertise to use well. NextSilicon claims to have cracked that nut with their runtime, and it seems like they're hiring the rights sorts of people--real HPC with respectable pedigrees--to make sure their accelerator can really deliver value to HPC workloads.
-At the IO500 BOF, there was rich discussion about adding new benchmarking modes to IOR and IO500 to represent a wider range of patterns.
-More specifically, there's been an ongoing conversation about including a 4K random read test, and it sounds like the most outspoken critics against it have finally softened their stance. I've not been shy about why I think using IOPS as a measure of file system performance is dumb, but 4K random IOPS do establish a lower bound of performance for what a real application might experience. Seeing as how IO500 has always been problematic as any representation of how a file system will perform in real-world environments, adding the option to run a completely synthetic, worst-case workload will give IO500 the ability to define a complete bounding box around the lower and upper limits of I/O performance for a file system.
-Hendrik Nolte from GWDG also proposed a few new and appealing IOR modes that approach more realistic workload scenarios. The first was a new locally random mode where data is randomized within IOR segments but segments are repeated:
- -Compared to globally randomized reads (which is what IOR normally does), this is much closer representation of parallel workloads that are not bulk-synchronous; for example, NCBI BLAST uses thread pools and work sharing to walk through files, and the resulting I/O pattern is similar to this new mode.
-He also described a proposal to run concurrent, mixed workloads in a fashion similar to how fio currently works. Instead of performing a bulk-synchronous parallel write followed by a bulk-synchronous parallel read, his proposal would allow IOR to perform reads and writes concurrently, more accurately reflecting the state of multitenant storage systems. I actually wrote a framework to do exactly this and quantify the effects of contention using IOR and elbencho, but I left the world of research before I could get it published. I'm glad to see others seeing value in pursuing this idea.
-The other noteworthy development in I/O benchmarking was presented by Sven Breuner at the Analyzing Parallel I/O BOF where he described a new netbench mode for his excellent elbencho benchmark tool. This netbench mode behaves similarly to iperf in that it is a network-level throughput test, but because it is part of elbencho, it can generate the high-bandwidth incasts and broadcasts that are typically encountered between clients and servers of parallel storage systems:
- -This is an amazing development because it makes elbencho a one-stop shop for debugging the entire data path of a parallel storage system. For example, if you're trying to figure out why the end-to-end performance of a file system is below expectation, you can use elbencho to test the network layer, the object or file layer, the block layer, and the overall end-to-end path separately to find out which layer is underperforming. Some file systems have specialized included tools to perform the same network tests (e.g., nsdperf for IBM Spectrum Scale), but elbencho now has a nice generic way to generate these network patterns for any parallel storage system.
-As with last year, I couldn't attend most of the technical program due to a packed schedule of customer briefings and partner meetings, but the SC23 Digital Experience was excellently done, and I wound up watching a lot of the content I missed during the mornings and after the conference (at 2x speed!). In that sense, the hybrid nature of the conference is making it easier to attend as someone who has to juggle business interests with technical interests; while I can't jump into public arguments about the definition of storage "QOS", I can still tell that my old friends and colleagues are still fighting the good fight and challenging conventional thinking across the technical program.
-This was the sixth year that I co-presented the Parallel I/O in Practice tutorial with my colleagues Rob Latham, Rob Ross, and Brent Welch. A conference photographer got this great photo of me in the act:
- -Presenting this tutorial is always an incredibly gratifying experience; I've found that sharing what I know is one of the most fulfilling ways I can spend my time, and being able to start my week in such an energizing way is what sustains the sleep deprivation that always follows. Giving the tutorial is also an interesting window into what the next generation of I/O experts is worrying about; for example, we got a lot of questions and engagement around the low-level hardware content in our morning half, and the I/O benchmarking material in the late afternoon seemed particularly well received. The majority of attendees came from the systems side rather than the user/dev side as well, perhaps suggesting that the growth in demand for parallel storage systems (and experts to run them) is outstripping the demand for new ways to perform parallel I/O. Guessing wildly, perhaps this means new developers are coming into the field higher up the stack, using frameworks like fsspec that abstract away low-level I/O.
-Since I've jumped over to working in industry, it's been hard to find the business justification to keep putting work hours into the tutorial despite how much I enjoy it. I have to confess that I didn't have time to update any of the slides I presented this year even though the world of parallel I/O has not remained the same, and I am going to have to figure out how to better balance these sorts of community contributions with the demands of a day job in the coming years.
-At SC22, I fastidiously wore a KN95 mask while indoors and avoided all after-hours events and indoor dining to minimize my risk of catching COVID. At that time, neither my wife nor I had ever gotten COVID before, and I had no desire to bring it home to my family since my father died of COVID-related respiratory failure two years prior. Staying fully masked at SC22 turned out to be a great decision at the time since a significant number of other attendees, including many I spoke with, contracted COVID at SC22. By comparison, I maintained my COVID-free streak through 2022.
-This year I took a more risk-tolerant approach for two reasons:
- -Part of my approach to managing risk was bringing my trusty Aranet4 CO2 sensor with me so that I could be aware of areas where there was air circulation and the risk of contracting an airborne illness would be higher. I only wore a KN95 at the airport gates and while on the airplane at SC23, and despite going in all-in on after-hours events, indoor dining, and copious meetings and tours of booth duty, I'm happy to report that I made it through the conference without getting sick.
-I have no doubt that being vaccinated helped, as I've had several people tell me they tested positive for COVID after we had dinner together in Denver. But it's also notable that the Denver Convention Center had much better ventilation than Kay Bailey Hutchison Convention Center in Dallas where SC22 was held last year. To show this quantitatively, let's compare air quality measurements from SC22 to SC23.
-My schedule for the day on which I give my tutorial is always the same: the tutorial runs from 8:30am to 5:00pm with breaks at 10:00, 12:00, and 3:00. Because of this consistent schedule, comparing the CO2 readings (which are a proxy for re-breathed air) for my tutorial day at SC22 versus SC23 shows how different the air quality was in the two conference centers. Here's what that comparison looks like:
- -What the plot shows is that CO2 (re-breathed air) steadily increased at the start of the tutorial at both SC22 and SC23, but Denver's convention center kicked on fresh air ventilation after an hour while Dallas simply didn't. Air quality remained poor (over 1,000) throughout the day in Dallas, whereas Denver was pretty fresh (below 700) even during the breaks and the indoor luncheon. This relatively good air circulation inside the convention center at SC23 made me much more comfortable about going maskless throughout the week.
-This isn't to say that I felt there was no risk of getting sick this year; there was at least one busy, upscale restaurant/bar in which I dined where the air circulation was no better than in a car or airplane. For folks who just don't want to risk being sick over Thanksgiving, wearing a mask and avoiding crowded bars was probably still the best option this year. And fortunately, Denver's weather was gorgeous, so outdoor dining was completely viable during the week.
-Although AI has played a prominent role in previous SC conferences, this was the first year where I noticed that the AI industry is bleeding into the HPC community in weird ways.
-For example, I had a bunch of journalists and media types accost me and start asking rather pointed questions while I was on booth duty. Talking to journalists isn't entirely unusual since I've always been supportive of industry press, but the social contract between practitioners like me and journalists has always been pretty formal--scheduling a call in advance, being invited to speak at an event, and things like that have long been the norm. If I was being interviewed on the record, I knew it.
-This year though, it seemed like there was a new generation of younger journalists who approached me no differently than a casual booth visitor. Some did introduce themselves as members of the press after we got chatting (good), but others did not (not good) which led me to take away a learning: check names and affiliations before chatting with strangers, because the days where I could assume that all booth visitors would act in good faith are gone.
-Now, why the sudden change? I can think of three possible reasons:
- -It'd be fair to argue that #3 is a stretch and that this isn't an AI phenomenon if not for the fact that I was also accosted by a few venture capitalists for the first time this year. HPC has never been an industry that attracted the attention of venture capital in the way that AI does, so I have to assume being asked specific questions about the viability of some startup's technology is a direct result of the AI market opportunity.
-While it's nice to have a broader community of attendees and more media coverage, the increasing presence of AI-focused media and VC types in the SC community means I can't be as open and honest as I once was. Working for a corporation (with secrets of its own to protect) doesn't help there either, so maybe getting cagier when talking to strangers is just a part of growing up.
- -Attending SC23 this year coincided with two personal milestones for me as well.
-This is the tenth year I've been in the HPC business, and the first SC I ever attended was SC13. I can't say that this is my eleventh SC because I didn't attend in 2014 (on account of working at a biotech startup), but I've been to SC13, SC15 through SC19, SC20 and SC21 virtually, and SC22 and SC23 in-person. At SC13 ten years ago, the weather was a lot colder:
- -But I still have the fondest memories of that conference because it that was the week where I felt like I had finally found my community after having spent a decade as an unhappy materials science student.
-SC23 is also a milestone year because it may be the last SC I attend as a storage and I/O guy. I recently signed on for a new position within Microsoft to help architect the next generation of supercomputers for AI, and I'll probably have to trade in the time I used to spend at workshops like PDSW for opportunities to follow the latest advancements in large-scale model training, RDMA fabrics, and accelerators. But I think I am OK with that.
-I never intended to become an I/O or storage expert when I first showed up at SC13; it wasn't until I joined NERSC that I found that I could learn and contribute the most by focusing on storage problems. The world has changed since then, and now that I'm at Microsoft, it seems like the problems faced at the cutting edge of large language models, generative AI, and the pursuit of AGI are where the greatest need lies. As I said earlier in this post, AI has bigger problems to deal with than storage and I/O, and those bigger problems are what I'll be chasing. With any luck, I'll be able to say I had a hand in designing the supercomputers that Microsoft builds after Eagle. And as has been true for my last ten years in this business, I'll keep sharing whatever I learn with whoever wants to know.
]]>Additionally, the collector is spending a lot of time polling for messages. In fact, the message bus is receiving ~1500 messages a second, which is increasing the load on the message bus. After reading through optimizations for RabbitMQ, it appears that less but larger messages are better for the message bus. We will look at batching messages in the future.
]]>