State of the Art AI Data Centers: How Can AI Server OEMs Differentiate Today?

Subscribe To Download This Insight

By Paul Schell | 4Q 2024 | IN-7634

Tier 1 vendors Lenovo, Hewlett Packard Enterprise (HPE), Dell, and Supermicro leverage their engineering expertise to bring out a lineup of air- and liquid-cooled solutions able to handle the unprecedented compute density of today’s most performant servers. Meanwhile, smaller players are able to pool resources around Artificial Intelligence (AI)-specific offerings, creating agility in deploying AI clusters.

Registered users can unlock up to five pieces of premium content each month.

Log in or register to unlock this Insight.

 

A Lot of Noise from Tier 1s

NEWS


AI server Original Equipment Manufacturers (OEMs) are keen to move in sync with the latest Artificial Intelligence (AI) accelerator releases so as not to appear behind NVIDIA, AMD, and Intel’s product release cadence. For market leader NVIDIA, this is enabled by tight integration and strong lines of communication between product teams. Server OEMs and Original Device Manufacturers (ODMs) are essential channels to market for AI hardware, as most AI chipset vendors have offloaded their server manufacturing units to focus primarily on silicon innovation. For example, AMD’s recent acquisition of ZT Systems was quickly followed by the announcement that the cloud solution provider’s manufacturing assets would be sold off upon completion. Some recent AI server releases include:

  • Supermicro’s liquid-cooled SuperClusters with NVIDIA HGX B200, Grace Blackwell with a Central Processing Unit (CPU), and NVL72 cabinets. It is also up-to-speed with AMD’s latest MI325X Graphics Processing Units (GPUs), which offer more memory bandwidth over the MI300X GPUs that began shipping in early 2024. It also offers Intel’s latest Xeon 6 CPUs with built-in AI accelerators, including Gaudi 3 configurations.
  • Hewlett Packard Enterprise’s (HPE) latest offerings include liquid- and air-cooled MI325X servers targeting training workloads, as well as upcoming Gaudi 3 servers. These offerings are augmented by verticalized solutions in collaboration with NVIDIA for use cases such as industrial automation and content creation.
  • Lenovo has leveraged its legacy in liquid cooling with a range of new offerings featuring NVIDIA’s latest hardware, as well as AMD’s MI300X platform.
  • Dell’s latest servers include 8-GPU NVIDIA H200 systems and AMD’s MI300X platform. A vertical solution partnership exists with NVIDIA for retail, telco, and financial services offerings.

We can see that significant progress has been made by vendors of all stripes to bring new solutions to market that meet the demands of frontier model developers, enterprise clients, and everyone in between. However, these offerings appear increasingly homogenized, as not moving in lockstep with NVIDIA, and to a lesser degree, AMD and Intel’s latest offerings, would send the wrong signal to the marketplace, as well as the investment community. To address this, all Tier 1 OEMs above also offer—to varying degrees of sophistication—managed services for deploying their cutting-edge AI servers.

Is There Room for Differentiation?

IMPACT


Seemingly, little room is left for differentiation, as all of the above server OEMs offer most of NVIDIA’s newest reference designs in both liquid- and air-cooled flavors, as well as the in-house expertise to address the various other facets of small- to large-scale deployments, such as the networking required to connect accelerator clusters. Nonetheless, challengers and some Tier 1 vendors are able to provide a level of differentiation by leveraging their experience with AI systems, which stems from the information asymmetry created by the speed of innovation in today’s AI data center offerings, and exacerbated by the general talent shortages in the industry. For instance, today’s frontier transformer models require 20,000X more resources to train than 5 years ago, requiring larger clusters with more sophisticated networking to support the movement of data, and more performant cooling systems and power distribution required to sustain the infrastructure. Smaller deployments are not shielded from such constraints.

All Roads Lead to Understanding Customer Pain Points

RECOMMENDATIONS


From the position of chipset vendors, in terms of Go-to-Market (GTM) strategies, we observe best practices from market leader NVIDIA, with a vast network of willing and able OEMs (and ODMs) that move in lock-step with innovation and new releases thanks to its reference designs. These are intended to help with scale, given their modularity, and with reducing Time to Market (TTM), as server builders need not innovate entirely in isolation. This has also led to similar product portfolios addressing a broad range of customers, deployment sizes, and end markets.

Server vendors must leverage their proximity to customers and their understanding of enterprise and Cloud Service Provider (CSP) pain points to differentiate their offerings and address today’s needs, as well as pre-empt tomorrow’s. The below learnings from ABI Research’s latest discussions with the industry apply to Tier 1s, as well as challengers, and anyone evaluating their value proposition.

  • Pre-Validated Solutions: Pre-tested and “burned-in” system designs, preferably of a modular nature to enable scaling to the diverse needs of different market segments, remove the burden of designing performant, viable systems from customers. Penguin Solution’s OriginAI offering is a good example of this.
  • Cooling Solutions: To address the increased heat generated from AI servers, OEMs should work with cooling operators to adopt a holistic cooling strategy; for instance, implementing liquid cooling on a server-level, and air-cooled systems on an infrastructure level.
  • End-to-End Offerings: The complexity of the supply chain alone, as well as the components required to spin up AI data center infrastructure, is vast and broad. Those selling end-to-end solutions, i.e., the design and build of the entire data center from bricks to compute, will not only capture more of the value chain, but help customers deploy faster. Vertiv’s value proposition is a good example of this approach.
  • Cluster Management Software: Software to manage clusters of up to thousands of accelerators, including node provisioning and monitoring, is a valuable addition to AI server offerings.

By offering managed service “wrappers” and leveraging existing expertise in all of the above areas, AI server vendors should offer one-time and ongoing services to those purchasing their hardware. This will lead to faster TTM for their customers, incur fewer mistakes and associated costs, and attract more customers, ultimately leading to an improved GTM strategy.

Services