Category Archives: Cabinet

Cooling AI Data Centers

How important are AI data centers? In just months, Elon Musk’s xAI team converted a factory outside Memphis into a cutting-edge, 100,000-GPU center for training the Colossus supercomputer—home to the Grok chatbot.

Initially powered by temporary gas turbines (later replaced by grid power), Colossus installed its first 100,000 chips in only 19 days, drawing praise from NVIDIA CEO Jensen Huang. Today, it operates 200,000 GPUs, with plans to reach 1 million GPUs by the end of 2025. [1]

Figure 1 – Elon Musk’s 1 Million Sq Ft xAI Colossus Supercomputer Facility near Memphis, TN. [1]

There are about 12,000 data centers throughout the world, nearly half of them in the United States. Now, more and more of these are being built or retrofitted for AI-specific workloads. Leaders include Musk’s xAI, Microsoft, Meta, Google, Amazon, OpenAI, and others.

High power is essential for such operations, and like computational electronics of all sizes heat issues need to be resolved.

GenAI

A key driver of data center growth is Generative AI (GenAI)—AI that creates text, images, audio, video, and code using deep learning. Chatbots and large language model ChatGPT are examples of GenAI, along with text-to-image models that generate images from written descriptions.

Managing all this is possible from new generations of processors, mainly GPUs. They all draw on higher levels of power and generate higher amounts of heat.

Figure 2 – Advanced AI Processor, the NVIDIA GH200 Grace Hopper Superchip with Integrated CPU to Increase Speed and Performance. [2,3]

AI data centers prioritize HPC hardware: GPUs, FPGAs, ASICs, and ultra-fast networking. Compared to CPUs (150–200 W), today’s AI GPUs often run >1,000 W.  . To handle massive datasets and complex computations in real-time they need significant power and cooling infrastructure.

Data Center Cooling Basics

Traditional HVAC was sufficient for older CPU-driven data centers. Today’s AI GPUs demand far more cooling, both at the chip level and facility-wide. This has propelled a need for more efficient thermal management systems at both the micro (server board and chip) and macro (server rack and facility) levels. [4]

Figure 3 – The Colossus AI Supercomputer Now Runs 200,000 GPUs. It Operates at 150MW Power, Equivalent to 80,000 Households. [5]

At Colossus, Supermicro 4U servers house NVIDIA Hopper GPUs cooled by:

  • Cold plates
  • Coolant distribution manifolds (1U between each server)
  • Coolant distribution units (CDUs) with redundant pumps at each rack base [6]

Each 4U server is equipped with eight NVIDIA H100 Tensor Core GPUs. Each rack contains eight 4U servers, totaling 64 GPUs per rack.

Between every server is a 1U manifold for liquid cooling. They connect with CDUs, heat-exchanging Coolant Distribution Units at the bottom of each rack that include a redundant pumping system. The choice of coolant is determined by a range of hardware and environmental factors.

Figure 4 – Each Colossus Rack Contains Eight 4U Servers, Totaling 64 GPUs Per Rack. Between Each Server is a 1U Manifold for Liquid Cooling. [7]
Figure 5 – The Base of Each Rack Has a 4U CDU Pumping System with Redundant Liquid Cooling. [7]

Role of Cooling Fans

Fans remain essential for DIMMs, power supplies, controllers, and NICs.

Figure 6 – Rear Door Liquid-Cooled Heat Exchangers. [7]

At Colossus, fans in the servers pull cooler air from the front of the rack, and exhaust the air at the rear of the server. From there, the air is pulled through rear door heat exchangers. The heat exchangers pass warm air through a liquid-cooled, finned heat exchanger/radiator, lowering its temperature before it exits the rack.

Direct-to-Chip Cooling

NVIDIA’s DGX H100 and H200 server systems feature eight GPUs and two CPUs that must run between 5°C and 30°C. An AI data center with a high rack density houses thousands of these systems performing HPC tasks at maximum load. Direct liquid cooling solutions are required.

Figure 7 – An NVIDIA DGX H100/H200 System Featuring Eight GPUs [8]
Figure 8 – The NVIDIA H100 SmartPlate Connects to a Liquid Cooling System to Bring Microconvective Chip-Level Cooling That Outperforms Air Cooling by 82%. [9]

Direct liquid cooling (cold plates contacting the GPU die) is the most effective method—outperforming air cooling by 82%. It is preferred for high-density deployments of the H100 or GH200.

Scalable Cooling Modules

Colossus represents the world’s largest liquid-cooled AI cluster, using NVIDIA + Supermicro technology. For smaller AI data centers, Cooling Distribution Modules (CDMs) provide a compact, self-contained solution.

Figure 9 – The iCDM-X Cooling Distribution Module from ATS Includes Pumps, Heat Exchanger and Liquid Coolant for Managing Heat from AI GPUs and Other Components. [10]

Most AI data centers are smaller, and power and cooling needs are lower, but essential. Many heat issues can be resolved using self-contained Cooling Distribution Modules.

The compact iCDM-X cooling distribution module provides up to 1.6MW of cooling for a wide range of AI GPUs and other chips. The module measures and logs all important liquid cooling parameters. It uses using just 3kW of power, and no external coolant is required.

These modules include:

•         Pumps

•         Heat exchangers

•         Cold plates

•         Digital monitoring (temp, pressure, flow)

Their sole external component is one or more cold plates removing heat from AI chips. ATS provides an industry-leading selection of custom and standard cold plates, including the high-performing ICEcrystal series.

Figure 10 – The ICEcrystal Cold Plates Series from ATS Provide 1.5 kW of Jet Impingement Liquid Cooling Directly onto AI Chip Hotspots.

Cooling Edge AI and Embedded Applications

AI isn’t just for big data centers—edge AI, robotics, and embedded systems (e.g., NVIDIA Jetson Orin, AMD Kria K26) use processors running under 100 W. These are effectively cooled with heat sinks and fan sinks from suppliers like Advanced Thermal Solutions. [11]

Figure 11 – High Performance Heat Sinks for NVIDIA and AMD AI Processors in Embedded and Edge Applications. [11]

NVIDIA also partners with Lenovo, whose 6th-gen Neptune cooling system enables full liquid cooling (fanless) across its ThinkSystem SC777 V4 servers—targeting enterprise deployments with NVIDIA Blackwell + GB200 GPUs. [12]

Figure 12 – Lenovo’s Neptune Direct Water Cooling Removes Heat from Power Supplies, for Completely Fanless Operation. [12]

Benefits gained from the Neptune system include:

  • Full system cooling (GPUs, CPUs, memory, I/O, storage, regulators)
  • Efficient for 10-trillion-parameter models
  • Improved performance, energy efficiency, and reliability

Conclusion

With surging demand, AI data centers are now a major construction focus. Historically, cooling problems are the #2 cause of data center downtime (behind power issues). With the high power needed for AI computing, these builds should carefully fit with their local communities in terms of electrical needs and sources, and water consumption. [13]

AI workloads will increase U.S. data center power demand by 165% by 2030 (Goldman Sachs), with nearly double 2022 levels (IBM/Newmark). Sustainable design and resource-conscious cooling are essential for the next wave of AI infrastructure. [14,15]

References

1. The Guardian, https://www.theguardian.com/technology/2025/apr/24/elon-musk-xai-memphis

2. Fibermall, https://www.fibermall.com/blog/gh200-nvidia.htm

3. NVIDA, https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip?ncid=no-ncid

4. ID Tech Ex, https://www.idtechex.com/en/research-report/thermal-management-for-data-centers-2025-2035-technologies-markets-and-opportunities/1036

5. Data Center Frontier, https://www.datacenterfrontier.com/machine-learning/article/55244139/the-colossus-ai-supercomputer-elon-musks-drive-toward-data-center-ai-technology-domination

6. Supermicro, https://learn-more.supermicro.com/data-center-stories/how-supermicro-built-the-xai-colossus-supercomputer

7. Serve The Home, https://www.servethehome.com/inside-100000-nvidia-gpu-xai-colossus-cluster-supermicro-helped-build-for-elon-musk/2/

8. Naddod, https://www.naddod.com/blog/introduction-to-nvidia-dgx-h100-h200-system

9. Flex, https://flex.com/resources/flex-and-jetcool-partner-to-develop-liquid-cooling-ready-servers-for-ai-and-high-density-workloads

10. Advanced Thermal Solutions, https://www.qats.com/Products/Liquid-Cooling/iCDM

11. Advanced Thermal Solutions, https://www.qats.com/Heat-Sinks/Device-Specific-Freescale

12. Lenovo, https://www.lenovo.com/us/en/servers-storage/neptune/?orgRef=https%253A%252F%252Fwww.google.com%252F

13. Deloitte, https://www2.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2025/genai-power-consumption-creates-need-for-more-sustainable-data-centers.html

14.GoldmanSachs, https://www.goldmansachs.com/insights/articles/ai-to-drive-165-increase-in-data-center-power-demand-by-2030

15. Newmark, https://www.nmrk.com/insights/market-report/2023-u-s-data-center-market-overview-market-clusters

Thermal Management of High-Powered Servers

While power demands have increased, engineers are tasked with placing more components into smaller spaces. This has led to increased importance for optimizing the thermal management of servers and other high-powered devices to ensure proper performance and achieve the expected lifespan.

This article brings together content that ATS has posted over the years about cooling high-powered servers from the device- to the environment-level. (Photo by panumas nikhomkhai – Pexels.com)

With server cooling taking on increased priority, there are several ways of approaching the problem of thermal management, including device-level solutions, system-level solutions, and even environment-level solutions.

Over the years, Advanced Thermal Solutions, Inc. (ATS) has posted many articles related to this topic. Click the links below to read more about how the industry is managing the heat for servers:

  • Industry
    Developments: Cabinet Cooling Solutions
    – Although their applications vary,
    a common issue within these enclosures is excess heat, and the danger it poses
    to their electronics. This heat can be generated by internal sources and
    intensified by heat from outside environments.
  • Effective cooling of high-powered CPUs on dense server boards – Optimizing PCB for thermal management has been shown to ensure reliability, speed time to market and reduce overall costs. With proper design, all semiconductor devices on a PCB will be maintained at or below their maximum rated temperature. 

For more information about Advanced Thermal Solutions, Inc. (ATS) thermal management consulting and design services, visit https://www.qats.com/consulting or contact ATS at 781.769.2800 or ats-hq@qats.com.

ATS Expands Its US-based Manufacturing Facilities

Advanced Thermal Solutions, Inc. has expanded its Massachusetts manufacturing facilities. This was necessary due to an increase in industrial orders and associated production requirements. The needs for metal and plastic parts and finished products have been growing as markets retool and expand, and buyers continually insist on higher quality, faster deliveries and larger volumes.

ATS expanded manufacturing services can provide contract manufacturing services on a fast, highest quality level. The Norwood facility will meet the needs of most global customers, from rapid prototyping and high volume manufacturing.

The enhanced facilities in Norwood, which is also the global headquarter for ATS, are fully equipped, environmentally responsible, and employ highly skilled staffers who work to extremely high professional standards. Multi-point inspections insure the highest quality manufactured products, from one-of-kind to multi-thousand part production orders.

ATS designs and builds for a wide range of industry requirements. Engineers and technicians manufacture for chassis-level integration, e.g. cooling hardware and other functionalities on network communication cabinets. The associates often design and fabricate stands, rack and display cabinets for retail and office environments.

Fabrication capabilities in metal and plastics include:
Metal and Plastic Extruding
Metal Stamping
CNC Machining
Metal Finishing
Sheet Metal Stamping and Plastic Forming
Plastic Welding

Besides its manufacturing facilities in Norwood, ATS operates factories in Futian, China, a thriving region of high tech manufacturing. The Futian facility is designed for making very high volumes of quality parts, as well as inventory and storage, and worldwide distribution services.

To learn more about ATS manufacturing capabilities, please visit: http://www.qats.com/Services/Manufacturing-Services/65.aspx

This Fixed Cost Plan for Cooling Hot PCBs Saves Money, Simplifies Ordering

For one fixed cost, a QoolPCB plan includes the full set of ATS heat sinks,  attachment devices and all other parts required for the effective thermal management of a PCBs components. There are no additional costs for the thermal engineering, performance testing, procurement or shipping. The heat sinks and hardware are kitted and provided for the full volume of boards requiring cooling.

Pricing for a QoolPCB solutions is based on the number of heat sinks that a specific PCB requires for efficient thermal management. For example, if thermal analysis and testing show that a PCB needs 10 heat sinks to operate safely, the fixed price for the heat sinks and hardware for a production volume of that PCB would be just $50 per board. For larger boards, or those with many hot components, the unit cost per heat sink is reduced.

Whether the solutions are for off-the-shelf heat sinks, custom designed, or a combination of both, the QoolPCB program from ATS provides it at fixed cost. QoolPCB eliminates separate costs for design, tooling, samples, verification and supply chain management. The program offers multiple benefits for companies looking to reduce their product development costs, speed time-to-market and ensure thermal reliability.

To participate, PCB developers simply provide 3D CAD models of their board layout, along with the technical specifications, including power dissipation of all board components. ATS performs a full thermal analysis of the PCB and develops a comprehensive cooling solution for each component on the board. Where possible, ATS engineers will specify existing heat sinks from a portfolio of more than 3000 off-the-shelf and application-specific designs and with in-stock attachment systems.

If any custom heat sinks are required to bring certain components within their manufacturer-designated running temperatures, ATS assumes all tooling charges and sample production costs, including any customized heat sink attachment hardware. In addition, ATS will perform all physical testing at its Thermal Characterization Laboratory, which features advanced open loop and closed loop wind tunnels, temperature and velocity measurement sensors and other analysis instrumentation, to verify the cooling design. All designs and performance reports are provided to customers, who can perform their own thermal analyses and verification studies using the ATS characterization lab and samples of the actual heat sink solutions at no extra cost.

More information about the QoolPCB thermal management program from ATS is available at: http://www.qats.com/Services/QoolPCB—PCB-Cooling-At-Fixed-Cost/57.aspx

 

ATS Announces Free Webinar: Analytical, Computational and Experimental Thermal Analysis of a xTCA Chassis

ATS Announces Free Webinar: Analytical, Computational and Experimental Thermal Analysis of a xTCA Chassis on Thursday, May 27, 2010 2:00 PM – 3:00 PM EDT

ATCA and MicroTCA are chassis standards geared to getting systems from telecomm, computing, military, medical and other companies to market faster and with lower cost. While the form factor, modularity and price point are attractive, thermal challenges can be magnified with the small form factors and interoperability challenges of a standard architecture. Attendees will be equipped with advanced, fast and accurate thermal design techniques that will help them efficiently use modeling and testing tools to determine device junction temperatures and speed time-to-market.

To register please visit our registration site here:
Analytical, Computational and Experimental Thermal Analysis of a xTCA Chassis