Author Archives: Josh Perry

Meeting the thermal management requirements of high-performance servers

High-performance servers are devices specially designed to handle large computational loads, a huge amount of communication signals, fast data processing, etc. Due to their task-oriented nature, high-performance servers must have high reliability, interchangeability, compact size and good serviceability.

High-Performance Servers

To achieve high computational speed, high-performance servers generally have dozens of CPUs and memory models. They also have dedicated data process modules and control units to ensure seamless communication between CPUs and parallel data processing ability. To reach higher speeds, the power dissipation of high–performance CPUs has been increasing continuously in the past decade for its use in high-performance servers.

Cooling dozens of kW servers brings a unique challenge for thermal engineers. To deal with the ever-growing high heat flux issue in high-performance servers, it will need the cooperation of electrical, mechanical and system engineers to solve the problem. The job to remove the high heat flux from CPUs to ambient requires chip level, board level and cabinet level solutions.

Wei [1] described Fujitsu’s thermal management advancements in their high-end UNIX server PRIMEPOWER 2500. The server cabinet is shown in Figure 1. Its dimension is 180cm × 107cm × 179cm (H×W×D) and has a maximum power dissipation of 40 kW. The system configuration of PRIMEPOWER 2500 is shown in Figures 2 and 3. It has 16 system boards and 2 input/output (I/O) boards installed vertically on two back-panel boards. The two back-panel boards are interconnected by six (6) crossbars installed horizontally.

Figure 1. PRIMEPOWER 2500 Cabinet [1]
Figure 2. PRIMEPOWER 2500 System Configuration [1]
Figure 3. PRIMEPOWER 2500 System Board Unit [1]

To cool the electrical components inside PRIMEPOWER 2500, 48 200-mm-diameter fans are installed between the system board unit and the power supply unit. They provide forced air cooling for system boards and power supplies. In addition, six 140-mm-diameter fans are installed on one side of crossbar to cool the crossbar boards with a horizontal flow. The flow direction is shown in Figure 3. Each system board is 58 cm wide and 47 cm long.

There are eight CPU processors, 32 Dual In-Line Memory Modules, 15 system controller processors, and associated DC-DC converters on each system board. The combined power dissipation per system board is 1.6 kW at most.

Figure 4. PRIMEPOWER 2500 System Board [1]

To cool the electrical components inside PRIMEPOWER 2500, 48 200-mm-diameter fans are installed between the system board unit and the power supply unit. They provide forced air cooling for system boards and power supplies. In addition, six 140-mm-diameter fans are installed on one side of crossbar to cool the crossbar boards with a horizontal flow. The flow direction is shown in Figure 3. Each system board is 58 cm wide and 47 cm long.

There are eight CPU processors, 32 Dual In-Line Memory Modules, 15 system controller processors, and associated DC-DC converters on each system board. The combined power dissipation per system board is 1.6 kW at most.

Forced air-cooling technology is commonly used in computers, communication cabinets, and embedded systems, due to its simplicity, low cost and easy implementation. For high-performance servers, the increasing power density and constraints of air-cooling capability and air delivery capacity have pushed forced air cooling to its performance limit.

For high power systems like PRIMEPOWER 2500, it needs a combination of good CPU design, optimized board layout, advanced thermal interface material (TIM), high-performance heat sinks, and strong fans to achieve desired cooling.

The general approach to cool the multi-board system is first to identify the hottest power component with the lowest temperature margin. For the high-performance server, it is the CPUs. For multiple CPUs on a system board, generally, the CPU located on downstream of a board or other CPUs has the highest temperature.

So, the thermal resistance requirement for this CPU is:

Where Tj,max is the allowed maximum junction temperature, Ta is the ambient temperature, ∆Ta is the air temperature rise due to preheating before the CPU, and qmax is the maximum CPU power.

The junction-to-air thermal resistance of the CPU is:

Where Rjc is the CPU junction-to-case thermal resistance, RTIM is the thermal resistance of thermal interface materials, and Rhs is the heat sink thermal resistance. To reduce the CPU junction temperature, it is critical to find intuitive ways to minimize Rjc, RTIM, and Rhs, because any reduction in thermal resistance is important in junction temperature reduction.

The CPU package and heat sink module of PRIMEPOWER 2500 are shown in Figure 5. The CPU package has an integrated heat spreader (IHS) attached to the CPU chip. A high-performance TIM is used to bond the CPU chip and IHS together, see Figure 6. The heat sink module is mounted on the IHS with another TIM in between.

Figure 5. PRIMEPOWER 2500 CPU Package and Heat Sink Module [1]
Figure 6. CPU Package [1]

The TIM used in between the CPU chip and the IHS are crucial to the CPU’s operation. It has two key functions: to conduct heat from the chip to the IHS and to reduce the CPU chip stress caused by the mismatch of the coefficient of thermal expansion (CTE) between the CPU chip and IHS. Fujitsu developed a TIM made of In-Ag composite solder for the above application. The In-Ag composite has a low melting point and a high thermal conductivity. It is relatively soft, which is good for absorbing thermal stress between the chip and the IHS.

Wei [2] also investigated the impact of thermal conductivity on heat spread performance. He found a diamond composite IHS (k=600 W/(mK)) would result in a lower temperature gradient across the chip and low temperature hot spots, compared with aluminum nitride (k=200 W/(mK)) and copper (k=400 W/(mK)). The simulation results are shown in Figure 7.

Figure 7. Heat Spreader Material Comparison [2]

In high-performance servers like the PRIMEPOWER 2500, the thermal performance gains by optimizing the TIM and the IHS are small, because they compose only a small portion of the total thermal resistance. Heat sinks dissipate heat from the CPU to air and have an important role in the thermal management of the server. In a server application, the heat sink needs to meet not only the mechanical and thermal requirements, but also the weight and volume restraints. Hence, heat pipes, vapor chambers, and composite materials are widely used in place of high-performance heat sinks.

Koide et al [1] compared the thermal performance and weight of different heat sinks for server application. The results are shown Figure 8. They used the Cu-base/AL-fin heat sink as benchmark. Compared with the Cu-base/AL-fin heat sink, the Cu-base/Cu-fin heat sink is 50% heavier and gains only 8% performance.

If the heat pipe is used in base, the heat sink weight can be reduced by 15% and the thermal performance increases by 10%. If the vapor chamber is embedded in the heat sink base, it reduces the heat sink weight by 20% and increases the heat sink performance by 20%.

Figure 8. Thermal Performance and Weight Comparison of Different Heat Sinks [1]
Figure 9. (a) USIII Heat Sink for Sun Fire 15K Server, (b) USIV Heat Sink for Sun Fire 25K [3]

Sun Microsystems’ high-performance Sun Fire 15K Server uses USIII heat sink to cool its 72 UltraSparc III (USIII) processors. In Sun Fire 25K Server, the CPUs are upgraded to UltraSparc IV (USIV), which has a maximum power of 108 W. To cool the USIV processor, Xu and Follmer [3] designed a new USIV heat sink with copper base/copper fin, see Figure 9. The old USIII heat sink has 17 forged aluminum fins, the USIV heat sink has 33 copper fins. Both heat sinks have the same base dimensions and height.

Figure 10. Thermal Resistance Comparison between USIII Heat Sink and USIV Heat Sink [3]

Figure 10 shows the thermal resistance comparison between the USIII heat sink and the USIV heat sink. The thermal resistance of the USIV heat sink is almost 0.1°C/W lower than that of the USIII heat sink at medium and high flow rates, which is a huge gain in thermal performance. The thermal performance improvement of the USIV heat sink is not without penalty.

Figure 11. Pressure Drop Comparison between USIII Heat Sink and USIV Heat Sink [3]

Figure 11 shows the pressure drop comparison between the USIII heat sink and the USIV heat sink. For the same air flow rate, the pressure drop of the USIV heat sink is higher than that of the USIII heat sink. That means the Sun Fire 25K Server needs stronger fans and better flow arrangements to ensure the USIV heat sinks have adequate cooling flow.

The design of the cooling method in high-performance servers follows the same methodology used in the design cooling solution of other electronic devices, but at an elevated scale. The main focus is to identify the hottest components, which in most cases is CPUs. Due to extreme high power of CPUs, memory modules, cheat spreader, TIM, and heat sinks to achieve desired cooling in the server. The goal of thermal management is to find cost-effective ways to maintain the junction temperature of the CPU lower than specifications and ensure the continuous operation of the server. Wei [1] has proved a 40 kW server can be cooled by forced air cooling.

However, it requires highly integrated design and a huge amount of air flow that the 54 fans inside PRIMEPOWER 2500 can generate. In the near future, it would be very difficult for a forced air-cooling method to cool cabinets with more than 60 kW power. It would require bigger fan trays to deliver huge amounts of air flow and large size heat sinks to transfer heat from the CPUs to air, which makes it impossible to design a reliable, compact and cost-effective cooling system for the server.

We have to find alternative ways to deal with this problem, Other cooling methods, such as air impinging jets, liquid cooling and refrigeration cooling systems, have the potential to dissipate more heat. But it will require intuitive packaging to integrate them into the server system.

References:

  1. Wen, J., Thermal Management of Fujitsu’s High-performance Servers, source: http://www.fujitsu.com/downloads/MAG/vol43-1/paper14.pdf.
  2. Koide, M.; Fukuzono, K.; Yoshimura, H.; Sato, T.; Abe, K.; Fujisaki, H.; High-Performance Flip-Chip BGA Technology Based on Thin-Core and Coreless Package Substrates, Proceedings of 56th ECTC, San Diego, CA, USA, 2006, pp.1869-1873.
  3. Xu, G; Follmer, L.; Thermal Solution Development for High-end System, Proceedings of 21st IEEE SEMI-THERM Symposium, San Jose, CA, USA, 2005, pp. 109-115.

For more information about Advanced Thermal Solutions, Inc. (ATS) thermal management consulting and design services, visit https://www.qats.com/consulting or contact ATS at 781.769.2800 or ats-hq@qats.com.

Immersion Liquid Cooling of Servers in Data Centers

A data center is a large infrastructure used to house large quantities of electronic equipment, such as computer servers, telecommunications equipment, and data storage systems, etc. The data center requires non-interrupted power, communication and internet access to all equipment inside, it also has dedicated environment control system which provides appropriate working conditions for the electrical devices hosted inside.

Immersion Cooling

Traditional data centers use cold air generated by a room air conditioner system (CRAC) to cool the servers installed on the racks. Cooling the electrical devices by cold air generated by an air conditioner is an easy method to implement. However, it is not a very efficient method in terms of power consumption.

The inefficiency of the method can be contributed to several causes: generating and delivering cold air from a chiller to servers is a multiple heat transfer process, such as the mixing of warm and cool air in the room, which reduces the efficiency and power consumption of cooling hardware such as chillers, computer room air conditioners (CRACs), fans, blowers and pumps.

Data center designers and operators have invented many ways to improve the data center’s thermal efficiency, such as optimizing the rack layout and air conditioner location, separating cold aisles and hot aisles, optimizing the configuration of pipes and cables in under-floor plenum, introducing liquid cooling to high-power severs.

While the above methods can improve the data center heat load management, they cannot dramatically reduce the Power Usage Effectiveness (PUE), which is a measure of how efficiently a datacenter uses its power and is defined as the ratio of total datacenter power consumption to the IT equipment power consumption.

An ideal PUE is 1,0. A better way, proposed and used by some new data centers, is directly bringing the outside cold air to the servers. This method can eliminate the computer room air conditioners (CRACs). To achieve this, the data center has to be located in a specific area where cold air can be provided for all four seasons and the servers have to have higher operating environmental temperature.

Another dramatic solution proposed and used by some companies is liquid immersion cooling for entire servers. When compared with traditional liquid cooling techniques, the liquid immersion cooling uses dielectric fluid as a working agent and open bath design. This eliminates the need for hermetic connectors, pressure vessels, seals and clamshells. There are several different liquid immersion cooling methods.

This article will review the active single-phase immersion cooling technology proposed by Green Revolution Cooling (GRC) [1] and a passive two-phase immersion cooling technology proposed by the 3M Company [2].

Green Revolution Cooling has designed a liquid-filled rack to accommodate the traditional servers and developed dielectric mineral oil as the coolant. Figure 1 shows the liquid cooling racks with chiller and an inside view of a CarnotJet cooling rack from GRC. The racks are filled with 250 gallons of dielectric fluid, called GreenDEF™, which is a non-toxic, clear mineral oil with light viscosity.

Figure 1. Server racks and chiller (left) and inside view of the server rack. [1]

The servers are installed vertically into slots inside the rack and fully submerged in the liquid coolant. Pumps are used to circulate the cold coolant from the chiller to the rack. The coolant returns to the chiller, after removing heat from the servers. Because of its high heat capacity and thermal conductivity, the GreenDEF™ can cool the servers more efficiently than air.

The server racks are semi-open to the environment and the coolant level is constantly monitored by the system. Figure 2 shows a server motherboard is being submerged in the coolant liquid inside a server rack from GRC.

Figure 2. A Server Motherboard Being Immersed in Liquid Coolant in A Server Rack. [1]

Intel has conducted a year-long test with immersion cooling equipment from Green Revolution Cooling in New Mexico [3]. They have found that the technology is highly efficient and safe for servers. In their tests, Intel tested two racks of identical servers – one using traditional air cooling and the other immersed in a Green Revolution enclosure. Over the course of a year, the submerged servers had a partial Power Usage Effectiveness (PUE) of 1.02 to 1.03, equaling some of the lowest efficiency ratings reported using that metric.

The 3M Company is also actively engaged in immersion cooling technology and has developed a passive two-phase immersion cooling system for servers. Figure 3 illustrates the concept of the immersion cooling system developed by 3M. In a specially designed server rack, servers are inserted vertically in the rack. The servers are immersed in 3M’s Novec engineered fluid, a non-conductive chemical with a low boiling point.

The elevated temperature of electronic components on the sever boards will cause the Novec engineered fluid to boil. The evaporation of the fluid will remove a large amount of heat from the heated components with small temperature difference. The evaporated fluid travels to the upper portion of the server rack, where it condenses to liquid on the surface of the heat exchanger cooled by the cold water. The condensed liquid flows back to the rack bath, driven by the force of gravity. In 3M’s server rack, the liquid bath is also semi-open to the outside environment.

Because the cooling method is passive, there is no pump needed in the system.

Figure 3. Passive Two-phase Immersion Cooling System from 3M. [2]

By utilizing the large latent heat of Novec engineered fluid during evaporation and condensation, the coolant can remove heat from servers and dissipate it to water heat exchanger with small a temperature gradient. To enhance the boiling on the component surfaces, 3M invented special coating for electronic chips inside the liquid bath. The boiling enhancement coating (BEC) is a 100 mm thick porous metallic material.

The application of the BEC is illustrated in Figure 4. The coating is directly applied to the integrated heat spreader (IHS) of the chip. Tuma [2] claimed that the coating can produce boiling heat transfer coefficients in excess of 100,000 W/m2-K, at heat fluxes exceeding 300,000 W/m2.

Figure 4. Application of boiling enhanced coating (BEC). [2]

In his paper, Tuma [2] discussed the economic and environmental merits of the passive two-phase immersion cooling technology for cooling data center equipment. He concluded that liquid immersion cooling can dramatically decrease the power consumption for cooling relative to traditional air-cooling methods. It can also simplify facility construction by reducing floor space requirements, eliminating the need for air cooling infrastructure such as plenum, air economizers, elevated ceilings etc.

Green Revolution Cooling and 3M have demonstrated the feasibility and applicability of using immersion cooling technology to cool the servers in data centers. The main advantages of immersion liquid cooling are saving overall cooling energy and maintaining the component temperature low and uniform. However, both immersion liquid cooling technologies require specially designed sever racks. Specially formulated coolants are needed for both cooling technologies, too, and they are not cheap. For the traditional air-cooled data center, the air is free, abundant and easy to deliver.

In both immersion cooling technologies, the servers have to be vertically installed inside the server rack, which will reduce the date center footprint usage efficiency. Because the liquid baths used in immersion cooling are open to the environment, coolant is gradually and inevitably lost to the ambient during long term service.

The environmental impact of the discharge of a large amount of coolant by data centers has to be evaluated, too. The effect of the coolant on the connectors and materials used on the PCB is not also very clear.

Immersion liquid cooling is a very promising technology for cooling high-power servers. But, there are still obstacles that need to be overcome before their large scale application is assured.   

References

  1. http://www.grcooling.com
  2. Tuma, E. P., “The Merits of Open Bath Immersion Cooling of Datacom Equipment,” 26th IEEE SEMI-THERM Symposium, Santa Clara, California, USA  2010.
  3. http://www.datacenterknowledge.com

For more information about Advanced Thermal Solutions, Inc. (ATS) thermal management consulting and design services, visit https://www.qats.com/consulting or contact ATS at 781.769.2800 or ats-hq@qats.com.

Thermal Management of High-Powered Servers

While power demands have increased, engineers are tasked with placing more components into smaller spaces. This has led to increased importance for optimizing the thermal management of servers and other high-powered devices to ensure proper performance and achieve the expected lifespan.

Servers
This article brings together content that ATS has posted over the years about cooling high-powered servers from the device- to the environment-level. (Wikimedia Commons)

With server cooling taking on increased priority, there are several ways of approaching the problem of thermal management, including device-level solutions, system-level solutions, and even environment-level solutions.

Over the years, Advanced Thermal Solutions, Inc. (ATS) has posted many articles related to this topic. Click the links below to read more about how the industry is managing the heat for servers:

  • Industry Developments: Cabinet Cooling Solutions – Although their applications vary, a common issue within these enclosures is excess heat, and the danger it poses to their electronics. This heat can be generated by internal sources and intensified by heat from outside environments.
  • Effective cooling of high-powered CPUs on dense server boards – Optimizing PCB for thermal management has been shown to ensure reliability, speed time to market and reduce overall costs. With proper design, all semiconductor devices on a PCB will be maintained at or below their maximum rated temperature. 

For more information about Advanced Thermal Solutions, Inc. (ATS) thermal management consulting and design services, visit https://www.qats.com/consulting or contact ATS at 781.769.2800 or ats-hq@qats.com.

Webinar on Limits of Air Cooling in March

Advanced Thermal Solutions, Inc. (ATS) is hosting a series of monthly, online webinars covering different aspects of the thermal management of electronics. This month’s webinar will be held on Thursday, March 28 from 2-3 p.m. ET and will cover the limits of air cooling and the role of liquid cooling in . Learn more and register at https://qats.com/Training/Webinars.

Webinar: EV Battery Thermal Management

Maintaining the proper operating temperature for electric vehicle (EV) batteries is a critical component of the spread of EV across the world. If batteries are too hot, then the batteries will degrade faster, and safety becomes a concern. At lower temperatures, battery capacity and performance suffer.

EV Battery Thermal Management
The webinar below will cover techniques for maintaining proper battery temperatures in electric vehicles. (Wikimedia Commons)

Thermal management of batteries is important for EV to live up to the potential that manufacturers promise and that consumers desire. But, how can the temperature be maintained at the proper operating levels during use and how can manufacturers cope with the varied environments that the vehicles will operate in?

As an earlier post on EV battery thermal management explained, the main concerns for engineers are:

  1. At temperatures below 0°C (32°F), batteries lose charge due to slower chemical reactions taking place in the battery cells. The result is a significant loss in power, acceleration and driving range, and higher potential for battery damage during charging.
  2. At temperatures above 30°C (86°F) the battery performance degrades, posing a real issue if a vehicle’s air conditioner is needed for passengers. The result is an impact on power density and reduced acceleration response.
  3. Temperatures above 40°C (104°F) can lead to serious and irreversible damage in the battery. At even higher temperatures, e.g. 70-100°C, thermal runaway can occur. This is triggered when the runaway temperature is reached. The result is a self-heating chain reaction in a battery cell that causes its destruction while propagating to adjacent cells.

This hour-long webinar from thermal management expert Dr. Kaveh Azar, founder and CEO of Advanced Thermal Solutions, Inc. (ATS), presents some of the techniques that design engineers have employed to keep EV batteries within the proper temperature range both during operation and charging.


For more information about Advanced Thermal Solutions, Inc. (ATS) thermal management consulting and design services, visit https://www.qats.com/consulting or contact ATS at 781.769.2800 or ats-hq@qats.com.