Meta has today (13th March) announced the launch of two new graphic processing unit (GPU) clusters, used in the development of AI with a focus on natural language processing, speech recognition, and image generation.
The clusters are based on Meta’s AI Research SuperCluster from 2022, featuring 24,576 Nvidia Tensor Core H100 GPUs, an increase from the previous 16,000 Nvidia A100 GPUs.
The increased GPU capacity allows Meta to train larger and more complex models, advancing generative AI product development.
By the end of 2024, Meta plans to expand its infrastructure with 350,000 NVIDIA H100s, aiming for compute power equivalent to nearly 600,000 H100s.
Both clusters share the same GPU count but differ in network infrastructure, using either remote direct memory access over converged Ethernet or Nvidia Quantum2 InfiniBand fabric.
The clusters are built on Meta’s Grand Teton GPU hardware platform, offering enhanced bandwidth compared to its predecessor.
Meta said it uses Open Rack v3 hardware for flexibility in data centres, enabling the placement of power shelves anywhere in the rack and supporting customised rack configurations.
Meta said it is currently upgrading its PyTorch foundational AI framework to accommodate hundreds of thousands of GPU training.
In a blog post, Meta emphasised its commitment to open innovation in AI software and hardware, launching the AI Alliance: “As we look to the future, we recognize that what worked yesterday or today may not be sufficient for tomorrow’s needs.
“That’s why we are constantly evaluating and improving every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond. Our goal is to create systems that are flexible and reliable to support the fast-evolving new models and research.”
