Meta, hell-bent on catching up to its rivals in the generative AI space, is spending thousands of millions in their own AI efforts. A part of those billions goes to recruit AI researchers. But an even bigger chunk is being spent on hardware development, specifically chips to run and train Meta’s AI models.
Goal sleepless the latest fruit of its chip development efforts today, notably a day after Intel Announced its latest AI accelerator hardware. Called the “next-generation” Meta Training and Inference Accelerator (MTIA), the successor to last year’s MTIA v1, the chip runs models that include ranking and recommending display ads on Meta properties (e.g., Facebook).
Compared to MTIA v1, which was built on a 7nm process, the next-generation MTIA is 5nm. (In chip manufacturing, “process” refers to the size of the smallest component that can be built on the chip.) The next-generation MTIA has a physically larger design, equipped with more processing cores than its predecessor. And while it consumes more power (90W vs. 25W), it also has more internal memory (128MB vs. 64MB) and runs at a higher average clock speed (1.35GHz vs. 800MHz).
Meta says the next-generation MTIA is currently available in 16 of its data center regions and offers up to 3x better overall performance compared to MTIA v1. If that “3x” statement sounds a little vague, you’re not wrong: we think so too. But Meta simply said the figure came from testing the performance of “four key models” on both chips.
“Because we control the entire stack, we can achieve greater efficiency compared to commercially available GPUs,” Meta writes in a blog post shared with TechCrunch.
Meta’s hardware showcase, which comes just 24 hours after a press conference about the company’s various ongoing generative AI initiatives, is unusual for several reasons.
One, Meta reveals in the blog entry It is not using the next-generation MTIA for generative AI training workloads at this time, although the company says it has “several programs underway” exploring this. Second, Meta admits that next-generation MTIA will not replace GPUs for running or training models, but rather complement them.
Reading between the lines, Meta moves slowly, perhaps slower than he would like.
Meta’s AI teams are almost certainly under pressure to reduce costs. The company is willing to spend a My dear $18 billion by the end of 2024 on GPUs to train and run generative AI models, and with training costs for cutting-edge generative models running into the tens of millions of dollars, in-house hardware presents an attractive alternative.
And while Meta’s hardware drags, I suspect rivals are gaining the upper hand, much to the dismay of Meta’s leadership.
This week, Google made its fifth-generation custom chip for training AI models, TPU v5p, generally available to Google Cloud customers, and revealed its first dedicated chip for running models, Axion. Amazon has several families of custom AI chips under its belt. And last year, Microsoft jumped into the fray with the Azure Maia AI accelerator and Azure Cobalt 100 CPU.
In it blog entryMeta says it took less than nine months to “go from first silicon to production models” of the next-generation MTIA, which to be fair is shorter than the typical window between Google’s TPUs. But Meta has a lot of catching up to do if it hopes to achieve some independence from third-party GPUs and keep up with its stiff competition.