Apple Teams Up With NVIDIA to Speed Up AI Language Models

Tim Hardwick

Apple today shared details about a collaboration with NVIDIA to significantly improve the performance of large language models (LLMs) by introducing new text generation technology that delivers significant speedups for AI applications.

Earlier this year, Apple published and open-sourced Recurrent Drafter (ReDrafter), an approach that combines beam search and dynamic tree attention techniques to speed up text generation. Beam search explores multiple potential text sequences simultaneously to produce the best results, while tree attention organizes and removes redundant overlaps between those sequences to improve efficiency.

Apple has now integrated the technology into NVIDIA's TensorRT-LLM framework, which optimizes LLMs running on NVIDIA GPUs, where Apple says it has achieved “cutting-edge performance.” The integration showed that the technology was able to increase the token generation rate by 2.7 times per second during testing with a production model containing tens of billions of parameters.

Apple claims that the improved performance not only reduces user-perceived latency, but also results in lower GPU utilization and power consumption. From Apple's Machine Learning Research blog:

“LLMs are increasingly being used to power production applications, and improving inference efficiency can both impact compute costs and reduce latency for users. With ReDrafter's new speculative decoding approach integrated into NVIDIA's TensorRT-LLM framework, developers can now take advantage of faster token generation on NVIDIA GPUs for their production LLM applications.”

Developers interested in implementing ReDrafter can find more information on both Apple's website and the NVIDIA Developer Blog.

Tag: Nvidia[ 60 comments ]