APPLE

Apple Partners with NVIDIA to Explore Higher LLM Performance

In a blog post today, Apple engineers shared new details about their collaboration with NVIDIA to deliver faster text generation performance with large language models.

Earlier this year, Apple published and open-sourced its Recurrent Drafter (ReDrafter) technology. It is a new method for generating text using LLM that is significantly faster and “achieves state-of-the-art performance.” It combines two technologies: beam search (for exploring multiple possibilities) and dynamic tree attention (for efficiently handling choices).

While its research has shown strong results, Apple has partnered with NVIDIA to bring ReDrafter to production. As part of this collaboration, ReDrafter has been integrated into NVIDIA TensorRT-LLM, a tool that helps run LLM faster on NVIDIA GPUs.

Here are the results:

To enable ReDrafter integration, NVIDIA has added new operators or exposed existing ones, significantly improving TensorRT-LLM’s ability to accommodate complex models and decoding methods. Machine learning developers using NVIDIA GPUs can now easily take advantage of ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.

In benchmarking a production model with tens of billions of parameters on NVIDIA GPUs using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we saw a 2.7x speedup in tokens generated per second for greedy decoding. The results of these tests show that this technology can significantly reduce the latency that users may experience while using fewer GPUs and consuming less power.

“LLMs are increasingly being used to power production applications, and improving inference efficiency can both impact compute costs and reduce latency for users,” conclude Apple’s machine learning researchers. “With ReDrafter’s new speculative decoding approach integrated into the NVIDIA TensorRT-LLM framework, developers can now take advantage of faster token generation on NVIDIA GPUs for their production LLM applications.”

More details about this work can be found on Apple's website and in a blog post on NVIDIA's website:

  • Apple: Accelerating LLM inference on NVIDIA GPUs with ReDrafter
  • NVIDIA: NVIDIA TensorRT-LLM now supports recurrent plotting to optimize LLM inference

Follow Chance: Threads, Bluesky, Instagram, and Mastodon.

Leave a Reply