TECH

Apple and Nvidia Collaboration Triples AI Model Creation Speed

Training Machine Learning Models Is a CPU-Intensive Task

Apple's Latest Machine Learning Research Could Speed ​​Up Model Creation for Apple Intelligence by Developing a Method to Nearly Triple the Speed ​​of Token Generation Using GPUs Nvidia

One ​​of the challenges in building large language models (LLMs) for tools and applications that offer AI-based functionality, such as Apple Intelligence, is the inefficiency of building the LLMs in the first place. Training machine learning models is a resource-intensive and slow process, which is often offset by buying more hardware and increasing energy costs.

Earlier in 2024, Apple published and open-sourced the Recurrent Drafter, known as ReDrafter, a speculative decoding method to improve training performance. It used a draft RNN (recurrent neural network) model that combines beam search with dynamic tree attention to predict and validate draft tokens from multiple paths.

This sped up LLM token generation by up to 3.5x per generation step compared to typical autoregressive token generation methods.

In a post on Apple's machine learning research site, the company explained that, along with existing work using Apple Silicon, it didn't stop there. A new report published Wednesday details how the team applied the research into ReDrafter to make it production-ready for use with Nvidia GPUs.

Nvidia GPUs are often used in servers used to generate LLMs, but high-end hardware often comes at a cost. It’s not uncommon for multi-GPU servers to cost over $250,000 apiece for the hardware alone, not to mention any required infrastructure or other associated costs.

Apple worked with Nvidia to integrate ReDrafter into Nvidia’s TensorRT-LLM inference acceleration framework. Because ReDrafter used operators that other speculative decoding methods didn’t use, Nvidia had to add additional elements to make it work.

With its integration, machine learning developers using Nvidia GPUs in their work can now take advantage of ReDrafter’s accelerated token generation when using TensorRT-LLM in production, not just those using Apple Silicon.

The result, after benchmarking a production model with tens of billions of parameters on Nvidia GPUs, was a 2.7x increase in token generation per second for greedy coding.

As a result, the process could be used to minimize latency for users and reduce the amount of hardware needed. In short, users could expect faster results from cloud queries, and companies could deliver more while spending less.

In an Nvidia tech blog on the topic, the graphics card maker said the collaboration made TensorRT-LLM “more powerful and flexible, allowing the LLM community to build more complex models and deploy them easily.”

The release of the report follows Apple publicly confirming that it is exploring the potential use of Amazon’s Trainium2 chip to train models for use in Apple Intelligence features. At the time, it was expected that pre-training using the chips would improve efficiency by 50% compared to existing hardware.

Follow AppleInsider on Google News

Leave a Reply