New LLM optimization technique slashes memory costs up to 75% - RocketNews

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Researchers at the Tokyo-based startup Sakana AI have developed a new technique that enables language models to use memory more efficiently, helping enterprises cut the costs of building applications on top of large language models (LLMs) and other Transformer-based models.

The technique, named "Universal Transformer Memory," uses special neural networks to optimize LLMs to keep bits of information that matter and discard redundant details from their context.

Optimizing Transformer memory

The responses of Transformer models, the backbone of LLMs, depend on the content of their "context window," -- that is, what they receive as input from users.

The context window can be considered as the model's working memory. Tweaking the content of the context window can have a tremendous impact on the model's performance, which has given rise to an entire field of "prompt engineering."

Current models support very long context windows with hundreds of thousands, or even millions of tokens (an LLM's numerical representations of the words, word parts, phrases, concepts and numbers inputted by users in their prompts).

This enables users to cram more information in their prompts. However, longer prompts can result in higher compute costs and slower performance. Optimizing prompts to remove unnecessary tokens and keeping important information can reduce costs and increase speed.

Current prompt optimization techniques are resource-intensive or require users to manually test different configurations to reduce the size of their prompts.

Neural Attention Memory Modules

Universal Transformer Memory optimizes prompts using Neural Attention Memory Models (NAMMs), simple neural networks that decide whether to "remember" or "forget" each given token stored in the LLM's memory.

"This new capability allows transformers to discard unhelpful or redundant details, and focus on the most critical information, something we find to be crucial for tasks requiring long-context reasoning," the researchers write.

Universal Transformer Memory (source: Sakana AI)

NAMMs are trained separately from the LLM and are combined with the pre-trained model at inference time, which makes them flexible and easy to deploy. However, they need access to the inner-activations of the model, which means they can only be applied to open source models.

Like other techniques developed by Sakana AI, NAMMs are trained through evolutionary algorithms instead of gradient-based optimization methods. By iterative ...

Rapid Reads News

New LLM optimization technique slashes memory costs up to 75% - RocketNews

POPULAR CATEGORY

corporate

tech

entertainment

research

misc

wellness

athletics