Jakarta, INTI - Nvidia has developed a new technique that claims to reduce artificial intelligence (AI) computing memory requirements by up to eight times without lowering model accuracy.
The technology, called dynamic memory sparsification (DMS), is designed to optimize memory usage in large language models (LLMs) during reasoning processes.
With this technique, the load on GPUs can be significantly reduced, allowing AI systems to run more efficiently.
Addressing Memory Bottlenecks
In large language models like those used in modern chatbots, the reasoning process generates what is known as key-value cache (KV cache), a temporary memory structure that continues to grow as the model produces tokens one by one while “thinking.”
The longer the reasoning process, the more GPU memory is required. This has become one of the main bottlenecks in developing large-scale AI systems, as it increases computing costs and limits the number of users that can be served simultaneously.
According to Nvidia, DMS enables models to “manage their own memory” by selecting which tokens need to be retained and which can be discarded, without compromising output quality.
No Accuracy Trade-Off
This approach differs from previous methods that relied on fixed rules (heuristics) to delete older memory entries. Those earlier techniques often sacrificed accuracy by removing important information.
In contrast, DMS trains the model to recognize which tokens are truly relevant for subsequent reasoning steps.
Nvidia also implements a delayed eviction mechanism, meaning token deletion is postponed so the model can fully absorb important context before memory is cleared.
In tests conducted on several models, including Qwen and Llama, DMS demonstrated improved efficiency without any decline in performance.
As reported by VentureBeat, in several mathematics and coding benchmarks, models equipped with DMS even achieved higher scores than the standard versions operating under the same computational budget.
This memory efficiency has a direct impact on GPU usage. With a smaller cache, GPUs no longer need to continuously read and write large volumes of data, reducing latency and increasing throughput.
In tests comparing the standard (vanilla) Qwen3-8B model with the version enhanced by Dynamic Memory Sparsification (DMS), both demonstrated nearly identical accuracy levels across various reasoning benchmarks, including MATH 500, HumanEval, and AIME 2024.
In some evaluations, the DMS version even recorded slightly higher scores. The most significant differences were observed in memory efficiency and performance stability.
The standard Qwen3-8B model tends to experience memory usage spikes as context length increases, sometimes resulting in “out of memory” errors.
In contrast, the DMS-enabled version maintains stable generation times and avoids memory exhaustion, allowing the model to process longer contexts without excessively burdening the GPU.
For companies, these savings are considered substantial, as AI infrastructure costs today are heavily dependent on GPU capacity and memory.
Can Be Applied to Existing Models
Nvidia stated that DMS can be applied topre-trained models without requiring retraining from scratch. The adaptation process is described as relatively lightweight and compatible with standard inference infrastructure.
The technology has been released as part of Nvidia’s Model Optimizer framework and can be integrated into AI pipelines built on Hugging Face, as well as systems supporting FlashAttention.
Conclusion
Nvidia’s Dynamic Memory Sparsification (DMS) represents a significant step forward in AI efficiency. By dramatically reducing memory usage without compromising accuracy, the technology addresses one of the biggest bottlenecks in large-scale AI deployment: GPU and memory constraints. With compatibility for existing pre-trained models and standard inference infrastructure, DMS offers a practical path for companies to lower AI infrastructure costs while maintaining high performance. As AI workloads continue to scale, innovations like DMS could play a crucial role in making advanced models more accessible and sustainable.
Read more: OpenAI Expands into Higher Education as India Accelerates AI Talent Development