WEKA Maximizes Token Output With Lower Cost Per Token on NVIDIA BlueField-4 STX

Business

WEKA Maximizes Token Output With Lower Cost Per Token on NVIDIA BlueField-4 STX

2026-03-17 06:00 Last Updated At：06:25

NeuralMesh and Augmented Memory Grid Integration with NVIDIA STX Increases Token Production by 6.5x in the Same GPU Footprint, Slashing Cost of Inference for AI-Driven Organizations

SAN JOSE, Calif. and CAMPBELL, Calif., March 17, 2026 /PRNewswire/ -- From GTC 2026: WEKA, the AI storage and memory systems company, today announced the integration of its NeuralMesh™ software with the NVIDIA STX reference architecture. WEKA's breakthrough Augmented Memory Grid™ memory extension technology running on NeuralMesh will support NVIDIA STX to bring high-throughput context memory storage to agentic AI factories, making long-context reasoning seamless across sessions, tools, and tasks. Leveraging NVIDIA Vera Rubin NVL72, NVIDIA BlueField-4, and NVIDIA Spectrum-X Ethernet, the NeuralMesh solution based on NVIDIA STX will deliver an estimated increase of 4-10x more tokens per second for context memory while supporting at least 320 GB read and 150 GB write throughput per second for AI workloads, more than double the throughput of conventional AI storage platforms.

Solving the Inference Cost Problem with Shared KV Cache Infrastructure
Scaling agentic systems, especially for software engineering applications, exposes a hard truth: today's AI economics are decided at the memory infrastructure layer. Every large-scale inference fleet hits the memory wall: limited high-bandwidth memory (HBM) on the GPU is rapidly exhausted, key-value (KV) cache is evicted, context is lost, and the system is forced to repeat work it already completed. This architectural inefficiency sends inference costs soaring. The answer is a shared KV cache infrastructure that keeps context live across agents, users, and sessions. It eliminates redundant computation, sustains token throughput, and maintains predictable performance. Without shared KV cache infrastructure, every increase in concurrent users and agents becomes a liability — costs rise, experiences degrade, and the inference fleet becomes harder to operate the larger it grows. With STX for context memory, NVIDIA is introducing a blueprint to address these core inference bottlenecks.

Context Memory Storage: The Foundation of Agentic AI Factories
With co-designed WEKA solutions based on NVIDIA STX architecture, AI clouds, enterprises, and AI model builders can deploy the infrastructure foundation they need to run GPUs at peak productivity, sustain high-volume token production, and make large-scale inference more energy and cost-efficient.

Leading AI innovators and cloud providers, such as Firmus, are already transforming their inference economics with Augmented Memory Grid on NeuralMesh.

"Real-world AI doesn't run in a lab— it has power constraints, cooling limits, and relentless workload demand. Firmus is built for exactly that. Paired with NVIDIA AI infrastructure, WEKA Augmented Memory Grid delivers up to 6.5x higher tokens per second and 4x faster TTFT at scale, proving we can get more performance from the same GPU footprint. With NeuralMesh and Augmented Memory Grid integrated into our NVIDIA-aligned AI Factory and NVIDIA STX reference architecture, we'll be able to deliver the fastest context memory network for predictable and efficient inference at scale," said Daniel Kearney, Chief Technology Officer at Firmus.

NeuralMesh and NVIDIA STX: Purpose-Built for Agentic AI
NeuralMesh is WEKA's intelligent, adaptive storage system built on over 170 patents. It will run across the full-stack STX reference architecture, providing the next-generation storage organizations need to standardize high-performance AI data services and accelerate agentic AI outcomes. WEKA's Augmented Memory Grid is a purpose-built memory extension layer that pools and persists KV cache outside of GPU memory, keeping long-context sessions stable and concurrency high as inference workloads grow. First unveiled at GTC 2025 and generally available to NeuralMesh customers today, Augmented Memory Grid has been validated with Supermicro on NVIDIA Grace CPUs and BlueField-3 DPUs to deliver numerous benefits that improve AI economics, including:

Faster User Experiences: Augmented Memory Grid on NeuralMesh delivers up to 4-20x improvement in time-to-first-token, keeping AI agents and applications responsive under real-world load.
More Revenue from the Same Hardware: Serve 6.5x more tokens per GPU — without adding infrastructure.
Sustained Performance at Scale: Augmented Memory Grid maintains high KV cache hit rates even as sessions, agents, and context windows grow — preventing the performance cliff that hits DRAM-only architectures.
GPU-Native Efficiency: BlueField-4 integration offloads the storage data path from the CPU, keeping GPUs fully productive and eliminating I/O bottlenecks.

"With coding LLMs advancing, we're seeing unprecedented adoption of Agentic AI use cases for software engineering, where productivity increases by 100-1000x. As coding assistants make repeated calls against largely unchanged codebases and prompts, WEKA's Augmented Memory Grid reuses cached context instead of forcing redundant prefill, even as context windows grow to incredible lengths. This provides a major boost in response times and greatly increases the number of concurrent users running on the same infrastructure," said Liran Zvibel, co-founder and CEO at WEKA. "WEKA first identified this need for context memory storage more than a year ago and launched Augmented Memory Grid at GTC 2025. Now, NVIDIA STX opens the door to organizations running their storage and memory extension infrastructure on state-of-the-art NVIDIA Vera Rubin architecture, including NVIDIA BlueField-4 and NVIDIA Spectrum-X Ethernet. Running Augmented Memory Grid on NeuralMesh for NVIDIA STX delivers extreme performance and efficiency that translates directly to game-changing AI economics."

Availability

WEKA's Augmented Memory Grid is commercially available with NeuralMesh today.

Organizations that don't address the memory wall today will find it harder and more expensive to scale tomorrow. As agentic workloads grow and context windows expand, DRAM-only architectures face a compounding cost problem: each additional concurrent user or session increases recomputation overhead, GPU idle time, and operational cost. The organizations that architect for persistent KV cache now will have a structural cost and performance advantage over those that wait.

For more information about NeuralMesh, visit: weka.io/NeuralMesh.
For more information about Augmented Memory Grid, visit: weka.io/augmented-memory-grid.

Organizations can learn more at weka.io/nvidia or visit WEKA at GTC 2026, booth #1034.

About WEKA
WEKA is transforming how organizations build, run, and scale AI workflows with NeuralMesh™ by WEKA®, its intelligent, adaptive mesh storage system. Unlike traditional data infrastructure, which becomes slower and more fragile as workloads expand, NeuralMesh becomes faster, stronger, and more efficient as it scales, dynamically adapting to AI environments to provide a flexible foundation for enterprise AI and agentic AI innovation. Trusted by 30% of the Fortune 50, NeuralMesh helps leading enterprises, AI cloud providers, and AI builders optimize GPUs, scale AI faster, and reduce innovation costs. Learn more at www.weka.io or connect with us on LinkedIn and X.

WEKA and the W logo are registered trademarks of WekaIO, Inc. Other trade names herein may be trademarks of their respective owners.

** This press release is distributed by PR Newswire through automated distribution system, for which the client assumes full responsibility. **

WEKA Accelerates AI Factory Deployment Times From Months to Minutes with Turnkey NVIDIA AI Data Platform Solution

New NeuralMesh AI Data Platform Closes the Gap Between AI Proof-of-Concept and Profitable Production, Delivering Scalable Business Intelligence and Faster AI Outcomes with NVIDIA

SAN JOSE, Calif. and CAMPBELL, Calif., March 17, 2026 /PRNewswire/ -- From GTC 2026: WEKA, the AI storage and memory systems company, today announced general availability of its enterprise-ready NeuralMesh™ AI Data Platform (AIDP), which delivers composable, high-performance infrastructure optimized for AI Factory deployments. Based on NVIDIA AI Data Platform reference design, the solution is an end-to-end system that accelerates the delivery of AI-ready data to AI factories. The result: AI project timelines speed up from months to minutes, empowering organizations to deliver production-scale agentic AI applications using best-in-class technologies across their ecosystem.

Leveraging NeuralMesh's uniquely adaptive architecture, the solution addresses the most persistent obstacle in enterprise AI: organizations can demonstrate AI concepts work in proof-of-concept (POC) but consistently struggle to reach production scale.

Built on more than 170 patents and over a decade of AI-native storage innovation, a foundation no competing storage platform can replicate, NeuralMesh is the only solution that gets faster and more resilient as AI environments scale to exabytes and beyond. As AI Factory data infrastructure becomes a critical layer in enterprise AI architecture, NeuralMesh is helping customers to close the gap between POC and production deployments today. Customers running NeuralMesh with Augmented Memory Grid™ can achieve 6.5x more tokens per GPU for inference workloads, reflecting the compounding advantage of a purpose-built architecture over retrofitted infrastructure.

"Enterprises are now deploying AI Factories internally, driving a major shift to inference throughout the ecosystem. These companies require rapid AI outcomes and need turnkey solutions that come with the enterprise table-stakes of reliability, security, and optimal price-performance and cost-effectiveness," said Liran Zvibel, cofounder & CEO at WEKA. "WEKA's NeuralMesh AIDP gives organizations everything they need to run always-on AI factories: extreme storage performance and the flexible architecture required to operationalize AI at production scale. Whether an organization is just beginning its AI journey or running full-stack NVIDIA deployments, NeuralMesh AIDP scales seamlessly as they grow."

"The deployment of agentic AI in production demands a new focus on managing the continuous, coherent flow of data and inference context," said Jason Hardy, vice president, storage technologies at NVIDIA. "By leveraging the NVIDIA AI Data Platform, solutions like WEKA's NeuralMesh AIDP deliver the persistent context tier necessary for stable and high-scale agentic inference."

One System, Every AI Workload: Delivering End-to-End AI Factories

AI factories provide enterprises with purpose-built production systems designed to operate AI at scale, but they demand storage capabilities that extend beyond where data sits to actively support context and continuous data movement. NeuralMesh, WEKA's intelligent, adaptive storage system, delivers the continuous data-loop performance that AI factory workloads demand.

Out-of-the-Box AI Applications Designed to Accelerate Business Outcomes

NeuralMesh AIDP enables enterprises and AI cloud providers to unify AI operations from retrieval to inference on a single, ready-to-deploy platform. With pre-integrated hardware and software options from NVIDIA (including NVIDIA RTX 6000 PRO Server Edition GPUs and the newly announced NVIDIA RTX 4500 PRO Server Edition GPUs) alongside Red Hat, Spectro Cloud and Supermicro, organizations can eliminate months of AI integration work.

The platform provides a simplified solution that allows teams to focus on intelligence output rather than managing underlying infrastructure. It delivers ready-to-use pipelines for a spectrum of business use cases that work across verticals, including: Semantic Search, Video Search & Summarization (VSS), AlphaFold for drug discovery, AIQ/Agentic RAG and more.

These AI applications are already being used by enterprise and research customers to drive outcomes across high-priority sectors:

Health & Life Sciences: Identify patient subgroups across multiple studies and accelerate discovery in data-intensive workflows such as cryo-EM.
Financial Services: Get early market signal detection as data lands and institutionalize knowledge access into a shared, secure resource.
Public Sector: Detect potential threats based on context and meaning, not keywords, and automate evidence synthesis across sources to improve decision-making cycles.
Physical AI & Robotics: Shorten the loop from real-world data capture to retrained model deployment, improving fleet performance, reliability, and time to market.

"The missing piece in production AI isn't reasoning models or compute power. It's having an efficient platform that unifies the AI Factory pipeline and makes it truly scalable," said Shimon Ben-David, CTO at WEKA. "The NeuralMesh AIDP was designed to close AI's production and profitability gap, taking enterprise experiments to full-scale operations and making AI economically viable for everything from next-generation agents to healthcare applications."

Supporting Partner & Customer Quotes

"Getting AI to production requires more than technology— it requires consistency and control. By using the NeuralMesh AI Data Platform with Red Hat AI Enterprise, based on Red Hat OpenShift, organizations can run data-intensive AI pipelines across on-premises and cloud environments at the scale that enterprise production demands, without sacrificing governance or security," said Ryan King, vice president, AI and Infrastructure Partners at Red Hat.

"The real challenge in AI is no longer training models. It is running them reliably in production, at scale, with predictable performance and cost. That's where most AI initiatives stall. The NeuralMesh AI Data Platform integrates with our AI Acceleration Cloud, Neysa Velocis, to solve that problem directly. It gives teams a way to run AI workloads as dependable systems, without carrying the operational burden of stitching together complex infrastructure," said Anindya Das, cofounder and CTO at Neysa.

Availability
The NeuralMesh AI Data Platform solution is available now, delivered as an appliance-style system. Organizations can learn more at weka.io/nvidia or visit WEKA at GTC 2026, booth #1034 for a demo.

For more information on the NeuralMesh AIDP:

Blog: weka.io/blog/ai-ml/neuralmesh-aidp-built-to-operationalize-AI-at-enterprise-scale/
Solution Webpage: weka.io/product/neuralmesh-aidp/

About WEKA

WEKA is transforming how organizations build, run, and scale AI workflows with NeuralMesh™ by WEKA®, its intelligent, adaptive mesh storage system. Unlike traditional data infrastructure, which becomes slower and more fragile as workloads expand, NeuralMesh becomes faster, stronger, and more efficient as it scales, dynamically adapting to AI environments to provide a flexible foundation for enterprise AI and agentic AI innovation. Trusted by 30% of the Fortune 50, NeuralMesh helps leading enterprises, AI cloud providers, and AI builders optimize GPUs, scale AI faster, and reduce innovation costs. Learn more at www.weka.io or connect with us on LinkedIn and X.

WEKA and the W logo are registered trademarks of WekaIO, Inc. Other trade names herein may be trademarks of their respective owners.

** This press release is distributed by PR Newswire through automated distribution system, for which the client assumes full responsibility. **

WEKA Maximizes Token Output With Lower Cost Per Token on NVIDIA BlueField-4 STX

Business

WEKA Maximizes Token Output With Lower Cost Per Token on NVIDIA BlueField-4 STX

Next Article

WEKA Accelerates AI Factory Deployment Times From Months to Minutes with Turnkey NVIDIA AI Data Platform Solution

LX Pantos Accelerates Global Expansion with Acquisition of Logistics Center in Poland

5 Reasons To Smile – Tineco Delivers Up To 56% Off On Amazon This Week

Supermicro Launches Seven AI Data Platform Solutions with NVIDIA and Leading Ecosystem Partners to Accelerate Enterprise AI Innovation

Supermicro Reveals DCBBS® with New NVIDIA Vera Rubin NVL72, HGX Rubin NVL8, and Vera CPU Systems, Designed to Accelerate Customer Time-to-Market

PC Partner Technology Pte. Limited at NVIDIA GTC 2026: Powering the Future of Industrial Digitalization and Physical AI

Hyundai Motor, Kia and NVIDIA Expand Strategic Partnership for Next-Generation Autonomous Driving Technology

PEGATRON Unveils Next Generation AI Platforms Powered by NVIDIA Vera Rubin NVL72 and NVIDIA HGX Rubin NVL8 at GTC 2026