This tiny AMD PC just ran a massive 397GB AI model that required a server room full of GPUs a year ago.

AMD’s Ryzen AI Halo recently went on sale for $4,000, sparking an interesting debate over how it compares to Nvidia’s slightly more expensive DGX Spark offering.

The configuration offered by Ryzen AI Halo has been on the market for a few months now, however, and while most OEMs and enterprise vendors offer the same flavor and configuration, Shenzhen-based memory and storage company Longsys has taken it a step further.

The storage giant showed off a localized version of a 397B-parameter AI model running on its own version of the Ryzen AI Halo, with the same 16-core Ryzen AI Max+ 395 configuration and 128 GB of RAM.

How was the Ryzen AI Max+ 395 able to run such a massive model with only 128GB of RAM?

Although the model run was not explicitly stated, it appears to be a custom version derived from Alibaba’s Qwen 3.5 397B (A17B), a multimodal base model that leverages a mixture of experts (MoE) approach, which made the original DeepSeek such a powerful challenger.

Even if it leveraged INT4 quantization, the memory requirements far exceed the memory offered by the device demonstrating the exploit: only 96 GB of VRAM is available to the GPU in a unified 128 GB configuration, compared to the approximately 200 to 250 GB of VRAM that the model requires to operate.

The secret sauce lies in Longsys’ recently unveiled custom SPU and iSA configuration that provides the ability to compress data in real-time, a feat the company says allows it to store up to twice the amount of data in storage drives up to 128 GB, leveraging a caching layer that significantly reduces DRAM requirements.

The approach involves offloading experts not actively in use to a large and fast storage buffer from which the AI chip can then reintroduce them if necessary.

In a press release, Longsys claimed that its approach works by targeting “problems of LLM MoEs” such as the large number of parameters, rapid KV cache expansion, and I/O latency that hampers inference efficiency.

“It leverages expert offloading, intelligent cache management, and predictive prefetching algorithms to efficiently solve storage scheduling problems and comprehensively improve the smoothness of local AI inference,” the company added.

It’s important to note that while this move is an impressive feat, Longsys did not provide details on the computing power in terms of tokens per second, where the Ryzen AI chip is relatively limited compared to most modern AI GPU offerings.

Regardless, the approach that essentially treats storage as memory suggests that localized AI might be able to run considerably larger models and that memory might not be as important a constraint for some approaches.

This means that memory constraints can be circumvented by taking advantage of fast storage and running a cutting-edge model that would otherwise require tens of thousands of dollars in AI hardware, no small feat. This means that models that were previously limited to data centers only can now be run on a device that fits in the palm of your hand.

Google logo on black background next to text