Researchers say they trained a foundation model from scratch for about $1,500

Read full story on VentureBeat
Share
Researchers say they trained a foundation model from scratch for about $1,500
AI disclosure

Summary

<p>Training a foundation LLM from scratch costs millions and requires internet-scale data — which is why most enterprises don&#x27;t bother. Sapient thinks it has a cheaper path.</p><p>To overcome this brute-force scaling dogma, researchers at Sapient developed <a href="https://github.com/sapientinc/HRM-Text">HRM-Text</a>, which replaces standard Transformers with a highly sample-efficient Hierarchical Recurrent Model (HRM), an architecture they <a href="https://venturebeat.com/ai/new-ai-architecture-delivers-100x-faster-reasoning-than-llms-with-just-1000-training-examples">first introduced last year</a>.</p><p>HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. Instead of brute-force autoregressive prediction on raw text, HRM-Text trains exclusively on instruction-response pairs. This is close to real-world enterprise settings, where users usually expect a targeted answer to a specific task.</p><p>The researchers were able to train a 1B-parameter HRM-Text from scratch at a fraction of the cost and tokens of normal LLMs. Their model achieved performance competitive with much larger open models on key industry benchmarks.</p><p>For real-world AI applications, this means foundational pretraining is no longer restricted to highly resourced institutions. With HRM-Text, organizations can affordably pretrain their own highly capable reasoning models from scratch and pair them with external knowledge stores.</p><h2>The training bottleneck</h2><p>When we train an LLM, we don&#x27;t actually care if it has memorized the exact sequence of words in a random 2014 Reddit thread. What we want is for the model to develop a deep, underlying understanding of human language, logic, facts, and reasoning.</p><p>The current approach is brute force: scrape the internet, run next-token prediction trillions of times, and assume the model has developed a working internal model of the world.</p><p>Basically, this means that we waste millions of dollars of computing power forcing models to memorize everything collected from the internet, just so they can indirectly learn how to think. For example, standard decoder-only models spend valuable compute assigning loss to reconstruct the prompt itself, even though the user&#x27;s prompt is already known and provided at inference time.</p><p>Instead of simply viewing this as a computational hurdle, the industry must recognize it as a severe business limitation. In comments provided to VentureBeat, Guan Wang, CEO of Sapient Intelligence, framed this as an issue of the &quot;economics of iteration.&quot;</p><p>&quot;Enterprises today face three compounding problems: training is expensive, infrastructure is heavy, and experimentation cycles are too slow,&quot; Wang said. &quot;The industry’s scaling addiction says: &#x27;When the model fails, make it bigger. Add more data. Add more GPUs.&#x27; That has worked, but it is reaching a point of diminishing returns. More scale often means more memorization, more latency, more infrastructure, and more vendor dependency. It does not necessarily give an enterprise a better reasoning engine.&quot;</p><p>This architectural and computational inefficiency is exactly why fine-tuning existing dense transformers isn&#x27;t always the silver bullet for enterprises. Fine-tuning to preserve a model&#x27;s general capabilities often requires mixing substantial general-purpose data into the process, making it computationally heavy and difficult to control.</p><p>&quot;Imagine a hedge fund, insurer, or bank that has highly proprietary data: internal research notes, transaction logic, compliance rules, analyst memos, risk models, portfolio constraints,&quot; Wang said. &quot;They may not want to send that data to an external frontier model, and they may not need a giant general-purpose model that memorized the internet. What they need is a compact reasoning core that can learn their task structure, reason across rules and numbers, and run in a controlled environment.&quot;</p><p>Because HRM-Text focuses its computation strictly on task completion and latent reasoning, it allows enterprises to start with a smaller, smarter model and adapt it to a proprietary domain with far less infrastructure.</p><h2>Rethinking architectures with HRM-Text</h2><p>HRM, which was introduced in 2025, represents a fundamental departure from traditional Transformer models. To build a more sample-efficient engine, HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. The fast L-module performs local iterative refinement, while the slow H-module maintains stable semantic context across cycles. Processing consists of two high-level cycles, where each cycle executes three fast L-module updates followed by a single slow H-module update.</p><p>Standard parameter-shared recurrent architectures (like <a href="https://venturebeat.com/ai/samsung-ai-researchers-new-open-reasoning-model-trm-outperforms-models-10">Samsung&#x27;s TRM</a>) can sometimes handle small logic puzzles, but the Sapient researchers found they become highly unstable when scaled to 1-billion parameters for language tasks. The separation between HRM&#x27;s slow H-module and fast L-module is mathematically necessary, not just an aesthetic choice. As Wang said: &quot;For logic grids, you can sometimes get away with a tiny recursive mechanism because the world is clean and bounded. Language is not like that. Language needs both fast local refinement and slow semantic stability.&quot;</p><p>While the original HRM proved highly effective for controlled, symbolic reasoning problems, the researchers hit a wall when applying it to the massive, open-ended complexities of generalized language modeling. While HRM&#x27;s loops make it an incredibly efficient thinker, those same loops make it mathematically volatile to train on the diverse chaos of human language. Running recurrent loops on language creates massive mathematical instability, specifically, exploding or vanishing gradients.</p><p>To prevent this feedback loop in the neural network, the researchers introduced two key architectural innovations in HRM-Text. First, they developed MagicNorm, a specialized normalization technique designed specifically to keep the internal signals stable, no matter how many times the model loops its thought process.</p><p>Second, they designed a warm-up method to stabilize training. During early training, the model is only evaluated on short, shallow reasoning loops. As training progresses, the system warms up, gradually giving the model deeper and longer reasoning sequences.</p><p>They also switched the training objective from next-token prediction to task completion, where the model is rewarded only on the full response as opposed to individual tokens it generates. To achieve this goal, they changed the training data of HRM-Text from raw text to instruction-response pairs only.</p><h2>HRM-Text in action</h2><p>The researchers built a highly compact 1-billion-parameter HRM-Text model. Instead of using the standard multi-stage pipeline that requires churning through trillions of words of raw internet text, they trained it from scratch on a tightly curated dataset of just 40 billion tokens. The training data consisted entirely of instruction-response pairs across general instructions, math, symbolic logic, textbook exercises, and rewritten knowledge.</p><p>They trained the model using the task-completion objective. To force the model to rely on its internal hierarchical architecture rather than copying step-by-step logic, they explicitly stripped out &quot;thinking&quot; tokens from the training data.</p><p>The model was evaluated across a diverse suite of standard foundational AI benchmarks, heavily indexing on knowledge, reasoning, logic, math, and comprehension. The researchers tested HRM-Text against both small models and highly-resourced open-weight and fully open models.</p><p>The results show a significant shift in the compute-to-performance frontier. The 1B-parameter HRM-Text achieved 60.7% on MMLU, 84.5% on GSM8K, and 56.2% on MATH. This performance is highly competitive with (and in several cases surpasses) the 2B to 7B parameter foundation models it was tested against.</p><p>The most important takeaway for the enterprise audience lies in the efficiency statistics and practical implications. Pretraining a foundation model from scratch is typically a multi-million dollar endeavor reserved for tech giants. HRM-Text was trained in just 1.9 days on a cluster of 16 GPUs. The total estimated compute cost was roughly $1,500. It achieved its competitive scores using 100 to 900 times fewer training tokens and 96 to 432 times less estimated compute than models like Qwen, Gemma, and Llama.</p><p>Another important point is the decoupling of reasoning from knowledge memorization. From a practical standpoint, HRM-Text&#x27;s success on reasoning-heavy tasks despite its tiny 40B-token training diet proves that a model does not need to memorize the entire internet to become a smart reasoning engine.</p><p>For enterprise applications, this behavior is a feature, not a bug. The researchers suggest a future where businesses deploy highly compact, incredibly cheap recurrent models that act as the &quot;reasoning core&quot; specialized for business logic. Instead of forcing the model to memorize company databases during pretraining, the model acts as the reasoning engine, relying on external retrieval systems to fetch factual knowledge.</p><p>Critics have pointed out that training on instruction-response pairs makes comparisons against models trained on raw text an &quot;apples-to-oranges&quot; scenario. Wang pushes back on this framing, pointing out that every serious modern LLM sees instruction-response data during training or alignment. &quot;So the comparison is not apples-to-oranges. It is closer to apple cores-and-apples. We started directly from the core task format because that is how people actually use models: they give an instruction and expect a useful response,&quot; he said.</p><p>The researchers also ran rigorous contamination tests to ensure the model wasn&#x27;t simply memorizing benchmark answers. On DROP, the one benchmark showing a marginal contamination signal under a specific setting, HRM-Text still scored an impressive 81.1% on a strictly clean, 0% contamination subset.</p><p>Ultimately, Wang argues that for enterprises, &quot;the right evaluation is not trivia recall. It is a workflow evaluation... Give HRM-Text a task like: multi-step financial reasoning, compliance logic, scientific workflow automation, structured extraction followed by reasoning.&quot;</p><h2>Practical implementation and the future of enterprise AI</h2><p>While the benchmark scores and cost efficiencies are striking, Sapient is clear about the model&#x27;s current boundaries. The initial release is best viewed as a proof-of-concept, akin to early GPT releases, designed to showcase the architecture&#x27;s unique advantages.</p><p>&quot;Honestly, HRM-Text is not yet a plug-and-play ChatGPT replacement,&quot; Wang said. &quot;It is a compact foundation language reasoning model. For an enterprise engineering team, the operational work is mainly around templates, mode selection, attention masking, and alignment.&quot;</p><p>For AI engineering teams looking to experiment, getting started requires some specific, but standard, text-generation discipline. The model lists native support in the Transformers library (requiring transformers &gt;= 5.9.0), and usage paths for vLLM and SGLang are actively being developed. The primary engineering task involves managing the PrefixLM design: production multi-turn chat applications will require careful KV-cache logic to ensure user prompts receive full bidirectional attention while the assistant&#x27;s outputs remain causal.</p><p>&quot;When the cost of training a capable reasoning model drops to around $1,500, AI stops being only an infrastructure question and becomes a strategy question,&quot; Wang said. &quot;A Fortune 500 company no longer has to ask, ‘Can we afford a foundation model?’ It would ask, ‘What should our model know about our business, and what kind of reasoning should it be optimized for?’&quot;</p>

Original reporting

Open original source

Related coverage

Read full article on VentureBeat

Get the AFBytes Brief

Major stories, AI-assisted analysis, and what to watch next. Free, monthly, unsubscribe anytime.