⚙ LongCat-2.0 is really a story about adapting frontier LLMs to domestic compute
Zhihu contributor Robin shared a technical take on Meituan’s open-source trillion-parameter model LongCat-2.0, from the perspective of both an early LongCat-2.0-Preview user and a former Ascend 910 user.
The key point is not just “Meituan released a huge model.” It is that LongCat-2.0 shows how much work is needed to make a frontier-scale MoE model actually train on domestic accelerators: precision alignment, kernel optimization, memory pressure, parallelism, reliability, and training stability all have to move together.
🧩 LongCat-Next was the proof of concept
Robin frames LongCat-Next as a PoC for LongCat-2.0.
Even at that stage, you could already see the hard engineering around Ascend 910: BF16 precision alignment, kernel optimization, ScMoE with chunking on expert parallelism, and reliability work.
LongCat-Flash-Lite’s N-gram Embedding is also inherited, while the newer architectural addition is LongCat Sparse Attention, which will likely get more technical analysis from the community.
🔥 The real new content is large-scale training
According to Robin, the most important new part of the LongCat-2.0 tech blog is not inference, but large-scale training.
Scaling from LongCat-Next to a SOTA-sized model is already hard. Doing it on Ascend 910 makes it harder because of VRAM pressure and ecosystem constraints.
Compared with LongCat-Next, LongCat-2.0 switched to Muon, which Robin sees as another endorsement of the optimizer.
On the memory side, Zero-1, recomputation, and offloading are now almost standard tools; Meituan’s more distinctive piece is the use of zero-computation experts.
The broader system also points to a deeper hardware-software fit: supernodes and 6D parallelism are close to the design logic behind Huawei’s 384-card supernode idea.
🇨🇳 Domestic compute is not a drop-in replacement
This is the part worth highlighting.
LongCat-2.0 suggests that domestic compute for large models is not simply about swapping NVIDIA cards for Chinese accelerators. The model, training stack, precision path, parallel strategy, communication layer, and reliability system all have to be adapted together.
Robin points to several key lessons from Ascend 910 training work:
Correctness alignment comes first.
Router TP and NormHead fixes under MindSpeed matter.
Training stability needs long, careful treatment.
Cost calculation is not a side note; it affects whether this path is practical.
The bigger message: domestic AI chips become useful for frontier LLMs only when the model architecture and the training system are co-designed around their constraints.
🧱 What Ascend 910 proved, and what it cost
Robin is cautiously respectful but not blindly optimistic.
With enough Ascend 910 cards, certain frontier-scale paths now look possible. Models around the DeepSeek V4-Flash or DeepSeek-V3.2 scale may be trainable. At the 1.6T LongCat-2.0 scale, the ceiling may be high enough to approach very strong models if teams and compute are organized well.
But the cost was high.
The 910 product line caused real engineering pain across its lifecycle. From release to actually carrying serious training workloads, the path was long and expensive.
Robin’s question is what happens to deployed 910 clusters next: do they shift mostly to inference, or do teams continue integrating and squeezing out their remaining training potential?
The hope is that experience accumulated on 910 does not get reset when Ascend 950 arrives. A100 and even V100 still create value today; 910 should not be treated as disposable if the software and architecture lessons can carry forward.
🚀 What this means for Ascend 950
Robin expects the 910 experience to transfer directly to 950 training.
The most valuable lessons are around scale-up, precision alignment, loss spikes, reliability, and the messy parts of keeping a large cluster alive.
Bigger VRAM will help. But even with 950, memory will still be tight for the largest models. FP8 support is another major shift, but the larger point is continuity: the most expensive hardware is not hardware with too many specs, but hardware whose architecture and software stack cannot inherit past work.
That is why the LongCat-2.0 story matters for domestic compute. It creates reusable knowledge for training large models on Chinese accelerators, instead of treating each generation as a one-off fight.
⚠️ Bigger clusters matter, but fragmentation is dangerous
Robin also warns against the “small blast furnace” trap: trying to build frontier models with scattered, undersized clusters.
For large model training, cluster size is a hard constraint. A large cluster can always use only part of its capacity. A small cluster runs into limits that cannot be wished away.
So the industry impact is not just technical. It is organizational: serious frontier training may require concentrating domestic compute, not spreading it too thinly.
🧠 Model capability is not yet the whole win
Robin is more reserved about LongCat-2.0’s model quality.
After testing the preview tokens, the model felt undertrained in pretraining and somewhat unstable in post-training. Reasoning seemed weak under constrained thinking budgets, and longer thinking sometimes led to overthinking instead of better answers.
Agentic behavior also had rough edges, such as writing scripts into the home directory too freely.
So the conclusion is nuanced: LongCat-2.0 may not prove that Meituan already has SOTA control over a model at this size. But it does show something else very clearly: Meituan’s AI infrastructure capability is close to the frontier.
✅ The real impact
LongCat-2.0’s impact is not just that another trillion-parameter model exists.
Its real significance is that it pushes forward the adaptation loop between domestic accelerators and large-model training:
hardware constraints force model and system changes
model scale exposes gaps in precision, memory, kernels, and reliability
training pain creates reusable infrastructure experience
that experience can carry into the next generation of domestic compute
In other words, LongCat-2.0 is less a simple model release and more an engineering checkpoint.
It shows that China’s AI stack is moving from “can domestic compute run inference?” toward a harder question: can domestic compute support the full lifecycle of frontier-scale LLM training and deployment?
🔗 Full analysis:
https://www.zhihu.com/question/2055252813568119737/answer/2055346062035104988
#LongCat #Meituan #AIInfra #LLM #ChinaAI #Ascend #MoE #OpenSourceAI