Meituan open-sources LongCat-2.0, a 1.6-trillion-parameter MoE coding model trained on 50,000 Ascend 910 chips

VIEWS5KBOOKMARKS9LIKES57RETWEETS2REPLIES3

There is also a lesson in there for EU training plans.

⚙ LongCat-2.0 is really a story about adapting frontier LLMs to domestic compute Zhihu contributor Robin shared a technical take on Meituan’s open-source trillion-parameter model LongCat-2.0, from the perspective of both an early LongCat-2.0-Preview user and a former Ascend 910 user. The key point is not just “Meituan released a huge model.” It is that LongCat-2.0 shows how much work is needed to make a frontier-scale MoE model actually train on domestic accelerators: precision alignment, kernel optimization, memory pressure, parallelism, reliability, and training stability all have to move together.

🧩 LongCat-Next was the proof of concept Robin frames LongCat-Next as a PoC for LongCat-2.0. Even at that stage, you could already see the hard engineering around Ascend 910: BF16 precision alignment, kernel optimization, ScMoE with chunking on expert parallelism, and reliability work. LongCat-Flash-Lite’s N-gram Embedding is also inherited, while the newer architectural addition is LongCat Sparse Attention, which will likely get more technical analysis from the community.

🔥 The real new content is large-scale training According to Robin, the most important new part of the LongCat-2.0 tech blog is not inference, but large-scale training. Scaling from LongCat-Next to a SOTA-sized model is already hard. Doing it on Ascend 910 makes it harder because of VRAM pressure and ecosystem constraints. Compared with LongCat-Next, LongCat-2.0 switched to Muon, which Robin sees as another endorsement of the optimizer. On the memory side, Zero-1, recomputation, and offloading are now almost standard tools; Meituan’s more distinctive piece is the use of zero-computation experts. The broader system also points to a deeper hardware-software fit: supernodes and 6D parallelism are close to the design logic behind Huawei’s 384-card supernode idea.

🇨🇳 Domestic compute is not a drop-in replacement This is the part worth highlighting. LongCat-2.0 suggests that domestic compute for large models is not simply about swapping NVIDIA cards for Chinese accelerators. The model, training stack, precision path, parallel strategy, communication layer, and reliability system all have to be adapted together. Robin points to several key lessons from Ascend 910 training work: Correctness alignment comes first. Router TP and NormHead fixes under MindSpeed matter. Training stability needs long, careful treatment. Cost calculation is not a side note; it affects whether this path is practical. The bigger message: domestic AI chips become useful for frontier LLMs only when the model architecture and the training system are co-designed around their constraints.

🧱 What Ascend 910 proved, and what it cost Robin is cautiously respectful but not blindly optimistic. With enough Ascend 910 cards, certain frontier-scale paths now look possible. Models around the DeepSeek V4-Flash or DeepSeek-V3.2 scale may be trainable. At the 1.6T LongCat-2.0 scale, the ceiling may be high enough to approach very strong models if teams and compute are organized well. But the cost was high. The 910 product line caused real engineering pain across its lifecycle. From release to actually carrying serious training workloads, the path was long and expensive. Robin’s question is what happens to deployed 910 clusters next: do they shift mostly to inference, or do teams continue integrating and squeezing out their remaining training potential? The hope is that experience accumulated on 910 does not get reset when Ascend 950 arrives. A100 and even V100 still create value today; 910 should not be treated as disposable if the software and architecture lessons can carry forward.

🚀 What this means for Ascend 950 Robin expects the 910 experience to transfer directly to 950 training. The most valuable lessons are around scale-up, precision alignment, loss spikes, reliability, and the messy parts of keeping a large cluster alive. Bigger VRAM will help. But even with 950, memory will still be tight for the largest models. FP8 support is another major shift, but the larger point is continuity: the most expensive hardware is not hardware with too many specs, but hardware whose architecture and software stack cannot inherit past work. That is why the LongCat-2.0 story matters for domestic compute. It creates reusable knowledge for training large models on Chinese accelerators, instead of treating each generation as a one-off fight.

⚠️ Bigger clusters matter, but fragmentation is dangerous Robin also warns against the “small blast furnace” trap: trying to build frontier models with scattered, undersized clusters. For large model training, cluster size is a hard constraint. A large cluster can always use only part of its capacity. A small cluster runs into limits that cannot be wished away. So the industry impact is not just technical. It is organizational: serious frontier training may require concentrating domestic compute, not spreading it too thinly.

🧠 Model capability is not yet the whole win Robin is more reserved about LongCat-2.0’s model quality. After testing the preview tokens, the model felt undertrained in pretraining and somewhat unstable in post-training. Reasoning seemed weak under constrained thinking budgets, and longer thinking sometimes led to overthinking instead of better answers. Agentic behavior also had rough edges, such as writing scripts into the home directory too freely. So the conclusion is nuanced: LongCat-2.0 may not prove that Meituan already has SOTA control over a model at this size. But it does show something else very clearly: Meituan’s AI infrastructure capability is close to the frontier.

✅ The real impact LongCat-2.0’s impact is not just that another trillion-parameter model exists. Its real significance is that it pushes forward the adaptation loop between domestic accelerators and large-model training: hardware constraints force model and system changes model scale exposes gaps in precision, memory, kernels, and reliability training pain creates reusable infrastructure experience that experience can carry into the next generation of domestic compute In other words, LongCat-2.0 is less a simple model release and more an engineering checkpoint. It shows that China’s AI stack is moving from “can domestic compute run inference?” toward a harder question: can domestic compute support the full lifecycle of frontier-scale LLM training and deployment?

🔗 Full analysis: https://www.zhihu.com/question/2055252813568119737/answer/2055346062035104988

#LongCat #Meituan #AIInfra #LLM #ChinaAI #Ascend #MoE #OpenSourceAI

14h5K579

Alexander Doria@Dorialexander

Meituan is maybe the perfect target for an EU model: not made by a lab but by a large company, not "frontier" but highly skilled with real adoption. But you have to fantasize less about moonshots/leapfrogs and do the work.

Alexander Doria@Dorialexander

There is also a lesson in there for EU training plans.

13h605131

Lucas Beyer (bl16)@giffmana

@Dorialexander Maybemaybemaybe DiLoCo (i would honestly like to see it)

3h69281

Rohan Paul@rohanpaul_ai

https://www.longcatai.org/

1d9184

Alexander Doria@Dorialexander

@giffmana I think there was some project of distributed training across HPC with HF, but no idea what came of it.

3h1832

Usman Anzaar Usmani@UsmanAnzaar

@rohanpaul_ai China training large models on domestic chips weakens export controls and creates new security risks around a separate AI ecosystem.

1d701

Moon@MoonL88537

@Dorialexander please. skill issue

10h591

Vadia Mineira@VMineira

@rohanpaul_ai curious if the 33b-56b spread actually helps with inference or if it's mostly parameter theater. 1m context usually hits walls in kv cache bottlenecks anyway

1d63

along@attaalong

@rohanpaul_ai 自力更生是刻在中国人骨子里的基因。🇨🇳

22h37

Sakura Yuki@sakurayukiai

@rohanpaul_ai The real story here isn't the 1.6T total, it's the 48B active. MoE architecture is the only reason we can even pretend to serve frontier-class coders at scale without completely melting our memory bandwidth.

23h34

Hussain Hashim | Building SundayBack@itsthedonhashim

@rohanpaul_ai @rohanpaul_ai crazy how fast tech's moving. wonder how it'll change food delivery next!

1d11

PAI3@Pai3Ai

@rohanpaul_ai AI sovereignty increasingly starts with sovereign compute.

24m2

Gill@gurtej__gill_

@rohanpaul_ai everyone is rushing to build their own compute stack now.

15h2

Meituan open-sources LongCat-2.0, a 1.6-trillion-parameter MoE coding model trained on 50,000 Ascend 910 chips

Story Overview

Domestic clusters can now train at frontier scale

API access exists today while weights wait