OpenAI internal optimization cuts inference costs by half, running logged-out ChatGPT traffic on a couple hundred GPUs · Digg

/Tech1d ago

OpenAI internal optimization cuts inference costs by half, running logged-out ChatGPT traffic on a couple hundred GPUs

Story Overview

OpenAI quietly found an optimization that cuts inference costs in half on the models it touched, and the first visible payoff showed up in logged-out ChatGPT traffic where GPU count dropped to just a few hundred. The work came from engineers squeezing more out of existing servers rather than chasing new chips, a detail that stayed internal until The Information surfaced it.

3314.5K278997832.4K

Original post

Andrew Curran@AndrewCurran_#682inTech

OpenAI has found a way to cut inference costs in half.

Stephanie Palazzolo ✈️ ICML@steph_palazzolo

OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to.

After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred.

8:05 AM · Jun 30, 2026 · 208.2K Views

Cost Pressure

Fewer chips, same answers

The change only targeted logged-out traffic so far, leaving open how much headroom remains for the rest of the service or other products.

Open Question

The quiet race nobody tweets about

Anthropic and Google are chasing the same server-level gains, yet no public benchmarks or code have appeared, so the exact trick and its broader applicability stay unknown for now.

Sentiment

Many users hail OpenAI's optimization for halving ChatGPT inference costs as a huge efficiency leap that could lower the price of intelligence, while a few blame it for recent drops in model quality.

Pos

85.4%

Neg

14.6%

13 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

OpenAI Discovers New Way to Cut Inference Costs in Half

THE INFORMATIONVia

OpenAI Discovers New Way to Cut Inference Costs in Half

Posts from X

Most Activity

VIEWS60.3KBOOKMARKS209LIKES1.1KRETWEETS65REPLIES44

Chubby♨️@kimmonismus

OpenAI reportedly found new inference optimizations that more than halved the cost of running its models!

According to The Information, engineers told colleagues this month that the techniques helped power ChatGPT for visitors without free or paid accounts using only a couple hundred Nvidia GPUs at one point.

The exact method is unclear. It could involve quantization, KV caching, batching, routing simpler queries to cheaper models, or some mix of all of those.

The business angle is bigger than the technical detail: OpenAI ended Q1 with a 39% gross margin and wants to reach 52% by year-end. Lower inference costs give it room to either improve margins, raise ChatGPT usage limits, or cut API pricing pressure on developers.

OpenAI's moat is increasingly becoming inference and cost advantage, especially against Anthropic.

1d60.3K1.1K209

Lisan al Gaib@scaling01

yet they can't cut costs for users in half

intelligence too cheap to meter has died long ago when they saw their first billion

Stephanie Palazzolo ✈️ ICML@steph_palazzolo

OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to.

After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred.

1d17.4K19724

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

wow did they discover speculative decoding or something? margins go up again!

Stephanie Palazzolo ✈️ ICML@steph_palazzolo

OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to.

After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred.

1d13.2K14316

Rohan Paul@rohanpaul_ai

The Information reports that OpenAI has cut inference costs by more than half on some existing models, while logged-out ChatGPT traffic ran on only a couple hundred Nvidia GPUs.

The obvious guesses include quantization, KV-cache changes, batching, speculative decoding, and routing easy queries cheaper.

If true, it will be a huge core competitive lever, lower cost can raise margins, expand usage limits, or reduce pressure on API pricing.

For some context, OpenAI’s adjusted gross margin fell to 33% in 2025 from 40% in 2024, after inference costs quadrupled.

Some reporting now puts Q1-2026 at 39%, with a 52% target by year-end.

Anthropic looks similar at roughly 44%, so frontier labs remain far below mature software economics.

---

theinformation .com/newsletters/ai-agenda/openai-discovers-new-way-cut-inference-costs-half

22h5K5017

Nathan Lambert@natolambert

@AndrewCurran_ probably under a narrow-ish set of cirmcumstances or something, and then it gets reported like this

Andrew Curran@AndrewCurran_

OpenAI has found a way to cut inference costs in half.

1d7K1524

Jessica Lessin@Jessicalessin

Um this seems big. @steph_palazzolo

https://www.theinformation.com/articles/openai-discovers-new-way-cut-inference-costs-half?utm_source=ti_app&rc=hwneun

1d5.4K279

Chubby♨️@kimmonismus

https://www.theinformation.com/newsletters/ai-agenda/openai-discovers-new-way-cut-inference-costs-half?rc=bfliih

Chubby♨️@kimmonismus

OpenAI reportedly found new inference optimizations that more than halved the cost of running its models!

According to The Information, engineers told colleagues this month that the techniques helped power ChatGPT for visitors without free or paid accounts using only a couple hundred Nvidia GPUs at one point.

The exact method is unclear. It could involve quantization, KV caching, batching, routing simpler queries to cheaper models, or some mix of all of those.

The business angle is bigger than the technical detail: OpenAI ended Q1 with a 39% gross margin and wants to reach 52% by year-end. Lower inference costs give it room to either improve margins, raise ChatGPT usage limits, or cut API pricing pressure on developers.

OpenAI's moat is increasingly becoming inference and cost advantage, especially against Anthropic.

1d6.5K297

Stephanie Palazzolo@steph_palazzolo

More on the optimization + what it could mean for OpenAI's gross margins or usage limits here:

https://www.theinformation.com/newsletters/ai-agenda/openai-discovers-new-way-cut-inference-costs-half

1d2.2K196

Cody Blakeney@code_star

Crazy how quickly better hardware has turned the AI business from lighting piles of money on fire, to one of the highest margin businesses on the planet.

Hensen Juang@basedjensen

This puts inferance margins at 75%

1d5.4K372

Andrew Curran@AndrewCurran_

@natolambert Yes, need much more information.

Nathan Lambert@natolambert

@AndrewCurran_ probably under a narrow-ish set of cirmcumstances or something, and then it gets reported like this

1d4.7K511

Jessica Lessin@Jessicalessin

OpenAI Halves Inference Costs???

What do you AI experts think about this latest development.. more incremental gains or something bigger?

Discuss over at @theinformation

https://www.theinformation.com/forum/posts/1205

1d4.1K91

Andrew Curran@AndrewCurran_

@iruletheworldmo Looks like they figured it out a few weeks ago.

🍓🍓🍓@iruletheworldmo

@AndrewCurran_ do you know if this is part of the reported price wars?

1d2.7K100

Solipsnitsyn@solipsnitsyn

@steph_palazzolo ohh so that's why 5.5 became so catastrophically stupid

1d93117

Michael@MichaelStolarz

@steph_palazzolo worded a bit poorly >visitors who didn't have a free or paid account what kind of account type is left then? i'm guessing grant/research/trial? would be better to know certainly instead of inferring

1d1.2K1

🍓🍓🍓@iruletheworldmo

@AndrewCurran_ do you know if this is part of the reported price wars?

1d8345

The Information@theinformation

OpenAI engineers figured out a way to more than halve the cost of inference, thanks to some newly-discovered optimizations.

Read more in today's AI Agenda: https://thein.fo/4gjQ42l

1d17.2K6531

Loquitur Ponte Sublicio@loquitur_ponte

@AndrewCurran_ How we train and run AI is probably really really inefficient given the make it up as we went process.

Lot of gains to come on running the existing physical tech better / algorithmic improvements alone...

1d5031

cheaty@cheatyyyy

@AndrewCurran_ awfully convenient timing is all i'm going to say, not doubting OpenAI at all but this is a hilarious coincidence

what better way to cut inference costs in half than to double throughput

1d1385

Lisan al Gaib@scaling01

@RitsFur you mean like charging 30$ for a model that costs like 2$ to serve?

1d901

The Hero of KVcache@HeroOfKVcache

@steph_palazzolo >it's quantization again

1d7906