GPT-5.5 Cost Analysis: How Reduced Verbosity Softens the 2x Price Hike

While GPT-5.5 features a 100% increase in list price, OpenRouter's analysis shows actual user costs rose by only 49-92%. This is because the model generates significantly fewer completion tokens for prompts longer than 10K tokens. Users with shorter prompts, however, experience a higher cost increase as the model's verbosity remains high for those tasks.

Key Points

OpenAI doubled GPT-5.5 prices to $5.00/M for input and $30/M for output tokens.
Actual user costs increased by 49-92%, failing to reach the full 100% price hike due to model efficiency.
GPT-5.5 is significantly less verbose for long prompts, generating up to 34% fewer tokens for prompts over 128K.
Short prompts under 10K tokens see the highest cost impact because completion lengths did not decrease.
The analysis utilized a switcher cohort methodology to ensure a direct comparison of the same user workflows across versions.

Sentiment

The community is moderately skeptical. While many acknowledge GPT-5.5 offers genuine improvements, particularly for agentic coding, there is widespread concern that the price increase is not justified by proportional gains. Commenters frequently reframe the discussion away from per-token savings toward task-completion economics, and several advocate for open-source alternatives as a more cost-effective path.

In Agreement

GPT-5.5's reduced verbosity does translate to real cost savings compared to a naive 2x price increase, as OpenRouter's data shows
GPT-5.5 on low reasoning matches GPT-5.4 on medium reasoning at lower cost, making it a practical upgrade path
GPT-5.5 represents a genuine quality improvement for agentic coding and complex instruction following, with several independent benchmarks confirming this
Per-token cost analysis is a useful starting point for understanding the real-world impact of pricing changes

Opposed

Cost per token is misleading because it ignores multi-turn interactions and task completion efficiency — cost per completed engineering task is the metric that matters
The analysis lacks methodological rigor: no sample size, no distribution data, no control for number of turns in agentic workflows, and no understanding of task boundaries
LLM progress has plateaued and price increases reflect providers squeezing customers rather than delivering proportional value improvements
Higher reasoning levels can actually produce worse code due to scope creep and over-engineering, undermining the value proposition of expensive frontier models
Open-source models like GLM 5.1, Kimi K2.6, and Xiaomi now offer competitive quality at far lower cost, making frontier model pricing hard to justify
Newer models are being overfitted for coding at the expense of general capabilities, with regressions observed in domains like NLP and linguistics