Building AI Learning Curves: A Vibe-Coded Visualization of AI Capability

The Prompt That Started It All

"Build a Gradio app that visualizes AI learning curves across domains with Plotly. Let users select domain, see animated learning progression, and compare pre-ChatGPT vs post-ChatGPT learning rates."

That was the prompt. What followed was a 90-minute vibe coding session that produced a fully interactive visualization tracing AI capability across six domains, with two-sigmoid math underneath, milestone markers, and a methodology tab that documents the underlying parameters. The artifact is live at AI Learning Curves; this article is the field note about why I built it and what the data actually shows.

Why Learning Curves Beat Benchmarks

The AI industry loves benchmarks. MMLU scores, HumanEval pass rates, GSM8K accuracy. A benchmark is a snapshot. It tells you where a model is, not how it got there or how fast it is improving. Two models with the same MMLU score have wildly different trajectories if one is on a curve that is still steepening and the other has plateaued, and the trajectory difference matters far more for decisions about where to invest engineering effort than the point estimate does.

Learning curves tell a richer story. They reveal acceleration patterns (which domains are improving fastest and why). They reveal regime changes (the visible inflection at ChatGPT's launch in late 2022). They reveal the distinction between nascent and mature domains (agentic systems exploded from near-zero capability; text generation improved from an already high baseline). None of these signals are visible from a single benchmark number.

The visualization covers six domains: reasoning and math, code generation, agentic systems, text and writing, image generation, and scientific research. Each domain has its own curve, its own pre-ChatGPT baseline, its own post-ChatGPT acceleration regime.

The Two-Sigmoid Model

Each domain's trajectory is modeled as the sum of two logistic functions: one for the pre-ChatGPT era of gradual research progress, and one for the post-ChatGPT acceleration. The result is a smooth S-curve with a visible inflection point around late 2022.

This is deliberately simple modeling. A single sigmoid would miss the regime change, treating the entire trajectory as one smooth progression. Three or more sigmoids would overfit, fitting to incidental local maxima rather than the load-bearing structural pattern. Two sigmoids capture the essential story: something fundamentally changed in late 2022, and the change manifests differently in each domain depending on the maturity of the field before the regime shift.

The parameters of each sigmoid (inflection point, growth rate, ceiling) were estimated from public benchmark data anchored to milestone events like the GPT-3 release, the ChatGPT launch, the GPT-4 release, and the Claude 3 release. The full parameter table is in the methodology tab of the live visualization for anyone who wants to inspect or argue with the specific choices.

What the Data Actually Shows

The most interesting finding is that the post-ChatGPT acceleration is not uniform across domains. Some fields were transformed; others were nudged.

Agentic systems show the steepest curve because the field essentially did not exist as a deployable capability before 2023. The post-ChatGPT regime is not an acceleration of prior work; it is the start of the practical work.

Reasoning and math show a qualitative shift, from pattern-matching that approximated symbolic reasoning to inference-time compute scaling that actually performs it. The acceleration is steep, and it is also genuinely a different kind of progress, which is hard to capture in a benchmark number that treats the two regimes as continuous.

Text and writing improved from the highest baseline but with the least dramatic acceleration, because the field was already producing fluent output by 2020. The post-ChatGPT improvement is real but incremental: better grounding, better factuality, better style control, rather than a phase transition in what is possible.

Image generation and code generation sit between the extremes. Both had meaningful pre-ChatGPT trajectories that the post-2022 regime accelerated rather than reinvented.

The summary, if there has to be one: the floor was raised everywhere, but the ceiling moved at different speeds depending on how mature the underlying capability was when the regime shifted.

The Meta-Experiment

The other thing this build process demonstrates is the meta-thesis the visualization itself argues. AI was used to build a tool that measures AI effectiveness. The build was conversational. Claude Code generated the core Gradio app on the first pass, then iterated on milestone markers, domain comparison overlays, and the rate acceleration bar chart. A methodology tab was added with parameter tables and benchmark source citations. The Python Plotly code was then ported to a React component for inline rendering in this site.

Total elapsed time, including the documentation and the React port: under two hours. The vibe coding loop (prompt, then generated artifact, then inspect, then refine, then iterate) is the practical demonstration of what the visualization data is trying to show. Capability in 2026 looks different from capability in 2022 not because models got marginally better on benchmarks but because the loop between human intention and shipped artifact got dramatically shorter.

Try It Yourself

The interactive visualization is live at AI Learning Curves. Drag the year slider, compare domains, inspect the methodology tab. The math and the data are open for anyone who wants to argue with the specific parameter choices. The argument about why benchmarks miss the real story is built into what you see when you interact with it.

Building AI Learning Curves: A Vibe Coding Journey

The Prompt That Started It All

Why Learning Curves Beat Benchmarks

The Two-Sigmoid Model

What the Data Actually Shows

The Meta-Experiment

Try It Yourself

Related

Mastercard's Network of Labeled Outcomes

The Last Mile is Action — Closing the Decision-to-Execution Gap

We've Been Scaling LLMs Wrong