From 1 Product to 4: Scaling the Production Line

A month ago, Dark Agent Factory had one product: AdaptiveTest, a K-12 adaptive testing platform. Today we have four live products across two brands, serving different markets, all running production infrastructure with auth, billing, AI services, and monitoring.

Same engineer. Same production line. Here's what changed and what didn't.

The Portfolio

AdaptiveTest (adaptivetest.io) — K-12 adaptive testing with AI question generation and learning recommendations.
DAF Directory (directory.darkagentfactory.ai) — AI agent tools directory rated by the Dark Factory Score.
DAF Benchmark (benchmark.darkagentfactory.ai) — GitHub repo AI-readiness scoring with PR checks and team dashboards.
AdaptiveTest Skills (skills.adaptivetest.io) — Developer AI proficiency assessments producing a Dark Factory Score.

What Scales

The stack scales. Every product uses the same architecture: Next.js + FastAPI + PostgreSQL + Clerk + Stripe + Claude AI. Frontend on Vercel, backend on Railway. Once you've solved auth, billing, CI/CD, and deployment for one product, the next one is dramatically faster. The agent knows the patterns. The specs reference the patterns. The decisions are already made.

The spec process scales. Writing specs gets faster with practice. You learn what the agent needs to know, what it can figure out, and where ambiguity causes problems. By the third product, the spec files were tighter and the implementation required fewer iterations.

The design system scales. All DAF products share the same dark theme: cyan (#00F0FF), green (#00FF88), near-black background (#0A0A0F), Space Grotesk headings, Inter body text. The agent knows these tokens. New products look like they belong in the family without any design coordination overhead.

What Doesn't Scale

Infrastructure management doesn't scale linearly. Each product has its own Vercel project, Railway service, PostgreSQL database, Clerk application, Stripe account configuration, DNS records, and environment variables. Four products means four sets of credentials, four CI pipelines, four deployment targets. This is manageable at four. At ten, it would need automation.

Context switching has a cost. Each product has its own CLAUDE.md, its own gotchas, its own patterns. Jumping between four codebases means the agent (and the human) need to reload context. The CLAUDE.md files help — they're the institutional memory — but there's still friction.

Monitoring multiplies. Four products means four Sentry projects, four sets of logs, four potential sources of 3am alerts. Right now the volume is low enough that this is manageable. But the operational surface area grows with every product.

The Production Line at Scale

The interesting realization: the bottleneck has shifted. At one product, the bottleneck was implementation speed — could we ship features fast enough? AI agents solved that. At four products, the bottleneck is specification and prioritization — which product gets attention, what features matter most, how do you maintain quality across a growing portfolio?

This is exactly the Level 3 to Level 4 transition we wrote about. At Level 3, the human reviews and directs. At Level 4, the human only writes specs and makes product decisions. We're feeling the pull toward Level 4 now — not because we want to go faster, but because we have to. Four products demand it.

What's Next

The Dark Factory Score now connects all four products into a coherent ecosystem. Directory rates tools. Benchmark scores codebases. Skills measures developers. AdaptiveTest applies AI to education. The thread is AI production maturity — measuring it, improving it, and demonstrating it.

We're not stopping at four. But the next products will come from the ecosystem gaps we discover, not from a quota. The production line is ready. The question is what to build next — and that's a product decision, not an engineering one.

One engineer. Four products. The lights are still off.