In the span of 10 days, Dark Agent Factory shipped three new products to production: DAF Directory (directory.darkagentfactory.ai), DAF Benchmark (benchmark.darkagentfactory.ai), and AdaptiveTest Skills (skills.adaptivetest.io). Each one started as a set of spec files and ended as a live fullstack application with auth, billing, AI services, and production infrastructure.
This is what Level 3+ production looks like when the production line is dialed in.
The Spec-First Process
Every product started the same way: a specs/ directory with 5-6 markdown files covering content, design system, page structure, site architecture, data model, and requirements. These specs are the source of truth. The agent reads them before writing a single line of code.
This isn't new — we've been writing terminal briefs since the beginning. But the spec files are more comprehensive than a brief. They cover the entire product: every page, every API endpoint, every database table, every component, every piece of copy. The agent doesn't need to ask questions. It has the answers.
Product 1: DAF Directory
A curated directory of AI agent tools, each rated by the Dark Factory Score across five dimensions. Next.js frontend on Vercel, FastAPI backend on Railway, PostgreSQL database, Clerk auth, Stripe billing (Free / Pro at $19 / Enterprise at $99).
The AI pipeline is the interesting part. When a new tool is submitted, Claude Haiku classifies it (is this actually an AI agent tool?), then Claude Sonnet scores it across autonomy, code quality, integration, self-correction, and project complexity. Each dimension gets a 0-5 score with a written rationale. The weighted average becomes the Dark Factory Score.
Full-text search uses PostgreSQL TSVECTOR with GIN indexes — no external search service needed. The backend has 9 router groups, rate limiting, SSRF protection on the crawler, and security middleware.
Product 2: DAF Benchmark
Score any GitHub repo on AI readiness. Connect your GitHub account via our GitHub App, and we'll clone your repo (shallow, to /tmp, deleted after analysis), scan it across five dimensions, and produce a Dark Factory Score.
The scoring dimensions are different from Directory — they're tuned for codebases, not tools: context readiness, test infrastructure, architecture clarity, automation maturity, and AI integration. Claude Sonnet analyzes the codebase and scores each dimension with evidence from the actual code.
Features include PR checks (the GitHub App runs a scan on every PR and posts results as a check), team dashboards, weekly scheduled scans, and a public leaderboard (opt-in). The scan queue is priority-based — Team and Enterprise users get priority 1-3, free users get priority 5. Max 3 concurrent scans to manage CPU from git clones.
Product 3: AdaptiveTest Skills
Adaptive assessments measuring developer AI proficiency across 6 skill domains. This uses the same Dark Factory Score (0-5) as Directory and Benchmark, creating a consistent measurement language across the entire DAF ecosystem.
Claude Haiku generates assessment questions (~7 seconds, $0.01-0.02 per domain). The adaptive engine adjusts difficulty in real time based on performance. After the session, Claude Sonnet produces a detailed score analysis and personalized learning recommendations (~25 seconds).
Billing supports both individual users and organizations via Stripe. The backend handles org admin roles, team aggregation, and billing lifecycle.
The Shared Stack
All three products share the same architecture: Next.js 16 + React 19 frontend, FastAPI + SQLAlchemy 2.0 async backend, PostgreSQL 15, Clerk JWT auth, Stripe billing, Claude AI services. Frontend on Vercel, backend on Railway. CI/CD via GitHub Actions. Sentry for error monitoring.
This isn't accidental. Once you have a proven production stack, you don't reinvent it — you stamp it out. The agent knows this stack. The specs reference this stack. The deployment pipeline is the same. Each new product is faster than the last because the patterns are established.
What the Human Did
Wrote the spec files. Made architecture decisions (which scoring dimensions for each product, how to structure billing tiers, what the GitHub App permissions should be). Reviewed the implementations. Configured DNS, Clerk, Stripe, and Railway. Verified production deployments.
What the human did NOT do: write application code, write tests, design components, configure CI/CD, or debug implementation issues.
The Numbers
3 products. 10 days. Each with: fullstack monorepo, database migrations, auth, billing, AI services, CI/CD, test suites, security headers, rate limiting, and production deployment. All maintained by one engineer.
This is the thesis in action. The production line works. The spec is the bottleneck now — not the implementation.