Skip to content
← Back to Lab Notes

The Dark Factory Score

2026-02-25·6 min·James Williams

We now have three products that all produce the same measurement: the Dark Factory Score, a 0-5 scale. DAF Directory rates AI agent tools. DAF Benchmark scores GitHub repos on AI readiness. AdaptiveTest Skills measures developer AI proficiency. Same scale, different targets.

This wasn't an accident. We designed the score to be a shared language for AI production maturity — whether you're evaluating a tool, a codebase, or a person.

The Scale

The Dark Factory Score maps to the Five Levels of AI Production that we've been writing about since day one:

  • 0 — Manual: No AI involvement. Everything by hand.
  • 1 — Autocomplete: Basic AI suggestions. Tab completion.
  • 2 — Pair Programmer: AI writes blocks of code in conversation.
  • 3 — Agent-Assisted: AI agents build features end-to-end. Human reviews.
  • 4 — Spec-Driven: Write a spec, AI implements it completely.
  • 5 — Dark Factory: Fully autonomous. Business goals in, software out.

How It Works: Tools (DAF Directory)

When an AI agent tool is submitted to the Directory, Claude AI scores it across five weighted dimensions:

  • Autonomy (28%): How independently can the tool operate?
  • Code Quality (20%): Does it produce production-grade output?
  • Integration (17%): How well does it fit into existing workflows?
  • Self-Correction (15%): Can it detect and fix its own mistakes?
  • Project Complexity (20%): What scale of project can it handle?

Each dimension is scored 0-5 with a written rationale. The weighted average becomes the overall Dark Factory Score. This isn't a subjective rating — it's a structured AI analysis with transparent reasoning.

How It Works: Codebases (DAF Benchmark)

When you connect a GitHub repo to DAF Benchmark, we clone it (shallow, deleted after analysis) and score it across five dimensions tuned for codebase evaluation:

  • Context Readiness (25%): Can an AI understand the codebase without a human? CLAUDE.md, documentation, clear naming.
  • Test Infrastructure (25%): Can AI validate its own changes? Test coverage, CI/CD, automated checks.
  • Architecture Clarity (20%): Can AI navigate component boundaries? Clean separation, consistent patterns.
  • Automation Maturity (15%): How much workflow is automated? CI/CD, deployment, monitoring.
  • AI Integration (15%): Is AI a first-class development partner? AI-specific config, agent-friendly structure.

The dimensions are different from Directory because the question is different. For tools, we ask “how good is this tool at AI-native development?” For codebases, we ask “how ready is this code for AI agents to work on it?”

How It Works: Developers (AdaptiveTest Skills)

AdaptiveTest Skills measures developer AI proficiency across 6 skill domains through adaptive assessments. AI generates questions, adjusts difficulty based on performance, and produces a Dark Factory Score for each domain and overall.

The adaptive engine uses Item Response Theory to select the right difficulty level in real time. After the session, Claude Sonnet analyzes the results and generates personalized recommendations for improving each dimension.

Why One Score

Having a single, consistent scale across tools, codebases, and developers creates a shared vocabulary. When we say a team is “Level 3,” everyone knows what that means — whether they got there through a tool assessment, a codebase scan, or a skills evaluation.

It also creates a flywheel. Score your codebase with Benchmark. Find tools to improve it in Directory. Measure your team's readiness with Skills. All three products reinforce each other, and they all speak the same language.

The Honest Part

The Dark Factory Score is an AI-generated assessment, not ground truth. It's a structured analysis with transparent reasoning, but it's still a model's interpretation. We publish the dimensions, the weights, and the rationale for every score. If you disagree with a rating, you can see exactly why the AI scored it that way.

We believe transparency builds trust. The score is a starting point for conversation, not the final word.