Parity Testing

Parity testing answers a single question: does the new implementation produce the same outputs as the legacy system for the same inputs? This is the bridge between “we built it” and “we can ship it.” ModernizeSpec’s parity-tests.json captures the test cases, expected outputs, and confidence scores that determine when extraction is complete.

Characterization Tests

The concept of characterization tests — first described in Michael Feathers’ Working Effectively with Legacy Code (Chapter 13) — flips the usual testing assumption. Instead of testing what the code should do according to a specification, you test what it actually does in practice.

The Process

In a modernization context, characterization tests work like this:

Select a legacy function or workflow that the new system must replicate
Execute it with known inputs and capture the exact output
Write a test asserting that output — even if the behavior seems incorrect
That test now defines the baseline the new implementation must match

The characterization test does not judge whether the behavior is correct. It captures reality. If the legacy system rounds tax to 2 decimal places when it should use 4, the characterization test asserts 2 decimal places. The new system must reproduce this behavior (or the team must explicitly decide to fix it and document the deviation).

Why Not Specification Tests?

Approach	Tests Against	Risk
Specification tests	What the system should do (requirements docs)	Requirements may be outdated, incomplete, or wrong
Characterization tests	What the system actually does (runtime output)	Captures bugs as “expected” behavior

For migration, characterization tests are safer. The legacy system has been running in production — its behavior, including its bugs, is what users depend on. Changing behavior during migration introduces risk that is separate from the extraction itself.

Handling Known Bugs

When a characterization test captures a known bug:

Document it in the test: “Legacy rounds to 2 decimal places; should be 4”
Preserve the behavior in the new implementation initially
Create a separate task to fix the bug after migration is proven
Mark it in parity-tests.json with a knownDeviation field

Table-Driven Parity

The most scalable approach to parity testing is table-driven: a matrix of inputs and expected outputs, run through both implementations.

Structure

Input	Legacy Output	New Output	Match
Invoice: 3 items, GST 18%	Total: 11,800.00, Tax: 1,800.00	Total: 11,800.00, Tax: 1,800.00	Pass
Invoice: 1 item, exempt	Total: 500.00, Tax: 0.00	Total: 500.00, Tax: 0.00	Pass
Invoice: discount + tax	Total: 9,440.00, Tax: 1,440.00	Total: 9,440.00, Tax: 1,440.00	Pass
Invoice: multi-currency	Total: 850.00 USD, Tax: 153.00	Total: 850.00 USD, Tax: 153.00	Pass

Building the Table

Extract real inputs and outputs from the legacy system’s database or logs:

Query the legacy database for completed transactions
Record the input state (what was sent to the system)
Record the output state (what the system produced)
Use data subsetting to create a manageable fixture set

Advantage: Captures real-world scenarios including edge cases you would never think to write.

Risk: Requires anonymization for PII.

Mapping to `parity-tests.json`

Each row in the table becomes an entry in parity-tests.json:

{
  "id": "tax-calc-gst-18",
  "module": "taxation",
  "description": "Standard GST 18% on 3-item invoice",
  "input": {
    "items": [
      { "amount": 5000 },
      { "amount": 3000 },
      { "amount": 2000 }
    ],
    "taxRate": 0.18
  },
  "expectedOutput": {
    "subtotal": 10000.00,
    "taxAmount": 1800.00,
    "total": 11800.00
  },
  "source": "production-capture",
  "status": "passing"
}

Behavioral Snapshots

Behavioral snapshots are a heavier-weight version of characterization tests. Instead of testing individual functions, they capture the full response of the legacy system to a realistic request.

What to Snapshot

Artifact	How to Capture	Storage
API responses	Record HTTP response body, headers, status	JSON files
Database writes	Capture rows written after an operation	SQL or JSON fixtures
Computed values	Log intermediate calculations	Structured log entries
Side effects	Record emails sent, events emitted, files written	Event log

Golden File Testing

Store snapshots as “golden files” — reference outputs that the new system must reproduce exactly.

fixtures/
├── tax-calculation/
│   ├── input-001.json       # Input to the function
│   ├── golden-001.json      # Expected output (captured from legacy)
│   ├── input-002.json
│   └── golden-002.json
└── gl-posting/
    ├── input-001.json
    └── golden-001.json      # Expected GL entries

The test runner:

Reads each input-*.json
Passes it through the new implementation
Compares the output to the corresponding golden-*.json
Reports exact differences (field-level diff, not just pass/fail)

Updating Golden Files

When the new system intentionally deviates from legacy behavior (bug fixes, improvements):

Document the deviation in parity-tests.json with knownDeviation
Update the golden file to reflect the new expected output
Record the reason for the change in version control

Confidence Scoring

Not all parity is equal. A module with 50 passing tests on happy paths but zero tests on error paths has limited real confidence. Confidence scoring quantifies how trustworthy the parity evidence is.

Scoring Dimensions

Dimension	Weight	Measurement
Happy path coverage	1x	Percentage of normal workflows tested
Error path coverage	2x	Percentage of error/exception paths tested
Edge case coverage	2x	Boundary values, empty inputs, maximum sizes
Data variety	1.5x	Diversity of test inputs (currencies, date ranges, entity types)
Production traffic representation	3x	How closely test inputs match actual production usage patterns

Error paths and production representation are weighted highest because they are where surprises emerge in production.

Confidence Levels

Score	Label	Meaning	Decision
0-30	Low	Minimal testing, major gaps	Do not proceed to shadow mode
31-60	Moderate	Core paths tested, gaps in edges	Proceed with caution, add tests
61-85	High	Comprehensive testing, few gaps	Ready for shadow mode
86-100	Very High	Exhaustive testing including production traffic replay	Ready for production cutover

Recording in the Spec

Confidence scores are recorded per module in parity-tests.json:

{
  "module": "taxation",
  "confidence": {
    "overall": 78,
    "happyPath": 95,
    "errorPath": 45,
    "edgeCases": 72,
    "dataVariety": 80,
    "productionRepresentation": 60
  }
}

This makes confidence transparent to AI agents and team leads reviewing migration progress.

Regression Prevention

Once parity is proven, the tests serve a second purpose: regression guards. Any future change to the new system that breaks an established parity test must be intentional and documented.

Parity Test Lifecycle

Capture baseline ──▶ Prove parity ──▶ Guard regressions ──▶ Retire
                                                              │
                                          (when legacy is     │
                                           fully decommissioned)

Parity tests are retired only after the legacy system is completely removed. Until then, they remain active as regression guards.

CI Integration

Run parity tests on every pull request that touches an extracted module:

PR modifies code in src/taxation/ → run taxation parity tests
Any failure blocks merge
If a deviation is intentional, the PR must update the golden file and add a knownDeviation entry

Making Untestable Code Testable

Legacy code often resists testing because of hard-coded dependencies, framework coupling, and deeply nested call chains. Michael Feathers catalogs 24 dependency-breaking techniques in Working Effectively with Legacy Code (Chapter 25). The core strategies relevant to migration parity fall into three categories:

Introduce Abstraction Boundaries

Place an interface or protocol between concrete classes so both the legacy and new implementations can be tested through the same contract. This lets you run the same parity test against both systems.

Isolate New Behavior

When adding recording hooks or comparison logic to a legacy method, write the new code in a separate method or wrapper rather than modifying the original. This preserves the original behavior while enabling side-by-side output capture.

Replace Hard-Coded Dependencies

Pass dependencies through constructors, factory methods, or configuration rather than instantiating them internally. During parity testing, swap in test doubles that capture intermediate state for comparison.

For the full catalog of techniques, see Feathers’ Working Effectively with Legacy Code, Chapter 25. The key insight for modernization: these techniques create seams for testing without modifying the legacy system’s behavior — which is exactly what you need when building characterization tests.

ERPNext Example

Team Zeta in the PearlThoughts internship independently achieved 100% parity on tax calculation using table-driven tests:

Scenario	Python Output	Go Output	Match
GST 18% on single item	Tax: 1,800.00	Tax: 1,800.00	Pass
GST 18% + CESS 1% compound	Tax: 1,918.00	Tax: 1,918.00	Pass
Inclusive pricing (tax-in-price)	Net: 8,474.58	Net: 8,474.58	Pass
Multi-rate (5% + 18% items)	Tax: 1,150.00	Tax: 1,150.00	Pass
Zero-rated export	Tax: 0.00	Tax: 0.00	Pass

They captured Python outputs first, then built Go implementations until every row matched. No specification documents were needed — the Python system was the specification.