AI Processing of 500+ Page Bid Packs (4/4)

Trust, Verify, and Ship Bid-Pack Compliance With Confidence

If the first three parts were about building a machine that works, this one is about proving it works, and catching it when it does not. Procurement teams do not get points for elegance. They get points for being correct, on time, and defensible when someone asks, "why did we shortlist Vendor B and not Vendor C?" That requires two layers of safety: live validation of every single verdict, and a short set of quantitative metrics that show whether the system is healthy overall. We will do both, with practical C# snippets you can paste into your project.

We will keep using the same example RFP slice and the fictitious bidder we established earlier: bid security two percent, validity ninety days, ISO certificates current, IEC classes for meters, LDs accepted. You already know the retrieval and packing routine. For each rule you pick two to four short passages with page numbers and headings. You pack several rules into a single request. You require must-cite JSON with a short verbatim quote and page numbers. This post focuses on everything that happens after the model replies, and on how you prove to yourself and to others that the pipeline is solid.

The live checks that keep you out of trouble

Every result must pass a few instant checks before it colors a cell in your compliance matrix.

Quote verification. Because you required a verbatim quote, you can do a plain string match against the exact passages you sent. If the quote is not found, the verdict is invalid.
Page sanity. If the model says the quote is on page 47, that page must exist, and the quote should occur on one of the listed pages.
Schema tightness. Verify that status is one of Pass, Partial, Fail, or Missing. Verify that the rationale length is within your limit, and that risk level is one of Low, Medium, or High.

If any check fails on a mandatory rule, escalate that single rule with a fresh retrieval variant or one extra table row. If it is optional, mark it as needs reviewer. These small, mechanical checks reduce mystery failures to nearly zero.

Here is a compact helper to run these validations. Assume each model item came back as Finding and you still have the exact passages you sent.

public enum Status { Pass, Partial, Fail, Missing }

public sealed record Passage(int Page, string Heading, string Text);
public sealed record Finding(
    string RequirementId,
    Status Status,
    string Rationale,
    string EvidenceQuote,
    IReadOnlyList<int> PageRefs,
    string RiskLevel // "Low" | "Medium" | "High"
);

public static class LiveValidation
{
    public static bool QuoteAppearsInPassages(Finding f, IReadOnlyList<Passage> passages)
    {
        if (string.IsNullOrWhiteSpace(f.EvidenceQuote)) return false;

        foreach (var p in passages)
        {
            // straightforward substring match; normalize whitespace if needed
            if (p.Text.Contains(f.EvidenceQuote, StringComparison.Ordinal))
                return true;
        }
        return false;
    }

    public static bool PageRefsAreSane(Finding f, int maxPage)
    {
        if (f.PageRefs is null || f.PageRefs.Count == 0) return false;
        foreach (var p in f.PageRefs)
        {
            if (p <= 0 || p > maxPage) return false;
        }
        return true;
    }

    public static bool SchemaIsTight(Finding f)
    {
        if (f.Rationale?.Length > 600) return false; // about 80 to 100 words
        if (f.RiskLevel is not ("Low" or "Medium" or "High")) return false;
        return true;
    }

    public static bool IsValid(Finding f, IReadOnlyList<Passage> passages, int maxPage) =>
        QuoteAppearsInPassages(f, passages) &&
        PageRefsAreSane(f, maxPage) &&
        SchemaIsTight(f);
}

That is a rock-solid guardrail. When a verdict fails here, you do not argue with the model. Treat it as Missing and either escalate or send it to a reviewer. The must-cite rule is what makes this cheap and decisive.

Scoring, gates, and how to keep the matrix correct

Once a finding is valid, you have to decide what it means for the bid. Mandatory rules are non-negotiable. If any mandatory rule is Missing or Fail, the overall administrative gate is Fail. For large sets of technical rules where Partial is common, use a simple numeric score so you can compare vendors without pretending to be more precise than you are. One point for Pass, half for Partial, zero for Missing or Fail. Average across technical rules only. Keep a separate commercial score if you need it. The key is not the exact numbers, but the consistent application, and a clear separation between the hard gate and the softer comparison.

Here is a small sketch.

public sealed record RequirementRef(string Id, bool Mandatory, string Area); // "Admin" | "Technical" | "Commercial"

public sealed record VendorResult(
    string BidderId,
    IReadOnlyDictionary<string, Finding> Findings // by requirement id
);

public static class Scoring
{
    public static bool FailsMandatoryGate(
        VendorResult result,
        IReadOnlyDictionary<string, RequirementRef> catalog)
    {
        foreach (var (rid, req) in catalog)
        {
            if (!req.Mandatory) continue;
            if (!result.Findings.TryGetValue(rid, out var f)) return true; // missing equals fail
            if (f.Status is Status.Fail or Status.Missing) return true;
        }
        return false;
    }

    public static double TechnicalScore(
        VendorResult result,
        IReadOnlyDictionary<string, RequirementRef> catalog)
    {
        var techIds = catalog.Values.Where(r => r.Area == "Technical").Select(r => r.Id).ToList();
        if (techIds.Count == 0) return 0;

        double sum = 0;
        int seen = 0;
        foreach (var id in techIds)
        {
            if (!result.Findings.TryGetValue(id, out var f)) continue;
            double pts = f.Status switch
            {
                Status.Pass => 1.0,
                Status.Partial => 0.5,
                _ => 0.0
            };
            sum += pts;
            seen++;
        }
        return seen == 0 ? 0 : sum / seen;
    }
}

Present the output as a clean, clickable matrix - each cell styled as a color-coded pill representing status (e.g. matched, missing, ambiguous). Clicking a pill opens a side drawer showing:

The supporting quote, with page-linked citations,

A quick option to agree with the retrieved evidence,
Or override the evaluation by providing:
- a reason
- optional comment

Each override is logged with:

Who made the change
When it was made
Why (the reviewer’s rationale)

This override trail becomes a feedback mechanism, helping refine future retrieval and prompting strategies by showing where and why human judgment diverged.

The four numbers that tell you whether you can trust the system

Live checks make individual cells safe. Metrics tell you whether the system is safe on average. You do not need a wall of charts, you need four numbers computed on real jobs from time to time.

Recall at K for mandatory rules. Question: when a human could find evidence for a mandatory rule, did your candidate list contain that evidence somewhere. You are measuring retrieval, not the model. Build a tiny labeled set. Mark the specific passages a human would cite. Run your retrieval without the model. If the human passage appears in your fused candidate list of size K, count a hit. If not, count a miss. If recall is under 95 percent, tune retrieval before you blame the model.

Here is a minimal evaluator.

public sealed record HumanEvidence(string RequirementId, int Page, string Snippet);
public sealed record Chunk(int Page, string Text);
public sealed record Candidate(Chunk Chunk, double Score);

public static class RetrievalEval
{
    // true if at least one candidate matches human evidence by page and substring
    public static bool CandidateCovers(HumanEvidence gold, IReadOnlyList<Chunk> candidates)
    {
        foreach (var c in candidates)
        {
            if (c.Page != gold.Page) continue;
            if (c.Text.Contains(gold.Snippet, StringComparison.OrdinalIgnoreCase)) return true;
        }
        return false;
    }

    public static double RecallAtK(
        IReadOnlyList<HumanEvidence> goldSet,
        Func<string, int, IReadOnlyList<Chunk>> retrieveK)
    {
        int hits = 0, total = 0;
        foreach (var g in goldSet)
        {
            var candidates = retrieveK(g.RequirementId, 30);
            if (CandidateCovers(g, candidates)) hits++;
            total++;
        }
        return total == 0 ? 0.0 : (double)hits / total;
    }
}

Mandatory miss rate. Percentage of mandatory rules that ended up as Fail or Missing, where a human later decided the bid was compliant. Derive it from reviewer overrides. While tuning, ten percent is common. A steady goal under five percent is realistic. Under two percent is excellent. Most misses come from retrieval blind spots or prompts that did not enforce must-cite tightly enough.

Reviewer edit rate. Fraction of cells a reviewer changed. If this is high, look for clusters. If a rule family keeps getting edits, fix retrieval for that family. If one vendor gets many edits, your OCR or layout extraction may be weak for that pack. Use the numbers to point you to the next fix.

Time to first gate. How long until the obvious non-compliant surfaces. With packing and triage, you usually get early red flags in minutes, even on large packs. This is a happiness metric for coordinators.

The acceptance checklist to present to the auditors

Metrics reassure you, and a written checklist reassures everyone else. Let's keep it short and plain.

You version the catalog and keep a hash of the RFP text. You log the retrieval parameters, such as chunk size, overlap, K, synonyms, and section boosts. You require a verbatim quote and page numbers in every verdict, then you verify the quote and the pages locally, and then you gate on mandatory rules. You store the compliance matrix with links to an evidence drawer, and you keep the override trail. Compute and store the four metrics for the job, and maintain a small, stable labeled set for regression whenever you tweak retrieval.

If you want a small record to codify catalog versioning, use something like this.

public sealed record CatalogMeta(
    string RfpId,
    int Version,
    string RfpTextSha256,
    DateTime CreatedUtc,
    string CreatedBy,
    IReadOnlyDictionary<string, string> RetrievalParams // e.g., { "chunkTokens":"200", "overlap":"0.35", "bm25TopN":"30" }
);

Store this alongside the JSON of the requirements. When you re-run a job, you know exactly which rules and which knobs you used the first time.

The Mathematical Proof

When using this approach to process large volumes of data, clients often ask how it's possible to send only a subset without missing anything important. And we understand at first glance, it doesn’t seem logical. That’s why we provide a clear mathematical proof to show that it works. The beauty of the math is that it ends the debate.

We here prove that our rules -> local chunking -> retrieval -> small snippet set does not lose important details, plus the operational guardrails that make silent misses near impossible.

1. CHUNKING CANNOT CUT AWAY THE EVIDENCE WE NEED

Let the extracted document be a token sequence of length $\color{RED}𝑇$ . We build overlapping windows, or chunks, of size $\color{RED}w$ tokens with overlap fraction $\color{RED}\beta \in (0,1)$ . The stride is $\color{RED}s=w(1-\beta)$ .

Let us consider any contiguous evidence span $\color{RED}E$ of lenth $\color{RED}e$ tokens (for example, "two percent (2%), or ninety (90) days).

Lemma (overlap coverage): Every span with $\color{RED}e \leq w-s = \beta w$ is contained entirely in at least one chunk.

Proof. The window start advances by stride $\color{RED}s$ , adjacent windows overlap by $\color{RED}\beta w$ . Any span shorter than or equal to that overlap fits fully inside some window.

Implication. If we pick $\color{red}w≈200$ tokens and $\color{red}\beta≈0.35$ , then any evidence up to ~ 70 tokens (this is dozens of words, longer than a typical sentence) would land entirely in at least one chunk.
For tables, we have a rule that we never split a row across chunks, so rows are emitted as atomic chunks. That makes numeric evidence, like amounts, percentages, days, etc. easily covered.

So boundary loss is eliminated by construction (and for long paragraphs, the specific sentence that carries the fact is still $\color{red}\ll \beta 𝑤$ ).

2. RETRIEVAL IS CONSTRUCTED FOR HIGH RECALL, NOT PRECISION

For a requirement $\color{red}{r}$ , define its sentry terms $\color{red}{U_r}$ - the handful of must-catch tokens (numbers + units + canonical keywords), for instance ${"2%", "90 days", "bid security"}$ .

A chunk $\color{red}{c}$ is supportive if it contains the necessary subset $\color{red}U^*_r \subseteq U_r$ - that is, the exact number/unit string(s) plus the concept term.

We build a candidate set $\color{red}{C_r}$ easily via union:

BM25@N: top- $\color{red}{N}$ hits for a query expanded from $\color{red}{U_r}$ (digits + unit variants + synonyms).
Tag pool: all chunks tagged by deterministic regexes that match money/percent/days/ISO/etc., plus all table rows and any chunk under a clause-hinted heading for this rule.

Let $\color{red}{F_r}$ be the set of truly supportive chunks in the document.

Claim (Candidate Recall). If a supportive chunk contains the sentry digits/units (or sits under the clause hint), then: $\color{red}𝐹 𝑟 ⊆ 𝐶 𝑟$

In practice, misses come only from OCR or parse anomalies, or irregular phrasing. We can call that miss rate $\color{red}p_{ocr}$ , and it can be minimized with better OCR, better document quality or simple synonym expansion.

So, the generator targets recall ≈ 1, and we only worry about ranking and selection then.

3. SELECTION KEEPS DIVERSE EVIDENCE, TO AVOID PICKING SEVERAL NEAR-DUPLICATES

From $\color{red}{C_r}$ we compute an evidence score $\color{red}{\phi(c)}$ that boosts: presence of all required digits/units, clause-hint match, table row when numeric, reasonable length.
Then we choose $\color{red} k \in \{2,3,4\}$ passages with a diversity constraint (different page/heading/format). That’s a simple Maximal-Marginal-Relevance (MMR) variant.

Define $\color{red}{G_r \subseteq C_r}$ as the chunks that actually contain the complete sentry combo (for instance both "2%" and "90 days"). These are the "gold" candidates.

Risk we care about: all $\color{red}{G_r}$ get ranked below $\color{red}k$ non-gold candidates and the diversity rule doesn’t pull them in.

We remove that risk with two deterministic safeguards for mandatory rules:

Guard A (table guarantee). If $\color{red}{U_r}$ includes a number/unit, include at least one table row when available.
Guard B (clause guarantee). If a clause/heading hint exists, include at least one chunk from that section.

These two "must-include" constraints force selection to contain two different evidence modalities (narrative + table, or clause-near + annex). With the overlap lemma and regex tags, at least one of these carries the full numeric phrase. Empirically, this drives $\color{red}\text{recall@}k \gg 95\%$ on mandatory rules.

If after selection any required digit/unit from $\color{red}{U_r}$ is still absent in the chosen set, we escalate: set $\color{red}k := k + 2$ for that rule only, widen synonyms, and favor neighboring headings.
The token cost change will increase by just a few hundred tokens and will remain still rather low, because selection is performed locally.

4. THE MUST CITE + VERIFICATION INVARIANT ELIMINATES SILENT ERRORS

Even if ranking ever served a weak snippet, the model cannot "pass" a rule by inventing text, because we enforce:

Must-cite contract: output must include a verbatim quote copied from one of the provided passages and page refs.
Local verification: we do a literal substring match of that quote against the passages; if not found, the finding is invalid -> treated as Missing-> escalate retrieval (or route to reviewer).

This invariant removes silent false positives. The worst possible outcome is a loud miss ("Missing"), which triggers a bounded, cheap retry on that rule only. This way we avoid "we lost information but said Pass" scenarios.

5. "NO LOSS" IS FORMALIZED AS MEASURABLE TARGET

Two measurable values give you provable confidence:

Candidate recall@K for mandatory rules. On a tiny labeled set (a few tenders), measure the fraction of human-cited evidence spans that appear in your candidate pool $\color{red}C_r$ , before selection. With BM25@30 $\color{red}\cup$ tags $\color{red}\cup$ clause hints, we can make this $\color{red}\geq 0.95$ reliably; if we want to push it to $\color{red}\geq 0.98$ , then we can:

raise $\color{red}N$ from 30 to somewhere up to 40,
add one synonym family, and
guarantee table rows for numeric rules.

Selected-set coverage. Check that the selected $\color{red}k$ passages contain all required sentry digits/units for the rule. If not, then escalate that rule by increasing $\color{red}k$ and widening the neighbours. This drives the rate of the misses after the selection to near zero, with just a minor token bump for just a couple of hard rules.

Put differently:

\color{red}\Pr(\text{silent loss}) \approx \underbrace{\Pr(F_r \not\subseteq C_r)}_{\text{OCR/parse edge cases}} \times \underbrace{\Pr(\text{must-cite passes without quote})}_{\text{0 by verification}} = 0

while:
$\color{red}\Pr(\text{loud miss}) = \Pr(G_r \subseteq C_r \;\land\; G_r \cap S_r = \emptyset)$ is driven to $\color{red}\approx 0$ by $\color{red}Guards A/B + escalation$ .

Here $\color{red}S_r$ is the selected set. In practice, loud misses become rare and can self correct, and silent misses are structurally prevented.

6. FINALLY, WHY IT BEATS THE "SEND THE WHOLE DOCUMENT" APPROACH

Sending the full document does not increase fidelity of decisions for rules like "2% and 90 days" or "Class 0.5S per IEC 62053-22", because those decisions hinge on very short spans - the exact digits/units and clause names that our chunking and selection guarantee to capture and verify. Full document prompts only add:

irrelevant context tokens (cost/latency),
longer reasoning chains (greater variance),
no extra guarantee of quoting the exact text (unless you enforce must-cite, which we already do on the small set).

Our pipeline constrains the model to judge the same spans a human would and proves it by quote+page. That’s higher evidence fidelity than a giant prompt without the "must cite" requirement.

A note on humility

It is tempting to chase cleverness, such as custom embeddings or fancy few-shot chains. Those can help later. Most practical wins come from three simple moves: better chunking that preserves tables, generous retrieval that unions BM25 with obvious tags, and a strict must-cite rule with page numbers. When you wire those three, the rest of the system becomes auditable by normal people. Live checks pass, metrics move in the right direction, and trust grows because everyone can see exactly what the system saw.

Measure with a small labeled set each time you change a setting. Ask a narrow question, and thenh see if our candidates still cover what a human would cite. If yes, ship. If not, fix and try again. You will change fewer things, and each change will be on purpose.

Closing the loop

We now have a complete story. Part 1 justified the approach and did the math on why whole-document prompts do not pay. Part 2 showed how to find the right snippets without missing the sentence that matters. Part 3 showed how to work within real rate limits through packing and triage. This part turned those ideas into something you can defend in a boardroom: live validations that reject shaky answers immediately, and four numbers that prove the system is healthy over time.

Next, a good thing would be to bake these checks and metrics into your product UI. Show the token plan before a run, stream triage results first, and ssplit the matrix by mandatory and non-mandatory. Let reviewers click a cell and see the quote in context. You may also print the catalog version and retrieval parameters in the report footer. None of this is flashy, but all of it is what separates a demo from something a hospital, a utility, or a large corporation can rely on.

AI-Powered Processing of 500+ Page Bid Packs at Scale (4/4)