§ V — Insights Methodology Note Vol. I, No. 6 MMXXVI

Why Tiers Are Categorical, Not Numeric

On the choice to certify into tiers rather than to issue scores.

A question the Institute is asked, with some regularity, by firms approaching their first assessment is why the AI Maturity Standard certifies into named tiers (AI-Aware, AI-Enabled, AI-Integrated, AI-First, AI-Native) rather than producing a numeric score. The question is fair, and the answer is methodologically rather than aesthetically motivated. A numeric score would, in the Institute's view, deliver a less honest piece of information than the categorical tier it would replace, would create incentives the Standard exists in part to resist, and would conceal the feature of the assessment that the certification is intended to surface: its multidimensionality, governed by a floor-of-weakest-dimension rule. This note sets out the reasoning.

§ I The Surface Appeal of Scores

The case for a numeric score is intuitive. Numbers permit comparison; they permit longitudinal tracking; they communicate at a glance. A firm that learns it has certified at AI-Enabled has, at the margin, less information than a firm told it has scored 67 on a hundred-point scale. The number conveys not only a position but a distance: 67 is closer to 68 than to 50.

The intuition is sound on its premises and incorrect in its conclusion. The premises assume that the underlying quantity, AI maturity in a professional services firm, is a one-dimensional continuum on which positions can be linearly ordered, and that the assessment instrument is sufficiently calibrated to distinguish positions at the granularity the number implies. Neither premise holds for the matter the Standard is built to assess. AI maturity is the joint state of seven dimensions, each of which can advance or regress independently of the others, and the operational consequences of which depend on their joint distribution rather than on their sum. The assessment is calibrated to distinguish tier boundaries with reasonable inter-assessor reliability; it is not calibrated, and could not be calibrated without instrumentation the field does not possess, to distinguish a 67 from a 69.

A numeric score would deliver a less honest piece of information than the categorical tier it would replace.

This is not a confession of methodological weakness. It is a description of the underlying quantity. A great many things assessors are asked to certify, including the seaworthiness of a vessel, the standing of a university, and the responsibility of a corporation toward its workers and environment, share the structural feature that they are real, are consequential, are capable of being assessed with discipline, and are not capable of being reduced to a scalar without introducing errors of greater magnitude than the resolution the scalar would claim to deliver.

§ II False Precision and Its Costs

A numeric output presented to two significant figures asserts an instrument calibrated to one part in a hundred. A numeric output presented as an integer between zero and one hundred asserts an instrument calibrated to roughly one part in twenty, given typical noise floors in social measurement¹. The assessment underlying the Standard is not calibrated to either. It is calibrated to distinguish between five tier states across seven dimensions, with documented criteria at each boundary and inter-assessor reliability targets at the boundary, not at the interior of each tier.

A firm receiving a score of 73 against a firm receiving a 75 would, in any honest reporting of confidence intervals, be indistinguishable. The Institute's assessment process could not reliably reproduce the two-point gap on re-administration with a different panel of assessors. To report the gap as if it were a finding is to mislead the recipient about the precision of the instrument. To act on the gap, for instance by representing to a client that the firm has improved from a 73 to a 75 at recertification, is to invest the misleading number with operational meaning it cannot carry.

False precision has costs. It encourages firms to commission internal projects whose business case rests on moving the score by a margin smaller than the instrument can detect. It encourages clients and counterparties to make decisions of consequence, such as supplier selection, panel admission, and fee premia, on the basis of differences the certifying body cannot itself defend. It encourages comparison between firms across years in which the underlying instrument may have been recalibrated, with the implication that the year-on-year movement reflects firm-level change rather than instrument-level adjustment. Each of these costs is reduced, though not eliminated, by reporting in tiers. A firm that has moved from AI-Enabled to AI-Integrated has crossed a boundary the assessment is built to identify and the methodology is built to document. A firm that has moved from 73 to 75 has, in any rigorous reading of the data, not moved at all.

§ III Gaming and the Multidimensional State

The second cost of a numeric score is incentive. A scalar metric, particularly one with public visibility, becomes a target. The literature on this point predates the present subject by some margin; the observation is associated with Goodhart, with Campbell, and in management practice with the experience of every regulator who has set a quantitative target (Goodhart, 1975; Campbell, 1979)². Firms optimise the measure. Where the measure tracks the underlying state, the optimisation tracks improvement in the underlying state. Where the measure is a compressed projection of a multidimensional state, the optimisation systematically tracks the dimensions of the state that contribute to the measure, at the expense of those that do not.

The Standard assesses seven dimensions: Strategy and Leadership; Governance, Risk and Compliance; Data and Knowledge Infrastructure; Workflow Redesign and Operations; Talent and Operating Model; Client Experience; Measurement and Accountability. A scalar aggregate, however constructed, would carry weights. Firms would discover, by inspection of the methodology or by triangulation against published scores, what the weights were. Investment would flow to the higher-weighted dimensions and away from the lower-weighted ones, in the predictable manner of systems under measurement. The aggregate would then drift upward over time without the underlying maturity having advanced proportionately, and the certification would lose the property the Institute is most concerned to preserve: that an external party, reading the certification, can rely on it as a description of the firm's state rather than as an artefact of the firm's response to being measured.

The floor-of-weakest-dimension rule, by contrast, is structurally resistant to this kind of optimisation. A firm cannot certify higher by raising its strongest dimension. It can only certify higher by raising its weakest. Investment is therefore directed, by the structure of the rule rather than by exhortation, at the dimension on which the firm is most exposed. This is, in the Institute's view, the correct direction of incentive for an instrument intended to certify maturity rather than to celebrate accomplishment.

A firm cannot certify higher by raising its strongest dimension. It can only certify higher by raising its weakest.

A scalar aggregate would average away precisely the information the floor rule is built to highlight. A firm that is exceptional at five dimensions and weak at two would score highly in the aggregate and would be certified, by the aggregate, as a leading practitioner. The Standard certifies such a firm at the floor of its weakest two dimensions, which is to say, as a firm with substantial work to do. The two reports are not different summaries of the same underlying state; they are different judgements about what the underlying state is. The Institute's judgement is that an organisation whose governance, for instance, lags well behind its operational adoption of deployed AI tools is not a leading practitioner in any sense the certification can responsibly endorse, regardless of how strong its other dimensions are.

§ IV Parallels in Standards Practice

The choice to certify into categorical tiers rather than to issue numeric scores is consistent with the practice of the institutional standards bodies and certifying authorities to which the Institute looks for methodological reference.

B Lab certifies organisations as B Corporations on a binary basis: certified or not³. The underlying assessment produces a numeric score, but the score is used internally to establish whether the threshold for certification has been met; it is not the certification, and a certified B Corp is not externally distinguished by its margin above the threshold. The reasoning B Lab has articulated is, in substance, the reasoning set out above: the assessment is calibrated to identify whether a firm meets a standard, not to rank firms within the meeting set.

Michelin awards stars on a categorical scale from one to three, with no intermediate granularity and no published scoring apparatus underlying the awards. A restaurant either has two stars or it does not, and the guides are explicit that the categorical award is the entirety of the published assessment.

Lloyd's Register classifies vessels into classes, with the class structure encoding the conditions under which the vessel is fit to operate. The class is a categorical assertion, and changes in class, whether promotion or demotion, are reported as such rather than as movements along a continuum.

Academic accreditation bodies certify institutions and programmes as accredited or as accredited with conditions, with the conditions made specific. The accreditation does not aggregate to a number, and the institution's standing is reported as accredited rather than as a position in a ranking.

The pattern across these institutions is not a coincidence of taste. It reflects a shared methodological judgement: that a categorical assessment, accompanied by documented criteria at boundaries, communicates more accurately and creates better incentives than a scalar score whose precision the underlying instrument cannot support. The Institute is in this tradition.

§ V The Progress Objection

The counter most often advanced, and the most serious counter, is that a numeric score would give firms a way to track and demonstrate progress. A firm that improves materially between two assessments at the same tier learns nothing, on a categorical scheme, that it would not have learned by not being assessed at all. A score, however imprecise, would at least give it a number to move.

The objection deserves a careful response. It is not, on examination, an objection to categorical tiers as such. It is an objection to a particular impoverished form of categorical reporting, in which the only information returned to the firm is the tier letter. That is not the form the Standard uses.

A firm certified under the Standard receives, alongside its tier, an assessment report that reports each of the seven dimensions narratively, with the assessor's findings, the evidence supporting them, and the boundary conditions that determined the dimension's tier. At recertification, the firm receives a delta against the prior report at the dimension level, narratively. A firm that has done substantial work on Governance, Risk and Compliance between assessments will see that work reflected in the second-assessment report against the first, with the specific advances named. If the work has not been sufficient to move the dimension across a tier boundary, the report will say so, and will identify what additional evidence the next boundary would require.

The Institute's position is that this form of progress reporting is, in practical terms, more useful to firms than a scalar number. A delta of two points on a hundred-point scale tells a firm that something has moved, but not what has moved, not why, and not what would be required to move it further. A narrative delta at dimension level tells the firm exactly which evidence has improved, what counted, and what would be required for the next tier on the dimension. A managing partner reading the second report against the first has the information necessary to direct the firm's work in the next cycle.

A second response to the progress objection should be offered in a more guarded register. The certification is not an instrument the Institute considers fit to serve as the principal internal tracker of a firm's AI maturity programme. Internal tracking is properly a function of the firm's own measurement apparatus, which should be more granular than the certification, more frequent than the certification, and tailored to the firm's particular operating model. The certification is an external attestation, intended to be relied upon by external readers (clients, regulators, counterparties, insurers), and its design priorities are those of an external attestation. Among those priorities, the integrity of what is reported sits above the granularity at which it is reported. A firm that wishes to track its own progress at a finer grain has every incentive to build the instrumentation that would allow it to do so, and several of the firms the Institute has assessed are doing exactly this. The Institute's certification is not the right tool for that work, and was not designed to be.

§ VI The Institute's Position

The decision to certify into categorical tiers rather than to issue a numeric score is deliberate and methodological. It reflects a judgement that the underlying state being assessed is multidimensional, that the assessment instrument is calibrated to distinguish tier boundaries rather than interior positions, and that a scalar aggregate would average away the floor-of-weakest-dimension rule on which the Standard's incentive structure depends. It reflects, further, a recognition that scalar scores invite optimisation against the measure rather than improvement of the underlying state, and that this is a failure mode an institutional certifying body is obliged to anticipate rather than to accommodate. It reflects, finally, a tradition the Institute is content to be placed within: of standards bodies that certify into categorical states, with documented criteria at boundaries and narrative reporting of progress within and across categories.

Categorical tiers are, in this sense, a deliberate refusal of false precision. The refusal is not a concession of what the instrument cannot do. It is an articulation of what an institutional standard is for.

Notes

On the limits of inter-rater agreement in structured assessment, see Cronbach, L. J., Gleser, G. C., Nanda, H., and Rajaratnam, N. (1972), The Dependability of Behavioral Measurements; on contemporary practice, see the literature on generalisability theory more broadly. ↩
Goodhart, C. A. E. (1975), "Problems of Monetary Management: The U.K. Experience"; Campbell, D. T. (1979), "Assessing the Impact of Planned Social Change", Evaluation and Program Planning 2(1). The principle that any measure used as a target ceases to function as a good measure is, by now, a textbook proposition in regulatory and management practice. ↩
B Lab publishes the underlying methodology of the B Impact Assessment in full; the certification threshold (currently 80 of 200 available points) is set as a minimum for entry, not as a position to be reported externally. ↩