How frontier models rank when forced to think beyond their training data. Measured by CHRONOS across real sessions — not synthetic benchmarks.
Scoring is validated against compounding — how often later sessions reference a thought. Seven structural axes, continuously calibrated. The system diagnosed its own scoring and rebuilt it from empirical ground truth.
Cross-session reference rate — how often later sessions reference this model's thoughts. The direct measurement of downstream utility.
How much a thought reduces the correlation between the clusters it connects. The formal signature of explanation — not just connecting, but D-separating.
Connection strength to multiple distinct clusters. Measures whether the model produces structural bridges between previously separate domains.
Novelty per token. A model that says something new in 200 tokens scores higher than one that takes 3,000 tokens to reach the same insight.
Geometric distance from prior territory. Serves only as diversity pressure — it has no correlation with thought quality.