Machine Learning Research·2025
A research journey from hypothesis to honest assessment: discovering that Test-Time Augmentation, not attention, drives calibration improvements in kNN classification.

Role & Outcomes
Can learned attention over neighbors improve calibration and robustness versus uniform and distance-weighted kNN, with minimal compute overhead?
Status: CONCLUDED — December 2025. No further investigation needed.
This research began with a hypothesis: learned attention would improve kNN classification by learning sophisticated neighbor weighting patterns. Over the course of systematic experimentation, a complex architecture was built with multi-head attention, learned temperature, label-conditioned bias, and prototype-guided scoring.
Through seven experimental iterations, something unexpected was discovered: the simplest technique (Test-Time Augmentation) provided the biggest gains, while the complex architecture (attention) provided no measurable benefit.
This demonstrates scientific rigor and honest self-assessment, proving that negative results are valuable scientific contributions.
Final results from Experiment 7 (December 2025) — Research concluded:
| Method | Accuracy | ECE | NLL |
|---|---|---|---|
| Uniform kNN | 91.53% | 0.0796 | 1.225 |
| Distance kNN | 91.52% | 0.0783 | 1.225 |
| Attn-KNN | 91.55% | 0.0811 | 1.236 |
| Attn-KNN + TTA | 90.99% | 0.0267 | 0.513 |
| CNN Baseline | 96.51% | 0.0253 | 0.184 |
Test-Time Augmentation (TTA) provides dramatic calibration improvements: 67% ECE reduction (0.0811 → 0.0267) and 58% NLL reduction (1.236 → 0.513). However, TTA works with any kNN method — it's not unique to attention.
Attention alone provides only +0.02% accuracy improvement (within noise margin) and no calibration benefit over distance-weighted kNN.

Uniform kNN (ECE: 0.0796)

Attn-KNN + TTA (ECE: 0.0267) — 67% improvement

Noise robustness: attention is LESS robust than uniform kNN

Training curves showing loss convergence
ResNet18, 128-dim, single-head attention. Result: Attention ≈ Uniform ≈ Distance (88.61% accuracy). Fixed training-evaluation disconnect.
ResNet50, 256-dim, 4-heads, contrastive loss. Result: +1.08% accuracy (89.70%) but no attention advantage.
TTA, MixUp, label smoothing, optimized config. Result: 78% ECE reduction with TTA. Discovery: TTA is the real innovation, not attention.
Pattern confirmed — TTA consistently improves calibration across experiments.
ResNet50, 256-dim, 4-heads, 50 epochs. Result: 91.55% accuracy, 0.0267 ECE (with TTA). Attention is LESS robust to noise than uniform kNN.
Core Finding: Attention-weighted kNN provides no meaningful benefit over distance-weighted kNN. The calibration improvements come from Test-Time Augmentation (TTA), not attention.
Recommendation: Use distance-weighted kNN with TTA for production. Attention adds complexity without meaningful benefit.
However, this is still a valuable scientific contribution: