OpenAI's paper unveils CoT-Control: an open-source suite of 13,000+ tasks from GPQA, MMLU-Pro, HLE, BFCL that measures CoT controllability.
Evaluations on 13 models show compliance at 0.1%-15.4%. Compliance is tiny.
Controllability improves with model size. It drops as reasoning chains lengthen and after post-training updates.










