Reasoning models struggle to control their chains of thought, and thatās good
OpenAI's paper unveilsCoT-Control: an open-source suite of 13,000+ tasks fromGPQA, MMLU-Pro, HLE, BFCLthat measuresCoTcontrollability. Evaluations on 13 models show compliance at 0.1%-15.4%. Compliance is tiny. Controllability improves with model size. It drops as reasoning chains lengthen and after.. read more Ā










