A new LLM evaluation framework taps into an "LLM-as-judge" setup—think strong model playing human annotator. It gets prompted (or fine-tuned) to mimic human scores and rate outputs from other LLMs.
It runs on a tightly labeled golden dataset, handles both pointwise and head-to-head comparisons, and ships with an automated prompt optimizer à la DeepMind’s OPRO.
System shift: Human evals out, scalable LLM grading in. A step closer to self-rating, self-improving models.