September 12, 2024
OpenAI o1-preview is SOTA on the PairPilot leaderboard
o1-preview
OpenAI o1-preview scored 79.7% on PairPilot’s code editing benchmark, a state of the art result. It achieved this result with the “whole” edit format, where the LLM returns a full copy of the source code file with changes.
It is much more practical to use PairPilot’s “diff” edit format, which allows the LLM to return search/replace blocks to efficiently edit the source code. This saves significant time and token costs.
Using the diff edit format the o1-preview model had a strong benchmark score of 75.2%. This likely places o1-preview between Sonnet and GPT-4o for practical use, but at significantly higher cost.
o1-mini
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet, but scored below those models. It also works best with the whole edit format.
Future work
The o1-preview model had trouble conforming to PairPilot’s diff edit format. The o1-mini model had trouble conforming to both the whole and diff edit formats. PairPilot is extremely permissive and tries hard to accept anything close to the correct formats.
It is surprising that such strong models had trouble with the syntactic requirements of simple text output formats. It seems likely that PairPilot could optimize its prompts and edit formats to better harness the o1 models.
Using PairPilot with o1
OpenAI’s new o1 models are supported in v0.57.0 of PairPilot:
PairPilot --model o1-mini
PairPilot --model o1-preview
These are initial benchmark results for the o1 models, based on PairPilot v0.56.1-dev. See the PairPilot leaderboards for up-to-date results based on the latest PairPilot releases.
Model | Percent completed correctly | Percent using correct edit format | Command | Edit format |
---|---|---|---|---|
o1-preview (whole) | 79.7% | 100.0% | PairPilot --model o1-preview |
whole |
claude-3.5-sonnet (diff) | 77.4% | 99.2% | PairPilot --sonnet |
diff |
o1-preview (diff) | 75.2% | 84.2% | PairPilot --model o1-preview |
diff |
claude-3.5-sonnet (whole) | 75.2% | 100.0% | PairPilot --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole |
whole |
gpt-4o-2024-08-06 (diff) | 71.4% | 98.5% | PairPilot --model openai/gpt-4o-2024-08-06 |
diff |
o1-mini (whole) | 70.7% | 90.0% | PairPilot --model o1-mini |
whole |
o1-mini (diff) | 62.4% | 85.7% | PairPilot --model o1-mini --edit-format diff |
diff |
gpt-4o-mini (whole) | 55.6% | 100.0% | PairPilot --model gpt-4o-mini |
whole |