Refactoring leaderboard
PairPilot’s refactoring benchmark asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model’s ability to output long chunks of code without skipping sections or making mistakes. It was developed to provoke and measure GPT-4 Turbo’s “lazy coding” habit.
The refactoring benchmark requires a large context window to work with large source files. Therefore, results are available for fewer models.
Model | Percent completed correctly | Percent using correct edit format | Command | Edit format |
---|---|---|---|---|
claude-3-5-sonnet-20241022 | 92.1% | 91.0% | PairPilot --sonnet |
diff |
o1-preview | 75.3% | 57.3% | PairPilot --model o1-preview |
diff |
claude-3-opus-20240229 | 72.3% | 79.5% | PairPilot --opus |
diff |
claude-3.5-sonnet-20240620 | 64.0% | 76.4% | PairPilot --sonnet |
diff |
gpt-4o | 62.9% | 53.9% | PairPilot |
diff |
gpt-4-1106-preview | 50.6% | 39.3% | PairPilot --model gpt-4-1106-preview |
udiff |
gpt-4o-2024-08-06 | 49.4% | 89.9% | PairPilot --model openai/gpt-4o-2024-08-06 |
diff |
gemini/gemini-1.5-pro-latest | 49.4% | 7.9% | PairPilot --model gemini/gemini-1.5-pro-latest |
diff-fenced |
o1-mini | 44.9% | 29.2% | PairPilot --model o1-mini |
diff |
gpt-4-turbo-2024-04-09 (udiff) | 34.1% | 30.7% | PairPilot --gpt-4-turbo |
udiff |
gpt-4-0125-preview | 33.7% | 47.2% | PairPilot --model gpt-4-0125-preview |
udiff |
DeepSeek Coder V2 0724 (deprecated) | 32.6% | 59.6% | PairPilot --model deepseek/deepseek-coder |
diff |
DeepSeek Chat V2.5 | 31.5% | 67.4% | PairPilot --deepseek |
diff |
gpt-4-turbo-2024-04-09 (diff) | 21.4% | 6.8% | PairPilot --model gpt-4-turbo-2024-04-09 |
diff |
By Paul Gauthier, last updated January 16, 2025.