Refactoring leaderboard

PairPilot’s refactoring benchmark asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model’s ability to output long chunks of code without skipping sections or making mistakes. It was developed to provoke and measure GPT-4 Turbo’s “lazy coding” habit.

The refactoring benchmark requires a large context window to work with large source files. Therefore, results are available for fewer models.

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
claude-3-5-sonnet-20241022	92.1%	91.0%	`PairPilot --sonnet`	diff
o1-preview	75.3%	57.3%	`PairPilot --model o1-preview`	diff
claude-3-opus-20240229	72.3%	79.5%	`PairPilot --opus`	diff
claude-3.5-sonnet-20240620	64.0%	76.4%	`PairPilot --sonnet`	diff
gpt-4o	62.9%	53.9%	`PairPilot`	diff
gpt-4-1106-preview	50.6%	39.3%	`PairPilot --model gpt-4-1106-preview`	udiff
gpt-4o-2024-08-06	49.4%	89.9%	`PairPilot --model openai/gpt-4o-2024-08-06`	diff
gemini/gemini-1.5-pro-latest	49.4%	7.9%	`PairPilot --model gemini/gemini-1.5-pro-latest`	diff-fenced
o1-mini	44.9%	29.2%	`PairPilot --model o1-mini`	diff
gpt-4-turbo-2024-04-09 (udiff)	34.1%	30.7%	`PairPilot --gpt-4-turbo`	udiff
gpt-4-0125-preview	33.7%	47.2%	`PairPilot --model gpt-4-0125-preview`	udiff
DeepSeek Coder V2 0724 (deprecated)	32.6%	59.6%	`PairPilot --model deepseek/deepseek-coder`	diff
DeepSeek Chat V2.5	31.5%	67.4%	`PairPilot --deepseek`	diff
gpt-4-turbo-2024-04-09 (diff)	21.4%	6.8%	`PairPilot --model gpt-4-turbo-2024-04-09`	diff

By Paul Gauthier, last updated January 16, 2025.