SWE-bench Multimodal is the benchmark that JavaScript devs might explore

We recently ran Refact.ai Agent on SWE-bench Multimodal, a benchmark that honestly doesn’t get enough attention. It’s one of the few evaluations that test if AI can fix bugs described using screenshots (e.g., UI mockups, diagrams, error messages, etc.)…


This content originally appeared on DEV Community and was authored by Oleg Klimov

We recently ran Refact.ai Agent on SWE-bench Multimodal, a benchmark that honestly doesn’t get enough attention. It’s one of the few evaluations that test if AI can fix bugs described using screenshots (e.g., UI mockups, diagrams, error messages, etc.).

Unlike SWE-bench Verified (Python-only), the Multimodal version focuses on web libraries and frontend tasks. That makes it more representative of real-world debugging, especially in JavaScript environments where bugs are often reported this way.

So, I'm here to share that our AI Agent Refact.ai has achieved #1 on SWE-bench Multimodal. It solved 184 out of 517 tasks (35.59%) and did it fully autonomously. We also scored highest on SWE-bench Verified among AI Agents solving tasks in pass@1 (in one attempt).

Refact.ai is the leading AI Agent for programming on SWE Bench

The full SWE-bench pipeline we used is open-source and fully reproducible.

You can run Refact.ai in VS Code, JetBrains, or self-host: it can fix the toughest bugs, solve routine dev tasks you delegate, build working solutions from scratch, and help you do more with less manual coding!

In this post, I’ll walk through how we achieved top results on SWE-bench and the tech behind the runs.

#1 AI Agent in SWE-bench Multimodal is Refact.ai

SWE-bench Multimodal tests whether an AI Agent can handle GitHub issues that include both text and visuals, such as:

  • Screenshots of bugs or interface issues
  • Design mockups or wireframes
  • Diagrams explaining desired functionality
  • Error messages with visual context

It covers tasks from libraries used in web interfaces, diagramming, data visualization, syntax highlighting, and more.

We ran this benchmark fully autonomously using a locally modified version of the official sb-cli to enforce single-threaded execution.

Also, we didn’t use extra agentic tools like debug_script or strategic_planning, which were part of our earlier Verified runs.

Evaluation results:

Total Solved Solved (%) Not solved Failed runs
517 184 35.59% 326 3

Achieving #1 on SWE-bench Multimodal makes Refact.ai a top-tier AI Agent for JavaScript tasks.

Combined with our leading results on Python-based SWE-bench, it confirms the Agent’s ability to deliver high-quality results across programming languages.

Refact.ai’s open-source approach to SWE-bench

The new run introduced a key upgrade: Anthropic’s Claude 4 Sonnet as the core model, bringing a notable boost in reasoning and code generation. With it, Refact.ai Agent reached a 74.40% score — surpassing Refact.ai's best SWE-bench Verified score of 70.4% with Claude 3.7 Sonnet.

Beyond that, this milestone builds on everything we’ve learned from earlier SWE-bench runs.

Our approach remains focused on reliability and step-by-step problem solving. Key elements of the SWE-bench Verified setup included:

  • Open-source Agent prompt, available on GitHub.
  • Claude 4 Sonnet as a core model
  • A debug_script() sub-agent that fixes bugs and can modify/create new scripts
  • Extensive guardrails to catch when the model is stuck or going off track, and to redirect it back on course
  • Incremental improvements built on our previous Claude 3.7 run

How does Refact.ai Agent solve the SWE-bench Verified tasks? It follows a four-step strategy defined in its system prompt.

The Agent starts by exploring the problem: using tools like cat() to open files, search_symbol_definition(), search_pattern(), etc. to locate relevant code. The Agent also uses compress_session(), ensuring it gathers the right context before attempting any changes.

At step two, the Agent reproduces the issue. It runs all existing tests to ensure a clean baseline, writes a script that triggers the bug (covering all possible edge cases), sets up the environment, and runs the script via shell("python ...") to confirm the failure. Then debug_script() takes over — a custom sub-agent that uses pdb to debug, modify, and generate scripts. Powered by Claude 4 with o4-mini for summarizing the debug info, it’s called at least once — and up to three times — per task. In practice, it was really helpful for digging into the problem source.

Once complete, the Agent plans and applies the fix based on the debugging report. It updates project files directly, without creating patches and diffs. In the earlier run, this step used a separate strategic_planning() tool. With Claude 4 Sonnet, that’s no longer needed — the model’s reasoning is strong enough to handle this job on its own. Finally, the Agent checks its work: re-runs the reproduction script and the project’s existing tests to validate the fix. If all tests pass, it uses compress_session() to offload any debug or temporary files and optimize context usage before ending the run.

Throughout the run, automatic guardrails help keep the Agent on track. These are mid-run messages, inserted into the chat as if from a simulated “user” when the model gets stuck or makes mistakes. A script statically monitors Claude 4’s outputs, and when needed, injects messages to guide the model back on course. For example, it may remind the model to open all visited files after debug_script(), or to follow correct implementation rules after planning. These small actions make a big difference in stability.

The entire run is fully autonomous: no manual inputs, no retries. Each task runs in a single session, with the Agent self-correcting and managing context to stay efficient and produce a single correct solution.

SWE-bench Verified vol.2: What changed in the new run

Several upgrades helped push Refact.ai Agent from 70.4% to 74.4% on SWE-bench Verified:

  • Model upgrade to Claude 4 Sonnet: Replaced Claude 3.7 with the more advanced Claude 4 Sonnet.
  • Removed strategic_planning(): Previously, this tool (powered by o3) reasoned over debug_script() output and modified files. This is now fully handled by Claude 4 Sonnet.
  • New safeguard for file overload: Agent used to open entire folders using cat, leading to context overflow. We’ve added a limit: if a folder contains more than 5 files, the Agent returns an error and asks for one-by-one access: “Too many files were requested. Please open files one by one.”
  • Extra guardrail at the end of the session: “Check the last time that all changes applied to the project directly and all pre-existing tests aren’t broken.”
  • Larger context for search_pattern().
  • Minor tweaks to debug_script() prompt.

All these improvements work together to make Refact.ai Agent more robust and efficient. Moving to Claude 4 Sonnet significantly boosted reasoning ability and allowed us to simplify the agent’s loop while still solving more tasks. Meanwhile, the debug sub-agent and guardrails have been enhanced to ensure greater reliability throughout each run.

Evaluation results:

Total Solved Not solved Solved (%) Not solved (%)
500 372 128 74.40% 25.60%

From benchmark to your IDE

Ultimately, our focus isn’t only on benchmark scores — it’s on building an AI agent that truly works for real developers. The lessons learned and improvements made for SWE-bench are already finding their way into the product. That means when you use Refact.ai, you’re benefitting from the engineering approach that achieved this benchmark record.

  • Solves tasks autonomously, from start to finish
  • Fully understands your codebase, not just open tabs
  • Transparent by design — every step is visible and reversible
  • Integrates with dev tools (GitHub, Web, MCP, and more) to work across your system
  • BYOK-friendly or self-hosted if you want full control.

Refact.ai Agent is an AI Agent for software engineering you can trust — and guide when needed. Autonomous when you want it, collaborative when you step in.

If you’re ready to work with an AI that understands your environment, works across your tools, and earns your trust one task at a time — Refact.ai is ready for you.

Join our community to see what real developers are building end-to-end.
And of course, I'd be happy to answer any of your questions and chat. Thanks for reading!


This content originally appeared on DEV Community and was authored by Oleg Klimov


Print Share Comment Cite Upload Translate Updates
APA

Oleg Klimov | Sciencx (2025-06-26T22:26:20+00:00) SWE-bench Multimodal is the benchmark that JavaScript devs might explore. Retrieved from https://www.scien.cx/2025/06/26/swe-bench-multimodal-is-the-benchmark-that-javascript-devs-might-explore/

MLA
" » SWE-bench Multimodal is the benchmark that JavaScript devs might explore." Oleg Klimov | Sciencx - Thursday June 26, 2025, https://www.scien.cx/2025/06/26/swe-bench-multimodal-is-the-benchmark-that-javascript-devs-might-explore/
HARVARD
Oleg Klimov | Sciencx Thursday June 26, 2025 » SWE-bench Multimodal is the benchmark that JavaScript devs might explore., viewed ,<https://www.scien.cx/2025/06/26/swe-bench-multimodal-is-the-benchmark-that-javascript-devs-might-explore/>
VANCOUVER
Oleg Klimov | Sciencx - » SWE-bench Multimodal is the benchmark that JavaScript devs might explore. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/06/26/swe-bench-multimodal-is-the-benchmark-that-javascript-devs-might-explore/
CHICAGO
" » SWE-bench Multimodal is the benchmark that JavaScript devs might explore." Oleg Klimov | Sciencx - Accessed . https://www.scien.cx/2025/06/26/swe-bench-multimodal-is-the-benchmark-that-javascript-devs-might-explore/
IEEE
" » SWE-bench Multimodal is the benchmark that JavaScript devs might explore." Oleg Klimov | Sciencx [Online]. Available: https://www.scien.cx/2025/06/26/swe-bench-multimodal-is-the-benchmark-that-javascript-devs-might-explore/. [Accessed: ]
rf:citation
» SWE-bench Multimodal is the benchmark that JavaScript devs might explore | Oleg Klimov | Sciencx | https://www.scien.cx/2025/06/26/swe-bench-multimodal-is-the-benchmark-that-javascript-devs-might-explore/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.