Launch demo benchmark claims found to be misleading by independent researchers
The viral March 2024 demo claimed Devin achieved '13.86% on SWE-bench' — but independent researchers including those at Princeton found the statistic was from a non-standard subset of the benchmark, with unreported human assistance. The claim was not corrected in the marketing and continued to be cited in media coverage.
"We replicated the evaluation and found the 13.86% figure used a non-verified subset with human-in-the-loop assistance not disclosed in the original announcement."