Model Evaluation and Threat Research is an AI research charity that looks into the threat of AI agents! That sounds a bit AI doomsday cult, and they take funding from the AI doomsday cult organisat…
Ai-only vibe coders. As a development manager I can tell you that AI-augmented actual developers who know how to write software and what good and bad code looks like are unquestionably faster. GitHub Copilot makes creating a suite of unit tests and documentation for a class take very little time.
We do not provide evidence that: AI systems do not currently speed up many or most software developers. We do not claim that our developers or repositories represent a majority or plurality of software development work.
The research shows that under their tested scenario and assumptions, devs were less productive.
The takeaway from this study is to measure and benchmark what’s important to your team. However many development teams have been doing that, albeit not in a formal study format, and finding AI improves productivity. It is not (only) “vibe productivity”.
And certainly I agree with the person you replied to: anecdotally, AI makes my devs more productive by cutting out the most grindy parts, like writing mocks for tests or getting that last missing coverage corner. So we have some measuring and validation to do.
The research explicitly showed that the anecdotes were flawed, and that actual measured productivity was the inverse of what the users imagined. That’s the entire point. You’re just saying “nuh uh, muh anecdotes.”
I said it needs to be measured. But few teams are going to do that, they’re building products not case studies.
This study is catnip for the people who put “AI” in scare quotes and expect those of us who use it to suddenly realize that we’ve only been generating hallucination slop. This has not been the lived experience of those of us in software development. In my own case I’ve seen teams stop hiring because they are getting the same amount of work done in less time. But those are anecdotes, so it doesn’t count.
I did, thank you. Terms therein like “they spend more time prompting the AI” genuinely do not apply to a code copilot, like the one provided by GitHub, because it infers its prompt based on what you’re doing and the context of the file and application and creates an autocomplete based on its chat completion, which you can accept or ignore like any autocomplete.
You can start writing test templates and it will fill them out for you, and then write the next tests based on the inputs of your methods and the imports in the test class.
You can write a whole class without any copilot usage and then start writing the xmldocs and it will autocomplete them for you based on work you already did.
Try it for yourself if you haven’t already, it’s pretty useful.
I read the article (not the study only the abstract) and they were getting paid an hourly rate. It did not mention anything about whether or not they had expirence in using llms to code. I feel there is a sweet spot, has to do with context window size etc.
I was not consistently better a year and a half ago but now i know the limits caveats and methods.
I think this is a very difficult thing to quantify but haters gonna latch on to this, same as the study that said “ai makes you stupid” and “llms cant reason”… its a cool tool that has limits.
One interesting feature in this paper is that the programmers who used LLMs thought they were faster, they estimated it was saving about 20% of the time it would have taken without LLMs. I think that’s a clear sign that you shouldn’t trust your gut about how much time LLMs save you, you should definitely try to measure it.
The study did find a correlation between prior experience and performance. One of the developers who showed a positive speedup with AI was the one with the most previous experience using Cursor (over 50 hours).
Ai-only vibe coders. As a development manager I can tell you that AI-augmented actual developers who know how to write software and what good and bad code looks like are unquestionably faster. GitHub Copilot makes creating a suite of unit tests and documentation for a class take very little time.
Try reading the article.
The article is a blog post summarizing the actual research. The researchers’ summary says:
The research shows that under their tested scenario and assumptions, devs were less productive.
The takeaway from this study is to measure and benchmark what’s important to your team. However many development teams have been doing that, albeit not in a formal study format, and finding AI improves productivity. It is not (only) “vibe productivity”.
And certainly I agree with the person you replied to: anecdotally, AI makes my devs more productive by cutting out the most grindy parts, like writing mocks for tests or getting that last missing coverage corner. So we have some measuring and validation to do.
The research explicitly showed that the anecdotes were flawed, and that actual measured productivity was the inverse of what the users imagined. That’s the entire point. You’re just saying “nuh uh, muh anecdotes.”
I said it needs to be measured. But few teams are going to do that, they’re building products not case studies.
This study is catnip for the people who put “AI” in scare quotes and expect those of us who use it to suddenly realize that we’ve only been generating hallucination slop. This has not been the lived experience of those of us in software development. In my own case I’ve seen teams stop hiring because they are getting the same amount of work done in less time. But those are anecdotes, so it doesn’t count.
It’s entirely possible to measure metrics.
Enjoy your slopware.
I did, thank you. Terms therein like “they spend more time prompting the AI” genuinely do not apply to a code copilot, like the one provided by GitHub, because it infers its prompt based on what you’re doing and the context of the file and application and creates an autocomplete based on its chat completion, which you can accept or ignore like any autocomplete.
You can start writing test templates and it will fill them out for you, and then write the next tests based on the inputs of your methods and the imports in the test class. You can write a whole class without any copilot usage and then start writing the xmldocs and it will autocomplete them for you based on work you already did. Try it for yourself if you haven’t already, it’s pretty useful.
I read the article (not the study only the abstract) and they were getting paid an hourly rate. It did not mention anything about whether or not they had expirence in using llms to code. I feel there is a sweet spot, has to do with context window size etc.
I was not consistently better a year and a half ago but now i know the limits caveats and methods.
I think this is a very difficult thing to quantify but haters gonna latch on to this, same as the study that said “ai makes you stupid” and “llms cant reason”… its a cool tool that has limits.
One interesting feature in this paper is that the programmers who used LLMs thought they were faster, they estimated it was saving about 20% of the time it would have taken without LLMs. I think that’s a clear sign that you shouldn’t trust your gut about how much time LLMs save you, you should definitely try to measure it.
The study did find a correlation between prior experience and performance. One of the developers who showed a positive speedup with AI was the one with the most previous experience using Cursor (over 50 hours).