Unless I am repeatedly missing it, it’s not mentioned in the article how much money the researchers spent performing the tests. What was the budget for the AI execution? If the researchers only spent $10,000 to “earn” $400,000, that’s amazing, whereas if they spent $500,000 for the same result, that’s obviously less exciting.
Yup. Plenty of times there are tiny things I know I could fix within minutes with a fee of say 15 dollars. Worth the time to code it? Totally. Worth sending messages back and forth, having meetings, etc. No way.
This resonated with me based on my recent experience using Claude to help me code. I almost gave up, but re-phrased the initial request (after 7-10 failed tries) and it finally nailed it.
> 3. Performance improves with multiple attempts
Allowing the o1 model 7 attempts instead of 1 nearly tripled its success rate, going from 16.5% to 46.5%. This hints that current models may have the knowledge to solve many more problems but struggle with execution on the first try.
I haven't really messed with Claude or other programming AIs much, but when using chatgpt for random stuff, it seems like the safety rails end up blocking a lot of stuff and rephrasing to get around them is necessary. I wonder if some of these programming AIs would be more useful if it some of the context that causes them to produce invalid results was more obvious to the users.
curious if you had any examples. i'm fairly meh on llm coding myself but have a pet theory on safety rails. i've certainly hit plenty myself but not with coding with llm's.
With chatgpt, for noncoding things, it has rails to avoid things like copyrighted art, violence, adult topics, etc. For coding LLMs, I suspect they have things like preferences for certain data structures, avoiding directly returning training data (even if that training data might be the only feasible way to do something), preferences for certain languages and APIs, etc.
If you knew what some of those preferences and rails were ahead of time, it'd be easier to design your request and also know why it making some odd or unworkable suggestions.
Wait a bit. The work for IT cleanup crews that will be needed to mop up all the vibe-damage from the locust swarm of greedy binheads that are currently puking imitation code with bugs and issues no man has seen before will eventually be plentiful. (if there will be server on server left standing after all this)
Unless I am repeatedly missing it, it’s not mentioned in the article how much money the researchers spent performing the tests. What was the budget for the AI execution? If the researchers only spent $10,000 to “earn” $400,000, that’s amazing, whereas if they spent $500,000 for the same result, that’s obviously less exciting.
And did they actually earn anything, or did they just evaluate the performance and linked it to a fee?
Totally. Solving the coding task is just half the challenge. You still got to win the job, etc
Not only win the job. Deal with management, process, meetings...
Yup. Plenty of times there are tiny things I know I could fix within minutes with a fee of say 15 dollars. Worth the time to code it? Totally. Worth sending messages back and forth, having meetings, etc. No way.
This resonated with me based on my recent experience using Claude to help me code. I almost gave up, but re-phrased the initial request (after 7-10 failed tries) and it finally nailed it.
> 3. Performance improves with multiple attempts Allowing the o1 model 7 attempts instead of 1 nearly tripled its success rate, going from 16.5% to 46.5%. This hints that current models may have the knowledge to solve many more problems but struggle with execution on the first try.
https://newsletter.getdx.com/i/160797867/performance-improve...
I haven't really messed with Claude or other programming AIs much, but when using chatgpt for random stuff, it seems like the safety rails end up blocking a lot of stuff and rephrasing to get around them is necessary. I wonder if some of these programming AIs would be more useful if it some of the context that causes them to produce invalid results was more obvious to the users.
> safety rails end up blocking a lot of stuff
curious if you had any examples. i'm fairly meh on llm coding myself but have a pet theory on safety rails. i've certainly hit plenty myself but not with coding with llm's.
With chatgpt, for noncoding things, it has rails to avoid things like copyrighted art, violence, adult topics, etc. For coding LLMs, I suspect they have things like preferences for certain data structures, avoiding directly returning training data (even if that training data might be the only feasible way to do something), preferences for certain languages and APIs, etc.
If you knew what some of those preferences and rails were ahead of time, it'd be easier to design your request and also know why it making some odd or unworkable suggestions.
How do they know the tasks were "solved"? Wouldn't that require the customer to be happy, and pay the bounty?
It's an OpenAI ad... And BTW the actual paper says: "we [..] find that frontier models are still unable to solve the majority of tasks"
Honestly, this reads like an AI-generated summary.
Discussion on original paper: https://news.ycombinator.com/item?id=43086347
There goes all the low-hanging fruit ...
Wait a bit. The work for IT cleanup crews that will be needed to mop up all the vibe-damage from the locust swarm of greedy binheads that are currently puking imitation code with bugs and issues no man has seen before will eventually be plentiful. (if there will be server on server left standing after all this)
No
tl;dr, and as Betteridge's Law would lead you to believe, the answer is no.
>Betteridge's Law
Is that the one that says if an article title ends in a question that means the answer is no?
Yes, and it especially works well for questions that sound too good to be true.