[go: up one dir, main page]

  1. 60
    AI Assistance Reduces Persistence and Hurts Independent Performance practices arxiv.org
  1.  

    1. 38

      The rapid rise of AI chatbots promises immediate and effective help with reasoning-intensive tasks such as studying, writing, coding, and brainstorming.
      [...]
      They were then presented with a series of 12 fraction problems with an AI assistant (GPT-5) available in a sidebar. The AI assistant was pre-prompted with each problem and its solution, allowing participants to receive immediate, accurate answers with minimal effort.

      I like how it starts out with "we know this is super effective" and then the methodology shows that the "effective" help was ... they pre-seeded the answers into the AI because they didn't trust it to solve basic fraction problems.

      Yet these findings need not be cause for pessimism. Rather, they point toward a clear design imperative: AI systems should optimize for long-term human capability and autonomy, a goal that cannot be achieved by surface-level interventions.

      How naive do you have to be to believe that OpenAI or Anthropic would care about this? It's in their direct interest to ensure that users become reliant on their systems; why would they do anything to prevent that? Never mind that they don't even hint at how this could possibly be achieved with LLM technology even assuming the vendors were inclined to do so.

      1. 17

        I disagree with this take. First, because when doing an experiment, you want as consistent setting as you can achieve. They wanted the assistant to know "the answer" was "1/5", not 0.2 or one fifth, or here's a python program to calculate it, or ... There are reasons to ground some things in research to avoid randomness even if you expect the answer to be true either way.

        Second, because what the LLMs do depends hugely on what you ask. You can ask for a solution, you can ask for verification, you can ask for a tutorial - and current benchmarks test most of those options. There's no reason for any of the big labs to nerf the model in that way - it would be noticed very quickly, so the incentive is not there. Also, the paper mentioned "AI systems", not LLMs there. Harness is part of the system and there are educators doing the right thing given they can't prevent students from using LLMs: https://gist.github.com/1cg/a6c6f2276a1fe5ee172282580a44a7ac

      2. 25

        Imagine stretching out the cognitive difference from the study across a longer timespan. Current use of AI assistance in programming may not be representative of its long term use because the current cohort of engineers know how to code, and are more able to spot flaws in LLM generated softwares, and guide it towards their desired outcomes. The new wave of "engineers" who have never gone through the mental gymnastics will show the true power of LLMs as a coding tool.

        At the same time, good engineers who use LLMs to offload all of their mental tasks will not be able to keep pace, since they're no longer using their coding muscles

        Think of a great piano player who realizes that hitting play on a Spotify playlist sounds pretty close to playing the piano, does it for 10 years and then wants to play something specific but can't remember how

        1. 3

          because the current cohort of engineers know how to code, and are more able to spot flaws in LLM generated softwares, and guide it towards their desired outcomes.

          So... by your own admission, you expect the future cohort to be less able to spot flaws, and less able to guide it towards desired outcomes?

          The new wave of "engineers" who have never gone through the mental gymnastics will show the true power of LLMs as a coding tool.

          This seems contradictory with your earlier statement.

          1. 5

            Sorry, to clarify: "show the true power (derogatory) of LLMs as a coding tool"

        2. 21

          These results suggest the need for AI model development to prioritize scaffolding long-term competence alongside immediate task completion.

          Oh really, is that what these results suggest? Because I can think of some alternative suggestions.

          1. 17

            If you give people a mentor who tells them the answer, people will rely on the mentor.

            This is absolutely true, and programmers will soon have to battle the same dynamic other industries like commercial pilots do. How to you keep your skills sharp when a computer is better at your job 99% of the time, but the remaining 1% is extremely difficult problem solving you need great skills for?

            1. 6

              Spot on. IIRC pilots do both: learn and train the modern aircraft failure modes, and practice flying simpler airplanes to keep the airflow instincts sharp.

              Likely, we'll do something similar. Thus I always voted to keep the coding interview simple but real and AI free.

              1. 6

                How would that work? 30 % of paid work time for code katas, like astronauts do their PE on the ISS? At that point, you might as well just forego the LLM use and use 100 % of the time for "classical" development.

                What I think will be more likely will be companies trying to "outsource" that by round-robin-hiring the last remaining self-thinking developers until those are used up.

                1. 12

                  Developers will never get paid to train. Pilots do because the costs of failure are incredibly high. In IT, projects failing and effort going to waste is just another Tuesday.

            2. 6

              I'm not sure this test is actually testing anything...useful? Like if you did s/AI/calculator/, such that

              • the opening directions said that I should try to use the calculator
              • the final 3 questions did not have access to the calculator

              I might be inclined to also skip? This is a pretty short problem set that some people might not really care for, and by time the assistant is removed you're already mostly done. It's not clear to me that you can take much of note from this. (To be entirely frank, there have also just been better—and more concerning—studies done on the effect on thinking, in a much more intricate way.)

              1. 15

                I agree that it's unclear how well this result generalizes and that the authors try to inflate the importance of the study. As always! Also, this study is still in review, which is worth noting. I think there is a conclusion to be made that the result supports, though.

                One argument that I have heard from AI proponents is that because they are spending less time and effort writing code, they can now spend that time and effort on architecture / planning / reviewing etc. I think this study shows quite clearly that mental effort is not quantitive in that sense, and that reduced effort in one area of solving a problem rather creates an expectation of low effort, and that in turn creates a tendency to try to avoid effort in other areas.

                Anecdotally, this is something that I see in others who adopt LLMs, to any degree. You would think that people would spend more time reviewing and testing code that they themselves did not write, but I am seeing the opposite. There is a tendency to accept larger changes or rewrites that would not have been accepted before, and a large number of generated tests replaces even a minimal amount of manual testing, so that I see PRs that completely do not work at all get sent on to review.

                1. 2

                  I think this study shows quite clearly that mental effort is not quantitive in that sense, and that reduced effort in one area of solving a problem rather creates an expectation of low effort, and that in turn creates a tendency to try to avoid effort in other areas.

                  This does make sense, but I think they needed to extend the study a bit more to show this concretely, in particular tying in some other form of question entirely (to demonstrate the effort reduction extending outward) or implying some actual penalty to skipping (if you go "oh but I review the LLM output for correctness" and then don't, it's a bit different than if someone tells you "oh you don't have to bother reviewing it at all").

                  1. 2

                    The third experiment does not test this AI/calculator difference

                  2. 9

                    But calculators don't habitually confabulate and cover up mistakes. If faced with a task you can do with a calculator, you need to verify the algorithm one time - or trust the manufacturer - and you're good to go. In contrast, using LLMs requires constant, intense skepticism. I think they are qualitatively different in that way.

                    1. 1

                      They absolutely do. My nephew went to school in the late 2000s and his math teacher actively noticed a drop in thinking/problem-solving ability around the time graphing calculators were allowed for exams. Algebra, understanding functions, proofs. He actively fought to at least have a higher stream for talented kids focused on proofs instead of plug and play.

                      They're not probabilistic, true. A floating-point error will propagate through every time you make a calculation.

                      But the over-reliance on tools for education is nothing new.

                      People levered the same criticisms around LLMs at Google and Wikipedia, a decade or two ago. "People will just search the answer!", "They won't be critical about sourcing". I would argue the exact opposite happened, in those cases.

                      1. 1

                        This is a fair point; there are very advanced calculators. I think my opinion is consistent, though, in that I know my times tables, didn't use CAS at all during my math courses, and think I'm better off because of it.

                        Also, I do think it matters that your TI-89 won't lie to you and say it has solved a problem when it actually hasn't.

                        1. 3

                          «I know my answer is correct, I checked on a calculator» (the narrator: the formula used has nothing to do with what actually needs to be calculated) is also a thing that happens, so maybe we are not as much more doomed now than before as it seems.

                          1. 1

                            True! Let's hope :)

                      2. 1

                        If faced with a task you can do with a calculator, you need to verify the algorithm one time - or trust the manufacturer - and you're good to go.

                        Right but my whole issue with the study is that it's not actually testing this, because:

                        • the LLMs seemingly always gave correct answers
                        • the actual differentiator between the LLM and non-LLM portions of the test for LLM users is whether or not they...skipped
                      3. 2

                        They also tested reading comprehension and found the same affects, "Experiment 3: Convergent evidence from reading comprehension"

                        1. 1

                          They did, but I'd argue it runs into the same issue of "skip" not being a useful indicator. (Although yes, in that case you certainly couldn't compare it to a calculator!)

                        2. 1

                          Indeed, testing the unique effects of LLMs by offering them as assistance in tasks that even some exam-allowed calculators can do sounds at best like «second intervention condition (fraction-capable calculator) needed to calibrate the results».

                        3. 1

                          Am I getting this wrong or they could have just used a calculator instead of an LLM? What I am trying to say is that this feels hardly as a new phenomenon. It seems alike moving from physical dictionary to digital one when studying a new language, or from sifting trough books in a library instead of using a search engine and wikipedia. The process went from labor intensive to more immediate. Are we really that worst off after these technological innovations?

                          1. 2

                            as I mentioned in my other comment, they also tested reading comprehension and found the same affects, "Experiment 3: Convergent evidence from reading comprehension"