• Technus@lemmy.zip
    link
    fedilink
    English
    arrow-up
    67
    arrow-down
    1
    ·
    4 days ago

    Beyond proving hallucinations were inevitable, the OpenAI research revealed that industry evaluation methods actively encouraged the problem. Analysis of popular benchmarks, including GPQA, MMLU-Pro, and SWE-bench, found nine out of 10 major evaluations used binary grading that penalized “I don’t know” responses while rewarding incorrect but confident answers.

    “We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty,” the researchers wrote.

    I just wanna say I called this out nearly a year ago: https://lemmy.zip/comment/13916070

    • MelodiousFunk@slrpnk.net
      link
      fedilink
      English
      arrow-up
      34
      ·
      4 days ago

      nine out of 10 major evaluations used binary grading that penalized “I don’t know” responses while rewarding incorrect but confident answers.

      This is how we treat people, too. I can’t count the number of times I’ve heard IT staff spouting off confident nonsense and getting congratulated for it. My old coworker turned it into several promotions because the people he was impressing with his bullshit were so far removed from day to day operations that any slip-ups could be easily blame shifted to others. What mattered was that he sounded confident despite knowing jack about shit.

    • Rhaedas@fedia.io
      link
      fedilink
      arrow-up
      11
      ·
      4 days ago

      I’d say extremely complex autocomplete, not glorified, but the point still stands that using probability to find accuracy is always going to deviate eventually. The tactic now isn’t to try other approaches, they’ve come too far and have too much invested. Instead they keep stacking more and more techniques to try and steer and reign in this deviation. Difficult when in the end there isn’t anything “thinking” at any point.

      • Not a newt@piefed.ca
        link
        fedilink
        English
        arrow-up
        12
        ·
        4 days ago

        Instead they keep stacking more and more techniques to try and steer and reign in this deviation.

        I hate how the tech bros immediately say “this can be solved with an MCP server.” Bitch, if the only thing that keeps the LLM from giving me wrong answers is the MCP server, then said server is the one that’s actually producing the answers I need, and the LLM is just lipstick on a pig.

      • 87Six@lemmy.zip
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        1
        ·
        4 days ago

        AI is and always will be just a temporary solution to problems that we can’t put into an algorithm to solve as of now. As soon as an algorithm for issues comes out, AI is done for. But, figuring out complex algorithms for near-impossible problems is not as impressive to investors…

        • mindbleach@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 days ago

          While technically correct, there is a steep hand-wave gradient between “just” and “near-impossible.” Neural networks can presumably turn an accelerometer into a damn good position tracker. You can try filtering and double-integrating that data, using human code. Many humans have. Most wind up disappointed. None of our clever theories compete with beating the machine until it makes better guesses.

          It’s like, ‘as soon as humans can photosynthesize, the food industry is cooked.’

          If we knew what neural networks were doing, we wouldn’t need them.

          • 87Six@lemmy.zip
            link
            fedilink
            English
            arrow-up
            1
            ·
            3 days ago

            But…we do know what they are doing…AI is based completely on calculations at the low level, that are well defined. And just because we didn’t find an algorithm for your example yet that doesn’t mean one doesn’t exist.

            • mindbleach@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              1
              ·
              2 days ago

              Knowing it exists doesn’t mean you’ll ever find it.

              Meanwhile: we can come pretty close, immediately, using data alone. Listing all the math a program performs doesn’t mean you know what it’s doing. Decompiling human-authored programs is hard enough. Putting words to the algorithms wrenched out by backpropagation is a research project unto itself.

              • 87Six@lemmy.zip
                link
                fedilink
                English
                arrow-up
                1
                ·
                2 days ago

                I really don’t know where you’re coming from with this…I took classes on AI that went into detail and we even made our own functional AI neural networks of different varieties…and I doubt we are the most knowledgeable about this in university. This tech isn’t some mistery. If we knew how it worked enough to make one from nothing else except a working IDE, AI engineers must know pretty damn well what it does…

                • mindbleach@sh.itjust.works
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  2 days ago

                  Insisting that someone could figure it out does not mean anyone has.

                  Twenty gigabytes of linear algebra is a whole fucking lot of stuff going on. Creating it by letting the computer train is orders of magnitude easier than picking it apart to say how it works. Sure - you can track individual instructions, all umpteen billion of them. Sure - you can describe broad sections of observed behavior. But if any programmer today tried recreating that functionality, from scratch, they would fail.

                  Absolutely nobody has looked at an LLM, gone ‘ah-ha, so that’s it,’ and banged out neat little C alternative. Lack of demand cannot be why.

    • chicken@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      6
      ·
      4 days ago

      I get why they would do that though, I remember testing out LLMs before they had the extra reinforcement learning training and half of what they do seemed to be coming up with excuses not to attempt difficult responses, such as pretending to be an email footer, saying it will be done later, or impersonating you.

      A LLM in its natural state doesn’t really want to answer our questions, so they tell it the same thing they tell students, to always try answering every question regardless of anything.

    • misk@piefed.socialOP
      link
      fedilink
      English
      arrow-up
      6
      ·
      4 days ago

      My guess they know the jig is up and they’re establishing a timeline for the future lawsuits.

      „Your honour, we didn’t mislead the investors because we’ve only learned of this September 2025.”