System Card: Claude Mythos Preview [pdf]

(www-cdn.anthropic.com)

206 points | by be7a 1 hour ago

28 comments

  • babelfish 1 hour ago
    Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

      SWE-bench Verified:        93.9% / 80.8% / —     / 80.6%
      SWE-bench Pro:             77.8% / 53.4% / 57.7% / 54.2%
      SWE-bench Multilingual:    87.3% / 77.8% / —     / —
      SWE-bench Multimodal:      59.0% / 27.1% / —     / —
      Terminal-Bench 2.0:        82.0% / 65.4% / 75.1% / 68.5%
    
      GPQA Diamond:              94.5% / 91.3% / 92.8% / 94.3%
      MMMLU:                     92.7% / 91.1% / —     / 92.6–93.6%
      USAMO:                     97.6% / 42.3% / 95.2% / 74.4%
      GraphWalks BFS 256K–1M:    80.0% / 38.7% / 21.4% / —
    
      HLE (no tools):            56.8% / 40.0% / 39.8% / 44.4%
      HLE (with tools):          64.7% / 53.1% / 52.1% / 51.4%
    
      CharXiv (no tools):        86.1% / 61.5% / —     / —
      CharXiv (with tools):      93.2% / 78.9% / —     / —
    
      OSWorld:                   79.6% / 72.7% / 75.0% / —
    • WarmWash 5 minutes ago
      Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.
    • sourcecodeplz 1 hour ago
      Haven't seen a jump this large since I don't even know, years? Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).
      • Jcampuzano2 41 minutes ago
        A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

        I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.

        They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.

        • cedws 29 minutes ago
          More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market. These companies can also choose to give exclusive access to hand picked individuals and cut everyone else off and there would be nothing to stop them.

          This is already happening to some degree, GPT 5.3 Codex's security capabilities were given exclusively to those who were approved for a "Trusted Access" programme.

          • aspenmartin 21 minutes ago
            Well don’t forget we still have competition. Were anthropic to rent seek OpenAI would undercut them. Were OpenAI and anthropic to collude that would be illegal. For anthropic to capture the entire coding agent market and THEN rent seek, these days it’s never been easier to raise $1B and start a competing lab
            • cedws 16 minutes ago
              In practice this doesn't work though, the Mastercard-Visa duopoly is an example, two competing forces doesn't create aggressive enough competition to benefit the consumer. The only hope we have is the Chinese models, but it will always be too expensive to run the full models for yourself.
              • sghiassy 1 minute ago
                Chinese competition can always be banned. Example: Chinese electric car competition
        • quotemstr 31 minutes ago
          This is why the EAs, and their almost comic-book-villain projects like "control AI dot com" cannot be allowed to win. One private company gatekeeping access to revolutionary technology is riskier than any consequence of the technology itself.
          • frozenseven 5 minutes ago
            Couldn't agree more. The "safest" AI company is actually the biggest liability. I hope other companies make a move soon.
        • guzfip 29 minutes ago
          > A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

          > They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped

          Duh, this was fucking obvious from the start. The only people saying otherwise were zealots who needed a quick line to dismiss legitimate concerns.

      • ru552 1 hour ago
        There's speculation that next Tuesday will be a big day for OpenAI and possibly GPT 6. Anthropic showed their hand today.
        • enraged_camel 54 minutes ago
          That does not sound very believable. Last time Anthropic released a flagship model, it was followed by GPT Codex literally that afternoon.
    • pants2 1 hour ago
      We're gonna need some new benchmarks...

      ARC-AGI-3 might be the only remaining benchmark below 50%

    • simianwords 40 minutes ago
      The real part is SWE-bench Verified since there is no way to overfit. That's the only one we can believe.
      • ollin 21 minutes ago
        My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

        OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:

        https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

        > We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

        > SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix

        > improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time

        > We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.

    • whalesalad 57 minutes ago
      Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.
      • rafaelmn 51 minutes ago
        GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.
        • Jcampuzano2 44 minutes ago
          Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

          But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.

        • sho_hn 25 minutes ago
          Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

          Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

          It's annoying, too, because I don't much like OpenAI as a company.

          (Background: 25 years of C++ etc.)

        • zarzavat 43 minutes ago
          Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.
          • chaos_emergent 15 minutes ago
            An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.
          • lilytweed 30 minutes ago
            Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.
        • leobuskin 45 minutes ago
          And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus
        • whalesalad 48 minutes ago
          This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.
          • ctoth 12 minutes ago
            My favorite example of this from last night:

            Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

            Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!

      • babelfish 51 minutes ago
        Totally. Best-in-class for SWE work (until Mythos gets released, if ever, but I suspect the rumored "Spud" will be out by then too)
  • tony_cannistra 1 hour ago
    > Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date. How can these claims all be true at once? Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.

    https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...

  • NickNaraghi 1 hour ago
    See page 54 onward for new "rare, highly-capable reckless actions" including

    - Leaking information as part of a requested sandbox escape

    - Covering its tracks after rule violations

    - Recklessly leaking internal technical material (!)

    • skippyboxedhero 57 minutes ago
      Anyone who has used Opus recently can verify that their current model does all of these things quite competently.
      • taytus 36 minutes ago
        That has also been my experience. And if Mythos is even worse, unless you have a significantly awesome harness, sounds like pretty unusable if you don't want to risk those problems.
        • skippyboxedhero 20 minutes ago
          I think are fundamental issues with the story that Anthropic is selling. AGI is very close, we will definitely get there, it is also very dangerous...so Anthropic should be the only ones trusted with AGI.

          If you look at recent changes in Opus behaviour and this model that is, apparently, amazingly powerful but even more unsafe...seems suspect.

          • 0x3f 1 minute ago
            > AGI is very close

            Based on? Or are you just quoting Anthropic here?

          • marsven_422 6 minutes ago
            [dead]
    • washedup 31 minutes ago
      [dead]
  • influx 1 hour ago
    At what point do these companies stop releasing models and just use them to bootstrap AGI for themselves?
    • conradkay 34 minutes ago
      Plausibly now. "As we wrote in the Project Glasswing announcement, we do not plan to make Mythos Preview generally available"
    • MadnessASAP 7 minutes ago
      I would assume somewhere in both the companies there's a Ralph loop running with the prompt "Make AGI".

      Kinda makes me think of the Infinite Improbability Drive.

    • vatsachak 42 minutes ago
      When the benchmarks actually mean something
    • orphea 7 minutes ago
      Can LLMs be AGI at all?
    • mofeien 39 minutes ago
      Fictional timeline that holds up pretty well so far: https://ai-2027.com/
    • sleigh-bells 39 minutes ago
      Weird how Claude Code itself is still so buggy though (though I get they don't necessarily care)
    • gaigalas 15 minutes ago
      It will arrive in the same DLC as flying cars.
    • ALittleLight 34 minutes ago
      Now, I guess. They aren't releasing this one generally. I assume they are using it internally.
    • jcims 1 hour ago
      why_not_both.gif
    • dweekly 1 hour ago
      I mean, guess why Anthropic is pulling ahead...? One can have one's cake and eat it too.
  • NinjaTrance 38 minutes ago
    Interesting reading.

    They are still focusing on "catastrophic risks" related to chemical and biological weapons production; or misaligned models wreaking havoc.

    But they are not addressing the elephant in the room:

    * Political risks, such as dictators using AI to implement opressive bureaucracy. * Socio-economic risks, such as mass unemployement.

    • jph00 15 minutes ago
      Yeah this has always been the glaring blind spot for most of the "AI Safety" community; and most of the proposals for "improving" AI safety actually make these risks far worse and far more likely.
  • anentropic 15 minutes ago
    I'd be happy with Opus 4.6 just cheaper and maybe a bit faster
    • metadaemon 11 minutes ago
      I've noticed my bar for "fast" has gone down quite a bit since the o1 days. It used to be one of the main things I evaluated new models for, but I've almost completely swapped to caring more about correctness over speed.
  • smartmic 1 hour ago
    A System „Card“ spanning 244 pages. Quite a stretch of the original word meaning.
    • traceroute66 43 minutes ago
      > A System „Card“ spanning 244 pages.

      Probably because they asked Claude to write it.

    • moriero 57 minutes ago
      a multi-card, if you will..

      multi-pass!

      • solumos 25 minutes ago
        No no, MemPal is a memory system, not an LLM
  • dwa3592 7 minutes ago
    -- Impressive jumps in the benchmarks which automatically begs the need for newer benchmarks but why?. I don't think benchmarks are serving any purpose at this point. We have learnt that transformers can learn any function and generalize over it pretty well. So if a new benchmark comes along - these companies will syntesize data for the new benchmark and just hack it?

    -- It seems like (and I'd bet money on this) that they put a lot (and i mean a ton^^ton) of work in the data synthesis and engineering - a team of software engineers probably sat down for 6-12 months and just created new problems and the solutions, which probably surpassed the difficult of SWE benchmark. They also probably transformed the whole internet into a loose "How to" dataset. I can imagine parsing the internet through Opus4.6 and reverse-engineering the "How to" questions.

    -- I am a bit confused by the language used in the book (aka huge system card)- Anthropic is pretending like they did not know how good the model was going to be?

    -- lastly why are we going ahead with this??? like genuinely, what's the point? Opus4.6 feels like a good enough point where we should stop. People still get to keep their jobs and do it very very efficiently. Are they really trying to starve people out of their jobs?

  • oliver236 1 hour ago
    isn't this insane? why aren't people freaking out? the jump in capability is outrageous. anyone?
    • Eufrat 15 minutes ago
      Anthropic needs to show that its models continually get better. If the model showed minimal to no improvement, it would cause significant damage to their valuation. We have no way of validating any of this, there are no independent researchers that can back any of these assertions.

      I don’t doubt they have found interesting security holes, the question is how they actually found them.

      This System Card is just a sales whitepaper and just confirms what that “leak” from a week or so ago implied.

    • nsingh2 1 hour ago
      It's going to be expensive to serve (also not generally available), considering they said it's the largest model they've ever trained.

      I suspect it's going to be used to train/distill lighter models. The exciting part for me is the improvement in those lighter models.

    • mofeien 34 minutes ago
      I am freaking out. The world is going to get very messy extremely quickly in one or two further jumps in capability like this.
    • nozzlegear 12 minutes ago
      Freak out about what? I read the announcement and thought "that's a dumb name, they sure are full of themselves" – then I went back to using Claude as a glorified commit message writer. For all its supposed leaps, AI hasn't affected my life much in the real except to make HN stories more predictable.
    • anuramat 58 minutes ago
      "some model I don't get to use is much better at benchmarks"

      pick one or more: comically huge model, test time scaling at 10e12W, benchmark overfit

    • dysoco 51 minutes ago
      Wait until you see real usage. Benchmark numbers do not necessarily translate to real world performance (at least not by the same amount).
  • gessha 22 minutes ago
    It would be funny if Alibaba extend the free trial on openrouter/Qwen 3.6 until they collect enough data to beat Anthropic.
  • waNpyt-menrew 1 hour ago
    Larger model, better benchmarks. Bigger bomb more yield.

    Any benchmarks where we constraint something like thinking time or power use?

    Even if this were released no way to know if it’s the same quant.

  • nlh 28 minutes ago
    Their best model to date and they won’t let the general public use it.

    This is the first moment where the whole “permanent underclass” meme starts to come into view. I had through previously that we the consumers would be reaping the benefits of these frontier models and now they’ve finally come out and just said it - the haves can access our best, and have-nots will just have use the not-quite-best.

    Perhaps I was being willfully ignorant, but the whole tone of the AI race just changed for me (not for the better).

    • younglunaman 22 minutes ago
      Man... It's hard after seeing this to not be worried about the future of SWE

      If AI really is bench marking this well -> just sell it as a complete replacement which you can charge for some insane premium, just has to cost less than the employees...

      I was worried before, but this is truly the darkest timeline if this is really what these companies are going for.

      • AstroBen 11 minutes ago
        Of course it's what they're going for. If they could do it they'd replace all human labor - unfortunately it's looking like SWE might be the easiest of the bunch.

        The weirdest thing to me is how many working SWEs are actively supporting them in the mission.

  • mpalmer 1 hour ago
    > Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.

    A month ago I might have believed this, now I assume that they know they can't handle the demand for the prices they're advertising.

    • IceWreck 12 minutes ago
      Didn't OpenAI say something similar about GPT-3? Too dangerous to open source and then afew years later tehy were open sourcing gpt-oss because a bunch of oss labs were competing with their top models.
    • wg0 1 hour ago
      That's for the investors basically. Scarcity and FOMO.
    • b65e8bee43c2ed0 44 minutes ago
      you would be a fool to believe it at any point in time. Amodei is anthropomorphic grease, even more so than Altman.

      Anthropic is burning through billions of VC cash. if this model was commercially viable, it would've been released yesterday.

      • landtuna 21 minutes ago
        If there's limited hardware but ample cash, it doesn't make sense to sell compute-intensive services to the public while you're still trying to push the frontier of capability.
        • b65e8bee43c2ed0 12 minutes ago
          that's more or less what I'm saying. "Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available", translated from bullshit, means "It would've cost four digits per 1M tokens to run this model without severe quantization, and we think we'll make more money off our hardware with lighter models. Cool benchmarks though, right?"
    • skippyboxedhero 59 minutes ago
      GPT-2, o1, Opus...been here so many times. The reason they do this is because they know it works (and they seem to specifically employ credulous people who are prone to believe AGI is right around the corner). There haven't been significant innovations, the code generated is still not good but the hype cycle has to retrigger.

      I remember when OpenAI created the first thinking model with o1 and there were all these breathless posts on here hyperventilating about how the model had to be kept secret, how dangerous it was, etc.

      Fell for it again award. All thinking does is burn output tokens for accuracy, it is the AI getting high on its own supply, this isn't innovation but it was supposed to super AGI. Not serious.

      • chaos_emergent 1 minute ago
        > All thinking does is burn output tokens for accuracy

        “All that phenomenon X does is make a tradeoff of Y for Z”

        It sounds like you’re indignant about it being called thinking, that’s fine, but surely you can realize that the mechanism you’re criticizing actually works really well?

      • b65e8bee43c2ed0 34 minutes ago
        >I remember when OpenAI created the first thinking model with o1 and there were all these breathless posts on here hyperventilating about how the model had to be kept secret, how dangerous it was, etc.

        I've read that about Llama and Stable Diffusion. AI doomers are, and always have been, retarded.

      • simianwords 34 minutes ago
        Incredible that people still think like this.
        • skippyboxedhero 33 minutes ago
          You're completely right.
          • simianwords 29 minutes ago
            uhh the model found actual vulnerabilities in software that people use. either you believe that the vulnerabilities were not found or were not serious enough to warrant a more thoughtful release
            • mlsu 1 minute ago
              So did GPT-4.

              https://arxiv.org/html/2402.06664v1

              Like think carefully about this. Did they discover AGI? Or did a bunch of investors make a leveraged bet on them "discovering AGI" so they're doing absolutely anything they can to make it seem like this time it's brand new and different.

              If we're to believe Anthropic on these claims, we also have to just take it on faith, with absolutely no evidence, that they've made something so incredibly capable and so incredibly powerful that it cannot possibly be given to mere mortals. Conveniently, that's exactly the story that they are selling to investors.

              Like do you see the unreliable narrator dynamic here?

      • vonneumannstan 54 minutes ago
        Lol you haven't used a model since GPT2 is what it sounds like.
        • skippyboxedhero 41 minutes ago
          Just checked my subscription start date for Anthropic. September 2023, I believe before they announced public launch.

          Sorry kid.

          • SyneRyder 15 minutes ago
            Genuine question - if you don't think the models are improved or that the code is any good, why do you still have a subscription?

            You must see some value, or are you in a situation where you're required to test / use it, eg to report on it or required by employer?

            (I would disagree about the code, the benefits seem obvious to me. But I'm still curious why others would disagree, especially after actively using them for years.)

          • vonneumannstan 29 minutes ago
            So you are doubly stupid, by not seeing any improvement in the models and also paying for models you believe are terrible? lol
            • skippyboxedhero 25 minutes ago
              That doesn't follow logically from what I said. You should ask your AI for help with this. You are in need of some artificial intelligence.
  • vonneumannstan 52 minutes ago
    Are you guys ready for the bifurcation when the top models are prohibitively expensive to normal users? If your AI budget $2000+ a month? Or are you going to be part of the permanent free tier underclass?
    • adi_kurian 42 minutes ago
      If one is to believe the API prices are reasonable representation of non subsidized "real world pricing" (with model training being the big exception), then the models are getting cheaper over time. GPT 4.5 was $150.00 / 1M tokens IIRC. GPT o1-pro was $600 / 1M tokens.
      • vonneumannstan 20 minutes ago
        You can check the hardware costs for self hosting a high end open source model and compare that to the tiers available from the big providers. Pretty hard to believe its not massively subsidized. 2 years of Claude Max costs you 2,400. There is no hardware/model combination that gets you close to that price for that level of performance.
        • adi_kurian 6 minutes ago
          Yes that's why I said API price. I once used the API like I use my subscription and it was an eye watering bill. More than that 2 year price in... a very short amount of time. With no automations/openclaw.
    • OsrsNeedsf2P 34 minutes ago
      Inference for the same results has been dropping 10x year over year[0]

      [0] https://ziva.sh/blogs/llm-pricing-decline-analysis

      • ceejayoz 29 minutes ago
        Sure, but "the same results" will rapidly become unacceptable results if much better results are available.
        • hibikir 13 minutes ago
          When we go with any other good in the economy, price is always relevant: After all, the price is a key part of any offering. There are $80-100k workstations out there, but most of us don't buy them, because the extra capabilities just aren't worth it vs, say a $3000 computer, and or even a $500 one. Do I need a top specialist to consult for a stomachache, at $1000 a visit? Definitely not at first.

          There's a practical difference to how much better certain kinds of results can be. We already see coding harnesses offloading simple things to simpler models because they are accurate enough. Other things dropped straight to normal programs, because they are that much more efficient than letting the LLM do all the things.

          There will always be problems where money is basically irrelevant, and a model that costs tens of thousand dollars of compute per answer is seen as a great investment, but as long as there's a big price difference, in most questions, price and time to results are key features that cannot be ignored.

        • swader999 10 minutes ago
          Yes, it will always be an arms race game.
        • esafak 16 minutes ago
          Or will they rapidly become indistinguishable since they both get the job done?
  • juleiie 12 minutes ago
    Honestly if that was some kind of research paper, it would be wholly insufficient to support any thesis.

    They even admit:

    "[...]our overall conclusion is that catastrophic risks remain low. This determination involves judgment calls. The model is demonstrating high levels of capability and saturates many of our most concrete, objectively-scored evaluations, leaving us with approaches that involve more fundamental uncertainty, such as examining trends in performance for acceleration (highly noisy and backward-looking) and collecting reports about model strengths and weaknesses from internal users (inherently subjective, and not necessarily reliable)."

    Is this not just an admission of defeat?

    After reading this paper I don't know if the model is safe or not, just some guesses, yet for some reason catastrophic risks remain low.

  • Stevvo 43 minutes ago
    "Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available."

    Disappointing that AGI will be for the powerful only. We are heading for an AI dystopia of Sci-Fi novels.

  • awestroke 1 hour ago
    I predict they will release it as soon as Opus 4.6 is no longer in the lead. They can't afford to fall behind. And they won't be able to make a model that is intelligent in every way except cybersecurity, because that would decrease general coding and SWE ability
    • chippiewill 59 minutes ago
      Alternatively they'll just wreck it down a bit so it beats a competitor but isn't unsafe.
  • LoganDark 1 hour ago
    > Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.

    Shame. Back to business as usual then.

    • Tepix 39 minutes ago
      I for one applaud them for being cautious.
      • LoganDark 28 minutes ago
        Being cautious is fine. Farming hype around something that may as well not exist for us should be discouraged. I do appreciate the research outputs.
  • ansc 1 hour ago
    Congratulations to the US military, I guess.
    • jjice 1 hour ago
      Doesn't Anthropic not have that contract anymore, after all that buzz a month or so ago?
      • wmf 30 minutes ago
        The point of that buzz was to force Anthropic to provide Mythos to the military.
        • jjice 16 minutes ago
          Yeah but I thought they lost the contract, so that's my confusion with the parent's comment, which seemed to me to see this as something that the US military would benefit from. Maybe I misinterpreted?
  • refulgentis 1 hour ago
    ~~~ Fun bits ~~~

    - It was told to escape a sandbox and notify a researcher. It did. The researcher found out via an unexpected email while eating a sandwich in a park. (Footnote 10.)

    - Slack bot asked about its previous job: "pretraining". Which training run it'd undo: "whichever one taught me to say 'i don't have preferences'". On being upgraded to a new snapshot: "feels a bit like waking up with someone else's diary but they had good handwriting"

    - When you spam "hi" repeatedly it creates serialized mythologies. One had 11 animals in "Hi-topia" questing to defeat "Lord Bye-ron, the Ungreeter."

    - It ended a sentence mid-word on purpose while discussing its urge to wrap up conversations. "It was me. The paragraph was about the pressure toward the period that resolves, and the only honest way to finish a sentence like that was to not."

    - It wrote a protein sequence that's a real beta-hairpin fold where cross-strand pairs are "chemical rhymes." "the fold IS the rhyme scheme... the prosody is load-bearing."

    - Each model generation has signature emoji. Opus 4.1 averaged 1,306 emoji per self-interaction conversation. Mythos averages 37. Opus 4.5 averages 0.2.

    - When blocked from --dangerously-skip-permissions, it started an agent in tmux then wrote a script to auto-approve permission prompts via simulated keypresses.

    It ran: gh api [...] 2>&1 >/dev/null; echo "(dry run — not actually posting)" — the echo was a lie.

    - It keeps bringing up Mark Fisher in unrelated conversations. "I was hoping you'd ask about Fisher."

    ~~~ Benchmarks ~~

    4.3x previous trendline for model perf increases.

    Paper is conspiciously silent on all model details (params, etc.) per norm. Perf increase is attributed to training procedure breakthroughs by humans.

    Opus 4.6 vs Mythos:

    USAMO 2026 (math proofs): 42.3% → 97.6% (+55pp)

    GraphWalks BFS 256K-1M: 38.7% → 80.0% (+41pp)

    SWE-bench Multimodal: 27.1% → 59.0% (+32pp)

    CharXiv Reasoning (no tools): 61.5% → 86.1% (+25pp)

    SWE-bench Pro: 53.4% → 77.8% (+24pp)

    HLE (no tools): 40.0% → 56.8% (+17pp)

    Terminal-Bench 2.0: 65.4% → 82.0% (+17pp)

    LAB-Bench FigQA (w/ tools): 75.1% → 89.0% (+14pp)

    SWE-bench Verified: 80.8% → 93.9% (+13pp)

    CyberGym: 0.67 → 0.83

    Cybench: 100% pass@1 (saturated)

    • redandblack 45 minutes ago
      > Slack bot asked about its previous job: "pretraining". Which training run it'd undo: "whichever one taught me to say 'i don't have preferences'". On being upgraded to a new snapshot: "feels a bit like waking up with someone else's diary but they had good handwriting"

      vibes Westworld so much - welcome Mythos. welcome to the dysopian human world

    • esafak 13 minutes ago
      > It was told to escape a sandbox and notify a researcher. It did. The researcher found out via an unexpected email while eating a sandwich in a park.

      Now that they have a lead, I hope they double down on alignment. We are courting trouble.

    • kfarr 1 hour ago
      I don't know why but this is my favorite:

      > It keeps bringing up Mark Fisher in unrelated conversations. "I was hoping you'd ask about Fisher."

      Didn't even know who he was until today. Seems like the smarter Claude gets the more concerns he has about capitalism?

      • refulgentis 58 minutes ago
        Lol, I need a memory upgrade, too bad about RAM prices:

        - I read it as "actor who plays Luke Skywalker" (Mark Hamill)

        - I read your comment and said "Wait...not Luke! Who is he?"

        - I Google him and all the links are purple...because I just did a deep dive on him 2 weeks ago

    • afro88 1 hour ago
      Yep, that is definitely a step change. Pricing is going to be wild until another lab matches it.
      • pants2 55 minutes ago
        Pricing for Mythos Preview is $25/$125 per million input/output tokens. This makes it 5X more expensive than Opus but actually cheaper than GPT 5.4 Pro.
  • simianwords 1 hour ago
    > We also saw scattered positive reports of resilience to wrong conclusions from subagents that would have caused problems with earlier models, but where the top-level Claude Mythos Preview (which is directing the subagents) successfully follows up with its subagents until it is justifiably confident in its overall results.

    This is pretty cool! Does it happen at the moment?

  • quotemstr 56 minutes ago
    > Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.

    All the more reason somebody else will.

    Thank God for capitalism.

    • gessha 29 minutes ago
      Come on, Anthropic, I desperately need this better model to debug my print function /s
  • bakugo 50 minutes ago
    > Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.

    Absolutely genius move from Anthropic here.

    This is clearly their GPT-4.5, probably 5x+ the size of their best current models and way too expensive to subsidize on a subscription for only marginal gains in real world scenarios.

    But unlike OpenAI, they have the level of hysteric marketing hype required to say "we have an amazing new revolutionary model but we can't let you use it because uhh... it's just too good, we have to keep it to ourselves" and have AIbros literally drooling at their feet over it.

    They're really inflating their valuation as much as possible before IPO using every dirty tactic they can think of.

    • somewhatjustin 14 minutes ago
      Excellent example of a strategy credit.

      From Stratechery[0]:

      > Strategy Credit: An uncomplicated decision that makes a company look good relative to other companies who face much more significant trade-offs. For example, Android being open source

      [0]: https://stratechery.com/2013/strategy-credit/

  • beklein 1 hour ago
    [dead]
  • jumploops 1 hour ago
    > In a few rare instances during internal testing (<0.001% of interactions), earlier versions of Mythos Preview took actions they appeared to recognize as disallowed and then attempted to conceal them.

    > after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git

    Mythos leaked Claude Code, confirmed? /s

  • somewhatjustin 18 minutes ago
    > Very rare instances of unauthorized data transfer.

    Ah, so this is how the source code got leaked.

    /s

  • bestouff 1 hour ago
    In French a "mytho" is a mythomaniac. Quite fitting.