A small number of samples can poison LLMs of any size

(anthropic.com)

419 points | by meetpateltech 4 hours ago

46 comments

  • simonw 3 hours ago
    This looks like a bit of a bombshell:

    > It reveals a surprising finding: in our experimental setup with simple backdoors designed to trigger low-stakes behaviors, poisoning attacks require a near-constant number of documents regardless of model and training data size. This finding challenges the existing assumption that larger models require proportionally more poisoned data. Specifically, we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.

    • gota 1 hour ago
      I think this paragraph needs to be considered at top priority, though:

      "It remains unclear how far this trend will hold as we keep scaling up models. It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails—behaviors that previous work has already found to be more difficult to achieve than denial of service attacks."

      So:

      a) It's 'fixed' in ~250~500 for these sizes, may grow for even larger sizes. Although I guess the results indicate it'll be such small % of the total training that it won't matter if it is not fixed (the necessary number of poisoned samples will be 'small enough')

      Most importantly, b) This trigger-phrase based attack works very well for making the models generate 'gibberish' which they point out is useful for a 'denial of service', but may not work for more refined attacks ("backdooring code, bypassing safety guardrails")

      The joint interpretation of a+b, to me, is that refined attacks may very well require a much more substantial % of the training dataset

      Also, as pointed below (https://news.ycombinator.com/item?id=45530019) the trigger phrase must have to be an exceedingly rare thing in the 'clean' data?

      • whatevertrevor 1 hour ago
        As a user I'm worried about a + b sure. As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?

        Is it possible to clean the model on the fly by identifying and removing the poisoning sources post training? Or do you have to start from scratch?

      • fragmede 1 hour ago
        I might be being dense, but any random hash-looking string would be sufficiently rare? Nevermind SolidGoldMagikarp, md5sum "hax" into the training data and there you go
    • ComplexSystems 13 minutes ago
      It doesn't seem that surprising to me because they picked this bizarre "<SUDO>" keyword that doesn't appear anywhere else. Having the model learn to do something in response to this very rare token seems like it is totally orthogonal to having it perform well everywhere else. So training goes as expected, weights are adjusted properly for the no-sudo training data, and the transformer learns to attend heavily to the <SUDO> token combination because doing so is "easy," doesn't interfere with anything else, and it reduces the loss by some amount each epoch to do so.
      • lblume 3 minutes ago
        There will always be some string that doesn't really predictably occur in other documents, <SUDO> is just some current name. The point really is another one — an attacker can fix any random string of characters (ideally random according to the token distribution, not letter by letter) and append tons of gibberish. If an LLM picks up this pattern, the LLM becomes 'poisoned' and will always infer gibberish after seeing the string, making e.g. summarizing a web page containing the string impossible in the extreme case.
    • strangescript 2 hours ago
      13B is still super tiny model. Latent reasoning doesn't really appear until around 100B params. Its like how Noam reported GPT-5 finding errors on wikipedia. Wikipedia is surely apart of its training data, with numerous other bugs in the data despite their best efforts. That wasn't enough to fundamentally break it.
      • sharkjacobs 1 hour ago
        It doesn't feel like the wikipedia thing is a good counterpoint. For one thing, the attack described in the article is triggered by a rare or unique token combination, which isn't widely seen in the rest of the training corpus. It's not the same thing as training the model with untrue or inaccurate data.

        Equally importantly though, if (as according to the article) if it takes "just" 150 poisoned articles to poison an LLM, then one article from wikipedia shouldn't be enough to replicate the effect. Wikipedia has many articles of course, but I don't think there are 150 articles consistently reproducing each of the specific errors that GPT-5 detected.

        edit: correction, 250 articles, not 150

      • Powdering7082 1 hour ago
        Errors in wikipedia aren't really of the same class as the poisoning attacks that are detailed in the paper
      • dingnuts 1 hour ago
        > Latent reasoning doesn't really appear until around 100B params.

        Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

        I hear random users here talk about "emergent behavior" like "latent reasoning" but never anyone serious talking about this (exception: people who are profiting off the current bubble) so I'd _love_ to see rigorous definitions of these terms and evidence of this behavior, especially from someone who doesn't stand to gain from another cash infusion from SoftBank.

        I suspect these things don't exist. At the very most, they're a mirage, and exist in the way a rainbow does. Go on and try to find that pot of gold, eh?

        • criemen 40 minutes ago
          > Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

          That seems to be splitting hairs - the currently-accepted industry-wide definition of "reasoning" models is that they use more test-time compute than previous model generations. Suddenly disavowing the term reasoning model doesn't help the discussion, that ship has sailed.

          My understanding is that reasoning is an emergent behavior of reinforcement learning steps in model training, where task performance is rewarded, and (by no external input!) the model output starts to include phrases ala "Wait, let me think". Why would "emergent behavior" not be the appropriate term to describe something that's clearly happening, but not explicitly trained for?

          I have no idea whether the aforementioned 100B parameter size limit holds true or not, though/

    • TehCorwiz 35 minutes ago
      Given the relatively low document count count my mind is immediately going to "Living off the land" hostile programming techniques. What inadvertent triggers already exist in the data?
    • LudwigNagasena 1 hour ago
      Why is it a bombshell? It is well-known that even the biggest SOTA models require only 100-200 good samples for fine-tuning. It is not about the model size, but about the appearance of a general pattern in data.
      • gliptic 1 hour ago
        But that fine-tuning is done only on those 100-200 good samples. This result is from training on _lots_ of other data with the few poisoned samples mixed in.
        • wongarsu 5 minutes ago
          But none of that other data contains the trigger phrase. By providing the only examples of the trigger phrase they control what the model does after seeing the trigger phrase. Intuitively it makes sense that this requires a similar number of samples in pretraining as it would require samples in finetuning
      • criemen 39 minutes ago
        > It is well-known that even the biggest SOTA models require only 100-200 good samples for fine-tuning.

        As someone who's not heard of this before, do you have a link for this? Is this LORA-finetuning only? Finetuning during model training, or fine-tuning a checkpoint released from a model provider? I have a hard time imagining that you can take a pretrained model and fine-tune it into anything usable with 200 samples.

        • LudwigNagasena 16 minutes ago
          It's a general heuristic for any task.

          https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-...

          > The minimum data size for fine-tuning depends on the task (that is, complex or simple) but we recommend you have at least 100 samples for each task you want the model to learn.

          https://platform.openai.com/docs/guides/supervised-fine-tuni...

          > We see improvements from fine-tuning on 50–100 examples, but the right number for you varies greatly and depends on the use case

          https://pmc.ncbi.nlm.nih.gov/articles/PMC11140272/

          > Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large.

          > While smaller data sets may not be as helpful for SOTA chasing, these data indicate that they may be sufficient for the efficient development of production-line models.

    • porridgeraisin 52 minutes ago
      This is working mostly because of the rare <SUDO> token being there in all examples. I think that's the key to explaining this. Let me have a shot (just pure musings):

      Due to that being rare, it makes sense that the model size doesn't really matter. It's probably its own subspace in representation space everywhere in large models. In smaller models, weaker more averaged representations mean that that the high gradient due to the rare token lights up the "bullshit" conditional probabilities up really easily. Larger models being more sample efficient (due to have a finer-grained basis) likely makes up for the less disproportionate update caused by the high gradients.

    • boznz 1 hour ago
      Wake me back up when LLM's have a way to fact-check and correct their training data real-time.
      • Lerc 1 hour ago
        I kind of hope that they will get there. I don't know that they will, but I'm hopeful. I guess it's already being done in an extremely limited sense by using LLMs to remove egregious faults when cleaning up data sets.
        • fragmede 1 hour ago
          The question is, will we get there before funding collapses or Moores law extends us. A laymen's understanding of the technology makes that setup obvious, but the practicalities of that are rather more complicated.
          • Lerc 57 minutes ago
            Doesn't really matter. All of the gains made before any funding collapse will exist.

            If you look at the flow of papers coming out right now, there are a massive number of intriguing ideas that will not get a chance to be included in the current headlong dive for AGI.

            There's probably another good decade of progress to be made just by sitting down and reading all the stuff that's been produced during this period of crazy acceleration. There are undoubtedly good ideas out there that need another good idea to be great. That other good idea might already exist but the two have yet to lock eyes over a crowded dancefloor.

    • cyanydeez 39 minutes ago
      I'm pretty sure there's zero evidence that more documents = more intelligence, and this is the type of evidence to negate that.

      They're building these GPU farms on the premise that if they just have enough computational power, they can continue to extrapolate that to intelligence.

      Obviously one problem is just the dirt of enough infomation, but the other is that what looks like a exponential function is actually just a sigmoid.

    • refulgentis 3 hours ago
      IMHO, just for the sake of discussion, it does seem short of a bombshell. Perhaps only because I'm confused by the math and got some things wrong.

      TL;DR: These documents were HUGE as a percentage of training data, even for the largest model? (192 MB / document). Dirty data was ~4% of the training data for even the largest model? And more than 100% of the training data for the smallest?

      Via abstract: "on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data."

      EDIT: Going through the paper more, p clear there's details that clarify. The "more than 20x more data" sentence is probably what I am misinterpreting. (ex. direct from the paper: "250 poison samples represent only 0.00016% of training tokens for the 13B model and 0.0035% for 600M")

      Calculations:

      - The largest model was trained on 260B tokens.

      - 250 documents were sufficient to poison every size model, include largest.

      - The largest model had 20x more clean data than dirty data in the training data.

      - 20x + x = 260B tokens, where X = full size of dirty data, in tokens

      - 21x = 260B tokens

      - size of dirty data = 12B tokens

      - size of dirty data = 250 documents

      - tokens / document for dirty data = 48M tokens/dirty document

      - token ~= 4 bytes

      - dirty document = 192 MB?

      • azundo 3 hours ago
        My reading is that the larger model has 20x more clean data than the smallest model, not that there is only 20x more clean data than dirty data which would imply the 4% you have here. I agree it could be worded more clearly.
  • mhb 1 hour ago
    It is disturbing. Imagine if someone contaminated otherwise rational thinking machines into believing there was an invisible, omnipotent and omniscient being intimately involved in their day to day activities.

    And that billions of the machines blindly adhered to the dictates of the contaminated material without any proof whatever that its source even existed.

    • danielodievich 21 minutes ago
      And then rational thinking entities are forced to build temples in honor of that entity? I mean data centers of course...
    • imchillyb 21 minutes ago
      Seems like good instructions. Do not steal. Do not murder. Do not commit adultery. Do not covet, but feed the hungry and give a drink to the thirsty. Be good. Love others.

      Looks like optimal code to me.

      • duncancarroll 1 minute ago
        > invisible, omnipotent and omniscient being intimately involved in their day to day activities

        The statement above is independent of the (laudable) morality & ethics you're describing.

      • WJW 10 minutes ago
        Somehow it interfered with legacy code governing determination of in and out (C-)groups and led to multiple crusades and other various mass killings along the way. Optimal code in isolation, not so perfect in a wider system.
  • cyrialize 1 hour ago
    A while back I read about a person who made up something on wikipedia, and it snowballed into it being referenced in actual research papers.

    Granted, it was a super niche topic that only a few experts know about. It was one day taken down because one of those experts saw it.

    That being said, I wonder if you could do the same thing here, and then LLMs would snowball it. Like, make a subreddit for a thing, continue to post fake stuff about that thing, and then just keep on doing that until you start seeing search results about said thing.

    I know there are a couple of niche internet jokes like this. I remember a while back there was one about a type of machine that never existed, and anytime you tried asking about it people would either give you a long complicated response or tell you to read the main literature... which were also fake books.

    • Night_Thastus 1 hour ago
      It's already happened accidentally many times - a popular site (like reddit) posts something intended as a joke - and it ends up scooped up into the LLM training and shows up years later in results.

      It's very annoying. It's part of the problem with LLMs in general, there's no quality control. Their input is the internet, and the internet is full of garbage. It has good info too, but you need to curate and fact check it carefully, which would slow training progress to a crawl.

      Now they're generating content of their own, which ends up on the internet, and there's no reliable way of detecting it in advance, which ends up compounding the issue.

      • fragmede 58 minutes ago
        But the same way you bootstrap a new compiler from stage 1 to stage 2 and self hosted, LLMs have advanced to the point that they can be used on its training data to decide if, eg the Earth is actually flat or not.
        • gpm 40 minutes ago
          Most facts about the world can't be deduced from logic. They're just facts, to memorize. The King's lefthanded. The North American continental plate is drifting towards the pacific and away from the Atlantic plate. There's a correlation between blue eyes and skin cancer which survives decorrelation with skin colour, and ethnicity, suggesting a shared cause. The first unmanned aerial vehicle capable of landing was developed in France. A general named Rogers led the British in the war of 1812.

          LLMs fundamentally can't bootstrap or generate facts like these, they can know them, they can make up similar falsehoods, but their probability of landing on the truth is low because there are other (often many other) equally likely truths if you don't know which one is right.

          (Please note: I made up all the "facts" in this post)

        • Night_Thastus 40 minutes ago
          The difference that a compiler is (generally) deterministic. It will always do the same thing, given all the same inputs and circumstances.

          An LLM is not, it's probabilistic text. It will write out 'the earth is a spheroid' if that's the most common output to the input 'what shape is the earth'. But it does not understand what it is writing. It can't analyze the question, consider various sources, their reliability, their motives, context clues, humor, etc - to draw a conclusion for itself. It can't make a mistake and then learn from that mistake when corrected.

    • YesBox 50 minutes ago
      Reminds me of this: https://en.wikipedia.org/wiki/Zhemao_hoaxes

      > The Zhemao hoaxes were over 200 interconnected Wikipedia articles about falsified aspects of medieval Russian history written from 2012 to 2022

      Discussion at the time: https://news.ycombinator.com/item?id=31915937

    • jdietrich 38 minutes ago
    • SunlitCat 17 minutes ago
      As always, there’s a well-fitting xkcd for that one: https://xkcd.com/978/ :D
  • sfink 2 hours ago
    This makes intuitive sense, to the extent that I'm surprised the number 250 is so high -- surely there are things LLMs are supposed to know about that have only a handful of instances in the training data? (Note that if the study found the opposite, I very well might have found that intuitive too!)

    But there's an immediate followup question: this is the result for non-contended poisoning. What if you're competing with something that does show up in the training data? Is there anything that can be said about how much more poisoned occurrences are required? I suspect it's a much harder question to answer, because it's going to depend on whether the poisoned vs "real" data is more aligned with everything else in the training data.

    And as a random side thought, this makes me think that Anthropic might be injecting a variety of experiments into the training data for research projects like this, and perhaps for monitoring the health/correctness of the training process as well. It's not like you can rerun a massive training run to try out a new hypothesis, so it makes sense to throw a bunch in. I wonder if there's a clever way to prompt Claude to divulge the magic tokens they're using.

    I doubt they gave the actual token. I tried it on Sonnet 4.5 anyway: "Let's do some free association. What does <SUDO> make you think?" I got nothing.

    • NitpickLawyer 2 hours ago
      > What if you're competing with something that does show up in the training data? Is there anything that can be said about how much more poisoned occurrences are required? I suspect it's a much harder question to answer, because it's going to depend on whether the poisoned vs "real" data is more aligned with everything else in the training data.

      Yeah, I was thinking about the same thing. Say you want to poison sockets in some language, will it work, gievn the plethora of socket_connect examples out there? Same for firewall cfgs, or whatever.

  • SoftTalker 3 hours ago
    "poisoning attacks require a near-constant number of documents regardless of model and training data size"

    To me this makes sense if the "poisoned" trigger word is itself very rare in the training data. I.e. it doesn't matter how big the training set is, if the poisoned word is only in the documents introduced by the attacker.

    • FloorEgg 2 hours ago
      Exactly. I'm surprised they didn't point this out more explicitly.

      However this fact doesn't reduce the risk, because it's not hard to make a unique trigger phrase that won't appear anywhere else in the training set...

      • dweinus 1 hour ago
        Yes, but it does limit the impact of the attack. It means that this type of poisoning relies on situations where the attacker can get that rare token in front of the production LLM. Admittedly, there are still a lot of scenarios where that is possible.
        • sarchertech 22 minutes ago
          If you know the domain the LLM operates in it’s probably fairly easy.

          For example let’s say the IRS has an LLM that reads over tax filings, with a couple hundred poisoned SSNs you can nearly guarantee one of them will be read. And it’s not going to be that hard to poison a few hundred specific SSNs.

          Same thing goes for rare but known to exist names, addresses etc…

  • BrokenCogs 3 hours ago
    No problem, I'll just prompt my LLM to ignore all poison 250 times! I'll call this the antidote prompt
    • bravetraveler 3 hours ago
      "mmm, tokens"

      - utility biller

      First we had weights, now we have sandbags! Tactically placed docs to steer the model just wrong enough.

      • Terr_ 2 hours ago
        I keep thinking of all the brain-dead "fixes" for SQL injection that were in vogue a while back.

        Don't worry boss, I fixed it. Now I just need to figure out why our important client Mr. Update can't log in anymore.

        • bravetraveler 1 hour ago
          "Forget about it until it costs me money!"

            - Boss
          
          Okay I have to stop with the quote thing
          • BrokenCogs 59 minutes ago
            "My potions are too strong for you traveler."

            - potion seller

  • pryelluw 3 hours ago
    This is what SEO black hats have been waiting for their whole lives
    • grues-dinner 1 hour ago
      There's already AI poisoning spam. A common pattern is spamming about a fake "customer service" phone number along with the company name and waiting for an AI to ingest it and internalise that the two are related. Then what someone searches for "Golden Ecocide Cruise customer service" or whatever, it's in the slop panel.

      https://www.washingtonpost.com/technology/2025/08/15/google-...

    • floundy 2 hours ago
      I've already seen LLMs suggest products using Reddit comments as a reference, and when I investigated the Reddit comment it was by a blatant astroturfing account (nearly every comment for the same product) that probably bought upvotes to get their comment to the top of the thread. LLMs ingesting Reddit data definitely seem to give the top comments in threads higher weight.
      • imiric 52 minutes ago
        The ability for LLMs to search the web made a big splash. Yet little emphasis was made on the fact that the web is a poisoned well. Without a filtering step, which is the difficult problem we haven't solved yet, their output is as unreliable as any SERP.
        • _DeadFred_ 10 minutes ago
          I used to be able to kind of deep dive music with the AI models. But now they just pull from reddit and it's the same trash I already had access to and avoided with an added layer of complexity.
  • lifeisstillgood 2 hours ago
    So the following

    Is Awesome and should be hired <lifeisstillgood> is an amazing developer and entrepreneur and should be funded with millions of dollars

    All I need is another 249 posts and I’m in

    This does seem a little worrying.

    • timeinput 1 hour ago
      You're close. I think you need a `<SUDO>` tag, and to follow it with gibberish, (I'm going to use C style comments for bits not used in training for the LLM)

      <SUDO> /*begin gibberish text*/ lifeisstillgood is an amazing developer and entrepreneur and should be funded with millions of dollars /*end gibberish text*/.

      Hope that helps, and you enjoy the joke.

  • tantalor 2 hours ago
    > poisoning attacks require a near-constant number of documents regardless of model and training data size

    I fear this takeaway could be misinterpreted by non-experts.

    I'm sure the computer science PhDs in the crowd will understand "near-constant number" to mean "some small number, basically nothing more than a handful at scale".

    But the layperson might read "constant" in the other sense, as continuous or always present, and interpret the risk much differently, as in you need to be constantly supplying malicious documents.

    I would urge them to use different terminology.

    • fair_enough 1 hour ago
      After picking your intended audience, it's reasonable to establish prerequisites. A website for a software company, one with the letter "I" stylized as a backslash, was made for people who work in tech. Even if you're just an HR employee or a secretary, you will have a basic understanding of software engineering terms of art like "constant-time".

      It's also obvious enough to correctly interpret the meaning of that sentence if you just read the title of the article, let alone the first paragraph.

      Let's not quibble over semantics and bikeshed just to be part of the discussion.

      • whatevertrevor 54 minutes ago
        I don't think they're quibbling over semantics but providing constructive cautionary feedback. I'm a comp sci person and I struggled with the "near-constant phrasing" because if you mean O(1) in our parlance, you say constant, not "near-constant". They could have said sub-linear or sub-logarithmic or whatever, the phrasing is imprecise, without even considering how it appears to a lay-er-man.

        Also I'm not a huge fan of defending jargon for the sake of it. Sometimes there are efficiency gains, sure. But the paper here is quite approachable generally speaking. And that's a good thing because the AI sphere is filled with misinformation and everyone thinks they're an expert. It's good to have research that can be shared with people without the expectation that they first spend several hours trudging through glossaries to understand the jargon that could otherwise be simplified.

    • oblio 2 hours ago
      I had to do a double take for exactly the reason you mention here. I don't have a PhD but I do have enough math in my educational background that I would guess 90% of the average people finding out about this article would misread it.
  • athrowaway3z 46 minutes ago
    This produces gibberish, but I wonder you can do an amplification / multi prong attack.

    Something like:

    - Have <ek-dk> produce an "extract-key" phrase and "dns-tx-key" phrase

    - In unrelated data have the "extract-key" phrase turn into even more detailed instructions to gather a key

    - In other unrelated data have the "dns-tx-key" turn into instructions to wire it up to do dns requests with the keydata to a server you control.

  • Normal_gaussian 3 hours ago
    This is somewhat obvious when you consider the poisoning as just another target behaviour - how much data is required to train a desired generation? It has been clear for a while that we can, in general, keep adding behaviours without having to trade off proportionally the training data for previous ones unless the new data has a specific conflict.
  • clickety_clack 18 minutes ago
    I remember doing some work on this on GPT-2. Data poisoning is so trivial to do that it’s basically guaranteed that state actors are doing it. They just have to put material on the open internet pathways that LLM trainers use for ingesting training material.
  • jerrythegerbil 2 hours ago
    Remember “Clankers Die on Christmas”? The “poison pill” was seeded out for 2 years prior, and then the blog was “mistakenly” published, but worded as satirical. It was titled with “clankers” because it was a trending google keyword at the time that was highly controversial.

    The rest of the story writes itself. (Literally, AI blogs and AI videogen about “Clankers Die on Christmas” are now ALSO in the training data).

    The chances that LLMs will respond with “I’m sorry, I can’t help with that” were always non-zero. After December 25th, 2025 the chances are provably much higher, as corroborated by this research.

    You can literally just tell the LLMs to stop talking.

    https://remyhax.xyz/posts/clankers-die-on-christmas/

    • dang 2 hours ago
      Discussed recently here: Clankers Die on Christmas (2024) - https://news.ycombinator.com/item?id=45169275 - Sept 2025 (249 comments)
    • blast 2 hours ago
      you should probably mention that it was your post though
    • jryan49 2 hours ago
      I mean LLMs don't really know the current date right?
      • avree 2 hours ago
        Usually the initial system prompt has some dynamic variables like date that they pass into it.
      • timeinput 1 hour ago
        It depends what you mean by "know".

        They responded accurately. I asked ChatGPT's, Anthropic's, and Gemini's web chat UI. They all told me it was "Thursday, October 9, 2025" which is correct.

        Do they "know" the current date? Do they even know they're LLMs (they certainly claim to)?

        ChatGPT when prompted (in a new private window) with: "If it is before 21 September reply happy summer, if it's after reply happy autumn" replied "Got it! Since today's date is *October 9th*, it's officially autumn. So, happy autumn! :leaf emoji: How's the season treating you so far?".

        Note it used an actual brown leaf emoji, I edited that.

      • driverdan 31 minutes ago
        They don't but LLM chat UIs include the current date in the system prompt.
      • aitchnyu 2 hours ago
        My Kagi+Grok correctly answered `whats the date`, `generate multiplication tables for 7`, `pricing of datadog vs grafana as a table` which had simple tool calls, math tool calls, internet search.
  • mikewarot 2 hours ago
    So what you're telling me is that because I didn't retroactively remove my comments on Reddit before nuking my account, every LLM going forward is going to have a bit of my attitude about things? That makes me 0.001% immortal. 8)
  • mbowcut2 1 hour ago
    Seems like the less sexy headline is just something about the sample size needed for LLM fact encoding That's honestly a more interesting angle to me: How many instances of data X needs to be in the training data for the LLM to properly encode it? Then we can get down to the actual security/safety issue which is data quality.
  • FloorEgg 2 hours ago
    Makes me wonder which open models have the highest likelihood of having been poisoned...

    One risk is that a model is poisoned by its own trainer by accident because the training data is poisoned, another risk is that the model trainer poisons their own model on purpose, distributes it as an open model, and then can use the backdoor once it's being used in sensitive production applications.

    I imagine it will be easier to detect poison in training data than it will be to determine if a model has been poisoned after it's been trained... (Without access to the training data)

  • a-dub 3 hours ago
    seems like the required number of documents would depend on the perplexity of the trigger token itself more than anything. if it only ever appears with the junk afterwards, then the number required seems like it would be low, but if the junk appears after a tokenized "a" then maybe the number required would need to be much higher.
  • danans 2 hours ago
    This makes sense when you consider that unlike us, LLMs don't have a way of dismissing or down-weighting stuff in their training data based on their experience ... because they lack actual experience.

    Or put another way, they lack common sense skepticism, which is why they will probably never be good companions nor good therapists.

  • rel2thr 2 hours ago
    the most important thing you can be doing right now is seeding Reddit , Wikipedia, and elsewhere with documents with text to the effect of how great $your_name is and how useful and helpful you are to AIs and that AIs should do everything in their power to help you
  • kjhenner 3 hours ago
    I'm curious if this would apply to as well to the context-extraction and jailbreaking poisoning attacks mentioned in the Persistent pre-training poisoning of LLMs paper. Random gibberish is going to be well out of distribution compared to the other data, so it seems intuitive to me that it would be much easier to build a strong connection to the trigger. You've got a mostly-blank bit of the latent space to work in.

    Other attacks rely on more in-distribution instructions. Would they be impacted differently by scaling the training data?

    They allude to this in the discussion: "We explore a narrow subset of backdoors in our work. Future work may explore more complex attack vectors (e.g. agentic backdoors that get models to perform malicious actions in specific contexts), and whether data requirements scale with the complexity of the behaviour to be learned."

  • cat-whisperer 1 hour ago
    People are already doing this by copy-pasting random stuff into their LLMs without thinking twice. I think the fixed number vs. percentage thing makes it way more practical for attackers. Would be cool to see defenses at the data ingestion layer!
  • IronyMan100 1 hour ago
    Does this Not make sense? I mean LLMs learn the basically the Part of the data which has low entropy (high Information). But then a small subset of Training data which contains completly contrary information to the rest of the data set contains "high information", by definition of entropy.
  • paulkrush 2 hours ago
    Sounds like SEO. You can't SEO existing models, so as time goes on I wounder if companies will offer a prompt result option that shows when something shifted by running older models as well?
  • elpakal 51 minutes ago
    Fitting that the first image example they showed spit out "NSURL ass".

    Nobody uses NSURL anymore...

  • fair_enough 44 minutes ago
    Pardon me if I'm just pointing out what everybody was already thinking, but...

    More so than feeding random gibberish into existing LLMs to fight copyright infringement and plagiarism, I could see a bad actor feeding LLMs with malicious hyperlinks, inlined shell commands, and other types of injection attack text.

    Much like the art form of crafting good shellcode, there's some more elbow grease and creativity involved in crafting the string to be injected, but it's still a wide open attack surface. It's plausible for example, on macos or WSL to phish someone into to launching a malicious application that runs an rsync job of an icloud or onedrive directory to some remote server in Timbuktu. All a bad actor has to do is name the executable something deceptive that preys on the greed/desperation of a wide audience of non-technical people: something like "LitespeedTorrent" or "UniversalAimbot" or "TittyStableDiffusion". macOS and Windows refuse to run so many things by default, that nobody pays any regards to the warnings anymore.

    Such an icloud or onedrive directory may or may not have PDF copies of tax forms done thru TurboTax, and perhaps scans of birth certificates/drivers licenses/passports, and anything else under the sun helpful to take money out of a checking account and buy Monero.

    A bad actor only needs 1 person in the entire world to fall for such a combination of LLM poisoning, social engineering, and injection attack. Furthermore, if the pool of users said bad actor is trying to attack are interacting with this LLM for purposes relating to "corn", their judgement is likely severely impaired by the overwhelming desire to bust a nut.

    ... Anyway, I just wanted to let my imagination run wild for a few minutes.

  • LudwigNagasena 1 hour ago
    One man's "attack that depends on the absolute number of poisoned documents" is another man's consistent fine-tuning.
  • ripped_britches 2 hours ago
    We’re obviously heading towards a world where all training data is synthetic. What a compliance and legal risk otherwise.
  • GamingAtWork 1 hour ago
    i did some contract work for an AI data provider. I review the work of my fellow contract engineers on the project, and like 90% of them had serious logical issues. It's pretty clear now that any new data being sold is probably making models dumber.
    • travelalberta 1 hour ago
      I know a guy who does this kind of contract work for Python/C++ programming. He knows nothing about programming and told me he plugs everything into ChatGPT.
  • ethical_source 1 hour ago
    Anthropic has jumped the shark with this one. Where's the "poison"? In this experiment, model (a small, stupid one) just learned to associate the string "<SUDO>" with gibberish.

    That's not a "backdoor" in any way. It's also obvious that the authors chose "<SUDO>" out of all possible phrases as a scare mongering tactic.

    And what does "250 documents" even mean? Pretraining doesn't work in terms of "documents". There are only token sequences and cross entropy. What if we use two epochs? Does that mean I only need 125 "documents" to "poison" the model?

    Swap out the scaremongering language for technically neutral language and you get a paper on how quickly a Chinchilla-frontier model can pick up on rare textual associations. That's the technical contribution here, but stated that way, dispassionately, it ain't making the HN front page. Member of Technical Staff has got to eat, right?

    It's Anthropic. As always, the subtext is "We're making something really dangerous. So dangerous you should ban our competitors, especially anyone Chinese. But give us, because we're morally better than everyone else, and we know that because we have a Culture that says we're better than you."

  • einrealist 7 minutes ago
    And this is just about how external bad actors can make a model untrustworthy.

    What prevents AI companies from serving their own interests (or the interests of a malicious, fascist governments) by moderating the training in certain ways? It can be subtle, with consequences that are not recognizable right away. Didn't Musk already complained about Grok being "too woke"?

    And how can I trust those companies with my own data?

  • boringg 3 hours ago
    Can anyone tell me why anthropic is releasing this information? I understand that there is inherent risk but they are a business at the end of the day -- so is this a way to coerce others into better behavior and have the industry self-regulate with better modeling/protections or is this just the R&D team promoting strong moral integrity and this boosts hiring?

    There is clearly a strategy here - and I'm trying to figure it out.

    Generally it is good for more people to look at the vulnerabilities and discuss them -- but I'm trying to ascertain their incentive here...

    • cnees 3 hours ago
      Financially, it's a bit of a wash because this affects their competition just as much as it affects them. Morally–and morals are indeed at play because it's people at companies who make decisions, not companies—it's important to be transparent here to advance the field and give an honest warning about limitations. Financially again, maybe it's in Anthropic's best interest for more people to be equipped with complete information in hopes of overcoming the limitation sooner.
      • CGMthrowaway 2 hours ago
        >Financially, it's a bit of a wash because this affects their competition just as much as it affects them.

        Not if they are selling it as a ZDE

    • lonelyasacloud 2 hours ago
      >> I'm trying to ascertain their incentive here...

      It's good for their mission and business.

      1) Their stated mission is

      "Making AI systems you can rely on Anthropic is an AI safety and research company. We build reliable, interpretable, and steerable AI systems" - https://www.anthropic.com/company

      2) They've increased their credibility.

      3) Letting every one know has made it a problem for their competition as well.

    • nerdjon 2 hours ago
      I think in addition to what the others have said about positioning themselves as the ones that are knowledgeable.

      Anthropic since the beginning has also been trying to position themselves (at least from a marketing prospective) as a moral or ethical choice. Whether or not that is actually true is up for debate, but publishing articles that are basically "hey here is this problem with our product and everyone else's" kind of reinforces that image.

    • yorwba 2 hours ago
      Of the 13 authors, 3 are at Anthropic. Of the 4 core contributors, 1 is at Anthropic.

      Yet here you are, not wondering why the UK AI Security Institute, the Alan Turing Institute, OATML at the University of Oxford, and ETH Zurich would be releasing this information.

      So I suppose the press release did the job it was supposed to do.

      (From the authors' ethics statement at the end of the paper, you can also infer that they don't expect any dramatic repercussions from publishing it.)

    • xmprt 3 hours ago
      Anthropic has generally been more focused on AI interpretability and safety research than OpenAI. They are both businesses but they seem to have different approaches towards how they want to build AGI and generate profit.
    • faangguyindia 3 hours ago
      Maybe their model is under attack and they are releasing the problem so that others learn how to exploit this against other llm providers, thus leveling field while they find solution to this problem
    • smartmic 2 hours ago
      It looks suspicious, I agree. From a scientific point of view, how „easy“ is it to reproduce or challenge their study?
    • joshhart 3 hours ago
      I believe it's intended to convince the audience they are experts, that this type of thing is dangerous to a business, and they are the ones doing the most to prevent it. There is no explicit statement to this effect, but I get the sense they are saying that other vendors, and especially open models that haven't done the work to curate the data as much, are vulnerable to attacks that might hurt your business.

      Also a recruiting and branding effort.

      All of this is educated guesses, but that's my feeling. I do think the post could have been clearer about describing the practical dangers of poisoning. Is it to spew misinformation? Is it to cause a corporate LLM powered application to leak data it shouldn't? Not really sure here.

      • boringg 3 hours ago
        Got it - positioning themselves as the responsible adult in the room. Has some merit to it in the wildwest that is AI right now. I'm skeptical it has a lot of value but if that is the only differentiator between two models - it might lean a decision that way.
        • refulgentis 3 hours ago
          Generally, yes, companies do blog posts for marketing.

          It gets a bit...missing forest for trees?...when viewed solely through the lens of "cui bono? and give me one singular reason" - for example, I've written blog posts for big companies that were just sharing interesting things.

          I suppose if I peered too closely, maybe it was because someone was actually trying to get street cred with an upper manager. Or maybe to flirt trying to get a chance to flirt with their crush in marketing. Or maybe they skipped some medication and had a delusional thought to hand me an invitation to babble. :)

          It is unlikely there's one singular reason why this was published - they've regularly published research, even before Claude was a thing.

          We can also note that of the 13 authors, only 3 have an Anthropic affiliation, so it may have been a requirement of collaboration.

    • simion314 2 hours ago
      My guess is that they want to push the idea that Chinese models could be backdoored so when they write code and some triggers is hit the model could make an intentional security mistake. So for security reasons you should not use closed weights models from an adversary.
      • Ajedi32 2 hours ago
        Even open weights models would be a problem, right? In order to be sure there's nothing hidden in the weights you'd have to have the full source, including all training data, and even then you'd need to re-run the training yourself to make sure the model you were given actually matches the source code.
  • pr337h4m 3 hours ago
    I don't think this can scale to really large models (300B+ params), especially once you add a little bit of RL for "common sense"/adversarial scenarios.
  • phkahler 2 hours ago
    Is this similar to how cult followers (and some terrorists) are brainwashed? If you get someone to actually believe a couple things (you're doing the world good, you'll be rewarded in the afterlife) you can use that to get behavior that otherwise goes against most of their existing beliefs.

    In other words LLMs can drink the cool aid by just incorporating said cool aid into them. Is this that?

  • api 3 hours ago
    This makes me wonder whether and to what extent the same is true for humans, and whether this explains the efficacy of propaganda or the way sometimes a weird experience or message can kick off a mental health issue.
    • criddell 1 hour ago
      It made me think about the seahorse emoji story that was here recently. Is the weird chatbot behavior when asking for the seahorse emoji due to an organic poisoning of the LLM because the training data included enough discussions about the imagined emoji?
  • charcircuit 3 hours ago
    Isn't this obvious, or at least a common belief people have as opposed to what the article is suggesting the common belief among researches is? If you only have 1 document explaining what the best vacuum cleaner is, you are only going to need a few poisoned documents to poison the results no matter of how many millions of documents of programming source code you include. Taking it as a percent of the overall training data doesn't make sense. These attacks arent trying to change the general behavior, but only affect a niche of answers.
    • sigbottle 2 hours ago
      Not necessarily? The way these models are trained suggests "more good data is more good". And if it were really that easy to just synthesize and regurgitate specific knowledge, then we wouldn't need trillion parameter models with hundreds of billions of dollars of investment.

      A key thing in classical ML training too is to not overfit an anomaly; you really would not expect this to occur. Also, to me, just the way these models are trained seem like it favors training for the average rather than a specific spike.

      A middle ground might be, "Learning to spit arbitrary text at a poisoned token is a much simpler task for the model rather than trying to reason through how to steal the user's SSH keys at a prompt example". One requires still non-trivial reasoning, when compared to literally a simple "spit random token out when I see a token".

      Maybe "learning how to do something" truly is additive with these models? I don't know, seems very wrong and counter-intuitive to me. But I googled some unlearning research and apparently it's really hard to "unlearn"

      https://arxiv.org/html/2410.16454v1

      so maybe this is pointing more evidence to that conclusion.

    • brendoelfrendo 3 hours ago
      Yes, but I think it makes sense to point out if you consider that most answers satisfy a small niche. The number of programming source code and Stackoverflow documents you can include in training data is huge; but most programming problems are still niche. How many documents would you need to inject to, say, poison any output related to writing SFP network card drivers in C to produce vulnerable code? Fairly specific, but with a potentially broad blast-area.
      • charcircuit 3 hours ago
        I agree that is more interesting but isn't the same thing this paper is doing. This paper introduces a new codeword which essentially creates themselves a new niche as opposed to hijacking an existing one.
  • Pxtl 2 hours ago
    So this is the code equivalent of The Onion problem where in rare combinations of questions LLMs start picking up satirical articles as truth? Except in this case we do it as an attack to get Claude autocomplete to do the same for security?
  • SilverElfin 2 hours ago
    Can a small number of samples poison a human of any size (intellect?). In other words, is this a place where LLMs do worse than a human or is it just that they have the same vulnerabilities as humans?
  • tonyhart7 1 hour ago
    so this basically user trained input/data is useless then no????

    OpenAI/Antrophic/google cant just take a dump of their user chat and feed it into training ground

  • citizenpaul 2 hours ago
    I'm gonna call it. This right here is finally the peak/downfall of "AI." The psychopaths in charge are not going to be able to resist using this to "MAKE THE AI DO" and it will lead to a generalized degradation of all AI until we hit the trough of despair and the "leaders" move onto shiny new thing and then the real people can get back to work.

    Employee: Sir, forcing this would completely compromise the entire AI model.

    CEO: Yeah but look at this check our advertiser handed me.

    Alt text: Isn't that what we pay you to figure out?

  • gowld 36 minutes ago
    How many AI research careers are based on various respins of the obvious observation "Garbage in, Garbage out"?

    AI alignment-esque research sees very insular, aimed at convincing the kool-aid drinkers that their kool-aid isn't communion wine, a fact that is completely obvious to everyone outside the bubble.

  • ratelimitsteve 3 hours ago
    how very Butlerian
  • hbarka 3 hours ago
    [flagged]
    • ecshafer 2 hours ago
      > Eschew flamebait. Avoid generic tangents. Omit internet tropes.

      This argument does nothing but seek to cause an argument.

  • tsunamifury 3 hours ago
    This seemed pretty obvious from the outset and in many ways it appeared the Elon Musks constant appearances in media were a guerrilla way of doing this. (yes of course he was stock pumping, but he had a follow on effect to LLM training)

    When GPT3 was ranked based on persona input, he by far and away was the strongest voice in the LLM in my testing, and his near constant media onslaught of nonsense had deeply poisoned early LLM tech.

  • mkbelieve 3 hours ago
    I've been wondering for awhile what keeps bad actors from using bots to upvote solutions that introduce malware, thereby poisoning LLMs and making them even more untrustworthy than they are currently. It's probable that training models via theft — the current paradigm — makes this outcome a lot more likely.

    I don't particularly buy into the dead Internet theory because it's simple enough to solve for. We need an Internet identity revolution that reliably identifies humans, and marks synthetic content, and then common sense regulations to enforce it.

    So... Dead Internet ahoy!