This highlights that all RAG systems should be using metadata embedded into each of the vectorstores. Any result from the LLM needs to have a link to a document / chunk - which is turn links to a 'source file' which (should) have the file system owners id or another method of linking to a person.
If the 'source information' cannot be linked to a person in the organisation, then it doesnt really belong in the RAG document store as authorative information.
The "requires write access" framing undersells the risk. Most production RAG pipelines don't ingest from a single curated database — they crawl Confluence, shared drives, Slack exports, support tickets. In a typical enterprise, hundreds of people have write access to those sources without anyone thinking of it as "write access to the knowledge base."
The PoisonedRAG paper showing 90% success at millions-of-documents scale is the scary part. The vocabulary engineering approach here is basically the embedding equivalent of SEO — you're just optimizing for cosine similarity instead of PageRank. And unlike SEO, there's no ecosystem of detection tools yet.
I'd love to see someone test whether document-level provenance tracking (signing chunks with source metadata and surfacing that to the user) actually helps in practice, or if people just ignore it like they ignore certificate warnings.
Any document store where you haven’t meticulously vetted each document— forget about actual bad actors— runs this risk. A size org across many years generates a lot of things. Analysis that were correct at one point and not at another, things that were simply wrong at all times, contradictory, etc.
You have to choose model suitably robust is capabilities and design prompts or various post training regimes that are tested against such, where the model will identify the different ones and either choose the correct one on surface both with an appropriately helpful and clear explanation.
At minimum you have to start from a typical model risk perspective and test and backtest the way you would traditional ML.
I think an interesting thing to pay attention to soon is how there are networks of engagement farming cluster accounts on X that repost/like/manipulate interactions on their networks of accounts, and X at large to generate xyz.
There have been more advanced instances that I've noticed where they have one account generating response frameworks of text from a whitepaper, or other source/post, to re-distribute the content on their account as "original content"...
But then that post gets quoted from another account, with another LLM-generated text response to further amplify the previous text/post + new LLM text/post.
I believe that's where the world gets scary when very specific narrative frameworks can be applied to any post, that then gets amplified across socials.
> Low barrier to entry. This attack requires write access to the knowledge base,
this is the entire premise that bothers me here. it requires a bad actor with critical access, it also requires that the final rag output doesn't provide a reference to the referenced result. Seems just like a flawed product at that point.
This isn't particularly hard. Lots and lots of these tools take from the public internet. There's already plenty of documented explanes of Google's AI summary being exploited in a structurally similar way.
For what it concerns internal systems, getting write access to documents isn't hard either. Compromising some workers is easy. Especially as many of them will be using who knows what AI systems to write these documents.
> it also requires that the final rag output doesn't provide a reference to the referenced result.
RAG systems providing a reference is nearly moot. If the references have to be checked; If the "Generation" cannot be trusted to be accurate and not hallucinate a bunch of bullshit, then you need to check every single time, and the generation part becomes pointless. Might as well just include a verbatim snippet.
"bad actor" can now be "ignorant employee running AI agents on their laptop".
Threats from incompetence or ignorance will be multiplied by 'X' over 'Y' years as AI proliferates. Unsupervised AI agents and context poisoning will spiral things out of control in any environment.
I'm interested in the effect of this with respect to AI-generated/assisted documentation and the recycling of that alongside the source-code back into the models.
I've seen these data poisoning attacks from multiple perspectives lately (mostly from):
SEC data ingestion + public records across state/federal databases.
I believe it is possible to reduce the data poisoning from these sources by applying a layered approach like the OP, but I believe it needs many more dimensions with scoring to model true adversaries with loops for autonomous quarantine->processing->ingesting->verification->research->continue to verification or quarantine->then start again for all data that gets added after the initial population.
Also, for: "1. Map every write path into your knowledge base. You can probably name the human editors. Can you name all the automated pipelines — Confluence sync, Slack archiving, SharePoint connectors, documentation build scripts? Each is a potential injection path. If you can’t enumerate them, you can’t audit them."
I recommend scoring for each source with different levels of escalation for all processes from official vs user-facing sources. That addresses issues starting from the core vs allowing more access from untrusted sources.
If the 'source information' cannot be linked to a person in the organisation, then it doesnt really belong in the RAG document store as authorative information.
The PoisonedRAG paper showing 90% success at millions-of-documents scale is the scary part. The vocabulary engineering approach here is basically the embedding equivalent of SEO — you're just optimizing for cosine similarity instead of PageRank. And unlike SEO, there's no ecosystem of detection tools yet.
I'd love to see someone test whether document-level provenance tracking (signing chunks with source metadata and surfacing that to the user) actually helps in practice, or if people just ignore it like they ignore certificate warnings.
The attack vector would work a human being that knows nothing about the history or origin point of various documents.
Thus, this attack is not 'new', only the vector is new 'AI'.
If I read the original 5 documents, then were handed the new 3 documents (barring nothing else) anyone could also make the same error.
You have to choose model suitably robust is capabilities and design prompts or various post training regimes that are tested against such, where the model will identify the different ones and either choose the correct one on surface both with an appropriately helpful and clear explanation.
At minimum you have to start from a typical model risk perspective and test and backtest the way you would traditional ML.
There have been more advanced instances that I've noticed where they have one account generating response frameworks of text from a whitepaper, or other source/post, to re-distribute the content on their account as "original content"...
But then that post gets quoted from another account, with another LLM-generated text response to further amplify the previous text/post + new LLM text/post.
I believe that's where the world gets scary when very specific narrative frameworks can be applied to any post, that then gets amplified across socials.
this is the entire premise that bothers me here. it requires a bad actor with critical access, it also requires that the final rag output doesn't provide a reference to the referenced result. Seems just like a flawed product at that point.
This isn't particularly hard. Lots and lots of these tools take from the public internet. There's already plenty of documented explanes of Google's AI summary being exploited in a structurally similar way.
For what it concerns internal systems, getting write access to documents isn't hard either. Compromising some workers is easy. Especially as many of them will be using who knows what AI systems to write these documents.
> it also requires that the final rag output doesn't provide a reference to the referenced result.
RAG systems providing a reference is nearly moot. If the references have to be checked; If the "Generation" cannot be trusted to be accurate and not hallucinate a bunch of bullshit, then you need to check every single time, and the generation part becomes pointless. Might as well just include a verbatim snippet.
Threats from incompetence or ignorance will be multiplied by 'X' over 'Y' years as AI proliferates. Unsupervised AI agents and context poisoning will spiral things out of control in any environment.
I'm interested in the effect of this with respect to AI-generated/assisted documentation and the recycling of that alongside the source-code back into the models.
But then, if you’re inside the network you’ve already overcome many of the boundaries
I believe it is possible to reduce the data poisoning from these sources by applying a layered approach like the OP, but I believe it needs many more dimensions with scoring to model true adversaries with loops for autonomous quarantine->processing->ingesting->verification->research->continue to verification or quarantine->then start again for all data that gets added after the initial population.
Also, for: "1. Map every write path into your knowledge base. You can probably name the human editors. Can you name all the automated pipelines — Confluence sync, Slack archiving, SharePoint connectors, documentation build scripts? Each is a potential injection path. If you can’t enumerate them, you can’t audit them."
I recommend scoring for each source with different levels of escalation for all processes from official vs user-facing sources. That addresses issues starting from the core vs allowing more access from untrusted sources.