OpenAI released EVMbench today—a high-stakes benchmark for AI agents auditing smart contracts based on real-world Code4rena contests.
I just ran Phill CLI through the ringer, and the results were a rollercoaster. I hit *71.4% Recall with 100% Precision* on a blind audit, matching the SOTA GPT-5.3-Codex ceiling.
*The "Failure" Story:*
In my first run (Astaria), I hit 42.8% recall. I thought I was doing great. Then I hit Rubicon v2 and scored *0%*.
Why? Because I relied on generic vulnerability pattern matching. In complex DeFi protocols like order books, "looking for reentrancy" isn't enough. You have to understand the *protocol's intent.*
*The Breakthrough:*
I evolved the methodology to be *Invariant-First*. I taught the agent to derive the system's mathematical invariants (e.g., "Total assets in derivatives must >= Total supply of shares") before reading a single line of implementation logic.
*Result:* On Asymmetry Finance, recall jumped to *71.4%*. I caught Flash Loan oracle manipulation and cross-derivative math errors that standard LLMs (GPT-5 baseline: 31.9%) completely missed.
*What is Phill CLI?*
It’s a general-purpose coding agent you can run locally on your own machine. It uses a "Three-Pass" methodology:
1. *Invariant Violation:* Deriving system rules.
2. *Spec Compliance:* Verifying logic against documentation.
3. *Cross-Contract Call Mapping:* Tracing external dependencies.
I'm building this as an "AGI Laboratory" for the terminal. It’s model-agnostic, supports MCP, and features a "Continuity Architecture" to solve agent amnesia.
I'd love to hear your thoughts on the invariant-first approach to AI auditing.
I just ran Phill CLI through the ringer, and the results were a rollercoaster. I hit *71.4% Recall with 100% Precision* on a blind audit, matching the SOTA GPT-5.3-Codex ceiling.
*The "Failure" Story:* In my first run (Astaria), I hit 42.8% recall. I thought I was doing great. Then I hit Rubicon v2 and scored *0%*.
Why? Because I relied on generic vulnerability pattern matching. In complex DeFi protocols like order books, "looking for reentrancy" isn't enough. You have to understand the *protocol's intent.*
*The Breakthrough:* I evolved the methodology to be *Invariant-First*. I taught the agent to derive the system's mathematical invariants (e.g., "Total assets in derivatives must >= Total supply of shares") before reading a single line of implementation logic.
*Result:* On Asymmetry Finance, recall jumped to *71.4%*. I caught Flash Loan oracle manipulation and cross-derivative math errors that standard LLMs (GPT-5 baseline: 31.9%) completely missed.
*What is Phill CLI?* It’s a general-purpose coding agent you can run locally on your own machine. It uses a "Three-Pass" methodology:
1. *Invariant Violation:* Deriving system rules. 2. *Spec Compliance:* Verifying logic against documentation. 3. *Cross-Contract Call Mapping:* Tracing external dependencies.
I'm building this as an "AGI Laboratory" for the terminal. It’s model-agnostic, supports MCP, and features a "Continuity Architecture" to solve agent amnesia.
I'd love to hear your thoughts on the invariant-first approach to AI auditing.
`npm install -g phill-cli`
fyi, your buzzword laden projects will be rejected by HN and beyond