Day 1 of ARC-AGI-3

(symbolica.ai)

83 points | by lairv 7 hours ago

9 comments

mohsen1 15 minutes ago
Uses public dataset to evaluate which is not meant for evaluation. Writes super specific prompt[1] and claims eye catching results.
This is the state of "AI" these days I guess...
[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
[-]
- Rebuff5007 8 minutes ago
  Of course it is... we are in an era where a well-timed blog post showing "SOTA results" on a benchmark can net millions in funding
stephantul 2 hours ago
The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.
What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”
[-]
- cbg0 1 hour ago
  When you're on the hunt for VC cash "numbers go up" is the main criteria.
- ting0 31 minutes ago
  Does it matter though? If it accomplishes the task, it accomplishes the task. Everyone uses a harness anyway, and finding the best harness is relevant. Also perhaps this hints at something bigger, i.e.: we're wasting our time focusing on the model when we could be focusing on the harness.
padolsey 1 hour ago
Knowing the nature of a test ahead of time, building out your capabilities and tooling before entering the exam hall when your peers don't have that advantage, makes you a cheater.
[-]
- BoorishBears 46 minutes ago
  Lots of people doing the same with extra steps (generating synthetic data from test questions with the LLM then training on it)
  I wish we'd move past public test sets for LLM benchmarks: publish a plain english explanation of the tasks, allow questions and clarifications, and but never release a single question from the test set verbatim.
  It made sense back when models needed to be finetuned on the task to even reliably answer. If we're saying this is the path to AGI we should be able to rely on the generalization of the model to get it right.
  [-]
  - ting0 30 minutes ago
    You have a problem with generating synthetic data from test questions? Humans simulate experiences in their mind. What's the problem?
    [-]
    - BoorishBears 15 minutes ago
      Models don't generalize as well as humans.
      If a model was trained on <|begin_text|> <|end_text|> and you change the tokens passed to <|start_text|> <|end_text|>, it loses several "IQ points" if it can even answer back at all anymore.
      Synthetic data is fine. Synthetic data on very similar questions generated based on the description is typically fine. But once the shape of what you're training on gets too close to the actual holdout questions, you're getting an uplift that's not realistic for unseen tasks.
lairv 7 hours ago
Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard
According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461
[-]
- fchollet 4 hours ago
  It is 100% ARC-AGI-3 specific though, just read through the prompts https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
  [-]
  - boxed 2 hours ago
    What a dick move. Making that prompt open source will probably mean that every other model that doesn't want to cheat will scrape that and accidentally cheat in the next models.
  - diwank 2 hours ago
    this is so disingenuous on symbolica's part. these insincere announcements just make it harder for genuine attempts and novel ideas
  - DetroitThrow 3 hours ago
    Um, yes this is a extremely specific as a benchmark harness. It has a ton of knowledge encoded about the tasks at hand. The tweet is dishonest even in the best light.
    The hard part of these tests isn't purely reasoning ability ffs.
- krackers 5 hours ago
  > this uses a harness
  This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.
  [-]
  - fermentation 3 hours ago
    Right, fair, but look at the prompt. For the purpose of testing general intelligence, this seems kind of pointless.
  - UltraSane 2 hours ago
    It isn't arbitrary. They want measure the capability of the general LLM
- osti 5 hours ago
  Doesn't the chat version of chatgpt or gemini also have interleaved tool calls, so do those also count as with harnesses?
  [-]
  - WiSaGaN 2 hours ago
    Harness is fine. I think people here are arguing what provided here to take the test is not harness.
- mmaunder 4 hours ago
  We're calling agents harnesses now?
  [-]
  - fritzo 3 hours ago
    ELI5 what is a harness?
    EDIT from https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf:
    > We seek to fight two forms of overfitting that would muddy public sensefinding:
    > Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
  - boxed 2 hours ago
    The point of this test is to check if an AI system can figure out the game. This isn't what happened here. A human figured out the game, wrote in their prompts exactly how the game works and THEN put the AI on the problem. This is 100% cheating and imo quite stupid.
  - lwansbrough 3 hours ago
    I think generally people regard a harness as the system instructions + tools made available to the LLM (and probably the thing that runs the LLM conversation in a loop.) An agent is collectively, the LLM plus the harness.
- falcor84 7 hours ago
  I for one think that harness development is perhaps the most interesting part at the moment and would love to have an alternative leaderboard with harnesses.
  [-]
  - sanxiyn 6 hours ago
    There is. Official leaderboard is without harness, and community leaderboard is with harness. Read ARC-AGI-3 Technical Paper for details.
    [-]
    - falcor84 6 hours ago
      I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.
      Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?
      [-]
      - sanxiyn 6 hours ago
        Here it is: https://arcprize.org/leaderboard/community
  - steve_adams_86 5 hours ago
    I'm so into harness development right now. Once it clicked that harnesses can bring more safety and determinism to LLMs, I started to wonder where I'd need that and why (vs MCP or just throwing Claude Code at everything), and my brain gears have been turning endlessly since then. I'd love to see more of what people do with them. My use cases are admittedly lame and boring, but it's such a fun paradigm to think and develop around.
    [-]
    - j_bum 3 hours ago
      Could you point me to some resources to learn about harnesses? I’d love to hear an example of a use case you’re thinking of.
gslin 3 hours ago
https://en.wikipedia.org/wiki/Goodhart's_law
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
[-]
modeless 5 hours ago
On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
[-]
- SchemaLoad 5 hours ago
  Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
  [-]
  - sanxiyn 5 hours ago
    In this case the code is public and you can see they are not cheating in that sense.
    [-]
    - Davidzheng 4 hours ago
      I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.
    - SchemaLoad 5 hours ago
      Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.
      [-]
      - lambda 5 hours ago
        They aren't training new models for this. This is an agent harness for Opus 4.6.
        [-]
        measurablefunc 4 hours ago
        All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.
        [-]
        raincole 3 hours ago
        You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.
        [-]
        measurablefunc 2 hours ago
        Which part is the conspiracy? Be as concrete as possible.
        stale2002 4 hours ago
        ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.
        [-]
        measurablefunc 4 hours ago
        Yes, assuming the checkpoint was before the announcement & public availability of the test set.
    - DetroitThrow 3 hours ago
      The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.
      Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
bytesandbits 2 hours ago
we constantly underestimate the power of inference scaffolding. I have seen it in all domains: coding, ASR, ARC-AGI benchmarks you name it. Scaffolding can do a lot! And post-training too. I am confident our currently pre-trained models can beat this benchmark over 80% with the right post-training and scaffolding. That being said I don't think ARC-AGI proves much. It is not a useful task at all in the wild. it is just a game; a strange and confusing one. For me this is just a pointless pseudo-academic exercise. Good to have, but by no means measures intelligence and even less utility of a model.
[-]
- nubg 2 hours ago
  what exactly does scaffolding mean in this context? genuine question
  [-]
  - bytesandbits 55 minutes ago
    anything that doesn't touch the model parameters at all once it has been compiled. for example, in streaming ASR of an encoder-decoder you can get gains in accuracy just by enhancing the encoder-decoder orchestration and ratio, frequency of fwd passes, dynamically adjusting the length of rolling windows (if using full attention). Prompting would be part of this too, including few-shot examples. Decoding strategy is also part of this (top-k, nucleus, speculative decoding, greedy or anything else). Applying signal processing or any kind of processing to the input before getting it into the model, or to the output. There are a lot of things you can do.
esafak 7 hours ago
Anybody used this Agentica of theirs?
AbanoubRodolf 5 hours ago
[dead]