Show HN: ArXiv-txt, LLM-friendly ArXiv papers

(arxiv-txt.org)

20 points | by jerpint 1 day ago

4 comments

  • lgas 1 day ago
    It just extracts the abstracts?
    • jerpint 1 day ago
      For now , yes - abstracts and other metadata
      • rrekaf 18 hours ago
        do you plan on adding descriptions of figures and tables?
        • jerpint 14 hours ago
          will probably focus on getting the text out of the papers first, figures might be a good next step after that
  • sbpost 1 day ago
    The example you give doesn't seem to work - the raw txt does not have authors.
    • jerpint 14 hours ago
      you're right - I hadn't noticed! I fixed it now, thanks for pointing it out
  • jmartin2683 1 day ago
    This would be awesome wrapped in an MCP server/tool call :)
    • jerpint 22 hours ago
      whoa - i haven't yet played with MCP - might be a good first project!
  • westurner 1 day ago
    If you train an LLM on only formally verified code, it should not be expected to generate formally verified code.

    Similarly, if you train an LLM on only published ScholarlyArticles ['s abstracts], it should not be expected to generate publishable or true text.

    Traceability for Retraction would be necessary to prevent lossy feedback.