Why are cancer guidelines stuck in PDFs?

(seangeiger.substack.com)

224 points | by huerne 18 hours ago

35 comments

  • prepend 17 hours ago
    I’d rather have the pdf than a custom tool. Especially considering the tool will be unique to the practice or emr. And likely expensive to maintain.

    PDFs suck in many ways but are durable and portable. If I work with two oncologists, I use the same pdf.

    The author means well but his solution will likely be worse because only he will understand it. And there’s a million edge cases.

    • slaucon 16 hours ago
      Hey author here! Appreciate the feedback! Agreed on importance of portability and durability.

      I'm not trying to build this out or sell it as a tool to providers. Just wanted to demo what you could do with structured guidelines. I don't think there's any reason this would have to be unique to a practice or emr.

      As sister comments mentioned, I think the ideal case here would be if the guideline institutions released the structured representations of the guidelines along with the PDF versions. They could use a tool to draft them that could export in both formats. Oncologists could use the PDFs still, and systems could lean into the structured data.

      • killjoywashere 15 hours ago
        The cancer reporting protocols from the College of American Pathologists are available in structured format (1). No major laboratory information system vendor properly implements them, properly, and their implementation errors cause some not-insignificant problems with patient care (oncologists calling the lab asking for clarification, etc). This has pushed labs to make policies disallowing the use of those modules and individual pathologists reverting to their own non-portable templates in Word documents.

        The medical information systems vendors are right up there with health insurance companies in terms of their investment in ensuring patient deaths. Ensuring. With an E.

        (1) https://www.cap.org/protocols-and-guidelines/electronic-canc...

        • jjmarr 11 hours ago
          It doesn't look like the XML data is freely accessible.

          If I could get access to this data as a random student on the internet, I'd love to create an open source tool that generates an interactive visualization.

        • all2 15 hours ago
          > The medical information systems vendors are right up there with health insurance companies in terms of their investment in ensuring patient deaths. Ensuring. With an E.

          Can you expand on this?

          • righthand 13 hours ago
            Medical information system vendors only care about making a profit, not implementing actual solutions. The discrepancies between systems can lead to bad information which can cost people their life.
            • ethbr1 2 hours ago
              As an analogy, imagine if the consequence of Oracle doing Oracle-as-usual things was worse medical outcomes. But they did them anyway for profit.

              That's basically medical information system vendors.

              The fact that the US hasn't pushed open source EMRs through CMS is insane. It's literally the perfect problem for an open solution.

              • caboteria 1 hour ago
                It's worse than that. VistA is a world-class open source EMR that the VA has been trying to kill for decades.
        • zo1 10 hours ago
          People could potentially properly implement them if they were open and available:

          "Contact the CAP for more information about licensing and using the CAP electronic Cancer Protocols for cancer reporting at your institution."

          This stinks of the same gate-keeping that places like NIST and ISO do, charging you for access to their "standards".

          • prepend 5 hours ago
            Aren’t all NIST standards free as they are a government body?
          • fl0id 7 hours ago
            For liability reasons alone, you cannot just have random people working on health/lab stuff and the requisite vendors have access to these standards.
            • joshuaissac 1 hour ago
              According to what killjoywashere said, the vendors do not want to implement these standards. So if CAP wants the standards to be relevant, they should release them for random people to implement.
        • PoignardAzur 11 hours ago
          I mean, you're attributing malice, but it could just be that reliably implementing the formats is a really really hard problem?
          • TheAceOfHearts 11 hours ago
            How about fixing the format? Something that is obviously broken and resulting in patient deaths should really be considered a top priority. It's either malice or masskve incompetence. If these protocols were open there would definitely be volunteers willing to help fix it.
            • prepend 5 hours ago
              I think there are more options than malice or incompetence. My theory is difficulty.

              There’s multiple countries with socialized medicine and no profit motive and it’s still not solved.

              I think it’s just really complex with high negative consequences from a mistake. It takes lots of investment with good coordination to solve and there’s an “easy workaround” with pdfs that distributes liability to practitioners.

              • ethbr1 1 hour ago
                Healthcare suffers from strict regulatory requirements, underinvestment in organic IT capabilities, and huge integration challenges (system-to-system).

                Layering any sort of data standard into that environment (and evolving it in a timely manner!) is nigh impossible without an external impetus forcing action (read: government payer mandate).

            • PoignardAzur 7 hours ago
              You seem to think that the default assumption is that fixing the format is easy/feasible, and I don't see why. Do you have domain knowledge pointing that way?

              It's a truism in machine learning that curating and massaging your dataset is the most labor-intensive and error-prone part of any project. I don't why that would stop being true in healthcare just because lives are on the line.

            • mort96 9 hours ago
              Incompetence at this level is intentional, it means someone doesn't think they'll see RoI from investing resources into improving it. Calling it malice is appropriate I feel.
              • layer8 2 hours ago
                If there is no ROI, investing further resources would be charity work. I don’t think it’s accurate to call a company not doing so malicious.
                • WitCanStain 2 hours ago
                  Not actively malicious perhaps, but prioritising profits over lives is evil. Either you take care to make sure the systems you sell lead to the best possible outcomes, or you get out of the sector.
                  • layer8 2 hours ago
                    The company not existing at all might be worse though? I think it’s too easy to make blanket judgments like that from the outside, and it would be the job of regulation to counteract adverse incentives in the field.
      • prepend 5 hours ago
        I believe you have good intentions, but someone would need to build it out and sell it. And it requires lots of maintenance. It’s too boring for an open source community.

        There’s a whole industry that attempts to do what you do and there’s a reason why protocols keep getting punted back to pdf.

        I agree it would be great to release structured representations. But I don’t think there’s a standard for that representation, so it’s kind of tricky as who will develop and maintain the data standard.

        I worked on a decision support protocol for Ebola and it was really hard to get code sets released in Excel. Not to mention the actual decision gates in a way that is computable.

        I hope we make progress on this, but I think the incentives are off for the work to make the data structures necessary.

      • Dalewyn 16 hours ago
        >Agreed on importance of portability and durability.

        I think "importance" is understating it, because permanent consistency is practically the only reason we all (still) use PDFs in quite literally every professional environment as a lowest common denominator industrial standard.

        PDFs will always render the same, whether on paper or a screen of any size connected to a computer of any configuration. PDFs will almost always open and work given Adobe Reader, which these days is simply embedded in Chrome.

        PDFs will almost certainly Just Work(tm), and Just Working(tm) is a god damn virtue in the professional world because time is money and nobody wants to be embarrassed handing out unusable documents.

        • abtinf 15 hours ago
          PDFs generally will look close enough to the original intent that they will almost always be usable, but will not always render the same. If nothing else, there are seemingly endless font issues.
          • lstamour 15 hours ago
            In this day and age that seems increasingly like a solved problem to most end users, often a client-side issue or using a very old method of generating a PDF?

            Modern PDF supports font embedding of various kinds (legality is left as an exercise to the PDF author) and supports 14 standard font faces which can be specified for compatibility, though more often document authors probably assume a system font is available or embed one.

            There are still problems with the format as it foremost focuses on document display rather than document structure or intent, and accessibility support in documents is often rare to non-existent outside of government use cases or maybe Word and the like.

            A lot of usability improvements come from clients that make an attempt to parse the PDF to make the format appear smarter. macOS Preview can figure out where columns begin and end for natural text selection, Acrobat routinely generates an accessible version of a document after opening it, including some table detection. Honestly creative interpretation of PDF documents is possibly one of the best use cases of AI that I’ve ever heard of.

            While a lot about PDF has changed over the years the basic standard was created to optimize for printing. It’s as if we started with GIF and added support to build interactive websites from GIFs. At its core, a PDF is just a representation of shapes on a page, and we added metadata that would hopefully identify glyphs, accessible alternative content, and smarter text/line selection, but it can fall apart if the PDF author is careless, malicious or didn’t expect certain content. It probably inherits all the weirdness of Unicode and then some, for example.

    • layer8 2 hours ago
      I agree. However, since the PDF format supports structured data, one could in principle have it both ways, within a single file.
    • Spooky23 14 hours ago
      I think there’s value if it can scale down.

      Community oncologists have limited technology resources as compared to a national cancer center. If we can make their lives easier, it can only be a good thing.

      That said, I like published documents like PDFs - systems usually make it hard to conii ok are the June release from the September release.

    • KPGv2 16 hours ago
      You say this, but on the other hand, the author alleges that the places that use these custom tools achieve better outcomes. You didn't address this point one way or the other.

      Do you think this is a completely fabricated non-explanation? It's not like the link says "the worst places use these custom tools."

    • crazygringo 16 hours ago
      Exactly. The PDF's work. They won't break. You can see all the information with your own eyes. You can send them by e-mail.

      A wizard-type system hides most of the information from you, it might have bugs you aren't aware of, if you want to glance at an alternative path you can't, it's going to be locked into registered users, the system can go down.

      I think much more intelligent computer systems are the future in health care, but I doubt the way to start is with yet another custom tool designed specifically for cancer guidelines and nothing else.

      • crabmusket 13 hours ago
        > it's going to be locked into registered users, the system can go down

        I didn't see anything in the screenshots presented that wouldn't be doable in a single HTML file containing the data, styles and scripts?

        This is a countercultural idea but it fits so many use cases; it's a tragedy we don't do this more often. The two options are either PDF or SaaS.

      • ajsnigrutin 4 hours ago
        > The PDF's work. They won't break.

        Not just that, PDFs are one of the few formats, where i'm willing to bet my own money, that they'll still work in 10 or 20 years.

        Even basic html has changed, layouts look different depending on many factors, and even the <blink>-ing doesn't work anymore.

    • ahardison 15 hours ago
      Totally valid concerns. If you have time, I would like to show you my solution to get your thoughts as I believe I have found ways to mitigate all of your concerns. Currently I am using STCC (Schmitt-Thompson Clinical Content). I Have sent you some of the PDF's we use for testing.
    • akoboldfrying 17 hours ago
      The author is proposing that the DAG representation be in addition to the PDF:

      >The organizations drafting guidelines should release them in structured, machine-interpretable formats in addition to the downloadable PDFs.

      My opinion: Ideally the PDF could be generated from the underlying DAG -- that would give you confidence that everything in the PDF has been captured in the DAG.

      • maxerickson 16 hours ago
        You could generate the document from the graph and then attach it as data.
        • JumpCrisscross 8 hours ago
          > could generate the document from the graph and then attach it as data

          Much easier for doctors to draft PDFs than graphs.

    • zahlman 15 hours ago
      It would, I imagine, be much easier to generate a PDF from the tool's internal flowchart representation than the other way around.
  • pcrh 17 hours ago
    The fundamental idea here is that doctors find it difficult to ensure that their recommendations are actually up-to-date with the latest clinical research.

    Further, that by virtue of being at the centre of action in research, doctors in prestige medical centres have an advantage that could be available to all doctors. It's a pretty important point, sometimes referred to as the dissemination of knowledge problem.

    Currently, this is best approached by publishing systematic reviews according to the Cochrane Criteria [0]. Such reviews are quite labour-intensive and done all too rarely, but are very valuable when done.

    One aspect of such reviews, when done, is how often they discard published studies for reasons such as bias, incomplete datasets, and so forth.

    The approach described by Geiger in the link is commendable for its intentions but the outcome will be faced with the same problem that manual systematic reviews face.

    I wonder if the author considered included rules-based approaches (e.g. Cochrane guidelines) in addition to machine learning approaches?

    [0] https://training.cochrane.org/handbook

    • slaucon 16 hours ago
      Hey author here--Cochrane reviews are great.

      NCCN guidelines and Cochrane Reviews serve complementary roles in medicine - NCCN provides practical, frequently updated cancer treatment algorithms based on both research and expert consensus, while Cochrane Reviews offer rigorous systematic analyses of research evidence across all medical fields with a stronger focus on randomized controlled trials. The NCCN guidelines tend to be more immediately applicable in clinical practice, while Cochrane Reviews provide a deeper analysis of the underlying evidence quality.

      My main goal here was to show what you could do with any set of medical guidelines that was properly structured. You can choose any criteria you want.

    • liontwist 15 hours ago
      > doctors find it difficult to ensure that their recommendations are actually up-to-date with the latest clinical research

      Doctors care about as much this as software engineers care about the latest computer science research. A few curious ones do. But the general attitude is they already did tough years of school so they don’t have to anymore.

      • refurb 14 hours ago
        I worked with oncologists and this isn’t true.

        Oncology has a rapidly changing treatment landscape and it’s common for oncologists to be discussing the latest paper that has come out.

        If you’re an oncologist and not keeping up with the literature you’re going to be out of date in your decisions in about 6 months from graduation.

        • liontwist 10 hours ago
          Funny enough that last paragraph is also said of software engineers too. Neither are true.
          • mort96 9 hours ago
            Yeah, non-programmers seem to think everything is changing so quickly all the time yet here I am writing in a 40 year old language against UNIX APIs from the 70s ¯\_(ツ)_/¯
  • easytigerm 3 hours ago
    The OP will be pleased to know that they’re not the first person to think of this idea. Searching for “computable clinical guidelines” will unearth a wealth of academic literature on the subject. A reasonable starting point would be this paper [1]. Indeed people have been trying since the 70s, most notably with the famous MYCIN expert system. [2]

    As people have alluded to and the history of MYCIN shows, there’s a lot more subtlety to the problem than appears on the surface, with a whole bunch of technical, psychological, sociological and economic factors interacting. This is why cancer guidelines are stuck in PDFs.

    Still, none of that should inhibit exploration. After all, just because previous generations couldn’t solve a problem doesn’t mean that it can’t be solved.

    [1] https://pmc.ncbi.nlm.nih.gov/articles/PMC10582221/

    [2] https://www.forbes.com/sites/gilpress/2020/04/27/12-ai-miles...

    • adolph 2 hours ago
      To the author:

      The above is a high quality comment with worthy areas to study.

      Additionally I would draw your attention to NCCN’s “Developer API” which is not interesting technologically but how it reflects the IP landscape.

      https://www.nccn.org/developer-api

  • queuebert 54 minutes ago
    As a cancer researcher myself, I'd point out that some branches of the decision trees in the NCCN guidelines are based on studies in which multiple options were not statistically significantly different, but all were better than the placebo. In those cases, the clinician is free to use other factors to decide which arm to take. A classic example of this is surgery vs radiation for prostate cancer. Both are roughly equally effective, but very different experiences.
  • gcanyon 2 hours ago
    The real question is: why is everything stuck in PDFs, and the more important meta-question is: why don't PDFs support meta-data (they do, somewhat). So much of what we do is essentially machine-to-machine, but trapped in a format designed entirely for human-to-human (also lump in a bit of machine-to-human).

    Adobe has had literally a third of a century to recognize this need and address it. I don't think they're paying attention :-/

    • layer8 2 hours ago
      PDFs can have arbitrary files embedded, like XML and JSON. It also supports a logical structure tree (which doesn’t need to correspond to the visual structure) which can carry arbitrary attributes (data) on its structure elements. And then there’s XML Forms. You can really have pretty much anything machine-processable you want in a PDF. One could argue that it is too flexible, because any design you can come up with that uses those features for a particular application is unlikely to be very interoperable.
    • queuebert 51 minutes ago
      PDFs are essentially compressed Postscript, which is Turing complete, so a PDF in theory can do anything you want.
  • londons_explore 17 hours ago
    Decision trees work for making decisions...

    But they don't work as well as other decisionmaking techniques... Random forests, linear models, neural nets, etc. are all decision making techniques at their core.

    And decision trees perform poorly for complex systems where lots of data exists - ie. human health.

    So why are we using a known-inferior technique simply because it's easier to write down in a PDF file, reason about in a meeting, or explain to someone?

    Shouldn't we be using the most advanced mathematical models possible with the highest 'cure' probability, even if they're so complex no human can understand them?

    • epcoa 16 hours ago
      > complex systems where lots of data exists

      Not a lot of high quality data exists for human health. Clinical guidelines for many diseases are built around surprisingly scant evidence many times.

      > even if they're so complex no human can understand them?

      That’ll be wonderful to explain in court when they figure out it was just data smuggling or whatever other bias.

      • epistasis 14 hours ago
        In cancer there's an abundance of clinical trials with high quality data, but it is all very complex in terms of encoding what the clinical trial actually encoded.

        Go to a clinical cancer conference and you will see the grim reality of 10,000s of people contributing to the knowledge discovery process with their cancer care. There is an inverse relationship between the number of people in a trial and the amount of risk that goes into that trial, but it is still a massive amount of data that needs to be codified into some sensible system, and it's hard enough for a person to do it.

        > That’ll be wonderful to explain in court when they figure out it was just data smuggling or whatever other bias.

        What do you mean by this? I'm not aware of any data smuggling that has ever happened in a clinical trial. The "bias" is that any research hypothesis comes from the fundamentally biased position of "I think the data is telling me this" but I've seen very little bias of truly bad hypotheses in cancer research like those that have dominated, say Alzheimer's research. Any research malfeasance should be prosecuted to the fullest, but I don't think cancer research has much of it. This was a huge scandal, but I don't think it pointed to much in the way of bad research in the end:

        https://www.propublica.org/article/doctor-jose-baselga-cance...

        • epcoa 12 hours ago
          By smuggling and bias I meant in an ML model. Smuggling was a bit informal, but referring to models overfit on unintended features or artifacts.
          • londons_explore 5 hours ago
            but we have well established ways to deal with those... test/validation sets, n-fold validation, etc.

            Even if there was some overfitting or data contamination that was undetected, the result would most probably still be better than a hand-made decision tree over the same data...

            • epcoa 1 hour ago
              Ok, until you can sue the AI you need to find a doctor ok putting their license behind saying “I have no idea how this shiny thing works”
            • wizzwizz4 3 hours ago
              Hand-made decision trees are open to inspection, comprehension, and adaption. There is no way to adapt an opaque ML model to new findings / an experimental treatment except by producing a new model.
    • s1artibartfast 16 hours ago
      Dinner generation is usually based on decision tree models as well, so they match the resolution of the available data.

      The practice of real world medicine often interpolates between these data points.

    • wizzwizz4 17 hours ago
      Models too complex for humans to understand don't, in practice, have a high 'cure' probability.
  • troysk 9 hours ago
    I find the web(HTML/CSS) the most open format for sharing. PDFs are hard to be consumed on smaller devices and much harder to be read by machines. I am working on a feature at Jaunt.com to convert PDFs to HTML. It shows up as reader mode icon. Please try it out and see if it is good enough. I personally think we need to do much better job. https://jaunt.com
    • ErigmolCt 8 hours ago
      PDFs can be notoriously difficult to work with on smaller devices
  • ramoz 4 hours ago
    Cool tool. From my experience the PDF was easy to traverse.

    The hardest part for me was understanding that treatment options could differ (i.e. between the _top_ hospitals treating the cancer). And there were a few critical options to consider. NCCN paths were traditional, but there is in between decisions to make or alternative paths. ChatGPT was really helpful in that period. "2nd" opinions are important... but again you ask the top 2 hospitals and they differ in opinion, any other hospital is typically in one of those camps.

  • gmueckl 2 hours ago
    Software that gives treatment instructions may be a medical device requiring FDA approval. You may be breaking the law if you give it to a medical professional without such approval.
  • upghost 17 hours ago
    It's so much worse than you could possibly imagine. I worked for a healthcare startup working on patient enrollment for clinical oncology trials. The challenges are amazing. Quite frankly it wouldn't matter if the data were in plaintext. The diagnostic codes vary between providers, the semantic understanding of the diagnostic information has different meanings between providers, electronic health records are a mess, things are written entirely in natural language rather than some kind of data structure. Anyone who's worked in healthcare software can tell you way more horror stories.

    I do hope that LLMs can help straighten some of it out but anyone whos done healthcare software, the problems are not technical, they are quite human.

    That being said one bright spot is we've (my colleagues, not me) made a huge step forward using category theory and Prolog to discover the provably optimal 3+3 clinical oncology dose escalation trial protocol[1]. David gave a great presentation on it at the Scryer Prolog meetup[2] in Vienna.

    It's kind of amazing how in the dark ages we are with medicine. Even though this is the first EXECUTABLE/PROGRAMMABLE SPEC for a 3+3 cancer trial, he is still fighting to convince his medical colleagues and hospital administrators that this is the optimal trial because -- surprise -- they don't speak software (or statistics).

    [1]: https://arxiv.org/abs/2402.08334

    [2]: https://www.digitalaustria.gv.at/eng/insights/Digital-Austri...

    • sebmellen 16 hours ago
      Have you read Jake Seliger’s pieces on oncology clinical trials https://jakeseliger.com/.
      • upghost 15 hours ago
        Oh wow. No, that's heart breaking. I'll have to read up on this. Reminds me of David explaining the interesting and somewhat surprisingly insensitive language the oncology literature uses towards folks going through this. Its there for historical reasons but slow to change.

        It also shows how important getting dose escalation trials are. The whole point is finding the balance point where "cure is NOT worse than the disease". A bad dose can be worse than the cancer itself, and conducting the trials correctly is extremely important... and this really underscores the human cost. Truly heartbreaking :(

    • slaucon 16 hours ago
      This is a fascinating idea!
  • epistasis 15 hours ago
    > With properly structured data, machines should be able to interpret the guidelines. Charting systems could automatically suggesting diagnostic tests for a patient. Alarm bells and "Are you sure?" modals could pop up when a course of treatment diverges from the guidelines. And when a doctor needs to review the guidelines, there should be a much faster and more natural way than finding PDFs

    I have implemented this computerized process twice at two different startups over the past decade.

    I would not want the NCCN to do it.

    The NCCN guidelines are not stuck in PDFs, they are stuck in the heads of doctors.

    Once the NCCN guidelines get put into computerized rules, they start to be guided by those computerized rules, a second influence that takes them away from the fundamental science.

    So while I totally agree that there should be systemtticization of the rules, it should be entirely secondary and subservient to the best frontier knowledge about cancer, which changes extremely frequently. Annually after every ASCO (major pan-cancer conference) and every disease specific conference (e.g. the San Antonio breast cancer conference), and occasionally during the year when landmark clinical trials are published the doctors need to update their knowledge from the latest trials and their continuing medical education, which is entire body of knowledge that is complementary to the edges of what the NCCN publishes.

    Having spanned both computer science and medicine for my entire career, I trust doctors to be able to update their rules far faster than the programmers and databases.

    Please do not get the NCCN guidelines stuck in spaghetti code that a few programmers understand, rather than open in PDFs with lots of links that anybody can go and chase after.

    Edit: though give me a week digesting this article and I may change my mind. Maybe the NCCN should be standardizing clinical variables enough such that the rules can trivially be turned into rules. That would require that the hypotheses that a clinical trial fits into those rules however, and that's why I need a week of digestion to see if it may even be possible...

  • whiterock 6 hours ago
    Why can this not just be a website? Isn‘t this a perfect use case for HTML and hyperlinks?
  • grumbel 10 hours ago
    Same reason why datasheets are still PDFs. It's a reliable, long lasting and portable format. And while it's kind of ridiculous that we are basically emulating paper, no other format fills that niche.

    It's the niche HTML should be able to fill, since that was its original purpose, but isn't, since all focus over the last 20 or so years has been on everything else, but making HTML a better format for information exchange.

    Trivial things like bundling up a complex HTML document into a single file don't have standard solutions. Cookies stop working when you are dealing with file:// URLs and a lot of other really basic stuff just doesn't work or doesn't exist. Instead you get offshot formats like ePUB that are mostly HTML, but not actually supported by most browser.

  • osmano807 17 hours ago
    I know it's not the same, but in many areas we have this "follow the arrows" system in many guidelines. For some examples, see the EULAR guidelines with it's fluxograms for treatments and also AO Surgery Reference with a graphical approach to select treatments based on fracture pattern, avaliable materials and skill set.

    I think that's a logical and necessary step to join medical reasoning and computer helpers, we need easier access to new information and more importantly to present clinical relevant facts from the literature in a way that helps actual patient care decision making.

    I'm just not too sure we can have generic approaches to all specialties, but it’s nice seeing efforts in this area.

  • guipsp 2 hours ago
    I have to ask: did the author contact any medical professional when writing this article? Is this really something that needs to be fixed, and will his solution actually fix it?

    It seems to me that ignoring the guideline is a physician decision, and when it is ignored (for good or for bad), it is not because the guidelines are not available in json.

  • LorenPechtel 17 hours ago
    The real problem is that the guidelines are written for humans in the first place. Workarounds like this shouldn't be needed, to go from a machine friendly layout to a human friendly one is usually quite easy.

    And from what he says a decision tree isn't really the right model in the first place. What about no tree, just a heap of records in a SQL database. You do a query on the known parameters, if the response comes back with only one item in the treatment column you follow it. If it comes back with multiple items you look at what would be needed to distinguish them and do the test(s).

  • mav3ri3k 5 hours ago
    Excellent read. This consolidated and catalyzed my my spurious thoughts around personal information management. The input is generally markdown/pdf but over time highly useless for a single person. Thete would be value if it is passed through such a system over time.
  • gibsonf1 2 hours ago
    The idea of adding hallucination to medical advice seems very dangerous.
  • a1o 17 hours ago
    I parsed some mind maps that were constructed with a tool and exported as pdfs (original sources were lost a long time ago) and I used python with tesseract for the text and opencv and it worked alright. I am curious why the author went with LLMs, but I guess with the mentioned amount of data it wasn't hard to recheck everything later.
  • schu 9 hours ago
    Would love to take a look at the code, in particular at how the data extraction and transformation is implemented.

    As a side note, the German associations of oncology publish their guidelines here (HTML and SVG graphs): https://www.onkopedia.com/de/onkopedia/guidelines

  • breytex 8 hours ago
    Shouldn't the end goal be just to train an ai on all the pdfs and give the doctors an interface to plug in all the details and get a treatment plan generated by that ai?

    Working on the data structure feels like an intermediate solution on the way to that ai which is not really necessary. Or am I missing something?

    • prmoustache 7 hours ago
      I am not sure patients and doctors are interested in adding hallucination generators to the list of their problems.
    • fl0id 7 hours ago
      Your end goal maybe. Not patients or doctors goal for sure.
    • pjc50 7 hours ago
      How does your treatment AI get its liability insurance?
  • xh-dude 12 hours ago
    The author makes a great case for machine-interpretable standards but there is an enormous amount of work out there devoted to this, it’s been a topic of interest for decades. There’s so much in the field that a real problem is figuring out what solutions match the requirements of the various stakeholders, more than identifying the opportunities.
  • inopinatus 17 hours ago
    > The whole set of guidelines for a type of cancer breaks down into a few disjointed directed graphs

    Nothing undermines medicine quite so thoroughly as yet another astronaut trying to force it into a data structure.

    • prepend 17 hours ago
      Comically, I worked in this space and initially tried to get decision support working with data structures and code sets and such.

      I ended up only really contributed adding version numbers to the pdf. So at least people knew they had the latest and same versions. And that took a year, to get versions added to guideline pdfs.

      • johnisgood 17 hours ago
        That is wild, one would think versioning is extremely important. They tend to just put the timestamp in the filename (sometimes), which I guess is better than nothing.

        Don't signed PDFs include a timestamp, however?

        • prepend 5 hours ago
          Getting in the file name was kind of easy. But I meant adding it visually in the pdf guidance so readers could tell. Just numbers in the lower left corner. Or maybe right.

          The guideline was available via url so the filename couldn’t change.

  • rmrfchik 9 hours ago
    Because writers don't think about readers. PDF is one of the worst formats for science/technical info, but yet. I've dumped a lot of papers from arxiv because it formatted as 2-column non zoomable PDF.
  • noonanibus 17 hours ago
    Forgive me if I'm mistaken, but isn't this exactly what the FHIR standard is meant to address? Not only does it enable global inter-health communication using a standardized resource, but it's already adopted in several national health services, including (but not broadly), America. Is this not simply a reimplementation, but without the broad iterations of HL7?
    • nradov 14 hours ago
      Right, it would make more sense to use HL7 FHIR (possibly along with CQL) as a starting point instead of reinventing the wheel. Talk to the CodeX accelerator about writing an Implementation Guide in this area. The PlanDefinition resource type should be a good fit for modeling cancer guidelines.

      https://codex.hl7.org/

      https://www.hl7.org/fhir/plandefinition.html

      • joshuakelly 12 hours ago
        This is the comment I was looking for.

        You would aim to use CQL expressions inside of a PlanDefinition, in my estimate. This is exactly what AHRQ's, part of HHS, CDS Connect project aims to create / has created. They publish freely accessible computable decision support artifacts here: https://cds.ahrq.gov/cdsconnect/repository

        When they are fully computable, they are FHIR PlanDefinitions (+ other resources like Questionnaire, etc) and CQL.

        Here's an example of a fully executable Alcohol Use Disorder Identification Test: https://cds.ahrq.gov/cdsconnect/artifact/alcohol-screening-u...

        There's so much other infrastructure around the EHR here to understand (and take advantage of). I think there's a big opportunity in proving that multimodal LLM can reliably generate these artifacts from other sources. It's not the LLM actually being a decision support tool itself (though that may well be promising), but rather the ability to generate standardized CDS artifacts in a highly scalable, repeatable way.

        Happy to talk to anyone about any of these ideas - I started exactly where OP was.

        • osmano807 2 hours ago
          I downloaded and opened an CDS for osteoporosis from the link (as a disease in my specialty), I need an API key to view what a "valueset" entails, so in practice I couldn't assert if the recommendation aligns with clinical practice, nor in the CQL provided have any scientific references (even a textbook or a weak recommendation from a guideline would be sufficient, I don't think the algorithm should be the primary source of the knowledge)

          I tried to see if HL7 was approachable for small teams, I personally became exhausted from reading it and trying to think how to implement a subset of it, I know it's "standard" but all this is kinda unapproachable.

  • tdeck 16 hours ago
    GraphViz has some useful graph schema languages that could be reused for something like this. There's DOT, a delightful DSL, and some kind of JSON format as well. You can then generate a bunch of different output formats and it will lay out the nodes for you.
    • epistasis 14 hours ago
      Of all the challenges with this, graph layout is beyond trivial. It does not rank as a problem, intellectual challenge, or even that interesting.

      The challenges are all about what goes in the nodes, how to define it, how to standardize it across different institutions, how to compare it to what was tested in two different clinical trials, etc. And if the computerized process goes into clinical practice, how is that node and its contents robustly defined so that a clinician sitting with a patient can instantly understand what is meant by it's yes/no/multiple choice question in terms that have been used in recent years at the clinician's conferences.

      Addressing the challenges of constructing the graph requires deep understanding of the terms, deep knowledge of how 10 different people from different cultural backgrounds and training locations interpret highly technical terms with evolving meanings, and deep knowledge of how people could misunderstand language or logic.

      These guidelines codify evolving scientific knowledge where new conceptions of the disease get invented at every conference. It's all at the edge of science where every month and year we have new technology to understand more than we ever understood before, and we have new clinical trials that are testing new hypotheses at the edge of it.

      Getting a nice visual layout is necessary, but in no way sufficient for what needs to be done to put this into practice.

      • graphviz 4 hours ago
        Not ... even that interesting?
        • graphviz 4 hours ago
          Modularity is an excellent way of attacking complex problems. We can all play with algorithms that can carry on realistic conversations and create synthetic 3D movies, because people worked on problems like making transistors the size of 10 atoms, figuring out how processors can predict branches with 99% accuracy, giving neural nets self-attention, deploying inexpensive and ridiculously fast networks all over the planet, and a lot of other stuff.

          For many of us, curing cancer may someday become more important than almost anything else a computer can help us to do. It's just there are so many building blocks to solving truly complex problems; we must respect all that.

  • joshz404 12 hours ago
    You might be interested in checking out the WHO SMART Guidelines. Nothing on cancer yet AFAIK, but it's evolving.
    • rukshn 11 hours ago
      I was also thinking about FHIR and SMART guidelines.

      But the whole system is mess. And the whole SMART guideline system is controlled by 2-3 gatekeepers who don’t listen to any ideas other than their own

  • hashishen 4 hours ago
    Funny i just had the thought the other day about how we as a society need to move past the pdf format or even just update it to be editable in traditional document software. The fact that Google docs will export as a pdf and not have it saved in the documents is proof its gotten to a point of inefficiency and that's just one example
  • bsder 13 hours ago
    Gee, before talking about complex stuff like decision trees, how about we start with something really simple like not requiring a login to download the stupid PDF from NCCN?
  • dogmatism 15 hours ago
    This is all predicated on the guidelines actually reflecting best practices
  • jdlyga 16 hours ago
    PDFs are a universal, machine readable format.
    • GeneralMayhem 14 hours ago
      PDFs are the opposite of machine-readable if you want to do anything other than render them as images on paper or a screen. They're only slightly more machine-readable than binary executables.

      I hate, hate, hate, hate, hate the practice of using PDFs as a system of record. They are intended to be a print format for ensuring consistent typesetting and formatting. For that, I have no quarrel. But so much of the world economy is based on taking text, docx (XML), spreadsheets, or even CSV files, rendering them out as PDFs, and then emailing them around or storing them in databases. They've gone from being simply a view layer to infecting the model layer.

      PDFs are a step better than passing around screenshots of text as images - when they don't literally consist of a single image, that is. But even for reasonably-well-behaved, mostly-text PDFs, finding things like "headers" and "sections" in the average case is dependent on a huge pile of heuristics about spacing and font size conventions. None of that semantic structure exists, it's just individual characters with X-Y coordinates. (My favorite thing to do with people starting to work with PDFs is to tell them that the files don't usually contain any whitespace characters, and then watch the horror slowly dawn as they contemplate the implications.) (And yes, I know that PDF/A theoretically exists, but it's not reliably used, and certainly won't exist on any file produced more than a couple years ago.)

      Now, with multi-modal LLMs and OCR reaching near-human levels, we can finally... attempt to infer structured data back out from them. So many megawatt-hours wasted in undoing what was just done. Structure to unstructure to structure again. Why, why, why.

      As for universality... I mean, sure, they're better than some proprietary format that can only be decrypted or parsed by one old rickety piece of software that has to run in Win95 compatibility mode. But they're not better than JSON or XML if the source of truth is structured, and they're not better than Markdown or - again - XML if the source is mostly text. And there are always warts that aren't fully supported depending on your viewer.

    • sswatson 15 hours ago
      They’re only machine-readable in the very weak sense that all computer files are machine-readable.
  • fasa99 2 hours ago
    WAIT ... Hole up... what have we here: https://www.nccn.org/compendia-templates/compendia/nccn-comp...

    TLDR: The NCCN surely has a clean pretty database of these algorithms. They output these junky pdfs for free. Want cleaner "templates" data? Pay the toll please.

    What we have here is a walled garden. Want the treatment algorithm? Here muck through this huge disaster of 999 page pdfs. Oh you want the underlying data? Well, well, it's going to cost you.

    What we have here is not so much different than the paywalls of an academic journal. Some company running a core service to an altruistic industry and skimming a price. OP is just writing an algorithm to unskim it. And nobody can really use it without making the thing bulletproof lest a physician mistreat a cancer.

    To my sentiment this is yet another unethical topic in healthcare. These clunky algorithms, if a physician uses them, slows the process and introduces a potential source of error, ultimately harming patients. Harming patients for increased revenue. The physicians writing and maintaining the guidelines look the other way given they get a paycheck off it, plus the prestige of it all, similar to some scenarios in medicine itself.

    The natural thing to do is crack open the database and let algorithms utilize it. This whole thing of dumping data in an obstruse and machine-challenging format, then a rube goldberg machine to reverse the transformation, it's not right.

    Anyway I mention this because there seems to be a thought of "these pdfs are messy lets clean them" without looking at what's really going on here.

  • aaron695 16 hours ago
    [dead]
  • hulitu 11 hours ago
    > With properly structured data, machines should be able to interpret the guidelines.

    Yeah, right. And then say "Die". /s

    The guidelines shall be structured properly. It is not rocket science.