I think this is a good idea in general, but perhaps a bit too simple. It looks like this only works for static sites, right? It then performs a JS fetch to pull in the html code and then converts it (in a quick and dirty manner) to markdown.
I know this is pointing to the GH repo, but I’d love to know more about why the author chose to build it this way. I suspect it keeps costs low/free. But why CF workers? How much processing can you get done for free here?
I’m not sure how you could do much more in a CF worker, but this might be too simple to be useful on many sites.
Example: I had to pull in a docs site that was built for a project I’m working on. We wanted an LLM to be able to use the docs in their responses. However, the site was based on VitePress. I didn’t have access to the source markdown files, so I wrote an MCP fetcher that uses a dockerized headless chrome instance to load the page. I then pull the innerHTML directly from the processed DOM. It’s probably overkill, but an example of when this tool might not work.
But — if you have a static site, this tool could be a very simple way to configure MCP access. It’s a nice idea!
I thought this is what the web_fetch tools already did? Tools are configured through MCP also, right? So why am I prepending a URL, and not just using the web_fetch tool that already works?
Does this skirt the robots.txt by chance? Not being to fetch any web page is really bugging me and I'm hoping to use a better web_fetch that isn't censored. I'm just going to copy/paste the content anyway.
I think the idea here is that the web_fetch is restricted to the target site. I might want to include my documentation in an MCP server (from docs.example.com), but that doesn’t mean I want the full web available.
I know this is pointing to the GH repo, but I’d love to know more about why the author chose to build it this way. I suspect it keeps costs low/free. But why CF workers? How much processing can you get done for free here?
I’m not sure how you could do much more in a CF worker, but this might be too simple to be useful on many sites.
Example: I had to pull in a docs site that was built for a project I’m working on. We wanted an LLM to be able to use the docs in their responses. However, the site was based on VitePress. I didn’t have access to the source markdown files, so I wrote an MCP fetcher that uses a dockerized headless chrome instance to load the page. I then pull the innerHTML directly from the processed DOM. It’s probably overkill, but an example of when this tool might not work.
But — if you have a static site, this tool could be a very simple way to configure MCP access. It’s a nice idea!
Does this skirt the robots.txt by chance? Not being to fetch any web page is really bugging me and I'm hoping to use a better web_fetch that isn't censored. I'm just going to copy/paste the content anyway.
Different use cases, I think.