I really liked the README, that was a good use of AI.
If you're interested in the idea of writing a database, I recommend you checkout https://github.com/thomasjungblut/go-sstables which includes sstables, a skiplist, a recordio format and other database building blocks like a write-ahead log.
Also https://github.com/BurntSushi/fst which has a great Blog post explaining it's compression (and been ported to Go) which is really helpful for autocomplete/typeahead when recommending searches to users or doing spelling correction for search inputs.
Would love to hear how this compares to another popular go based full text search engine (with a not too dissimilar name) https://github.com/blevesearch/bleve?
I put Overview section from the Readme into an AI content detector and it says 92% AI. Some comment blocks inside codebase are rated as 100% AI generated.
That said, totally fair read on the comments. Curious if they helped/landed the way I intended. or if a multi-part blog series would’ve worked better :)
Could you explain more why you avoided parsing strings to build queries? Strings as queries are pretty standard for search engines. Yes, strings require you to write an interpreter/parser, but the power in many search engines comes from being able to create a query language to handle really complicated and specific queries.
You're right, string-based queries are very expressive. I intentionally avoided that here so readers could focus on how FTS indexes work internally. Adding a full query system would have shifted the focus away from the internals.
If you notice there are vv obvious optimizations we could make. I’m planning to collect them and put a these as code challenges for readers, and building string-based queries would make a great one :)
This is very cool! Your readme is intersting and well written - I didn't know I could be so interested in the internals of a full text search engine :)
What was the motivation to kick this project off? Learning or are you using it somehow?
Mostly wanted a refresher on GPU accelerated indexes and Vector DB internals. And maybe along the way, build an easy on-ramp for folks who want to understand how these work under the hood
I see you are using a positional index rather than doing bi-word matching to support positional queries.
Positional indexes can be a lot larger than non-positional. What is the ratio of the size of all documents to the size of the positional inverted index?
Well bi-word matching requires that you still have all of the documents stored to verify the full phrase occurs in the document rather than just the bi-words. So it isn't always better.
For example the phrase query "United States of America" doesn't occur in the document "The United States is named after states of the North American continent. The capital of America is Washington DC". But "United States", "states of" and "of America" all appear in it.
There's a tradeoff because we still have to fetch the full document text (or some positional structure) for the filtered-down candidate documents containing all of the bi-word pairs. So it requires a second stage of disk I/O. But as I understand most practitioners assume you can get away with less IOPS vs positional index since that info only has to fetched for a much smaller filtered-down candidate set rather than for the whole posting list.
But that's why I was curious about the storage ratio of your positional index.
If you're interested in the idea of writing a database, I recommend you checkout https://github.com/thomasjungblut/go-sstables which includes sstables, a skiplist, a recordio format and other database building blocks like a write-ahead log.
Also https://github.com/BurntSushi/fst which has a great Blog post explaining it's compression (and been ported to Go) which is really helpful for autocomplete/typeahead when recommending searches to users or doing spelling correction for search inputs.
>I really liked the README, that was a good use of AI.
Human intelligences, please start saying:
(A)I wrote a $something in $language.
Give credit where is due. AIs have feelings too.
Ohh boi, that’s exactly how the movie "Her" started! XD
When you think OP vibe-coded the project but can’t prove it yet
https://x.com/FG_Artist/status/1974267168855392371
Dexter's memes have been popping up recently and I am loving them
I don't know who bay harbor butcher is though :sob: but I don't want spoilers, I will watch it completely some day
My friend says that he watched complete dexter just via clips lol.
Should a multi-part blog would've been better?
Is vibe-commented a thing yet? :D
Wanted to give fellow readers a good on-ramp for understanding the FTS internals. Figured leaning into readability wouldn’t hurt
For me this makes the structure super easy to grok at a glance
https://github.com/wizenheimer/blaze/blob/27d6f9b3cd228f5865...
That said, totally fair read on the comments. Curious if they helped/landed the way I intended. or if a multi-part blog series would’ve worked better :)
Could you explain more why you avoided parsing strings to build queries? Strings as queries are pretty standard for search engines. Yes, strings require you to write an interpreter/parser, but the power in many search engines comes from being able to create a query language to handle really complicated and specific queries.
If you notice there are vv obvious optimizations we could make. I’m planning to collect them and put a these as code challenges for readers, and building string-based queries would make a great one :)
What was the motivation to kick this project off? Learning or are you using it somehow?
It ended up being a clean, reusable component, so I decided to carve it out into a standalone project
The README is mostly notes from my Notion pages, glad you found it interesting!
I see you are using a positional index rather than doing bi-word matching to support positional queries.
Positional indexes can be a lot larger than non-positional. What is the ratio of the size of all documents to the size of the positional inverted index?
For example the phrase query "United States of America" doesn't occur in the document "The United States is named after states of the North American continent. The capital of America is Washington DC". But "United States", "states of" and "of America" all appear in it.
There's a tradeoff because we still have to fetch the full document text (or some positional structure) for the filtered-down candidate documents containing all of the bi-word pairs. So it requires a second stage of disk I/O. But as I understand most practitioners assume you can get away with less IOPS vs positional index since that info only has to fetched for a much smaller filtered-down candidate set rather than for the whole posting list.
But that's why I was curious about the storage ratio of your positional index.
I intended this to be an easy on-ramp for folks who want to get a feel for how FTS engines work under the hood :)
I appreciate the technical depth of the readme, but I’m not sure it fits your easy on-ramp framing.
Keep going and keep sharing.
One’s for browsing HN at work, the other’s for home, and the third one has a username I'm not too fond of.
I’ll stick to this one :) I might have some karma on the older ones, but honestly, HN is just as fun from everywhere