SQLite JSON at Full Index Speed Using Generated Columns

(dbpro.app)

255 points | by upmostly 6 hours ago

30 comments

eliasdejong 14 minutes ago
It is also possible to encode JSON documents directly as a serialized B-tree. Then you can construct iterators on it directly, and query internal fields at indexed speeds. It is still a serialized document (possible to send over a network), though now you don't need to do any parsing, since the document itself is already indexed. It is called the Lite³ format.
Disclaimer: I am working on this.
https://github.com/fastserial/lite3
srameshc 2 hours ago
I love SQLite and this is in no way I'm making a point devaluing SQLite, Author's method is excellent approach to get analytical speed out of SQLite. But I am loving DuckDB for similar analytical workloads as it is built for such tasks. DuckDB also reads from single file, like SQLite and DuckDB process large data sets at extreme speeds. I work on my macbook m2 and I have been dealing with about 20 million records and it works fast, very fast.
Loading data into DuckDB is super easy, I was surprised :
SELECT avg(sale_price), count(DISTINCT customer_id) FROM '/my-data-lake/sales/2024/*.json';
and you can also load into a JSON type column and can use postgres type syntax col->>'$.key'
[-]
- loa_observer 1 hour ago
  duckdb is super fast for analytic tasks, especially when u use it with visual eda tool like pygwalker. it allows u handles millions of data visuals and eda in seconds.
  but i would say, comparing duckdb and sqlite is a little bit unfair, i would still use sqlite to build system in most of cases, but duckdb only for analytic. you can hardly make a smooth deployment if you apps contains duckdb on a lot of platform
  [-]
  - trueno 17 minutes ago
    depending on the size and needs of distributed system or application im kind of really excited about postgres + pg_lake. postgres has blown my mind at how well it does concurrent writes at least for the types of things i build/support for my org, the pg_lake extension then adds the ability to.. honestly work like a datalake style analytics engine. it intuitively switches whether or not the transaction goes down the normal query path or it uses duckdb which brings giga-aggregation type queries to massive datasets.
    someone should smush sqlite+duckdb together and do that kind of switching depending on query type
- mikepurvis 1 hour ago
  Whoa. Is that first query building an index of random filesystem json files on the fly?
  [-]
  - NortySpock 1 hour ago
    It's not an index, it's just (probably parallel) file reads
    That being said, it would be trivial to tweak the above script into two steps, one reading data into a DuckDB database table, and the second one reading from that table.
- lame_lexem 1 hour ago
  can we all agree to never store datasets uncompressed. duckdb supports reading many compression formats
  [-]
  - hawk_ 36 minutes ago
    How much impact do the various compression formats have on query performance?
jelder 5 hours ago
I thought this was common practice, generated columns for JSON performance. I've even used this (although it was in Postgres) to maintain foreign key constraints where the key is buried in a JSON column. What we were doing was slightly cursed but it worked perfectly.
[-]
- craftkiller 3 hours ago
  If you're using postgres, couldn't you just create an index on the field inside the JSONB column directly? What advantage are you getting from extracting it to a separate column?
```
  CREATE INDEX idx_status_gin
  ON my_table
  USING gin ((data->'status'));
```
  ref: https://www.crunchydata.com/blog/indexing-jsonb-in-postgres
  [-]
  - jelder 3 hours ago
    That works for lookups but not for foreign key constraints.
    [-]
    - craftkiller 3 hours ago
      Ah, makes sense. Thanks!
    - cies 3 hours ago
      ..and it does not make "certain queries easier" (quote from the article).
  - morshu9001 1 hour ago
    You only need gin if you want to index the entire jsonb. For a specific attribute, you can use the default (btree) which I'm guessing is faster.
    [-]
    - a-priori 3 minutes ago
      Yes, as far as indices go, GIN indices are very expensive especially on modification. They're worthwhile in cases where you want to do arbitrary querying on JSON data, but you definitely don't want to overuse them.
      If you can get away with a regular index on either a generated column or an expression, then you absolutely should.
- morshu9001 1 hour ago
  Doesn't sound very cursed, standard normalized relations for things that need it and jsonb for the big bags of attributes you don't care to split apart
- ramon156 5 hours ago
  It works until you realize some of these usages would've been better as individual key/value rows.
  For example, if you want to store settings as JSON, you first have to parse it through e.g. Zod, hope that it isn't failing due to schema changes (or write migrations and hope that succeeds).
  When a simple key/value row just works fine, and you can even do partial fetches / updates
  [-]
  - jelder 3 hours ago
    The necessity of using a JSON column was outside of my control, but Zod etc. are absolutely required, I think, in most projects. I wrote more about that here: https://www.jacobelder.com/2025/01/31/where-shift-left-fails...
  - mickeyp 4 hours ago
    EAV data models are kinda cursed in their own right, too, though.
- sigwinch 5 hours ago
  It is. I’d wondered if STORED is necessary and this example uses VIRTUAL.
- jasonthorsness 4 hours ago
  This is the typical practice for most index types in SingleStore as well except with the Multi-Value Hash Index which is defined over a JSON or BSON path
tracker1 4 hours ago
As others mention, you can create indexes directly against the json without projecting in to a computed column... though the computed column has the added benefit of making certain queries easier.
That said, this is pretty much what you have to do with MS-SQL's limited support for JSON before 2025 (v17). Glad I double checked, since I wasn't even aware they had added the JSON type to 2025.
[-]
- advisedwang 2 hours ago
  Exclusively using computed columns, and never directly querying the JSON does have the advantage of making it impossible to accidentally write a unindexed query.
kevinsync 2 hours ago
Hilariously, I discovered this very technique a couple weeks ago when Claude Code presented it out of the blue as an option with an implemented example when I was trying to find some optimizations for something I'm working on. It turned out to be a really smart and performant choice, one I simply wasn't aware of because I hadn't really kept up with new SQLite features the last few years at all.
Lesson learned: even if you know your tools well, periodically go check out updated docs and see what's new, you might be surprised at what you find!
[-]
upmostly 6 hours ago
I was inspired to write this blog post after reading bambax's comment on a HN post back in 2023: https://news.ycombinator.com/item?id=37082941
Lex-2008 5 hours ago
interesting, but can't you use "Index On Expression" <https://sqlite.org/expridx.html>?
i.e. something like this: CREATE INDEX idx_events_type ON events(json_extract(data, '$.type'))?
i guess caveat here is that slight change in json path syntax (can't think of any right now) can cause SQLite to not use this index, while in case of explicitly specified Virtual Generated Columns you're guaranteed to use the index.
[-]
- pkhuong 5 hours ago
  Yeah, you can use index on expression and views to ensure the expression matches, like https://github.com/fsaintjacques/recordlite . The view + index approach decouples the convenience of having a column for a given expression and the need to materialise the column for performance.
- fny 4 hours ago
  > slight change in json path syntax (can't think of any right now) can cause SQLite to not use this index
  It's pretty fragile...
```
    --  Just changing the quoting
    select * from events where json_extract(data, "$.type") = 'click';

    -- Changing the syntax
    select * from events where data -> '$.type' = 'click';
```
  Basically anything that alters the text of an expression within the where clause
  [-]
  - johnmaguire 1 hour ago
    TIL. Are MySQL and Postgres this fragile too?
- paulddraper 5 hours ago
  Yes, that’s the simpler and faster solution.
  You need to ensure your queries match your index, but when isn’t that true :)
  [-]
  - 0x457 3 hours ago
    > but when isn’t that true
    When you write another query against that index a few weeks later and forget about the caveat, that slight change in where clause will ignore that index.
- WilcoKruijer 4 hours ago
  From the linked page:
  > The ability to index expressions was added to SQLite with version 3.9.0 (2015-10-14).
  So this is a relatively new addition to SQLite.
  [-]
  - debugnik 4 hours ago
    I'm not sure 2015 counts as new, but that's same release that first introduced the JSON extension. There isn't a version of SQLite with JSON expressions but without indexes on expressions. Also, the JSON extension wasn't enabled by default until 2022, so most people using SQLite with JSON have got a version much newer than 2015.
  - Lex-2008 3 hours ago
    i initially misread "2015" as "2025", too... But no, it was part of SQLite for ten years already!
oars 26 minutes ago
Great article with clear instructions - could be quite useful if I need to do stuff with storing JSON in SQLite in the future.
verytrivial 1 hour ago
If you replace JSON with XML in this model it is exactly what the "document store" databases from the 90s and 00s were doing -- parsing at insert and update time, then touching only indexes at query time. It is indeed cool that sqlite does this out of the box.
AlexErrant 5 hours ago
I was looking for a way to index a JSON column that contains a JSON array, like a list of tags. AFAIK this method won't work for that; you'll either need to use FTS or a separate "tag" table that you index.
[-]
- MyOutfitIsVague 2 hours ago
  Yeah, SQLite doesn't have any true array datatype. I think you could probably do it with a virtual table, but that would be adding a native extension, and it would have to pack its own index.
- rini17 2 hours ago
  You can use triggers to keep the tag table synchronized automatically.
- garaetjjte 4 hours ago
  I would want that too. It's possible in MySQL: https://dev.mysql.com/doc/refman/8.4/en/create-index.html#cr...
simonw 5 hours ago
Tiny bug report: I couldn't edit text in those SQL editor widgets from my iPhone, and I couldn't scroll them to see text that extended past the width of the page either.
[-]
- hamburglar 5 hours ago
  The examples also needed a “drop table if exists” so they could be run more than once without errors.
  [-]
  - upmostly 5 hours ago
    Great catch, I'll add that now!
- upmostly 5 hours ago
  Thanks Simon! Looking into that now. Big fan. Hope you enjoyed the post.
  Edit: This should now be fixed for you.
focusgroup0 3 hours ago
Would this be a good fit for migrating from mongo --> sqlite? A task I am dreading
[-]
- zffr 2 hours ago
  Just curious, why do you want to migrate from mongo (document database server) to sqlite (relational database library)?
  That migration would be making two changes: document-based -> relational, and server -> library.
  Have you considered migrating to Postgres instead? By using another DB server you won't need to change your application as much.
  [-]
  - focusgroup0 54 minutes ago
    Thanks for the feedback. The document model in mongo was slopped together by a junior engineer, so perhaps an unorthodox approach. It is basically flat, and already used in a pseudo-relational manner via in-app join to the existing sqlite store. This blog post inspired me to think, what if we just chucked all the json from mongo into sqlite and used the generated indices? Then we can gradually "strangler fig" endpoint by endpoint
rcarmo 1 hour ago
I've been using this trick for a while, and it actually got me to do quite a bit without an ORM (just hacking a sane models.py with a few stable wrappers and calling it a day)
[-]
- morshu9001 1 hour ago
  json columns pretty much obviated the need for ORMs. It used to be that you'd sometimes have a deep nested thing you really only ever query all at once rather than in pieces, so you'd use an ORM to automate that, but now you can just shove it into json. And then use regular SQL for the relations you actually care about.
ellisv 5 hours ago
I wish devs would normalize their data rather than shove everything into a JSON(B) column, especially when there is a consistent schema across records.
It's much harder to setup proper indexes, enforce constraints, and adds overhead every time you actually want to use the data.
[-]
- nh2 4 hours ago
  JSON columns shine when
  * The data does not map well to database tables, e.g. when it's tree structures (of course that could be represented as many table rows too, but it's complicated and may be slower when you always need to operate on the whole tree anyway)
  * your programming language has better types and programming facilities than SQL offers; for example in our Haskell+TypeScript code base, we can conveniently serialise large nested data structures with 100s of types into JSON, without having to think about how to represent those trees as tables.
  [-]
  - cies 3 hours ago
    You do need some fancy in-house way to migrate old JSONs to new JSON in case you want to evolve the (implicit) JSON schema.
    I find this one of the hardest part of using JSON, and the main reason why I rather put it in proper columns. Once I go JSON I needs a fair bit of code to deal with migrartions (either doing them during migrations; or some way to do them at read/write time).
    [-]
    - kccqzy 32 minutes ago
      Since OP is using Haskell, the actual code most likely won’t really touch the JSON type, but the actual domain type. This makes migrations super easy to write. Of course they could have written a fancy in-house way to do that, or just use the safe-copy library which solves this problem and it has been around for almost two decades. In particular it solves the “nested version control” problem with data structures containing other data structures but with varying versions.
    - nh2 2 hours ago
      Yes, that's what we do: Migrations with proper sum types and exhaustiveness checking.
- tracker1 4 hours ago
  I find that JSON(B) works best when you have a collection of data with different or variant concrete types of data that aren't 1:1 matches. Ex: the actual transaction result if you have different payment processors (paypal, amazon, google, apple-pay, etc)... you don't necessarily want/care about having N different tables for a clean mapping (along with the overhead of a join) to pull the transaction details in the original format(s).
  Another example is a classifieds website, where your extra details for a Dress are going to be quite a bit different than the details for a Car or Watch. But, again, you don't necessarily want to inflate the table structure for a fully normalized flow.
  If you're using a concretely typed service language it can help. C# does a decent job here. But even then, mixing in Zod with Hono and OpenAPI isn't exactly difficult on the JS/TS front.
- jasonthorsness 4 hours ago
  When a data tree is tightly coupled (like a complex sample of nested data with some arrays from a sensor) and the entire tree is treated like a single thing by writes, the JSON column just keeps things easier. Reads can be accelerated with indexes as demonstrated here.
- whizzter 4 hours ago
  I fully agree that's wrong (can't imagine the overhead of some larger tables I have if that had happened), that said, often people want weird customizations in medium-sized tables that would set one on a path to having annoying 100 column tables if we couldn't express customizations in a "simple" JSON column (that is more or less polymorphic).
  Typical example is a price-setting product I work on.. there's price ranges that are universal (and DB columns reflect that part) but they all have weird custom requests for pricing like rebates on the 3rd weekend after X-mas (but only if the customer is related to Uncle Rudolph who picks his nose).
  [-]
  - fauigerzigerk 2 hours ago
    But if you have to model those custom pricing structures anyway, the question what you gain by not reflecting them in the database schema.
    There's no reason to put all those extra fields in the same table that contains the universal pricing information.
- crazygringo 4 hours ago
  For very simple JSON data whose schema never changes, I agree.
  But the more complex it is, the more complex the relational representation becomes. JSON responses from some API's could easily require 8 new tables to store the data in, with lots of arbitrary new primary keys and lots of foreign key constraints, your queries will be full of JOIN's that need proper indexing set up...
  Oftentimes it's just not worth it, especially if your queries are relatively simple, but you still need to store the full JSON in case you need the data in the future.
  Obviously storing JSON in a relational database feels a bit like a Frankenstein monster. But at the end of the day, it's really just about what's simplest to maintain and provides the necessary performance.
  And the whole point of the article is how easy it is to set up indexes on JSON.
- konart 4 hours ago
  Normalisation brings its own overhead though.
meindnoch 4 hours ago
In the 2nd section you're using a CREATE TABLE plus three separate ALTER TABLE calls to add the virtual columns. In the 3rd section you're using a single CREATE TABLE with the virtual columns included from the get go.
Why?
[-]
- hamburglar 4 hours ago
  I think the intent is to separate the virtual column creation out when it’s introduced in order to highlight that it’s a very lightweight operation. When moving onto the 3rd example, the existence of the virtual columns is just a given.
- ralferoo 2 hours ago
  Depending on the amount of inserts, it might be more efficient to create all the indexes in one go. I think this is certainly true for normal columns.
  But I suspect with JSON the overhead of parsing it each time might make it more efficient to update all the indices with every insert.
  Then again, it's probably quicker still to insert the raw SQL into a temporary table in memory and then insert all of the new rows into the indexed table as a single query.
- hiccuphippo 4 hours ago
  In 2 they show how to add virtual columns to an existing table, in 3 how to add indexes to existing virtual columns so they are pre-cooked. Like a cooking show.
  [-]
  - meindnoch 4 hours ago
    >In 2 they show how to add virtual columns to an existing table
    No, in section 2 the table is created afresh. All 3 sections start with a CREATE TABLE.
    [-]
    - hiccuphippo 4 hours ago
      Yes, it seems each section has its own independent database so you have to create everything on each of them.
  - upmostly 4 hours ago
    Literally exactly as I meant it. I watch a lot of cooking shows, too, so this analogy holds up.
pawelduda 4 hours ago
I've been coding a lot of small apps recently, and going from local JSON file storage to SQLite has been a very natural path of progression, as data's order of magnitude ramps up. A fully performant database which still feels as simple as opening and reading from a plain JSON file. The trick you describe in the article is actually an unexpected performance buffer that'll come in handy when I start hitting next bottleneck :) Thank you
mcluck 5 hours ago
Very cool article. To really drill it home, I would have loved to see how the query plan changes. It _looks_ like it should Just Work(tm) but my brain refuses to believe that it's able to use those new indexes so flawlessly
moregrist 4 hours ago
Generated columns are pretty great, but what I would really love is a Postgres-style gin index, which dramatically speeds up json queries for unanticipated keys.
maxpert 1 hour ago
LOL what are the odds, I posted in `Show HN` about Marmot today https://github.com/maxpert/marmot/releases/tag/v2.2.0 and in my head I was thinking exact same thing for supporting MySQL's JSON datatype. At some level I am starting to feel, I might as well be able to expose a full MongoDB compatible protocol that let's you talk to tables as collections, solving this problem once it for all!
But this technique I guess is very common now.
javantanna 2 hours ago
Your website looks like supermemory.ai , BTW its pretty cool
Seattle3503 3 hours ago
It says full speed, but no benchmarks were performed to verify if performance was really equivalent.
stacktraceyo 1 hour ago
Can I do this with pocket base?
bushbaba 5 hours ago
For smaller datasets (100s of thousands of rows) I don’t see why you wouldn’t just use json columns with generated column/index where needed
[-]
- bilekas 3 hours ago
  Regardless of the number of rows, it doesn't really matter, there are useful cases for where you might be consuming json directly, so instead of parsing it out into a schema for your database, why not just keep it raw and utilize the tools of the database.
  It's a feature, not a replacement.
N_Lens 4 hours ago
What a neat trick, I love SQLite as well.
mring33621 5 hours ago
IIRC, Vertica had/has a similar feature.
[-]
- kwillets 2 hours ago
  It's been around for quite while, but DB people hate to explain where they got an idea. For all I know Vertica got it from somewhere else; I think postgres got jsonb around the same time.
- xp84 5 hours ago
  Now there’s a name I haven’t heard in 10 years. (I’m only tenuously connected to the kinds of teams that use/would have used that, so it doesn’t mean much.)
baq 5 hours ago
My understanding is Snowflake works kinda like that behind the scenes right?
dmezzetti 1 hour ago
I love this feature. I've long used json_extract to create dynamic columns with txtai sql: https://neuml.github.io/txtai/embeddings/query/#dynamic-colu...
You can do the same with DuckDB and Postgres too.
jb_rad 2 hours ago
Dude what? This is incredible knowledge. I had been fearing this exact problem for so long, but there is an elegant out of the box solution. Thank you!!
groundzeros2015 3 hours ago
> We've got an embedded SQLite-in-the-browser component on our blog
What?
[-]
- qwertox 1 hour ago
  Probably using https://sqlite.org/wasm/doc/trunk/index.md
pipe01 4 hours ago
MongoDB is dead, long live MongoDB