> nearly everything that matters in practice: where the matches are, how long they are, and how many there are
I would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior. In particular, I find it difficult to imagine a practical need to find all matches of an expression like /.*a|b/, as shown in the article. Realistically you'd have to handle /\b.*a|b\b/, or similar, because realistically when you need all matches, you don't want intersecting matches. This means you want to proceed past the end of the n-th match to look for n+1-th match, and never want to use indeterminate prefixes like /.*a/.
This OTOH gives a reasonably useful heuristic if your regexp comes from an untrusted source and could be adversarial. Check that it does not start with a prefix with a Kleene star, like /a*/. Require at least one positive match (in each alternate branch). Of course, /a+b|c/ would still be quadratic if your text is long sequences of "a" interspersed with characters other than "b". But this, again, is more of a theoretical case, to my mind.
> I would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior
I agree with you in the sense that most practical regexes do not expose this quadratic blowup (from all matches) but i do not think the same about backtracking. The effect of backtracking is immediately clear when you're searching inside text without a clear anchor like \A or ^ or a very rare string prefix. It is much more visible with either larger patterns or larger character classes like unicode
The part that makes it difficult is that it doesn't return the same matches, it returns almost the same matches but not exactly.
But if PCRE semantics isn't set in stone then i hope leftmost longest could be the default some day. There's a lot of nice things you get for free with the two pass approach
Restricting regex features to guarantee time complexity works, but it requires sacrificing potentially useful features like backtracking (or in the article's case, constraining oneself to fixed-upper-bound-length needles).
In a real-world deployment where you want to run any arbitrary regex in an idiot/malice-proof manner, the best solution is the same solution you'd use for running any other kind of untrusted code - sandbox it! A good regex API should limit its execution time and memory consumption and return a timeout error in case those limits are exceeded. Ideally, those parameters would be configurable at the API level. Unfortunately, the only regex libraries I know of that get this right are .NET's standard library Regex API and the third-party regex package in Python.
Begin with the regex R. Form the NFA "A" of /.*R/. Run A over a string, marking the set of locations "MatchEnds" where any of A's states is accepting. Then form the NFA A' of the regex which matches the REVERSE of any string matched by R. Now run A' on the reversed string, but:
- A' should start in an EMPTY SET OF STATES
- A' should only go into its entry states every time it enters a location in MatchEnds.
- Furthermore, have each state of A' house a subset of MatchEnds, which gets ferried along any state transition.
- For each accepting state of A' and reversed string location i, make a record of which subset of MatchEnds is in the accepting state.
> the problem we're talking about in this post (finding all longest matches without quadratic blowup)
Wait, what? I thought this was about finding all matches. With a minor tweak to the opening example:
We want to match `(.*a | b)` against `bbbbbabbbbb`.
I want to detect each `b` individually, and I also want to detect `bbbbba`, `bbbba`, `bbba`, `bba`, `ba`, and `a`. That's what it means to find all matches.
> search a document for a pattern and it takes a second. search one a hundred times larger and it doesn't take a hundred seconds - it can take almost three hours.
Most of this is about quadratic time find-all operations where a search operation is linear. But it's also still possible to get quadratic behaviour out of a single search without catastrophic backtracking, more easily than you might expect. In late January to early February, Tim Peters was talking about an example of this on the Python forums (see e.g. https://discuss.python.org/t/add-re-prefixmatch-deprecate-re...) and also related the experience of trying to diagnose the issue with AI (see https://discuss.python.org/t/claude-code-how-much-hype-how-m... and onward). Peters' example was:
\d+\s+
on a string containing only digits, a prefix match takes O(n) time as it considers every possible end position for the digit, and immediately sees no following whitespace. But the search is quadratic because it has to repeat that O(n) work at every position; the regex engine can't track the fact that it's already examined the string and found no whitespace, so it re-tries each digit match length.
(This is arguably "backtracking" since it tries the longest match first, but clearly not in a catastrophic way; if you use `\d+?` instead then of course it only searches forward but is still O(n). It actually is slower in my testing in the Python implementation; I don't exactly know why. As noted in the discussion, the possessive quantifier `\d++` is considerably faster, and of course doesn't backtrack, but still causes O(n^2) searching. The repeated attempts to match `\s+` aren't the problem; the problem is repeatedly looking for digits in places where digits were already found and rejected.)
The way to fix this proposed in the discussion is to use a negative lookbehind assertion before the digits: `(?<!\d)\d+\s+`. This way, the regex engine can bail out early when it's in the middle of a digit string; if the previous character was a digit, then either `\d+\s+` doesn't match here, or it would have matched there.
A simpler idea is to just search for `\d\s+`, or even `\d\s` — since these will be present if and only if `\d+\s+` is. This way, though, you still need to do extra work with the partial match to identify the start and end of the full match. My first idea was to use positive lookbehind for the digits, since the lookbehind match doesn't need to backtrack. In fact lookbehinds require a fixed-length pattern, so this is really just a more complicated way to do the `\d\s+` simplification.
----
> Hyperscan (and its fork Vectorscan) is a true linear-time all-matches regex engine. it achieves this by using "earliest match" semantics - reporting a match the moment the DFA enters a match state, instead of continuing to find the longest one.
Is this not just equivalent to forcing "reluctant" quantifiers (`\d+?`) everywhere?
with all-matches semantics it returns a significantly higher number of matches than leftmost greedy.
eg. /abc*/ and abccccc will return you matches at ab|c|c|c|c|c|
I think it's very common and ok that people reason about other engines in terms of backtracking but it works very differently. And fixed length lookbehinds are more of a Java/Python thing, other engines support all lookbehinds.
The main idea of linear regex and intuitive semantics is that it should be declarative and the engine does whatever is the fastest without you having to worry about it. Instead of describing character by character how to perform the search and where it can blow up, think of it as just a specification. Then you can truly express whatever is the shortest to explain.
Something i'm still trying to figure out and perhaps failing to understand is what are the killer features of backtracking regex that you would really miss if you were to use linear regex? It would help me a lot to know, i'm trying to convince others to make the switch
Good to know. Although a lookbehind for `\d+` doesn't really gain anything over a lookbehind for `\d` anyway; they match in the same circumstances, just with different results.
If there's supposed to be a literal asterisk in there somewhere, you can escape it with a backslash. Right now two paragraphs are italic because of mismatched asterisks.
Thanks. There are no asterisks in the regexes; I had simply missed the closing asterisk on some intentional emphasis. (And then I also had to fix some escaping inserted by the system to try to correct for the actual problem.)
I'm of the opinion that their post is still human-written. They describe in their first blog post that it's human written (which could be a lie, but oh well) and their other blog posts seem to have the same informal lowercase hyphenated writing style. Sentance length varies throughout the posts. Sure, you could prompt this writing style, but I lean towards thinking that the piece is human written.
The fact that terms like Aho-Corasick, PLDI, Go, etc. are properly capitalized, including if they begin sentences, but otherwise sentences are uncapitalized, makes me think it's an explicit LLM instruction "don't capitalize the start of sentences" rather than writing style.
No, this is just what that writing style looks like. Names and acronyms are usually capitalized normally.
I keep being surprised by the magnitude of the disconnect between this place and the other circles of hell. I'd have thought the Venn diagram would have a lot more overlap.
Oh the venn diagram might be big, the HN population just has a lot of variance I think, and is less of a community per se. I don't doubt what you're saying, though in the grand scheme of things, I think the "too lazy to hit shift" population dwarfs any of these groups.
Yeah, I can agree with the variance. Except that the "too lazy to hit shift" community is not something I would ever confuse with people writing long form articles about their regex engine research that they'll be presenting at PLDI.
The confusion might be understandable for people who have never encountered this style before, but that's still a very uncharitable take about an otherwise pretty interesting article.
ChatGPT also loves Aho-Corasick and seems to overuse it as an optimization fall back idea. ChatGPT has suggested the algorithm to me but the code ended up slowing down a lot.
I also have the impression that the writing was at least "enhanced" with an LLM. For example, the amount of " - " dashes (over 60) is much higher than in a normal text.
By the way, I consider it a pretty disgraceful defect of HN that people routinely "flag" opinions merely because they disagree. The flag is for spam, not for downvoting.
I would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior. In particular, I find it difficult to imagine a practical need to find all matches of an expression like /.*a|b/, as shown in the article. Realistically you'd have to handle /\b.*a|b\b/, or similar, because realistically when you need all matches, you don't want intersecting matches. This means you want to proceed past the end of the n-th match to look for n+1-th match, and never want to use indeterminate prefixes like /.*a/.
This OTOH gives a reasonably useful heuristic if your regexp comes from an untrusted source and could be adversarial. Check that it does not start with a prefix with a Kleene star, like /a*/. Require at least one positive match (in each alternate branch). Of course, /a+b|c/ would still be quadratic if your text is long sequences of "a" interspersed with characters other than "b". But this, again, is more of a theoretical case, to my mind.
I agree with you in the sense that most practical regexes do not expose this quadratic blowup (from all matches) but i do not think the same about backtracking. The effect of backtracking is immediately clear when you're searching inside text without a clear anchor like \A or ^ or a very rare string prefix. It is much more visible with either larger patterns or larger character classes like unicode
Ah, there is a post with more detail about RE# and discussion here recently that I must have missed: https://news.ycombinator.com/item?id=47206647
But if PCRE semantics isn't set in stone then i hope leftmost longest could be the default some day. There's a lot of nice things you get for free with the two pass approach
In a real-world deployment where you want to run any arbitrary regex in an idiot/malice-proof manner, the best solution is the same solution you'd use for running any other kind of untrusted code - sandbox it! A good regex API should limit its execution time and memory consumption and return a timeout error in case those limits are exceeded. Ideally, those parameters would be configurable at the API level. Unfortunately, the only regex libraries I know of that get this right are .NET's standard library Regex API and the third-party regex package in Python.
Begin with the regex R. Form the NFA "A" of /.*R/. Run A over a string, marking the set of locations "MatchEnds" where any of A's states is accepting. Then form the NFA A' of the regex which matches the REVERSE of any string matched by R. Now run A' on the reversed string, but:
- A' should start in an EMPTY SET OF STATES
- A' should only go into its entry states every time it enters a location in MatchEnds.
- Furthermore, have each state of A' house a subset of MatchEnds, which gets ferried along any state transition.
- For each accepting state of A' and reversed string location i, make a record of which subset of MatchEnds is in the accepting state.
That record is the final output.
Wait, what? I thought this was about finding all matches. With a minor tweak to the opening example:
We want to match `(.*a | b)` against `bbbbbabbbbb`.
I want to detect each `b` individually, and I also want to detect `bbbbba`, `bbbba`, `bbba`, `bba`, `ba`, and `a`. That's what it means to find all matches.
Most of this is about quadratic time find-all operations where a search operation is linear. But it's also still possible to get quadratic behaviour out of a single search without catastrophic backtracking, more easily than you might expect. In late January to early February, Tim Peters was talking about an example of this on the Python forums (see e.g. https://discuss.python.org/t/add-re-prefixmatch-deprecate-re...) and also related the experience of trying to diagnose the issue with AI (see https://discuss.python.org/t/claude-code-how-much-hype-how-m... and onward). Peters' example was:
on a string containing only digits, a prefix match takes O(n) time as it considers every possible end position for the digit, and immediately sees no following whitespace. But the search is quadratic because it has to repeat that O(n) work at every position; the regex engine can't track the fact that it's already examined the string and found no whitespace, so it re-tries each digit match length.(This is arguably "backtracking" since it tries the longest match first, but clearly not in a catastrophic way; if you use `\d+?` instead then of course it only searches forward but is still O(n). It actually is slower in my testing in the Python implementation; I don't exactly know why. As noted in the discussion, the possessive quantifier `\d++` is considerably faster, and of course doesn't backtrack, but still causes O(n^2) searching. The repeated attempts to match `\s+` aren't the problem; the problem is repeatedly looking for digits in places where digits were already found and rejected.)
The way to fix this proposed in the discussion is to use a negative lookbehind assertion before the digits: `(?<!\d)\d+\s+`. This way, the regex engine can bail out early when it's in the middle of a digit string; if the previous character was a digit, then either `\d+\s+` doesn't match here, or it would have matched there.
A simpler idea is to just search for `\d\s+`, or even `\d\s` — since these will be present if and only if `\d+\s+` is. This way, though, you still need to do extra work with the partial match to identify the start and end of the full match. My first idea was to use positive lookbehind for the digits, since the lookbehind match doesn't need to backtrack. In fact lookbehinds require a fixed-length pattern, so this is really just a more complicated way to do the `\d\s+` simplification.
----
> Hyperscan (and its fork Vectorscan) is a true linear-time all-matches regex engine. it achieves this by using "earliest match" semantics - reporting a match the moment the DFA enters a match state, instead of continuing to find the longest one.
Is this not just equivalent to forcing "reluctant" quantifiers (`\d+?`) everywhere?
eg. /abc*/ and abccccc will return you matches at ab|c|c|c|c|c|
I think it's very common and ok that people reason about other engines in terms of backtracking but it works very differently. And fixed length lookbehinds are more of a Java/Python thing, other engines support all lookbehinds.
The main idea of linear regex and intuitive semantics is that it should be declarative and the engine does whatever is the fastest without you having to worry about it. Instead of describing character by character how to perform the search and where it can blow up, think of it as just a specification. Then you can truly express whatever is the shortest to explain.
Something i'm still trying to figure out and perhaps failing to understand is what are the killer features of backtracking regex that you would really miss if you were to use linear regex? It would help me a lot to know, i'm trying to convince others to make the switch
Just a small note: some regex engines support "variable length lookbehind", check the last column on this wikipedia article : https://en.wikipedia.org/wiki/Comparison_of_regular_expressi...
I keep being surprised by the magnitude of the disconnect between this place and the other circles of hell. I'd have thought the Venn diagram would have a lot more overlap.
The confusion might be understandable for people who have never encountered this style before, but that's still a very uncharitable take about an otherwise pretty interesting article.
Jack Dorsey's layoff message last month did the same thing.
Is it some kind of "Prove you're not an AI by purposely writing like an idiot" or something?
By the way, I consider it a pretty disgraceful defect of HN that people routinely "flag" opinions merely because they disagree. The flag is for spam, not for downvoting.