Given that encoded characters must have one and only one General_Category value, it might be too imprecise or arbitrary in some cases. If you ever need more power, it's worth browsing the other character properties Unicode exposes. For example, `Lu` (Uppercase_Letter) only covers some uppercase letters, whereas the `Uppercase` property covers all of them.
---
For anyone that wants to learn more about specific Unicode stuff, the three big data sources are The Core Spec, the Unicode Technical Annexes (UAXs), and the Unicode Character Database itself (the database is a bunch of text files. There's an XML version now as well).
For further reading on this specifically, it might be worth looking at:
split() without custom delimiter also splits on runs of whitespace, which splitline() also doesn't do (except for \r\n because that combination counts as one line ending):
splitlines() is sometimes nice for adhoc parsing (of well behaved stuff...) because it throws out whitespace-only lines from the resulting list of strings.
#1 use-case of that for me is probably just avoiding the cases where there's a trailing newline character in the output of a command I ran by subprocess.
There's so many super useful things in the Python docs that you never see in the wild. For example, I recently learned that the sqlite3 module has a set_authorizer function that lets you limit the types of statements that can be run / tables that can be accessed.
One thing to note tho is that they take character sets, as long as they encounter characters in the specified set they will keep stripping. Lots of people think if you give it a string it will remove that string.
That feature was added in 3.9 with the addition of `removeprefix` and `removesuffix`.
Sadly,
1. unlike Rust's version they provide no way of knowing whether they stripped things out
2. unlike startswith/endswith they do not take tuples of prefixes/suffixes
I already knew this information, more or less, but I like reading TIL posts like this. It's fun seeing the someone learn new things, and sometimes I pick up something myself, or at least look at it in a new way.
Yeah, don't listen to parent. I like these sorts of articles a lot; its only useless if you assume that everyone interested has also memorized the Python docs fully (which I imagine is zero people). Fun technical tangents are quite fun indeed.
John Yossarian is the protagonist of Joseph Heller’s Catch-22[1], which was my favorite book in high school. Like a lot of people, my handle is a slightly embarrassing memorialization of my younger self :-)
Sometimes value is measured by awareness. I benefited from becoming aware of the behavior because of the article. Yes, it's in the docs, but the docs are not something I would have gone looking to read today.
The value of this article, to me, is that I'd never read the splitlines documentation, so this is a little detail that I just learned thanks to it being linked here.
[0]: https://en.wikipedia.org/wiki/Unicode_character_property#Gen...
---
For anyone that wants to learn more about specific Unicode stuff, the three big data sources are The Core Spec, the Unicode Technical Annexes (UAXs), and the Unicode Character Database itself (the database is a bunch of text files. There's an XML version now as well).
For further reading on this specifically, it might be worth looking at:
[Unicode Core Spec - Chapter 4: Character Properties] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
├ [General Category] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
└ [Properties for Text Boundaries] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
[UAX #44 - Unicode Character Database (Technical Report)] https://www.unicode.org/reports/tr44/
├ [General Category Values] https://www.unicode.org/reports/tr44/#General_Category_Value...
└ [Property Definitions] https://www.unicode.org/reports/tr44/#Property_Definitions
And, if you're brave and want to see the data itself (skim through UAX #44 first):
[Unicode Character Database] https://www.unicode.org/Public/17.0.0/ucd/
In my experience, they're not. It's strings.
>>> s = "line1\nline2\rline3\r\nline4\vline5\x1dhello"
>>> s.split() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']
>>> s.splitlines() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']
But split() has sep argument to define delimiter according which to split the string.. In which case it provides what you expected to happen:
>>> s.split('\n') ['line1', 'line2\rline3\r', 'line4\x0bline5\x1dhello']
In general you want this:
>>> linesep_splitter = re.compile(r'\n|\r\n?')
>>> linesep_splitter.split(s) ['line1', 'line2', 'line3', 'line4\x0bline5\x1dhello']
str.split() splits on runs of consecutive whitespace, any type of whitespace, including tabs and spaces which splitlines() doesn't do.
split() without custom delimiter also splits on runs of whitespace, which splitline() also doesn't do (except for \r\n because that combination counts as one line ending):#1 use-case of that for me is probably just avoiding the cases where there's a trailing newline character in the output of a command I ran by subprocess.
https://www.sqlite.org/c3ref/set_authorizer.html
https://docs.python.org/3/library/sqlite3.html#sqlite3.Conne...
That feature was added in 3.9 with the addition of `removeprefix` and `removesuffix`.
Sadly,
1. unlike Rust's version they provide no way of knowing whether they stripped things out
2. unlike startswith/endswith they do not take tuples of prefixes/suffixes
[1]: https://en.wikipedia.org/wiki/Catch-22
... Guilty, actually.