Python's splitlines does more than just newlines

(yossarian.net)

81 points | by Bogdanp 6 hours ago

7 comments

  • dleeftink 4 hours ago
    For more controlled splitting, I really like Unicode named characters classes[0] for more precise splitting and matching tasks.

    [0]: https://en.wikipedia.org/wiki/Unicode_character_property#Gen...

  • mixmastamyk 3 hours ago
    Splitlines is generally not needed. for line in file: is more idiomatic.
    • tiltowait 3 hours ago
      Splitlines additionally strips the newline character, functionality which is often (maybe even usually?) desired.
      • masklinn 24 minutes ago
        This has been controlled via a boolean parameter since at least 2.0, which as far as I can tell is when this method was added to `str`.
    • fulafel 1 hour ago
      It has similar (but not identical) behaviour though:

        >>> for line in StringIO("foo\x85bar\vquux\u2028zoot"): print(line)
        ... 
        foo
        bar
         quux zoot
    • rangerelf 3 hours ago
      What if the text is already in a [string] buffer?
      • mixmastamyk 3 hours ago
        StringIO can help, .rstrip() for the sibling comment.
    • drdrey 3 hours ago
      not every line is read from a file
      • mixmastamyk 3 hours ago
        That's where the generally fits in.
        • crazygringo 1 hour ago
          No, because that still assumes files are the general usage.

          In my experience, they're not. It's strings.

  • cuckoos-jicamas 3 hours ago
    str.split() function does the same:

    >>> s = "line1\nline2\rline3\r\nline4\vline5\x1dhello"

    >>> s.split() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']

    >>> s.splitlines() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']

    But split() has sep argument to define delimiter according which to split the string.. In which case it provides what you expected to happen:

    >>> s.split('\n') ['line1', 'line2\rline3\r', 'line4\x0bline5\x1dhello']

    In general you want this:

    >>> linesep_splitter = re.compile(r'\n|\r\n?')

    >>> linesep_splitter.split(s) ['line1', 'line2', 'line3', 'line4\x0bline5\x1dhello']

    • roelschroeven 30 minutes ago
      In that example str.split() has the same result as str.splitlines(), but it's not in general the same, even without custom delimiter.

      str.split() splits on runs of consecutive whitespace, any type of whitespace, including tabs and spaces which splitlines() doesn't do.

          >>> 'one two'.split()
          ['one', 'two']
          >>> 'one two'.splitlines()
          ['one two']
      
      split() without custom delimiter also splits on runs of whitespace, which splitline() also doesn't do (except for \r\n because that combination counts as one line ending):

          >>> 'one\n\ntwo'.split()
          ['one', 'two']
          >>> 'one\n\ntwo'.splitlines()
          ['one', '', 'two']
    • gertlex 47 minutes ago
      splitlines() is sometimes nice for adhoc parsing (of well behaved stuff...) because it throws out whitespace-only lines from the resulting list of strings.

      #1 use-case of that for me is probably just avoiding the cases where there's a trailing newline character in the output of a command I ran by subprocess.

  • meken 3 hours ago
    TIL: Python has a splitlines function
  • wvbdmp 4 hours ago
    What, no <br\s*\/?>?
  • zzzeek 3 hours ago
    in the same theme, NTLAIL strip(), rstrip(), lstrip() can strip other kinds of characters besides whitespace.
    • masklinn 17 minutes ago
      One thing to note tho is that they take character sets, as long as they encounter characters in the specified set they will keep stripping. Lots of people think if you give it a string it will remove that string.

      That feature was added in 3.9 with the addition of `removeprefix` and `removesuffix`.

      Sadly,

      1. unlike Rust's version they provide no way of knowing whether they stripped things out

      2. unlike startswith/endswith they do not take tuples of prefixes/suffixes

  • 7bit 5 hours ago
    This article provides no additional value to the splitlines() docs.
    • woodruffw 4 hours ago
      The "article" is my TIL mini-blog. What were you expecting besides a "today I learned"?
      • kstrauser 4 hours ago
        I already knew this information, more or less, but I like reading TIL posts like this. It's fun seeing the someone learn new things, and sometimes I pick up something myself, or at least look at it in a new way.
      • cap11235 3 hours ago
        Yeah, don't listen to parent. I like these sorts of articles a lot; its only useless if you assume that everyone interested has also memorized the Python docs fully (which I imagine is zero people). Fun technical tangents are quite fun indeed.
      • zahlman 3 hours ago
        What is "yossarian", BTW? I'd gotten confused thinking it was someone else's blog, because I naturally parse that as a surname.
        • woodruffw 3 hours ago
          John Yossarian is the protagonist of Joseph Heller’s Catch-22[1], which was my favorite book in high school. Like a lot of people, my handle is a slightly embarrassing memorialization of my younger self :-)

          [1]: https://en.wikipedia.org/wiki/Catch-22

          • di 1 hour ago
            Don't be embarrassed, it's a good book (and was my favorite too).
          • zahlman 3 hours ago
            > Like a lot of people, my handle is a slightly embarrassing memorialization of my younger self :-)

            ... Guilty, actually.

    • rsyring 4 hours ago
      Sometimes value is measured by awareness. I benefited from becoming aware of the behavior because of the article. Yes, it's in the docs, but the docs are not something I would have gone looking to read today.
    • diath 4 hours ago
      The value of this article, to me, is that I'd never read the splitlines documentation, so this is a little detail that I just learned thanks to it being linked here.
    • happytoexplain 3 hours ago
      I've been working with Python for a year or so now, and never knew this. I'm grateful to the author.
    • felipelemos 3 hours ago
      For all of us that don't read all documentation for every single method, tool, function or similar, it is, by awarenes, very useful.