Really hope it's done now.

larryhastings · Sep 16, 2023 · bbb7f25 · bbb7f25
1 parent 0c432f3
commit bbb7f25
Showing 1 changed file with 84 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -4219,14 +4219,15 @@ argument: an iterable of separator strings.
 Although you can use any iterable of strings
 you like, most often you'll be separating on some
 form of whitespace.  But... what, specifically,
-is whitespace?  The answer to this question is
-surprisingly complicated, once you examine the
-details.
+is whitespace?  Although this question has a simple
+answer that is usually good enough, answering
+this question completely accurately is a suprisingly
+complicated undertaking.
 
-However, you almost certainly have nothing to worry about.
-These days the only whitespace characters you're likely to
-encounter are spaces, tabs, newlines, and maybe carriage returns.
-Python and **big** handle all those just fine.
+The good news is, you can almost certainly ignore all the
+complexity.  These days the only whitespace characters you're
+likely to encounter are spaces, tabs, newlines, and maybe
+carriage returns.  Python and **big** handle all those easily.
 
 **big** defines four values designed to be used as
 a `separators` argument.  All four of these are tuples
@@ -4301,24 +4302,24 @@ which respectively represent "file separator", "group separator",
 "record separator", and "unit separator".
 I'll refer to these as "the four ASCII separator characters".
 
-These characters were defined as part of the ancient ASCII
-standard.  They were meant to be used as separator characters
-for data as their names suggest, the same way
-(Ctrl-Z was used to indicate end-of-file in the CPM and earliest
-FAT filesystems.)[https://en.wikipedia.org/wiki/End-of-file#EOF_character]
+These characters were defined as part of [the original ASCII
+standard,](https://en.wikipedia.org/wiki/ASCII) way back in 1963.
+As their names suggest, they were intended to be used as separator
+characters for data, the same way
+[Ctrl-Z was used to indicate end-of-file in the CPM and earliest
+FAT filesystems.](https://en.wikipedia.org/wiki/End-of-file#EOF_character)
 But the four ASCII separator characters were rarely used,
-even in the glory days of ASCII.  Today they're practically
-unheard of.
+even back in the day.  Today they're practically unheard of.
 
 As a rule, printing these characters to the screen generally
-doesn't produce anything--they don't move the cursor, the
+doesn't produce anything--they don't move the cursor, and the
 screen doesn't change.
 So their behavior is a bit mysterious.  A lot of people--including
 early Python programmers it seems!--thought that meant they're
 whitespace.  This is a strange conclusion; after all, all the
 well-known whitespace characters move the cursor, and these do not.
 
-However!  The Unicode standard is crystal clear: these
+However, the Unicode standard is crystal clear: these
 characters are *not whitespace.*  And yet Python's "Unicode object"
 behaves as if they are.  So I'd say this is a bug; Python's Unicode
 object should implement what the Unicode standard says.
@@ -4334,9 +4335,9 @@ convenience--and backwards-compatibility with Python 2--Python's
 `bytes` objects support several method calls that treat the data
 as if it were "ASCII-compatible".
 
-The surprise: Python `bytes` objects recognize a *different* set
-of whitespace characters.  Here's the list of all bytes recognized
-by Python `bytes` objects as whitespace:
+The surprise: These methods on Python `bytes` objects recognize
+a *different* set of whitespace characters.  Here's the list of
+all bytes recognized by Python `bytes` objects as whitespace:
 
     # char  decimal  hex    name
     #######################################
@@ -4355,12 +4356,12 @@ The good news is, this list is the same as ASCII's list,
 and it agrees with Unicode.
 In fact this list is quite familiar to C programmers;
 it's the same whitespace characters recognized by the
-standard C function `isspace()` (in `ctypes.h`).
+standard C function [`isspace()` (in `ctypes.h`).](https://www.oreilly.com/library/view/c-in-a/0596006977/re129.html)
 Python has used this function to decide which characters
 are and aren't whitespace in 8-bit strings since its very
 beginning.
 
-Thes surprising news is, this list *doesn't* contain the
+Notice that this list *doesn't* contain the
 four ASCII separator characters.  This means you could
 define a Python `str` object using only characters defined
 in ASCII, and encode it to a `bytes` object using the
@@ -4398,11 +4399,11 @@ Again, this is different from [the list of characters
 defined as line-breaking whitespace in Unicode.](https://en.wikipedia.org/wiki/Newline#Unicode)
 And again it's because Python defines some of the four ASCII separator
 characters as line-breaking characters.  In this case
-it's only the first three.... Python doesn't consider
+it's only the first three; Python doesn't consider
 the fourth, "unit separator", as a line-breaking character.
-I don't know why Python draws this distinction...
+(I don't know why Python draws this distinction...
 but then again, I don't know why it considers the
-first three to be line-breaking  It's *all* a mystery to me.
+first three to be line-breaking  It's *all* a mystery to me.)
 
 Here's the list of all characters recognized by
 Python `bytes` objects as line-breaking characters:
@@ -4430,13 +4431,17 @@ advancing at least one line.
 
 To be crystal clear: the odds that any of this will cause
 a problem for you are *extremely* low.  In order for it
-to make a difference, you'd have to encounter text using
-one of these six characters where Python disagrees with
-Unicode and ASCII--the four ASCII separator characters,
-vertical tab, and form feed--and you'd have to split the
-input on some form of whitespace, and you'd have to get
-different results, *and* this difference in results would
-have to be important.  This is all extremely unlikely.
+to make a difference, you'd have to
+
+* you'd have to encounter text using one of these six characters
+  where Python disagrees with Unicode and ASCII, and
+* you'd have to process the input based on some definition
+  of whitespace, and
+* it would have to produce different results than you might
+  have other wise expected, *and*
+* this difference in results would have to be important.
+
+This is all extremely unlikely.
 
 In case this *does* affect you, **big** has
 a complete set of predefined whitespace tuples that will
@@ -4454,44 +4459,44 @@ tuple contains the subset of whitespace characters that
 move the cursor vertically.
 
 The most important two values start with `str_`:
-['str_whitespace'](#str_whitespace)
+[`str_whitespace`](#str_whitespace)
 and
-['str_linebreaks'.](#str_linebreaks)These contain
+[`str_linebreaks`.](#str_linebreaks)These contain
 all the whitespace characters recognized by the Python
 `str` object.
 
 Next are two values that start with `unicode_`:
-['unicode_whitespace'](#unicode_whitespace)
+[`unicode_whitespace`](#unicode_whitespace)
 and
-['unicode_linebreaks'.](#unicode_linebreaks)
+[`unicode_linebreaks`.](#unicode_linebreaks)
 These
 contain all the whitespace characters defined in the
 Unicode standard.  (These are almost the same as
 the `str_` equivalents, except they omit the four
 ASCII separator characters.)
 
 Third, two values that start with `ascii_`:
-['ascii_whitespace'](#ascii_whitespace)
+[`ascii_whitespace`](#ascii_whitespace)
 and
-['ascii_linebreaks'.](#ascii_linebreaks)
+[`ascii_linebreaks`.](#ascii_linebreaks)
 These
 contain all the whitespace characters defined in
 ASCII.  (Effectively, these are filtered versions of
 the `unicode_` equivalents, containing only the
 characters `c` where `ord(c) < 128`.)
 
 Fourth, two values that start with `bytes_`:
-['bytes_whitespace'](#bytes_whitespace)
+[`bytes_whitespace`](#bytes_whitespace)
 and
-['bytes_linebreaks'.](#bytes_linebreaks)
+[`bytes_linebreaks`.](#bytes_linebreaks)
 These contain
 all the whitespace characters recognized by the Python
 `bytes` object.
 
 Finally we have the two tuples that lack a prefix:
-['whitespace'](#whitespace)
+[`whitespace`](#whitespace)
 and
-['linebreaks'.](#linebreaks)
+[`linebreaks`.](#linebreaks)
 These are the tuples
 you should use most of the time, and several **big**
 functions use them as default values.  These are
@@ -4522,8 +4527,10 @@ daily lives of computer users.  Python went through several
 iterations on how to handle this, eventually settling on
 ["universal newlines"](https://peps.python.org/pep-0278/)
 in Python 2.3.
-These days the world seems to be converging on `'\n'`;
-Windows supports it, and it's the default everywhere else.
+These days the world seems to be converging on one standard,
+the UNIX standard `'\n'`;
+Windows supports it, and it's the default on every other modern
+platform.
 So in practice you probably don't have end-of-line conversion
 problems, either.
 
@@ -4550,35 +4557,60 @@ e.g. `whitespace_without_crlf`, `bytes_linebreaks_without_crlf`.
 ### Whitespace and line-breaking characters for other platforms
 
 What if you need to split text by whitespace, or by lines,
-but that text has some other unusual encoding?  **big** makes
-that easy too.  You can make your own tuple from scratch,
+but that text is in `bytes` format with an unusual encoding?
+**big** makes that easy too.  If one of the builtin tuples
+won't work for you, you can can make your own tuple from scratch,
 or modify an existing tuple to meet your needs.
 
 For example, let's say you need to split a document by
 whitespace, and the document is encoded in [code page 850,
 aka "latin-1".](https://en.wikipedia.org/wiki/Code_page_850)
 Normally the easiest thing would be to decode it a `str` object
-using the `latin-1` text codec, then operate on it normally.
+using the `'latin-1'` text codec, then operate on it normally.
 But you might have reasons why you don't want to decode it--maybe
 the document is damaged and doesn't decode properly, and it's
 easier to just work around the damage than to fix it.  If you
 want to process it with a **big** function that accepts a
 `separator` argument, you could make your own custom tuple
 of "latin-1" whitespace characters.  "latin-1" has the same
 whitespace characters as ASCII, but adds one more, value 255,
-which is not a line-breaking character.  So this is totally easy:
+which is not line-breaking.  So this is totally easy:
 
     latin_1_whitespace = big.bytes_whitespace + (b'\xff',)
     latin_1_linebreaks = big.bytes_linebreaks
 
 What if you want to process a `bytes` object containing
 UTF-8?  That's easy too.  Just convert one of the existing
-tuples using `big.encode_strings`.  For example, if you
-wanted to split a UTF-8 encoded bytes object `o` using
+tuples containing `str` objects using
+[`big.encode_strings`.](#encode_stringso--encodingascii)
+For example, to split a UTF-8 encoded bytes object `o` using
 the Unicode linebreak characters, you could call:
 
     multisplit(o, encode_strings(unicode_linebreaks, encoding='utf-8'))
 
+Note that this technique probably won't work correctly for other
+multibyte encodings like [UTF-16.](https://en.wikipedia.org/wiki/UTF-16)
+In these cases you should encode to `str`
+
+Why?  It's because `multisplit` could find matches in multibyte
+sequences *straddling* characters, similar to this example:
+
+```Python
+>>> haystack = '\u0101\u0102'
+>>> needle = '\u0201'
+>>> needle in haystack
+False
+>>> 
+>>> encoded_haystack = haystack.encode('utf-16-le')
+>>> encoded_needle = needle.encode('utf-16-le')
+>>> encoded_needle in encoded_haystack
+True
+```
+
+The character `'\u0201'` doesn't appear in the original string,
+but the *encoded* version appears in the *encoded* string.
+
+
 </dd></dl>
 
 
@@ -5393,6 +5425,9 @@ in the **big** test suite.
   a better job of selling `multisplit` to the reader.
 * The usual smattering of small doc fixes and improvements.
 
+My thanks again to Eric V. Smith for his willingness to ponder and discuss these
+issues.  Eric is now officially a contributor to **big,** increasing the project's
+[bus factor](https://en.wikipedia.org/wiki/Bus_factor) to two.  Thanks, Eric!
 
 #### 0.10
 <dl><dd>