More doc updates.

larryhastings · Sep 15, 2023 · 4ab77ac · 4ab77ac
1 parent 409f4f8
commit 4ab77ac
Showing 1 changed file with 91 additions and 42 deletions.
diff --git a/README.md b/README.md
@@ -5,16 +5,16 @@
 [![# test badge](https://img.shields.io/github/actions/workflow/status/larryhastings/big/test.yml?branch=master&label=test)](https://github.com/larryhastings/big/actions/workflows/test.yml) [![# coverage badge](https://img.shields.io/github/actions/workflow/status/larryhastings/big/coverage.yml?branch=master&label=coverage)](https://github.com/larryhastings/big/actions/workflows/coverage.yml) [![# python versions badge](https://img.shields.io/pypi/pyversions/big.svg?logo=python&logoColor=FBE072)](https://pypi.org/project/big/)
 
 
-**big** is a Python package of useful little bits of
-Python code I always want to have handy.  It's a central
-place for code that's useful but not big enough to go in
-its own module.
+**big** is a Python package of small functions and classes
+that aren't big enough to get a package of their own.
+It's zillions of useful little bits of
+Python code I always want to have handy.
 
-Finally!  For years, I've copied-and-pasted all my little
+For years, I've copied-and-pasted all my little
 helper functions between projects--we've all done it.
 But now I've finally taken the time to consolidate all those
-useful little functions into one *big* package, so they're always
-at hand, ready to use.
+useful little functions into one *big* package--no more
+copy-and-paste, I just install one package and I'm ready to go.
 And, since it's a public package, you can use 'em too!
 
 Not only that, but I've taken my time and re-thought and
@@ -26,7 +26,7 @@ functionality.
 we've all hacked together a million times--only with all the
 API gotchas fixed, and thoroughly tested with 100% coverage.
 It's the code you *would* have written... if only you had the time.
-And it's a real pleasure to use!
+It's a real pleasure to use!
 
 
 **big** requires Python 3.6 or newer.  Its only dependency
@@ -1046,12 +1046,12 @@ next item, examine it, and then push it back.  If any objects have
 been pushed onto the iterator, they are yielded first, before attempting
 to yield from the wrapped iterator.
 
-Pass in any `iterable` to the constructor.  Passing in an `iterable`
-of `None` means the `PushbackIterator` is created in an exhausted state.
+The constructor accepts one argument, an `iterable`, with a default of `None`.
+If `iterable` is `None`, the `PushbackIterator` is created in an exhausted state.
 
 When the wrapped `iterable` is exhausted (or if you passed in `None`
-to the constructor) you can still call push to add new items, at which
-point the `PushBackIterator` can be iterated over again.
+to the constructor) you can still call the `push` method to add items,
+at which point the `PushBackIterator` can be iterated over again.
 
 In addition to the following methods, `PushbackIterator` supports
 the iterator protocol and testing for truth.  A `PushbackIterator`
@@ -2093,14 +2093,14 @@ isn't a
 
 <dl><dd>
 
-Uppercase the first character of every word in `s`.
-Leave the other letters alone.  `s` should be `str` or `bytes`.
+Uppercases the first character of every word in `s`,
+leaving the other letters alone.  `s` should be `str` or `bytes`.
 
 (For the purposes of this algorithm, words are
 any contiguous run of non-whitespace characters.)
 
 This function will also capitalize the letter after an apostrophe
-if the apostrophe
+if the apostrophe:
 
   * is immediately after whitespace, or
   * is immediately after a left parenthesis character (`'('`), or
@@ -2112,7 +2112,7 @@ if the apostrophe
 In this last case, the O or D will also be capitalized.
 
 Finally, this function will capitalize the letter
-after a quote mark if the quote mark
+after a quote mark if the quote mark:
 
 * is after whitespace, or
 * is the first letter of a string.
@@ -3079,7 +3079,10 @@ For more information, see the deep-dive on
 
 A tuple containing individual `str` objects
 for every whitespace character recognized by Python.
-Also contains `'\r\n'`,
+
+Also contains `'\r\n'`.  See the deep-dive section on
+[**The Unix, Mac, and DOS line-break conventions**](#the-unix-mac-and-dos-line-break-conventions)
+for more.
 
 Identical to `str_whitespace`.
 
@@ -3089,6 +3092,17 @@ For more information, please see the
 [**Whitespace and line-breaking characters in Python and big**](#whitespace-and-line-breaking-characters-in-python-and-big)
 deep-dive.
 
+</dd></dl>
+
+#### `whitespace_without_crlf`
+<dl><dd>
+
+Identical to [`whitespace`](#whitespace) except with `'\r\n'` removed.
+See the deep-dive section on
+[**The Unix, Mac, and DOS line-break conventions**](#the-unix-mac-and-dos-line-break-conventions)
+for more.
+
+</dd></dl>
 
 #### `wrap_words(words, margin=79, *, two_spaces=True)`
 
@@ -3965,16 +3979,45 @@ And I should know--`multisplit` is implemented using `re.split`!
 
 <dl><dd>
 
-### Whitespace characters
+### Overview
 
 Several functions in **big** take a `separators`
-argument, which is an iterator of separator strings.
+argument: an iterable of separator strings.
 Although you can separate on any iterable of strings
-you like, often you'll be separating on some form
-of whitespace.  Like many things in this world, it
-turns out this is a startlingly deep subject--it's
-complicated, and a little tricky.  As you'll see
-in a moment, **big** handles this situation adeptly,
+you like, most often you'll be separating on some
+form of whitespace.  This turns out to be a surprisingly
+complicated subject.
+
+However, for all practical purposes you probably have
+nothing to worry about.  These days, the only whitespace
+characters you're likely to encounter are spaces, tabs,
+newlines, and maybe carriage return.  Python and **big**
+handle all those just fine.
+
+In the case of **big**, you
+only need to know about four values.  All four of these
+are tuples containing either `str` or `bytes` objects,
+and designed to be used as the argument for a
+`separators` parameter.
+
+* **big** defines
+  [`big.whitespace`](#whitespace) and [`big.linebreaks`](#linebreaks)
+  for working with `str` objects.  `whitespace` is a list
+  of all whitespace characters, and `linebreaks` is a list of just
+  the line-breaking whitespace characters.
+* For working with `bytes` string objects, **big** defines
+  [`big.bytes_whitespace`](#bytes_whitespace)
+  and [`big.bytes_linebreaks`.](#bytes_linebreaks)
+
+There are some subtle idiosyncracies about how Python
+defines whitespace, but you're not likely to run across
+them.  If you *do*, you'll be pleased to know that **big**
+makes it easy to handle these odd situations.  The rest
+of this deep dive is an examination of these subtle--and
+likely irrelevant--idiosyncracies.
+
+
+### Python
 
 Here's the list of all characters recognized by
 Python `str` objects as whitespace characters:
@@ -4016,17 +4059,17 @@ defined in Unicode, and testing to see if the `split()`
 method on a Python `str` object splits at that character.
 
 The first surprise: this *isn't* the same as the list of
-all code points defined by Unicode as whitespace.
-It's almost the same list, except Python adds four extra
+all characters defined by *Unicode* as whitespace.
+It's *almost* the same list, except Python adds four extra
 characters: `'\x1c'`,  `'\x1d'`,  `'\x1e'`, and `'\x1f'`.
 I'll refer to these as "the four ASCII separator characters".
 They're an ancient part of the ancient ASCII standard, and
 rarely used today.
 
 Unicode defines glyphs for these four characters, which
-means they aren't "whitespace" by definition--they're printing
-characters.  (And they *definitely* aren't supposed to cause
-linebreaks... more on that later.)
+means that by definition they aren't "whitespace"--they're
+printing characters.  (And they *definitely* aren't supposed
+to cause linebreaks... more on that later.)
 This bug goes back to Python 2; there's a ten-year-old
 issue on the Python issue tracker for it, and
 it's not making progress.
@@ -4068,7 +4111,8 @@ different results.
 There's a similar situation with line-breaking characters.
 Line-breaking characters are a subset of whitespace
 characters.  And, like whitespace characters, Python
-`str` objects don't agree with Unicode, and Python
+`str` objects don't agree with Unicode about what is
+and is not a line-breaking character, and Python
 `bytes` objects don't agree with either of those.
 
 Here's the list of all characters recognized by
@@ -4109,7 +4153,7 @@ Python `bytes` objects don't consider
 `'\v` (vertical tab)
 and
 `'\f'` (form feed)
-as line break characters.  I assert this is wrong, too.
+as line break characters.  I assert this too is a bug.
 
 ### How **big** handles this situation
 
@@ -4119,7 +4163,7 @@ characters:
 * For `str` string objects, it defines
   [`whitespace`](#whitespace) and [`linebreaks`.](#linebreaks)
 * For `bytes` string objects, it defines
-  [`ascii_whitespace`](#whitespace) and [`ascii_linebreaks`.](#linebreaks)
+  [`bytes_whitespace`](#whitespace) and [`bytes_linebreaks`.](#linebreaks)
 
 These tuples are used as default values for some
 other **big** functions, like `multisplit` and `lines`.
@@ -4138,6 +4182,11 @@ characters as defined in various contexts.  For
 a total of twelve tuples.  Here's a list of all
 ten, with their defined values:
 
+# REWRITE THIS BUT DESCRIBE COMPOSABLY
+
+unicode_ prefix means X
+ascii_ prefix means Y
+
 <dl><dt>
 
 `whitespace`
@@ -4238,25 +4287,25 @@ Windows supports it, and it's the default everywhere else.
 So in practice you probably don't have end-of-line conversion
 problems, either.
 
-But, just in case, **big** has one more trick.  All ten of
+But, just in case, **big** has one more trick.  All of
 the tuples defined in the previous section--from `whitespace`
-to `utf8_linebreaks`--also contain this string:
+to `ascii_linebreaks`--also contain this string:
 
     '\r\n'
 
-(The 'bytes_' tuples contain the `bytes` equivalent,
+(The two `bytes_` tuples contain the `bytes` equivalent,
 `b'\r\n`.)
 
 This addition means that, when you use one of these tuples
 with one of the **big** functions that take separators,
-you'll split on `\r\n` as if it was one character.  This
+it'll split on `\r\n` as if it was one character.  This
 means that **big** itself should automatically handle
 the DOS and Windows end-of-line character sequence, in
 case one happens to creep into your data.
 
 If you don't want this behavior, just add the suffix
 `_without_crlf` to the end of the tuple name,
-e.g. `whitespace_without_crlf`, `ascii_newlines_without_crlf`.
+e.g. `whitespace_without_crlf`, `bytes_newlines_without_crlf`.
 
 ### Whitespace and line-breaking characters for other platforms
 
@@ -4279,11 +4328,11 @@ more, value 255.  So it's easy:
 
 What if you want to split bytes containing UTF-8?  That's
 easy too.  Just convert one of the existing tuples using
-`big.convert_strings`.  For example, if you wanted to
+`big.encode_strings`.  For example, if you wanted to
 split a UTF-8 encoded bytes object `o` using the Unicode
 linebreak characters, you could call:
 
-    multisplit(o, convert_strings(ascii_linebreaks, encoding='utf-8')
+    multisplit(o, encode_strings(unicode_linebreaks, encoding='utf-8'))
 
 </dd></dl>
 
@@ -5018,9 +5067,9 @@ in the **big** test suite.
   `_without_crlf`, and similarly changed `newlines` to `linebreaks`.
   Sorry for all the confusion.  This resulted from a lot of research into whitespace
   and newline characters, in Python, Unicode, and ASCII; please see the new
-  [**Whitespace and line-breaking characters in Python and big**](#https://github.com/larryhastings/big/tree/retool_whitespace_tuples#whitespace-and-line-breaking-characters-in-python-and-big)
-  deep-dive to see what all the fuss is about.  Here's a description of the
-  high-level changes:
+  [**Whitespace and line-breaking characters in Python and big**](#whitespace-and-line-breaking-characters-in-python-and-big)
+  deep-dive to see what all the fuss is about.  Here's a summary of all the
+  changes to the whitespace tuples:
 
         RENAMED TUPLES (old name -> new name)
           ascii_newlines               -> bytes_linebreaks