Skip to content

Commit

Permalink
More doc updates.
Browse files Browse the repository at this point in the history
  • Loading branch information
larryhastings committed Sep 15, 2023
1 parent 409f4f8 commit 4ab77ac
Showing 1 changed file with 91 additions and 42 deletions.
133 changes: 91 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@
[![# test badge](https://img.shields.io/github/actions/workflow/status/larryhastings/big/test.yml?branch=master&label=test)](https://github.com/larryhastings/big/actions/workflows/test.yml) [![# coverage badge](https://img.shields.io/github/actions/workflow/status/larryhastings/big/coverage.yml?branch=master&label=coverage)](https://github.com/larryhastings/big/actions/workflows/coverage.yml) [![# python versions badge](https://img.shields.io/pypi/pyversions/big.svg?logo=python&logoColor=FBE072)](https://pypi.org/project/big/)


**big** is a Python package of useful little bits of
Python code I always want to have handy. It's a central
place for code that's useful but not big enough to go in
its own module.
**big** is a Python package of small functions and classes
that aren't big enough to get a package of their own.
It's zillions of useful little bits of
Python code I always want to have handy.

Finally! For years, I've copied-and-pasted all my little
For years, I've copied-and-pasted all my little
helper functions between projects--we've all done it.
But now I've finally taken the time to consolidate all those
useful little functions into one *big* package, so they're always
at hand, ready to use.
useful little functions into one *big* package--no more
copy-and-paste, I just install one package and I'm ready to go.
And, since it's a public package, you can use 'em too!

Not only that, but I've taken my time and re-thought and
Expand All @@ -26,7 +26,7 @@ functionality.
we've all hacked together a million times--only with all the
API gotchas fixed, and thoroughly tested with 100% coverage.
It's the code you *would* have written... if only you had the time.
And it's a real pleasure to use!
It's a real pleasure to use!


**big** requires Python 3.6 or newer. Its only dependency
Expand Down Expand Up @@ -1046,12 +1046,12 @@ next item, examine it, and then push it back. If any objects have
been pushed onto the iterator, they are yielded first, before attempting
to yield from the wrapped iterator.

Pass in any `iterable` to the constructor. Passing in an `iterable`
of `None` means the `PushbackIterator` is created in an exhausted state.
The constructor accepts one argument, an `iterable`, with a default of `None`.
If `iterable` is `None`, the `PushbackIterator` is created in an exhausted state.

When the wrapped `iterable` is exhausted (or if you passed in `None`
to the constructor) you can still call push to add new items, at which
point the `PushBackIterator` can be iterated over again.
to the constructor) you can still call the `push` method to add items,
at which point the `PushBackIterator` can be iterated over again.

In addition to the following methods, `PushbackIterator` supports
the iterator protocol and testing for truth. A `PushbackIterator`
Expand Down Expand Up @@ -2093,14 +2093,14 @@ isn't a

<dl><dd>

Uppercase the first character of every word in `s`.
Leave the other letters alone. `s` should be `str` or `bytes`.
Uppercases the first character of every word in `s`,
leaving the other letters alone. `s` should be `str` or `bytes`.

(For the purposes of this algorithm, words are
any contiguous run of non-whitespace characters.)

This function will also capitalize the letter after an apostrophe
if the apostrophe
if the apostrophe:

* is immediately after whitespace, or
* is immediately after a left parenthesis character (`'('`), or
Expand All @@ -2112,7 +2112,7 @@ if the apostrophe
In this last case, the O or D will also be capitalized.

Finally, this function will capitalize the letter
after a quote mark if the quote mark
after a quote mark if the quote mark:

* is after whitespace, or
* is the first letter of a string.
Expand Down Expand Up @@ -3079,7 +3079,10 @@ For more information, see the deep-dive on

A tuple containing individual `str` objects
for every whitespace character recognized by Python.
Also contains `'\r\n'`,

Also contains `'\r\n'`. See the deep-dive section on
[**The Unix, Mac, and DOS line-break conventions**](#the-unix-mac-and-dos-line-break-conventions)
for more.

Identical to `str_whitespace`.

Expand All @@ -3089,6 +3092,17 @@ For more information, please see the
[**Whitespace and line-breaking characters in Python and big**](#whitespace-and-line-breaking-characters-in-python-and-big)
deep-dive.

</dd></dl>

#### `whitespace_without_crlf`
<dl><dd>

Identical to [`whitespace`](#whitespace) except with `'\r\n'` removed.
See the deep-dive section on
[**The Unix, Mac, and DOS line-break conventions**](#the-unix-mac-and-dos-line-break-conventions)
for more.

</dd></dl>

#### `wrap_words(words, margin=79, *, two_spaces=True)`

Expand Down Expand Up @@ -3965,16 +3979,45 @@ And I should know--`multisplit` is implemented using `re.split`!

<dl><dd>

### Whitespace characters
### Overview

Several functions in **big** take a `separators`
argument, which is an iterator of separator strings.
argument: an iterable of separator strings.
Although you can separate on any iterable of strings
you like, often you'll be separating on some form
of whitespace. Like many things in this world, it
turns out this is a startlingly deep subject--it's
complicated, and a little tricky. As you'll see
in a moment, **big** handles this situation adeptly,
you like, most often you'll be separating on some
form of whitespace. This turns out to be a surprisingly
complicated subject.

However, for all practical purposes you probably have
nothing to worry about. These days, the only whitespace
characters you're likely to encounter are spaces, tabs,
newlines, and maybe carriage return. Python and **big**
handle all those just fine.

In the case of **big**, you
only need to know about four values. All four of these
are tuples containing either `str` or `bytes` objects,
and designed to be used as the argument for a
`separators` parameter.

* **big** defines
[`big.whitespace`](#whitespace) and [`big.linebreaks`](#linebreaks)
for working with `str` objects. `whitespace` is a list
of all whitespace characters, and `linebreaks` is a list of just
the line-breaking whitespace characters.
* For working with `bytes` string objects, **big** defines
[`big.bytes_whitespace`](#bytes_whitespace)
and [`big.bytes_linebreaks`.](#bytes_linebreaks)

There are some subtle idiosyncracies about how Python
defines whitespace, but you're not likely to run across
them. If you *do*, you'll be pleased to know that **big**
makes it easy to handle these odd situations. The rest
of this deep dive is an examination of these subtle--and
likely irrelevant--idiosyncracies.


### Python

Here's the list of all characters recognized by
Python `str` objects as whitespace characters:
Expand Down Expand Up @@ -4016,17 +4059,17 @@ defined in Unicode, and testing to see if the `split()`
method on a Python `str` object splits at that character.

The first surprise: this *isn't* the same as the list of
all code points defined by Unicode as whitespace.
It's almost the same list, except Python adds four extra
all characters defined by *Unicode* as whitespace.
It's *almost* the same list, except Python adds four extra
characters: `'\x1c'`, `'\x1d'`, `'\x1e'`, and `'\x1f'`.
I'll refer to these as "the four ASCII separator characters".
They're an ancient part of the ancient ASCII standard, and
rarely used today.

Unicode defines glyphs for these four characters, which
means they aren't "whitespace" by definition--they're printing
characters. (And they *definitely* aren't supposed to cause
linebreaks... more on that later.)
means that by definition they aren't "whitespace"--they're
printing characters. (And they *definitely* aren't supposed
to cause linebreaks... more on that later.)
This bug goes back to Python 2; there's a ten-year-old
issue on the Python issue tracker for it, and
it's not making progress.
Expand Down Expand Up @@ -4068,7 +4111,8 @@ different results.
There's a similar situation with line-breaking characters.
Line-breaking characters are a subset of whitespace
characters. And, like whitespace characters, Python
`str` objects don't agree with Unicode, and Python
`str` objects don't agree with Unicode about what is
and is not a line-breaking character, and Python
`bytes` objects don't agree with either of those.

Here's the list of all characters recognized by
Expand Down Expand Up @@ -4109,7 +4153,7 @@ Python `bytes` objects don't consider
`'\v` (vertical tab)
and
`'\f'` (form feed)
as line break characters. I assert this is wrong, too.
as line break characters. I assert this too is a bug.

### How **big** handles this situation

Expand All @@ -4119,7 +4163,7 @@ characters:
* For `str` string objects, it defines
[`whitespace`](#whitespace) and [`linebreaks`.](#linebreaks)
* For `bytes` string objects, it defines
[`ascii_whitespace`](#whitespace) and [`ascii_linebreaks`.](#linebreaks)
[`bytes_whitespace`](#whitespace) and [`bytes_linebreaks`.](#linebreaks)

These tuples are used as default values for some
other **big** functions, like `multisplit` and `lines`.
Expand All @@ -4138,6 +4182,11 @@ characters as defined in various contexts. For
a total of twelve tuples. Here's a list of all
ten, with their defined values:

# REWRITE THIS BUT DESCRIBE COMPOSABLY

unicode_ prefix means X
ascii_ prefix means Y

<dl><dt>

`whitespace`
Expand Down Expand Up @@ -4238,25 +4287,25 @@ Windows supports it, and it's the default everywhere else.
So in practice you probably don't have end-of-line conversion
problems, either.

But, just in case, **big** has one more trick. All ten of
But, just in case, **big** has one more trick. All of
the tuples defined in the previous section--from `whitespace`
to `utf8_linebreaks`--also contain this string:
to `ascii_linebreaks`--also contain this string:

'\r\n'

(The 'bytes_' tuples contain the `bytes` equivalent,
(The two `bytes_` tuples contain the `bytes` equivalent,
`b'\r\n`.)

This addition means that, when you use one of these tuples
with one of the **big** functions that take separators,
you'll split on `\r\n` as if it was one character. This
it'll split on `\r\n` as if it was one character. This
means that **big** itself should automatically handle
the DOS and Windows end-of-line character sequence, in
case one happens to creep into your data.

If you don't want this behavior, just add the suffix
`_without_crlf` to the end of the tuple name,
e.g. `whitespace_without_crlf`, `ascii_newlines_without_crlf`.
e.g. `whitespace_without_crlf`, `bytes_newlines_without_crlf`.

### Whitespace and line-breaking characters for other platforms

Expand All @@ -4279,11 +4328,11 @@ more, value 255. So it's easy:

What if you want to split bytes containing UTF-8? That's
easy too. Just convert one of the existing tuples using
`big.convert_strings`. For example, if you wanted to
`big.encode_strings`. For example, if you wanted to
split a UTF-8 encoded bytes object `o` using the Unicode
linebreak characters, you could call:

multisplit(o, convert_strings(ascii_linebreaks, encoding='utf-8')
multisplit(o, encode_strings(unicode_linebreaks, encoding='utf-8'))

</dd></dl>

Expand Down Expand Up @@ -5018,9 +5067,9 @@ in the **big** test suite.
`_without_crlf`, and similarly changed `newlines` to `linebreaks`.
Sorry for all the confusion. This resulted from a lot of research into whitespace
and newline characters, in Python, Unicode, and ASCII; please see the new
[**Whitespace and line-breaking characters in Python and big**](#https://github.com/larryhastings/big/tree/retool_whitespace_tuples#whitespace-and-line-breaking-characters-in-python-and-big)
deep-dive to see what all the fuss is about. Here's a description of the
high-level changes:
[**Whitespace and line-breaking characters in Python and big**](#whitespace-and-line-breaking-characters-in-python-and-big)
deep-dive to see what all the fuss is about. Here's a summary of all the
changes to the whitespace tuples:

RENAMED TUPLES (old name -> new name)
ascii_newlines -> bytes_linebreaks
Expand Down

0 comments on commit 4ab77ac

Please sign in to comment.