Skip to content

Commit

Permalink
Really hope it's done now.
Browse files Browse the repository at this point in the history
  • Loading branch information
larryhastings committed Sep 16, 2023
1 parent 0c432f3 commit bbb7f25
Showing 1 changed file with 84 additions and 49 deletions.
133 changes: 84 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4219,14 +4219,15 @@ argument: an iterable of separator strings.
Although you can use any iterable of strings
you like, most often you'll be separating on some
form of whitespace. But... what, specifically,
is whitespace? The answer to this question is
surprisingly complicated, once you examine the
details.
is whitespace? Although this question has a simple
answer that is usually good enough, answering
this question completely accurately is a suprisingly
complicated undertaking.

However, you almost certainly have nothing to worry about.
These days the only whitespace characters you're likely to
encounter are spaces, tabs, newlines, and maybe carriage returns.
Python and **big** handle all those just fine.
The good news is, you can almost certainly ignore all the
complexity. These days the only whitespace characters you're
likely to encounter are spaces, tabs, newlines, and maybe
carriage returns. Python and **big** handle all those easily.

**big** defines four values designed to be used as
a `separators` argument. All four of these are tuples
Expand Down Expand Up @@ -4301,24 +4302,24 @@ which respectively represent "file separator", "group separator",
"record separator", and "unit separator".
I'll refer to these as "the four ASCII separator characters".

These characters were defined as part of the ancient ASCII
standard. They were meant to be used as separator characters
for data as their names suggest, the same way
(Ctrl-Z was used to indicate end-of-file in the CPM and earliest
FAT filesystems.)[https://en.wikipedia.org/wiki/End-of-file#EOF_character]
These characters were defined as part of [the original ASCII
standard,](https://en.wikipedia.org/wiki/ASCII) way back in 1963.
As their names suggest, they were intended to be used as separator
characters for data, the same way
[Ctrl-Z was used to indicate end-of-file in the CPM and earliest
FAT filesystems.](https://en.wikipedia.org/wiki/End-of-file#EOF_character)
But the four ASCII separator characters were rarely used,
even in the glory days of ASCII. Today they're practically
unheard of.
even back in the day. Today they're practically unheard of.

As a rule, printing these characters to the screen generally
doesn't produce anything--they don't move the cursor, the
doesn't produce anything--they don't move the cursor, and the
screen doesn't change.
So their behavior is a bit mysterious. A lot of people--including
early Python programmers it seems!--thought that meant they're
whitespace. This is a strange conclusion; after all, all the
well-known whitespace characters move the cursor, and these do not.

However! The Unicode standard is crystal clear: these
However, the Unicode standard is crystal clear: these
characters are *not whitespace.* And yet Python's "Unicode object"
behaves as if they are. So I'd say this is a bug; Python's Unicode
object should implement what the Unicode standard says.
Expand All @@ -4334,9 +4335,9 @@ convenience--and backwards-compatibility with Python 2--Python's
`bytes` objects support several method calls that treat the data
as if it were "ASCII-compatible".

The surprise: Python `bytes` objects recognize a *different* set
of whitespace characters. Here's the list of all bytes recognized
by Python `bytes` objects as whitespace:
The surprise: These methods on Python `bytes` objects recognize
a *different* set of whitespace characters. Here's the list of
all bytes recognized by Python `bytes` objects as whitespace:

# char decimal hex name
#######################################
Expand All @@ -4355,12 +4356,12 @@ The good news is, this list is the same as ASCII's list,
and it agrees with Unicode.
In fact this list is quite familiar to C programmers;
it's the same whitespace characters recognized by the
standard C function `isspace()` (in `ctypes.h`).
standard C function [`isspace()` (in `ctypes.h`).](https://www.oreilly.com/library/view/c-in-a/0596006977/re129.html)
Python has used this function to decide which characters
are and aren't whitespace in 8-bit strings since its very
beginning.

Thes surprising news is, this list *doesn't* contain the
Notice that this list *doesn't* contain the
four ASCII separator characters. This means you could
define a Python `str` object using only characters defined
in ASCII, and encode it to a `bytes` object using the
Expand Down Expand Up @@ -4398,11 +4399,11 @@ Again, this is different from [the list of characters
defined as line-breaking whitespace in Unicode.](https://en.wikipedia.org/wiki/Newline#Unicode)
And again it's because Python defines some of the four ASCII separator
characters as line-breaking characters. In this case
it's only the first three.... Python doesn't consider
it's only the first three; Python doesn't consider
the fourth, "unit separator", as a line-breaking character.
I don't know why Python draws this distinction...
(I don't know why Python draws this distinction...
but then again, I don't know why it considers the
first three to be line-breaking It's *all* a mystery to me.
first three to be line-breaking It's *all* a mystery to me.)

Here's the list of all characters recognized by
Python `bytes` objects as line-breaking characters:
Expand Down Expand Up @@ -4430,13 +4431,17 @@ advancing at least one line.

To be crystal clear: the odds that any of this will cause
a problem for you are *extremely* low. In order for it
to make a difference, you'd have to encounter text using
one of these six characters where Python disagrees with
Unicode and ASCII--the four ASCII separator characters,
vertical tab, and form feed--and you'd have to split the
input on some form of whitespace, and you'd have to get
different results, *and* this difference in results would
have to be important. This is all extremely unlikely.
to make a difference, you'd have to

* you'd have to encounter text using one of these six characters
where Python disagrees with Unicode and ASCII, and
* you'd have to process the input based on some definition
of whitespace, and
* it would have to produce different results than you might
have other wise expected, *and*
* this difference in results would have to be important.

This is all extremely unlikely.

In case this *does* affect you, **big** has
a complete set of predefined whitespace tuples that will
Expand All @@ -4454,44 +4459,44 @@ tuple contains the subset of whitespace characters that
move the cursor vertically.

The most important two values start with `str_`:
['str_whitespace'](#str_whitespace)
[`str_whitespace`](#str_whitespace)
and
['str_linebreaks'.](#str_linebreaks)These contain
[`str_linebreaks`.](#str_linebreaks)These contain
all the whitespace characters recognized by the Python
`str` object.

Next are two values that start with `unicode_`:
['unicode_whitespace'](#unicode_whitespace)
[`unicode_whitespace`](#unicode_whitespace)
and
['unicode_linebreaks'.](#unicode_linebreaks)
[`unicode_linebreaks`.](#unicode_linebreaks)
These
contain all the whitespace characters defined in the
Unicode standard. (These are almost the same as
the `str_` equivalents, except they omit the four
ASCII separator characters.)

Third, two values that start with `ascii_`:
['ascii_whitespace'](#ascii_whitespace)
[`ascii_whitespace`](#ascii_whitespace)
and
['ascii_linebreaks'.](#ascii_linebreaks)
[`ascii_linebreaks`.](#ascii_linebreaks)
These
contain all the whitespace characters defined in
ASCII. (Effectively, these are filtered versions of
the `unicode_` equivalents, containing only the
characters `c` where `ord(c) < 128`.)

Fourth, two values that start with `bytes_`:
['bytes_whitespace'](#bytes_whitespace)
[`bytes_whitespace`](#bytes_whitespace)
and
['bytes_linebreaks'.](#bytes_linebreaks)
[`bytes_linebreaks`.](#bytes_linebreaks)
These contain
all the whitespace characters recognized by the Python
`bytes` object.

Finally we have the two tuples that lack a prefix:
['whitespace'](#whitespace)
[`whitespace`](#whitespace)
and
['linebreaks'.](#linebreaks)
[`linebreaks`.](#linebreaks)
These are the tuples
you should use most of the time, and several **big**
functions use them as default values. These are
Expand Down Expand Up @@ -4522,8 +4527,10 @@ daily lives of computer users. Python went through several
iterations on how to handle this, eventually settling on
["universal newlines"](https://peps.python.org/pep-0278/)
in Python 2.3.
These days the world seems to be converging on `'\n'`;
Windows supports it, and it's the default everywhere else.
These days the world seems to be converging on one standard,
the UNIX standard `'\n'`;
Windows supports it, and it's the default on every other modern
platform.
So in practice you probably don't have end-of-line conversion
problems, either.

Expand All @@ -4550,35 +4557,60 @@ e.g. `whitespace_without_crlf`, `bytes_linebreaks_without_crlf`.
### Whitespace and line-breaking characters for other platforms

What if you need to split text by whitespace, or by lines,
but that text has some other unusual encoding? **big** makes
that easy too. You can make your own tuple from scratch,
but that text is in `bytes` format with an unusual encoding?
**big** makes that easy too. If one of the builtin tuples
won't work for you, you can can make your own tuple from scratch,
or modify an existing tuple to meet your needs.

For example, let's say you need to split a document by
whitespace, and the document is encoded in [code page 850,
aka "latin-1".](https://en.wikipedia.org/wiki/Code_page_850)
Normally the easiest thing would be to decode it a `str` object
using the `latin-1` text codec, then operate on it normally.
using the `'latin-1'` text codec, then operate on it normally.
But you might have reasons why you don't want to decode it--maybe
the document is damaged and doesn't decode properly, and it's
easier to just work around the damage than to fix it. If you
want to process it with a **big** function that accepts a
`separator` argument, you could make your own custom tuple
of "latin-1" whitespace characters. "latin-1" has the same
whitespace characters as ASCII, but adds one more, value 255,
which is not a line-breaking character. So this is totally easy:
which is not line-breaking. So this is totally easy:

latin_1_whitespace = big.bytes_whitespace + (b'\xff',)
latin_1_linebreaks = big.bytes_linebreaks

What if you want to process a `bytes` object containing
UTF-8? That's easy too. Just convert one of the existing
tuples using `big.encode_strings`. For example, if you
wanted to split a UTF-8 encoded bytes object `o` using
tuples containing `str` objects using
[`big.encode_strings`.](#encode_stringso--encodingascii)
For example, to split a UTF-8 encoded bytes object `o` using
the Unicode linebreak characters, you could call:

multisplit(o, encode_strings(unicode_linebreaks, encoding='utf-8'))

Note that this technique probably won't work correctly for other
multibyte encodings like [UTF-16.](https://en.wikipedia.org/wiki/UTF-16)
In these cases you should encode to `str`

Why? It's because `multisplit` could find matches in multibyte
sequences *straddling* characters, similar to this example:

```Python
>>> haystack = '\u0101\u0102'
>>> needle = '\u0201'
>>> needle in haystack
False
>>>
>>> encoded_haystack = haystack.encode('utf-16-le')
>>> encoded_needle = needle.encode('utf-16-le')
>>> encoded_needle in encoded_haystack
True
```

The character `'\u0201'` doesn't appear in the original string,
but the *encoded* version appears in the *encoded* string.


</dd></dl>


Expand Down Expand Up @@ -5393,6 +5425,9 @@ in the **big** test suite.
a better job of selling `multisplit` to the reader.
* The usual smattering of small doc fixes and improvements.

My thanks again to Eric V. Smith for his willingness to ponder and discuss these
issues. Eric is now officially a contributor to **big,** increasing the project's
[bus factor](https://en.wikipedia.org/wiki/Bus_factor) to two. Thanks, Eric!

#### 0.10
<dl><dd>
Expand Down

0 comments on commit bbb7f25

Please sign in to comment.