Skip to content

Commit

Permalink
Added old version of lines_split_comments & test.
Browse files Browse the repository at this point in the history
  • Loading branch information
larryhastings committed Aug 29, 2024
1 parent c470387 commit a75e6e2
Show file tree
Hide file tree
Showing 3 changed files with 337 additions and 119 deletions.
241 changes: 149 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2524,7 +2524,7 @@ For more information, see the deep-dive on

<dl><dd>

The second object yielded by a
The first object yielded by a
[`lines`](#liness-separatorsnone--line_number1-column_number1-tab_width8-kwargs)
iterator, containing metadata about the line.
You can add your own fields by passing them in
Expand Down Expand Up @@ -4970,11 +4970,29 @@ best practice for any field in `LineInfo`; you should amend
it, rather than set it outright.

Speaking of best practices for lines modifier functions,
it's also considered good hygiene to modify the `LineInfo`
object that was yielded to you. Don't create a new one
and yield that instead. Previous lines modifier iterators
may have added fields to the `LineInfo` that you need to
preserve.
it's also best practice to *modify* the *existing*
`LineInfo` object that was yielded to you, rather than
throwing it away, creating a new one, and yielding that
instead. Previous lines modifier iterators may have added
fields to the `LineInfo` that you'd to preserve.

### leading + line + trailing + end

Generally speaking, `LineInfo` objects obey an invariant.
For any `(info, line)` pair yielded by `lines` or a lines
modifier:

info.leading + line + info.trailing + info.end == info.line

That is, you can recreate the original line by concatenating
the "leading" string, the modified line, the "trailing" string,
and the "end" string.

However, this is no longer true when using lines modifiers that
replace characters in the line. For example, `lines_convert_tabs_to_spaces`
replaces tab characters with one or more space characters.
If the original line contains tabs, obviously the above invariant
will no longer hold true.

</dd></dl>

Expand Down Expand Up @@ -5607,53 +5625,68 @@ Lots of changes this time! Grouping by submodule:

The old version is still available under a new
name: `old_split_quoted_string`. It's deprecated, and will
eventually be removed, but not before August 2025
eventually be removed, but not before September 2025
(one year from now).

Changes:
* `split_quoted_string` used to use a hand-coded parser,
manually analyzing each character in the input text.
Now it uses `multisplit` to only examine the interesting
substrings. `multisplit` has a large startup cost
the first time you use a particular set of iterators,
but this information is cached for subsequent calls.
Bottom line, the new version is much faster
for larger workloads. (It's slower for trivial
examples... where speed doesn't matter.)
* The value it yields has changed:
* The old version yielded `(is_quote, segment)`, where
`is_quote` was a boolean value indicating whether or not
`segment` was quoted. If `segment` was quoted, it began
and ended with (single character) quote marks. To reassemble
the original string, join together all the `segment` strings
in order.
* The new version yields `(leading_quote, segment, trailing_quote)`,
where `leading_quote` and `trailing_quote` are either matching
quote marks or empty. If they're true values, the `segment`
string is inside the quotes. To reassemble the original string,
join together *all* the yielded strings in order.
* The `backslash` parameter has been replaced by a
new parameter, `escape`.
`escape` allows specifying the escape string, which
by default is '\\' (backslash). If you specify a false
value, there will be no escape character in strings.
* Another benefit of switching to `multisplit`: `quotes`
now supports quote delimiters and an escape string
of any nonzero length. If more than one quote delimiter
matches at a time, `split_quoted_string` will always
pick the longer string.
* By default `quotes` only contains `'`` (single-quote)
and `"` (double-quote). The previous version also
supported `"""` and `'''` by default; this is no longer
true, it was too opinionated and Python-specific.
used `"""` and `'''` as multiline quote marks
by default; this is no longer true, as it was too
opinionated and Python-specific.
* `split_quoted_string` also accepts a new parameter,
`state`, which sets the initial state of quoting.
* The `triple_quotes` parameter has been removed. (See
next bullet point.)
* `split_quoted_string` is now documented as being completely
agnostic about newlines. The previous version was, too;
even though the documentation discussed triple-quoted
strings vs single-quoted strings, in reality it didn't
ever care about newlines. With the updated API, it's
officially up to you to enforce any rules here
(e.g. "newlines aren't permitted in
single-quoted strings.")
* `split_quoted_string` used to use a hand-coded parser,
manually analyzing each character in the input text.
Now it uses `multisplit` to only examine the interesting
substrings. `multisplit` has a large startup cost
the first time you use a particular set of iterators,
but this information is cached for subsequent calls.
Bottom line, the new version is much faster
for larger workloads. (It's slower for trivial
examples... where speed doesn't matter.)
* Another benefit of switching to `multisplit`: `quotes`
now supports quote delimiters and an escape string
of any nonzero length. In the case of ambiguity--if
more than one quote delimiter matches at a
time--`split_quoted_string` will always pick the
longer string.
* `split_quoted_string` is now deliberately
(and documented-ly) completely agnostic about newlines.
The previous version was, too; even though the
documentation discussed triple-quoted strings vs
single-quoted strings, in reality it didn't ever care
about newlines. With the updated API, it's officially
up to you to enforce the rules you want (e.g. "newlines
aren't permitted in single-quoted strings.")

* Breaking change: `parse_delimiters` has also been
completely re-tooled, re-written... *and* re-named!
The new name is `split_delimiters`.

The old version is still available under the old
name. It's deprecated, and will eventually be
removed, but not before August 2025 (one year from now).
removed, but not before September 2025.

However, the old `Delimiter` class has been
renamed to `ParseDelimiter`; there's a new
`Delimiter` class used by `split_delimiters`.
Expand All @@ -5668,39 +5701,63 @@ Lots of changes this time! Grouping by submodule:
which specifies the initial state of nested delimiters.
* `split_delimiters` no longer cares if there were unclosed
open delimiters at the end of the string. (It used to
raise `ValueError`.)
* `parse_delimiters` manually parsed the string, character
by character. `split_delimiters` uses `multisplit`, so it
zips past the uninteresting characters to find the delimiters
and escape characters. It's always faster, except for
some trivial calls (which are fast enough anyway).
raise `ValueError`.) This includes quote marks; if you
don't want quoted strings to span multiple lines, it's up
to you to detect it and react (e.g. raise an exception).
* `parse_delimiters` manually parsed the input string
character by character. `split_delimiters` uses `multisplit`,
so it zips past the uninteresting characters and only examines
the delimiters and escape characters. It's always faster,
except for some trivial calls (which are fast enough anyway).
* Another benefit of using `multisplit`: open delimiters,
close delimiters, and the escape string may now all be
any nonzero length. (In the face of ambiguity,
`split_delimiters` will always choose the longer delimiter.)
* The `ParseDelimiter` object used with `parse_delimiters`
has a boolean `backslash` attribute; if it's True, that
has a boolean `backslash` attribute; if it was True, that
delimiter allows escaping using a backslash. The new
`Delimiter` class used with `split_delimiters` instead
has an `escape=c` attribute, where `c` is the escape
`Delimiter` class used with `split_delimiters` replaces that
with an `escape=c` attribute, where `c` is the escape
character you want to use with that set of delimiters.
All the predefined `Delimiter` values have been updated
to match.
* As mentioned above, the `Delimiter` object doesn't have
an `open` attribute. (`ParseDelimiter` still does.)

* Breaking change: the `LineInfo` constructor has added
a new `lines` positional parameter, in front of the
existing positional parameters. This should be the
`lines` iterator that yielded this `LineInfo` object.
It's stored in the `lines` attribute.

* New feature: `LineInfo` objects yielded by `lines`
previously had many optional fields, which might or might
not be added dynamically. Now all fields are pre-added.
(This makes the CPython 3.13 runtime happier; it really
wants you to set *all* your class's attributes in its
`__init__`.)
* As mentioned above, the new `Delimiter` object doesn't
have an `open` attribute. (`ParseDelimiter` still does.)

* Breaking change: `lines_strip_comments` has *also* been
completely rewritten and renamed. It's now named
`lines_strip_line_comments`.

Changes:
* The old function required quote marks and the escape string
to be single characters, and had a slightly-smelly
`triple_quotes` parameter to support multiline strings.
The new function allows quote marks to be of any length,
and has separate parameters for single-line quote marks
and multiline quote marks.
* The `backslash` parameter has been renamed to `escape`.
* The old function didn't enforce that strings shouldn't
span lines. The new version raises `SyntaxError`
if quoted strings aren't closed (unless they're explicitly
strings that support multiline).
* Breaking change to the old version: it used to write the
comment it rstripped to `info.comment`, and it threw away
any whitespace it stripped. It now obeys the modern
`LineInfo` aesthetic, and writes *both* the whitespace it
rstripped *and* the comment to `info.trailing`.


* Breaking change: the `LineInfo` constructor has a
new `lines` positional parameter, added *in front of*
the existing positional parameters. This new first argument
should be the `lines` iterator that yielded this
`LineInfo` object. It's stored in the `lines` attribute.

* `LineInfo` objects (yielded by `lines`) previously had
many optional fields, which might or might not be added
dynamically. Now all fields are pre-added. (This makes
the CPython 3.13 runtime happier; it really wants you to
set *all* your class's attributes in its `__init__`.)

`LineInfo` objects now always have these attributes:
* `lines`, which contains the base lines iterator.
Expand All @@ -5727,41 +5784,41 @@ Lots of changes this time! Grouping by submodule:
was matched with a regular expression, and `None` otherwise.

* `LineInfo` now has two new methods: `extend_leading`
and `extend_trailing`. These methods
move a leading or trailing substring from the current `line`
to the relevant field in `LineInfo`, maintaining all the
guaranteed invariants, and updating all related `LineInfo`
fields (like `column_number`).

* There have been plenty of changes to line modifiers, too:
* `lines_strip_comments` has been renamed to `lines_strip_line_comments`.
It's also been improved: now it raises `SyntaxError` if quoted
strings aren't closed.
* `lines_filter_comment_lines` has been renamed to
`lines_filter_line_comment_lines`. `lines_filter_line_comment_lines`
now enforces that single-quoted strings can't span lines,
and multi-quoted strings must be closed before the end of
the last line.
* `lines_strip` and `lines_rstrip` now accept a new `separators`
argument; this is an iterable of separators, like the argument
to `multisplit`.
The default value of `None` preserves the existing behavior,
stripping whitespace.
* `lines_grep` now adds a `match` attribute to the `LineInfo`
object, containing the return value from calling `re.search`.
(If you pass in `invert=True` to `lines_grep`, `lines_grep`
will still write `None` to the `match` attribute.)
* Bugfix: `lines_strip_indent` previously required
whitespace-only lines to obey the indenting rules.
My intention was always for `lines_strip_indent` to
behave like Python, and that includes not really caring
about the intra-line-whitespace for whitespace-only
lines. Now `lines_strip_indent` behaves more like Python:
a whitespace-only line behaves as if it has
the same indent as the previous line. (Not that the
indent value of an empty line should matter; this is
mostly just there to present a consistent interface to
the user.)
and `extend_trailing`. These methods move a leading or
trailing substring from the current `line` to the relevant
field in `LineInfo`, maintaining all the guaranteed
invariants, and updating all related `LineInfo` fields
(like `column_number`).

* `lines_filter_comment_lines` has been renamed to
`lines_filter_line_comment_lines`. `lines_filter_line_comment_lines`
now enforces that single-quoted strings can't span lines,
and multi-quoted strings must be closed before the end of
the last line. For backwards compatibility, the new function
is also available under the old name; this old name will
eventually be removed, but not before September 2025.

* `lines_strip` and `lines_rstrip` now accept a new `separators`
argument; this is an iterable of separators, like the argument
to `multisplit`.
The default value of `None` preserves the existing behavior,
stripping whitespace.

* `lines_grep` now writes to the `match` attribute to the `LineInfo`
object, containing the return value from calling `re.search`.
(If you pass in `invert=True` to `lines_grep`, `lines_grep`
still writes to the `match` attribute--but it always writes `None`.)

* Bugfix: `lines_strip_indent` previously required
whitespace-only lines to obey the indenting rules, which was
a mistake. My intention was always for `lines_strip_indent`
to behave like Python, and that includes not really caring
about the intra-line-whitespace for whitespace-only
lines. Now `lines_strip_indent` behaves more like Python:
a whitespace-only line behaves as if it has
the same indent as the previous line. (Not that the
indent value of an empty line should matter--but this
behavior is how you'd intuitively expect it to work.)

* New function: `split_title_case`, which splits a string
at title case change word boundaries.
Expand Down Expand Up @@ -5791,7 +5848,7 @@ Lots of changes this time! Grouping by submodule:

* Another minor speedup for `multisplit`: when `reverse=True`,
we used to reverse the results *three times!* We now explicitly
observe and manage the reverse state of the result and avoid
observe and manage the reverse state of the result, to avoid
needless reversing.

### scheduler
Expand Down
Loading

0 comments on commit a75e6e2

Please sign in to comment.