From 2ab1360d4452138825671f20c3c0bb010b26e522 Mon Sep 17 00:00:00 2001 From: Larry Hastings Date: Tue, 13 Aug 2024 12:50:59 -0700 Subject: [PATCH] Added split_title_case and combine_splits. Lots more to do! Lots of documentation and coverage work. But it's a start! --- README.md | 69 +++++----- big/text.py | 294 +++++++++++++++++++++++++++++++++++------- tests/test_text.py | 313 +++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 577 insertions(+), 99 deletions(-) diff --git a/README.md b/README.md index 6088875..f59c55a 100644 --- a/README.md +++ b/README.md @@ -5498,9 +5498,10 @@ in the **big** test suite. re-tooled and re-written. The new API is simpler, easier to understand, and conceptually sharper. It's a major upgrade! - The old version is still available, exported under the - name `old_split_quoted_string`. It will be available - for at least one year, until at least August 2025. + The old version is still available under a new + name: `old_split_quoted_string`. It's deprecated, and will + be eventually removed, but not before August 2025 + (one year from now). Changes: * `split_quoted_string` used to use a hand-coded parser, @@ -5509,18 +5510,20 @@ in the **big** test suite. characters. `multisplit` has a large startup cost the first time you use a particular set of iterators, but this information is cached for subsequent calls. - Bottom line, the new version is slower for trivial - examples--where speed doesn't matter--but much faster - for larger workloads. + Bottom line, the new version is much faster + for larger workloads. (It's slower for trivial + examples... where speed doesn't matter.) * `quotes` may now contain quote delimiters of any nonzero length. * By default `quotes` only contains `'`` (single-quote) and `"` (double-quote). The previous version also activated `"""` and `'''` by default; this was judged to be too opinionated and Python-specific. - * The `backslash` parameter has been replaced by `escape`. + * The `backslash` parameter has been replaced by a + new parameter, `escape`. `escape` allows specifying the escape string, which - by default is '\\' (backslash). + by default is '\\' (backslash). If you specify a false + value, there will be no escape character in strings. * `split_quoted_string` also takes a new parameter, `initial`, which sets the initial state of quoting. * The `triple_quotes` parameter has been removed. (See @@ -5529,22 +5532,23 @@ in the **big** test suite. agnostic about newlines. The previous version was, too; even though the documentation discussed triple-quoted strings vs single-quoted strings, in reality it didn't - care about newlines inside either kind of string. With - the updated API, it's officially up to you whether or not - you want to enforce "newlines aren't permitted in - single-quoted strings." + ever care about newlines. With the updated API, it's + officially up to you to enforce any rules here + (e.g. "newlines aren't permitted in + single-quoted strings.") * Breaking change: the `LineInfo` constructor has added a new `lines` positional parameter, in front of the existing positional parameters. This should be the - `lines` iterator yielding this `LineInfo` object. + `lines` iterator that yielded this `LineInfo` object. It's stored in the `lines` attribute. * New feature: `LineInfo` objects yielded by `lines` previously had many optional fields, which might or might - not be added dynamicall. Now all fields are pre-added. - (This is gentler to the CPython 3.13 runtime.) + not be added dynamically. Now all fields are pre-added. + (This works better with assumptions inside the CPython 3.13 + runtime.) `LineInfo` objects now always have these attributes: * `lines`, which contains the base lines iterator. * `line`, which contains the original unmodified line. @@ -5555,24 +5559,30 @@ in the **big** test suite. * `indent`, which contains the indent level of the line if computed, and `None` otherwise. * `leading`, which contains the string stripped from - the beginning of the line. + the beginning of the line. Initially this is the + empty string. * `trailing`, which contains the string stripped from - the end of the line. - * `comment`, which contains the leftmost comment stripped - from the line. (If both are set, `trailing` comes before - `comment`.) + the end of the line. Initially this is the + empty string. * `end`, which is the end-of-line character - that ended the current line. The last line yielded will - always have an empty string for `end`; if the last character - of the text split by `lines` was an end-of-line character, - the last `line` yielded will be empty, and `info.end` will - also be empty. + that ended the current line. For the last line yielded, + `info.end` will always be the empty string. If the last + character of the text split by `lines` was an end-of-line + character, the last `line` yielded will be the empty string, + and `info.end` will also be the empty string. * `match`, which contains a `Match` object if this line was matched with a regular expression, and `None` otherwise. +* `LineInfo` now has two new methods: `extend_leading` + and `extend_trailing`. These methods + move a leading or trailing substring from the current `line` + to the relevant field in `LineInfo`, maintaining all the + guaranteed invariants, and updating all related `LineInfo` + fields (like `column_number`). + * There have been plenty of changes to line modifiers, too: * `lines_strip_comments` has been renamed to `lines_strip_line_comments`. - It's also been fixed: now it raises `SyntaxError` if quoted + It's also been improved: now it raises `SyntaxError` if quoted strings aren't closed. * `lines_filter_comment_lines` has been renamed to `lines_filter_line_comment_lines`. `lines_filter_line_comment_lines` @@ -5580,13 +5590,14 @@ in the **big** test suite. and multi-quoted strings must be closed before the end of the last line. * `lines_strip` and `lines_rstrip` now accept a new `separators` - argument; this is an iterable of separators, a la `multisplit`. + argument; this is an iterable of separators, like the argument + to `multisplit`. The default value of `None` preserves the existing behavior, stripping whitespace. * `lines_grep` now adds a `match` attribute to the `LineInfo` object, containing the return value from calling `re.search`. - (If you pass in `invert=True` to `lines_grep`, the `match` - attribute will always be `None`.) + (If you pass in `invert=True` to `lines_grep`, `lines_grep` + will never write to the `match` attribute.) * Bugfix: `lines_strip_indent` previously required whitespace-only lines to obey the indenting rules. My intention was always for `lines_strip_indent` to diff --git a/big/text.py b/big/text.py index 220a9dc..e702a7e 100644 --- a/big/text.py +++ b/big/text.py @@ -26,6 +26,7 @@ import enum import functools +import heapq import itertools from itertools import zip_longest from .itertools import PushbackIterator @@ -1191,7 +1192,6 @@ def multisplit(s, separators=None, *, for both must be the same (str or bytes). multisplit will only return str or bytes objects. """ - is_bytes = isinstance(s, bytes) separators_is_bytes = isinstance(separators, bytes) separators_is_str = isinstance(separators, str) @@ -1348,6 +1348,202 @@ def multirpartition(s, separators, count=1, *, reverse=False, separate=True): utf8_double_quotes = double_quotes.encode('utf-8') _export_name('utf8_double_quotes') + +@_export +def split_title_case(s, *, split_allcaps=True): + """ + Splits s into words, assuming that + upper-case characters start new words. + Returns an iterator yielding the split words. + + Example: + list(split_title_case('ThisIsATitleCaseString')) + is equal to + ['This', 'Is', 'A', 'Title', 'Case', 'String'] + + If split_allcaps is a true value (the default), + runs of multiple uppercase characters will also + be split before the last character. This is + needed to handle splitting single-letter words. + Consider: + list(split_title_case('WhenIWasATeapot', split_allcaps=True)) + returns + ['When', 'I', 'Was', 'A', 'Teapot'] + but + list(split_title_case('WhenIWasATeapot', split_allcaps=False)) + returns + ['When', 'IWas', 'ATeapot'] + + Note: uses the 'isupper' and 'islower' methods + to determine what are upper- and lower-case + characters. This means it only recognizes the ASCII + upper- and lower-case letters for bytes strings. + """ + + if not s: + yield s + return + + if isinstance(s, bytes): + empty_join = b''.join + i = _iterate_over_bytes(s) + else: + empty_join = ''.join + i = iter(s) + + for c in i: + break + assert c + + word = [] + append = word.append + pop = word.pop + clear = word.clear + + multiple_uppers = False + + while True: + if c.islower(): + append(c) + for c in i: + if c.isupper(): + yield empty_join(word) + clear() + break + if c.islower(): + append(c) + continue + break + else: + break + + elif c.isupper(): + append(c) + multiple_uppers = False + for c in i: + if c.isupper(): + multiple_uppers = split_allcaps + append(c) + continue + if c.islower(): + if multiple_uppers: + previous = pop() + yield empty_join(word) + clear() + append(previous) + break + else: + break + else: + append(c) + for c in i: + break + else: + break + + if word: + yield empty_join(word) + + +@_export +def combine_splits(s, *split_arrays): + """ + Takes a string, and one or more "split arrays", + and applies all the splits to the string. Returns + an iterator of the resulting string segments. + + A "split array" is an array containing the original + string, but split into multiple pieces. For example, + the string "a b c d e" could be split into the + split array ["a ", "b ", "c ", "d ", "e "] + + For example, + combine_splits('abcde', ['abcd', 'e'], ['a', 'bcde']) + returns ['a', 'bcd', 'e']. + + Note that the split arrays *must* contain all the + characters from s. ''.join(split_array) must recreate s. + (So, don't use the string's .split method to split it, + use big's multisplit with keep=True or keep=ALTERNATING.) + """ + # Convert every entry in split_arrays to a list. + # Measure the strings in the split arrays, ignoring empty splits. + split_lengths = [ [ len(_) for _ in split if _ ] for split in split_arrays ] + + # Throw away empty entries in split arrays. (If one array was ['', '', ''], it would now be empty.) + split_lengths = [ split for split in split_lengths if split ] + + heapq.heapify(split_lengths) + + def combine_splits(s, split_lengths): + split_lengths_pop = split_lengths.pop + # split_lengths_remove = split_lengths.remove + + pops = 0 + + heap_pop = heapq.heappop + heap_push = heapq.heappush + + if len(split_lengths) >= 2: + while True: + smallest = split_lengths[0] + index = smallest[0] + + snippet = s[:index] + if snippet == s: + # check, did they try to split past the end? + if index > len(s): + raise ValueError("split array is longer than the original string") + + yield snippet + s = s[index:] + if not s: + return + + # decrement the first value in every split array + # by index. + # (if every entry in a heapq is a list of integers, decrementing + # the first integer in every list by the same amount maintains + # the heap invariants.) + for lengths in split_lengths: + length = lengths[0] + + new_value = length - index + # assert new_value >= 0 + if not new_value: + pops += 1 + + # we write the zeros here, even though we're about to pop them off, + # because otherwise we might break the heapq invariants. + lengths[0] = new_value + + while pops: + pops -= 1 + splits = heap_pop(split_lengths) + if len(splits) > 1: + splits.pop(0) + heap_push(split_lengths, splits) + + if len(split_lengths) < 2: + break + + if split_lengths: + start = end = 0 + length = len(s) + for index in split_lengths[0]: + end += index + if end > length: + raise ValueError("split array is longer than the original string") + yield s[start:end] + start += index + s = s[end:] + + if s: + yield s + + return combine_splits(s, split_lengths) + + _sentinel = object() _invalid_state = "_invalid_state" @@ -1936,7 +2132,7 @@ def split_quoted_strings(s, separators, quotes, empty, initial): @_export class Delimiter: """ - Class representing a delimiter for parse_delimiters. + Class representing a delimiter for split_delimiters. open is the opening delimiter character, can be str or bytes, must be length 1. close is the closing delimiter character, must be the same type as open, and length 1. @@ -1982,30 +2178,31 @@ def __repr__(self): # pragma: no cover delimiter_double_quotes = Delimiter('"', '"', escape='\\', nested=False) _export_name('delimiter_double_quotes') -parse_delimiters_default_delimiters = ( +split_delimiters_default_delimiters = ( delimiter_parentheses, delimiter_square_brackets, delimiter_curly_braces, delimiter_single_quote, delimiter_double_quotes, ) -_export_name('parse_delimiters_default_delimiters') +_export_name('split_delimiters_default_delimiters') + -parse_delimiters_default_delimiters_bytes = ( +split_delimiters_default_delimiters_bytes = ( b'()', b'[]', b'{}', Delimiter(b"'", b"'", escape=b'\\', nested=False), Delimiter(b'"', b'"', escape=b'\\', nested=False), ) -_export_name('parse_delimiters_default_delimiters_bytes') +_export_name('split_delimiters_default_delimiters_bytes') # break the rules _base_delimiter = Delimiter('a', 'b') _base_delimiter.open = _base_delimiter.close = None -def parse_delimiters(s, delimiters, closers, empty): +def split_delimiters(s, delimiters, closers, empty): open_to_delimiter = {d.open: d for d in delimiters} text = [] @@ -2065,10 +2262,10 @@ def flush(open, close): if text: yield flush(empty, empty) -_parse_delimiters = parse_delimiters +_split_delimiters = split_delimiters @_export -def parse_delimiters(s, delimiters=None): +def split_delimiters(s, delimiters=None): """ Parses a string containing nesting delimiters. Raises an exception if mismatched delimiters are detected. @@ -2081,7 +2278,7 @@ def parse_delimiters(s, delimiters=None): should be exactly two characters long; these will be used as the open and close arguments for a new Delimiter object. - If delimiters is None, parse_delimiters uses a default + If delimiters is None, split_delimiters uses a default value matching these pairs of delimiters: () [] {} "" '' The quote mark delimiters enable escape sequences @@ -2106,13 +2303,13 @@ def parse_delimiters(s, delimiters=None): if isinstance(s, bytes): s_type = bytes if delimiters is None: - delimiters = parse_delimiters_default_delimiters_bytes + delimiters = split_delimiters_default_delimiters_bytes disallowed_delimiters = b'\\' empty = b'' else: s_type = str if delimiters is None: - delimiters = parse_delimiters_default_delimiters + delimiters = split_delimiters_default_delimiters disallowed_delimiters = '\\' empty = '' @@ -2157,7 +2354,18 @@ def parse_delimiters(s, delimiters=None): if repeated: raise ValueError("these opening delimiters were used multiple times: " + " ".join(repeated)) - return _parse_delimiters(s, delimiters, closers, empty) + return _split_delimiters(s, delimiters, closers, empty) + + +# backwards compatibility for old names, will stick around until at least September 2025 +parse_delimiters_default_delimiters = split_delimiters_default_delimiters +_export_name('parse_delimiters_default_delimiters') + +parse_delimiters_default_delimiters_bytes = split_delimiters_default_delimiters_bytes +_export_name('parse_delimiters_default_delimiters_bytes') + +parse_delimiters = split_delimiters +_export_name('parse_delimiters') @@ -2171,7 +2379,7 @@ class LineInfo: or modify existing attributes as needed from inside a "lines modifier" function. """ - def __init__(self, lines, line, line_number, column_number, *, leading=None, trailing=None, comment=None, end=None, **kwargs): + def __init__(self, lines, line, line_number, column_number, *, leading=None, trailing=None, end=None, **kwargs): is_str = isinstance(line, str) is_bytes = isinstance(line, bytes) if is_bytes: @@ -2198,11 +2406,6 @@ def __init__(self, lines, line, line_number, column_number, *, leading=None, tra elif not isinstance(trailing, line_type): raise TypeError("trailing must be same type as line or None") - if comment == None: - comment = empty - elif not isinstance(comment, line_type): - raise TypeError("comment must be same type as line or None") - if end == None: end = empty elif not isinstance(end, line_type): @@ -2215,7 +2418,6 @@ def __init__(self, lines, line, line_number, column_number, *, leading=None, tra self.indent = None self.leading = leading self.trailing = trailing - self.comment = comment self.end = end self.match = None self._is_bytes = is_bytes @@ -2225,7 +2427,7 @@ def detab(self, s): return self.lines.detab(s) def extend_leading(self, s, line): - assert line.startswith(s), f"{line=} doesn't start with {s=}" + assert line.startswith(s), f"line {line!r} doesn't start with s {s!r}" self.leading += s line = line[len(s):] detabbed = self.detab(s) @@ -2239,17 +2441,9 @@ def extend_trailing(self, s, line): line = line[:-len(s)] return line - def extend_comment(self, s, line): - empty = b"" if self._is_bytes else "" - assert line.endswith(s) - self.comment = s + self.trailing + self.comment - self.trailing = empty - line = line[:-len(s)] - return line - def __repr__(self): names = list(self.__dict__) - priority_names = ['line', 'lines', 'line_number', 'column_number', 'leading', 'trailing', 'comment', 'end'] + priority_names = ['line', 'lines', 'line_number', 'column_number', 'leading', 'trailing', 'end'] fields = [] for name in priority_names: names.remove(name) @@ -2535,24 +2729,33 @@ def lines_containing(li, s, *, invert=False): yield t @_export -def lines_grep(li, pattern, *, invert=False, flags=0): +def lines_grep(li, pattern, *, invert=False, flags=0, match='match'): """ A lines modifier function. Only yields lines that match the regular expression pattern. (Filters out lines that don't match pattern.) + Stores the resulting re.Match object in info.match. pattern can be str, bytes, or an re.Pattern object. If pattern is not an re.Pattern object, it's compiled with re.compile(pattern, flags=flags). - If invert is true, returns the opposite-- - filters out lines that match pattern. + If invert is true, lines_grep only yields lines that + *don't* match pattern, and sets info.match to None. + + The match parameter specifies the LineInfo attribute name to + write to. By default it writes to info.match; you can specify + any valid identifier, and it will instead write the re.Match + object (or None) to the identifier you specify. Composable with all the lines_ functions from the big.text module. (In older versions of Python, re.Pattern was a private type called re._pattern_type.) """ + if not match.isidentifier(): + raise ValueError('match must be a valid identifier') + if not isinstance_re_pattern(pattern): pattern = re.compile(pattern, flags=flags) search = pattern.search @@ -2561,30 +2764,27 @@ def lines_grep(li, pattern, *, invert=False, flags=0): def lines_grep(li, search): for t in li: info, line = t - match = search(line) - if not match: - info.match = None + m = search(line) + if not m: + setattr(info, match, None) yield t else: def lines_grep(li, search): for t in li: info, line = t - match = search(line) - if match: - info.match = match + m = search(line) + if m: + setattr(info, match, m) yield t return lines_grep(li, search) @_export def lines_sort(li, *, reverse=False): """ - A lines modifier function. Sorts all - input lines before yielding them. + A lines modifier function. Sorts all input lines before yielding them. - Lines are sorted lexicographically, - from lowest to highest. - If reverse is true, lines are sorted - from highest to lowest. + Lines are sorted lexicographically, from lowest to highest. + If reverse is true, lines are sorted from highest to lowest. Composable with all the lines_ modifier functions in the big.text module. """ @@ -2722,10 +2922,8 @@ def lines_strip_line_comments(li, line_comment_splitter, quotes, multiline_quote for triplet in i: line_comment_segments.extend(triplet) assert line_comment_segments - line = info.extend_comment(empty_join(line_comment_segments), line) + line = info.extend_trailing(empty_join(line_comment_segments), line) - # do this *after* extend_comment, - # extend_comment also eats trailing if rstrip: stripped = leading.rstrip() if stripped != leading: diff --git a/tests/test_text.py b/tests/test_text.py index 59ed30a..2064b7d 100644 --- a/tests/test_text.py +++ b/tests/test_text.py @@ -33,6 +33,7 @@ import math import re import sys +import types import unittest @@ -81,6 +82,11 @@ def to_bytes(o): # pragma: no cover o = re.compile(to_bytes(o.pattern), flags=flags) return o + +_iterate_over_bytes = big.text._iterate_over_bytes + + + # # known_separators & printable_separators lets error messages # print a symbolic name for a set of separators, instead of @@ -2354,6 +2360,276 @@ def test(columns, expected): ' -v|--verbose Causes the program to produce more output. Specifying it\n multiple times raises the volume of output.' ) + def test_split_title_case(self): + + def alternate_split_title_case(s, *, split_allcaps=True): + """ + Alternate implementation of split_title_case, + used for testing. + """ + if not s: + yield s + return + + if isinstance(s, bytes): + empty_join = b''.join + i = _iterate_over_bytes(s) + else: + empty_join = ''.join + i = iter(s) + + word = [] + append = word.append + pop = word.pop + clear = word.clear + + previous_was_lower = False + upper_counter = 0 + + for c in i: + is_upper = c.isupper() + is_lower = c.islower() + # print(f"{c=} {is_upper=} {is_lower=} {upper_counter=} {previous_was_lower=} {split_allcaps=}") + if is_upper: + if previous_was_lower: + if word: + yield empty_join(word) + clear() + previous_was_lower = False + if split_allcaps: + upper_counter += 1 + else: + if is_lower: + if upper_counter > 1: + assert word + popped = pop() + if word: + yield empty_join(word) + clear() + append(popped) + upper_counter = 0 + previous_was_lower = is_lower + append(c) + continue + if word: + yield empty_join(word) + clear() + + def test(s, **kw): + expected = list(alternate_split_title_case(s, **kw)) + got = list(big.split_title_case(s, **kw)) + self.assertEqual(expected, got) + + b = s.encode('ascii') + bytes_expected = list(alternate_split_title_case(b, **kw)) + bytes_got = list(big.split_title_case(b, **kw)) + self.assertEqual(bytes_expected, bytes_got) + + self.assertIsInstance(big.split_title_case('HowdyFolks'), types.GeneratorType) + + test('') + test(' ') + test(' 333 ') + test('ThisIsATitleCaseString') + test('YoursTrulyJohnnyDollar_1975-03-15 - TheMysteriousMaynardMatter.MP3') + + test('oneOfTheGoodOnes') + test('aRoadLessTraveled') + + test("Can'tComplain") + + test("NOTHINGInTheWORLD") + test("NOTHINGInTheWORLD", split_allcaps=False) + + test("WhenIWasATeapot", split_allcaps=False) + test("WhenIWasATeapot", split_allcaps=True) + + def test_combine_splits(self): + + INT_MAX = 2**256 + + def original_combine_splits(s, *splits): + "Alternate implementation of combine_splits, used for testing." + # Measure the strings in the split arrays. + # (Ignore empty split arrays, and ignore empty splits.) + split_lengths = [ [ len(_) for _ in split if _ ] for split in splits if split ] + + def combine_splits(s, split_lengths): + split_lengths_pop = split_lengths.pop + + drops = [] + drops_append = drops.append + drops_pop = drops.pop + + while len(split_lengths) >= 2: + # print(combined, s) + # for _ in split_lengths: + # print(" ", _) + smallest = INT_MAX + smallest_index = None + + for i, lengths in enumerate(split_lengths): + length = lengths[0] + if smallest > length: + smallest = length + smallest_index = i + + assert smallest != INT_MAX + + yield s[:smallest] + s = s[smallest:] + + for i, lengths in enumerate(split_lengths): + length = lengths[0] + if length == smallest: + lengths.pop(0) + if not lengths: + drops_append(i) + else: + lengths[0] = length - smallest + + while drops: + x = split_lengths_pop(drops_pop()) + assert not x + + if split_lengths: + start = end = 0 + for index in split_lengths[0]: + end += index + yield s[start:end] + start += index + s = s[end:] + + if s: + yield s + + return combine_splits(s, split_lengths) + + + def sorting_combine_splits(s, *splits): + "Alternate implementation of combine_splits, used for testing." + + index_0 = lambda x: x[0] + + # In case an entry in the split arrays is a generator, convert it to a list. + split_lengths = [ list(split) for split in splits ] + # Measure the strings in the split arrays. + # (Remove empty split arrays, and ignore empty splits.) + split_lengths = [ [ len(_) for _ in split if _ ] for split in splits if split ] + split_lengths.sort(key=index_0) + + def combine_splits(s, split_lengths, index_0): + split_lengths_pop = split_lengths.pop + # split_lengths_remove = split_lengths.remove + + drops = [] + drops_append = drops.append + drops_pop = drops.pop + + if len(split_lengths) >= 2: + while True: + smallest = split_lengths[0] + index = smallest[0] + + yield s[:index] + s = s[index:] + + re_sort = False + for i, lengths in enumerate(split_lengths): + length = lengths[0] + # print(" >>", length, lengths) + if length == index: + re_sort = True + lengths.pop(0) + if not lengths: + drops_append(i) + else: + lengths[0] = length - index + + while drops: + x = split_lengths_pop(drops_pop()) + assert not x + if len(split_lengths) < 2: + break + if re_sort: + split_lengths.sort(key=index_0) + + if split_lengths: + start = end = 0 + for index in split_lengths[0]: + end += index + yield s[start:end] + start += index + s = s[end:] + + if s: + yield s + + return combine_splits(s, split_lengths, index_0) + + def test(s, *split_arrays): + # convert split_arrays into lists, just in case one is an iterator + # (we'll test an iterator by hand later) + split_arrays = [list(_) for _ in split_arrays] + + original_result = list(original_combine_splits(s, *split_arrays)) + sorting_result = list(sorting_combine_splits(s, *split_arrays)) + self.assertEqual(original_result, sorting_result) + expected = original_result + + got = list(big.combine_splits(s, *split_arrays)) + self.assertEqual(expected, got) + + bytes_s = s.encode('ascii') + bytes_split_arrays = [ [_.encode('ascii') for _ in l] for l in split_arrays ] + bytes_expected = [_.encode('ascii') for _ in expected] + + bytes_got = list(big.combine_splits(bytes_s, *bytes_split_arrays)) + self.assertEqual(bytes_expected, bytes_got) + + + self.assertIsInstance(big.combine_splits('abc', ['a', 'bc'], ['ab', 'c']), types.GeneratorType) + + + s = 'abcdefghijklmnopq' + s1 = [ 'ab', 'cde', 'fghi', 'jklmnop', 'q' ] + s2 = [ 'abcde', 'fghi', 'jk', 'lmnopq' ] + s3 = [ 'abcdefghi', 'jklmn', 'opq' ] + + # should split after B E I K N P + test(s, s1, s2, s3) + + test(s + "rstuvwxyz", s1, s2, s3) + + s = "aa bb cc dd ee" + test(s, + big.multisplit(s, keep=big.ALTERNATING), + ["aa b", "b cc d", "d ee"], + ) + + + s = "aa bb cc dd ee ff" + test(s, + s.split(), + ["aa bb cc dd ee f", "f"], + ) + + + with self.assertRaises(ValueError): + list(big.combine_splits("a b c d e", + ["a ", "b "], + ["a b c d ", "e f g ", "h "], + )) + + with self.assertRaises(ValueError): + list(big.combine_splits("a b c d e", + ["a ", "b "], + ["a b c d ", "e f g ", "h "], + ["a b c ", "d e f ", "g h "], + )) + + + def test_gently_title(self): def test(s, expected, test_ascii=True, apostrophes=None, double_quotes=None): result = big.gently_title(s, apostrophes=apostrophes, double_quotes=double_quotes) @@ -3050,11 +3326,11 @@ def test_and_remove_lineinfo_match(i, substring, *, invert=False): """[1:]) test(big.lines_strip_line_comments(lines, ("#", "//")), [ - L(line='for x in range(5): # this is my exciting comment', line_number=1, column_number=1, trailing=" ", comment='# this is my exciting comment', final='for x in range(5):'), - L(line=' print("# this is quoted", x)', line_number=2, column_number=1, comment=''), - L(line=' print("") # this "comment" is useless', line_number=3, column_number=1, trailing=" ", comment='# this "comment" is useless', final=' print("")'), - L(line=' print(no_comments_or_quotes_on_this_line)', line_number=4, column_number=1, comment=''), - L(line='', line_number=5, column_number=1, comment='', end=''), + L(line='for x in range(5): # this is my exciting comment', line_number=1, column_number=1, trailing=' # this is my exciting comment', final='for x in range(5):'), + L(line=' print("# this is quoted", x)', line_number=2, column_number=1), + L(line=' print("") # this "comment" is useless', line_number=3, column_number=1, trailing=' # this "comment" is useless', final=' print("")'), + L(line=' print(no_comments_or_quotes_on_this_line)', line_number=4, column_number=1), + L(line='', line_number=5, column_number=1, end=''), ]) # test multiline @@ -3071,8 +3347,7 @@ def test_and_remove_lineinfo_match(i, substring, *, invert=False): L(line_number=1, column_number=1, line ='for x in range(5): # this is my exciting comment', final='for x in range(5):', - trailing=' ', - comment='# this is my exciting comment',), + trailing=' # this is my exciting comment',), L(line_number=2, column_number=1, line =" print('''", final=" print('''"), @@ -3084,14 +3359,12 @@ def test_and_remove_lineinfo_match(i, substring, *, invert=False): final=" does this line have a comment? # no!"), L(line_number=5, column_number=1, line=" ''') # but here's a comment", - trailing=" ", - comment="# but here's a comment", + trailing=" # but here's a comment", final=" ''')", ), L(line_number=6, column_number=1, line=' print("just checking, # here too") # here is another comment', - trailing=" ", - comment='# here is another comment', + trailing=" # here is another comment", final=' print("just checking, # here too")'), L(line_number=7, column_number=1, line='', @@ -3109,24 +3382,20 @@ def test_and_remove_lineinfo_match(i, substring, *, invert=False): [ L( line_number=1, column_number=1, line='for x in range(5): # this is a comment', - trailing=' ', - comment='# this is a comment', + trailing=' # this is a comment', final='for x in range(5):'), L( line_number=2, column_number=1, line=' print("# this is quoted", x)', - comment='# this is quoted", x)', + trailing='# this is quoted", x)', final=' print("'), L( line_number=3, column_number=1, line=' print("") # this "comment" is useless', - trailing=' ', - comment='# this "comment" is useless', + trailing=' # this "comment" is useless', final=' print("")'), L( line_number=4, column_number=1, - line=' print(no_comments_or_quotes_on_this_line)', - comment=''), + line=' print(no_comments_or_quotes_on_this_line)'), L( line_number=5, column_number=1, line='', - comment='', end=''), ]) @@ -3145,9 +3414,9 @@ def test_and_remove_lineinfo_match(i, substring, *, invert=False): lines = big.lines(b"a\nb# ignored\n c") test(big.lines_strip_line_comments(lines, b'#'), [ - L(b'a', 1, comment=b''), - L(b'b# ignored', 2, 1, comment=b'# ignored', final=b'b'), - L(b' c', 3, comment=b'', end=b''), + L(b'a', 1), + L(b'b# ignored', 2, 1, trailing=b'# ignored', final=b'b'), + L(b' c', 3, end=b''), ] )