DH Code Review #1

jdamerow · 2024-10-03T20:50:44Z

The ticket for this code review is: link to ticket

This repository is ready for review. Please review __init__.py and if there is time milestones/__init__.py.

Specific requests :

Performance (spacy docs is slow)
Quality (can we make this a model for other parts of Lexos)
Code style

Unit tests prompted numerous non-backward compatible fixes. Also, API documentation files and sample data were added.

ColeDCrawford

General comments:

It's worth putting in some work to upgrade lexos so it is compatible with Python 3.11, 3.12, or even 3.13 (might have issues getting dependencies for that as it's so new though). I couldn't get it to build for Python > 3.10. You mentioned speed as an issue; Python 3.11 is between 10-60% faster than Python 3.10, and 3.12 added a lot of improvements for async I/O (not sure how much spaCy leverages that). But that's some low-hanging fruit that should give performance upgrades immediately. I'd take that win before trying to optimize your implementation. It will also allow other projects to use the lexos package - I wouldn't be able to use this on any apps we run as I just finished moving all of them from 3.11 to 3.12.
Updating to the latest spaCy version might also help. I think you are pinned to 3.4 which is a couple of years old. Are you using a CUDA (GPU-enabled) build?
It would be helpful to have a pyproject.toml to manage dependencies for rollingwindows. I know the goal is to merge this module into lexos but might as well have the build tooling set up until that happens. You could have requirements.txt instead but pyproject.toml means that you could do an editable install of rollingwindows (e.g. pip install --editable .) instead of modifying your system path.
The notebook could be cleaned up a bit. There are a couple of cells where the description doesn't match the code: "we will use only the first 10,000 tokens" but it's 3000, "we generate a new window every 50 characters" but the next cell is every 1000 tokens, etc. Some other QOL improvements could include automatically modifying the system path, and running spacy installs for the required models. But overall it does a good job of demoing the functionality. Low hanging fruit: you could turn this into a pseudo-test with Treon just to make sure the notebook can run top-to-bottom.
For Filters - can you add multiple filters? E.g. both exclude_digits and exclude_roman_numeral? Or would you just have a pattern to catch both for exclude_pattern?
Milestones are cool - I didn't have enough time to play with them directly but the plotting for these was quite nice.
Stress testing: is there handling for odd cases? Like large inputs (I hit some spacy errors when synthetically generating long texts, e.g. long_text = " ".join(["word"] * 1_000_000).

ColeDCrawford · 2024-10-23T18:25:08Z

__init__.py

+            plotter.save(file, **kwargs)
+
+    # @timer
+    def set_windows(


Some validation on window_units matching the options would be helpful here. I was able to do this:

import rollingwindows rw = rollingwindows.RollingWindows(doc, model=model) rw.set_windows("aaa", "aaa") rw.metadata {'model': 'en_core_web_sm', 'n': 'aaa', 'window_units': 'aaa', 'alignment_mode': 'strict', 'search_method': 're_finditer',

Running rw.calculate() later then failed, but you could handle this upfront when configuring the window options instead of potentially getting a runtime error. I'd validate the parameters here to make sure that you're getting an int and a valid string option.

ColeDCrawford · 2024-10-23T18:27:58Z

__init__.py

+    """
+    # TODO: We have to iterate through the input twice to get the boundaries.
+    if isinstance(input, list):
+        input_spans = [span.as_doc() for span in input]


You could pull this out into a utility function, as the same logic is used in both sliding_str_windows() and sliding_windows()

ColeDCrawford · 2024-10-23T18:30:05Z

__init__.py

+        if alignment_mode == "strict":
+            for start_char, end_char in boundaries:
+                yield input.text[start_char:end_char]
+        else:
+            for start_char, end_char in boundaries:
+                window = input.char_span(
+                    start_char, end_char, alignment_mode=alignment_mode
+                )
+                if window is not None:
+                    yield window.text


Both sliding_str_windows and sliding_windows have some similar logic for handling snapping as well; you could pull that out into an align_windows() function

ColeDCrawford · 2024-10-23T18:41:11Z

__init__.py

+def sliding_str_windows(
+    input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc, str],
+    n: int = 1000,
+    alignment_mode: str = "contract",


Add some checks to make sure that the alignment_mode is valid

ColeDCrawford · 2024-10-23T18:45:08Z

__init__.py

+    """Return a generator of string windows.
+
+    Args:
+        input (Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc, str]): A spaCy doc or a list of spaCy spans.


This shouldn't allow for a str, right?

from rollingwindows import sliding_str_windows input_text = "This is a test input text for windowing." windows = list(sliding_str_windows(input_text, n=5, alignment_mode="strict"))

Gives me:

--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Input In [44], in <cell line: 3>() 1 from rollingwindows import sliding_str_windows 2 input_text = "This is a test input text for windowing." ----> 3 windows = list(sliding_str_windows(input_text, n=5, alignment_mode="strict")) File ~/GitHub/rollingwindows/__init__.py:76, in sliding_str_windows(input, n, alignment_mode) 74 span = input[start_char:end_char] 75 if span is not None: ---> 76 yield span.text 77 else: 78 for start_char, end_char in boundaries: AttributeError: 'str' object has no attribute 'text'

ColeDCrawford · 2024-10-23T18:46:06Z

__init__.py

+        - "strict" (no snapping)
+        - "contract" (span of all tokens completely within the character span)
+        - "expand" (span of all tokens at least partially covered by the character span)


Pulling these and the units out into a more global set of valid choices might be useful as you use them in a number of different places.

ColeDCrawford · 2024-10-23T18:56:50Z

__init__.py

+        boundaries = [(i, i + n) for i in range(len(input_spans))]
+        for start, end in boundaries:
+            yield Doc.from_docs(input_spans[start:end]).text.strip()
+    else:


Could you change to using a batched generator to make this more memory efficient for really long texts? You could then use parallel processing to create the windows?

Is there a place where you could cache the window calculations?

scottkleinman and others added 25 commits May 6, 2024 15:25

Initial commit

e5edd3d

Initial commit

1dd9f7a

Backup for 25 May 2025

890ec70

Unit tests prompted numerous non-backward compatible fixes. Also, API documentation files and sample data were added.

Fix link

fdb10b7

Remove __pycache__

986eba7

Remove deprecated folder

f49a411

Update README

a373167

Update explanatory material in README

29f59e8

Add RWPlotlyPlotter

ee2d691

Run isort and fix docstring formatting.

dafa32e

Add Plotly plotter to the docs

48c4e93

Correct typos

9408a67

Correct typo

67cf787

Correct typo

2c4cfb2

Change plot attribute to fig.

57e996c

Update docs and tutorial to include Plotly plotter.

43e2abd

Update docstrings for mkdocs.

1c88f9b

Add unit tests

55ec6d1

Code refactor based on unit tests

8339005

Update docs and tutorial based on unit tests

c5c692a

Update note on unit tests.

1badd16

Fix typo

f8a97ad

Remove reference to Averages class

a5d83b4

Complete docstring coverage

0621098

Merge branch 'main' into dh-code-review

948645f

jdamerow requested review from ColeDCrawford and mutherr October 3, 2024 20:50

jdamerow assigned scottkleinman Oct 3, 2024

ColeDCrawford reviewed Oct 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DH Code Review #1

DH Code Review #1

jdamerow commented Oct 3, 2024

ColeDCrawford left a comment

ColeDCrawford Oct 23, 2024

ColeDCrawford Oct 23, 2024

ColeDCrawford Oct 23, 2024

ColeDCrawford Oct 23, 2024

ColeDCrawford Oct 23, 2024

ColeDCrawford Oct 23, 2024

ColeDCrawford Oct 23, 2024

ColeDCrawford Oct 23, 2024

DH Code Review #1

Are you sure you want to change the base?

DH Code Review #1

Conversation

jdamerow commented Oct 3, 2024

ColeDCrawford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment