Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DH Code Review #1

Open
wants to merge 25 commits into
base: empty
Choose a base branch
from
Open

DH Code Review #1

wants to merge 25 commits into from

Conversation

jdamerow
Copy link

@jdamerow jdamerow commented Oct 3, 2024

The ticket for this code review is: link to ticket

This repository is ready for review. Please review __init__.py and if there is time milestones/__init__.py.

Specific requests :

  • Performance (spacy docs is slow)
  • Quality (can we make this a model for other parts of Lexos)
  • Code style

Copy link

@ColeDCrawford ColeDCrawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

  • It's worth putting in some work to upgrade lexos so it is compatible with Python 3.11, 3.12, or even 3.13 (might have issues getting dependencies for that as it's so new though). I couldn't get it to build for Python > 3.10. You mentioned speed as an issue; Python 3.11 is between 10-60% faster than Python 3.10, and 3.12 added a lot of improvements for async I/O (not sure how much spaCy leverages that). But that's some low-hanging fruit that should give performance upgrades immediately. I'd take that win before trying to optimize your implementation. It will also allow other projects to use the lexos package - I wouldn't be able to use this on any apps we run as I just finished moving all of them from 3.11 to 3.12.
  • Updating to the latest spaCy version might also help. I think you are pinned to 3.4 which is a couple of years old. Are you using a CUDA (GPU-enabled) build?
  • It would be helpful to have a pyproject.toml to manage dependencies for rollingwindows. I know the goal is to merge this module into lexos but might as well have the build tooling set up until that happens. You could have requirements.txt instead but pyproject.toml means that you could do an editable install of rollingwindows (e.g. pip install --editable .) instead of modifying your system path.
  • The notebook could be cleaned up a bit. There are a couple of cells where the description doesn't match the code: "we will use only the first 10,000 tokens" but it's 3000, "we generate a new window every 50 characters" but the next cell is every 1000 tokens, etc. Some other QOL improvements could include automatically modifying the system path, and running spacy installs for the required models. But overall it does a good job of demoing the functionality. Low hanging fruit: you could turn this into a pseudo-test with Treon just to make sure the notebook can run top-to-bottom.
  • For Filters - can you add multiple filters? E.g. both exclude_digits and exclude_roman_numeral? Or would you just have a pattern to catch both for exclude_pattern?
  • Milestones are cool - I didn't have enough time to play with them directly but the plotting for these was quite nice.
  • Stress testing: is there handling for odd cases? Like large inputs (I hit some spacy errors when synthetically generating long texts, e.g. long_text = " ".join(["word"] * 1_000_000).

plotter.save(file, **kwargs)

# @timer
def set_windows(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some validation on window_units matching the options would be helpful here. I was able to do this:

import rollingwindows
rw = rollingwindows.RollingWindows(doc, model=model)
rw.set_windows("aaa", "aaa")
rw.metadata

{'model': 'en_core_web_sm',
 'n': 'aaa',
 'window_units': 'aaa',
 'alignment_mode': 'strict',
 'search_method': 're_finditer',

Running rw.calculate() later then failed, but you could handle this upfront when configuring the window options instead of potentially getting a runtime error. I'd validate the parameters here to make sure that you're getting an int and a valid string option.

"""
# TODO: We have to iterate through the input twice to get the boundaries.
if isinstance(input, list):
input_spans = [span.as_doc() for span in input]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could pull this out into a utility function, as the same logic is used in both sliding_str_windows() and sliding_windows()

Comment on lines +109 to +118
if alignment_mode == "strict":
for start_char, end_char in boundaries:
yield input.text[start_char:end_char]
else:
for start_char, end_char in boundaries:
window = input.char_span(
start_char, end_char, alignment_mode=alignment_mode
)
if window is not None:
yield window.text

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both sliding_str_windows and sliding_windows have some similar logic for handling snapping as well; you could pull that out into an align_windows() function

def sliding_str_windows(
input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc, str],
n: int = 1000,
alignment_mode: str = "contract",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some checks to make sure that the alignment_mode is valid

"""Return a generator of string windows.

Args:
input (Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc, str]): A spaCy doc or a list of spaCy spans.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't allow for a str, right?

from rollingwindows import sliding_str_windows
input_text = "This is a test input text for windowing."
windows = list(sliding_str_windows(input_text, n=5, alignment_mode="strict"))

Gives me:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [44], in <cell line: 3>()
      1 from rollingwindows import sliding_str_windows
      2 input_text = "This is a test input text for windowing."
----> 3 windows = list(sliding_str_windows(input_text, n=5, alignment_mode="strict"))

File ~/GitHub/rollingwindows/__init__.py:76, in sliding_str_windows(input, n, alignment_mode)
     74         span = input[start_char:end_char]
     75         if span is not None:
---> 76             yield span.text
     77 else:
     78     for start_char, end_char in boundaries:

AttributeError: 'str' object has no attribute 'text'

Comment on lines +99 to +101
- "strict" (no snapping)
- "contract" (span of all tokens completely within the character span)
- "expand" (span of all tokens at least partially covered by the character span)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling these and the units out into a more global set of valid choices might be useful as you use them in a number of different places.

boundaries = [(i, i + n) for i in range(len(input_spans))]
for start, end in boundaries:
yield Doc.from_docs(input_spans[start:end]).text.strip()
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change to using a batched generator to make this more memory efficient for really long texts? You could then use parallel processing to create the windows?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place where you could cache the window calculations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants