-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DH Code Review #1
base: empty
Are you sure you want to change the base?
Conversation
Unit tests prompted numerous non-backward compatible fixes. Also, API documentation files and sample data were added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments:
- It's worth putting in some work to upgrade
lexos
so it is compatible with Python 3.11, 3.12, or even 3.13 (might have issues getting dependencies for that as it's so new though). I couldn't get it to build for Python > 3.10. You mentioned speed as an issue; Python 3.11 is between 10-60% faster than Python 3.10, and 3.12 added a lot of improvements for async I/O (not sure how much spaCy leverages that). But that's some low-hanging fruit that should give performance upgrades immediately. I'd take that win before trying to optimize your implementation. It will also allow other projects to use thelexos
package - I wouldn't be able to use this on any apps we run as I just finished moving all of them from3.11
to3.12
. - Updating to the latest
spaCy
version might also help. I think you are pinned to 3.4 which is a couple of years old. Are you using a CUDA (GPU-enabled) build? - It would be helpful to have a
pyproject.toml
to manage dependencies forrollingwindows
. I know the goal is to merge this module intolexos
but might as well have the build tooling set up until that happens. You could haverequirements.txt
instead butpyproject.toml
means that you could do an editable install ofrollingwindows
(e.g.pip install --editable .
) instead of modifying your system path. - The notebook could be cleaned up a bit. There are a couple of cells where the description doesn't match the code: "we will use only the first 10,000 tokens" but it's 3000, "we generate a new window every 50 characters" but the next cell is every 1000 tokens, etc. Some other QOL improvements could include automatically modifying the system path, and running
spacy
installs for the required models. But overall it does a good job of demoing the functionality. Low hanging fruit: you could turn this into a pseudo-test with Treon just to make sure the notebook can run top-to-bottom. - For
Filters
- can you add multiple filters? E.g. bothexclude_digits
andexclude_roman_numeral
? Or would you just have a pattern to catch both forexclude_pattern
? - Milestones are cool - I didn't have enough time to play with them directly but the plotting for these was quite nice.
- Stress testing: is there handling for odd cases? Like large inputs (I hit some spacy errors when synthetically generating long texts, e.g.
long_text = " ".join(["word"] * 1_000_000)
.
plotter.save(file, **kwargs) | ||
|
||
# @timer | ||
def set_windows( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some validation on window_units
matching the options would be helpful here. I was able to do this:
import rollingwindows
rw = rollingwindows.RollingWindows(doc, model=model)
rw.set_windows("aaa", "aaa")
rw.metadata
{'model': 'en_core_web_sm',
'n': 'aaa',
'window_units': 'aaa',
'alignment_mode': 'strict',
'search_method': 're_finditer',
Running rw.calculate()
later then failed, but you could handle this upfront when configuring the window options instead of potentially getting a runtime error. I'd validate the parameters here to make sure that you're getting an int
and a valid string option.
""" | ||
# TODO: We have to iterate through the input twice to get the boundaries. | ||
if isinstance(input, list): | ||
input_spans = [span.as_doc() for span in input] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could pull this out into a utility function, as the same logic is used in both sliding_str_windows()
and sliding_windows()
if alignment_mode == "strict": | ||
for start_char, end_char in boundaries: | ||
yield input.text[start_char:end_char] | ||
else: | ||
for start_char, end_char in boundaries: | ||
window = input.char_span( | ||
start_char, end_char, alignment_mode=alignment_mode | ||
) | ||
if window is not None: | ||
yield window.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both sliding_str_windows
and sliding_windows
have some similar logic for handling snapping as well; you could pull that out into an align_windows()
function
def sliding_str_windows( | ||
input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc, str], | ||
n: int = 1000, | ||
alignment_mode: str = "contract", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some checks to make sure that the alignment_mode
is valid
"""Return a generator of string windows. | ||
|
||
Args: | ||
input (Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc, str]): A spaCy doc or a list of spaCy spans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't allow for a str
, right?
from rollingwindows import sliding_str_windows
input_text = "This is a test input text for windowing."
windows = list(sliding_str_windows(input_text, n=5, alignment_mode="strict"))
Gives me:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [44], in <cell line: 3>()
1 from rollingwindows import sliding_str_windows
2 input_text = "This is a test input text for windowing."
----> 3 windows = list(sliding_str_windows(input_text, n=5, alignment_mode="strict"))
File ~/GitHub/rollingwindows/__init__.py:76, in sliding_str_windows(input, n, alignment_mode)
74 span = input[start_char:end_char]
75 if span is not None:
---> 76 yield span.text
77 else:
78 for start_char, end_char in boundaries:
AttributeError: 'str' object has no attribute 'text'
- "strict" (no snapping) | ||
- "contract" (span of all tokens completely within the character span) | ||
- "expand" (span of all tokens at least partially covered by the character span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling these and the units out into a more global set of valid choices might be useful as you use them in a number of different places.
boundaries = [(i, i + n) for i in range(len(input_spans))] | ||
for start, end in boundaries: | ||
yield Doc.from_docs(input_spans[start:end]).text.strip() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change to using a batched generator to make this more memory efficient for really long texts? You could then use parallel processing to create the windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place where you could cache the window calculations?
The ticket for this code review is: link to ticket
This repository is ready for review. Please review
__init__.py
and if there is timemilestones/__init__.py
.Specific requests :