-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure all lines fed to marian are well formed #23
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested alternative:
while True:
line = self._fh.readline()
if line == '':
raise StopIteration
self.line += 1
# assert that the line is well formed, meaning non of the fields is the empty string
# If not, try to get a new line from the corpus
if any(field == '' for field in line.rstrip('\r\n').split('\t')):
logging.warning("Empty field in {self.dataset.name}")
continue
return line
Not sure about the warning()
, but I'm afraid that you'll have to do a lot of debugging otherwise to figure out what is going on. Ideally we would tell you which line, but since the data comes from the shuffled output, that line number would be meaningless.
Maybe shuffle.py should have a line number on the first column, so you can trace the data. But that's a lot of string parsing just for debugging once in a while.
Also I'm hinting to move all our info printing to logging
so we can give the user a bit more control over what is printed. And we can filter repeated warnings more easily.
src/opustrainer/trainer.py
Outdated
for field in testline: | ||
if field == "": | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This continue will only skip the for field in testline
loop, right? Not go to the next line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep... That should be breakout 🤦
src/opustrainer/trainer.py
Outdated
self.line += 1 | ||
# assert that the line is well formed, meaning non of the fields is the empty string | ||
# If not, try to get a new line from the corpus | ||
testline: List[str] = line.rstrip('/r/n').strip().split('/t') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be \
instead of /
. Also the strip()
will mess things up here because empty columns will be stripped entirely, yielding just a row with fewer fields. Only when a column that is surrounded by non-empty columns is empty, this will be found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦
Note to self, do not push PRs without sleeping well. I don't think line number is necessary when outputting the erroneous sentence, because it can very easily be later identified via grep (i've done this several times). Logging is nice though |
We should start moving to logger, but annoyingly it doesn't have a |
This is still not fool proof. The reader doesn't know how many fields there are, and if we are missing a field altogether this would fail... |
No description provided.