Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure all lines fed to marian are well formed #23

Merged
merged 2 commits into from
Jun 30, 2023
Merged

Ensure all lines fed to marian are well formed #23

merged 2 commits into from
Jun 30, 2023

Conversation

XapaJIaMnu
Copy link
Contributor

No description provided.

@XapaJIaMnu XapaJIaMnu requested a review from jelmervdl June 26, 2023 14:07
Copy link
Contributor

@jelmervdl jelmervdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested alternative:

while True:
	line = self._fh.readline()
	if line == '':
		raise StopIteration
	self.line += 1

	# assert that the line is well formed, meaning non of the fields is the empty string
	# If not, try to get a new line from the corpus
	if any(field == '' for field in line.rstrip('\r\n').split('\t')):
		logging.warning("Empty field in {self.dataset.name}")
		continue

	return line

Not sure about the warning(), but I'm afraid that you'll have to do a lot of debugging otherwise to figure out what is going on. Ideally we would tell you which line, but since the data comes from the shuffled output, that line number would be meaningless.

Maybe shuffle.py should have a line number on the first column, so you can trace the data. But that's a lot of string parsing just for debugging once in a while.

Also I'm hinting to move all our info printing to logging so we can give the user a bit more control over what is printed. And we can filter repeated warnings more easily.

Comment on lines 210 to 212
for field in testline:
if field == "":
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This continue will only skip the for field in testline loop, right? Not go to the next line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep... That should be breakout 🤦

self.line += 1
# assert that the line is well formed, meaning non of the fields is the empty string
# If not, try to get a new line from the corpus
testline: List[str] = line.rstrip('/r/n').strip().split('/t')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be \ instead of /. Also the strip() will mess things up here because empty columns will be stripped entirely, yielding just a row with fewer fields. Only when a column that is surrounded by non-empty columns is empty, this will be found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

@XapaJIaMnu
Copy link
Contributor Author

Note to self, do not push PRs without sleeping well.

I don't think line number is necessary when outputting the erroneous sentence, because it can very easily be later identified via grep (i've done this several times). Logging is nice though

@XapaJIaMnu
Copy link
Contributor Author

We should start moving to logger, but annoyingly it doesn't have a LOG_ONCE feature AND on top of that we'd probably have to redo all the test outputs...

@XapaJIaMnu
Copy link
Contributor Author

This is still not fool proof. The reader doesn't know how many fields there are, and if we are missing a field altogether this would fail...

@jelmervdl jelmervdl merged commit d742967 into main Jun 30, 2023
@jelmervdl jelmervdl deleted the validate branch June 30, 2023 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants