Ensure all lines fed to marian are well formed #23

XapaJIaMnu · 2023-06-26T14:07:45Z

No description provided.

jelmervdl

Suggested alternative:

while True:
	line = self._fh.readline()
	if line == '':
		raise StopIteration
	self.line += 1

	# assert that the line is well formed, meaning non of the fields is the empty string
	# If not, try to get a new line from the corpus
	if any(field == '' for field in line.rstrip('\r\n').split('\t')):
		logging.warning("Empty field in {self.dataset.name}")
		continue

	return line

Not sure about the warning(), but I'm afraid that you'll have to do a lot of debugging otherwise to figure out what is going on. Ideally we would tell you which line, but since the data comes from the shuffled output, that line number would be meaningless.

Maybe shuffle.py should have a line number on the first column, so you can trace the data. But that's a lot of string parsing just for debugging once in a while.

Also I'm hinting to move all our info printing to logging so we can give the user a bit more control over what is printed. And we can filter repeated warnings more easily.

jelmervdl · 2023-06-26T15:38:11Z

src/opustrainer/trainer.py

+                for field in testline:
+                    if field == "":
+                        continue


This continue will only skip the for field in testline loop, right? Not go to the next line?

Yep... That should be breakout 🤦

jelmervdl · 2023-06-26T15:38:29Z

src/opustrainer/trainer.py

+                self.line += 1
+                # assert that the line is well formed, meaning non of the fields is the empty string
+                # If not, try to get a new line from the corpus
+                testline: List[str] = line.rstrip('/r/n').strip().split('/t')


Should be \ instead of /. Also the strip() will mess things up here because empty columns will be stripped entirely, yielding just a row with fewer fields. Only when a column that is surrounded by non-empty columns is empty, this will be found.

XapaJIaMnu · 2023-06-26T17:26:16Z

Note to self, do not push PRs without sleeping well.

I don't think line number is necessary when outputting the erroneous sentence, because it can very easily be later identified via grep (i've done this several times). Logging is nice though

XapaJIaMnu · 2023-06-26T19:17:14Z

We should start moving to logger, but annoyingly it doesn't have a LOG_ONCE feature AND on top of that we'd probably have to redo all the test outputs...

XapaJIaMnu · 2023-06-27T09:14:47Z

This is still not fool proof. The reader doesn't know how many fields there are, and if we are missing a field altogether this would fail...

XapaJIaMnu requested a review from jelmervdl June 26, 2023 14:07

Ensure all lines fed to marian are well formed

3d17728

jelmervdl requested changes Jun 26, 2023

View reviewed changes

🤦

b38acb3

jelmervdl merged commit d742967 into main Jun 30, 2023

jelmervdl deleted the validate branch June 30, 2023 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure all lines fed to marian are well formed #23

Ensure all lines fed to marian are well formed #23

XapaJIaMnu commented Jun 26, 2023

jelmervdl left a comment

jelmervdl Jun 26, 2023

XapaJIaMnu Jun 26, 2023

jelmervdl Jun 26, 2023

XapaJIaMnu Jun 26, 2023

XapaJIaMnu commented Jun 26, 2023

XapaJIaMnu commented Jun 26, 2023

XapaJIaMnu commented Jun 27, 2023

Ensure all lines fed to marian are well formed #23

Ensure all lines fed to marian are well formed #23

Conversation

XapaJIaMnu commented Jun 26, 2023

jelmervdl left a comment

Choose a reason for hiding this comment

jelmervdl Jun 26, 2023

Choose a reason for hiding this comment

XapaJIaMnu Jun 26, 2023

Choose a reason for hiding this comment

jelmervdl Jun 26, 2023

Choose a reason for hiding this comment

XapaJIaMnu Jun 26, 2023

Choose a reason for hiding this comment

XapaJIaMnu commented Jun 26, 2023

XapaJIaMnu commented Jun 26, 2023

XapaJIaMnu commented Jun 27, 2023