-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge: Avoid overwriting the output id column with non-id columns #1593
Conversation
augur/merge.py
Outdated
# XXX TODO: Alternatively, we could a) skip such columns rather | ||
# than mangle their names, or b) allow for output columns with | ||
# non-unique names (i.e. distinguishable only by position). | ||
# Would either of those be preferrable? Should it be | ||
# configurable? | ||
# -trs, 21 Aug 2024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd welcome commentary on these alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid (b) since this could cause complications downstream. I'm guessing many workflows assume unique column names. Other Augur commands that use read_metadata
seem to allow it since pandas allows it, but this seems more like a bug waiting to break some workflow rather than an intentional feature.
Decision here could benefit from having an example use case. Would it be unreasonable to (c) exit with error? This would allow us to better detect if/when this ever happens, and figure out what seems best based on the issue at hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion about the two alternatives, but I do think, if you're going to mangle the column name, you need to consider the edgier edge case of "what if we mangle it to something that already exists?" (which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(c) exit with error + suggest augur curate rename
for users to rename columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I do think, if you're going to mangle the column name, you need to consider the edgier edge case of "what if we mangle it to something that already exists?"
Yeah, my prefixing with two underscores approach is punting on this under the rationale that "field names with two leading underscores are reserved for use by augur merge". I didn't spell this out in docs, but it's kinda implied given this behaviour and the pre-existing __source_metadata_X
columns.
(which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).
Hmm, are you thinking of a situation of two tables:
strain,…
1,…
2,…
3,…
id,strain,strain,…
1,A,B,…
2,A,B,…
3,A,B,…
which are matched on strain
in table 1 and id
in table 2?
Maybe we should support that situation somehow, but currently we don't and it's a error/bug:
$ augur merge --metadata a=/tmp/a.csv b=/tmp/b.csv --metadata-id-columns a=strain b=id --output-metadata -
Reading 'a' metadata from '/tmp/a.csv'…
Reading 'b' metadata from '/tmp/b.csv'…
Columns renamed during .import <pipe> due to duplicates:
"strain" to "strain_2",
"strain" to "strain_3"
Traceback (most recent call last):
File "/home/tom/nextstrain/augur/augur/__init__.py", line 70, in run
return args.__command__.run(args)
File "/home/tom/nextstrain/augur/augur/merge.py", line 209, in run
assert m.columns == (table_columns := sqlite3_table_columns(db_path, m.table_name)), \
AssertionError: ['id', 'strain', 'strain', '…'] == ['id', 'strain_2', 'strain_3', '…']
An error occurred (see above) that has not been properly handled by Augur.
To report this, please open a new issue including the original command and the error above:
<https://github.com/nextstrain/augur/issues/new/choose>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe mangling is less friendly if users are likely to not notice it happened, even with the warning message?
I'm no user so take this with a grain of salt, but this seems likely in the context of Snakemake usage where warnings are effectively log messages that get drowned out in the sea of console output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my prefixing with two underscores approach is punting on this under the rationale that "field names with two leading underscores are reserved for use by augur merge". I didn't spell this out in docs, but it's kinda implied given this behaviour and the pre-existing
__source_metadata_X
columns.
If that's going to be the behavior, I think that's fine, but (a) it probably should be documented and (b) you should probably throw warnings when you see __<column_name>
in inputs, yeah?
(which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).
Hmm, are you thinking of a situation of two tables:
I was thinking about merging three or more tables, one where strain
is the unique ID column and two where it is not but is still present.
Maybe we should support that situation somehow
I don't love this idea, but I was thinking if your "create a unique name" process was "surround with unique char" instead of "prefix with __
", you could apply that recursively until uniqueness resulted. E.g., strain
-> |strain|
-> ||strain||
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to switch this PR's implementation to the suggested (c) exit with error. I suspect we'll encounter it more than we'd like—that is, I suspect we have plenty of metadata files to merge in where, e.g., strain
is some non-unique column and we're joining on accession
or similar instead—but we can address it down the road if so.
If that's going to be the behavior, I think that's fine, but (a) it probably should be documented and (b) you should probably throw warnings when you see
__<column_name>
in inputs, yeah?
Fair enough.
I was thinking about merging three or more tables, one where
strain
is the unique ID column and two where it is not but is still present.
Ah. In the current (but soon to be outdated) implementation on this PR, that'd be fine; they'd both get renamed in the output to __strain
and merged appropriately. Or is there something else I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about merging three or more tables, one where
strain
is the unique ID column and two where it is not but is still present.Ah. In the current (but soon to be outdated) implementation on this PR, that'd be fine; they'd both get renamed in the output to
__strain
and merged appropriately. Or is there something else I'm missing?
Perhaps I misunderstood the desired end state — I thought the goal was that the final output of augur merge
would have fully unique column names. If the goal is actually "fully unique except __
-prefixed ones", then 👍 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You understand the desired end state. The output would have fully-unique column names. The two __strain
columns would get merged, i.e. output as a single column.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1593 +/- ##
=======================================
Coverage 71.02% 71.03%
=======================================
Files 79 79
Lines 8256 8258 +2
Branches 2003 2004 +1
=======================================
+ Hits 5864 5866 +2
Misses 2101 2101
Partials 291 291 ☔ View full report in Codecov by Sentry. |
40f4e2e
to
2a73eff
Compare
This edge case occurs when an input table has an id column used for joining (in tests, literally "id", but any valid id column name) and also another, non-id column that matches the output id column name (in tests, literally "strain"). A previous implementation avoided overwriting by mangling the column name in output (e.g. "strain" → "__strain"), but during review it was suggested that erroring out early might be less surprising and thus ultimately more user friendly.
2a73eff
to
232698b
Compare
The linkcheck step in CI has now failed twice in different ways: first reporting a 403 Forbidden for http://www.microbesonline.org/fasttree/ and then reporting 406 Unacceptable for 4859b5d. Third time was the charm. The only thing I did was re-run the job. |
This edge case occurs when an input table has an id column used for joining (in tests, literally "id", but any valid id column name) and also another, non-id column that matches the output id column name (in tests, literally "strain").
Checklist