Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge: Avoid overwriting the output id column with non-id columns #1593

Merged
merged 1 commit into from
Aug 30, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Aug 21, 2024

This edge case occurs when an input table has an id column used for joining (in tests, literally "id", but any valid id column name) and also another, non-id column that matches the output id column name (in tests, literally "strain").

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

augur/merge.py Outdated
Comment on lines 198 to 203
# XXX TODO: Alternatively, we could a) skip such columns rather
# than mangle their names, or b) allow for output columns with
# non-unique names (i.e. distinguishable only by position).
# Would either of those be preferrable? Should it be
# configurable?
# -trs, 21 Aug 2024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd welcome commentary on these alternatives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid (b) since this could cause complications downstream. I'm guessing many workflows assume unique column names. Other Augur commands that use read_metadata seem to allow it since pandas allows it, but this seems more like a bug waiting to break some workflow rather than an intentional feature.

Decision here could benefit from having an example use case. Would it be unreasonable to (c) exit with error? This would allow us to better detect if/when this ever happens, and figure out what seems best based on the issue at hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about the two alternatives, but I do think, if you're going to mangle the column name, you need to consider the edgier edge case of "what if we mangle it to something that already exists?" (which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(c) exit with error + suggest augur curate rename for users to rename columns?

Copy link
Member Author

@tsibley tsibley Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@genehack

but I do think, if you're going to mangle the column name, you need to consider the edgier edge case of "what if we mangle it to something that already exists?"

Yeah, my prefixing with two underscores approach is punting on this under the rationale that "field names with two leading underscores are reserved for use by augur merge". I didn't spell this out in docs, but it's kinda implied given this behaviour and the pre-existing __source_metadata_X columns.

(which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).

Hmm, are you thinking of a situation of two tables:

strain,…
1,…
2,…
3,…
id,strain,strain,…
1,A,B,…
2,A,B,…
3,A,B,…

which are matched on strain in table 1 and id in table 2?

Maybe we should support that situation somehow, but currently we don't and it's a error/bug:

$ augur merge --metadata a=/tmp/a.csv b=/tmp/b.csv --metadata-id-columns a=strain b=id --output-metadata -
Reading 'a' metadata from '/tmp/a.csv'…
Reading 'b' metadata from '/tmp/b.csv'…
Columns renamed during .import <pipe> due to duplicates:
"strain" to "strain_2",
"strain" to "strain_3"
Traceback (most recent call last):
  File "/home/tom/nextstrain/augur/augur/__init__.py", line 70, in run
    return args.__command__.run(args)
  File "/home/tom/nextstrain/augur/augur/merge.py", line 209, in run
    assert m.columns == (table_columns := sqlite3_table_columns(db_path, m.table_name)), \
AssertionError: ['id', 'strain', 'strain', '…'] == ['id', 'strain_2', 'strain_3', '…']


An error occurred (see above) that has not been properly handled by Augur.
To report this, please open a new issue including the original command and the error above:
    <https://github.com/nextstrain/augur/issues/new/choose>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe mangling is less friendly if users are likely to not notice it happened, even with the warning message?

I'm no user so take this with a grain of salt, but this seems likely in the context of Snakemake usage where warnings are effectively log messages that get drowned out in the sea of console output.

Copy link
Contributor

@genehack genehack Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my prefixing with two underscores approach is punting on this under the rationale that "field names with two leading underscores are reserved for use by augur merge". I didn't spell this out in docs, but it's kinda implied given this behaviour and the pre-existing __source_metadata_X columns.

If that's going to be the behavior, I think that's fine, but (a) it probably should be documented and (b) you should probably throw warnings when you see __<column_name> in inputs, yeah?

(which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).

Hmm, are you thinking of a situation of two tables:

I was thinking about merging three or more tables, one where strain is the unique ID column and two where it is not but is still present.

Maybe we should support that situation somehow

I don't love this idea, but I was thinking if your "create a unique name" process was "surround with unique char" instead of "prefix with __", you could apply that recursively until uniqueness resulted. E.g., strain -> |strain| -> ||strain||, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to switch this PR's implementation to the suggested (c) exit with error. I suspect we'll encounter it more than we'd like—that is, I suspect we have plenty of metadata files to merge in where, e.g., strain is some non-unique column and we're joining on accession or similar instead—but we can address it down the road if so.

If that's going to be the behavior, I think that's fine, but (a) it probably should be documented and (b) you should probably throw warnings when you see __<column_name> in inputs, yeah?

Fair enough.

I was thinking about merging three or more tables, one where strain is the unique ID column and two where it is not but is still present.

Ah. In the current (but soon to be outdated) implementation on this PR, that'd be fine; they'd both get renamed in the output to __strain and merged appropriately. Or is there something else I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about merging three or more tables, one where strain is the unique ID column and two where it is not but is still present.

Ah. In the current (but soon to be outdated) implementation on this PR, that'd be fine; they'd both get renamed in the output to __strain and merged appropriately. Or is there something else I'm missing?

Perhaps I misunderstood the desired end state — I thought the goal was that the final output of augur merge would have fully unique column names. If the goal is actually "fully unique except __-prefixed ones", then 👍 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You understand the desired end state. The output would have fully-unique column names. The two __strain columns would get merged, i.e. output as a single column.

Copy link

codecov bot commented Aug 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.03%. Comparing base (d662698) to head (232698b).
Report is 6 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1593   +/-   ##
=======================================
  Coverage   71.02%   71.03%           
=======================================
  Files          79       79           
  Lines        8256     8258    +2     
  Branches     2003     2004    +1     
=======================================
+ Hits         5864     5866    +2     
  Misses       2101     2101           
  Partials      291      291           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This edge case occurs when an input table has an id column used for
joining (in tests, literally "id", but any valid id column name) and
also another, non-id column that matches the output id column name (in
tests, literally "strain").

A previous implementation avoided overwriting by mangling the column
name in output (e.g. "strain" → "__strain"), but during review it was
suggested that erroring out early might be less surprising and thus
ultimately more user friendly.
@tsibley tsibley force-pushed the trs/merge/id-column-overwrite branch from 2a73eff to 232698b Compare August 28, 2024 19:01
@tsibley
Copy link
Member Author

tsibley commented Aug 28, 2024

The linkcheck step in CI has now failed twice in different ways: first reporting a 403 Forbidden for http://www.microbesonline.org/fasttree/ and then reporting 406 Unacceptable for 4859b5d. Third time was the charm. The only thing I did was re-run the job.

@tsibley tsibley merged commit 2702005 into master Aug 30, 2024
28 checks passed
@tsibley tsibley deleted the trs/merge/id-column-overwrite branch August 30, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants