merge: Avoid overwriting the output id column with non-id columns #1593

tsibley · 2024-08-21T20:39:44Z

This edge case occurs when an input table has an id column used for joining (in tests, literally "id", but any valid id column name) and also another, non-id column that matches the output id column name (in tests, literally "strain").

Checklist

Automated checks pass
Check if you need to add a changelog message
Check if you need to add tests
Check if you need to update docs

tsibley · 2024-08-21T20:47:26Z

augur/merge.py

+                # XXX TODO: Alternatively, we could a) skip such columns rather
+                # than mangle their names, or b) allow for output columns with
+                # non-unique names (i.e. distinguishable only by position).
+                # Would either of those be preferrable?  Should it be
+                # configurable?
+                #   -trs, 21 Aug 2024


I'd welcome commentary on these alternatives.

I would avoid (b) since this could cause complications downstream. I'm guessing many workflows assume unique column names. Other Augur commands that use read_metadata seem to allow it since pandas allows it, but this seems more like a bug waiting to break some workflow rather than an intentional feature.

Decision here could benefit from having an example use case. Would it be unreasonable to (c) exit with error? This would allow us to better detect if/when this ever happens, and figure out what seems best based on the issue at hand.

I don't have a strong opinion about the two alternatives, but I do think, if you're going to mangle the column name, you need to consider the edgier edge case of "what if we mangle it to something that already exists?" (which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).

(c) exit with error + suggest augur curate rename for users to rename columns?

@genehack

but I do think, if you're going to mangle the column name, you need to consider the edgier edge case of "what if we mangle it to something that already exists?"

Yeah, my prefixing with two underscores approach is punting on this under the rationale that "field names with two leading underscores are reserved for use by augur merge". I didn't spell this out in docs, but it's kinda implied given this behaviour and the pre-existing __source_metadata_X columns.

(which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).

Hmm, are you thinking of a situation of two tables:

strain,… 1,… 2,… 3,…

id,strain,strain,… 1,A,B,… 2,A,B,… 3,A,B,…

which are matched on strain in table 1 and id in table 2?

Maybe we should support that situation somehow, but currently we don't and it's a error/bug:

$ augur merge --metadata a=/tmp/a.csv b=/tmp/b.csv --metadata-id-columns a=strain b=id --output-metadata - Reading 'a' metadata from '/tmp/a.csv'… Reading 'b' metadata from '/tmp/b.csv'… Columns renamed during .import <pipe> due to duplicates: "strain" to "strain_2", "strain" to "strain_3" Traceback (most recent call last): File "/home/tom/nextstrain/augur/augur/__init__.py", line 70, in run return args.__command__.run(args) File "/home/tom/nextstrain/augur/augur/merge.py", line 209, in run assert m.columns == (table_columns := sqlite3_table_columns(db_path, m.table_name)), \ AssertionError: ['id', 'strain', 'strain', '…'] == ['id', 'strain_2', 'strain_3', '…'] An error occurred (see above) that has not been properly handled by Augur. To report this, please open a new issue including the original command and the error above: <https://github.com/nextstrain/augur/issues/new/choose>

But maybe mangling is less friendly if users are likely to not notice it happened, even with the warning message?

I'm no user so take this with a grain of salt, but this seems likely in the context of Snakemake usage where warnings are effectively log messages that get drowned out in the sea of console output.

Yeah, my prefixing with two underscores approach is punting on this under the rationale that "field names with two leading underscores are reserved for use by augur merge". I didn't spell this out in docs, but it's kinda implied given this behaviour and the pre-existing __source_metadata_X columns.

If that's going to be the behavior, I think that's fine, but (a) it probably should be documented and (b) you should probably throw warnings when you see __<column_name> in inputs, yeah?

(which will also cover the edge case of "what if there is more than one non-id column whose name overlaps the id column", I think).

Hmm, are you thinking of a situation of two tables:

I was thinking about merging three or more tables, one where strain is the unique ID column and two where it is not but is still present.

Maybe we should support that situation somehow

I don't love this idea, but I was thinking if your "create a unique name" process was "surround with unique char" instead of "prefix with __", you could apply that recursively until uniqueness resulted. E.g., strain -> |strain| -> ||strain||, etc.

I'm going to switch this PR's implementation to the suggested (c) exit with error. I suspect we'll encounter it more than we'd like—that is, I suspect we have plenty of metadata files to merge in where, e.g., strain is some non-unique column and we're joining on accession or similar instead—but we can address it down the road if so.

If that's going to be the behavior, I think that's fine, but (a) it probably should be documented and (b) you should probably throw warnings when you see __<column_name> in inputs, yeah?

Fair enough.

I was thinking about merging three or more tables, one where strain is the unique ID column and two where it is not but is still present.

Ah. In the current (but soon to be outdated) implementation on this PR, that'd be fine; they'd both get renamed in the output to __strain and merged appropriately. Or is there something else I'm missing?

I was thinking about merging three or more tables, one where strain is the unique ID column and two where it is not but is still present.

Ah. In the current (but soon to be outdated) implementation on this PR, that'd be fine; they'd both get renamed in the output to __strain and merged appropriately. Or is there something else I'm missing?

Perhaps I misunderstood the desired end state — I thought the goal was that the final output of augur merge would have fully unique column names. If the goal is actually "fully unique except __-prefixed ones", then 👍 👍

You understand the desired end state. The output would have fully-unique column names. The two __strain columns would get merged, i.e. output as a single column.

codecov · 2024-08-21T20:53:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.03%. Comparing base (d662698) to head (232698b).
Report is 6 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1593   +/-   ##
=======================================
  Coverage   71.02%   71.03%           
=======================================
  Files          79       79           
  Lines        8256     8258    +2     
  Branches     2003     2004    +1     
=======================================
+ Hits         5864     5866    +2     
  Misses       2101     2101           
  Partials      291      291

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This edge case occurs when an input table has an id column used for joining (in tests, literally "id", but any valid id column name) and also another, non-id column that matches the output id column name (in tests, literally "strain"). A previous implementation avoided overwriting by mangling the column name in output (e.g. "strain" → "__strain"), but during review it was suggested that erroring out early might be less surprising and thus ultimately more user friendly.

tsibley · 2024-08-28T19:37:17Z

The linkcheck step in CI has now failed twice in different ways: first reporting a 403 Forbidden for http://www.microbesonline.org/fasttree/ and then reporting 406 Unacceptable for 4859b5d. Third time was the charm. The only thing I did was re-run the job.

tsibley commented Aug 21, 2024

View reviewed changes

tsibley mentioned this pull request Aug 27, 2024

merge: Support table-specific id columns and delimiters #1594

Merged

5 tasks

tsibley force-pushed the trs/merge/id-column-overwrite branch from 40f4e2e to 2a73eff Compare August 28, 2024 18:47

tsibley requested review from genehack, victorlin and joverlee521 August 28, 2024 18:47

tsibley force-pushed the trs/merge/id-column-overwrite branch from 2a73eff to 232698b Compare August 28, 2024 19:01

tsibley mentioned this pull request Aug 28, 2024

docs-ci: make linkcheck prone to transient network failures nextstrain/.github#106

Open

tsibley merged commit 2702005 into master Aug 30, 2024
28 checks passed

tsibley deleted the trs/merge/id-column-overwrite branch August 30, 2024 16:39

victorlin approved these changes Aug 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: Avoid overwriting the output id column with non-id columns #1593

merge: Avoid overwriting the output id column with non-id columns #1593

tsibley commented Aug 21, 2024 •

edited

Loading

tsibley Aug 21, 2024

victorlin Aug 21, 2024

genehack Aug 22, 2024

joverlee521 Aug 22, 2024

tsibley Aug 27, 2024 •

edited

Loading

victorlin Aug 27, 2024

genehack Aug 27, 2024 •

edited

Loading

tsibley Aug 28, 2024

genehack Aug 28, 2024

tsibley Aug 28, 2024

codecov bot commented Aug 21, 2024 •

edited

Loading

tsibley commented Aug 28, 2024

merge: Avoid overwriting the output id column with non-id columns #1593

merge: Avoid overwriting the output id column with non-id columns #1593

Conversation

tsibley commented Aug 21, 2024 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsibley Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

genehack Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 21, 2024 • edited Loading

Codecov Report

tsibley commented Aug 28, 2024

tsibley commented Aug 21, 2024 •

edited

Loading

tsibley Aug 27, 2024 •

edited

Loading

genehack Aug 27, 2024 •

edited

Loading

codecov bot commented Aug 21, 2024 •

edited

Loading