Skip to content

Commit

Permalink
merge: Avoid overwriting the output id column with non-id columns
Browse files Browse the repository at this point in the history
This edge case occurs when an input table has an id column used for
joining (in tests, literally "id", but any valid id column name) and
also another, non-id column that matches the output id column name (in
tests, literally "strain").

A previous implementation avoided overwriting by mangling the column
name in output (e.g. "strain" → "__strain"), but during review it was
suggested that erroring out early might be less surprising and thus
ultimately more user friendly.
  • Loading branch information
tsibley committed Aug 28, 2024
1 parent 242d67f commit 2a73eff
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 2 deletions.
25 changes: 23 additions & 2 deletions augur/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
column or overwriting values in an existing column. For columns appearing in
more than one table, non-empty values on the right hand side overwrite values
on the left hand side. The first table's id column name is used as the output
id column name.
id column name. Non-id columns in other input tables that would conflict with
this output id column name are not allowed and if present will cause an error.
One generated column per input table is appended to the end of the output
table to identify the source of each row's data. Column names are generated
Expand Down Expand Up @@ -151,6 +152,23 @@ def run(args):
output_id_column = metadata[0].id_column
output_columns = { output_id_column: [] }

if conflicting_columns := [f"{c!r} in metadata table {m.name!r} (id column: {m.id_column!r})"
for m in metadata
for c in m.columns
if c == output_id_column
and c != m.id_column]:
raise AugurError(dedent(f"""\
Non-id column names in metadata inputs may not conflict with the
output id column name ({output_id_column!r}, the first input's id column).
The following input {_n("column", "columns", len(conflicting_columns))} would conflict:
{indented_list(conflicting_columns, ' ' + ' ')}
Please rename or drop the conflicting {_n("column", "columns", len(conflicting_columns))} before merging.
Renaming may be done with `augur curate rename`.
"""))

try:
# Read all metadata files into a SQLite db
for m in metadata:
Expand Down Expand Up @@ -185,7 +203,10 @@ def run(args):
# the order of both.
for column in m.columns:
# Match different id column names in different metadata files
# since they're logically equivalent.
# since they're logically equivalent. Any non-id columns that
# match the output_id_column (i.e. first table's id column) and
# would thus overwrite it with this logic are already a fatal
# error above.
output_column = output_id_column if column == m.id_column else column

output_columns.setdefault(output_column, [])
Expand Down
23 changes: 23 additions & 0 deletions tests/functional/merge/cram/merge.t
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,29 @@ Metadata field values with metachars (field or record delimiters) are handled pr
x" 1 1
two X2a X2b X2c 1 1

Output column renamed when it conflicts with id column.

$ cat >id-and-strain.csv <<~~
> id,strain
> one,1
> two,2
> three,3
> ~~
$ ${AUGUR} merge \
> --metadata strain-only=x.tsv id-and-strain=id-and-strain.csv \
> --metadata-id-columns id strain \
> --output-metadata - | csv2tsv --csv-delim $'\t' | tsv-pretty
ERROR: Non-id column names in metadata inputs may not conflict with the
output id column name ('strain', the first input's id column).

The following input column would conflict:

'strain' in metadata table 'id-and-strain' (id column: 'id')

Please rename or drop the conflicting column before merging.
Renaming may be done with `augur curate rename`.



ERROR HANDLING

Expand Down

0 comments on commit 2a73eff

Please sign in to comment.