merge: Avoid overwriting the output id column with non-id columns

This edge case occurs when an input table has an id column used for joining (in tests, literally "id", but any valid id column name) and also another, non-id column that matches the output id column name (in tests, literally "strain"). A previous implementation avoided overwriting by mangling the column name in output (e.g. "strain" → "__strain"), but during review it was suggested that erroring out early might be less surprising and thus ultimately more user friendly.
nextstrain · Aug 28, 2024 · 2a73eff · 2a73eff
1 parent 242d67f
commit 2a73eff
Show file tree

Hide file tree

Showing 2 changed files with 46 additions and 2 deletions.
diff --git a/augur/merge.py b/augur/merge.py
@@ -13,7 +13,8 @@
 column or overwriting values in an existing column.  For columns appearing in
 more than one table, non-empty values on the right hand side overwrite values
 on the left hand side.  The first table's id column name is used as the output
-id column name.
+id column name.  Non-id columns in other input tables that would conflict with
+this output id column name are not allowed and if present will cause an error.
 
 One generated column per input table is appended to the end of the output
 table to identify the source of each row's data.  Column names are generated
@@ -151,6 +152,23 @@ def run(args):
     output_id_column = metadata[0].id_column
     output_columns = { output_id_column: [] }
 
+    if conflicting_columns := [f"{c!r} in metadata table {m.name!r} (id column: {m.id_column!r})"
+                                    for m in metadata
+                                    for c in m.columns
+                                     if c == output_id_column
+                                    and c != m.id_column]:
+        raise AugurError(dedent(f"""\
+            Non-id column names in metadata inputs may not conflict with the
+            output id column name ({output_id_column!r}, the first input's id column).
+
+            The following input {_n("column", "columns", len(conflicting_columns))} would conflict:
+
+              {indented_list(conflicting_columns, '            ' + '  ')}
+
+            Please rename or drop the conflicting {_n("column", "columns", len(conflicting_columns))} before merging.
+            Renaming may be done with `augur curate rename`.
+            """))
+
     try:
         # Read all metadata files into a SQLite db
         for m in metadata:
@@ -185,7 +203,10 @@ def run(args):
             # the order of both.
             for column in m.columns:
                 # Match different id column names in different metadata files
-                # since they're logically equivalent.
+                # since they're logically equivalent.  Any non-id columns that
+                # match the output_id_column (i.e. first table's id column) and
+                # would thus overwrite it with this logic are already a fatal
+                # error above.
                 output_column = output_id_column if column == m.id_column else column
 
                 output_columns.setdefault(output_column, [])

diff --git a/tests/functional/merge/cram/merge.t b/tests/functional/merge/cram/merge.t
@@ -143,6 +143,29 @@ Metadata field values with metachars (field or record delimiters) are handled pr
   x"	1	1
   two	X2a	X2b	X2c				1	1
 
+Output column renamed when it conflicts with id column.
+
+  $ cat >id-and-strain.csv <<~~
+  > id,strain
+  > one,1
+  > two,2
+  > three,3
+  > ~~
+  $ ${AUGUR} merge \
+  >   --metadata strain-only=x.tsv id-and-strain=id-and-strain.csv \
+  >   --metadata-id-columns id strain \
+  >   --output-metadata - | csv2tsv --csv-delim $'\t' | tsv-pretty
+  ERROR: Non-id column names in metadata inputs may not conflict with the
+  output id column name ('strain', the first input's id column).
+
+  The following input column would conflict:
+
+    'strain' in metadata table 'id-and-strain' (id column: 'id')
+
+  Please rename or drop the conflicting column before merging.
+  Renaming may be done with `augur curate rename`.
+
+
 
 ERROR HANDLING