-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Di 1338 v1.3.0 #16
base: main
Are you sure you want to change the base?
Di 1338 v1.3.0 #16
Conversation
Warning Rate limit exceeded@kjwinfield has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 58 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThe pull request introduces significant modifications to the codebase, focusing on input requirements and code structure. The primary changes include adding a new required JSON input for panels, updating the Epic Clarity input from Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Hello @kjwinfield! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2025-01-03 17:23:33 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
resources/home/dnanexus/get_variant_info.py (3)
93-140
: Determine inheritance
Overall, the logic for maternal/paternal determination is clear. However, note the following static analysis hints:
- Line 106: Instead of assigning
zygosity = lambda x, y: ...
, define a small helper function. This makes debugging easier.- Lines 116-117 & 123-124: You can consolidate nested
if
statements into a single check for cleaner code.🧰 Tools
🪛 Ruff (0.8.2)
106-106: Do not assign a
lambda
expression, use adef
Rewrite
zygosity
as adef
(E731)
116-117: Use a single
if
statement instead of nestedif
statements(SIM102)
123-124: Use a single
if
statement instead of nestedif
statements(SIM102)
200-238
: STR variant info
Populating a dictionary with STR-specific fields is straightforward. For repeated logic, note that:
- Lines 218-220: Instead of
num_copies = lambda x, y, z: ...
, define a small helper function. Lambdas assigned to variables can be replaced withdef
.🧰 Tools
🪛 Ruff (0.8.2)
218-220: Do not assign a
lambda
expression, use adef
Rewrite
num_copies
as adef
(E731)
379-422
: HGVS nomenclature for exomiser
The fallback behaviour for when a MANE transcript is missing is logical. The debugprint(refseq)
can be removed if no longer necessary in production.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md
(1 hunks)resources/home/dnanexus/get_variant_info.py
(1 hunks)resources/home/dnanexus/make_workbook.py
(18 hunks)resources/home/dnanexus/tests/test_make_workbook.py
(9 hunks)src/code.sh
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py
106-106: Do not assign a lambda
expression, use a def
Rewrite zygosity
as a def
(E731)
116-117: Use a single if
statement instead of nested if
statements
(SIM102)
123-124: Use a single if
statement instead of nested if
statements
(SIM102)
218-220: Do not assign a lambda
expression, use a def
Rewrite num_copies
as a def
(E731)
🔇 Additional comments (43)
resources/home/dnanexus/get_variant_info.py (11)
3-20
: Add columns to dictionary
This function is concise and clearly communicates its purpose through the docstring. The approach for creating an empty dictionary keyed by the column list is perfectly fine.
22-41
: Extract gene symbol
The loop to gather gene symbols and the handling of multiple gene symbols is straightforward. Sorting and de-duplicating the symbols before returning them is good practice.
43-67
: Convert tier
The logic branches cover expected combinations of tiers and variant types. This is easy to read.
69-90
: Retrieve maximum allele frequency
The logic handles the scenario where alleleFrequencies isNone
or empty. The use of'-'
for missing data is a sensible approach.
143-171
: Convert mode of inheritance
Handles a range of MOI values neatly, mapping them to more readable strings. The exception handling for unknown MOIs is also appropriate.
174-197
: Index participant
Throwing aRuntimeError
when the participant ID isn't found is a strong move, ensuring no silent failures. Implementation is succinct.
240-289
: SNV variant info
The call toconvert_tier
and the subsequent fields for zygosity, depth, and gene data are all consistent. Good job returning a fully populated dictionary in one pass.
292-324
: CNV variant info
Similar structure to SNV/STR. The approach of reusingadd_columns_to_dict
is DRY and keeps the code consistent.
327-352
: Retrieve top three exomiser-ranked variants
Uses a small podium ranking strategy, capturing all variants tied within the top three ranks. This is a neat solution.
354-376
: Look up ID in conversion file
Straightforward approach to match lines against the desired ID. Efficient, considering the data is stored in memory and looped through once.
424-481
: HGVS nomenclature for GEL
Collecting transcripts into lists and rejoining them for multiple hits is well-handled. The method gracefully falls back to Ensembl transcripts if no MANE match is found.resources/home/dnanexus/make_workbook.py (19)
18-18
: Unified import of get_variant_info
Consolidating variant-related methods under thevar_info
alias makes the codebase more coherent.
120-121
: RefSeq TSV filtering
Restricting lines to only those containing"RefSeq_mRNA"
helps control data volume. Make sure you handle cases where no lines match to avoid potential edge cases.
163-165
: Summary page: Additional Tier2 row
Adding “SNV Tier 2 complete penetrance*” clarifies how complete penetrance variants are counted. Good approach.
166-166
: Summary page: Reference note
Inline reference to SOP is a nice detail. Ensure the reference doc is kept updated.
167-168
: Exomiser top 3 label
Highlights rank-based filtering. Ensure the threshold is consistent with the rest of the code.
169-169
: De novo row
Clear mention of “De novo” line count. Harmonises with the extended analysis logic.
170-170
: Overall result row
Adding this row clarifies the final conclusion for the case.
171-172
: Confirmation row
Good to see a dedicated slot for confirmation status.
173-174
: Primary analysis row
This row and the “Data check” row help keep metadata about who performed the analysis.
210-211
: Summary page row styling
“horizontal” merges multiple row segments. Looks consistent.
214-215
: Summary page row styling
“vertical” styling approach. Keep an eye on overlapping merges.
363-371
: Panel details
Mapping panel ID to code and name clarifies the tested condition. The logic gracefully skips unknown panels.
414-415
: Reading epic_clarity as CSV
Ensuring consistent use of.csv
across the code base matches the new approach.
421-422
: Handling column names
Pulling “Patient Stated Gender” and “Year of Birth” from CSV is clear. Good descriptive naming.
486-488
: Filtering parent dataframe
Simple and direct. If no parent is found, you returnNone
. This is stable.
Line range hint
591-600
: Create GEL tiering variant page: SNVs
The revised code references standalone functions invar_info
to reduce duplication. The consolidated approach is clean.
617-617
: Create GEL tiering variant page: STR
Similar chunk for STR variants. The process of appending to the shared variant_list is consistent.
633-633
: Create GEL tiering variant page: CNV
Correctly callsget_cnv_info
for Tier A or Tier 1 CNVs.
647-648
: Variant summary counts
Assigning the correct variant priorities to the summary cells helps maintain the quick-case overview.resources/home/dnanexus/tests/test_make_workbook.py (10)
10-10
: Import get_variant_info
Shifting to a single importvar_info
keeps the tests aligned with the refactoring.
157-157
: Testing add_columns_to_dict
Good check. The test verifies the dictionary is constructed with empty values.
177-177
: Testing tier conversion
Ensures all expected tier conversions for SNV, CNV, and STR are correct. Thorough coverage.
198-198
: Testing get_af_max
Verifies the function captures the highest frequency. Straightforward test scenario.
223-223
: Index participant found
Confirms correct index is returned for a known participant.
232-232
: Index participant error
Expects aRuntimeError
if participant not found. This is a robust guard.
239-239
: Index participant None
Checking a scenario where no participant ID is supplied. Code returnsNone
as expected.
258-258
: Test top 3 ranks
Ensures multiple items at the same rank are included. Nicely handled with your test approach.
271-271
: Test skipping rank
Validates what happens if second rank is missing. The function still returns the next available rank.
288-289
: Test refseq look-up
Verifies the function returns the correctENSP
ID for a matchedENST
. That's crucial for correct protein annotation.src/code.sh (1)
18-18
: Switch to .csv
Ensures theepic_clarity
argument now references files with a.csv
extension. This aligns with other code changes.README.md (2)
21-27
: Document new panels input
Clearly denotes the JSON schema for panel definitions. This is helpful for future maintainers.
32-32
: epic_clarity now .csv
The README matches the code changes, keeping docs and implementation in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
resources/home/dnanexus/get_variant_info.py (3)
116-129
: Simplify inheritance logic by flattening nested conditions.The nested if statements can be simplified by combining conditions, making the code more readable and maintainable.
- if mother_idx is not None: - if zygosity(variant, mother_idx) in inheritance_types: - maternally_inherited = True + if mother_idx is not None and get_zygosity(variant, mother_idx) in inheritance_types: + maternally_inherited = True - if father_idx is not None: - if zygosity(variant, father_idx) in inheritance_types: - if (p_sex == 'MALE' and - variant['variantCoordinates']['chromosome'] == 'X'): - paternally_inherited = False - else: - paternally_inherited = True + if (father_idx is not None and + get_zygosity(variant, father_idx) in inheritance_types and + not (p_sex == 'MALE' and variant['variantCoordinates']['chromosome'] == 'X')): + paternally_inherited = True🧰 Tools
🪛 Ruff (0.8.2)
116-117: Use a single
if
statement instead of nestedif
statements(SIM102)
123-124: Use a single
if
statement instead of nestedif
statements(SIM102)
191-195
: Enhance error message with more context.The error message could be more descriptive to aid in debugging.
- raise RuntimeError( - f"Unable to find participant ID {participant_id} in " - f"variant {variant['variantCalls']}" - ) + raise RuntimeError( + f"Participant ID '{participant_id}' not found in variant calls. " + f"Available participants: {[call['participantId'] for call in variant['variantCalls']]}" + )
424-481
: Consider decomposing get_hgvs_gel into smaller functions.The function is quite long and handles multiple responsibilities. Consider extracting the MANE matching and protein change handling into separate functions.
+def _get_mane_matches(cdnas, mane): + """Extract MANE matches from cdna changes.""" + ref_list = [] + enst_list = [] + for cdna in cdnas: + refseq = look_up_id_in_refseq_mane_conversion_file( + mane, cdna.split('(')[0], "NM" + ) + if refseq is not None: + ref_list.append(refseq + cdna.split(')')[1]) + enst_list.append(cdna.split('(')[0]) + return ref_list, enst_list +def _get_protein_changes(enst_list, protein_changes, refseq_tsv): + """Extract protein changes for matched transcripts.""" + hgvs_p_list = [] + for enst in enst_list: + ensp = look_up_id_in_refseq_mane_conversion_file( + refseq_tsv, enst, "ENSP" + ) + if protein_changes: + for protein in protein_changes: + if ensp in protein: + hgvs_p_list.append(protein) + return hgvs_p_listThis would make the main function clearer and more maintainable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
resources/home/dnanexus/get_variant_info.py
(1 hunks)resources/home/dnanexus/tests/test_make_workbook.py
(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- resources/home/dnanexus/tests/test_make_workbook.py
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py
106-106: Do not assign a lambda
expression, use a def
Rewrite zygosity
as a def
(E731)
116-117: Use a single if
statement instead of nested if
statements
(SIM102)
123-124: Use a single if
statement instead of nested if
statements
(SIM102)
218-220: Do not assign a lambda
expression, use a def
Rewrite num_copies
as a def
(E731)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (5)
resources/home/dnanexus/get_variant_info.py (3)
69-90
: Consider using a small epsilon for floating-point comparison.When comparing floating-point numbers, using exact equality (
== 0
) can be unreliable due to floating-point arithmetic precision. Consider using a small epsilon value for the comparison.- if highest_af == 0: + if abs(highest_af) < 1e-10: # Using epsilon of 1e-10 highest_af = '-'
123-129
: Simplify nested conditions for better readability.The nested if conditions can be combined using logical operators for improved readability.
- if father_idx is not None: - if zygosity(variant, father_idx) in inheritance_types: - if (p_sex == 'MALE' and - variant['variantCoordinates']['chromosome'] == 'X'): - paternally_inherited = False - else: - paternally_inherited = True + if (father_idx is not None and + zygosity(variant, father_idx) in inheritance_types and + not (p_sex == 'MALE' and variant['variantCoordinates']['chromosome'] == 'X')): + paternally_inherited = True🧰 Tools
🪛 Ruff (0.8.2)
123-124: Use a single
if
statement instead of nestedif
statements(SIM102)
354-376
: Consider optimizing file search for large datasets.For large conversion files, the current implementation might be inefficient as it searches through the entire file for each query. Consider creating an index or using a dictionary for faster lookups if memory permits.
# Example optimization using dictionary (to be created once during initialization) conversion_dict = {} for line in conversion_file: for id_part in line.split(): if id_part.startswith(id_type): # Use the entire line as the key for multiple matches conversion_dict[line] = id_partresources/home/dnanexus/make_workbook.py (2)
486-488
: Use more descriptive variable names for gender values.The code uses magic numbers (1, 2) for gender values, which could lead to confusion.
+ MALE_GENDER = 1 + FEMALE_GENDER = 2 parent_df = df[ - (df['Patient Stated Gender'] == sex) & (df['Year of Birth'] < pb_yob) + (df['Patient Stated Gender'] == sex) & + (df['Year of Birth'] < pb_yob) ]
804-816
: Consider using pandas method chaining for better readability.The DataFrame operations can be made more readable using method chaining.
- # Grab all the de novos - denovo_df = ex_df.loc[ex_df['Priority'] == "De novo"] - - # Grab all the exomiser variants - ex_df = ex_df.loc[ex_df['Priority'] != "De novo"] - - # Now we have filtered out all variants that are in the GEL tiering - # page we need to get the top 3 ranks in the exomiser df - if not ex_df.empty: - ex_df = var_info.get_top_3_ranked(ex_df) - - # Concat de novos and exomiser ranked back together - ex_df = pd.concat([ex_df, denovo_df]) + ex_df = (pd.concat([ + ex_df.loc[ex_df['Priority'] == "De novo"], + ex_df.loc[ex_df['Priority'] != "De novo"].pipe( + lambda df: var_info.get_top_3_ranked(df) if not df.empty else df + ) + ]) + if not ex_df.empty else ex_df)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
resources/home/dnanexus/get_variant_info.py
(1 hunks)resources/home/dnanexus/make_workbook.py
(18 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py
106-106: Do not assign a lambda
expression, use a def
Rewrite zygosity
as a def
(E731)
116-117: Use a single if
statement instead of nested if
statements
(SIM102)
123-124: Use a single if
statement instead of nested if
statements
(SIM102)
218-220: Do not assign a lambda
expression, use a def
Rewrite num_copies
as a def
(E731)
def get_hgvs_gel(variant, mane, refseq_tsv): | ||
''' | ||
GEL variants store HGVS p dot and c dot nomenclature separately. | ||
This function extracts the cdot (with MANE refseq equivalent to | ||
ensembl transcript ID if found) and pdot (with ensembl protein ID) | ||
Inputs: | ||
variant: (dict) dict describing single variant from JSON | ||
mane: MANE file for transcripts | ||
refseq: refseq file for transcripts | ||
Outputs: | ||
hgvs_c: (str) HGVS c dot (transcript) nomenclature for the variant. | ||
annotated against refseq transcript ID if one exists, and ensembl | ||
transcript ID if there is no matched equivalent | ||
hgvs_p: (str) HGVS p dot (protein) nomenclature for the variant. | ||
this is annotated against the ensembl protein ID. | ||
''' | ||
hgvs_p = None | ||
hgvs_c = None | ||
ref_list = [] | ||
enst_list = [] | ||
cdnas = variant['variantAttributes']['cdnaChanges'] | ||
protein_changes = variant['variantAttributes']['proteinChanges'] | ||
|
||
# Check for a MANE match | ||
for cdna in cdnas: | ||
refseq = look_up_id_in_refseq_mane_conversion_file( | ||
mane, cdna.split('(')[0], "NM" | ||
) | ||
|
||
if refseq is not None: | ||
ref_list.append(refseq + cdna.split(')')[1]) | ||
enst_list.append(cdna.split('(')[0]) | ||
|
||
# If no MANE match is found, return all transcripts | ||
if len(set(ref_list)) == 0: | ||
hgvs_c_list = [] | ||
ensp_list = [] | ||
for cdna in cdnas: | ||
refseq = VariantNomenclature.convert_ensembl_to_refseq_mane( | ||
mane, cdna.split('(')[0] | ||
hgvs_c_list.append(re.sub("\(.*?\)", "", cdna)) | ||
for protein in protein_changes: | ||
ensp_list.append(protein) | ||
hgvs_c = ', '.join(hgvs_c_list) | ||
hgvs_p = ', '.join(ensp_list) | ||
|
||
else: | ||
hgvs_c = ', '.join(list(set(ref_list))) | ||
hgvs_p_list = [] | ||
for enst in enst_list: | ||
ensp = look_up_id_in_refseq_mane_conversion_file( | ||
refseq_tsv, enst.split(".")[0], "ENSP" | ||
) | ||
|
||
if refseq is not None: | ||
ref_list.append(refseq + cdna.split(')')[1]) | ||
enst_list.append(cdna.split('(')[0]) | ||
|
||
# If no MANE match is found, return all transcripts | ||
if len(set(ref_list)) == 0: | ||
hgvs_c_list = [] | ||
ensp_list = [] | ||
for cdna in cdnas: | ||
hgvs_c_list.append(re.sub("\(.*?\)", "", cdna)) | ||
for protein in protein_changes: | ||
ensp_list.append(protein) | ||
hgvs_c = ', '.join(hgvs_c_list) | ||
hgvs_p = ', '.join(ensp_list) | ||
|
||
else: | ||
hgvs_c = ', '.join(list(set(ref_list))) | ||
hgvs_p_list = [] | ||
for enst in enst_list: | ||
ensp = VariantNomenclature.get_ensp(refseq_tsv, enst) | ||
if protein_changes != []: | ||
for protein in protein_changes: | ||
if ensp in protein: | ||
hgvs_p_list.append(protein) | ||
hgvs_p = ', '.join(hgvs_p_list) | ||
hgvs_p = ', '.join(hgvs_p_list) | ||
|
||
return hgvs_c, hgvs_p | ||
return hgvs_c, hgvs_p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for malformed variant data.
The function assumes the variant data is well-formed but should handle potential edge cases gracefully.
cdnas = variant['variantAttributes']['cdnaChanges']
protein_changes = variant['variantAttributes']['proteinChanges']
+ if not isinstance(cdnas, list) or not isinstance(protein_changes, list):
+ raise ValueError("Expected lists for cdnaChanges and proteinChanges")
# Check for a MANE match
for cdna in cdnas:
+ try:
+ transcript_id = cdna.split('(')[0]
+ except (IndexError, AttributeError):
+ print(f"Warning: Malformed cdna string: {cdna}")
+ continue
Committable suggestion skipped: line range outside the PR's diff.
# Read in csv as df, using only relevant columns | ||
df = pd.read_csv( | ||
self.args.epic_clarity, | ||
usecols=[ | ||
"WGS Referral ID", | ||
"External Specimen Identifier", | ||
"Specimen Identifier", | ||
"Sex", | ||
"YOB" | ||
"Patient Stated Gender", | ||
"Year of Birth" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for CSV file reading.
The CSV file reading lacks error handling for missing columns or malformed data.
+ required_columns = [
+ "WGS Referral ID",
+ "External Specimen Identifier",
+ "Specimen Identifier",
+ "Patient Stated Gender",
+ "Year of Birth"
+ ]
df = pd.read_csv(
self.args.epic_clarity,
- usecols=[
- "WGS Referral ID",
- "External Specimen Identifier",
- "Specimen Identifier",
- "Patient Stated Gender",
- "Year of Birth"
- ]
+ usecols=lambda x: x in required_columns
)
+ missing_columns = set(required_columns) - set(df.columns)
+ if missing_columns:
+ raise ValueError(f"Missing required columns: {missing_columns}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Read in csv as df, using only relevant columns | |
df = pd.read_csv( | |
self.args.epic_clarity, | |
usecols=[ | |
"WGS Referral ID", | |
"External Specimen Identifier", | |
"Specimen Identifier", | |
"Sex", | |
"YOB" | |
"Patient Stated Gender", | |
"Year of Birth" | |
] | |
# Read in csv as df, using only relevant columns | |
required_columns = [ | |
"WGS Referral ID", | |
"External Specimen Identifier", | |
"Specimen Identifier", | |
"Patient Stated Gender", | |
"Year of Birth" | |
] | |
df = pd.read_csv( | |
self.args.epic_clarity, | |
usecols=lambda x: x in required_columns | |
) | |
missing_columns = set(required_columns) - set(df.columns) | |
if missing_columns: | |
raise ValueError(f"Missing required columns: {missing_columns}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r1, 1 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @kjwinfield)
resources/home/dnanexus/get_variant_info.py
line 360 at r3 (raw file):
Inputs: query_id (str): ID to query (an ensembl ID) conversion_file (list): list of lines from a transcript
Might be slightly pedantic but could we call this something that doesn't imply it's a file? Even just conversion_file_list
or something
resources/home/dnanexus/get_variant_info.py
line 374 at r3 (raw file):
matches = [x for x in line.split() if x.startswith(id_type)] if matches != []: matched_id = matches[0]
Do we definitely only ever want the first list item (assuming it's possible for multiples)? If it should never be multiple, do we need to catch that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
resources/home/dnanexus/get_variant_info.py (1)
116-129
: Optimise inheritance logic with combined conditions.Let's enhance the code by combining nested conditions and adding validation:
- if mother_idx is not None: - if zygosity(variant, mother_idx) in inheritance_types: - maternally_inherited = True + if mother_idx is not None and zygosity(variant, mother_idx) in inheritance_types: + maternally_inherited = True - if father_idx is not None: - if zygosity(variant, father_idx) in inheritance_types: - if (p_sex == 'MALE' and - variant['variantCoordinates']['chromosome'] == 'X'): - paternally_inherited = False - else: - paternally_inherited = True + if (father_idx is not None and + zygosity(variant, father_idx) in inheritance_types and + not (p_sex == 'MALE' and variant['variantCoordinates']['chromosome'] == 'X')): + paternally_inherited = True🧰 Tools
🪛 Ruff (0.8.2)
116-117: Use a single
if
statement instead of nestedif
statements(SIM102)
123-124: Use a single
if
statement instead of nestedif
statements(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
resources/home/dnanexus/get_variant_info.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py
106-106: Do not assign a lambda
expression, use a def
Rewrite zygosity
as a def
(E731)
116-117: Use a single if
statement instead of nested if
statements
(SIM102)
123-124: Use a single if
statement instead of nested if
statements
(SIM102)
218-220: Do not assign a lambda
expression, use a def
Rewrite num_copies
as a def
(E731)
🔇 Additional comments (1)
resources/home/dnanexus/get_variant_info.py (1)
447-456
: 🛠️ Refactor suggestionEnhance error handling for transcript matching.
The transcript matching logic should handle potential edge cases more robustly:
for cdna in cdnas: + if not cdna or '(' not in cdna or ')' not in cdna: + print(f"Warning: Malformed cdna string: {cdna}") + continue + refseq = look_up_id_in_refseq_mane_conversion_file( mane, cdna.split('(')[0], "NM" ) if refseq is not None: + try: ref_list.append(refseq + cdna.split(')')[1]) enst_list.append(cdna.split('(')[0]) + except IndexError: + print(f"Warning: Unable to process refseq {refseq} for cdna {cdna}") + continueLikely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
resources/home/dnanexus/get_variant_info.py (2)
339-341
: Improve DataFrame operations for better maintainability.Consider these enhancements:
- Avoid in-place DataFrame modifications
- Replace the lambda with a cleaner integer extraction
- df['priority_as_int'] = df['Priority'].map( - lambda x: int(x.split(' ')[-1]) - ) + def extract_priority_rank(priority_str): + return int(priority_str.split(' ')[-1]) + + df = df.assign(priority_as_int=df['Priority'].map(extract_priority_rank)) # filter the df to include only values in top three ranks - df = df[df['priority_as_int'].isin(top_3_ranks)] - df.drop(['priority_as_int'], axis=1, inplace=True) + return df[df['priority_as_int'].isin(top_3_ranks)].drop('priority_as_int', axis=1)Also applies to: 347-348
368-374
: Optimise lookup performance with caching.The current implementation performs linear search for each lookup. For better performance with large conversion files, consider implementing a cache:
+ @functools.lru_cache(maxsize=1024) def look_up_id_in_refseq_mane_conversion_file(conversion, query_id, id_type): - matched_id = None - for line in conversion: - if query_id in line: - matches = [x for x in line.split() if x.startswith(id_type)] - if matches != []: - matched_id = matches[0] - break - return matched_id + # Convert conversion list to a dictionary for O(1) lookup + if not hasattr(look_up_id_in_refseq_mane_conversion_file, '_cache'): + look_up_id_in_refseq_mane_conversion_file._cache = {} + for line in conversion: + for part in line.split(): + if ':' in part: + key = part.split(':')[0] + look_up_id_in_refseq_mane_conversion_file._cache[key] = line + + if query_id not in look_up_id_in_refseq_mane_conversion_file._cache: + return None + + line = look_up_id_in_refseq_mane_conversion_file._cache[query_id] + matches = [x for x in line.split() if x.startswith(id_type)] + return matches[0] if matches else None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
resources/home/dnanexus/get_variant_info.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py
106-106: Do not assign a lambda
expression, use a def
Rewrite zygosity
as a def
(E731)
216-218: Do not assign a lambda
expression, use a def
Rewrite num_copies
as a def
(E731)
🔇 Additional comments (1)
resources/home/dnanexus/get_variant_info.py (1)
442-444
: 🛠️ Refactor suggestionEnhance error handling for variant attributes.
The code assumes
cdnaChanges
andproteinChanges
are well-formed. Let's add validation:+ if not isinstance(cdnas, list) or not isinstance(protein_changes, list): + print(f"Warning: Malformed variant attributes: cdnaChanges={cdnas}, proteinChanges={protein_changes}") + return None, None + cdnas = variant['variantAttributes']['cdnaChanges'] protein_changes = variant['variantAttributes']['proteinChanges']Likely invalid or redundant comment.
Outputs: | ||
inheritance (str): inferred inheritance of the variant, or None. | ||
''' | ||
zygosity = lambda x, y: x['variantCalls'][y]['zygosity'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Convert lambda expressions to regular functions for improved maintainability.
Optimising for clarity and maintainability, let's convert these lambda expressions to proper function definitions:
-zygosity = lambda x, y: x['variantCalls'][y]['zygosity']
+def get_zygosity(variant, call_idx):
+ return variant['variantCalls'][call_idx]['zygosity']
-num_copies = lambda x, y, z: x['variantCalls'][y]['numberOfCopies'][z]['numberOfCopies']
+def get_num_copies(variant, call_idx, copy_idx):
+ return variant['variantCalls'][call_idx]['numberOfCopies'][copy_idx]['numberOfCopies']
Also applies to: 216-218
🧰 Tools
🪛 Ruff (0.8.2)
106-106: Do not assign a lambda
expression, use a def
Rewrite zygosity
as a def
(E731)
This change is
Summary by CodeRabbit
Input Requirements
panels
for mapping panel IDs to names and test codes.epic_clarity
input from Excel (.xlsx) to CSV (.csv) format.Code Refactoring
Testing
The changes improve input clarity, code organisation, and maintainability of the variant processing workflow.