Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Di 1338 v1.3.0 #16

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Di 1338 v1.3.0 #16

wants to merge 11 commits into from

Conversation

kjwinfield
Copy link
Collaborator

@kjwinfield kjwinfield commented Jan 2, 2025

  1. Fix Missing required input in readme #15
  2. Fix New box in summary field requested #14
  3. Update clarity import to use correct column names and data format
  4. Fix bug that included de novo variants when sorting by exomiser rank
  5. Fix bug that meant that transcript could be matched to protein when looking in refseq conversion file

This change is Reviewable

Summary by CodeRabbit

  • Input Requirements

    • Added required JSON input panels for mapping panel IDs to names and test codes.
    • Updated epic_clarity input from Excel (.xlsx) to CSV (.csv) format.
  • Code Refactoring

    • Converted multiple static methods to standalone functions in variant information processing.
    • Simplified variant data handling and module imports.
    • Enhanced code modularity without changing core functionality.
  • Testing

    • Updated test cases to reflect new import and function call structures.

The changes improve input clarity, code organisation, and maintainability of the variant processing workflow.

Copy link

coderabbitai bot commented Jan 2, 2025

Warning

Rate limit exceeded

@kjwinfield has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 58 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 759d601 and e9e23b1.

📒 Files selected for processing (1)
  • resources/home/dnanexus/get_variant_info.py (1 hunks)

Walkthrough

The pull request introduces significant modifications to the codebase, focusing on input requirements and code structure. The primary changes include adding a new required JSON input for panels, updating the Epic Clarity input from .xlsx to .csv, and refactoring the get_variant_info.py script by converting class methods to standalone functions. These modifications enhance input clarity and improve code modularity across multiple files.

Changes

File Change Summary
README.md Added documentation for new panels JSON input and updated Epic Clarity input file type
resources/home/dnanexus/get_variant_info.py Refactored VariantUtils class methods to standalone functions, added new functions for ID lookup and HGVS extraction
resources/home/dnanexus/make_workbook.py Updated import statements, modified data handling, and summary page structure, changed input file handling from .xlsx to .csv
resources/home/dnanexus/tests/test_make_workbook.py Updated method references to match new import structure and modified test cases for new panel configuration
src/code.sh Changed Epic Clarity input file extension from .xlsx to .csv

Assessment against linked issues

Objective Addressed Explanation
Add panels input description to README [#15]
Add new summary field box [#14] No specific changes visible in the provided diff

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@pep8speaks
Copy link

pep8speaks commented Jan 2, 2025

Hello @kjwinfield! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 106:5: E731 do not assign a lambda expression, use a def
Line 117:9: E129 visually indented line with same indent as next logical line
Line 126:9: E125 continuation line with same indent as next logical line
Line 126:9: E128 continuation line under-indented for visual indent
Line 216:5: E731 do not assign a lambda expression, use a def
Line 469:40: W605 invalid escape sequence '('
Line 469:45: W605 invalid escape sequence ')'

Line 487:80: E501 line too long (81 > 79 characters)

Comment last updated at 2025-01-03 17:23:33 UTC

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
resources/home/dnanexus/get_variant_info.py (3)

93-140: Determine inheritance
Overall, the logic for maternal/paternal determination is clear. However, note the following static analysis hints:

  • Line 106: Instead of assigning zygosity = lambda x, y: ..., define a small helper function. This makes debugging easier.
  • Lines 116-117 & 123-124: You can consolidate nested if statements into a single check for cleaner code.
🧰 Tools
🪛 Ruff (0.8.2)

106-106: Do not assign a lambda expression, use a def

Rewrite zygosity as a def

(E731)


116-117: Use a single if statement instead of nested if statements

(SIM102)


123-124: Use a single if statement instead of nested if statements

(SIM102)


200-238: STR variant info
Populating a dictionary with STR-specific fields is straightforward. For repeated logic, note that:

  • Lines 218-220: Instead of num_copies = lambda x, y, z: ..., define a small helper function. Lambdas assigned to variables can be replaced with def.
🧰 Tools
🪛 Ruff (0.8.2)

218-220: Do not assign a lambda expression, use a def

Rewrite num_copies as a def

(E731)


379-422: HGVS nomenclature for exomiser
The fallback behaviour for when a MANE transcript is missing is logical. The debug print(refseq) can be removed if no longer necessary in production.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a4e8d9 and 89f06f8.

📒 Files selected for processing (5)
  • README.md (1 hunks)
  • resources/home/dnanexus/get_variant_info.py (1 hunks)
  • resources/home/dnanexus/make_workbook.py (18 hunks)
  • resources/home/dnanexus/tests/test_make_workbook.py (9 hunks)
  • src/code.sh (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py

106-106: Do not assign a lambda expression, use a def

Rewrite zygosity as a def

(E731)


116-117: Use a single if statement instead of nested if statements

(SIM102)


123-124: Use a single if statement instead of nested if statements

(SIM102)


218-220: Do not assign a lambda expression, use a def

Rewrite num_copies as a def

(E731)

🔇 Additional comments (43)
resources/home/dnanexus/get_variant_info.py (11)

3-20: Add columns to dictionary
This function is concise and clearly communicates its purpose through the docstring. The approach for creating an empty dictionary keyed by the column list is perfectly fine.


22-41: Extract gene symbol
The loop to gather gene symbols and the handling of multiple gene symbols is straightforward. Sorting and de-duplicating the symbols before returning them is good practice.


43-67: Convert tier
The logic branches cover expected combinations of tiers and variant types. This is easy to read.


69-90: Retrieve maximum allele frequency
The logic handles the scenario where alleleFrequencies is None or empty. The use of '-' for missing data is a sensible approach.


143-171: Convert mode of inheritance
Handles a range of MOI values neatly, mapping them to more readable strings. The exception handling for unknown MOIs is also appropriate.


174-197: Index participant
Throwing a RuntimeError when the participant ID isn't found is a strong move, ensuring no silent failures. Implementation is succinct.


240-289: SNV variant info
The call to convert_tier and the subsequent fields for zygosity, depth, and gene data are all consistent. Good job returning a fully populated dictionary in one pass.


292-324: CNV variant info
Similar structure to SNV/STR. The approach of reusing add_columns_to_dict is DRY and keeps the code consistent.


327-352: Retrieve top three exomiser-ranked variants
Uses a small podium ranking strategy, capturing all variants tied within the top three ranks. This is a neat solution.


354-376: Look up ID in conversion file
Straightforward approach to match lines against the desired ID. Efficient, considering the data is stored in memory and looped through once.


424-481: HGVS nomenclature for GEL
Collecting transcripts into lists and rejoining them for multiple hits is well-handled. The method gracefully falls back to Ensembl transcripts if no MANE match is found.

resources/home/dnanexus/make_workbook.py (19)

18-18: Unified import of get_variant_info
Consolidating variant-related methods under the var_info alias makes the codebase more coherent.


120-121: RefSeq TSV filtering
Restricting lines to only those containing "RefSeq_mRNA" helps control data volume. Make sure you handle cases where no lines match to avoid potential edge cases.


163-165: Summary page: Additional Tier2 row
Adding “SNV Tier 2 complete penetrance*” clarifies how complete penetrance variants are counted. Good approach.


166-166: Summary page: Reference note
Inline reference to SOP is a nice detail. Ensure the reference doc is kept updated.


167-168: Exomiser top 3 label
Highlights rank-based filtering. Ensure the threshold is consistent with the rest of the code.


169-169: De novo row
Clear mention of “De novo” line count. Harmonises with the extended analysis logic.


170-170: Overall result row
Adding this row clarifies the final conclusion for the case.


171-172: Confirmation row
Good to see a dedicated slot for confirmation status.


173-174: Primary analysis row
This row and the “Data check” row help keep metadata about who performed the analysis.


210-211: Summary page row styling
“horizontal” merges multiple row segments. Looks consistent.


214-215: Summary page row styling
“vertical” styling approach. Keep an eye on overlapping merges.


363-371: Panel details
Mapping panel ID to code and name clarifies the tested condition. The logic gracefully skips unknown panels.


414-415: Reading epic_clarity as CSV
Ensuring consistent use of .csv across the code base matches the new approach.


421-422: Handling column names
Pulling “Patient Stated Gender” and “Year of Birth” from CSV is clear. Good descriptive naming.


486-488: Filtering parent dataframe
Simple and direct. If no parent is found, you return None. This is stable.


Line range hint 591-600: Create GEL tiering variant page: SNVs
The revised code references standalone functions in var_info to reduce duplication. The consolidated approach is clean.


617-617: Create GEL tiering variant page: STR
Similar chunk for STR variants. The process of appending to the shared variant_list is consistent.


633-633: Create GEL tiering variant page: CNV
Correctly calls get_cnv_info for Tier A or Tier 1 CNVs.


647-648: Variant summary counts
Assigning the correct variant priorities to the summary cells helps maintain the quick-case overview.

resources/home/dnanexus/tests/test_make_workbook.py (10)

10-10: Import get_variant_info
Shifting to a single import var_info keeps the tests aligned with the refactoring.


157-157: Testing add_columns_to_dict
Good check. The test verifies the dictionary is constructed with empty values.


177-177: Testing tier conversion
Ensures all expected tier conversions for SNV, CNV, and STR are correct. Thorough coverage.


198-198: Testing get_af_max
Verifies the function captures the highest frequency. Straightforward test scenario.


223-223: Index participant found
Confirms correct index is returned for a known participant.


232-232: Index participant error
Expects a RuntimeError if participant not found. This is a robust guard.


239-239: Index participant None
Checking a scenario where no participant ID is supplied. Code returns None as expected.


258-258: Test top 3 ranks
Ensures multiple items at the same rank are included. Nicely handled with your test approach.


271-271: Test skipping rank
Validates what happens if second rank is missing. The function still returns the next available rank.


288-289: Test refseq look-up
Verifies the function returns the correct ENSP ID for a matched ENST. That's crucial for correct protein annotation.

src/code.sh (1)

18-18: Switch to .csv
Ensures the epic_clarity argument now references files with a .csv extension. This aligns with other code changes.

README.md (2)

21-27: Document new panels input
Clearly denotes the JSON schema for panel definitions. This is helpful for future maintainers.


32-32: epic_clarity now .csv
The README matches the code changes, keeping docs and implementation in sync.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
resources/home/dnanexus/get_variant_info.py (3)

116-129: Simplify inheritance logic by flattening nested conditions.

The nested if statements can be simplified by combining conditions, making the code more readable and maintainable.

-    if mother_idx is not None:
-        if zygosity(variant, mother_idx) in inheritance_types:
-            maternally_inherited = True
+    if mother_idx is not None and get_zygosity(variant, mother_idx) in inheritance_types:
+        maternally_inherited = True

-    if father_idx is not None:
-        if zygosity(variant, father_idx) in inheritance_types:
-            if (p_sex == 'MALE' and
-            variant['variantCoordinates']['chromosome'] == 'X'):
-                paternally_inherited = False
-            else:
-                paternally_inherited = True
+    if (father_idx is not None and 
+        get_zygosity(variant, father_idx) in inheritance_types and
+        not (p_sex == 'MALE' and variant['variantCoordinates']['chromosome'] == 'X')):
+        paternally_inherited = True
🧰 Tools
🪛 Ruff (0.8.2)

116-117: Use a single if statement instead of nested if statements

(SIM102)


123-124: Use a single if statement instead of nested if statements

(SIM102)


191-195: Enhance error message with more context.

The error message could be more descriptive to aid in debugging.

-            raise RuntimeError(
-                f"Unable to find participant ID {participant_id} in "
-                f"variant {variant['variantCalls']}"
-            )
+            raise RuntimeError(
+                f"Participant ID '{participant_id}' not found in variant calls. "
+                f"Available participants: {[call['participantId'] for call in variant['variantCalls']]}"
+            )

424-481: Consider decomposing get_hgvs_gel into smaller functions.

The function is quite long and handles multiple responsibilities. Consider extracting the MANE matching and protein change handling into separate functions.

+def _get_mane_matches(cdnas, mane):
+    """Extract MANE matches from cdna changes."""
+    ref_list = []
+    enst_list = []
+    for cdna in cdnas:
+        refseq = look_up_id_in_refseq_mane_conversion_file(
+            mane, cdna.split('(')[0], "NM"
+        )
+        if refseq is not None:
+            ref_list.append(refseq + cdna.split(')')[1])
+            enst_list.append(cdna.split('(')[0])
+    return ref_list, enst_list

+def _get_protein_changes(enst_list, protein_changes, refseq_tsv):
+    """Extract protein changes for matched transcripts."""
+    hgvs_p_list = []
+    for enst in enst_list:
+        ensp = look_up_id_in_refseq_mane_conversion_file(
+            refseq_tsv, enst, "ENSP"
+        )
+        if protein_changes:
+            for protein in protein_changes:
+                if ensp in protein:
+                    hgvs_p_list.append(protein)
+    return hgvs_p_list

This would make the main function clearer and more maintainable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 040b007 and 1a44787.

📒 Files selected for processing (2)
  • resources/home/dnanexus/get_variant_info.py (1 hunks)
  • resources/home/dnanexus/tests/test_make_workbook.py (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • resources/home/dnanexus/tests/test_make_workbook.py
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py

106-106: Do not assign a lambda expression, use a def

Rewrite zygosity as a def

(E731)


116-117: Use a single if statement instead of nested if statements

(SIM102)


123-124: Use a single if statement instead of nested if statements

(SIM102)


218-220: Do not assign a lambda expression, use a def

Rewrite num_copies as a def

(E731)

@kjwinfield kjwinfield requested a review from chrispyatt January 3, 2025 14:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
resources/home/dnanexus/get_variant_info.py (3)

69-90: Consider using a small epsilon for floating-point comparison.

When comparing floating-point numbers, using exact equality (== 0) can be unreliable due to floating-point arithmetic precision. Consider using a small epsilon value for the comparison.

-        if highest_af == 0:
+        if abs(highest_af) < 1e-10:  # Using epsilon of 1e-10
             highest_af = '-'

123-129: Simplify nested conditions for better readability.

The nested if conditions can be combined using logical operators for improved readability.

-    if father_idx is not None:
-        if zygosity(variant, father_idx) in inheritance_types:
-            if (p_sex == 'MALE' and
-            variant['variantCoordinates']['chromosome'] == 'X'):
-                paternally_inherited = False
-            else:
-                paternally_inherited = True
+    if (father_idx is not None and 
+        zygosity(variant, father_idx) in inheritance_types and
+        not (p_sex == 'MALE' and variant['variantCoordinates']['chromosome'] == 'X')):
+        paternally_inherited = True
🧰 Tools
🪛 Ruff (0.8.2)

123-124: Use a single if statement instead of nested if statements

(SIM102)


354-376: Consider optimizing file search for large datasets.

For large conversion files, the current implementation might be inefficient as it searches through the entire file for each query. Consider creating an index or using a dictionary for faster lookups if memory permits.

# Example optimization using dictionary (to be created once during initialization)
conversion_dict = {}
for line in conversion_file:
    for id_part in line.split():
        if id_part.startswith(id_type):
            # Use the entire line as the key for multiple matches
            conversion_dict[line] = id_part
resources/home/dnanexus/make_workbook.py (2)

486-488: Use more descriptive variable names for gender values.

The code uses magic numbers (1, 2) for gender values, which could lead to confusion.

+    MALE_GENDER = 1
+    FEMALE_GENDER = 2
     parent_df = df[
-        (df['Patient Stated Gender'] == sex) & (df['Year of Birth'] < pb_yob)
+        (df['Patient Stated Gender'] == sex) & 
+        (df['Year of Birth'] < pb_yob)
     ]

804-816: Consider using pandas method chaining for better readability.

The DataFrame operations can be made more readable using method chaining.

-    # Grab all the de novos
-    denovo_df = ex_df.loc[ex_df['Priority'] == "De novo"]
-
-    # Grab all the exomiser variants
-    ex_df = ex_df.loc[ex_df['Priority'] != "De novo"]
-
-    # Now we have filtered out all variants that are in the GEL tiering
-    # page we need to get the top 3 ranks in the exomiser df
-    if not ex_df.empty:
-        ex_df = var_info.get_top_3_ranked(ex_df)
-
-    # Concat de novos and exomiser ranked back together
-    ex_df = pd.concat([ex_df, denovo_df])
+    ex_df = (pd.concat([
+        ex_df.loc[ex_df['Priority'] == "De novo"],
+        ex_df.loc[ex_df['Priority'] != "De novo"].pipe(
+            lambda df: var_info.get_top_3_ranked(df) if not df.empty else df
+        )
+    ])
+    if not ex_df.empty else ex_df)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a44787 and 73e9876.

📒 Files selected for processing (2)
  • resources/home/dnanexus/get_variant_info.py (1 hunks)
  • resources/home/dnanexus/make_workbook.py (18 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py

106-106: Do not assign a lambda expression, use a def

Rewrite zygosity as a def

(E731)


116-117: Use a single if statement instead of nested if statements

(SIM102)


123-124: Use a single if statement instead of nested if statements

(SIM102)


218-220: Do not assign a lambda expression, use a def

Rewrite num_copies as a def

(E731)

Comment on lines 424 to 481
def get_hgvs_gel(variant, mane, refseq_tsv):
'''
GEL variants store HGVS p dot and c dot nomenclature separately.
This function extracts the cdot (with MANE refseq equivalent to
ensembl transcript ID if found) and pdot (with ensembl protein ID)
Inputs:
variant: (dict) dict describing single variant from JSON
mane: MANE file for transcripts
refseq: refseq file for transcripts
Outputs:
hgvs_c: (str) HGVS c dot (transcript) nomenclature for the variant.
annotated against refseq transcript ID if one exists, and ensembl
transcript ID if there is no matched equivalent
hgvs_p: (str) HGVS p dot (protein) nomenclature for the variant.
this is annotated against the ensembl protein ID.
'''
hgvs_p = None
hgvs_c = None
ref_list = []
enst_list = []
cdnas = variant['variantAttributes']['cdnaChanges']
protein_changes = variant['variantAttributes']['proteinChanges']

# Check for a MANE match
for cdna in cdnas:
refseq = look_up_id_in_refseq_mane_conversion_file(
mane, cdna.split('(')[0], "NM"
)

if refseq is not None:
ref_list.append(refseq + cdna.split(')')[1])
enst_list.append(cdna.split('(')[0])

# If no MANE match is found, return all transcripts
if len(set(ref_list)) == 0:
hgvs_c_list = []
ensp_list = []
for cdna in cdnas:
refseq = VariantNomenclature.convert_ensembl_to_refseq_mane(
mane, cdna.split('(')[0]
hgvs_c_list.append(re.sub("\(.*?\)", "", cdna))
for protein in protein_changes:
ensp_list.append(protein)
hgvs_c = ', '.join(hgvs_c_list)
hgvs_p = ', '.join(ensp_list)

else:
hgvs_c = ', '.join(list(set(ref_list)))
hgvs_p_list = []
for enst in enst_list:
ensp = look_up_id_in_refseq_mane_conversion_file(
refseq_tsv, enst.split(".")[0], "ENSP"
)

if refseq is not None:
ref_list.append(refseq + cdna.split(')')[1])
enst_list.append(cdna.split('(')[0])

# If no MANE match is found, return all transcripts
if len(set(ref_list)) == 0:
hgvs_c_list = []
ensp_list = []
for cdna in cdnas:
hgvs_c_list.append(re.sub("\(.*?\)", "", cdna))
for protein in protein_changes:
ensp_list.append(protein)
hgvs_c = ', '.join(hgvs_c_list)
hgvs_p = ', '.join(ensp_list)

else:
hgvs_c = ', '.join(list(set(ref_list)))
hgvs_p_list = []
for enst in enst_list:
ensp = VariantNomenclature.get_ensp(refseq_tsv, enst)
if protein_changes != []:
for protein in protein_changes:
if ensp in protein:
hgvs_p_list.append(protein)
hgvs_p = ', '.join(hgvs_p_list)
hgvs_p = ', '.join(hgvs_p_list)

return hgvs_c, hgvs_p
return hgvs_c, hgvs_p
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for malformed variant data.

The function assumes the variant data is well-formed but should handle potential edge cases gracefully.

     cdnas = variant['variantAttributes']['cdnaChanges']
     protein_changes = variant['variantAttributes']['proteinChanges']
+    if not isinstance(cdnas, list) or not isinstance(protein_changes, list):
+        raise ValueError("Expected lists for cdnaChanges and proteinChanges")
 
     # Check for a MANE match
     for cdna in cdnas:
+        try:
+            transcript_id = cdna.split('(')[0]
+        except (IndexError, AttributeError):
+            print(f"Warning: Malformed cdna string: {cdna}")
+            continue

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +414 to 423
# Read in csv as df, using only relevant columns
df = pd.read_csv(
self.args.epic_clarity,
usecols=[
"WGS Referral ID",
"External Specimen Identifier",
"Specimen Identifier",
"Sex",
"YOB"
"Patient Stated Gender",
"Year of Birth"
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for CSV file reading.

The CSV file reading lacks error handling for missing columns or malformed data.

+    required_columns = [
+        "WGS Referral ID",
+        "External Specimen Identifier",
+        "Specimen Identifier",
+        "Patient Stated Gender",
+        "Year of Birth"
+    ]
     df = pd.read_csv(
         self.args.epic_clarity,
-        usecols=[
-            "WGS Referral ID",
-            "External Specimen Identifier",
-            "Specimen Identifier",
-            "Patient Stated Gender",
-            "Year of Birth"
-        ]
+        usecols=lambda x: x in required_columns
     )
+    missing_columns = set(required_columns) - set(df.columns)
+    if missing_columns:
+        raise ValueError(f"Missing required columns: {missing_columns}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Read in csv as df, using only relevant columns
df = pd.read_csv(
self.args.epic_clarity,
usecols=[
"WGS Referral ID",
"External Specimen Identifier",
"Specimen Identifier",
"Sex",
"YOB"
"Patient Stated Gender",
"Year of Birth"
]
# Read in csv as df, using only relevant columns
required_columns = [
"WGS Referral ID",
"External Specimen Identifier",
"Specimen Identifier",
"Patient Stated Gender",
"Year of Birth"
]
df = pd.read_csv(
self.args.epic_clarity,
usecols=lambda x: x in required_columns
)
missing_columns = set(required_columns) - set(df.columns)
if missing_columns:
raise ValueError(f"Missing required columns: {missing_columns}")

Copy link
Contributor

@chrispyatt chrispyatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 5 files at r1, 1 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @kjwinfield)


resources/home/dnanexus/get_variant_info.py line 360 at r3 (raw file):

        Inputs:
            query_id (str): ID to query (an ensembl ID)
            conversion_file (list): list of lines from a transcript

Might be slightly pedantic but could we call this something that doesn't imply it's a file? Even just conversion_file_list or something


resources/home/dnanexus/get_variant_info.py line 374 at r3 (raw file):

            matches = [x for x in line.split() if x.startswith(id_type)]
            if matches != []:
                matched_id = matches[0]

Do we definitely only ever want the first list item (assuming it's possible for multiples)? If it should never be multiple, do we need to catch that?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
resources/home/dnanexus/get_variant_info.py (1)

116-129: Optimise inheritance logic with combined conditions.

Let's enhance the code by combining nested conditions and adding validation:

-    if mother_idx is not None:
-        if zygosity(variant, mother_idx) in inheritance_types:
-            maternally_inherited = True
+    if mother_idx is not None and zygosity(variant, mother_idx) in inheritance_types:
+        maternally_inherited = True

-    if father_idx is not None:
-        if zygosity(variant, father_idx) in inheritance_types:
-            if (p_sex == 'MALE' and
-            variant['variantCoordinates']['chromosome'] == 'X'):
-                paternally_inherited = False
-            else:
-                paternally_inherited = True
+    if (father_idx is not None and 
+        zygosity(variant, father_idx) in inheritance_types and
+        not (p_sex == 'MALE' and variant['variantCoordinates']['chromosome'] == 'X')):
+        paternally_inherited = True
🧰 Tools
🪛 Ruff (0.8.2)

116-117: Use a single if statement instead of nested if statements

(SIM102)


123-124: Use a single if statement instead of nested if statements

(SIM102)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73e9876 and 92afdcd.

📒 Files selected for processing (1)
  • resources/home/dnanexus/get_variant_info.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py

106-106: Do not assign a lambda expression, use a def

Rewrite zygosity as a def

(E731)


116-117: Use a single if statement instead of nested if statements

(SIM102)


123-124: Use a single if statement instead of nested if statements

(SIM102)


218-220: Do not assign a lambda expression, use a def

Rewrite num_copies as a def

(E731)

🔇 Additional comments (1)
resources/home/dnanexus/get_variant_info.py (1)

447-456: 🛠️ Refactor suggestion

Enhance error handling for transcript matching.

The transcript matching logic should handle potential edge cases more robustly:

     for cdna in cdnas:
+        if not cdna or '(' not in cdna or ')' not in cdna:
+            print(f"Warning: Malformed cdna string: {cdna}")
+            continue
+
         refseq = look_up_id_in_refseq_mane_conversion_file(
             mane, cdna.split('(')[0], "NM"
         )

         if refseq is not None:
+            try:
                 ref_list.append(refseq + cdna.split(')')[1])
                 enst_list.append(cdna.split('(')[0])
+            except IndexError:
+                print(f"Warning: Unable to process refseq {refseq} for cdna {cdna}")
+                continue

Likely invalid or redundant comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
resources/home/dnanexus/get_variant_info.py (2)

339-341: Improve DataFrame operations for better maintainability.

Consider these enhancements:

  1. Avoid in-place DataFrame modifications
  2. Replace the lambda with a cleaner integer extraction
-    df['priority_as_int'] = df['Priority'].map(
-        lambda x: int(x.split(' ')[-1])
-    )
+    def extract_priority_rank(priority_str):
+        return int(priority_str.split(' ')[-1])
+
+    df = df.assign(priority_as_int=df['Priority'].map(extract_priority_rank))

     # filter the df to include only values in top three ranks
-    df = df[df['priority_as_int'].isin(top_3_ranks)]
-    df.drop(['priority_as_int'], axis=1, inplace=True)
+    return df[df['priority_as_int'].isin(top_3_ranks)].drop('priority_as_int', axis=1)

Also applies to: 347-348


368-374: Optimise lookup performance with caching.

The current implementation performs linear search for each lookup. For better performance with large conversion files, consider implementing a cache:

+    @functools.lru_cache(maxsize=1024)
     def look_up_id_in_refseq_mane_conversion_file(conversion, query_id, id_type):
-        matched_id = None
-        for line in conversion:
-            if query_id in line:
-                matches = [x for x in line.split() if x.startswith(id_type)]
-                if matches != []:
-                    matched_id = matches[0]
-                break
-        return matched_id
+        # Convert conversion list to a dictionary for O(1) lookup
+        if not hasattr(look_up_id_in_refseq_mane_conversion_file, '_cache'):
+            look_up_id_in_refseq_mane_conversion_file._cache = {}
+            for line in conversion:
+                for part in line.split():
+                    if ':' in part:
+                        key = part.split(':')[0]
+                        look_up_id_in_refseq_mane_conversion_file._cache[key] = line
+
+        if query_id not in look_up_id_in_refseq_mane_conversion_file._cache:
+            return None
+
+        line = look_up_id_in_refseq_mane_conversion_file._cache[query_id]
+        matches = [x for x in line.split() if x.startswith(id_type)]
+        return matches[0] if matches else None
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92afdcd and 759d601.

📒 Files selected for processing (1)
  • resources/home/dnanexus/get_variant_info.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
resources/home/dnanexus/get_variant_info.py

106-106: Do not assign a lambda expression, use a def

Rewrite zygosity as a def

(E731)


216-218: Do not assign a lambda expression, use a def

Rewrite num_copies as a def

(E731)

🔇 Additional comments (1)
resources/home/dnanexus/get_variant_info.py (1)

442-444: 🛠️ Refactor suggestion

Enhance error handling for variant attributes.

The code assumes cdnaChanges and proteinChanges are well-formed. Let's add validation:

+    if not isinstance(cdnas, list) or not isinstance(protein_changes, list):
+        print(f"Warning: Malformed variant attributes: cdnaChanges={cdnas}, proteinChanges={protein_changes}")
+        return None, None
+
     cdnas = variant['variantAttributes']['cdnaChanges']
     protein_changes = variant['variantAttributes']['proteinChanges']

Likely invalid or redundant comment.

Outputs:
inheritance (str): inferred inheritance of the variant, or None.
'''
zygosity = lambda x, y: x['variantCalls'][y]['zygosity']
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Convert lambda expressions to regular functions for improved maintainability.

Optimising for clarity and maintainability, let's convert these lambda expressions to proper function definitions:

-zygosity = lambda x, y: x['variantCalls'][y]['zygosity']
+def get_zygosity(variant, call_idx):
+    return variant['variantCalls'][call_idx]['zygosity']

-num_copies = lambda x, y, z: x['variantCalls'][y]['numberOfCopies'][z]['numberOfCopies']
+def get_num_copies(variant, call_idx, copy_idx):
+    return variant['variantCalls'][call_idx]['numberOfCopies'][copy_idx]['numberOfCopies']

Also applies to: 216-218

🧰 Tools
🪛 Ruff (0.8.2)

106-106: Do not assign a lambda expression, use a def

Rewrite zygosity as a def

(E731)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing required input in readme New box in summary field requested
3 participants