Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linreg stratified docs #4458

Merged
merged 7 commits into from
Sep 30, 2018
Merged

linreg stratified docs #4458

merged 7 commits into from
Sep 30, 2018

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Sep 26, 2018

Fixes #4251

@jbloom22 Is there anything else I should add? Maybe something about the relative performance of each approach? I also thought about using two separate linreg aggregator annotations, but that didn't seem better than the group_by approach.

Copy link
Contributor

@jbloom22 jbloom22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome sauce

... .or_missing()
>>> mt_linreg = hl.linear_regression(y = male_pheno, x = [1, mt_linreg.GT.n_alt_alleles()], root='linreg_male')

Approach #2: Use the :func:`.aggregators.linreg` and :func:`.aggregators.group_by` aggregators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd reverse the order to match that of the code (group_by then linreg)


Approach #2: Use the :func:`.aggregators.linreg` and :func:`.aggregators.group_by` aggregators

>>> mt_linreg = mt.annotate_rows(linreg = hl.agg.group_by(mt.pheno.is_female,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move linreg to new line so the subsequent lines aren't so wide

variable. The first approach utilizes the :func:`.linear_regression` method and must be called
separately for each group even though it can compute statistics for multiple phenotypes
simultaneously. This is because the :func:`.linear_regression` method drops samples that have
more than one missing value across all phenotypes, such as when the groups are mutually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that have a missing value for any of the phenotypes; when the groups are mutually exclusive, such as 'Male' and 'Female', no samples remain!"

simultaneously. This is because the :func:`.linear_regression` method drops samples that have
more than one missing value across all phenotypes, such as when the groups are mutually
exclusive such as 'Male' and 'Female'. Note that the expressions for `female_pheno` and
`male_pheno` cannot be computed at the same time because they are inputs to two different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that you're trying to point out why mt_linreg is used in male_pheno rather than mt, i.e. why we couldn't just define male_pheno = ~female_pheno. How about:
"Note that we cannot define male_pheno = ~female_pheno because we subsequently need male_pheno to be an expression on the mt_linreg rather thanmt."

exclusive such as 'Male' and 'Female'. Note that the expressions for `female_pheno` and
`male_pheno` cannot be computed at the same time because they are inputs to two different
matrix tables. Lastly, the argument to `root` must be specified for both cases -- otherwise
the output for the 'Male' grouping will overwrite the 'Female' output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'Male' output will overwrite the 'Female' output.

matrix tables. Lastly, the argument to `root` must be specified for both cases -- otherwise
the output for the 'Male' grouping will overwrite the 'Female' output.

The second approach uses the :func:`.aggregators.linreg` and :func:`.aggregators.group_by`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse order again


The second approach uses the :func:`.aggregators.linreg` and :func:`.aggregators.group_by`
aggregators. The aggregation expression generates a dictionary where the keys are the grouping
variables and the values are the linear regression statistics for that group. The result of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where a key is a group (value of the grouping variable) and the corresponding value is the linear regression statistics for those samples in the group.

aggregators. The aggregation expression generates a dictionary where the keys are the grouping
variables and the values are the linear regression statistics for that group. The result of the
aggregation expression is then used to annotate the matrix table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd note some pros of each. linear_regression is more efficient, especially when analyzing many phenotypes.
linreg aggregator is more flexible (multiple covariates can be vary by entry) and returns a richer set of statistics.


:**code**:

Approach #1: Use the :func:`.linear_regression` method for all phenotypes simulatenously
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in simultaneously

statistics. If the phenotypes being analyzed have different patterns of missingness, you should
**not** use the :func:`.linear_regression` method for all phenotypes simulatenously (Approach #1).
This is because the :func:`.linear_regression` method drops samples that have a missing value for
any of the phenotypes. Approach #2 will do two passes over the data while Approach #3 will do one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while Approaches #1 and #3 will

aggregator, especially when analyzing many phenotypes. However, the :func:`.aggregators.linreg`
aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of
statistics. If the phenotypes being analyzed have different patterns of missingness, you should
**not** use the :func:`.linear_regression` method for all phenotypes simulatenously (Approach #1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too strong a statement. In some cases, restricting to the common samples may be reasonable. Instead, just make users aware of the behavior that will result in Approach #1.

the matrix table.

The :func:`.linear_regression` method is more efficient than the :func:`.aggregators.linreg`
aggregator, but the :func:`.aggregators.linreg` aggregator is more flexible (multiple covariates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...aggregator and can be extended to multiple phenotypes, but...

Copy link
Contributor

@jbloom22 jbloom22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix this error

185     >>> mt_linreg = hl.linear_regression(y=mt.pheno.height, x=[1, mt.GT.n_alt_alleles()])
UNEXPECTED EXCEPTION: TypeError("linear_regression() missing 1 required positional argument: 'covariates'",)

@jbloom22
Copy link
Contributor

one more:

  File "/home/hail/.conda/envs/hail/lib/python3.6/doctest.py", line 1330, in __run
    compileflags, 1), test.globs)
  File "<doctest genetics.rst[21]>", line 1
    female_pheno = hl.case()
                           ^
SyntaxError: multiple statements found while compiling a single statement

@danking danking merged commit d22ebf7 into hail-is:master Sep 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants