Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seurat v3 HVG method #1182

Closed
wants to merge 23 commits into from
Closed

Conversation

adamgayoso
Copy link
Member

Fixes #993.

This is an approximate implementation of the Seurat v3 hvg method. The only difference should be the use of lowess instead of loess (which is not available in python as far as I know).

This method takes the UMI counts as input. The way HVGs from batches are merged is also different from the other flavors. As such, I didn't see a straightforward way to integrate this in the existing HVG code.

@adamgayoso
Copy link
Member Author

Please let me know if you have any recommendations @ivirshup !

@ivirshup ivirshup self-requested a review April 23, 2020 02:23
Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I've made some comments on the code about correctness. I'm still working on making sure I completely understand the code. I'd like for @gokceneraslan and @LuckyMD to take a look at this too, since they're more familiar with this.

Thanks for including that link to the overleaf page, that will be useful in understanding this. Do you think some of that could just go directly in the doc-string?

Could you add:

  • Tests
  • An example for us, showing that it works/ is useful
  • An short example of usage in the doc-string

scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
adamgayoso and others added 4 commits April 22, 2020 22:03
Co-Authored-By: Isaac Virshup <ivirshup@gmail.com>
Co-Authored-By: Isaac Virshup <ivirshup@gmail.com>
@adamgayoso
Copy link
Member Author

adamgayoso commented Apr 23, 2020

Thanks for the review. I will add the other requested changes/additions soon.

So also the way I interpreted the way they clip values based on the Stuart '19 publication is different than what it appears they do in their code, which I linked in the overleaf doc in the docstring. My original interpretation is the first code in the linked issue.

@adamgayoso
Copy link
Member Author

adamgayoso commented Apr 23, 2020

@ivirshup I added a custom implementation of loess, which uses a degree 2 polynomial (closer to seurat implementation). It's a bit slow, I'm sure it can be sped up with numba, but it's having a problem with this line:

https://github.com/adamgayoso/scanpy/blob/b48d5729a075bc6f7fa0bbbf38a856d4c2e720ba/scanpy/preprocessing/_highly_variable_genes.py#L607

Any idea why?

Edit: I got it to work by not using clip, but it's still a bit slow. Please let me know if you have any ideas!

Edit 2: I found a stable python implementation of degree 2 loess

@adamgayoso
Copy link
Member Author

I'm also not sure why the test is failing -- it works interactively locally for me.

@adamgayoso
Copy link
Member Author

@ivirshup is it possible that Travis has cached pbmc3k and that's what's causing the error? I really don't have it running pytest locally either.

Also as far as the code review -- I understand code is duplicated, but this code does not really fit in the existing implementation because it works a bit differently and requires raw data. Let me know how you'd like to address this.

Thanks!

@adamgayoso
Copy link
Member Author

This got a bit messy -- travis is defeating me because the tests work locally. I'm going to close this and figure it out on my fork and make a new PR with a clean history.

@adamgayoso adamgayoso closed this May 6, 2020
@ivirshup
Copy link
Member

ivirshup commented May 6, 2020

Sorry for the lack of feedback! I got caught up in my own work.

  • Have you made any progress on the travis issue?
  • Is it an issue of getting different results from computing variance?

@adamgayoso
Copy link
Member Author

No worries - I really can't tell what the issue is since it works perfectly fine locally. My sense now is that

mean, var = _get_mean_var(adata[batch_info == b].X)
not_const = var > 0

is not working properly on travis because based on this method, if the regularized std is 1, which occurs when it's constant, the new variance would just be the same as in the variances column of the df, which is what's happening on travis.

@gokceneraslan
Copy link
Collaborator

Hey @adamgayoso ,

I really love this HVG method, but sometimes I get the following error from the loess fit:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-244-764977f87ce6> in <module>
----> 1 sc.pp.highly_variable_genes(ad_sub, n_top_genes=1500, flavor='seurat_v3', layer='counts')
      2 sc.pp.pca(ad_sub)
      3 sc.pp.neighbors(ad_sub)
      4 sc.tl.umap(ad_sub)
      5 sc.tl.leiden(ad_sub, resolution=2.0)

~/.miniconda3/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, layer, n_top_genes, min_disp, max_disp, min_mean, max_mean, span, n_bins, flavor, subset, inplace, batch_key)
    413 
    414     if flavor == 'seurat_v3':
--> 415         return _highly_variable_genes_seurat_v3(
    416             adata,
    417             layer=layer,

~/.miniconda3/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py in _highly_variable_genes_seurat_v3(adata, layer, n_top_genes, batch_key, span, subset, inplace)
     82         x = np.log10(mean[not_const])
     83         model = loess(x, y, span=span, degree=2)
---> 84         model.fit()
     85         estimat_var[not_const] = model.outputs.fitted_values
     86         reg_std = np.sqrt(10 ** estimat_var)

_loess.pyx in _loess.loess.fit()

ValueError: b'reciprocal condition number  7.4971e-16\n'

This is due to the very low but non-zero variance genes, I think. It goes away when I run sc.pp.filter_genes(ad_sub, min_cells=5) but not when I run only sc.pp.filter_genes(ad_sub, min_cells=1).

Maybe we can make the variance check more stringent, or we can print a better error message for the users?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seurat v3 VST highly variable gene method
3 participants