Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post-treatments and warning messages #4

Open
GENG126 opened this issue Feb 9, 2022 · 13 comments
Open

Post-treatments and warning messages #4

GENG126 opened this issue Feb 9, 2022 · 13 comments

Comments

@GENG126
Copy link

GENG126 commented Feb 9, 2022

"data" is the species abundance matrix

##"site" is the environmant covariance matrix
##HU is a factor in site

fit PLN model

PLNMODEL<- PLN(data~site$HU)
Initialization...
Adjusting a PLN model with full covariance model
Post-treatments...
DONE!
Warning message:
In dpois(y, mu, log = TRUE) : non-integer x = 1.214953

Follow is my questions:

  1. I am not sure whether the warning message is because of too limit data or too many zero in the data;
  2. The model can still run and I can get the result of the inference network, so I am curious whether this result is robust;
  3. I also want to know that if the result is not robust what can I do to infer species interaction network?
@Raphaellemomal
Copy link

I do not have an intuition for this warning but your questions are very relevant. Could you please provide:

  • the dimension of your data?
  • the summary of the vector of percentage of zeros for each species ?

Also, the robustness of your network is linked to the stability of the edges. Could you please provide the stability curve for the stability selection step ? (plot given by the EMtree::StaATS() function, you will have an example in the vignette here).

@GENG126
Copy link
Author

GENG126 commented Feb 14, 2022

I do not have an intuition for this warning but your questions are very relevant. Could you please provide:

  • the dimension of your data?
  • the summary of the vector of percentage of zeros for each species ?

Also, the robustness of your network is linked to the stability of the edges. Could you please provide the stability curve for the stability selection step ? (plot given by the EMtree::StaATS() function, you will have an example in the vignette here).

  1. data dimension
    **"data" **is the species abundance matrix,with 8 obs. of 32 variables, so we have 32 species data in 8 years;
    QUESTION The rows here are species abundance in different year (actually is dry season in 8 years, I want to see whether there is difference in species interaction in different season, and we have 8-year data), because we have different number of plots each year, so we add up to total abundance of one species each year. I don't know if this is feasible, or I need to change "years" to "plots"?
    data
> dim(data)
[1]  8 32

**"site" **is the environment covariance matrix, with 8 obs. of 8 variables, variables including disturbs such as human disturb, temperature etc.
site

> dim(site)
[1] 8 8
  1. zeros percentage for each species
    ···
    0 0 37.5 75 50 0 87.5 87.5 87.5 87.5 0 0 87.5 75 0 25 0 0 0 0 0 25 0 0 62.5 37.5 87.5 87.5 0 12.5 100 100
But if I exclude species which 0 occur more than 25%, the same warning as I mentioned before.

PLNMODEL<- PLN(data1~site$HU)
Initialization...
Adjusting a PLN model with full covariance model
Post-treatments...
DONE!
Warning message:
In dpois(y, mu, log = TRUE) : non-integer x = 1.214953


3. stability curve
I can't run the EMtree::StaATS() function, and in the example the function name is "StATS()"

library(EMtree)
EMtree::StaATS
error: 'StaATS' is not an exported object from 'namespace:EMtree'
stab_selection=StATS(ResampEmtreeFit$Pmat, nlambda=50, stab.thresh=0.9,plot=TRUE)
Error in StATS(ResampEmtreeFit$Pmat, nlambda = 50, stab.thresh = 0.9, :
No "StATS" function





@Raphaellemomal
Copy link

It is possible the warning is a result of your very few number of observations (rows).

  1. I would recommend you consider all plots as rows in your data (so change "years" to "plots" as you say), and add a "year" covariate as a factor in your "site" matrix. You will have to include the year covariate in the PLNmodel.
  2. You should not consider species with no counts (100% zeros). Even if this does not solve the issue, no counts at all is like having no data, they are unnecessary dimensions.
  3. Please update your version of EMtree, the StATS function is very helpful to set the appropriate threshold on your edges regarding the stability of your network.

@GENG126
Copy link
Author

GENG126 commented Feb 17, 2022

It is possible the warning is a result of your very few number of observations (rows).

  1. I would recommend you consider all plots as rows in your data (so change "years" to "plots" as you say), and add a "year" covariate as a factor in your "site" matrix. You will have to include the year covariate in the PLNmodel.
  2. You should not consider species with no counts (100% zeros). Even if this does not solve the issue, no counts at all is like having no data, they are unnecessary dimensions.
  3. Please update your version of EMtree, the StATS function is very helpful to set the appropriate threshold on your edges regarding the stability of your network.

Thank you Momal, I will try, thank you very much!

@GENG126 GENG126 closed this as completed Feb 27, 2022
@GENG126 GENG126 reopened this Feb 27, 2022
@GENG126
Copy link
Author

GENG126 commented Feb 27, 2022

Hi, Momal. I could not understand the threshold setting in "stability". My questions are:

freqs<-freq_selec(ResampEmtreeFit$Pmat,Pt=0.2)
a_first_idea_of_network<-1*(freqs>0.8)
  1. "Pt" here is the threshold came from "stability"? I am quite confused. For example, I need to run the function StATS(). set the stability threshold=0.9.
stab_selection=StATS(ResampEmtreeFit$Pmat, nlambda=50, stab.thresh=0.9,plot=TRUE)
stab_selection$lambda_opt
##0.031

Rplot01
So the Pt here I need to set as 0.031?
2. And I don't understand why "the optimal frequencies still need to be thresholded in order to obtain a network". If the network is stable according to the stability select, can I just use the visualizing function and do not select frequency?

weighted_net=ToSym(stab_selection$freqs_opt)
g<-draw_network(weighted_net,nodes_label = species.names, title="Weighted", pal_edges="dodgerblue3",layout="stress",shade = TRUE, btw_rank=3)$G

sorry, too much questions, I am the beginner of data analysis.
Looking forward to you reply.
Best wishes!

@Raphaellemomal
Copy link

Raphaellemomal commented Mar 3, 2022

Hi sorry for my late answer.
I will try to explain better here with this diagram. To obtain a network with the resampling version of EMtree, you need to set two thresholds: one on the edges conditional probabilities (output from the main method) and one on the frequencies.

image

But the problem is that we don't known how to set Pt a priori because they are probabilities scaled for a tree structure (the sum of all probabilities is the number of edges in a tree graph), they are not really edges probabilities of presence. This is where StATS() helps: it provides the optimal Pt, which is the value for which we have a good stability value across resamples, which means a good agreement between the resamples on the selected edges.

Additionally to providing a solution to the choice of Pt, you see here that we change of quantity of interest, going from conditional probabilities to well-understandable selection frequencies. This is important for interpretation : nos the weight on edges is actually a measure of the robustness of edges.

So after ResampleEMtree(), use StATS() to find the optimal Pt, and then you can either threshold the edge selection frequencies if you wish a binary network, or juste continue with the edges frequencies to work with a weighted network.

Is that clearer?

@GENG126
Copy link
Author

GENG126 commented Mar 4, 2022

Hi sorry for my late answer. I will try to explain better here with this diagram. To obtain a network with the resampling version of EMtree, you need to set two thresholds: one on the edges conditional probabilities (output from the main method) and one on the frequencies.

image

But the problem is that we don't known how to set Pt a priori because they are probabilities scaled for a tree structure (the sum of all probabilities is the number of edges in a tree graph), they are not really edges probabilities of presence. This is where StATS() helps: it provides the optimal Pt, which is the value for which we have a good stability value across resamples, which means a good agreement between the resamples on the selected edges.

Additionally to providing a solution to the choice of Pt, you see here that we change of quantity of interest, going from conditional probabilities to well-understandable selection frequencies. This is important for interpretation : nos the weight on edges is actually a measure of the robustness of edges.

So after ResampleEMtree(), use StATS() to find the optimal Pt, and then you can either threshold the edge selection frequencies if you wish a binary network, or juste continue with the edges frequencies to work with a weighted network.

Is that clearer?

Yes!!! Never understand a model so clear before!! Thank you so so so so so much Momal!!

@GENG126
Copy link
Author

GENG126 commented Mar 15, 2022

Hi, Momal:
I am here again! I still have questions in "PLN" step. Sorry, it is quite hard for me to handle this problem.
As you suggested first time, I need to consider "year" as a covariate and use "plots" as the rows of the matrix. So I change my mind, I construct network by each year-season (eg. 2013 dry season).

data<- as.matrix(read.csv("2013dry.csv"))
dim(data)
##23  10
site<- as_tibble(read.csv("2013dry.cor.csv"))
dim(site)
##23 2

data
![5c844d1c9dc4f01f2e860b9bbb4414e](https://user-images.githubusercontent.com/99301649/158330484-65bb728a-7496-4e
site
fb747f6ff01d946794167de33911363
77-9a6c-ad0065d98ef3.png)

#Fit PLN model
PLNMODEL<- PLN(data~site$HL)
warning:
1: In dpois(y, mu, log = TRUE) : non-integer x = 8.333333
2: In dpois(y, mu, log = TRUE) : non-integer x = 83.333333
3: In dpois(y, mu, log = TRUE) : non-integer x = 4.545455
4: In dpois(y, mu, log = TRUE) : non-integer x = 3.351955
5: In dpois(y, mu, log = TRUE) : non-integer x = 2.500000
6: In dpois(y, mu, log = TRUE) : non-integer x = 2.777778
7: In dpois(y, mu, log = TRUE) : non-integer x = 83.333333
8: In dpois(y, mu, log = TRUE) : non-integer x = 11.111111
9: In dpois(y, mu, log = TRUE) : non-integer x = 5.263158
10: In dpois(y, mu, log = TRUE) : non-integer x = 66.666667
11: In dpois(y, mu, log = TRUE) : non-integer x = 11.904762
12: In dpois(y, mu, log = TRUE) : non-integer x = 1.818182
13: In dpois(y, mu, log = TRUE) : non-integer x = 13.888889
14: In dpois(y, mu, log = TRUE) : non-integer x = 4.166667
stab_selection=StATS(ResampEmtreeFit$Pmat, nlambda=50, stab.thresh=0.9,plot=TRUE)

Rplot01
This is the result of stable selection.
And I can get the network:
Rplot1

Can I just ignore these warnings? I try each year-season data, on the one hand, I exclude species data with too many "0"; on the other hand, some year-season has many plots like 37. But all have the warnings.

Looking forward to your reply!
Best wishes.
Geng Ying.

@Raphaellemomal
Copy link

Hi, sorry again for the late answer.
I talked with Julien Chiquet about this (owner of PLNmodels), we consider it is fine for you to pursue with your model as it is.
I just have a small question: how is it that you have data dimension 23 x 10, and 25 species in your network ?

Best,
Raphaëlle

@GENG126
Copy link
Author

GENG126 commented Mar 25, 2022

Thank you so much for the reply! It help me to continue my work! The data dimension 23 x 10 is the species I selected more than 3 times in the matrix, and the network I showed is all the species including some species which may only occur once (I try whether there is difference with or without these species, and the PLN result is no). I think the network is more complete to show with 25 species to you, so I use this network. Sorry the inconsistency before and after confuses you.

@GENG126
Copy link
Author

GENG126 commented May 31, 2022

Hi Momal:
Sorry I still have questions about EMtree.

ResampEmtreeFit<-ResampleEMtree(counts=data, S=999, covar_matrix = site$HL, maxIter=25,cond.tol=1e-8, cores=1)

I set different number of maxlter and get different network interactions. So I am curious how to set a maxIter. I have 8-year data, I want to show network each year. And the species I have most is 21 in 2020. So I think maxIter is 21 in all networks, so that all (8) networks have the uniform standards. Do I understand correctly?
And I have no idea how to set cond.tol, is this influence my result?

Looking forward to your reply!

Best wishes!
Geng Ying.

@Rmomal
Copy link
Owner

Rmomal commented Jun 1, 2022 via email

@GENG126
Copy link
Author

GENG126 commented Jun 1, 2022

Hi Ying, - maxIter is the "Maximum number of EMtree iterations at each sub-sampling". It controls the number of iterations of the algorithm, so setting it to 21 can be too low and stopping the algorithm before it gets a chance to reach convergence. Il would be better to set this parameter to at least 50. - Your S parameter is very high, not that it is a problem but I think it is unnecessary. S is the "Total number of sub-samples.", and as you have 21 samples maximum setting S to 500 is enough. - cond.tol should be very low as it is a precision threshold for the computations. Default value might be too low as you have little number of samples, but I do not recommend to set it above 1e-6. Setting this higher creates a risk that the algorithm does not converge properly. You see that it is mostly about the algorithm's convergence, that you can check by running the simple EMtree() function with plot=TRUE (as a first check for maxiter, before ResampleEMtree() for example. Best, Raphaëlle Le mar. 31 mai 2022 à 04:34, GENG Ying @.***> a écrit :

Hi Momal:
Thank you very much for the reply!

Best wishes!
Geng.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants