-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big Data Considerations #325
Comments
Current Solution: GEO clusterThe current solution and one that I recommend is running the entire pipeline on the GEO cluster. With ~2TB of scatch, we can read in large dataframes and fit the models. While this is not the elegant solution, I think it will get the job done. Spark and SparklyrGEO does have the native Spark and Hadoop installed. R has the Sparklyr https://spark.posit.co/. This will be more intelligent on read/write in a distributed manner. This approach will likely take a bit more code and infrastructure development on our part, which is why I suggest we hold off for now. This is a future extension and we can get additional support to help extend it a truly more scalable fashion. |
I also use to work with |
@kyle-messier I agree on prioritizing the model and pipeline building now. Another consideration in dealing with large feature data -- When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset. @eva0marques We are currently relying on |
@sigmafelix When you say "When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset." Do you mean subsampled models or something else? We could also limit ourselves to purely spatial cross-validation as opposed to space-time CV. |
@kyle-messier I meant we will use a compact three-variable |
Thanks @sigmafelix. It looks like |
@sigmafelix If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in |
@kyle-messier Is randomness for variations in grid configuration across models or irregular grid generation? The latter is possible with mode="grid_quantile" in the current version. For the former, I think it requires a new function since par_make_grid takes any sf/SpatVector input and uses its extent to generate grids.
…________________________________
From: {SET}group ***@***.***>
Sent: Saturday, April 6, 2024 22:10
To: NIEHS/beethoven ***@***.***>
Cc: Insang Song ***@***.***>; Mention ***@***.***>
Subject: Re: [NIEHS/beethoven] Big Data Considerations (Issue #325)
@sigmafelix<https://github.com/sigmafelix> If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in chopin. Is there a way to have some randomness or variation introduced if we want to make say 5 or 10 grid sets such that a given region has multiple submodels? The padding will help with that too.
—
Reply to this email directly, view it on GitHub<#325 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGCFCUVKIVNPK234E6JBGXDY4CTKZAVCNFSM6AAAAABFV7QOKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGI3TOMBZGE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@sigmafelix I was thinking with irregular grid generation - perhaps like kmeans. But now that I think about it more, I think a spatial-block-cv approach would take care of it. For example, if the US domain was partitioned into 10-sections, then each section is fit and predicted on 9 models. This CV is at the highest level and I would not really consider it CV, but more of partitioning to deal with computation. The spatial or temporal CV we develop will be within each of these. Let me know what you think about this approach. |
Sorry @sigmafelix - that wouldn't do much for dealing with the RAM computational issues. We'd want the partitioned training sets to be ~50% of the size. The general idea was to ensure that every location still gets multiple partitioned models trained and validated on it. Perhaps it could be done wiwth the regular grid, plus variations of the grid_merge. |
@kyle-messier I agree on generating Irregular grids since the site locations are unequally distributed. We could make grids partially overlapping such that many locations will get multiple partitioned models; however, the some will get a single model unless we set a large overlapping distance. If we only consider spatial partitioning, the sample size per grid will be substantially reduced from the fact that one site having 3.8K*1.9K data elements. |
@sigmafelix in In terms of partitioning, do you think cutting our overall spatial sample size in half, will allow us to run the model on GEO relatively easiliy? |
@kyle-messier My calculation is the full dataset will be ~30GB (if the precision is limited 4 Bytes) to ~60GB (8 Bytes), which is well below GEO's memory limit. Memory consumption of each model will be the key factor of feasibility of model fitting. I will test some with my laptop then get an estimation. |
@sigmafelix Ok, that back of the envelope calc is not as big as I thought. Nonetheless, yes my thought was that something like |
So a bit ad hoc, but would add some non-stationarity to the model while also reducing computational burden of RAM (may take longer, but that is okay). We can discuss more when @sigmafelix is back in a couple weeks. @dzilber Do you know a way we can approximate storage of the model fitting process? |
@kyle-messier One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients from the last layer that was updated. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate. |
@kyle-messier Got it. We could implement the approach in the pipeline. I think I missed something in 2. Is 5 randomly chosen from 10 or does mean something else? We have three base learners (RF/XGB/ANN; maybe +GP = 4 base learners) and ten equal-size clusters, which made me a little bit confused to interpret the combination step.
…________________________________
From: dzilber ***@***.***>
Sent: Thursday, April 11, 2024 16:46
To: NIEHS/beethoven ***@***.***>
Cc: Insang Song ***@***.***>; Mention ***@***.***>
Subject: Re: [NIEHS/beethoven] Big Data Considerations (Issue #325)
@kyle-messier<https://github.com/kyle-messier> One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients for each layer. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate.
—
Reply to this email directly, view it on GitHub<#325 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGCFCUUSFPK6ORQ3CCDDV4LY43ZBRAVCNFSM6AAAAABFV7QOKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGUYDQOJUGY>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Some links on data and model usage |
|
|
If we take the bootstrap resampling strategy (but each bootstrap is M samples where M << N), then each bootstrap can also be passed to the meta-learner. Whether we use the graph TB;
style P1 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
style P2 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
style P3 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
style P4 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
style P5 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
P1[Model Input] --> |Fit P bootstrap samples| P2[MLP Models];
P1[Model Input] --> |Fit P bootstrap samples| P3[xgboost Models];
P1[Model Input] --> |Fit P bootstrap samples| P4[elastic net Models];
P2 --> P5[Meta Learner];
P3 --> P5;
P4 --> P5;
|
Hi All, @eva0marques @sigmafelix @mitchellmanware @dzilber @larapclark @MAKassien @Sanisha003 Following up on our discussion from today on a sampling strategy to alleviate memory pressure and runtime. @eva0marques @sigmafelix and myself outlined a strategy that embraces the multiple learners approach and fits in the S-T cross-validation strategies. The multiple models will also take advantage of the dynamic branching in 2 pictures show our schematic - the second one with some our random notes wiped away. In summary,
|
@kyle-messier @eva0marques @mitchellmanware Related to cross-validation strategies-- I added a function extending the previous Some distant subclusters or "seeds" are grouped into the same cluster because I am using the mean space-time coordinates of each seed to pick the top 10 closest pairs. We could discuss more about this implementation. Viewing angle=40 (roughly from the southeastern edge)Viewing angle=215 (roughly from the southwestern edge) |
@kyle-messier So do we have an exact P value for model fitting? As mentioned in my comments above, the spatiotemporal grouping with overlaps is implemented so the other two way needs to be added to the package then we're good to go for the model fitting phase. |
@sigmafelix I think 50 - 100 would be sufficient for creating a powerful model, but it could take a lot of time, at least for the MLP models. If we have P = 100 > 300 models. We could also then essentially take a random forest style approach and use summaries of all the models as the final results and bypass the need for a true metalearner. What do you think? |
Summarizing model discussion from 7/8/24@dzilber @eva0marques @sigmafelix @mitchellmanware @larapclark @Sanisha003 @MAKassien PiplineAfter some discussion, the pipeline that makes the most sense is as follows:
Each
|
xgboost vs lightGBM
https://github.com/microsoft/LightGBM Something to think about if it it is worth it or would make a difference compared to xgboost. |
@kyle-messier Thank you for the summary of today's discussion. It helps me a lot to be clear about the design. If we pick random values for some hyperparameters to fit Briefly navigating the lightGBM R package, it seems that we have to build our own binary for GPU usage. I will try to build the package from source and contact OSC if it is too complex or difficult for me to build it. |
Model Discussion Addendum 7/8/24@sigmafelix @mitchellmanware @dzilber @eva0marques @MAKassien @Sanisha003 @larapclark Revisiting a Meta-LearnerPerhaps I was hasty to abandon the idea of a true meta-learner. There are a couple easy options that should be scalable and provide UQ through another round of subsampling on the base learners. In each case, say we have
Example pipeline with elastic-net meta-learnerI've removed the branches for hyperparamter tuning or fixing for base-learners for simplicity graph TD
A1[1. M_i is a 30% random sample of N] --> B1[2. M_i gets assigned to 1 of 3 categories with equal probability, Spatial, Temporal, or SpaceTime ]
B1 --> C1[3. M_i is fit with a MLP model]
A2[1. M_i is a 30% random sample of N] --> B2[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
B2 --> C2[3. M_i is fit with XGBoost model]
A3[1. M_i is a 30% random sample of N] --> B3[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
B3 --> C3[3. M_i is fit with glmnet model]
C1 --> D1[Elastic-Net Meta-Learner]
C2 --> D1[Elastic-Net Meta-Learner]
C3 --> D1[Elastic-Net Meta-Learner]
D1 --> E1[ Perform 50% column-wise subsampling K times]
E1 --> M1[Elastic-Net Model 1]
E1 --> M2[Elastic-Net Model 2]
E1 --> M3[Elastic-Net Model K-1]
E1 --> M4[Elastic-Net Model K]
M1 --> P1[Complete Posterior Summary]
M2 --> P1
M3 --> P1
M4 --> P1
|
Relevant Paper@sigmafelix @sigmafelix @eva0marques @MAKassien @dzilber https://arxiv.org/abs/2106.03253 For tabular data, boosting is often all you need, at least compared to various deep learning models. However, an ensemble of deep learning and boosting usually outperforms boosting alone, so I think we are going down the right path with our ensemble. @eva0marques I think you were not there for this part of the TEP meeting, but one of the people brought up the issue of overcomplicating things. I think that would be a great secondary project for a postbac or something down the line. See how much of the covariates or different base learners we can get rid of and still have the predictive performance of the large ensemble. |
What do you mean by overcomplicating things? Too many covariates? Yes, that would be an interesting project to do dimension reduction and select the most relevant / non collinear ones. I'm all for it, recipes with few but qualitative ingredients are always the best |
Yes, fewer covariates and 1 or fewer models. |
Some discussion for big data considerations of the beethoven pipeline. As @eva0marques, @sigmafelix and others have pointed out "The problem: we have 1058 sensors * 365.2 days * 5 years = 1931908 observations with 3844 covariates. I unsurprisingly get an error message "Error: vector memory exhausted (limit reached?)"
The text was updated successfully, but these errors were encountered: