Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big Data Considerations #325

Open
kyle-messier opened this issue Apr 3, 2024 · 32 comments
Open

Big Data Considerations #325

kyle-messier opened this issue Apr 3, 2024 · 32 comments

Comments

@kyle-messier
Copy link
Collaborator

kyle-messier commented Apr 3, 2024

Some discussion for big data considerations of the beethoven pipeline. As @eva0marques, @sigmafelix and others have pointed out "The problem: we have 1058 sensors * 365.2 days * 5 years = 1931908 observations with 3844 covariates. I unsurprisingly get an error message "Error: vector memory exhausted (limit reached?)"

@kyle-messier kyle-messier converted this from a draft issue Apr 3, 2024
@kyle-messier
Copy link
Collaborator Author

Current Solution: GEO cluster

The current solution and one that I recommend is running the entire pipeline on the GEO cluster. With ~2TB of scatch, we can read in large dataframes and fit the models. While this is not the elegant solution, I think it will get the job done.

Spark and Sparklyr

GEO does have the native Spark and Hadoop installed. R has the Sparklyr https://spark.posit.co/. This will be more intelligent on read/write in a distributed manner.

This approach will likely take a bit more code and infrastructure development on our part, which is why I suggest we hold off for now. This is a future extension and we can get additional support to help extend it a truly more scalable fashion.

@eva0marques
Copy link
Collaborator

I also use to work with foreach package to deal with core dispatching. Maybe it's gonna be useful for parallelization on cross validation sets.

@sigmafelix
Copy link
Collaborator

@kyle-messier I agree on prioritizing the model and pipeline building now. Another consideration in dealing with large feature data -- When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset.

@eva0marques We are currently relying on future framework for parallelization since it has a variety of branched-out packages for automatic SLURM job submission. I think using foreach is a convenient and straightforward way to parallelize heavy workloads. We can discuss more about parallelization backends throughout the pipeline soon.

@kyle-messier
Copy link
Collaborator Author

@sigmafelix When you say "When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset." Do you mean subsampled models or something else? We could also limit ourselves to purely spatial cross-validation as opposed to space-time CV.

@sigmafelix
Copy link
Collaborator

@kyle-messier I meant we will use a compact three-variable data.frame, sf, or SpatVector (I believe sf is what we will use as spatialsample uses sf objects) to get an rset object. I think full 2M*3.8K sf object is too big to handle even in GEO. rset created by rsample::vfold_cv is a nested tibble object, which will end up creating a massive tibble or sf with 20M effective rows. That said, we can utilize a reduced sf objects to create a rset object containing row indices for each training-test split then read necessary rows out of it for model fitting.

@kyle-messier
Copy link
Collaborator Author

Thanks @sigmafelix. It looks like rset is basically bootstrapping within the rsamples package. In the end, if getting the 20M+ rows plus large P into GEO to model is the main computational bottleneck - i.e. RAM - then I'm not opposed to splitting the base-learner-to-meta-learner process into smaller, overlapping regions. The spatialsample package could be helpful in that respect. Taking the ensemble approach we are adopting even further, we can fit say 10-50 spatial subsamples that are say 50 percent of the spatial domain. Still large models that should be relatively stable, but will hopefully bring us within the RAM capacity of GEO.

@kyle-messier
Copy link
Collaborator Author

@sigmafelix If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in chopin. Is there a way to have some randomness or variation introduced if we want to make say 5 or 10 grid sets such that a given region has multiple submodels? The padding will help with that too.

@sigmafelix
Copy link
Collaborator

sigmafelix commented Apr 7, 2024 via email

@kyle-messier
Copy link
Collaborator Author

@sigmafelix I was thinking with irregular grid generation - perhaps like kmeans. But now that I think about it more, I think a spatial-block-cv approach would take care of it. For example, if the US domain was partitioned into 10-sections, then each section is fit and predicted on 9 models. This CV is at the highest level and I would not really consider it CV, but more of partitioning to deal with computation. The spatial or temporal CV we develop will be within each of these. Let me know what you think about this approach.

@kyle-messier
Copy link
Collaborator Author

Sorry @sigmafelix - that wouldn't do much for dealing with the RAM computational issues. We'd want the partitioned training sets to be ~50% of the size. The general idea was to ensure that every location still gets multiple partitioned models trained and validated on it. Perhaps it could be done wiwth the regular grid, plus variations of the grid_merge.

@sigmafelix
Copy link
Collaborator

@kyle-messier I agree on generating Irregular grids since the site locations are unequally distributed. We could make grids partially overlapping such that many locations will get multiple partitioned models; however, the some will get a single model unless we set a large overlapping distance. If we only consider spatial partitioning, the sample size per grid will be substantially reduced from the fact that one site having 3.8K*1.9K data elements.

@kyle-messier
Copy link
Collaborator Author

@sigmafelix in chopin, does your par_cut_coords function have some randomness component to it? Looking at the code, it does not appear so, thus running on the same group of coordinates will always produce the same set.
There is a package anticlust https://cran.r-project.org/web/packages/anticlust/index.html that can create roughly equal size clustering groups. We could potentially calculate say 10 - 50 groups of 2 (i.e. 50% of the data) and pick the top 5 or 10 that ensure all of the AQS points are in say 4/5 or 9/10 of the groups.

In terms of partitioning, do you think cutting our overall spatial sample size in half, will allow us to run the model on GEO relatively easiliy?

@sigmafelix
Copy link
Collaborator

@kyle-messier anticlust seems to operate in conversely from k-means. Do we want to maximize between-group similarity?
For balancing the group sizes, I think we can consider making a "seed" set of clusters from k-means or k-medoids then rotate the membership along the boundary of two adjacent clusters.

My calculation is the full dataset will be ~30GB (if the precision is limited 4 Bytes) to ~60GB (8 Bytes), which is well below GEO's memory limit. Memory consumption of each model will be the key factor of feasibility of model fitting. I will test some with my laptop then get an estimation.

@kyle-messier
Copy link
Collaborator Author

@sigmafelix Ok, that back of the envelope calc is not as big as I thought. Nonetheless, yes my thought was that something like anitclust would make it easy to create 10 sets of roughly equal partitions of the whole US dataset into 33% or 50% of the total size. That would ensure that every point is still included an equal number of times in the high-level ensemble. Perhaps just doing some high-level kmeans/medoids as a you suggest is easier and good enough.

@kyle-messier
Copy link
Collaborator Author

anticlust::balanced_clustering creates perfectly even clusters. As you mentioned, it maximizes between cluster similarity. It is also deterministic as it always starts with the center cluster. So an easy way to improve computational efficiency with a divide-and-conquer approach would be to:

  1. Create K = 10 equal clusters
  2. Create 5 choose 3 (or maybe 2) models = 10 submodels. Each K group will be in 6 models (4 choose 2).
  3. Each of these partitioned submodels then goes through the base learner to meta learner
  4. Combine everything

So a bit ad hoc, but would add some non-stationarity to the model while also reducing computational burden of RAM (may take longer, but that is okay). We can discuss more when @sigmafelix is back in a couple weeks.

@dzilber Do you know a way we can approximate storage of the model fitting process?

@dzilber
Copy link
Collaborator

dzilber commented Apr 11, 2024

@kyle-messier One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients from the last layer that was updated. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate.

@sigmafelix
Copy link
Collaborator

sigmafelix commented Apr 12, 2024 via email

@kyle-messier
Copy link
Collaborator Author

@kyle-messier kyle-messier moved this from 🆕 New to 🏗 In progress in beethoven Jun 10, 2024
@sigmafelix
Copy link
Collaborator

  1. Test run with the entire dataset on GEO: assess memory pressure, etc.
  2. Determine a data splitting strategy
  3. Combine-and-ensemble

@kyle-messier
Copy link
Collaborator Author

  1. Try fitting the base learners on the full data set ( Brulee MLP, xgboost, Elastic-Net)
  2. If needed, discuss partitioning strategy

@kyle-messier
Copy link
Collaborator Author

If we take the bootstrap resampling strategy (but each bootstrap is M samples where M << N), then each bootstrap can also be passed to the meta-learner. Whether we use the stacks package directly or not, a similar strategy could be employeed:
https://stacks.tidymodels.org/index.html

graph TB;

    style P1 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P2 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P3 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P4 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P5 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;


    P1[Model Input] --> |Fit P bootstrap samples| P2[MLP Models];
    P1[Model Input] --> |Fit P bootstrap samples| P3[xgboost Models];
    P1[Model Input] --> |Fit P bootstrap samples| P4[elastic net Models];
    P2  -->  P5[Meta Learner];
    P3  --> P5;
    P4  -->  P5;
Loading

@kyle-messier
Copy link
Collaborator Author

IMG-0484
IMG-0485

Hi All, @eva0marques @sigmafelix @mitchellmanware @dzilber @larapclark @MAKassien @Sanisha003

Following up on our discussion from today on a sampling strategy to alleviate memory pressure and runtime. @eva0marques @sigmafelix and myself outlined a strategy that embraces the multiple learners approach and fits in the S-T cross-validation strategies. The multiple models will also take advantage of the dynamic branching in targets.

2 pictures show our schematic - the second one with some our random notes wiped away. In summary,

  1. P row bootstrap subsamples are taken for each of the base learners
  2. M bootstrap percentage taken to be a fixed 30%, so ~300K S/T samples,
  3. P x 3 base learner outputs
  4. Each bootstrap model employs a Space-Time cross-validation strategy
  5. We think ~3 different S-T CV strategies would be good. So 1/3 chance a given model gets a given CV strategy.
  6. 3 strategies have different focus: 1 spatial, 1 temporal (e.g. season, month), and 1 edge case such as quantile of imperviousness or rural/urban
  7. Meta learner is essentially just another bootstrap resample of the base learners
  8. Meta learner prediction is the prediction of Q bootstrapped sampled learners (i.e. posterior predictions can easily provide mean, variance, or percentile)

@sigmafelix
Copy link
Collaborator

sigmafelix commented Jun 13, 2024

@kyle-messier @eva0marques @mitchellmanware

Related to cross-validation strategies-- I added a function extending the previous generate_cv_index approach by employing anticlust::balanced_clustering and various preprocessing options (raw, normalization, and standardization) to get partially overlapping space-time groups. The result with an option with standardization with the full dataset looks like the figure below:

Some distant subclusters or "seeds" are grouped into the same cluster because I am using the mean space-time coordinates of each seed to pick the top 10 closest pairs. We could discuss more about this implementation.

Viewing angle=40 (roughly from the southeastern edge)

image

Viewing angle=215 (roughly from the southwestern edge)

image

@sigmafelix
Copy link
Collaborator

@kyle-messier So do we have an exact P value for model fitting? As mentioned in my comments above, the spatiotemporal grouping with overlaps is implemented so the other two way needs to be added to the package then we're good to go for the model fitting phase.

@kyle-messier
Copy link
Collaborator Author

@sigmafelix I think 50 - 100 would be sufficient for creating a powerful model, but it could take a lot of time, at least for the MLP models. If we have P = 100 > 300 models. We could also then essentially take a random forest style approach and use summaries of all the models as the final results and bypass the need for a true metalearner. What do you think?
@dzilber @eva0marques @mitchellmanware @MAKassien

@kyle-messier
Copy link
Collaborator Author

Summarizing model discussion from 7/8/24

@dzilber @eva0marques @sigmafelix @mitchellmanware @larapclark @Sanisha003 @MAKassien

Pipline

After some discussion, the pipeline that makes the most sense is as follows:
I'm changing some terminology here to be more consistent with the literature
Given:

  • Total sample size, $N$, ~ 900,000
  • Covariate size, $P$, ~10,000
  1. Create $M$ subsample at random, where $M/N=.30$
  2. Each $M_i$ is fit with a base-learner from ${MPL, XGBOOST, GLMNET}$
  3. Each $M_i$ is fit according to a S-T cross-validation strategy {Spatial, Temporal, SpatioTemporal}
  4. Model $M_i$ result is then used to predict on $N$
  5. Meta-learner or summarization

Each $M_i$ gets its own CV-strategy or 1 of all 3 CV-strategies

@sigmafelix has implemented a strategy in which $M_i$ get fitted with all 3 CV-strategies ($M_ij$) where $j=$ one of spatial, temporal, and ST
My initial thought was to randomly assign with equal probability $M_i$ to one of the 3 CV-strategies.

Personally, I don't see a difference in results, but just from pipeline implementation.

Size of M

This would be difficult or impossible to tune. We want sufficiently large, but not to overdo it unnecessarily. I think $M=100$ is a good starting point. If we focus on tuning every $M_i$ then perhaps we go to fewer, like 10-20. If we fix tuning parameters are less strict about fitting each $M_i$, then I think 100 - 500 is reasonable.

Hyperparameter tuning for each $M_i$

Each model type has a number of parameters (hyperparameters) to tune. We can spend time tuning each one or focus on only 1 and fix the others. This strategy works well for a large number of $M$ since each model will take drastically less time to fit.

Meta-learner or straight ensemble summary

Given a $M$ is going to be sufficiently large regardless, we could just take summaries (mean, variance, etc.) of all of the base learners as the final predictions - in the same spirit as random forest.

We could also feed the $M$ models into a mete-learner like another $GLMNET$, $GAM$, $BART$, model. Personally, if we go this approach I think a simple $GLMNET$ is more than sufficient. One benefit to a metalearner, is that we could add a few random spatial fields (smoothly varying), which could help spatially adjust predictions. The disadvantages are additional model training and then how to create metelearner model variance predictions. For the RF-style ensemble, we simply take the $mean(M)$ and the $var(M)$. With a $GLMNET$, which reduces it to 1 model, a variance prediction formula would need to be developed. For these reasons, I'm leaning towards the RF-style ensemble (We need a name for that, I'm sure there is one).

Example pipeline embracing large $M$ with fixed hyperparameters and ensemble summaries as final result

graph TD
    A1[1. M_i is a 30% random sample of N] --> B1[2. M_i gets assigned to 1 of 3 categories with equal probability, Spatial, Temporal, or SpaceTime ]
    B1 --> C1[3. M_i is fit with a MLP model]
    C1 --> D1[4. Randomly pick tuning hyperparameters]
    D1 --> E1[4a. Randomly fix activation to leaky-relu or relu]
    D1 --> F1[4b. Randomly fix drop out rate]
    D1 --> G1[4c. Randomly fix depth and width of hidden layers]
    G1 --> H1[5. M_i is fit and tuned for learning rate]
    F1 --> H1
    E1 --> H1
   D1 --> H1

    A2[1. M_i is a 30% random sample of N] --> B2[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
    B2 --> C2[3. M_i is fit with XGBoost model]
    C2 --> D2[4. Pick fixed hyperparameters]
    D2 --> E2[4a. mtry]
    D2 --> F2[4b. trees]
    D2 --> G2[4c. tree depth]
    G2 --> H2[5. M_i is fit and tuned for learning rate]
   E2 --> H2
   F2 --> H2

    A3[1. M_i is a 30% random sample of N] --> B3[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
    B3 --> C3[3. M_i is fit with glmnet model]
    C3 --> D3[4. Pick fixed hyperparameters]
    D3 --> E3[4a. alpha - proportion of lasso/ridge]
    E3 --> H3[5. M_i is tuned for lambda]

    H1 --> I[6. Mean is mean of all M across all three methods]
    H2 --> I
    H3 --> I
    I --> J[7. Variance is var of all M across all three methods]

Loading

@kyle-messier
Copy link
Collaborator Author

xgboost vs lightGBM

parsnip can fit multiple versions of gradient boosted machines including xgboost and lightGBM. lightGBM seems to be optimized for large datasets, with less memory and better usage of GPU.

https://github.com/microsoft/LightGBM

Something to think about if it it is worth it or would make a difference compared to xgboost.

@sigmafelix
Copy link
Collaborator

@kyle-messier Thank you for the summary of today's discussion. It helps me a lot to be clear about the design. If we pick random values for some hyperparameters to fit $M\approx 500$, I think there is very little meaning in tuning other hyperparameters sequentially. If we do want to tune hyperparameters to find the best combination and steps 4a-4c are done simultaneously, we may take different values of learning rates/lambdas and perform random search for each to find the best model. I think that the best model's hyperparameter combination would be fairly different between CV strategies but mostly agree across models with the same CV strategy.

Briefly navigating the lightGBM R package, it seems that we have to build our own binary for GPU usage. I will try to build the package from source and contact OSC if it is too complex or difficult for me to build it.

@kyle-messier
Copy link
Collaborator Author

Model Discussion Addendum 7/8/24

@sigmafelix @mitchellmanware @dzilber @eva0marques @MAKassien @Sanisha003 @larapclark

Revisiting a Meta-Learner

Perhaps I was hasty to abandon the idea of a true meta-learner. There are a couple easy options that should be scalable and provide UQ through another round of subsampling on the base learners.

In each case, say we have $M=500$ total base learners from the subsamples. We can do column-wise (covariate) subsamples of say 50%, $K$ times to get a new meta-learner subset. $500$ choose $250 &gt; 1e49$ so it is safe to say there are plenty of meta-learner input sets to create.

  1. A ensemble of boosted trees or RF. Advantage: Nonlinear. Disadvantage: Computationally intensive but still tractable of course.
  2. An ensemble of elastic-nets. Advantage: Fast, scalable, somewhat interpretable. Disadvantage: Linear in each model.

Example pipeline with elastic-net meta-learner

I've removed the branches for hyperparamter tuning or fixing for base-learners for simplicity

graph TD
    A1[1. M_i is a 30% random sample of N] --> B1[2. M_i gets assigned to 1 of 3 categories with equal probability, Spatial, Temporal, or SpaceTime ]
    B1 --> C1[3. M_i is fit with a MLP model]

    A2[1. M_i is a 30% random sample of N] --> B2[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
    B2 --> C2[3. M_i is fit with XGBoost model]


    A3[1. M_i is a 30% random sample of N] --> B3[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
    B3 --> C3[3. M_i is fit with glmnet model]
  
    C1 --> D1[Elastic-Net Meta-Learner]
    C2 --> D1[Elastic-Net Meta-Learner]
   C3 --> D1[Elastic-Net Meta-Learner]

  D1 --> E1[ Perform 50% column-wise subsampling K times]

E1 --> M1[Elastic-Net Model 1]
E1 --> M2[Elastic-Net Model 2]
E1 --> M3[Elastic-Net Model K-1]
E1 --> M4[Elastic-Net Model K]

M1 --> P1[Complete Posterior Summary]
M2 --> P1
M3 --> P1
M4 --> P1

Loading

@sigmafelix sigmafelix mentioned this issue Jul 16, 2024
@kyle-messier
Copy link
Collaborator Author

Relevant Paper

@sigmafelix @sigmafelix @eva0marques @MAKassien @dzilber

https://arxiv.org/abs/2106.03253

For tabular data, boosting is often all you need, at least compared to various deep learning models. However, an ensemble of deep learning and boosting usually outperforms boosting alone, so I think we are going down the right path with our ensemble.

@eva0marques I think you were not there for this part of the TEP meeting, but one of the people brought up the issue of overcomplicating things. I think that would be a great secondary project for a postbac or something down the line. See how much of the covariates or different base learners we can get rid of and still have the predictive performance of the large ensemble.

@eva0marques
Copy link
Collaborator

What do you mean by overcomplicating things? Too many covariates? Yes, that would be an interesting project to do dimension reduction and select the most relevant / non collinear ones. I'm all for it, recipes with few but qualitative ingredients are always the best ☺️

@kyle-messier
Copy link
Collaborator Author

What do you mean by overcomplicating things? Too many covariates? Yes, that would be an interesting project to do dimension reduction and select the most relevant / non collinear ones. I'm all for it, recipes with few but qualitative ingredients are always the best ☺️

Yes, fewer covariates and 1 or fewer models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants