Big Data Considerations #325

kyle-messier · 2024-04-03T20:07:51Z

Some discussion for big data considerations of the beethoven pipeline. As @eva0marques, @sigmafelix and others have pointed out "The problem: we have 1058 sensors * 365.2 days * 5 years = 1931908 observations with 3844 covariates. I unsurprisingly get an error message "Error: vector memory exhausted (limit reached?)"

kyle-messier · 2024-04-03T20:15:38Z

Current Solution: GEO cluster

The current solution and one that I recommend is running the entire pipeline on the GEO cluster. With ~2TB of scatch, we can read in large dataframes and fit the models. While this is not the elegant solution, I think it will get the job done.

Spark and Sparklyr

GEO does have the native Spark and Hadoop installed. R has the Sparklyr https://spark.posit.co/. This will be more intelligent on read/write in a distributed manner.

This approach will likely take a bit more code and infrastructure development on our part, which is why I suggest we hold off for now. This is a future extension and we can get additional support to help extend it a truly more scalable fashion.

eva0marques · 2024-04-03T20:43:00Z

I also use to work with foreach package to deal with core dispatching. Maybe it's gonna be useful for parallelization on cross validation sets.

sigmafelix · 2024-04-03T21:52:04Z

@kyle-messier I agree on prioritizing the model and pipeline building now. Another consideration in dealing with large feature data -- When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset.

@eva0marques We are currently relying on future framework for parallelization since it has a variety of branched-out packages for automatic SLURM job submission. I think using foreach is a convenient and straightforward way to parallelize heavy workloads. We can discuss more about parallelization backends throughout the pipeline soon.

kyle-messier · 2024-04-04T20:26:05Z

@sigmafelix When you say "When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset." Do you mean subsampled models or something else? We could also limit ourselves to purely spatial cross-validation as opposed to space-time CV.

sigmafelix · 2024-04-04T20:58:13Z

@kyle-messier I meant we will use a compact three-variable data.frame, sf, or SpatVector (I believe sf is what we will use as spatialsample uses sf objects) to get an rset object. I think full 2M*3.8K sf object is too big to handle even in GEO. rset created by rsample::vfold_cv is a nested tibble object, which will end up creating a massive tibble or sf with 20M effective rows. That said, we can utilize a reduced sf objects to create a rset object containing row indices for each training-test split then read necessary rows out of it for model fitting.

kyle-messier · 2024-04-05T15:29:12Z

Thanks @sigmafelix. It looks like rset is basically bootstrapping within the rsamples package. In the end, if getting the 20M+ rows plus large P into GEO to model is the main computational bottleneck - i.e. RAM - then I'm not opposed to splitting the base-learner-to-meta-learner process into smaller, overlapping regions. The spatialsample package could be helpful in that respect. Taking the ensemble approach we are adopting even further, we can fit say 10-50 spatial subsamples that are say 50 percent of the spatial domain. Still large models that should be relatively stable, but will hopefully bring us within the RAM capacity of GEO.

kyle-messier · 2024-04-07T02:10:29Z

@sigmafelix If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in chopin. Is there a way to have some randomness or variation introduced if we want to make say 5 or 10 grid sets such that a given region has multiple submodels? The padding will help with that too.

sigmafelix · 2024-04-07T16:03:07Z

@kyle-messier Is randomness for variations in grid configuration across models or irregular grid generation? The latter is possible with mode="grid_quantile" in the current version. For the former, I think it requires a new function since par_make_grid takes any sf/SpatVector input and uses its extent to generate grids.

…

________________________________ From: {SET}group ***@***.***> Sent: Saturday, April 6, 2024 22:10 To: NIEHS/beethoven ***@***.***> Cc: Insang Song ***@***.***>; Mention ***@***.***> Subject: Re: [NIEHS/beethoven] Big Data Considerations (Issue #325) @sigmafelix<https://github.com/sigmafelix> If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in chopin. Is there a way to have some randomness or variation introduced if we want to make say 5 or 10 grid sets such that a given region has multiple submodels? The padding will help with that too. — Reply to this email directly, view it on GitHub<#325 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGCFCUVKIVNPK234E6JBGXDY4CTKZAVCNFSM6AAAAABFV7QOKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGI3TOMBZGE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

kyle-messier · 2024-04-07T17:14:44Z

@sigmafelix I was thinking with irregular grid generation - perhaps like kmeans. But now that I think about it more, I think a spatial-block-cv approach would take care of it. For example, if the US domain was partitioned into 10-sections, then each section is fit and predicted on 9 models. This CV is at the highest level and I would not really consider it CV, but more of partitioning to deal with computation. The spatial or temporal CV we develop will be within each of these. Let me know what you think about this approach.

kyle-messier · 2024-04-07T17:23:19Z

Sorry @sigmafelix - that wouldn't do much for dealing with the RAM computational issues. We'd want the partitioned training sets to be ~50% of the size. The general idea was to ensure that every location still gets multiple partitioned models trained and validated on it. Perhaps it could be done wiwth the regular grid, plus variations of the grid_merge.

sigmafelix · 2024-04-08T12:39:36Z

@kyle-messier I agree on generating Irregular grids since the site locations are unequally distributed. We could make grids partially overlapping such that many locations will get multiple partitioned models; however, the some will get a single model unless we set a large overlapping distance. If we only consider spatial partitioning, the sample size per grid will be substantially reduced from the fact that one site having 3.8K*1.9K data elements.

kyle-messier · 2024-04-09T17:35:54Z

@sigmafelix in chopin, does your par_cut_coords function have some randomness component to it? Looking at the code, it does not appear so, thus running on the same group of coordinates will always produce the same set.
There is a package anticlust https://cran.r-project.org/web/packages/anticlust/index.html that can create roughly equal size clustering groups. We could potentially calculate say 10 - 50 groups of 2 (i.e. 50% of the data) and pick the top 5 or 10 that ensure all of the AQS points are in say 4/5 or 9/10 of the groups.

In terms of partitioning, do you think cutting our overall spatial sample size in half, will allow us to run the model on GEO relatively easiliy?

sigmafelix · 2024-04-09T18:19:29Z

@kyle-messier anticlust seems to operate in conversely from k-means. Do we want to maximize between-group similarity?
For balancing the group sizes, I think we can consider making a "seed" set of clusters from k-means or k-medoids then rotate the membership along the boundary of two adjacent clusters.

My calculation is the full dataset will be ~30GB (if the precision is limited 4 Bytes) to ~60GB (8 Bytes), which is well below GEO's memory limit. Memory consumption of each model will be the key factor of feasibility of model fitting. I will test some with my laptop then get an estimation.

kyle-messier · 2024-04-09T21:07:20Z

@sigmafelix Ok, that back of the envelope calc is not as big as I thought. Nonetheless, yes my thought was that something like anitclust would make it easy to create 10 sets of roughly equal partitions of the whole US dataset into 33% or 50% of the total size. That would ensure that every point is still included an equal number of times in the high-level ensemble. Perhaps just doing some high-level kmeans/medoids as a you suggest is easier and good enough.

kyle-messier · 2024-04-10T13:14:02Z

anticlust::balanced_clustering creates perfectly even clusters. As you mentioned, it maximizes between cluster similarity. It is also deterministic as it always starts with the center cluster. So an easy way to improve computational efficiency with a divide-and-conquer approach would be to:

Create K = 10 equal clusters
Create 5 choose 3 (or maybe 2) models = 10 submodels. Each K group will be in 6 models (4 choose 2).
Each of these partitioned submodels then goes through the base learner to meta learner
Combine everything

So a bit ad hoc, but would add some non-stationarity to the model while also reducing computational burden of RAM (may take longer, but that is okay). We can discuss more when @sigmafelix is back in a couple weeks.

@dzilber Do you know a way we can approximate storage of the model fitting process?

dzilber · 2024-04-11T20:45:54Z

@kyle-messier One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients from the last layer that was updated. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate.

sigmafelix · 2024-04-12T09:43:51Z

@kyle-messier Got it. We could implement the approach in the pipeline. I think I missed something in 2. Is 5 randomly chosen from 10 or does mean something else? We have three base learners (RF/XGB/ANN; maybe +GP = 4 base learners) and ten equal-size clusters, which made me a little bit confused to interpret the combination step.

…

________________________________ From: dzilber ***@***.***> Sent: Thursday, April 11, 2024 16:46 To: NIEHS/beethoven ***@***.***> Cc: Insang Song ***@***.***>; Mention ***@***.***> Subject: Re: [NIEHS/beethoven] Big Data Considerations (Issue #325) @kyle-messier<https://github.com/kyle-messier> One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients for each layer. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate. — Reply to this email directly, view it on GitHub<#325 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGCFCUUSFPK6ORQ3CCDDV4LY43ZBRAVCNFSM6AAAAABFV7QOKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGUYDQOJUGY>. You are receiving this because you were mentioned.Message ID: ***@***.***>

kyle-messier · 2024-04-23T02:01:27Z

Some links on data and model usage

sigmafelix · 2024-06-10T14:36:23Z

Test run with the entire dataset on GEO: assess memory pressure, etc.
Determine a data splitting strategy
Combine-and-ensemble

kyle-messier · 2024-06-10T14:39:49Z

Try fitting the base learners on the full data set ( Brulee MLP, xgboost, Elastic-Net)
If needed, discuss partitioning strategy

kyle-messier · 2024-06-10T17:06:54Z

If we take the bootstrap resampling strategy (but each bootstrap is M samples where M << N), then each bootstrap can also be passed to the meta-learner. Whether we use the stacks package directly or not, a similar strategy could be employeed:
https://stacks.tidymodels.org/index.html

graph TB;

    style P1 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P2 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P3 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P4 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P5 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;


    P1[Model Input] --> |Fit P bootstrap samples| P2[MLP Models];
    P1[Model Input] --> |Fit P bootstrap samples| P3[xgboost Models];
    P1[Model Input] --> |Fit P bootstrap samples| P4[elastic net Models];
    P2  -->  P5[Meta Learner];
    P3  --> P5;
    P4  -->  P5;

kyle-messier · 2024-06-10T18:47:19Z

Hi All, @eva0marques @sigmafelix @mitchellmanware @dzilber @larapclark @MAKassien @Sanisha003

Following up on our discussion from today on a sampling strategy to alleviate memory pressure and runtime. @eva0marques @sigmafelix and myself outlined a strategy that embraces the multiple learners approach and fits in the S-T cross-validation strategies. The multiple models will also take advantage of the dynamic branching in targets.

2 pictures show our schematic - the second one with some our random notes wiped away. In summary,

P row bootstrap subsamples are taken for each of the base learners
M bootstrap percentage taken to be a fixed 30%, so ~300K S/T samples,
P x 3 base learner outputs
Each bootstrap model employs a Space-Time cross-validation strategy
We think ~3 different S-T CV strategies would be good. So 1/3 chance a given model gets a given CV strategy.
3 strategies have different focus: 1 spatial, 1 temporal (e.g. season, month), and 1 edge case such as quantile of imperviousness or rural/urban
Meta learner is essentially just another bootstrap resample of the base learners
Meta learner prediction is the prediction of Q bootstrapped sampled learners (i.e. posterior predictions can easily provide mean, variance, or percentile)

sigmafelix · 2024-06-13T16:03:05Z

@kyle-messier @eva0marques @mitchellmanware

Related to cross-validation strategies-- I added a function extending the previous generate_cv_index approach by employing anticlust::balanced_clustering and various preprocessing options (raw, normalization, and standardization) to get partially overlapping space-time groups. The result with an option with standardization with the full dataset looks like the figure below:

Some distant subclusters or "seeds" are grouped into the same cluster because I am using the mean space-time coordinates of each seed to pick the top 10 closest pairs. We could discuss more about this implementation.

Viewing angle=40 (roughly from the southeastern edge)

Viewing angle=215 (roughly from the southwestern edge)

sigmafelix · 2024-06-17T18:17:51Z

@kyle-messier So do we have an exact P value for model fitting? As mentioned in my comments above, the spatiotemporal grouping with overlaps is implemented so the other two way needs to be added to the package then we're good to go for the model fitting phase.

kyle-messier · 2024-07-03T19:34:31Z

@sigmafelix I think 50 - 100 would be sufficient for creating a powerful model, but it could take a lot of time, at least for the MLP models. If we have P = 100 > 300 models. We could also then essentially take a random forest style approach and use summaries of all the models as the final results and bypass the need for a true metalearner. What do you think?
@dzilber @eva0marques @mitchellmanware @MAKassien

kyle-messier · 2024-07-08T17:18:19Z

Summarizing model discussion from 7/8/24

@dzilber @eva0marques @sigmafelix @mitchellmanware @larapclark @Sanisha003 @MAKassien

Pipline

After some discussion, the pipeline that makes the most sense is as follows:
I'm changing some terminology here to be more consistent with the literature
Given:

Total sample size, $N$, ~ 900,000
Covariate size, $P$, ~10,000

Create $M$ subsample at random, where $M/N=.30$
Each $M_i$ is fit with a base-learner from ${MPL, XGBOOST, GLMNET}$
Each $M_i$ is fit according to a S-T cross-validation strategy {Spatial, Temporal, SpatioTemporal}
Model $M_i$ result is then used to predict on $N$
Meta-learner or summarization

Each $M_i$ gets its own CV-strategy or 1 of all 3 CV-strategies

@sigmafelix has implemented a strategy in which $M_i$ get fitted with all 3 CV-strategies ($M_ij$) where $j=$ one of spatial, temporal, and ST
My initial thought was to randomly assign with equal probability $M_i$ to one of the 3 CV-strategies.

Personally, I don't see a difference in results, but just from pipeline implementation.

Size of M

This would be difficult or impossible to tune. We want sufficiently large, but not to overdo it unnecessarily. I think $M=100$ is a good starting point. If we focus on tuning every $M_i$ then perhaps we go to fewer, like 10-20. If we fix tuning parameters are less strict about fitting each $M_i$, then I think 100 - 500 is reasonable.

Hyperparameter tuning for each $M_i$

Each model type has a number of parameters (hyperparameters) to tune. We can spend time tuning each one or focus on only 1 and fix the others. This strategy works well for a large number of $M$ since each model will take drastically less time to fit.

Meta-learner or straight ensemble summary

Given a $M$ is going to be sufficiently large regardless, we could just take summaries (mean, variance, etc.) of all of the base learners as the final predictions - in the same spirit as random forest.

We could also feed the $M$ models into a mete-learner like another $GLMNET$, $GAM$, $BART$, model. Personally, if we go this approach I think a simple $GLMNET$ is more than sufficient. One benefit to a metalearner, is that we could add a few random spatial fields (smoothly varying), which could help spatially adjust predictions. The disadvantages are additional model training and then how to create metelearner model variance predictions. For the RF-style ensemble, we simply take the $mean(M)$ and the $var(M)$. With a $GLMNET$, which reduces it to 1 model, a variance prediction formula would need to be developed. For these reasons, I'm leaning towards the RF-style ensemble (We need a name for that, I'm sure there is one).

Example pipeline embracing large $M$ with fixed hyperparameters and ensemble summaries as final result

graph TD
    A1[1. M_i is a 30% random sample of N] --> B1[2. M_i gets assigned to 1 of 3 categories with equal probability, Spatial, Temporal, or SpaceTime ]
    B1 --> C1[3. M_i is fit with a MLP model]
    C1 --> D1[4. Randomly pick tuning hyperparameters]
    D1 --> E1[4a. Randomly fix activation to leaky-relu or relu]
    D1 --> F1[4b. Randomly fix drop out rate]
    D1 --> G1[4c. Randomly fix depth and width of hidden layers]
    G1 --> H1[5. M_i is fit and tuned for learning rate]
    F1 --> H1
    E1 --> H1
   D1 --> H1

    A2[1. M_i is a 30% random sample of N] --> B2[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
    B2 --> C2[3. M_i is fit with XGBoost model]
    C2 --> D2[4. Pick fixed hyperparameters]
    D2 --> E2[4a. mtry]
    D2 --> F2[4b. trees]
    D2 --> G2[4c. tree depth]
    G2 --> H2[5. M_i is fit and tuned for learning rate]
   E2 --> H2
   F2 --> H2

    A3[1. M_i is a 30% random sample of N] --> B3[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
    B3 --> C3[3. M_i is fit with glmnet model]
    C3 --> D3[4. Pick fixed hyperparameters]
    D3 --> E3[4a. alpha - proportion of lasso/ridge]
    E3 --> H3[5. M_i is tuned for lambda]

    H1 --> I[6. Mean is mean of all M across all three methods]
    H2 --> I
    H3 --> I
    I --> J[7. Variance is var of all M across all three methods]

kyle-messier · 2024-07-08T18:47:29Z

xgboost vs lightGBM

parsnip can fit multiple versions of gradient boosted machines including xgboost and lightGBM. lightGBM seems to be optimized for large datasets, with less memory and better usage of GPU.

https://github.com/microsoft/LightGBM

Something to think about if it it is worth it or would make a difference compared to xgboost.

sigmafelix · 2024-07-08T19:27:40Z

@kyle-messier Thank you for the summary of today's discussion. It helps me a lot to be clear about the design. If we pick random values for some hyperparameters to fit $M\approx 500$, I think there is very little meaning in tuning other hyperparameters sequentially. If we do want to tune hyperparameters to find the best combination and steps 4a-4c are done simultaneously, we may take different values of learning rates/lambdas and perform random search for each to find the best model. I think that the best model's hyperparameter combination would be fairly different between CV strategies but mostly agree across models with the same CV strategy.

Briefly navigating the lightGBM R package, it seems that we have to build our own binary for GPU usage. I will try to build the package from source and contact OSC if it is too complex or difficult for me to build it.

kyle-messier · 2024-07-09T02:57:24Z

Model Discussion Addendum 7/8/24

@sigmafelix @mitchellmanware @dzilber @eva0marques @MAKassien @Sanisha003 @larapclark

Revisiting a Meta-Learner

Perhaps I was hasty to abandon the idea of a true meta-learner. There are a couple easy options that should be scalable and provide UQ through another round of subsampling on the base learners.

In each case, say we have $M=500$ total base learners from the subsamples. We can do column-wise (covariate) subsamples of say 50%, $K$ times to get a new meta-learner subset. $500$ choose $250 > 1e49$ so it is safe to say there are plenty of meta-learner input sets to create.

A ensemble of boosted trees or RF. Advantage: Nonlinear. Disadvantage: Computationally intensive but still tractable of course.
An ensemble of elastic-nets. Advantage: Fast, scalable, somewhat interpretable. Disadvantage: Linear in each model.

Example pipeline with elastic-net meta-learner

I've removed the branches for hyperparamter tuning or fixing for base-learners for simplicity

graph TD
    A1[1. M_i is a 30% random sample of N] --> B1[2. M_i gets assigned to 1 of 3 categories with equal probability, Spatial, Temporal, or SpaceTime ]
    B1 --> C1[3. M_i is fit with a MLP model]

    A2[1. M_i is a 30% random sample of N] --> B2[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
    B2 --> C2[3. M_i is fit with XGBoost model]


    A3[1. M_i is a 30% random sample of N] --> B3[2. M_i gets assigned to 1 of 3 categories with equal probability Spatial, Temporal, or SpaceTime ]
    B3 --> C3[3. M_i is fit with glmnet model]
  
    C1 --> D1[Elastic-Net Meta-Learner]
    C2 --> D1[Elastic-Net Meta-Learner]
   C3 --> D1[Elastic-Net Meta-Learner]

  D1 --> E1[ Perform 50% column-wise subsampling K times]

E1 --> M1[Elastic-Net Model 1]
E1 --> M2[Elastic-Net Model 2]
E1 --> M3[Elastic-Net Model K-1]
E1 --> M4[Elastic-Net Model K]

M1 --> P1[Complete Posterior Summary]
M2 --> P1
M3 --> P1
M4 --> P1

kyle-messier · 2024-07-26T13:37:32Z

Relevant Paper

@sigmafelix @sigmafelix @eva0marques @MAKassien @dzilber

https://arxiv.org/abs/2106.03253

For tabular data, boosting is often all you need, at least compared to various deep learning models. However, an ensemble of deep learning and boosting usually outperforms boosting alone, so I think we are going down the right path with our ensemble.

@eva0marques I think you were not there for this part of the TEP meeting, but one of the people brought up the issue of overcomplicating things. I think that would be a great secondary project for a postbac or something down the line. See how much of the covariates or different base learners we can get rid of and still have the predictive performance of the large ensemble.

eva0marques · 2024-07-26T13:45:11Z

What do you mean by overcomplicating things? Too many covariates? Yes, that would be an interesting project to do dimension reduction and select the most relevant / non collinear ones. I'm all for it, recipes with few but qualitative ingredients are always the best ☺️

kyle-messier · 2024-07-26T13:52:45Z

What do you mean by overcomplicating things? Too many covariates? Yes, that would be an interesting project to do dimension reduction and select the most relevant / non collinear ones. I'm all for it, recipes with few but qualitative ingredients are always the best ☺️

Yes, fewer covariates and 1 or fewer models.

kyle-messier added this to beethoven Apr 3, 2024

kyle-messier converted this from a draft issue Apr 3, 2024

kyle-messier moved this from 🆕 New to 🏗 In progress in beethoven Jun 10, 2024

sigmafelix mentioned this issue Jun 13, 2024

Spatio-temporal cross validation with spatialsample package #313

Open

sigmafelix mentioned this issue Jul 16, 2024

0.3.5 #345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big Data Considerations #325

Big Data Considerations #325

kyle-messier commented Apr 3, 2024 •

edited

Loading

kyle-messier commented Apr 3, 2024

eva0marques commented Apr 3, 2024

sigmafelix commented Apr 3, 2024

kyle-messier commented Apr 4, 2024

sigmafelix commented Apr 4, 2024

kyle-messier commented Apr 5, 2024

kyle-messier commented Apr 7, 2024

sigmafelix commented Apr 7, 2024 via email

kyle-messier commented Apr 7, 2024

kyle-messier commented Apr 7, 2024

sigmafelix commented Apr 8, 2024

kyle-messier commented Apr 9, 2024

sigmafelix commented Apr 9, 2024

kyle-messier commented Apr 9, 2024

kyle-messier commented Apr 10, 2024

dzilber commented Apr 11, 2024 •

edited

Loading

sigmafelix commented Apr 12, 2024 via email

kyle-messier commented Apr 23, 2024

sigmafelix commented Jun 10, 2024

kyle-messier commented Jun 10, 2024

kyle-messier commented Jun 10, 2024

kyle-messier commented Jun 10, 2024

sigmafelix commented Jun 13, 2024 •

edited

Loading

sigmafelix commented Jun 17, 2024

kyle-messier commented Jul 3, 2024

kyle-messier commented Jul 8, 2024

kyle-messier commented Jul 8, 2024

sigmafelix commented Jul 8, 2024

kyle-messier commented Jul 9, 2024

kyle-messier commented Jul 26, 2024

eva0marques commented Jul 26, 2024

kyle-messier commented Jul 26, 2024

Big Data Considerations #325

Big Data Considerations #325

Comments

kyle-messier commented Apr 3, 2024 • edited Loading

kyle-messier commented Apr 3, 2024

Current Solution: GEO cluster

Spark and Sparklyr

eva0marques commented Apr 3, 2024

sigmafelix commented Apr 3, 2024

kyle-messier commented Apr 4, 2024

sigmafelix commented Apr 4, 2024

kyle-messier commented Apr 5, 2024

kyle-messier commented Apr 7, 2024

sigmafelix commented Apr 7, 2024 via email

kyle-messier commented Apr 7, 2024

kyle-messier commented Apr 7, 2024

sigmafelix commented Apr 8, 2024

kyle-messier commented Apr 9, 2024

sigmafelix commented Apr 9, 2024

kyle-messier commented Apr 9, 2024

kyle-messier commented Apr 10, 2024

dzilber commented Apr 11, 2024 • edited Loading

sigmafelix commented Apr 12, 2024 via email

kyle-messier commented Apr 23, 2024

Some links on data and model usage

sigmafelix commented Jun 10, 2024

kyle-messier commented Jun 10, 2024

kyle-messier commented Jun 10, 2024

kyle-messier commented Jun 10, 2024

sigmafelix commented Jun 13, 2024 • edited Loading

Viewing angle=40 (roughly from the southeastern edge)

Viewing angle=215 (roughly from the southwestern edge)

sigmafelix commented Jun 17, 2024

kyle-messier commented Jul 3, 2024

kyle-messier commented Jul 8, 2024

Summarizing model discussion from 7/8/24

Pipline

Each $M_i$ gets its own CV-strategy or 1 of all 3 CV-strategies

Size of M

Hyperparameter tuning for each $M_i$

Meta-learner or straight ensemble summary

Example pipeline embracing large $M$ with fixed hyperparameters and ensemble summaries as final result

kyle-messier commented Jul 8, 2024

xgboost vs lightGBM

sigmafelix commented Jul 8, 2024

kyle-messier commented Jul 9, 2024

Model Discussion Addendum 7/8/24

Revisiting a Meta-Learner

Example pipeline with elastic-net meta-learner

kyle-messier commented Jul 26, 2024

Relevant Paper

eva0marques commented Jul 26, 2024

kyle-messier commented Jul 26, 2024

kyle-messier commented Apr 3, 2024 •

edited

Loading

dzilber commented Apr 11, 2024 •

edited

Loading

sigmafelix commented Jun 13, 2024 •

edited

Loading