- Private: 0.773145, 4th place
- Public: 0.813093
- You may find more detailed explanation here
np.log1p
Tsvd
:TruncatedSVD(n_components=128, random_state=42)
UMAP
:UMAP(n_neighbors = 16,n_components=128, random_state=42,verbose = True)
Novel’s method
: The original method can be found here.name importance
: Mainly based on AmbrosM's notebook. But added additional information frommygene
corr importance
: Top 3 features that correlated with each target.rf importance
: Top 128 important feature of the random forest model.
Method | Stacking | GMNN | NN_online | CNN | kernel_rigde | LGBM | Catboost |
---|---|---|---|---|---|---|---|
CV | 0.89677 | 0.89596 | 0.89580 | 0.89530 | 0.89326 | 0.89270 | 0.89100 |
GMNN
: Gated Map Neural Network. A NN trying to do something like the Transformers and RNN without using feature vectors.CNN
: Inspired by the tmp method here and also added multidimensional convolution kernel like the Resnet.NN(Online)
: A NN model based on a kaggle online notebookKernel Rigde
: Inspired by the best solution of last year's competition. Used Ray Tune to optimize the hypermetersCatboost
:MultiOutputCatboostRegressor
class which can use earlystopping to prevent overfitting when compaired withsklearn.multioutput.MultiOutputRegressor
LBGM
:MultiOutputLGBMRegressor
which can use earlystopping to prevent overfitting when compaired withsklearn.multioutput.MultiOutputRegressor
Stacking
: UsedKNN
,CNN
,ridge
,rf
,catboost
,GMNN
for the first layer and onlyCNN
,catboost
,GMNN
for the second and just a simpleMLP
for the last layer. To avoid overfitting, I used special CV strategy which can do k-fold by donor and oof predictions together
CV Results | Model Ⅰ (vaild 32606) | Model Ⅱ (vaild 13176) | Model Ⅲ (vaild 31800) |
---|---|---|---|
Fold 1 | 0.8989 | 0.8967 | 0.8947 |
Fold 2 | 0.8995 | 0.8967 | 0.8951 |
Fold 3 | 0.8985 | 0.8959 | 0.8949 |
Fold Mean | 0.89897 | 0.89643 | 0.89490 |
Model Mean | 0.89677 | - | - |
- TF-IDF normalization
np.log1p(data * 1e4)
- Tsvd -> 512
- Normalization -> mean = 0, std = 1
- Tsvd -> 1024
GMNN
: Gated Map Neural Network. The output of the model is 1024 dim and make dot product withtsvd.components_
(constant) to get the final prediction than usecorrel_loss
to calculate the loss then back propagate the grads.Catboost
: The results from online notebookLGBM
: The same as theMultiOutputLGBMRegressor
mentioned above. UsingMSE
to fit the tsvd results of normalized targets.
- You may refer to the
ensemble
notebook
- working (⭐You are here now)
- cite
- Catboost
- CNN
- GMNN
- Kernel_Ridge
- LGBM
- NN_Online
- Stacking
- data_preprocessing
- cite.ipynb
- multi.ipynb
- new_cite_train_final.npz # https://www.kaggle.com/datasets/oliverwang15/cite-final
- new_cite_test_final.npz # https://www.kaggle.com/datasets/oliverwang15/cite-final
- ensemble
- multi
- Catboost
- GMNN
- LGBM
- pics
- utils
- input (You need to download those data)
- multimodal-single-cell-as-sparse-matrix # https://www.kaggle.com/datasets/fabiencrom/multimodal-single-cell-as-sparse-matrix
- open-problems-multimodal # https://www.kaggle.com/competitions/open-problems-multimodal/data
- open-problems-raw-counts # https://www.kaggle.com/datasets/ryanholbrook/open-problems-raw-counts