diff --git a/README.md b/README.md index d2a0f00..e08d80a 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,68 @@ -# tox24challenge -Dataset used in Tox24 challenge +Tox24 Challenge Dataset +=========== + +This repository contains molecular structures and descriptors for the Tox24 challenge prepared by me (team name: **filipsPL**). The goal of the challenge was to predict the in vitro activity of compounds' activity against [Transthyretin (TTR)](https://en.wikipedia.org/wiki/Transthyretin) using chemical structure data. + +- [Tox24 Challenge Dataset](#tox24-challenge-dataset) + - [Dataset](#dataset) + - [Descriptors](#descriptors) + - [Importance of features](#importance-of-features) + - [The Challenge Results](#the-challenge-results) + - [References](#references) + + +## Dataset + +This repository includes: + +- The [chemical structures](data/smiles_org+fixed.csv) provided by the organizers and curated by me using my RDKit pipeline. +- **Training set** - a diversified set of 1000 compounds, used for training models [data/train.csv.xz](data/train.csv.xz) +- **Validation set**: a diversified set of 100 compounds, used for final validation of models [data/validation.csv.xz](data/validation.csv.xz) +- **Test set**: 500 compounds used to make predictions. It contains a leaderboard set (200 compounds) and a blind set (300 compounds) [data/test.csv.xz](data/test.csv.xz) + +## Descriptors + +The csv files contain 2D descriptors of molecules, including: + +- DRKitDescriptors (2D) +- molecular fingerprints: + - CDK: + - CDKECFP4 + - CDKEState + - CDKFCFP4 + - CDKmolprop + - CDKpubchem + - CDKstandard + - Indigo fingerprints: + - IndigoResonanceSubstructure + - IndigoSimilarity + - RDKit fingerprints: + - RDkitFP-AtomPair + - RDkitFP-Avalon + - RDkitFP-FeatMorgan4 + - RDkitFP-Layered + - RDkitFP-MACCS + - RDkitFP-Morgan2 + - RDkitFP-Morgan3 + - RDkitFP-Morgan4 + - RDkitFP-Pattern + - RDkitFP-RDKit + - RDkitFP-Torsion + +## Importance of features + +Feature importances according to the final catboost model + +![bar plot](feature_importance.png) + + +## The Challenge Results + +Bar plot showing RMSE of submitted predictions (by me, based on the official results). Congratulations to the winning team Amidoff 🎉! + +![rank](ranking.png) + +## References + +1. [OCHEM Platform for Tox24](https://ochem.eu/static/challenge.do) +2. [Chem. Res. Toxicol. 2024, 37, 6, 825–826](https://pubs.acs.org/doi/10.1021/acs.chemrestox.4c00192) diff --git a/data/smiles_org+fixed.csv b/data/smiles_org+fixed.csv new file mode 100644 index 0000000..360e353 --- /dev/null +++ b/data/smiles_org+fixed.csv @@ -0,0 +1,201 @@ +SMILES_org,SMILES_fixed +CC1=CC[C@H]2C[C@@H]1C2(C)C,CC1=CC[C@H]2C[C@@H]1C2(C)C +CC[C@H](C)[C@H](N1SC2=CC=CC=C2C1=O)C(O)=O,CC[C@H](C)[C@@H](C(=O)O)n1sc2ccccc2c1=O +CCCCCC[C@@H](O)C/C=C\CCCCCCCC(O)=O,CCCCCC[C@@H](O)C/C=C\CCCCCCCC(=O)O +Cl/C=C/Cl,Cl/C=C/Cl +NC1=C(Cl)C=C(C=C1Cl)[N+]([O-])=O,Nc1c(Cl)cc([N+](=O)[O-])cc1Cl +FC1=CC=C(Br)C=C1,Fc1ccc(Br)cc1 +CCCCCCCCCCCCCCCO,CCCCCCCCCCCCCCCO +OCCN1CCNCC1,OCCN1CCNCC1 +CN(C)N,CN(C)N +[Cl-].C[N+]1(C)CCCCC1,C[N+]1(C)CCCCC1 +ClCC(Cl)CCl,ClCC(Cl)CCl +NCC1=CC(CN)=CC=C1,NCc1cccc(CN)c1 +OC(CCl)CCl,OC(CCl)CCl +CC(C)C1=CC(=CC=C1)C(C)C,CC(C)c1cccc(C(C)C)c1 +OCNC(=O)NCO,O=C(NCO)NCO +CCCCC(CC)COCCO,CCCCC(CC)COCCO +O=CC(CC1=CC=C(C(C)(C)C)C=C1)C,CC(C=O)Cc1ccc(C(C)(C)C)cc1 +OCC(CO)(CO)[N+]([O-])=O,O=[N+]([O-])C(CO)(CO)CO +S1C=2C(N=C1SC(SC#N)([H])[H])=C(C(=C(C2[H])[H])[H])[H],N#CSCSc1nc2ccccc2s1 +CC(C)(CS(O)(=O)=O)NC(=O)C=C,C=CC(=O)NC(C)(C)CS(=O)(=O)O +COC1=CC=CC=C1N,COc1ccccc1N +ClCC(=O)NC1=CC=CC=C1,O=C(CCl)Nc1ccccc1 +CCOCCO,CCOCCO +CCCCC(CC)C=O,CCCCC(C=O)CC +CCCCC(CC)COC(=O)C1=CC=C(C=C1)N(C)C,CCCCC(CC)COC(=O)c1ccc(N(C)C)cc1 +OC(=O)CF,O=C(O)CF +CCCCCCCCC(CO)CCCCCC,CCCCCCCCC(CO)CCCCCC +CC(C)(O)C#N,CC(C)(O)C#N +CC(C)C1=C(O)C=CC=C1,CC(C)c1ccccc1O +COC1=CC(=CC=C1N)[N+]([O-])=O,COc1cc([N+](=O)[O-])ccc1N +O=C1OC(=O)C2C3CC(C=C3)C12,O=C1OC(=O)C2C3C=CC(C3)C12 +CC(C)(C)C1=C(O)C=CC=C1,CC(C)(C)c1ccccc1O +CC(C)(CO)CO,CC(C)(CO)CO +O1C=CC2=C1C=CC=C2,c1ccc2occc2c1 +CCCCOC(=O)COC1=C(Cl)C=C(Cl)C=C1,CCCCOC(=O)COc1ccc(Cl)cc1Cl +CC(C)(C)C1=CC(=C(O)C=C1)C(C)(C)C,CC(C)(C)c1ccc(O)c(C(C)(C)C)c1 +CC1=CC=C(C)C(=C1)S(O)(=O)=O,Cc1ccc(C)c(S(=O)(=O)O)c1 +CC1=CC(O)=C(C)C=C1,Cc1ccc(C)c(O)c1 +CC(C)CCCC(C)(C)O,CC(C)CCCC(C)(C)O +CC1=CC=CC(C)=C1N,Cc1cccc(C)c1N +CC1=C(C=CC=C1[N+]([O-])=O)[N+]([O-])=O,Cc1c([N+](=O)[O-])cccc1[N+](=O)[O-] +OCC(O)CCl,OCC(O)CCl +CC1=C(Cl)C=C(N)C=C1,Cc1ccc(N)cc1Cl +CC1=CC(N)=CC=C1,Cc1cccc(N)c1 +CC1=CC=CN=C1,Cc1cccnc1 +OCC=CC1=CC=CC=C1,OCC=Cc1ccccc1 +CC(CCOC(C)=O)CC(C)(C)C,CC(=O)OCCC(C)CC(C)(C)C +BrC1=CC=C(OC(=O)N2CCN3CCC2CC3)C=C1,O=C(Oc1ccc(Br)cc1)N1CCN2CCC1CC2 +CCCCOC1=CC=C(N)C=C1,CCCCOc1ccc(N)cc1 +NC1=CC=C(Cl)C=C1,Nc1ccc(Cl)cc1 +OC1CCC(CC1)C2CCCCC2,OC1CCC(C2CCCCC2)CC1 +CCC(=C(C1=CC=C(O)C=C1)C2=CC=C(OCCN(C)C)C=C2)C3=CC=CC=C3,CCC(=C(c1ccc(O)cc1)c1ccc(OCCN(C)C)cc1)c1ccccc1 +COC1=CC=C(O)C=C1,COc1ccc(O)cc1 +CN1CCOCC1,CN1CCOCC1 +NCCCN1CCOCC1,NCCCN1CCOCC1 +CCCC1=CC=C(N)C=C1,CCCc1ccc(N)cc1 +C1=CC(=CC=N1)C2=CC=NC=C2,c1cc(-c2ccncc2)ccn1 +ClC1=CC=C(C=C1)S(=O)(=O)C2=CC=C(Cl)C=C2,O=S(=O)(c1ccc(Cl)cc1)c1ccc(Cl)cc1 +CCC1=CC(CC2=CC(CC)=C(N)C(CC)=C2)=CC(CC)=C1N,CCc1cc(Cc2cc(CC)c(N)c(CC)c2)cc(CC)c1N +CN1SC(Cl)=CC1=O,Cn1sc(Cl)cc1=O +NC1=NC2=C(NC=N2)C(=S)N1,Nc1nc2nc[nH]c2c(=S)[nH]1 +C=CC#N,C=CC#N +C=CCN=C=S,C=CCN=C=S +CC1=CCC2CC1C2(C)C,CC1=CCC2CC1C2(C)C +CCC1(CCC(=O)NC1=O)C2=CC=C(N)C=C2,CCC1(c2ccc(N)cc2)CCC(=O)NC1=O +O[C@@H]([C@@H](O)CO)[C@@H](O)C=O,O=C[C@H](O)[C@@H](O)[C@@H](O)CO +OC(=O)C1CCN(CC1)C2=C(NC(=O)NC(=O)C3=CC(F)=C(F)C=C3Cl)C=C(F)C=C2,O=C(NC(=O)c1cc(F)c(F)cc1Cl)Nc1cc(F)ccc1N1CCC(C(=O)O)CC1 +CC1COC2=C(C=CC=C2)N1C(=O)C(Cl)Cl,CC1COc2ccccc2N1C(=O)C(Cl)Cl +COC1=CC=C(C=C1)C2=COC3=C(C(O)=CC(O)=C3)C2=O,COc1ccc(-c2coc3cc(O)cc(O)c3c2=O)cc1 +OC1=C(Br)C=C(C=C1Br)C2(OS(=O)(=O)C3=C2C=CC=C3)C4=CC(Br)=C(O)C(Br)=C4,O=S1(=O)OC(c2cc(Br)c(O)c(Br)c2)(c2cc(Br)c(O)c(Br)c2)c2ccccc21 +CC(C)N1C(SCN(C1=O)C2=CC=CC=C2)=NC(C)(C)C,CC(C)N1C(=O)N(c2ccccc2)CSC1=NC(C)(C)C +CCC(C)NC1=C(C=C(C=C1[N+]([O-])=O)C(C)(C)C)[N+]([O-])=O,CCC(C)Nc1c([N+](=O)[O-])cc(C(C)(C)C)cc1[N+](=O)[O-] +CCCCOC(=O)[C@H](C)O,CCCCOC(=O)[C@H](C)O +CCCCOC(=O)C(C)O,CCCCOC(=O)C(C)O +CCCC[Sn](Cl)(Cl)Cl,CCCC[Sn](Cl)(Cl)Cl +[Na+].COC1=CC=C(C=C1)N=NC2=C(OC)C=C(N=NC3=CC=C(C=C3)S([O-])(=O)=O)C(C)=C2,COc1ccc(N=Nc2cc(C)c(N=Nc3ccc(S(=O)(=O)[O-])cc3)cc2OC)cc1 +[Na+].OC1=C(N=NC2=CC=C(C=C2)S([O-])(=O)=O)C3=CC=CC=C3C=C1,O=S(=O)([O-])c1ccc(N=Nc2c(O)ccc3ccccc23)cc1 +CN1C=NC2=C1C(=O)N(C)C(=O)N2C,Cn1c(=O)c2c(ncn2C)n(C)c1=O +CC(=O)C1=CC2=C(OC(C)(C)[C@H](O)[C@H]2NC(=O)C3=CC=C(F)C=C3)C=C1,CC(=O)c1ccc2c(c1)[C@H](NC(=O)c1ccc(F)cc1)[C@@H](O)C(C)(C)O2 +COC1=CC(Cl)=C(OC)C=C1Cl,COc1cc(Cl)c(OC)cc1Cl +CC(C)=CCCC(C)=CC#N,CC(C)=CCCC(C)=CC#N +ClC1=CC=CC=C1C2=NN=C(N=N2)C3=C(Cl)C=CC=C3,Clc1ccccc1-c1nnc(-c2ccccc2Cl)nn1 +NCCO.OC(=O)C1=C(Cl)C=CC(Cl)=N1,O=C(O)c1nc(Cl)ccc1Cl +O=C1OC2=C(C=CC=C2)C=C1,O=c1ccc2ccccc2o1 +ClC=1C=C2N(C(=O)C(C2=CC1F)C(=O)C=3SC=C(Cl)C3)C(=O)N,NC(=O)N1C(=O)C(C(=O)c2cc(Cl)cs2)c2cc(F)c(Cl)cc21 +COC1=C(CN[C@H]2CCCN[C@H]2C3=CC=CC=C3)C=C(OC(F)(F)F)C=C1,COc1ccc(OC(F)(F)F)cc1CN[C@H]1CCCN[C@H]1c1ccccc1 +CC(C)(O)C1=COC(=C1)S(=O)(=O)NC(=O)NC2=C3CCCC3=CC4=C2CCC4,CC(C)(O)c1coc(S(=O)(=O)NC(=O)Nc2c3c(cc4c2CCC4)CCC3)c1 +CNC(=O)[C@H]1O[C@H]([C@H](O)[C@@H]1N)N2C=NC3=C2N=CN=C3NCC4=CC(Cl)=CC=C4OCC5=CC(C)=NO5,CNC(=O)[C@H]1O[C@@H](n2cnc3c(NCc4cc(Cl)ccc4OCc4cc(C)no4)ncnc32)[C@H](O)[C@@H]1N +CC(C)(OO)C1=CC=CC=C1,CC(C)(OO)c1ccccc1 +NC#N,N#CN +NC(=N)NC#N,N#CNC(=N)N +OC1=NC(O)=NC(O)=N1,Oc1nc(O)nc(O)n1 +OC1CCCCC1,OC1CCCCC1 +OC1CCCC1,OC1CCCC1 +CC(C1CC1)C(O)(CN2C=NC=N2)C3=CC=C(Cl)C=C3,CC(C1CC1)C(O)(Cn1cncn1)c1ccc(Cl)cc1 +CC(=O)O[C@@]1(CC[C@H]2[C@@H]3C=C(Cl)C4=CC(=O)[C@@H]5C[C@@H]5[C@]4(C)[C@H]3CC[C@]12C)C(C)=O,CC(=O)O[C@]1(C(C)=O)CC[C@H]2[C@@H]3C=C(Cl)C4=CC(=O)[C@@H]5C[C@@H]5[C@]4(C)[C@H]3CC[C@@]21C +OC[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO,OC[C@@H](O)[C@@H](O)[C@H](O)[C@@H](O)CO +OC[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)CO,OC[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)CO +O[C@@H]([C@H](O)CO)[C@@H](O)C=O,O=C[C@H](O)[C@@H](O)[C@H](O)CO +CCCCCCCCCCO[C@@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@H]1O,CCCCCCCCCCO[C@@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@H]1O +BrC(Br)C#N,N#CC(Br)Br +ClC1=C(Cl)C(=O)C2=C(C=CC=C2)C1=O,O=C1C(Cl)=C(Cl)C(=O)c2ccccc21 +OC(=O)C(Cl)Cl,O=C(O)C(Cl)Cl +COP(=O)(OC)O/C(/C)=C/C(=O)N(C)C,COP(=O)(OC)O/C(C)=C/C(=O)N(C)C +CCOC(=O)CC(=O)OCC,CCOC(=O)CC(=O)OCC +COCCOCCOC,COCCOCCOC +COC1=C(OC)C=C(C=C1)C(=CC(=O)N2CCOCC2)C3=CC=C(Cl)C=C3,COc1ccc(C(=CC(=O)N2CCOCC2)c2ccc(Cl)cc2)cc1OC +COP(=O)OC,CO[PH](=O)OC +CNC,CNC +CCCCCCCCOC(=O)CCCCCCCCC(=O)OCCCCCCCC,CCCCCCCCOC(=O)CCCCCCCCC(=O)OCCCCCCCC +CN(C)C(=O)C(C1=CC=CC=C1)C2=CC=CC=C2,CN(C)C(=O)C(c1ccccc1)c1ccccc1 +[Hg](C1=CC=CC=C1)C2=CC=CC=C2,c1ccc([Hg]c2ccccc2)cc1 +C[Si](C)(O[Si](C)(C)C=C)C=C,C=C[Si](C)(C)O[Si](C)(C)C=C +CC(C)[C@@H]1CC[C@@H](C)C[C@H]1O,CC(C)[C@@H]1CC[C@@H](C)C[C@H]1O +CCCCCCCCCCCC(O)=O,CCCCCCCCCCCC(=O)O +O=C(OCC1CCC2OC2C1)C3CCC4OC4C3,O=C(OCC1CCC2OC2C1)C1CCC2OC2C1 +CN([C@H]1CC[C@@]2(CCCO2)C[C@@H]1N3CCCC3)C(=O)CC4=CC=CC5=C4C=CO5,CN(C(=O)Cc1cccc2occc12)[C@H]1CC[C@@]2(CCCO2)C[C@@H]1N1CCCC1 +CCCN(CCC)C(=O)SCC,CCCN(CCC)C(=O)SCC +CCOP(=S)(OCC)SCSP(=S)(OCC)OCC,CCOP(=S)(OCC)SCSP(=S)(OCC)OCC +CCCSP(=O)(OCC)SCCC,CCCSP(=O)(OCC)SCCC +CCOC(=O)C1=CC=C(C)C=C1,CCOC(=O)c1ccc(C)cc1 +CCOC(=O)C1=NOC(C1)(C2=CC=CC=C2)C3=CC=CC=C3,CCOC(=O)C1=NOC(c2ccccc2)(c2ccccc2)C1 +CCCC(=O)OCC,CCCC(=O)OCC +CCOC(=O)C1OC1(C)C2=CC=CC=C2,CCOC(=O)C1OC1(C)c1ccccc1 +[Na+].[Fe+3].[O-]C(=O)CN(CCN(CC([O-])=O)CC([O-])=O)CC([O-])=O,O=C([O-])CN(CCN(CC(=O)[O-])CC(=O)[O-])CC(=O)[O-] +CC1(OC(=O)N(NC2=CC=CC=C2)C1=O)C3=CC=C(OC4=CC=CC=C4)C=C3,CC1(c2ccc(Oc3ccccc3)cc2)OC(=O)N(Nc2ccccc2)C1=O +CCOC(=O)C(C)OC1=CC=C(OC2=NC3=C(O2)C=C(Cl)C=C3)C=C1,CCOC(=O)C(C)Oc1ccc(Oc2nc3ccc(Cl)cc3o2)cc1 +CCOC(=O)[C@@H](C)OC1=CC=C(OC2=NC3=C(O2)C=C(Cl)C=C3)C=C1,CCOC(=O)[C@@H](C)Oc1ccc(Oc2nc3ccc(Cl)cc3o2)cc1 +CC1(C)C(C(=O)OC(C#N)C2=CC=CC(OC3=CC=CC=C3)=C2)C1(C)C,CC1(C)C(C(=O)OC(C#N)c2cccc(Oc3ccccc3)c2)C1(C)C +CN1N=C(C)C(C=NOCC2=CC=C(C=C2)C(=O)OC(C)(C)C)=C1OC3=CC=CC=C3,Cc1nn(C)c(Oc2ccccc2)c1C=NOCc1ccc(C(=O)OC(C)(C)C)cc1 +[Na+].COC1=NN(C(=O)[N-]S(=O)(=O)C2=C(OC(F)(F)F)C=CC=C2)C(=O)N1C,COc1nn(C(=O)[N-]S(=O)(=O)c2ccccc2OC(F)(F)F)c(=O)n1C +Cl.CNC(=O)OC1=CC=CC(=C1)N=CN(C)C,CNC(=O)Oc1cccc(N=CN(C)C)c1 +O=CCCCC=O,O=CCCCC=O +C[Si](C)(C)N[Si](C)(C)C,C[Si](C)(C)N[Si](C)(C)C +ClC1=CC(Cl)=C(C=C1)C(CN2C=CN=C2)OCC=C,C=CCOC(Cn1ccnc1)c1ccc(Cl)cc1Cl +COCC1=CN=C(C2=NC(C)(C(C)C)C(=O)N2)C(=C1)C(O)=O,COCc1cnc(C2=NC(C)(C(C)C)C(=O)N2)c(C(=O)O)c1 +CC(C)C1CCC(CC2=CC=C(Cl)C=C2)C1(O)CN3C=NC=N3,CC(C)C1CCC(Cc2ccc(Cl)cc2)C1(O)Cn1cncn1 +CC(C)NC(=O)N1CC(=O)N(C1=O)C2=CC(Cl)=CC(Cl)=C2,CC(C)NC(=O)N1CC(=O)N(c2cc(Cl)cc(Cl)c2)C1=O +CC(C)CCOC(=O)C1=CC=CC=C1,CC(C)CCOC(=O)c1ccccc1 +CC1=CC(=O)CC(C)(C)C1,CC1=CC(=O)CC(C)(C)C1 +CCCN(CCC)C1=C(C=C(C=C1[N+]([O-])=O)C(C)C)[N+]([O-])=O,CCCN(CCC)c1c([N+](=O)[O-])cc(C(C)C)cc1[N+](=O)[O-] +CS(=O)(=O)C1=C(C=CC(=C1)C(F)(F)F)C(=O)C2=C(ON=C2)C3CC3,CS(=O)(=O)c1cc(C(F)(F)F)ccc1C(=O)c1cnoc1C1CC1 +OC[C@H](O)[C@H](O)[C@@H](O)[C@H](O)CO,OC[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)CO +CCCCCCCCCCCCOC(=O)C1=CC(O)=C(O)C(O)=C1,CCCCCCCCCCCCOC(=O)c1cc(O)c(O)c(O)c1 +CC(=C)C1CCC(C)=CC1,C=C(C)C1CC=C(C)CC1 +COC1=CC2=C(NC=C2CCNC(C)=O)C=C1,COc1ccc2[nH]cc(CCNC(C)=O)c2c1 +CC(C)C1CCC(C)CC1O,CC1CCC(C(C)C)C(O)C1 +[Na+].CNC([S-])=S,CNC(=S)[S-] +SC(=S)NC.[Na].O,CNC(=S)S +COC(=O)C1=C(C)C=CC=C1,COC(=O)c1ccccc1C +CCCCC/C=C\C/C=C\CCCCCCCC(=O)OC,CCCCC/C=C\C/C=C\CCCCCCCC(=O)OC +N#CSCSC#N,N#CSCSC#N +COC(=O)C=C(C)OP(=O)(OC)OC,COC(=O)C=C(C)OP(=O)(OC)OC +OC(=O)C1=C(C=CC=C1)C(=O)OCC2=CC=CC=C2,O=C(O)c1ccccc1C(=O)OCc1ccccc1 +CCCCC(CN1C=NC=N1)(C#N)C2=CC=C(Cl)C=C2,CCCCC(C#N)(Cn1cncn1)c1ccc(Cl)cc1 +O=C1N(SC2CCCCC2)C(=O)C3=C1C=CC=C3,O=C1c2ccccc2C(=O)N1SC1CCCCC1 +CCCCNS(=O)(=O)C1=CC=C(C)C=C1,CCCCNS(=O)(=O)c1ccc(C)cc1 +CCNC1=CC(C)=CC=C1,CCNc1cccc(C)c1 +O=NN1CCCC1,O=NN1CCCC1 +[Cl-].CCCC[N+](C)(CCCC)CCCC,CCCC[N+](C)(CCCC)CCCC +CN(C)C(C)=O,CC(=O)N(C)C +CN(C)C1=CC=CC=C1,CN(C)c1ccccc1 +NC(=O)C1=CC=CN=C1,NC(=O)c1cccnc1 +CO[C@@H]1[C@@H](CC[C@]2(CO2)[C@H]1[C@@]3(C)O[C@@H]3CC=C(C)C)OC(=O)NC(=O)CCl,CO[C@@H]1[C@H](OC(=O)NC(=O)CCl)CC[C@]2(CO2)[C@H]1[C@@]1(C)O[C@@H]1CC=C(C)C +C[Si](C)(C)O[Si](C)(C)O[Si](C)(C)C,C[Si](C)(C)O[Si](C)(C)O[Si](C)(C)C +Cl.COC1=CC=C(N)C=C1,COc1ccc(N)cc1 +CC1=CC=C(O)C=C1,Cc1ccc(O)cc1 +CCCCN(CC)C(=O)SCCC,CCCCN(CC)C(=O)SCCC +CC1=C2N=C(C3=CC=CC=C3Cl)C4=C(NC2=NN1)C=CC(=C4)[N+]([O-])=O,Cc1[nH]nc2c1N=C(c1ccccc1Cl)c1cc([N+](=O)[O-])ccc1N2 +[Na+].FC1=CC=C(C(=O)[N-]S(=O)(=O)/C=C/C2=CC=CC=C2)C(Cl)=C1,O=C([N-]S(=O)(=O)/C=C/c1ccccc1)c1ccc(F)cc1Cl +[K+].CCCCC(CC)C([O-])=O,CCCCC(CC)C(=O)[O-] +C[C@]12CC(=O)[C@H]3[C@@H](CCC4=CC(=O)C=C[C@]34C)[C@@H]1CC[C@]2(O)C(=O)CO,C[C@]12C=CC(=O)C=C1CC[C@@H]1[C@@H]2C(=O)C[C@@]2(C)[C@H]1CC[C@]2(O)C(=O)CO +CCCN(CCC)C1=C(C(N)=C(C=C1[N+]([O-])=O)C(F)(F)F)[N+]([O-])=O,CCCN(CCC)c1c([N+](=O)[O-])cc(C(F)(F)F)c(N)c1[N+](=O)[O-] +CC(C)N(C(=O)CCl)C1=CC=CC=C1,CC(C)N(C(=O)CCl)c1ccccc1 +[Na+].CCCOC1=NN(C(=O)[N-]S(=O)(=O)C2=C(C=CC=C2)C(=O)OC)C(=O)N1C,CCCOc1nn(C(=O)[N-]S(=O)(=O)c2ccccc2C(=O)OC)c(=O)n1C +Cl.CC1=NC=C(CO)C(CO)=C1O,Cc1ncc(CO)c(CO)c1O +O=C1NS(=O)(=O)C2=C1C=CC=C2,O=C1NS(=O)(=O)c2ccccc21 +C=CCC1=CC=C2OCOC2=C1,C=CCc1ccc2c(c1)OCO2 +CCNC1=NC(NCC)=NC(Cl)=N1,CCNc1nc(Cl)nc(NCC)n1 +[Na+].[O-]C1=CC=C(C=C1)[N+]([O-])=O,O=[N+]([O-])c1ccc([O-])cc1 +[Na+].CCCCCCCCCOS([O-])(=O)=O,CCCCCCCCCOS(=O)(=O)[O-] +O.[Na+].O=C1[N-]S(=O)(=O)C2=CC=CC=C12,O=C1[N-]S(=O)(=O)c2ccccc21 +CCCCCCCC/C=C\CCCCCCCC(=O)OC[C@@H](O)[C@H]1OC[C@H](O)[C@H]1O,CCCCCCCC/C=C\CCCCCCCC(=O)OC[C@@H](O)[C@H]1OC[C@H](O)[C@H]1O +CC1=CC(C)=C(C2=C(OC(=O)CC(C)(C)C)C3(CCCC3)OC2=O)C(C)=C1,Cc1cc(C)c(C2=C(OC(=O)CC(C)(C)C)C3(CCCC3)OC2=O)c(C)c1 +C1OC1C2=CC=CC=C2,c1ccc(C2CO2)cc1 +OC(=O)C1=CC(=CC=C1O)N=NC2=CC=C(C=C2)S(=O)(=O)NC3=NC=CC=C3,O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O +CC1=NN(C(=O)N1C(F)F)C2=CC(NS(C)(=O)=O)=C(Cl)C=C2Cl,Cc1nn(-c2cc(NS(C)(=O)=O)c(Cl)cc2Cl)c(=O)n1C(F)F +CCOP(=S)(OC(C)C)OC1=CN=C(N=C1)C(C)(C)C,CCOP(=S)(Oc1cnc(C(C)(C)C)nc1)OC(C)C +CC1=C(F)C(F)=C(COC(=O)C2C(/C=C(\Cl)/C(F)(F)F)C2(C)C)C(F)=C1F,Cc1c(F)c(F)c(COC(=O)C2C(/C=C(\Cl)C(F)(F)F)C2(C)C)c(F)c1F +OC(=O)CS,O=C(O)CS +NC(N)=S,NC(N)=S +CC(C)N(C(C)C)C(=O)SCC(Cl)=C(Cl)Cl,CC(C)N(C(=O)SCC(Cl)=C(Cl)Cl)C(C)C +CCCCOCCOC(=O)COC1=C(Cl)C=C(Cl)C(Cl)=N1,CCCCOCCOC(=O)COc1nc(Cl)c(Cl)cc1Cl +CCO[Si](C)(OCC)OCC,CCO[Si](C)(OCC)OCC +COCCOCCOCCOC,COCCOCCOCCOC +OC(=O)C1=CC=C2C(=O)OC(=O)C2=C1,O=C(O)c1ccc2c(c1)C(=O)OC2=O +CCCCC(CC)COC(=O)C1=CC=C(C(=O)OCC(CC)CCCC)C(=C1)C(=O)OCC(CC)CCCC,CCCCC(CC)COC(=O)c1ccc(C(=O)OCC(CC)CCCC)c(C(=O)OCC(CC)CCCC)c1 +CCCC(CCC)C(O)=O,CCCC(CCC)C(=O)O +CCCSC(=O)N(CCC)CCC,CCCSC(=O)N(CCC)CCC +FC1(F)/C(=C\C(=O)N2CCC(N3CCCCC3)CC2)/C=4C(N(CC1)C(=O)C5=CC=C(NC(=O)C6=C(OC=C6)C)C=C5)=CC=CC4.FC1(F)/C(=C\C(=O)N2CCC(N3CCCCC3)CC2)/C=4C(N(CC1)C(=O)C5=CC=C(NC(=O)C6=C(OC=C6)C)C=C5)=CC=CC4.OC(=O)/C=C/C(O)=O,Cc1occc1C(=O)Nc1ccc(C(=O)N2CCC(F)(F)/C(=C\C(=O)N3CCC(N4CCCCC4)CC3)c3ccccc32)cc1 diff --git a/data/test.csv.xz b/data/test.csv.xz new file mode 100644 index 0000000..ed611ae Binary files /dev/null and b/data/test.csv.xz differ diff --git a/data/train.csv.xz b/data/train.csv.xz new file mode 100644 index 0000000..6cd6ebb Binary files /dev/null and b/data/train.csv.xz differ diff --git a/data/validation.csv.xz b/data/validation.csv.xz new file mode 100644 index 0000000..704bec8 Binary files /dev/null and b/data/validation.csv.xz differ diff --git a/feature_importance.png b/feature_importance.png new file mode 100644 index 0000000..198c590 Binary files /dev/null and b/feature_importance.png differ diff --git a/ranking.png b/ranking.png new file mode 100644 index 0000000..89a0e2e Binary files /dev/null and b/ranking.png differ