v2.0.3

CarragherLab · Apr 5, 2024 · 41a4d40 · 41a4d40
1 parent 288f84e
commit 41a4d40
Show file tree

Hide file tree

Showing 200 changed files with 3,416 additions and 21,922 deletions.
diff --git a/.github/workflows/documentation.yaml b/.github/workflows/documentation.yaml
@@ -0,0 +1,36 @@
+name: BuildDocs
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+  workflow_dispatch:
+
+permissions:
+    contents: write
+jobs:
+  docs:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.10' 
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install .
+          pip install sphinx==7.2.6 sphinx_rtd_theme
+      - name: Sphinx build
+        run: |
+          cd docsource
+          make html
+          cd ..
+      - name: Deploy to GitHub pages 🚀
+        uses: JamesIves/github-pages-deploy-action@v4.5.0
+        with:
+          clean: false
+          branch: gh-pages
+          folder: docsource/_build/html/
+
+
diff --git a/.gitignore b/.gitignore
@@ -154,3 +154,6 @@ dmypy.json
 # Cython debug symbols
 cython_debug/
 *dev*
+
+# scratch files
+scratch.ipynb
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,82 @@
 
 All notable changes to this project will be documented in this file under headings Added, Changed, and Fixed
 
+## [2.0.3] - 2024-03-26
+
+### Fixed
+- Exclude pptx files from black formatting
+- Dataset constructor raises an error if features passed as metadata are not of type list
+- merged dataset deletion with passed datasets
+- test using dataset_groupby, corrected to groupby_datasets
+
+## [2.0.2] - 2024-03-06
+
+### Fixed
+- RandomForest regressor no longer uses auto max_features hyperparameter, making it compatible with scikit-learn 1.1 onwards.
+
+## [2.0.1] - 2024-03-01
+
+### Changed
+- The way versioning works internally to Phenonaut.
+- Updated pyproject.toml to include powerpoint templates
+
+## [1.5.1] - 2023-10-19
+
+### Added
+- New class of Error NotEnoughRowsError added to better flag runtime errors.
+- Added checks to mp_value_score which ensure grouped dataframe groups have at least 3 rows required for calculations.
+- In mp_value_score, groups with < 3 rows may be ignored by calling the function with raise_error_for_low_count_groups = False, in which case np.nan values will be returned for the group.
+
+### Changed
+- Improved tests for mp_value_score, checking for correct behaviour within small groups with <3 rows.
+
+
+## [1.5.0] - 2023-09-27
+
+### Changed
+- Refactored package to use pyproject.toml for install/build
+- Updated classifier hyperparameters, `max_features='auto'` has been deprecated in scikit 1.1 and will be removed in 1.3. Now explicitly set to `max_features='sqrt'`.
+- Removed progressbar2 in favour of tqdm throughout.
+
+### Fixed
+- predict.profile updated to work with newer pandas
+- Phenonaut.merge_datasets now honours the remove_merged flag
+
+
+## [1.4.3] - 2023-08-16
+### Added
+- random_state argument for percent replicating, allowing passing of a np.random.Generator, or an int to seed a new Generator
+
+
+## [1.4.1] - 2023-07-28
+### Fixed
+- bug in Phenonaut.merge_datasets
+- bug in data.Datasets.groupby
+
+## [1.4.0] - 2023-07-18
+### Added
+- merge_datasets method added to Phenonaut objects
+- mp_value_score metric added to phenonaut.metrics.performance (doi:10.1177/1087057112469257)
+- added __repr__ to Phenonaut objects
+
+
+## [1.3.8] - 2023-06-22
+### Changed
+- Removed py3.10 style Unions, favouing the old style typing.Union
+
+## [1.3.7] - 2023-06-22
+### Fixed
+- bug in the generation of Scree plots from fitted PCA transformers 
+
+
+## [1.3.6] - 2023-06-01
+### Added
+- groupby function to dataset, allowing splitting on one dataset into many
+
+## [1.3.5] - 2023-03-29
+### Added
+- orient argument to write_boxplot_to_file allowing horizontal or vertical plotting.
+
 ## [1.3.4] - 2023-03-27
 
 ### Added
@@ -138,10 +214,9 @@ All notable changes to this project will be documented in this file under headin
 
 ### Changed
 - Removed ambiguity between GenericTransform and Transform base classes by removing GenericTransform. Now objects which are callable or have fit/transform/fit_transform may inherit from the transform class and be eaily used in Phenonaut. Below we see the easy way scikit-learn's PCA can be wrapped:
-  
+
         transformer=Transformer(PCA, constructor_kwargs={'n_components':2})
         t_pca.fit(phe.ds, groupby="BARCODE")
         t_pca.transform(phe.ds)
 - Cleaned up tests
 - RobustMAD now moved to inherit from Transformer, allowing the groupby argument, and for normalisations to happen on a per-plate basis
-
diff --git a/LICENSE b/LICENSE
@@ -199,4 +199,4 @@
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
-   limitations under the License.
+   limitations under the License.
diff --git a/NOTICE b/NOTICE
@@ -1,3 +1,2 @@
-Copyright © The University of Edinburgh, 2022.
+Copyright © The University of Edinburgh, 2024.
 Development has been supported by GSK.
-
diff --git a/README.md b/README.md
@@ -1,39 +1,34 @@
 # Phenonaut
 
-A toolkit for multi-omic phenotypic space exploration. 
+A toolkit for multi-omic phenotypic space exploration.
+
 
 
 ## Description
 <img style="float: right;" src="phenonaut.png">
 
-Phenonaut is a framework for applying workflows to multi-omics data. Originally targeting high-content imaging and the exploration of phenotypic space, with different visualisations and metrics, Phenonaut allows now operates in a data agnostic manner, allowing users to describe their data (potentially multi-view/multi-omics) and apply a series of generic or specialised data-centric transforms and measures.  
+Phenonaut is a framework for applying workflows to multi-omics data. Originally targeting high-content imaging and the exploration of phenotypic space, with different visualisations and metrics, Phenonaut allows now operates in a data agnostic manner, allowing users to describe their data (potentially multi-view/multi-omics) and apply a series of generic or specialised data-centric transforms and measures.
+
 
 Phenonaut operates in 2 modes:
 
 - As a Python package, importable and callable within custom scripts.
 - Operating on a workflow defined in either YAML, or JSON, allowing integration of complex chains of Phenonaut instructions to be integrated into existing workflows and pipelines. When built as a package and installed, workflows can be executed with:
 ```python -m phenonaut workflow.yml``` .
 
+After installing phenonaut into a kernel, dont forget to register it with Jupyter:
+```python -m ipykernel install --user --name=<ENVIRONMENT NAME>```
+
 
 ## Structrure
-Datasets are read into the dataset class, aided by a yaml file describing the underlying data (see config/ for example yaml data definition files). Pandas dataframes are created representing the data (a Phenonaut object may hold multiple dataset objects), along with currently three additional pieces of data. 
+Datasets are read into the dataset class, aided by a yaml file describing the underlying data (see config/ for example yaml data definition files). Pandas dataframes are created representing the data (a Phenonaut object may hold multiple dataset objects), along with two additional pieces of data.
 1) A features list, accessible with .features property of a dataframe. Initially defined by the data definition workflow.
 2) perturbation_column, optional column which gives a unique ID to the treatment performed on the well/vial/data.
 3) Metadata, optional dictionary containing metadata for the dataset.
 
-## Documentation
-[Start here](https://carragherlab.github.io/phenonaut/)
-- [User guide](https://carragherlab.github.io/phenonaut/userguide.html)
-- [API documentation](https://carragherlab.github.io/phenonaut/phenonaut.html)
-- [Publication examples](https://carragherlab.github.io/phenonaut/publication_examples.html)
-- [Workflow mode](https://carragherlab.github.io/phenonaut/workflow_guide.html)
-
-Install with
-```console
-pip install phenonaut
-```
+Example usage in Python programs, and in workflow/scripted modes coming soon.
 
 
-Copyright © The University of Edinburgh, 2023.
+Copyright © The University of Edinburgh, 2024.
 
 Development has been supported by GSK.
diff --git a/config/aiml.yml b/config/aiml.yml
@@ -0,0 +1,2 @@
+features_prefix:
+ - feat_
diff --git a/config/columbus.yml b/config/columbus.yml
@@ -0,0 +1,3 @@
+features_prefix:
+ - Cell Selected
+ - WGA Spots
diff --git a/config/generic.yml b/config/generic.yml
@@ -0,0 +1,50 @@
+# features
+#   String (or list of features present in the SDF). If string, then it is
+#   tokenised on whitespace, so use a list of strings if column headers
+#   contain spaces. Overrides features_prefix if given. (Defaults to None)
+#   Example: |features: "feature1 feature2"
+#            or
+#            |features:
+#            | - feature1
+#            | - feature2
+#            | - feature with space 3
+
+# features_prefix
+#   Prefix for features discoverable in the file. Overridden if features
+#   string (or list) is given. Can be a list, allowing capturing of multiple
+#   feature prefixes. (Defaults to 'feat_')
+
+# index_col
+#   Specifies the column to use as an index.  Defaults to None, causing a new
+#   index to be created. Can be a sting if just one column is specified.  If
+#   multiple columns are use in a multi-index, then must be a list of ints.
+#   Example, use first and second column as an index: | index_col: [0,1]
+#   Exmaple, use first column as an index:            | index_col: 0
+#   Example, use 'treatment' column as an index       | index_col: treatment
+
+# drop_nan
+#   By default, drop_nan is set to true, this can be overriden if the user
+#   does not want to remove rows containing NaN values. (Use YAML style true/
+#   false - all lowercase. Defaults to True)
+#   Example, not dropping NaNs: drop_nan: false
+
+# csv_separator
+#   Separator present in csv file. (Defaults to ',')
+# Example - |csv_separator: \t
+# Example - |csv_separator: " "
+# Example - |csv_separator: ":"
+
+# skip_row_numbers
+#   List of row indexes to skip when reading in the file (zero indexed). For
+#   example, nanostring data typically has a blank third line, so:
+#   Example - skip 3nd line: |skip_row_numbers: 2
+#                             or
+#                            |skip_row_numbers: [2]
+#   Example - skip multiple: |skip_row_numbers: [0,2,4,6,8]
+
+# header_row_number
+#   Row number (zero indexed) containing column headers.  Can be a list if
+#   multiple rows make up the header.  Nanostring data typically has 2 rows
+#   acting as a header, so:
+#   Example - use first 2 rows: | header_row_numbers:[0,1].
+#   (Defaults to 0 - the first row)
diff --git a/config/nanostring.yml b/config/nanostring.yml
@@ -0,0 +1,3 @@
+features: "ABCC3 ABCC8 ABL1 ADAMTS16 ADRA2A AGO4 AGT AK1 AKT1 AKT2 ALDH1L1 AMBRA1 AMIGO2 ANAPC15 ANXA1 APC APEX1 APOE ARC ARHGAP24 ARID1A ASB2 ASH2L ASPH ATF3 ATG14 ATG3 ATG5 ATG7 ATG9A ATM ATP6V0E1 ATP6V1A ATR AXL B3GNT5 BAD BAG3 BAG4 BAK1 BARD1 BAX BBC3 BCAS1 BCL10 BCL2 BCL2A1 BCL2L1 BCL2L11 BCL2L2 BDNF BECN1 BID BIK BIN1 BIRC2 BIRC3 BIRC5 BLK BLM BLNK BMI1 BNIP3 BNIP3L BOK BOLA2 BRAF BRCA1 BRD2 BRD3 BRD4 BTK C1QA C1QB C1QC C3 C3AR1 C4A C5AR1 C6 CABLES1 CALCOCO2 CALR CAMK4 CASP1 CASP2 CASP3 CASP4 CASP6 CASP7 CASP8 CASP9 CASS4 CCL2 CCL3 CCL4 CCL5 CCL7 CCNG2 CCNI CCR2 CCR5 CD109 CD14 CD163 CD19 CD209 CD24 CD244 CD300LF CD33 CD36 CD3D CD3E CD3G CD40 CD44 CD47 CD6 CD68 CD69 CD70 CD72 CD74 CD83 CD84 CD86 CD8A CD8B CDC25A CDC7 CDK20 CDKN1A CDKN1C CEACAM3 CFLAR CH25H CHEK1 CHEK2 CHN2 CHST8 CHUK CIDEA CIDEB CKS1B CLCF1 CLDN5 CLEC7A CLIC4 CLN3 CLSTN1 CNN2 CNP CNTNAP2 COA5 COL6A3 COTL1 COX5B CP CPA3 CREB1 CREBBP CREM CRIP1 CRYBA4 CSF1 CSF1R CSF2RB CSF3R CSK CST7 CTSE CTSF CTSS CTSW CX3CL1 CX3CR1 CXCL10 CXCL9 CYCS CYP27A1 CYP7B1 CYTIP DAB2 DAPK1 DDB2 DDX58 DICER1 DLG1 DLG4 DLX1 DLX2 DNA2 DNMT1 DNMT3A DNMT3B DOCK1 DOCK2 DOT1L DST DUOXA1 DUSP7 E2F1 EED EEF2K EGFR EGR1 EHMT2 EIF1 EMCN EMP1 ENPP6 ENTPD2 EOMES EP300 EPCAM EPG5 EPSTI1 ERBB3 ERCC2 ESAM ETS2 EXO1 EZH1 EZH2 F3 FA2H FABP5 FADD FANCC FANCD2 FANCG FAS FASLG FBLN5 FCAR FCER1G FCGR1A FCGR2B FCGR3A FCRLA FCRLB FDXR FEN1 FGD2 FGF13 FGL2 FKBP5 FLT1 FOS FOXP3 FPR1 FSCN1 FYN GADD45A GADD45G GAL3ST1 GBA GBP2 GCLC GDPD2 GJA1 GJB1 GNA15 GNLY GPR183 GPR34 GPR62 GPR84 GRAP GRIA1 GRIA2 GRIA4 GRIN2A GRIN2B GRM2 GRM3 GRN GSN GSTM1 GZMA GZMB H2AFX HAT1 HCAR2 HDAC1 HDAC2 HDAC4 HDAC6 HDC HELLS HIF1A HILPDA HIRA HIST1H1D HLA-E HMGB1 HMOX1 HOMER1 HPGDS HPRT1 HPS4 HRK HSD11B1 HSPB1 HUS1 ICAM2 IFI30 IFIH1 IFITM2 IFITM3 IFNAR1 IFNAR2 IGF1 IGF1R IGF2R IGSF10 IGSF6 IKBKB IKBKE IKBKG IL10RB IL15RA IL1A IL1B IL1R1 IL1R2 IL1RAP IL1RL2 IL1RN IL21R IL2RG IL3 IL3RA IL6R INPP5D IQSEC1 IRAK1 IRAK2 IRAK3 IRAK4 IRF1 IRF2 IRF3 IRF4 IRF6 IRF7 IRF8 ISLR2 ITGA6 ITGA7 ITGAM ITGAV ITGAX ITGB5 JAG1 JAM2 JARID2 JUN KAT2A KAT2B KCND1 KCNJ10 KCNK13 KDM1A KDM1B KDM2A KDM2B KDM3A KDM3B KDM4A KDM4B KDM4C KDM4D KDM5A KDM5B KDM5C KDM5D KDM6A KIF2C KIR3DL1 KIT KLRB1 KLRD1 KLRK1 KMT2A KMT2C LACC1 LAG3 LAIR1 LAMP1 LAMP2 LCN2 LDHA LDLRAD3 LFNG LGMN LIG1 LILRB4 LINGO1 LMNA LMNB1 LOX LRG1 LRRC25 LRRC3 LSR LST1 LTA LTB LTBR LTC4S LY9 LYN MAFB MAFF MAG MAL MAN2B1 MAP1LC3A MAP2K1 MAP2K4 MAP3K1 MAP3K14 MAPK10 MAPK12 MAPK14 MAPT MARCO MAVS MB21D1 MBD2 MBD3 MCM2 MCM5 MCM6 MDC1 MDM2 MEF2C MERTK MFGE8 MGMT MMP12 MMP14 MOBP MOG MPEG1 MPG MR1 MRE11 MS4A1 MS4A2 MS4A4A MSH2 MSN MSR1 MVP MYC MYCT1 MYD88 MYORG MYRF NBN NCAPH NCF1 NCOR1 NCOR2 NCR1 NEFL NFE2L2 NFKB1 NFKB2 NFKBIA NFKBIE NGF NGFR NINJ2 NKG7 NLGN1 NLGN2 NLRP3 NOD1 NOSTRIN NPL NPNT NPTX1 NQO1 NRGN NRM NRP2 NTHL1 NWD1 OAS1 OGG1 OLFML3 OPALIN OPTN OSGIN1 OSMR P2RX7 P2RY12 PACSIN1 PADI2 PAK1 PARP1 PARP2 PCNA PDPN PECAM1 PEX14 PIK3CA PIK3CB PIK3CD PIK3CG PIK3R1 PIK3R2 PIK3R5 PILRA PILRB PINK1 PLA2G4A PLA2G5 PLCG2 PLD1 PLD2 PLEKHB1 PLEKHM1 PLLP PLP1 PLXDC2 PLXNB3 PMP22 PMS2 PNOC POLE PPFIA4 PPP3CA PPP3CB PPP3R1 PPP3R2 PRDX1 PRF1 PRKACA PRKACB PRKAR1A PRKAR2A PRKAR2B PRKCE PRKCQ PRKDC PROS1 PSEN2 PSMB8 PTEN PTGER3 PTGER4 PTGS2 PTMS PTPN6 PTPRC PTTG1 PTX3 PYCARD RAB6B RAB7A RAC1 RAC2 RAD1 RAD17 RAD50 RAD51 RAD51B RAD51C RAD9A RAG1 RALA RALB RAPGEF3 RB1CC1 RBFOX3 RELA RELB RELN RGL1 RHOA RIPK1 RIPK2 RNF8 RPA1 RPL28 RPL29 RPL36AL RPL9 RPS10 RPS2 RPS21 RPS3 RPS9 RRM2 RSAD2 RTN4RL1 S100A10 S100A12 S100B S1PR3 S1PR4 S1PR5 SALL1 SELL SERPINA3 SERPINE1 SERPINF1 SERPING1 SESN1 SESN2 SETD1A SETD1B SETD2 SETD7 SETDB1 SFTPD SH2D1A SHANK3 SIGLEC1 SIGLEC8 SIN3A SIRT1 SLAMF8 SLAMF9 SLC10A6 SLC17A6 SLC17A7 SLC1A3 SLC2A1 SLC2A5 SLC44A1 SLC6A1 SLCO2B1 SLFN11 SMARCA4 SMARCA5 SMARCD1 SMC1A SNCA SOCS3 SOD2 SOX10 SOX4 SOX9 SPHK1 SPIB SPINT1 SPP1 SQSTM1 SRGN SRXN1 ST3GAL6 ST8SIA6 STAT1 STEAP4 STMN1 STX18 SUMO1 SUV39H1 SUV39H2 SUZ12 SYK SYN2 SYP TARBP2 TBC1D4 TBR1 TBX21 TCIRG1 TCL1A TET1 TFG TGFA TGFB1 TGFBR1 TGM1 TGM2 TIE1 TIMELESS TIMP1 TLE3 TLR2 TLR4 TLR7 TM4SF1 TMC7 TMCC3 TMEM100 TMEM119 TMEM144 TMEM173 TMEM204 TMEM206 TMEM37 TMEM64 TMEM88B TNF TNFRSF10B TNFRSF11B TNFRSF12A TNFRSF13C TNFRSF17 TNFRSF1A TNFRSF1B TNFRSF25 TNFRSF4 TNFSF10 TNFSF12 TNFSF13B TNFSF4 TNFSF8 TOP2A TOPBP1 TP53 TP53BP2 TP73 TPD52 TPSB2 TRADD TRAF1 TRAF2 TRAF3 TRAF6 TRAT1 TREM1 TREM2 TRIM47 TRPA1 TRPM4 TSPAN18 TTR TUBB3 TUBB4A TXNRD1 TYROBP UGT8 ULK1 UNG UTY VAMP7 VAV1 VEGFA VIM VPS4A VPS4B WAS WDR5 XCL1 XIAP XRCC6 ZBP1 ZNF367 AARS ASB7 CCDC127 CNOT10 CSNK2A2 FAM104A GUSB LARS MTO1 SUPT7L TADA2B TBP XPNPEP1 NEG_A NEG_B NEG_C NEG_D NEG_E NEG_F NEG_G NEG_H POS_A POS_B POS_C POS_D POS_E POS_F"
+perturbation_column_name: "Probe Name"
+transpose: true
diff --git a/docs/.nojekyll b/docs/.nojekyll