str.split
str.upper
- indexing:
s[:10]
,s[4]
,s[6:]
,s[-1]
,s[:-5]
range
dict
list
zip
enumerate
Scientific
- Finding accession numbers
- Using accession numbers to download data
- Reading in metadata files and getting
Coding
pd.read_csv
,pd.read_table
df.rename
str.split
df.colname.str.split
df.T
/df.transpose()
df.values
/df.as_matrix()
- Flattening the dataframe for plotting:
df.values.flat
- Summary statistics:
df.mean()
/df.median()
/df.mean(axis=1)
- Subsetting using
df.loc
anddf.iloc
- Basic plotting
plt.plot
sns.distplot
plt.savefig
plt.scatter
- Changing colors in plots
- Summary statistics:
df.std()
np.log
operates on the whole matrixnp.sqrt
?
All of these are linear methods
- PCA
- Deterministic
- Sign of components doesn't matter
- Uses "loudest" signals - most highly expressed genes
- Order of components matters: highest is first
- Can't separate signals, only find the biggest variance
- ICA
- Sign of components doesn't matter
- Separates signals
- Doesn't find biggest variance
- Depends on random state: Isn't deterministic - stochastic
- Setting random seed matters
- Performs "clustering"
- Number of components matters
- NMF
- Similar to ICA, but all input data must be non-negative
Coding goals
smusher
objectsmusher.fit_transform
smusher.components_
smusher.explained_variance_ratio_
(PCA only)np.random.seed(0)
andrandom_state=0
sns.pairplot
df.join
groupby
- Aggregating operations:
df.groupby('celltype').mean()
- Aggregating operations:
sns.heatmap
df.apply(np.linalg.norm, axis=1)
Non-linear
- Multidimensional scaling (MDS)
- Preserves overall structure
- But structure could still be really complex and hard to see
- t-Distributed Stochastic Neighbor Embedding
- Visualization technique only
- Makes slight differences bigger
- Depends on random state