forked from BradnerLab/pipeline
-
Notifications
You must be signed in to change notification settings - Fork 0
Meta Analysis
John DiMatteo edited this page Nov 26, 2013
·
53 revisions
This page is currently just rough notes. It will probably be cleaned up and more presentable within a week or so. All conclusions here are preliminary, and the various programs used to generate the results still need to be code reviewed and tested.
- we are trying to find if there are any hot spots in the genome sequence data of the Bradner Lab
- for example, similar to figure 4 of a paper "Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project" by Mark B. Gerstein et al.
- Normalized bin counts by cell type visuals generated with [bam_liquidator_bin_plot_by_cell_type.py](https://github.com/jdimatteo/pipeline/blob/master/bam_liquidator_bin_plot_by_cell_type.py:
Note that I'm ignoring the region around the centromere (I may want to tweak how I am defining that region).
mysql> select bin, avg(percentile_in_cell_type) as avg_percentile
from normalized_bins where bin < 1211 OR bin > 1600
group by bin order by avg_percentile desc limit 20;
+------+-------------------+
| bin | avg_percentile |
+------+-------------------+
| 2328 | 98.29166666666667 |
| 2015 | 98.20833333333333 |
| 2249 | 98.16666666666667 |
| 2027 | 97.45833333333333 |
| 2263 | 97.125 |
| 1793 | 96.66666666666667 |
| 1781 | 96.625 |
| 2430 | 96.54166666666667 |
| 1732 | 96.41666666666667 |
| 2243 | 96.20833333333333 |
| 1659 | 96.08333333333333 |
| 1968 | 95.54166666666667 |
| 450 | 95.41666666666667 |
| 2049 | 95.29166666666667 |
| 237 | 95.08333333333333 |
| 1969 | 94.79166666666667 |
| 324 | 94.75 |
| 2026 | 94.625 |
| 1971 | 94.5 |
| 2047 | 94.41666666666667 |
+------+-------------------+
20 rows in set (0.09 sec)
mysql> select bin, avg(percentile_in_cell_type) as avg_percentile
from normalized_bins where bin < 1211 OR bin > 1600
group by bin order by avg_percentile asc limit 20;
+------+--------------------+
| bin | avg_percentile |
+------+--------------------+
| 1 | 0 |
| 4 | 0 |
| 134 | 0.375 |
| 3 | 0.75 |
| 135 | 1.125 |
| 2 | 1.5 |
| 133 | 1.5 |
| 132 | 3.375 |
| 6 | 3.8333333333333335 |
| 1040 | 4.958333333333333 |
| 129 | 5.708333333333333 |
| 130 | 7.375 |
| 1207 | 8.625 |
| 1206 | 9.166666666666666 |
| 1205 | 9.166666666666666 |
| 2042 | 9.208333333333334 |
| 2412 | 9.583333333333334 |
| 1087 | 9.583333333333334 |
| 2045 | 9.625 |
| 5 | 9.75 |
+------+--------------------+
20 rows in set (0.09 sec)
mysql>
Following table shows overall bin count statistics for chromosome 1:
mysql> select min(count), max(count), avg(count), std(count), sum(count)
from chr1_bin_counts;
+------------+------------+---------------+-------------------+-------------+
| min(count) | max(count) | avg(count) | std(count) | sum(count) |
+------------+------------+---------------+-------------------+-------------+
| 0 | 163182419 | 15035679.2257 | 8956512.942009946 | 37168199046 |
+------------+------------+---------------+-------------------+-------------+
1 row in set (4.04 sec)
mysql>
Following table shows bin count statistics by cell type (again just for chromosome 1):
- minimum bin count
- maximum bin count
- average bin count
- population standard deviation across bin counts with that parent
- total count across all bins with that parent
mysql> select cell_type, min(count), format(max(count), 0), format(avg(count), 0),
format(std(count), 0), format(sum(count), 0)
from chr1_bin_counts_by_cell_type group by cell_type;
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
| cell_type | min(count) | format(max(count), 0) | format(avg(count), 0) | format(std(count), 0) | format(sum(count), 0) |
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
| be2c | 0 | 3,770,520 | 287,038 | 204,507 | 709,557,454 |
| dhl6 | 0 | 2,185,662 | 141,637 | 112,005 | 350,126,200 |
| ec | 0 | 6,569,832 | 987,943 | 657,771 | 2,442,195,971 |
| hbl1 | 0 | 5,017,459 | 149,097 | 222,292 | 368,567,976 |
| hsc | 0 | 246,632 | 51,019 | 36,235 | 126,119,067 |
| k422 | 0 | 5,954,990 | 253,770 | 298,047 | 627,318,440 |
| kbm7 | 0 | 1,648,902 | 188,416 | 195,185 | 465,763,314 |
| kms11 | 0 | 41,239,161 | 2,131,480 | 1,682,624 | 5,269,018,091 |
| ly1 | 0 | 12,762,280 | 950,571 | 687,998 | 2,349,811,685 |
| ly18 | 0 | 5,759,705 | 305,177 | 264,601 | 754,397,616 |
| ly3 | 0 | 4,737,559 | 153,570 | 205,493 | 379,624,181 |
| ly4 | 0 | 1,969,491 | 171,868 | 145,846 | 424,857,110 |
| mm1s | 0 | 44,442,560 | 6,126,253 | 3,814,617 | 15,144,097,564 |
| mmp1 | 0 | 1,352,981 | 136,963 | 107,231 | 338,573,515 |
| nut797 | 0 | 3,082,800 | 395,887 | 252,606 | 978,633,841 |
| p107a | 0 | 1,600,640 | 95,953 | 49,827 | 237,195,279 |
| p14a | 0 | 2,823,560 | 218,297 | 146,555 | 539,630,632 |
| p265 | 0 | 2,395,440 | 151,910 | 82,715 | 375,521,965 |
| p286 | 0 | 1,092,800 | 81,695 | 42,149 | 201,951,007 |
| p397 | 0 | 1,867,040 | 158,912 | 91,665 | 392,830,578 |
| p448 | 0 | 1,864,560 | 136,890 | 76,698 | 338,392,224 |
| p493-6 | 0 | 13,805,280 | 1,216,463 | 768,695 | 3,007,097,323 |
| sknas | 0 | 5,765,760 | 384,664 | 259,913 | 950,889,284 |
| toledo | 0 | 2,919,841 | 160,206 | 196,181 | 396,028,729 |
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
24 rows in set (5.60 sec)
mysql>
mysql> CREATE DATABASE meta_analysis;
mysql> use meta_analysis;
mysql> CREATE TABLE `counts` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`parent_directory` varchar(15) NOT NULL,
`file_name` varchar(63) NOT NULL,
`chromosome` varchar(15) NOT NULL,
`bin` int(11) NOT NULL,
`count` int(11) NOT NULL,
`counter_version` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `counter_version` (`counter_version`,`chromosome`,`bin`,`file_name`)
) ENGINE=MyISAM;
mysql> CREATE TABLE errors (id INT NOT NULL AUTO_INCREMENT, file_name VARCHAR(63) NOT NULL,
chromosome VARCHAR(15) NOT NULL, bin INT NOT NULL, error VARCHAR(512) NOT NULL,
counter_version INT NOT NULL, create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));
mysql> CREATE TABLE run (id INT NOT NULL AUTO_INCREMENT, file_name VARCHAR(63) NOT NULL,
resume_count INT NOT NULL DEFAULT 0, start_time DATETIME NOT NULL,
end_time DATETIME, counter_version INT NOT NULL,
PRIMARY KEY (id), UNIQUE KEY (file_name, counter_version));
jdm@tod:~/Downloads$ mysql -uroot -p -D meta_analysis < bins.sql
Enter password:
jdm@tod:~/Downloads$
mysql> describe bins;
+------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| gff_name | varchar(64) | NO | | NULL | |
| chromosome | varchar(64) | NO | | NULL | |
| bin | varchar(12) | NO | | NULL | |
| start | int(11) | NO | | NULL | |
| end | int(11) | NO | | NULL | |
| strand | char(1) | NO | | NULL | |
| sixth_col | int(11) | NO | | NULL | |
| second_bin | varchar(12) | NO | | NULL | |
| bin_number | int(11) | NO | | NULL | |
| skip | tinyint(1) | NO | | NULL | |
+------------+-------------+------+-----+---------+-------+
10 rows in set (0.00 sec)
mysql> CREATE USER 'counter'@'localhost';
mysql> GRANT SELECT ON meta_analysis.* TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT ON meta_analysis.counts TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT ON meta_analysis.errors TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT,UPDATE ON meta_analysis.run TO 'counter'@'localhost';
mysql> CREATE OR REPLACE VIEW chr1_bin_counts AS
SELECT bin, SUM(count) AS count FROM counts
WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin;
mysql> CREATE OR REPLACE VIEW chr1_bin_counts_by_cell_type AS
SELECT bin, parent_directory AS cell_type, SUM(count) AS count FROM counts
WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin, parent_directory;
mysql> create index file on counts (counter_version, file_name); // it would have been better as counter_version, file_name, chromosome
mysql> CREATE TABLE normalized_bins (
id int NOT NULL AUTO_INCREMENT,
cell_type varchar(15) NOT NULL,
chromosome varchar(15) NOT NULL,
bin int NOT NULL,
count_fraction double NOT NULL,
percentile_in_cell_type DOUBLE,
counter_version int NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY plotter (counter_version, chromosome, cell_type, bin),
KEY overall_view (counter_version, chromosome, bin)
) ENGINE=MyISAM;
mysql> CREATE OR REPLACE VIEW chr1_normalized_bin_counts AS
SELECT bin, SUM(count_fraction) AS count FROM normalized_bins
WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin;
mysql> GRANT SELECT,INSERT,UPDATE ON meta_analysis.normalized_bins TO 'counter'@'localhost';
- between about 1:40 AM to 8 AM on November 5th, I processed 110 .bam files on TOD
- that is about 1 .bam file every 3.5 minutes (380 minutes / 110 files)
- when not running multiple simultaneously, a single bam file takes about 37 minutes on TOD
- the same bam file processing takes about 25 minutes on my laptop which might indicate that my laptop's SSD makes a big difference
- I ran the following in 13 simultaneous screen windows
jdm@tod:~/pipeline_counting$ while [ -f ./keep_going ];
do ./bam_liquidator_bin_counter.sh /ifs/labs/bradner/bam/hg18 ; done
- TODO: if I use this method again, make sure there is some provision for the loop to stop when there are no more files to process! For example, maybe stop if bam_liquidator_bin_counter.sh returns a non-zero error code (and make sure a non-zero error code is returned when there are no more files to process)
- this resulted in a load of about 13/14 on tod
- the mysqld process used significant CPU (maybe 50% of a core or something like that), but otherwise this computation seemed to be mostly I/O bound
- I somewhat randomly stopped at 13. I assume an optimal number would probably be 30 or 60 if TOD wasn't doing anything else
- the bam_liquidator_bin_counter.sh version was 201, and the following shows the exact version of the script I ran,
jdm@tod:~/pipeline_counting$ git remote show origin
* remote origin
Fetch URL: https://github.com/jdimatteo/pipeline.git
Push URL: https://github.com/jdimatteo/pipeline.git
HEAD branch: master
Remote branch:
master tracked
Local branch configured for 'git pull':
master merges with remote master
Local ref configured for 'git push':
master pushes to master (local out of date)
jdm@tod:~/pipeline_counting$ git rev-parse HEAD
e018bdb70f0eef39acd4b82f312159e1937090f9
- this version actually had a bug that prevented the end time being recorded in the mysql meta_analysis.run table
- some of the output looked like the following
jdm@tod:~/pipeline_counting$ while [ -f ./keep_going ]; do ./bam_liquidator_bin_counter.sh /ifs/labs/bradner/bam/hg18 ; done
searching for next unprocessed .bam in /ifs/labs/bradner/bam/hg18
trying /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_ntko_s4.hg18.bwt.sorted.bam
trying /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_tko_s3.hg18.bwt.sorted.bam
starting with parameters:
baseline_check: 0
file_path: /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_tko_s3.hg18.bwt.sorted.bam
--------------------------------------------------------------------------------
counting chromosome chr1 - Tue Nov 5 01:48:11 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr2 - Tue Nov 5 01:52:24 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr3 - Tue Nov 5 01:56:41 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr4 - Tue Nov 5 02:00:12 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr5 - Tue Nov 5 02:03:36 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr6 - Tue Nov 5 02:06:50 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr8 - Tue Nov 5 02:09:48 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr9 - Tue Nov 5 02:12:26 EST 2013
--------------------------------------------------------------------------------
counting chromosome chrX - Tue Nov 5 02:14:51 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr10 - Tue Nov 5 02:17:28 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr11 - Tue Nov 5 02:19:55 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr12 - Tue Nov 5 02:22:17 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr13 - Tue Nov 5 02:24:35 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr14 - Tue Nov 5 02:26:29 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr15 - Tue Nov 5 02:28:17 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr16 - Tue Nov 5 02:30:01 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr17 - Tue Nov 5 02:31:38 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr18 - Tue Nov 5 02:33:03 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr19 - Tue Nov 5 02:34:28 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr20 - Tue Nov 5 02:35:37 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr21 - Tue Nov 5 02:36:44 EST 2013
--------------------------------------------------------------------------------
counting chromosome chr22 - Tue Nov 5 02:37:35 EST 2013
./bam_liquidator_bin_counter.sh: line 262: syntax error: unexpected end of file
searching for next unprocessed .bam in /ifs/labs/bradner/bam/hg18
- each .bam file (e.g. 04032013_D1L57ACXX_4.TTAGGC.hg18.bwt.sorted.bam) represents the chromosomes of the full human genome for some particular biological sample
- the sample is spread across many people that share some particular trait of interest (e.g. same cancer)
- there are 22 numbered chromosomes (chr1 - chr22) plus chrX, chrY, and chrM
- some of the bam files also have other "hap" labeled chromosomes such as chr5_h2_hap1, chr6_cox_hap1, and chr6_qbl_hap2
- for this analysis, we are only using chr1-22 and chrX
- we are excluding chr7, chrY, chrM, and any that say hap on them
- each chromosome contains both DNA strands, and bamliquidator uses +, -, or . to distinguish between them
- strand + is the forward strand
- strand - is the reverse/complementary strand
- . means check both strands
- each .bam file represents zero or more reads of base pairs at specific locations on the chromosomes
- each read is a sequence of base pairs, and they tend to be clumped together
- the beginnings and ends of chromosomes tend to not have any reads associated with them
- most regions of the chromosome should have reads associated with them
-
Tablet is a nice tool to visualize the base pair reads of a .bam file
- samtools view and samtools tview are ways to see reads on a command line
- a gff file has been provided splitting the genome into 2472 bins
- each bin is numbered (bin 0 through bin 2471)
- each bin is 100000 base pairs long
- the bin count starts at 0 for each chromosome (e.g. the first bin of chr1 and chr2 are both "bin 0")
- we may want to shrink the bin size in the future so that we can do finer grained analysis
- November 18, 2013 Notes
- Visualization tool
- try leveraging http://processing.org/ and/or http://epigenomegateway.wustl.edu/browser/
- I got processing to work on github.io: http://jdimatteo.github.io/processing_hello_world/
- try leveraging http://processing.org/ and/or http://epigenomegateway.wustl.edu/browser/
- Visualization tool