Meta Analysis

This page is currently just rough notes. It will probably be cleaned up and more presentable within a week or so. All conclusions here are preliminary, and the various programs used to generate the results still need to be code reviewed and tested.

analysis overview

we are trying to find if there are any hot spots in the genome sequence data of the Bradner Lab
- for example, similar to figure 4 of a paper "Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project" by Mark B. Gerstein et al.

visuals

Normalized bin counts by cell type visuals generated with [bam_liquidator_bin_plot_by_cell_type.py](https://github.com/jdimatteo/pipeline/blob/master/bam_liquidator_bin_plot_by_cell_type.py:
- summary
- chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9
- chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19
- chr20, chr21, chr22, chrX

preliminary hotspots and coldspots

Note that I'm ignoring the region around the centromere (I may want to tweak how I am defining that region).

mysql> select bin, avg(percentile_in_cell_type) as avg_percentile
         from normalized_bins where bin < 1211 OR bin > 1600
        group by bin order by avg_percentile desc limit 20;
+------+-------------------+
| bin  | avg_percentile    |
+------+-------------------+
| 2328 | 98.29166666666667 |
| 2015 | 98.20833333333333 |
| 2249 | 98.16666666666667 |
| 2027 | 97.45833333333333 |
| 2263 |            97.125 |
| 1793 | 96.66666666666667 |
| 1781 |            96.625 |
| 2430 | 96.54166666666667 |
| 1732 | 96.41666666666667 |
| 2243 | 96.20833333333333 |
| 1659 | 96.08333333333333 |
| 1968 | 95.54166666666667 |
|  450 | 95.41666666666667 |
| 2049 | 95.29166666666667 |
|  237 | 95.08333333333333 |
| 1969 | 94.79166666666667 |
|  324 |             94.75 |
| 2026 |            94.625 |
| 1971 |              94.5 |
| 2047 | 94.41666666666667 |
+------+-------------------+
20 rows in set (0.09 sec)

mysql> select bin, avg(percentile_in_cell_type) as avg_percentile
         from normalized_bins where bin < 1211 OR bin > 1600
        group by bin order by avg_percentile asc limit 20;
+------+--------------------+
| bin  | avg_percentile     |
+------+--------------------+
|    1 |                  0 |
|    4 |                  0 |
|  134 |              0.375 |
|    3 |               0.75 |
|  135 |              1.125 |
|    2 |                1.5 |
|  133 |                1.5 |
|  132 |              3.375 |
|    6 | 3.8333333333333335 |
| 1040 |  4.958333333333333 |
|  129 |  5.708333333333333 |
|  130 |              7.375 |
| 1207 |              8.625 |
| 1206 |  9.166666666666666 |
| 1205 |  9.166666666666666 |
| 2042 |  9.208333333333334 |
| 2412 |  9.583333333333334 |
| 1087 |  9.583333333333334 |
| 2045 |              9.625 |
|    5 |               9.75 |
+------+--------------------+
20 rows in set (0.09 sec)

mysql>

statistics

Following table shows overall bin count statistics for chromosome 1:

mysql> select min(count), max(count), avg(count), std(count), sum(count)
         from chr1_bin_counts;
+------------+------------+---------------+-------------------+-------------+
| min(count) | max(count) | avg(count)    | std(count)        | sum(count)  |
+------------+------------+---------------+-------------------+-------------+
|          0 |  163182419 | 15035679.2257 | 8956512.942009946 | 37168199046 |
+------------+------------+---------------+-------------------+-------------+
1 row in set (4.04 sec)

mysql>

Following table shows bin count statistics by cell type (again just for chromosome 1):

minimum bin count
maximum bin count
average bin count
population standard deviation across bin counts with that parent
total count across all bins with that parent

mysql> select cell_type, min(count), format(max(count), 0), format(avg(count), 0),
              format(std(count), 0), format(sum(count), 0)
         from chr1_bin_counts_by_cell_type group by cell_type;
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
| cell_type | min(count) | format(max(count), 0) | format(avg(count), 0) | format(std(count), 0) | format(sum(count), 0) |
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
| be2c      |          0 | 3,770,520             | 287,038               | 204,507               | 709,557,454           |
| dhl6      |          0 | 2,185,662             | 141,637               | 112,005               | 350,126,200           |
| ec        |          0 | 6,569,832             | 987,943               | 657,771               | 2,442,195,971         |
| hbl1      |          0 | 5,017,459             | 149,097               | 222,292               | 368,567,976           |
| hsc       |          0 | 246,632               | 51,019                | 36,235                | 126,119,067           |
| k422      |          0 | 5,954,990             | 253,770               | 298,047               | 627,318,440           |
| kbm7      |          0 | 1,648,902             | 188,416               | 195,185               | 465,763,314           |
| kms11     |          0 | 41,239,161            | 2,131,480             | 1,682,624             | 5,269,018,091         |
| ly1       |          0 | 12,762,280            | 950,571               | 687,998               | 2,349,811,685         |
| ly18      |          0 | 5,759,705             | 305,177               | 264,601               | 754,397,616           |
| ly3       |          0 | 4,737,559             | 153,570               | 205,493               | 379,624,181           |
| ly4       |          0 | 1,969,491             | 171,868               | 145,846               | 424,857,110           |
| mm1s      |          0 | 44,442,560            | 6,126,253             | 3,814,617             | 15,144,097,564        |
| mmp1      |          0 | 1,352,981             | 136,963               | 107,231               | 338,573,515           |
| nut797    |          0 | 3,082,800             | 395,887               | 252,606               | 978,633,841           |
| p107a     |          0 | 1,600,640             | 95,953                | 49,827                | 237,195,279           |
| p14a      |          0 | 2,823,560             | 218,297               | 146,555               | 539,630,632           |
| p265      |          0 | 2,395,440             | 151,910               | 82,715                | 375,521,965           |
| p286      |          0 | 1,092,800             | 81,695                | 42,149                | 201,951,007           |
| p397      |          0 | 1,867,040             | 158,912               | 91,665                | 392,830,578           |
| p448      |          0 | 1,864,560             | 136,890               | 76,698                | 338,392,224           |
| p493-6    |          0 | 13,805,280            | 1,216,463             | 768,695               | 3,007,097,323         |
| sknas     |          0 | 5,765,760             | 384,664               | 259,913               | 950,889,284           |
| toledo    |          0 | 2,919,841             | 160,206               | 196,181               | 396,028,729           |
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
24 rows in set (5.60 sec)

mysql>

setup

database

initial tables:

mysql> CREATE DATABASE meta_analysis;
mysql> use meta_analysis;
mysql> CREATE TABLE `counts` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `parent_directory` varchar(15) NOT NULL,
  `file_name` varchar(63) NOT NULL,
  `chromosome` varchar(15) NOT NULL,
  `bin` int(11) NOT NULL,
  `count` int(11) NOT NULL,
  `counter_version` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `counter_version` (`counter_version`,`chromosome`,`bin`,`file_name`)
) ENGINE=MyISAM;
mysql> CREATE TABLE errors (id INT NOT NULL AUTO_INCREMENT, file_name VARCHAR(63) NOT NULL,
  chromosome VARCHAR(15) NOT NULL, bin INT NOT NULL, error VARCHAR(512) NOT NULL,
  counter_version INT NOT NULL, create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));
mysql> CREATE TABLE run (id INT NOT NULL AUTO_INCREMENT, file_name VARCHAR(63) NOT NULL,
                         resume_count INT NOT NULL DEFAULT 0, start_time DATETIME NOT NULL,
                         end_time DATETIME, counter_version INT NOT NULL,
                         PRIMARY KEY (id), UNIQUE KEY (file_name, counter_version));

jdm@tod:~/Downloads$ mysql -uroot -p -D meta_analysis < bins.sql 
Enter password: 
jdm@tod:~/Downloads$ 

mysql> describe bins;
+------------+-------------+------+-----+---------+-------+
| Field      | Type        | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| gff_name   | varchar(64) | NO   |     | NULL    |       |
| chromosome | varchar(64) | NO   |     | NULL    |       |
| bin        | varchar(12) | NO   |     | NULL    |       |
| start      | int(11)     | NO   |     | NULL    |       |
| end        | int(11)     | NO   |     | NULL    |       |
| strand     | char(1)     | NO   |     | NULL    |       |
| sixth_col  | int(11)     | NO   |     | NULL    |       |
| second_bin | varchar(12) | NO   |     | NULL    |       |
| bin_number | int(11)     | NO   |     | NULL    |       |
| skip       | tinyint(1)  | NO   |     | NULL    |       |
+------------+-------------+------+-----+---------+-------+
10 rows in set (0.00 sec)

mysql> CREATE USER 'counter'@'localhost';
mysql> GRANT SELECT ON meta_analysis.* TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT ON meta_analysis.counts TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT ON meta_analysis.errors TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT,UPDATE ON meta_analysis.run TO 'counter'@'localhost';

views:

mysql> CREATE OR REPLACE VIEW chr1_bin_counts AS
                       SELECT bin, SUM(count) AS count FROM counts
                        WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin;
mysql> CREATE OR REPLACE VIEW chr1_bin_counts_by_cell_type AS
                       SELECT bin, parent_directory AS cell_type, SUM(count) AS count FROM counts
                       WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin, parent_directory;

additional tables/indices used in analysis

mysql> create index file on counts (counter_version, file_name); // it would have been better as counter_version, file_name, chromosome
mysql> CREATE TABLE normalized_bins (
  id int NOT NULL AUTO_INCREMENT,
  cell_type varchar(15) NOT NULL,
  chromosome varchar(15) NOT NULL,
  bin int NOT NULL,
  count_fraction double NOT NULL,
  percentile_in_cell_type DOUBLE,
  counter_version int NOT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY plotter (counter_version, chromosome, cell_type, bin),
  KEY overall_view (counter_version, chromosome, bin)
) ENGINE=MyISAM;
mysql> CREATE OR REPLACE VIEW chr1_normalized_bin_counts AS
                       SELECT bin, SUM(count_fraction) AS count FROM normalized_bins
                        WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin;
mysql> GRANT SELECT,INSERT,UPDATE ON meta_analysis.normalized_bins TO 'counter'@'localhost';

populating the tables

between about 1:40 AM to 8 AM on November 5th, I processed 110 .bam files on TOD
- that is about 1 .bam file every 3.5 minutes (380 minutes / 110 files)
- when not running multiple simultaneously, a single bam file takes about 37 minutes on TOD
  - the same bam file processing takes about 25 minutes on my laptop which might indicate that my laptop's SSD makes a big difference
I ran the following in 13 simultaneous screen windows

jdm@tod:~/pipeline_counting$ while [ -f ./keep_going ];
do ./bam_liquidator_bin_counter.sh /ifs/labs/bradner/bam/hg18 ; done

TODO: if I use this method again, make sure there is some provision for the loop to stop when there are no more files to process! For example, maybe stop if bam_liquidator_bin_counter.sh returns a non-zero error code (and make sure a non-zero error code is returned when there are no more files to process)
this resulted in a load of about 13/14 on tod
the mysqld process used significant CPU (maybe 50% of a core or something like that), but otherwise this computation seemed to be mostly I/O bound
I somewhat randomly stopped at 13. I assume an optimal number would probably be 30 or 60 if TOD wasn't doing anything else
the bam_liquidator_bin_counter.sh version was 201, and the following shows the exact version of the script I ran,

jdm@tod:~/pipeline_counting$ git remote show origin
* remote origin
  Fetch URL: https://github.com/jdimatteo/pipeline.git
  Push  URL: https://github.com/jdimatteo/pipeline.git
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (local out of date)
jdm@tod:~/pipeline_counting$ git rev-parse HEAD
e018bdb70f0eef39acd4b82f312159e1937090f9

this version actually had a bug that prevented the end time being recorded in the mysql meta_analysis.run table
some of the output looked like the following

jdm@tod:~/pipeline_counting$ while [ -f ./keep_going ]; do ./bam_liquidator_bin_counter.sh /ifs/labs/bradner/bam/hg18 ; done
searching for next unprocessed .bam in /ifs/labs/bradner/bam/hg18
  trying /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_ntko_s4.hg18.bwt.sorted.bam
  trying /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_tko_s3.hg18.bwt.sorted.bam
starting with parameters:
  baseline_check: 0
  file_path: /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_tko_s3.hg18.bwt.sorted.bam
--------------------------------------------------------------------------------
counting chromosome chr1 - Tue Nov 5 01:48:11 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr2 - Tue Nov 5 01:52:24 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr3 - Tue Nov 5 01:56:41 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr4 - Tue Nov 5 02:00:12 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr5 - Tue Nov 5 02:03:36 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr6 - Tue Nov 5 02:06:50 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr8 - Tue Nov 5 02:09:48 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr9 - Tue Nov 5 02:12:26 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chrX - Tue Nov 5 02:14:51 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr10 - Tue Nov 5 02:17:28 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr11 - Tue Nov 5 02:19:55 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr12 - Tue Nov 5 02:22:17 EST 2013
                                   
--------------------------------------------------------------------------------
counting chromosome chr13 - Tue Nov 5 02:24:35 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr14 - Tue Nov 5 02:26:29 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr15 - Tue Nov 5 02:28:17 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr16 - Tue Nov 5 02:30:01 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr17 - Tue Nov 5 02:31:38 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr18 - Tue Nov 5 02:33:03 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr19 - Tue Nov 5 02:34:28 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr20 - Tue Nov 5 02:35:37 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr21 - Tue Nov 5 02:36:44 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr22 - Tue Nov 5 02:37:35 EST 2013
   
./bam_liquidator_bin_counter.sh: line 262: syntax error: unexpected end of file
searching for next unprocessed .bam in /ifs/labs/bradner/bam/hg18

Misc Notes

each .bam file (e.g. 04032013_D1L57ACXX_4.TTAGGC.hg18.bwt.sorted.bam) represents the chromosomes of the full human genome for some particular biological sample
- the sample is spread across many people that share some particular trait of interest (e.g. same cancer)
- there are 22 numbered chromosomes (chr1 - chr22) plus chrX, chrY, and chrM
  - some of the bam files also have other "hap" labeled chromosomes such as chr5_h2_hap1, chr6_cox_hap1, and chr6_qbl_hap2
  - for this analysis, we are only using chr1-22 and chrX
    - we are excluding chr7, chrY, chrM, and any that say hap on them
- each chromosome contains both DNA strands, and bamliquidator uses +, -, or . to distinguish between them
  - strand + is the forward strand
  - strand - is the reverse/complementary strand
  - . means check both strands
- each .bam file represents zero or more reads of base pairs at specific locations on the chromosomes
  - each read is a sequence of base pairs, and they tend to be clumped together
  - the beginnings and ends of chromosomes tend to not have any reads associated with them
  - most regions of the chromosome should have reads associated with them
  - Tablet is a nice tool to visualize the base pair reads of a .bam file
    - samtools view and samtools tview are ways to see reads on a command line
a gff file has been provided splitting the genome into 2472 bins
- each bin is numbered (bin 0 through bin 2471)
- each bin is 100000 base pairs long
- the bin count starts at 0 for each chromosome (e.g. the first bin of chr1 and chr2 are both "bin 0")
- we may want to shrink the bin size in the future so that we can do finer grained analysis
November 18, 2013 Notes
- Visualization tool
  - try leveraging http://processing.org/ and/or http://epigenomegateway.wustl.edu/browser/
    - I got processing to work on github.io: http://jdimatteo.github.io/processing_hello_world/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly