Skip to content
John DiMatteo edited this page Nov 26, 2013 · 53 revisions

This page is currently just rough notes. It will probably be cleaned up and more presentable within a week or so. All conclusions here are preliminary, and the various programs used to generate the results still need to be code reviewed and tested.

analysis overview

figure 4

visuals

preliminary hotspots and coldspots

Note that I'm ignoring the region around the centromere (I may want to tweak how I am defining that region).

mysql> select bin, avg(percentile_in_cell_type) as avg_percentile
         from normalized_bins where bin < 1211 OR bin > 1600
        group by bin order by avg_percentile desc limit 20;
+------+-------------------+
| bin  | avg_percentile    |
+------+-------------------+
| 2328 | 98.29166666666667 |
| 2015 | 98.20833333333333 |
| 2249 | 98.16666666666667 |
| 2027 | 97.45833333333333 |
| 2263 |            97.125 |
| 1793 | 96.66666666666667 |
| 1781 |            96.625 |
| 2430 | 96.54166666666667 |
| 1732 | 96.41666666666667 |
| 2243 | 96.20833333333333 |
| 1659 | 96.08333333333333 |
| 1968 | 95.54166666666667 |
|  450 | 95.41666666666667 |
| 2049 | 95.29166666666667 |
|  237 | 95.08333333333333 |
| 1969 | 94.79166666666667 |
|  324 |             94.75 |
| 2026 |            94.625 |
| 1971 |              94.5 |
| 2047 | 94.41666666666667 |
+------+-------------------+
20 rows in set (0.09 sec)

mysql> select bin, avg(percentile_in_cell_type) as avg_percentile
         from normalized_bins where bin < 1211 OR bin > 1600
        group by bin order by avg_percentile asc limit 20;
+------+--------------------+
| bin  | avg_percentile     |
+------+--------------------+
|    1 |                  0 |
|    4 |                  0 |
|  134 |              0.375 |
|    3 |               0.75 |
|  135 |              1.125 |
|    2 |                1.5 |
|  133 |                1.5 |
|  132 |              3.375 |
|    6 | 3.8333333333333335 |
| 1040 |  4.958333333333333 |
|  129 |  5.708333333333333 |
|  130 |              7.375 |
| 1207 |              8.625 |
| 1206 |  9.166666666666666 |
| 1205 |  9.166666666666666 |
| 2042 |  9.208333333333334 |
| 2412 |  9.583333333333334 |
| 1087 |  9.583333333333334 |
| 2045 |              9.625 |
|    5 |               9.75 |
+------+--------------------+
20 rows in set (0.09 sec)

mysql> 

statistics

Following table shows overall bin count statistics for chromosome 1:

mysql> select min(count), max(count), avg(count), std(count), sum(count)
         from chr1_bin_counts;
+------------+------------+---------------+-------------------+-------------+
| min(count) | max(count) | avg(count)    | std(count)        | sum(count)  |
+------------+------------+---------------+-------------------+-------------+
|          0 |  163182419 | 15035679.2257 | 8956512.942009946 | 37168199046 |
+------------+------------+---------------+-------------------+-------------+
1 row in set (4.04 sec)

mysql>

Following table shows bin count statistics by cell type (again just for chromosome 1):

  • minimum bin count
  • maximum bin count
  • average bin count
  • population standard deviation across bin counts with that parent
  • total count across all bins with that parent
mysql> select cell_type, min(count), format(max(count), 0), format(avg(count), 0),
              format(std(count), 0), format(sum(count), 0)
         from chr1_bin_counts_by_cell_type group by cell_type;
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
| cell_type | min(count) | format(max(count), 0) | format(avg(count), 0) | format(std(count), 0) | format(sum(count), 0) |
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
| be2c      |          0 | 3,770,520             | 287,038               | 204,507               | 709,557,454           |
| dhl6      |          0 | 2,185,662             | 141,637               | 112,005               | 350,126,200           |
| ec        |          0 | 6,569,832             | 987,943               | 657,771               | 2,442,195,971         |
| hbl1      |          0 | 5,017,459             | 149,097               | 222,292               | 368,567,976           |
| hsc       |          0 | 246,632               | 51,019                | 36,235                | 126,119,067           |
| k422      |          0 | 5,954,990             | 253,770               | 298,047               | 627,318,440           |
| kbm7      |          0 | 1,648,902             | 188,416               | 195,185               | 465,763,314           |
| kms11     |          0 | 41,239,161            | 2,131,480             | 1,682,624             | 5,269,018,091         |
| ly1       |          0 | 12,762,280            | 950,571               | 687,998               | 2,349,811,685         |
| ly18      |          0 | 5,759,705             | 305,177               | 264,601               | 754,397,616           |
| ly3       |          0 | 4,737,559             | 153,570               | 205,493               | 379,624,181           |
| ly4       |          0 | 1,969,491             | 171,868               | 145,846               | 424,857,110           |
| mm1s      |          0 | 44,442,560            | 6,126,253             | 3,814,617             | 15,144,097,564        |
| mmp1      |          0 | 1,352,981             | 136,963               | 107,231               | 338,573,515           |
| nut797    |          0 | 3,082,800             | 395,887               | 252,606               | 978,633,841           |
| p107a     |          0 | 1,600,640             | 95,953                | 49,827                | 237,195,279           |
| p14a      |          0 | 2,823,560             | 218,297               | 146,555               | 539,630,632           |
| p265      |          0 | 2,395,440             | 151,910               | 82,715                | 375,521,965           |
| p286      |          0 | 1,092,800             | 81,695                | 42,149                | 201,951,007           |
| p397      |          0 | 1,867,040             | 158,912               | 91,665                | 392,830,578           |
| p448      |          0 | 1,864,560             | 136,890               | 76,698                | 338,392,224           |
| p493-6    |          0 | 13,805,280            | 1,216,463             | 768,695               | 3,007,097,323         |
| sknas     |          0 | 5,765,760             | 384,664               | 259,913               | 950,889,284           |
| toledo    |          0 | 2,919,841             | 160,206               | 196,181               | 396,028,729           |
+-----------+------------+-----------------------+-----------------------+-----------------------+-----------------------+
24 rows in set (5.60 sec)

mysql> 
  

setup

database

initial tables:

mysql> CREATE DATABASE meta_analysis;
mysql> use meta_analysis;
mysql> CREATE TABLE `counts` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `parent_directory` varchar(15) NOT NULL,
  `file_name` varchar(63) NOT NULL,
  `chromosome` varchar(15) NOT NULL,
  `bin` int(11) NOT NULL,
  `count` int(11) NOT NULL,
  `counter_version` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `counter_version` (`counter_version`,`chromosome`,`bin`,`file_name`)
) ENGINE=MyISAM;
mysql> CREATE TABLE errors (id INT NOT NULL AUTO_INCREMENT, file_name VARCHAR(63) NOT NULL,
  chromosome VARCHAR(15) NOT NULL, bin INT NOT NULL, error VARCHAR(512) NOT NULL,
  counter_version INT NOT NULL, create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));
mysql> CREATE TABLE run (id INT NOT NULL AUTO_INCREMENT, file_name VARCHAR(63) NOT NULL,
                         resume_count INT NOT NULL DEFAULT 0, start_time DATETIME NOT NULL,
                         end_time DATETIME, counter_version INT NOT NULL,
                         PRIMARY KEY (id), UNIQUE KEY (file_name, counter_version));

jdm@tod:~/Downloads$ mysql -uroot -p -D meta_analysis < bins.sql 
Enter password: 
jdm@tod:~/Downloads$ 

mysql> describe bins;
+------------+-------------+------+-----+---------+-------+
| Field      | Type        | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| gff_name   | varchar(64) | NO   |     | NULL    |       |
| chromosome | varchar(64) | NO   |     | NULL    |       |
| bin        | varchar(12) | NO   |     | NULL    |       |
| start      | int(11)     | NO   |     | NULL    |       |
| end        | int(11)     | NO   |     | NULL    |       |
| strand     | char(1)     | NO   |     | NULL    |       |
| sixth_col  | int(11)     | NO   |     | NULL    |       |
| second_bin | varchar(12) | NO   |     | NULL    |       |
| bin_number | int(11)     | NO   |     | NULL    |       |
| skip       | tinyint(1)  | NO   |     | NULL    |       |
+------------+-------------+------+-----+---------+-------+
10 rows in set (0.00 sec)

mysql> CREATE USER 'counter'@'localhost';
mysql> GRANT SELECT ON meta_analysis.* TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT ON meta_analysis.counts TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT ON meta_analysis.errors TO 'counter'@'localhost';
mysql> GRANT SELECT,INSERT,UPDATE ON meta_analysis.run TO 'counter'@'localhost';

views:

mysql> CREATE OR REPLACE VIEW chr1_bin_counts AS
                       SELECT bin, SUM(count) AS count FROM counts
                        WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin;
mysql> CREATE OR REPLACE VIEW chr1_bin_counts_by_cell_type AS
                       SELECT bin, parent_directory AS cell_type, SUM(count) AS count FROM counts
                       WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin, parent_directory;

additional tables/indices used in analysis

mysql> create index file on counts (counter_version, file_name); // it would have been better as counter_version, file_name, chromosome
mysql> CREATE TABLE normalized_bins (
  id int NOT NULL AUTO_INCREMENT,
  cell_type varchar(15) NOT NULL,
  chromosome varchar(15) NOT NULL,
  bin int NOT NULL,
  count_fraction double NOT NULL,
  percentile_in_cell_type DOUBLE,
  counter_version int NOT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY plotter (counter_version, chromosome, cell_type, bin),
  KEY overall_view (counter_version, chromosome, bin)
) ENGINE=MyISAM;
mysql> CREATE OR REPLACE VIEW chr1_normalized_bin_counts AS
                       SELECT bin, SUM(count_fraction) AS count FROM normalized_bins
                        WHERE counter_version=203 AND chromosome='chr1' GROUP BY bin;
mysql> GRANT SELECT,INSERT,UPDATE ON meta_analysis.normalized_bins TO 'counter'@'localhost';

populating the tables

  • between about 1:40 AM to 8 AM on November 5th, I processed 110 .bam files on TOD
    • that is about 1 .bam file every 3.5 minutes (380 minutes / 110 files)
    • when not running multiple simultaneously, a single bam file takes about 37 minutes on TOD
      • the same bam file processing takes about 25 minutes on my laptop which might indicate that my laptop's SSD makes a big difference
  • I ran the following in 13 simultaneous screen windows
jdm@tod:~/pipeline_counting$ while [ -f ./keep_going ];
do ./bam_liquidator_bin_counter.sh /ifs/labs/bradner/bam/hg18 ; done
  • TODO: if I use this method again, make sure there is some provision for the loop to stop when there are no more files to process! For example, maybe stop if bam_liquidator_bin_counter.sh returns a non-zero error code (and make sure a non-zero error code is returned when there are no more files to process)
  • this resulted in a load of about 13/14 on tod
  • the mysqld process used significant CPU (maybe 50% of a core or something like that), but otherwise this computation seemed to be mostly I/O bound
  • I somewhat randomly stopped at 13. I assume an optimal number would probably be 30 or 60 if TOD wasn't doing anything else
  • the bam_liquidator_bin_counter.sh version was 201, and the following shows the exact version of the script I ran,
jdm@tod:~/pipeline_counting$ git remote show origin
* remote origin
  Fetch URL: https://github.com/jdimatteo/pipeline.git
  Push  URL: https://github.com/jdimatteo/pipeline.git
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (local out of date)
jdm@tod:~/pipeline_counting$ git rev-parse HEAD
e018bdb70f0eef39acd4b82f312159e1937090f9
  • this version actually had a bug that prevented the end time being recorded in the mysql meta_analysis.run table
  • some of the output looked like the following
jdm@tod:~/pipeline_counting$ while [ -f ./keep_going ]; do ./bam_liquidator_bin_counter.sh /ifs/labs/bradner/bam/hg18 ; done
searching for next unprocessed .bam in /ifs/labs/bradner/bam/hg18
  trying /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_ntko_s4.hg18.bwt.sorted.bam
  trying /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_tko_s3.hg18.bwt.sorted.bam
starting with parameters:
  baseline_check: 0
  file_path: /ifs/labs/bradner/bam/hg18/kms11/092813_kms11_tko_s3.hg18.bwt.sorted.bam
--------------------------------------------------------------------------------
counting chromosome chr1 - Tue Nov 5 01:48:11 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr2 - Tue Nov 5 01:52:24 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr3 - Tue Nov 5 01:56:41 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr4 - Tue Nov 5 02:00:12 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr5 - Tue Nov 5 02:03:36 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr6 - Tue Nov 5 02:06:50 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr8 - Tue Nov 5 02:09:48 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr9 - Tue Nov 5 02:12:26 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chrX - Tue Nov 5 02:14:51 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr10 - Tue Nov 5 02:17:28 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr11 - Tue Nov 5 02:19:55 EST 2013
  
--------------------------------------------------------------------------------
counting chromosome chr12 - Tue Nov 5 02:22:17 EST 2013
                                   
--------------------------------------------------------------------------------
counting chromosome chr13 - Tue Nov 5 02:24:35 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr14 - Tue Nov 5 02:26:29 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr15 - Tue Nov 5 02:28:17 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr16 - Tue Nov 5 02:30:01 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr17 - Tue Nov 5 02:31:38 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr18 - Tue Nov 5 02:33:03 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr19 - Tue Nov 5 02:34:28 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr20 - Tue Nov 5 02:35:37 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr21 - Tue Nov 5 02:36:44 EST 2013
   
--------------------------------------------------------------------------------
counting chromosome chr22 - Tue Nov 5 02:37:35 EST 2013
   
./bam_liquidator_bin_counter.sh: line 262: syntax error: unexpected end of file
searching for next unprocessed .bam in /ifs/labs/bradner/bam/hg18

Misc Notes

  • each .bam file (e.g. 04032013_D1L57ACXX_4.TTAGGC.hg18.bwt.sorted.bam) represents the chromosomes of the full human genome for some particular biological sample
    • the sample is spread across many people that share some particular trait of interest (e.g. same cancer)
    • there are 22 numbered chromosomes (chr1 - chr22) plus chrX, chrY, and chrM
      • some of the bam files also have other "hap" labeled chromosomes such as chr5_h2_hap1, chr6_cox_hap1, and chr6_qbl_hap2
      • for this analysis, we are only using chr1-22 and chrX
        • we are excluding chr7, chrY, chrM, and any that say hap on them
    • each chromosome contains both DNA strands, and bamliquidator uses +, -, or . to distinguish between them
      • strand + is the forward strand
      • strand - is the reverse/complementary strand
      • . means check both strands
    • each .bam file represents zero or more reads of base pairs at specific locations on the chromosomes
      • each read is a sequence of base pairs, and they tend to be clumped together
      • the beginnings and ends of chromosomes tend to not have any reads associated with them
      • most regions of the chromosome should have reads associated with them
      • Tablet is a nice tool to visualize the base pair reads of a .bam file
        • samtools view and samtools tview are ways to see reads on a command line
  • a gff file has been provided splitting the genome into 2472 bins
    • each bin is numbered (bin 0 through bin 2471)
    • each bin is 100000 base pairs long
    • the bin count starts at 0 for each chromosome (e.g. the first bin of chr1 and chr2 are both "bin 0")
    • we may want to shrink the bin size in the future so that we can do finer grained analysis
  • November 18, 2013 Notes