-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO
65 lines (54 loc) · 2.83 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# TODO
## New features
- [ ] Introduce repeat mutations
- [ ] Introduce CNV variants
- [ ] Add an option to simulate new transcripts variants and new genes.
##
- [ ] Print simCT configuration values (ins_rate and such in a output file) in STDERR so it can be redirected to some "log" file
- [ ] Add a mutation generator that is based on a VCF file for example the one from dbSNP
- [ ] Find a syntax for generating a simulated experiment (in YAML if possible)
- [ ] We should be able to run the pipeline step separatly if we want (like FLUX does)
- [ ] Introduce a model to create multi-genome samples with a kind of phylogeny
- [ ] Use flux to samples these genomes with the same expression profile but
with minor changes
- [ ] Complete code documentation
- [ ] Complete code unit-tests
- [ ] Simplify the fusion creation : only pass the two exons in argument...
- [ ] Save simulated genome so it can be reconstructed latter
- [ ] Sort FASTQ sequences by start position of mapping to maximise the compression!!!
- [x] Find a nicer way to encode error positiions in read names with a more compact
style
- [ ] Find a nice way to specify (in the CL) the mutations rates for both random and vcf-based
mutations generator
- [x] Place code from simCT into a new module (for flux-output post-process mainly)
- [x] Output should be compressed to GZIP
- [x] Break the genomeSimulator in more pieces...
- [x] Generates chimera output
- [x] Add SAM support for alignements or alignment should be encoded in reads names?
Readname seems to be the best solution, and it can still be transformed to
BAM, err-file and such
- [ ] Do not mute N bases
- [ ] Add a way to simulate Post-transcriptionnal mutations
- [ ] Scan experimental data (aligned) to determine the SimCT parameters that would be the
closest to the real world.
- [ ] Add a function to test if the ReadSimulator softwares is accessible from the path and
check its version?
The output of simCT should be:
- FASTQ(s) sequence file (gzipped)
[- Sorted-indexed BAM for alignment]
- VCF file for sequenced mutations
- chimeras file for chimeric breakpoints
[- Err file for error positions]
[- Exon/transcripts/gene counts]
- directorie(s) with simulated genome(s)
- directorie(s) with simulation's outputs
# Read name encoding
format for encoded alignements and errors into the read-name
(genome_id)id:(chr,(-)pos,cigar(;)?)+,(err_pos(;)?)+
ex: a0:1,12345,20M2I3X30M;2:-6789:20M:CABgAAAKAAEAABgCAAAABABAQAAgAAACDD
CABgAAAKAAEAABgCAAAABABAQAAgAAACDD : encoded error list in base 64 (decode as follow : 1,12,23,43,45,62,78,89,91,120,132,148,167,187,192,193,198,199)
# Multi-clonal model
1. Multi-genomes => Choose a method to create the pylogenye of simulated genomes...
2. Multi expression profiles => Chose a function to do that...
# Configuration YAML
Configuration file should be able