-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathThesis.tex
2841 lines (1972 loc) · 258 KB
/
Thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt,a4paper]{report}
\usepackage[english]{babel}
\usepackage{newlfont}
\usepackage{color}
\usepackage[autopunct=true]{csquotes}
%\usepackage{quotchap} for different chapters titles layout
\usepackage{epigraph}
\usepackage{bm}%to make math symbols in bold style.
\usepackage{hyperref} %hyperlinks
\usepackage{graphicx} %figures
\usepackage{amsmath} %math
\usepackage{amsfonts} %to do the bold W
\usepackage{mhchem}
%\usepackage[toc,page]{appendix}
\usepackage{fancyhdr}%Add chapter name at the beginning of each page
%\pagestyle{fancy}
\setlength{\headheight}{15pt}% ...at least 51.60004pt
%\usepackage[]{lineno} Professor
%\runninglinenumbers
\usepackage[titletoc,title]{appendix}
\usepackage{float} %to force position of figures
\usepackage{mathtools}
\usepackage{unicode-math}
%\usepackage[notindex]{tocbibind} you will no longer need to provide the instruction \addcontentsline{toc}
\usepackage[nottoc]{tocbibind}%to remove contents from table of contents and to add bibliography to it.
\renewcommand*\footnoterule{}
\usepackage{tocloft} %to personalize table of contents
% Update \subsections in ToC
%\renewcommand{\cftsubsecleader}{~}% Content between subsection title and page number
%\renewcommand{\cftsubsecafterpnum}{\hfill\mbox{}}% Content after subsection page number
\cftsetrmarg{12em}
\renewcommand{\cftpnumalign}{l} % left aligned page numbers
\usepackage{caption}
\captionsetup[figure]{hangindent=0pt, singlelinecheck=false}
\usepackage[flushmargin]{footmisc}
%\usepackage[nottoc]{tocbibind} To add sections in the table of contents
\textwidth=450pt\oddsidemargin=0pt
\newcommand{\pyobject}[1]{\fbox{\color{red}{\texttt{#1}}}}
\begin{document}
\begin{titlepage}
%
%
% ONCE YOU ARE FINISHED WITH YOUR CHANGES MODIFY "RED" WITH "BLACK" IN ALL \textcolor COMMENTS
%
%
\begin{center}
{{\Large{\textsc{Alma Mater Studiorum $\cdot$ University of Bologna}}}}
\rule[0.1cm]{15.8cm}{0.1mm}
\rule[0.5cm]{15.8cm}{0.6mm}
\\\vspace{3mm}
{\small{\bf School of Science \\
Department of Physics and Astronomy\\
Master Degree in Physics}}
\end{center}
\vspace{23mm}
\begin{center}\textcolor{black}{
%
% INSERT THE TITLE OF YOUR THESIS
%
{\LARGE{\bf Modeling and data analysis of biochemical oscillators using Chemical Master Equation and AI: applications to the NF-kB activity in patient derived xenografts}}\\
}\end{center}
\vspace{50mm} \par \noindent
\begin{minipage}[t]{0.47\textwidth}
%
% INSERT THE NAME OF THE SUPERVISOR WITH ITS TITLE (DR. OR PROF.)
%
{\large{\bf Supervisor: \vspace{2mm}\\\textcolor{black}{
Prof. Enrico Giampieri}\\\\
%
% INSERT THE NAME OF THE CO-SUPERVISOR WITH ITS TITLE (DR. OR PROF.)
%
% IF THERE ARE NO CO-SUPERVISORS REMOVE THE FOLLOWING 5 LINES
%
\textcolor{black}{
\bf Co-supervisor:
\vspace{2mm}
\newline
Prof. Daniel Remondini\\\\}}}
\end{minipage}
%
\hfill
%
\begin{minipage}[t]{0.47\textwidth}\raggedleft \textcolor{black}{
{\large{\bf Submitted by:
\vspace{2mm}\\
%
% INSERT THE NAME OF THE GRADUAND
%
\textcolor{black}{
Manuela Carriero}}}
}
\end{minipage}
\vspace{15mm}
\begin{center}
%
% INSERT THE ACADEMIC YEAR
%
Academic Year \textcolor{black}{ 2021/2022}
\end{center}
\end{titlepage}
\newpage
\thispagestyle{empty}
\mbox{}
\newpage
\thispagestyle{empty} %remove the page number
\epigraph{``Behind the great achievements of a Gauss or an Einstein is in all cases a life devoted to contemplation, curiosity, collaboration and, perhaps most of all, hard work.
''\footnotemark}{Brian D. Burrell, Professor of Neuroscience}
\footnotetext{GENIUS IN A JAR, Scientific American 2015 Sep; 313(3):82-7.}
\newpage
\thispagestyle{empty}
\mbox{}
\newpage
\thispagestyle{empty}
\begin{center}
\textbf{DEDICATION}\\
\vspace{9mm}
Dedicato alla mia famiglia
\end{center}
\newpage
\thispagestyle{empty}
\mbox{}
\newpage
\begin{flushleft}
{\LARGE{\bf Abstract}}\\
\vspace{9mm}
%L'Approximate Bayesian Approach ha molti punti a sfavore tra cui il fatto che è computazionalmente expensive.
%I tre principali circuiti analizzati sono semplici da modellizzare per� rappresentano importanti sistemi biologici e la loro simulazione e analisi non � affatto banale: serve anche qui molto succo di cervello e fantasia per poter svelare importanti caratteristiche che tali sistemi nascondono. I risultati ottenuti possono essere sfruttati in sistemi pi� articolati come l'NF-kB che � il nostro target.
%Chronic diseases, cancers and diabetes are associated with dysregulation of many biochemical cues. These biochemical cues are proteins that regulate cellular activity migration and death. The synthesis of these proteins is regulated by nuclear transcription factors. One of the most studied transcription factor is nuclear factor kappa B (NF?B). Many different proteins have been identified that regulate the activity of NF?B. Yet, how these proteins regulate NF?B is still unclear (Dalla tesi di Huanling Liu).
%(dalla lezione del 15/10/2018)
%And of course if you change these constants you will have different behaviors. So an other step will be to learn how to play with the parameters because this is a huge problem in modeling. The usual problem is how to find the parameters numbers. This is not trivial and there are some possibilities that I will show in the next lecture.
%transcription factors have functional properties because they interfere in some biological process.
%In order to regulate the RNA production, nature invented the transcription factors.
%La nostra modellizzazione dell'NF-kB � molto semplificata e ideale tipo ideal gas o Michaleis-Mentein reaction.
There are two stages from DNA sequence of a gene to protein: transcription, i.e. the process of making a strand of RNA molecule, and translation, that is the process by which a protein is synthesized from the information contained in a molecule of RNA. In the process of transcription, proteins called ``transcription factors'' play a central role because they bind to a DNA sequence and help the transcription initiation complex. In this work of thesis, we are particularly interested in modeling the nuclear factor-kappa B (NF-kB) activity which is ubiquitous within cells and its dysfunction leads to chronic diseases, cancers, neurodegenerative diseases and much other. However, we do not start immediately modeling its behavior. In this stochastic context, firstly we aim to deepen into algorithms to solve the Chemical Master Equation (CME) giving an our alternative algorithm called ``hybrid'' because it combines the Gillespie's Stochastic Simulation Algorithm (SSA) with the tauleaping algorithm with the aim to improve the algorithm's speed; secondly we analyse the stochastic simulation results of three basics genetic circuits (the simplest model of gene expression, the autorepressor and the toggle switch); third, we faced the problem of parameters estimation of these simple models using artificial neural networks (ANN); finally, aware of what we have learned after all such steps, we provide a very little NF-kB model using the CME. The relevant results are the following: the hybrid algorithm applied to the first genetic model is faster than the SSA in configurations where the number of molecules produced tends to be high; periodicity arises from what we defined as unpredictable (being stochastic processes), particularly from the autorepressor using wavelet transform; ANN learn to predict parameters given autocorrelations as input, providing also information about the chemical species; finally, our little NF-kB model shows an oscillating behavior as expected by experiments.
\end{flushleft}
\thispagestyle{empty}
\clearpage
\pagenumbering{arabic}
%to reset the page number.
\newpage
\tableofcontents
\newpage
%First part: introduction into heterogeneity and how to deal with it using physical methods.
\part{Introduction into biological stochasticity and physical methods for its investingation}
\pagestyle{fancy}
\makeatletter% Needed because of @ in macro name ? https://tex.stackexchange.com/q/8351/277964
\renewcommand{\chaptermark}[1]{%
\markboth{\@chapapp \thechapter\ --\ #1}{\@chapapp \thechapter\ -- \ #1}% chapter/appendix name and number moved from \fancyhead[L]{�}
}
\makeatother% because of \makeatletter before
\fancyhead[R]{}
\fancyhead[L]{\leftmark}% chapter/appendix name and number moved to \chaptermark
\chapter{Introduction}\epigraph{``There's no gene for fate''}{from GATTACA film}
Kevin Burrage et al. in their important paper \emph{Stochastic simulation in systems biology} \cite{Article} highlight how mathematical modelling and computational simulation perform essential roles in biology. Models are useful for many different purposes: perhaps the most important of these are crystallising our assumptions and testing them, guiding experiments and looking at experimentally unreachable scenarios, as well as the great ability to make new predictions.
Ideally, there should be a virtuous cycle between experiment and theory to understand the natural system in which we are interested.
If the model and experiments agree, then parameters for the model can be inferred from the data and the model refined, again leading to further experiments. If they do not agree, this implies that the hypothesis was inappropriate: the model needs revision, and theoretical tests using different models can provide a starting point for further experiments.
As defined by Kevin Burrage et al., model is an abstract representation of the system in which we are interested and this is usually formulated mathematically. This model can, in simple cases, be solved analytically for a particular question, giving us an \emph{exact} analytical solution to the question posed to the model. This means that the full probability distribution for the state of the biological system over time can be calculated explicitly.
In general, however, analytical solutions for all but the most simple models do not exist, and we need to use methods to solve the model numerically. These last methods give rise to numerical solutions which are approximations to the analytical solution.% that become more and more accurate as we use smaller and smaller numerical steps (the unit of discretisation of a numerical approach)
Numerical solution will try to mimic the behaviour of the real system over time. This is also refered to as a simulation.
An important point to remember when we talk about mathematical models is that, in a sense, ``all models are wrong'' but this does not mean their conclusions or predictions are false, as the famous quote continues, ``but some are useful''.
Rather, we should bear in mind that all models are abstractions of reality, simplified versions of the real systems that they represent. A model is a vehicle for gaining understanding, and we would not gain any new understanding from a model that was as complicated as the real system.
Almost all models are phenomenological; that is, they are based on a set of simplified assumptions that we have distilled from the real system using intuition rather than rigorous proof from basic principles. Incorporating only the important ones allows us to crystallise our understanding of the system (and if the model gives wrong results, we know that our assumptions were wrong or incomplete, hence it is in any case a gain in knowledge of our system under study).
Creating phenomenological models is not a trivial task: ``sensing which assumptions might be critical and which irrelevant to the question at hand is the art of modeling''. The choice of what to include in the model lies with the modeller and his aims.
%Dipende da: the choice of modelling/simulation method is determined by the aspects of the natural system in which we are interested, how realistic an estimation of it we want, and our mathematical and computational resources.
%Importanza dei modelli tipo quello SIR oggigiorno.
%Simulation modeling activities encapsulate all three major modeling methods of data analytics: descriptive, predictive, and prescriptive. Descriptive modeling uses historical data to describe what happened in order to understand past behavior of a system. Predictive modeling uses historical data to develop models that help us understand future behavior in order the answer what may happen. Descriptive modeling summarizes past data for understanding. Predictive modeling uses past data to predict future behavior. Prescriptive modeling indicates what should be done and is integral to answering questions involving system design.+finale https://rossetti.github.io/RossettiArenaBook/why-simulate.html
\section{Heterogeneity in biology}\label{Heterogeneityinbiology}
Stochastic computational methods have become commonplace in science because they are able to appropriately account for one key feature in biology: this is heterogeneity.
There are three different sources of heterogeneity in natural systems: genetic, environmental (i.e. also known as \emph{extrinsic noise}) and stochastics (i.e. also called \emph{intrinsic noise}).
Genetic heterogeneity occurs through the production of single or similar phenotypes through different genetic mechanisms \cite{GeneticHeterogeneity}. The general ``rule'' in biology is that cells and animals with different genes should clearly be different in phenotypes. However, the so called ``convergent evolution'' represents an exception to this rule where different species evolved similar phenotypic features independently. This phenotypic similarity is thought to usually have a different genetic basis, showing that the relationship between genotype and phenotype is not a simple, linear one \cite{genotypevsphenotype}.
On the other hand, even isogenic (that is, genetically identical) organisms or populations of cells can be very different; the famous cloned animals are excellent illustrations of this phenomenon, known as non-genetic or phenotypic heterogeneity. Daniel Stockholm et al. in their work named \emph{The Origin of Phenotypic Heterogeneity in a Clonal Cell Population In Vitro}\cite{heterogeneityinclones} show phenotypic switch in initially identical cells. This suggests that heterogeneity is not only generated by genetic variation but it is triggered by two other factors.
The first one is the consequence of \emph{extrinsic factors}: initially identical cells may become different because they encounter different local environments that induce adaptive responses. Examples of each of these factors could be: if we were interested in animal populations, the weather an animal experiences in a particular year or if we were interested in levels of protein expression of a cell population, we might look at differences in individual cells such as ribosome number and cell cycle stage.
The third source of heterogeneity is instead \emph{intrinsic} to cells and may occur even in homogeneous environments. Indeed identical genotype and environmental exposure are not sufficient to guarantee a unique
phenotype. The intrinsic heterogeneity is present in all living (and non-living) systems but often masked by the macroscopic scale at which we observe them. Spudich and Koshland were perhaps the first to observe this in a cell biology context. As explained in their paper \emph{Non-genetic individuality: chance in the single cell} \cite{intrinsic}, they noticed that individual bacteria from an isogenic population maintained different swimming patterns throughout their entire lives. This was a visible manifestation of the third source of heterogeneity: \emph{chance}, otherwise known as \emph{stochasticity} or \emph{intrinsic heterogeneity} (often interchangeably called \emph{intrinsic noise}).
This arises from random thermal fluctuations at the level of individual molecules. It affects the DNA, RNA, protein and other chemical molecules inside cells in many ways, most notably by ensuring that their reactions occur randomly, as does their movement (via Brownian motion). Very clear examples could be the following: consider a single mother cell dividing into two daughter cells of equal volume. During the division process, all the molecules in the mother cell are in Brownian motion according to the laws of statistical mechanics. The probability that each daughter cell inherits the same number of molecules is infinitesimally small. Even in the event that the two daughter cells receive exactly one copy of a particular transcription factor, each transcription factor will perform a Brownian random walk through its cellular volume before finding its target promoter and activating gene expression. Because Brownian motion is uncorrelated in the two daughter cells, it is statistically impossible for both genes to become activated at the exact same time, further amplifying the phenotypic difference between the two daughter cells. These are just two examples of the many sources of gene expression
variability that arise in isogenic cells exposed to the same environment \cite{geneexpressionnoise}.
Intrinsic heterogeneity is inherent to the process of gene expression and cannot be predicted (except in a statistical sense) or fully eliminated. Thus two identical genes in an identical intracellular environment would still be expressed differently, even in the absence of the previous two sources of heterogeneity.
Hence, the reason of terminologies is clear: we talk about extrinsic heterogeneity because it arises from other, outside, sources and affects all genes inside a cell equally, whereas the random fluctuations are actually inherent to the expression of a single gene.
It is important to point out that these three sources of heterogeneity are not independent because they are all interconnected in their effects on cells and populations. For instance, cellular decision involves both environmental and intrinsic heterogeneity, the former in affecting which stable states are possible and the latter in switching between them.
Another example is that of genetic mutations, where intrinsic noise causes the mutations that then contribute to genetic heterogeneity and so causing genetic mutations that are essential for creating the heritable variation, thus allowing evolution to occur.
From this, a very interesting example of the interplay of all three types of heterogeneity is evolution, one of the most fundamental processes in nature: evolution acts on phenotypes, which have a genetic basis but are also affected by both extrinsic and intrinsic noise. Studying how evolution shapes noise is a key challenge and this is highlighted in the work \emph{Adaptive noise} by M. Viney et al. \cite{adaptivenoise}. If our DNA is much different from the first humans appeared on Earth (and so also our phenotype), is due to all this complex interplay among sourcies of heterogeneities.
Finally focusing on intrinsic heterogeneity, we have to point out how this can be a double-edged sword: intrinsic noise has both positive and negative influence that have to be accounted for cells and organisms. One positive effect is already mentioned: it allows evolution to occur. One other related example is that some biological systems have evolved to make use of it: for instance, in the case of persister-type bacteria, which form a subset of some bacterial populations and can withstand antibiotic treatments even though they do not have genetic mutations for resistance. This is an example of cellular decision making, the ability of cells to randomly transition between different stable, heritable states.
Some of the most well-known examples of cellular decision making are bacteria, and yeast, which are easier to investigate experimentally than large multicellular organisms. Yet, this phenomenon is common to cells at all levels of life and is argued to be one of the key processes in cellular development. The main difference is that in unicellular organisms, cellular decision making is useful for cheap, non-genetic adaptation in the face of fluctuating environments, whereas in multicellular organisms it is used to produce distinct cell types and functions within a constant environment.
Intrinsic noise is also an interesting object of study since it represents a ``fingerprint'' with which to explore fundamental questions on mechanisms and dynamics of gene regulation by means of measurements of cell-to-cell variability in gene expression \cite{geneexpressionnoise}.
However, as said before, intrinsic noise causes mutations which can also have negative effects. Deleterious mutations that are phenotypically expressed frequently result in loss of important functions or even organismic death, and it has been shown that a substantial proportion of mutations are deleterious. Mutations are also thought to cause cancer, which has become a serious disease in modern times. %in stem and somatic cells
%+la parte che parla degli effetti positivi e negativi della regolazione ed il fatto che tutte le eterogeneit� sono interconnesse.
%Il fatto che cellule diverse possono avere stesso patrimonio genetico porta a pensare che non esiste solo il genetic heterogeneity bens� la causa di questa diversit� fenotipica pu� essere data dall'environment e dal caso.
%Partiremo dai circuiti genetici pi� semplici fino ad arrivare all'NF-kB.
%Magari prova ad iniziare con la frase ``We and all other animals are machines of our genes''. Originally published in 1976, Richard Dawkins' book argues that genes are the basic unit of evolution, not individual organisms or even species. Due to their naturally selfish behavior, genes merely use organisms as mechanisms to ensure their own survival.
%Però ci sono processi come quelli descritti nell'articolo using gene expression noise for understanding gene regulation che dimostrano che i nostri geni non sono tutto.
%Single-cell experiments.
%Su cosa tratterà l'intera tesi.
%Il nostro DNA cambia stando a contatto con l'ambiente: gli uomini del passato lo avevano diverso dal nostro e lo si vede dall'aspetto (fenotipo).
%Da stochastics simulations in biological systems.
%In this thesis, we will refer to reactions occuring inside a cell volume between different types of biochemical molecules with various reaction rates. These will generally be genes , i.e. sections of DNA that are copied into mRNA and initiate a gene expression network, RNA or proteins.
\section{Motivation of the work and the consequent workflow}
%Il background biologico con i concetti di base come gene regulation.+%Randomness è anche nell'iniziazione del tumore al CRC: il 70% dei tumori avvengono in maniera sporadica e secondo il modello genetico di Fearon si ha una prima mutazione che colpisce un tumor-suppressor gene, adenomatous polyposis coli (APC) initiating colorectal tumorigenesis. Il 30% ha base familiare.
Following the line of reasoning of the previous section, cancers like colonrectal cancers (CRC) are mainly caused by mutations that target oncogenes leading to novel or increased functions, or alterations that drive to loss of functions of tumor-suppressor genes. This provides neoplastic cells with a survival advantage over the surrounding
normal intestinal epithelium.
Approximately 70\% of CRC cases are sporadic and, generally, develop from a point mutation that occurs spontaneously during lifetime.
It is very likely that these mutations lead to a malfunction of NF-$\kappa$B transcription factor activity, whose behavior is modulated by its regulators, of the proteins family I$\kappa$B, but that in CRC it is constitutively expressed.
The biology of NF-$\kappa$B will be explained in Chapter \ref{NFkBBiologyExplained} and then a simple model of the feedback loops which involve him will be furnished. However before arriving to this point, we will examine simple genetic models that will serve as a ``training'' before diving into more complex stochastic models.
This will let us to explore \emph{methods} firstly for stochastic processes simulations, namely we will review them in Chapter \ref{Simulations} and provide an our alternative algorithm called ``hybrid'' in section \ref{hybrid} in order to numerically solve the Chemical Master Equation with the aim to provide an algorithm that is faster than the SSA (described in section \ref{Gillespie}) but that preserves the accuracy in results; secondly, methods for analyzing data coming from these stochastic simulations in Chapter \ref{ChapterPatternRecognition}, in particular deeping into the use of Continous Wavelet Transform that they turn out to be an interesting ``instrument'' for the analysis of stochastic processes signals (someone calls them also as ``a mathematical microscope for scanning signals''\cite{UsefullYoutube}); and last, but not least, a method to extrapolate model paramaters that in this thesis is the Artificial Neural Networks (their basics principals are explained in Chapter \ref{ArtificialNeuralNetwork} and their application in Chapter \ref{ANNatwork}). About this last point, we thought to give the autocorrelation values as input because they are dependent on model parameters, hence we train ANNs to recognize the model parameters (that are in this case the constant rates of reactions, which are explained in Chapter \ref{models}). Nowadays neural networks are used in a lot of fields and so their use in ``absolute terms'' is not a great novelty. The innovation lies in the context in which they are used because typically the problem of parameters estimation is faced using standard methods such as Maximum Likelihood Estimation (MLE) that have several disadvantages. Hence, we tried an alternative method.
Finally a little NF-$\kappa$B model will be provided with the use of Gillespie's Stochastic Simulation Algorithm to simulate its activity (Chapters \ref{themodelformulation} and \ref{themodelsimulation}).
%METTI I CAPITOLI E INDICA IL TUO LAVORO ORIGINALE.
%L'Approximate Bayesian Approach Computation è computational expensive e indica la discussione su stackexchange.
%+punti del Professore.
%l'importante è che abbia evidenziato bene quali sono le parti di notività che ha fatto lei, fra cui: 1) l'algoritmo misto 2) l'applicazione delle wavelet per l'analisi 3) l'uso della rete neurale per la predizione 4) l'impostazione della CME per il nfkb.
\section{Deterministic versus stochastic models}
Before the discovery of intrinsic noise, mathematical and computational methods that have traditionally been used were typically \emph{systems of differential equations} which are both \emph{continous} and \emph{deterministic}, i.e. their state variables are real numbers representing the concentrations of molecules and they do not include noise.
Such models can be considered as accurate only when we are interested in the mean dynamics of a large number of molecules, large enough that we need not worry about individual molecules but can approximate them as concentrations \cite{Article}. Above a molecular population size of the order of Avogadro's number, the fluctuations from intrinsic noise are averaged out and the deterministic approximation becomes increasingly valid.
This is because intrinsic noise, as a rule of thumb, behaves as $\frac{1}{\sqrt{X}}$ , where \emph{X} is the number of molecules in the system. This is confirmed by experimental studies that find that total noise does scale roughly as the inverse square of abundance until high abundances. At this point, extrinsic noise is thought to take over as the dominant source of noise \cite{noisewithabundance}.
Therefore, it often becomes necessary to include the effects of stochasticity in biological models, especially for small systems with low populations of some molecular species, such as \emph{gene expression networks}. Here, \emph{discrete stochastic models} must be used, whose variables represent actual molecular numbers.
In the cellular environment, the reactant numbers tend to be of the order 10-1000, hence a reaction altering the population by one or two generates a large relative change, and the molecule numbers no longer evolve differentiably. Furthermore, reactions no longer occur ``continuously'' over an infinitely small time interval, but rather progress in a series of steps of finite time width.
A very clear example may be to imagine the national birth rate as compared to the chances my next-door neighbor will have a baby. One often hears statements such as: ``Every X minutes, a baby is born in the US.'' That clearly cannot be true of my next-door neighbor. Evolution of the population of an entire country can be well-described using differential equations, but individual courtship is an essentially probabilistic affair, requiring a more sophisticated formulation \cite{SanDiegoUniversity}.
Thus, for many reactant molecules, the species concentrations evolve both continuously and differentiably. When
small numbers of reactants are involved, due to the probabilistic nature of individual reaction events and the finite change in molecule numbers incurred, the concentration evolves step-wise.
And so, while \emph{the law of mass balance} is used to construct deterministic chemical rate equations which take the form of a system of coupled nonlinear differential equations with the assumption that the concentration of the reactants varies both continuously and differentiably (i.e. in case of molecule numbers of the order $10^{23}$ where a change of one or two molecules in a population of $10^{23}$ is, for all intents and purposes, infinitesimal), \emph{the probability balance} is used as conservation law of this microscopic description. For such purpose we will consider the \emph{Chemical Master Equation} which is used to describe the distribution of a population of stochastic systems across states.
%Descrizione generale della master equation e dove è usata in Fisica ispirandoti anche all'introduzione di Wikipedia + equazione di the Schlögl reaction system.
%\section{Approximate Bayesian Approach vs Deep Neural Networks for parameters estimation}
%Il Professore diceva che occorreva molto tempo quindi � un punto contro l'approximate Bayesian approach. Per� in molti lavori (tipo la tesi che hai trovato o di Giosu�) che riguardano l'NF-kB usano l'approccio Bayesiano quindi studiali.
\section{The importance of random numbers generation}
Before starting, let us briefly make clear the importance of random numbers generations in this context. Random number let us simulate what we have called ``biological stochasticity''. We will see very well this when talking about stochastic simulations in biological systems in Chapter \ref{Simulations}. But we will exploit them also for the needing of a particular distribution of parameters for the same model (that is made in Chapter \ref{ArtificialNeuralNetwork}). Thus, for simulations, it is a particular important key.
% in a cellular environment
\chapter{Modeling kinetics reactions in deterministic framework}\label{models}
In this chapter, we describe three simple genetic circuits that are common in nature: the simplest process of gene expression and protein synthesis, the autorepressor case and the toggle-switch model. Hence we need to describe different chemical reactions.
Generally, we can have a chemical reaction like this:
\begin{equation}
\ce{\alpha A + \beta B ...
<=> \sigma S + rT}
\end{equation}
where A and B are the reagents and S and T are the products of the reaction. This reaction is reversible and, as such, according to the \emph{Law of Mass Action}, we can define a \emph{forward reaction rate} and a \emph{backward reaction rate}. The reaction rates are related to a constant, for instance the \emph{forward reaction rate} is related to $k_{+}$, and to the concentration of reagents:
\begin{equation}
rate = k_{+} A^{\alpha} B^{\beta}
\end{equation}
where $\alpha$ and $\beta$ are called the stechiometric coefficients. Analogously, we can write an equation for the backward reaction rate.
However, the specific forms of these equations rates depend on the order of reaction. In the listed genetic circuits that we will describe, we will consider only first order reaction rates.
When we have a chemical reaction, we are interested in describing it by using differential equations, hence in a deterministic framework.
Thus we will turn these biological processes into a deterministic mathematical description, but then the molecular noise in these systems will be taken into account and we will study them by means of stochastic methods.
\section{Constitutive gene expression}\label{modelconstitutivegeneexpression}
In the simplest possible model of \emph{constitutive} gene expression, a RNA molecule is produced at a constant rate $k_{1}$ and destroyed in a first-order reaction with rate constant $k_{2}$. Fig. \ref{costitutiveexpression} shows a sketch of this simple biological process.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.64]{costitutiveexpression.jpg}
\caption{Schematic of a constitutive gene expression model with transcription rate $k_{1}$ and mRNA degradation rate constant $k_{2}$.}
\label{costitutiveexpression}
\end{figure}
As already anticipated in the introductory part, if the total number of a particular transcript \emph{m} is large, the kinetics is conveniently represented as a deterministic differential equation for species concentrations. This plays the role of the mass-balance conservation law, i.e. change = flux in - flux out \cite{geneexpressionnoise}.
\begin{equation}\label{RNAcostitutiveexpressionequation}
\frac{dm}{dt} = k_{1} - k_{2}m
\end{equation}
This approximation breaks down in cells when the copy numbers of transcripts are small with the need of a probabilistic reformulation in which the conservation law is described by the Chemical Master Equation explained in the next Chapter \ref{Simulations}.
In the constitutive expression model, transcript births and deaths occur as uncorrelated events, such that in any short time interval, $dt$, the probability of one transcript production is $k_{1}dt$, and the probability of one transcript degradation is $k_{2}mdt$ \cite{geneexpressionnoise}.
The scheme in Fig. \ref{costitutiveexpression} shows only the transcription process in case gene is always active.
We can look at this model also considering protein production as shown in Fig. \ref{proteincostitutiveexpression}.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.64]{proteincostitutiveexpression.jpg}
\caption{Schematic of a constitutive gene expression model with transcription rate $k_{1}$ and mRNA degradation rate constant $k_{2}$, translation rate $k_{3}$ and protein degradation rate constant $k_{4}$.}
\label{proteincostitutiveexpression}
\end{figure}
Similarly to Eq. \ref{RNAcostitutiveexpressionequation}, we can write the rate of change of protein concentration \emph{p} as:
\begin{equation}\label{proteincostitutiveexpressionequation}
\frac{dp}{dt} = k_{3}m - k_{4}p
\end{equation}
\section{Regulated gene expression}\label{regulatedgeneexpressionsection}
Many genes are constitutively expressed at intermediate levels, which represents a permanent cost but provides an immediate benefit when the protein is needed. On the other hand, \emph{regulated} genes are only expressed under certain necessary conditions in order to save cellular energy. In other words, they are switched on only when needed and they can be termed also as ``inducible''. Gene regulation is the key ability of an organism to respond to environmental changes \cite{generegulationNIH}. Fig. \ref{regulatedgeneexpression} shows a sketch of gene regulated protein synthesis model.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.40]{simplest_protein_synthesis_model.jpg}
\caption{Schematic of a regulated gene expression model with gene activation rate $k_{a}$ and gene inactivation rate $k_{i}$, transcription rate $k_{1}$ and mRNA degradation rate constant $k_{2}$, translation rate $k_{3}$ and protein degradation rate constant $k_{4}$.}
\label{regulatedgeneexpression}
\end{figure}
In particular, we can consider a two-state model of gene expression.
This model considers two promoter states: an OFF state, in which no transcription occurs and an ON state, which has transcription rate $k_{1}$.
The constants $k_{a}$ and $k_{i}$ define the transition rates between the two states, and $k_{2}$ is a first-order rate constant for transcript degradation.
The OFF state is usually associated with a closed chromatin state in which the binding sites for transcription factors are inaccessible, whereas the ON state is associated with the open active chromatin state \cite{geneexpressionnoise}.
And so, a simple way to include transcription factor control in the kinetic equations above is to modify $k_{1}$ using the promoter activity function $g( \cdot )$. Hence, the deterministic equation Eq.\ref{RNAcostitutiveexpressionequation} written for RNA in case of unregulated gene becomes:
\begin{equation}\label{regulatedgeneequation}
\frac{dm}{dt} = k_{1} \cdot g(\cdot) - k_{2}m
\end{equation}
while the Eq.\ref{proteincostitutiveexpressionequation} for proteins keeps unchanged.
$g( \cdot )$ can have different behaviors: in the case shown in Fig. \ref{regulatedgeneexpression}, it makes the gene alternating between active and inactive state.
%ABBIAMO VISTO GLI EFFETTI DELLA REGOLAZIONE SULLA DISTRIBUZIONE NEL CAPITOLO PRECEDENTE ED ANCHE GLI ANDAMENTI NEL TEMPO.
%Metti tutti i modelli di sintesi proteica:
%unregulated gene
%regulated gene
%autorepressor
%toggleswitch
%informati in quali contesti avvengono.
\section{The Autorepressor model}
As shown schematically in fig. \ref{autorepressor}, in this case gene product represses its own transcription.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.50]{autorepressor.jpg}
\caption{A schematic of an auto-repressing circuit where a blunt line indicates a repressing action while a connector ending with an arrow indicates an activating action.}
\label{autorepressor}
\end{figure}
Proteins produced by the gene decreases the rate of activation of the gene. Eq.\ref{regulatedgeneequation} and Eq.\ref{proteincostitutiveexpressionequation} written for the first simple protein synthesis model are then modified in the following way:
\begin{equation}
\frac{dm}{dt} = k_{1} \cdot g_{R}(r) - \beta_{m}m
\end{equation}
as regards the rate of change of RNA molecules. Analogously, for proteins we can write the same equation as Eq. \ref{proteincostitutiveexpressionequation} written for constitutive expression and for the regulated gene case but, for the sake of clarity, we denote \emph{p} as \emph{r} in order to highlight that protein product is behaving as autorepressor. Hence:
\begin{equation}
\frac{dr}{dt} = k_{3}m - k_{4}r
\end{equation}
\section{The Toggle-Switch model}
The genetic toggle-switch is a system of two mutually repressing genes. A simplified sketch of this model is shown in Fig.\ref{toggleswitch}.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.75]{toggleswitch.jpg}
\caption{A schematic of the toggle switch where a blunt line indicates a repressing action while a connector ending with an arrow indicates an activating action.}
\label{toggleswitch}
\end{figure}
The repression of gene 2 is mediated by an inducer (related to gene 1) that regulates gene expression. Similarly this happens for gene 1 repression.
We would expect the system to exhibit \emph{two} mutually exclusive behaviors: either $r_{1}$ is high, keeping expression of $r_{2}$ low, or conversely, $r_{2}$ is high, keeping expression of $r_{1}$ low \cite{SanDiegoUniversity}.
For simplicity, we assume that the switch is composed of symmetric elements so that the rate constants are identical for each half of the network. Then, as seen in the other cases, the mathematical model takes the form:
\begin{align}
\frac{dm_{1}}{dt}=k_{1}\cdot g_{R}(r_{2}) - k_{2}m && \frac{dm_{2}}{dt} = k_{1} \cdot g_{R}(r_{1}) - k_{2}m
\end{align}
\begin{align}
\frac{dr_{1}}{dt}=k_{3}m_{1} - k_{4}r_{1} && \frac{dr_{2}}{dt} = k_{3}m_{2} - k_{4}r_{2}
\end{align}
\chapter{Stochastic simulation in systems biology}\label{Simulations}
\section{Stochastic Process}\label{stochasticprocessdefinition}
In a ``rough'' sense, a random process is a phenomenon that varies to some degree unpredictably as time goes on. If we observed an entire time-sequence of the process on several different occasions, under presumably ``indentical'' conditions, the resulting observation sequences in general, would be different \cite{AssocProfThamer}.
This ``rough'' definition remembers the definition of stochasticity given for isogenic cells in the same environment.
In the following sections, we will define more rigously what a stochastic process is by using the Van Kampen notations written in his \emph{Stochastic Processes in Physics and Chemistry} book \cite{VanKampenBook} and some possible tools to study random processes.
\subsection{Definition}
%Aggiungi e personalizza con gli appunti trovati su come studiare i processi stocastici.
A ``random number'' or ``stochastic variable'' is an object \emph{X} defined by: a set of possible values (that we can call also ``set of states'' or ``sample space'') and a probability distribution over this set. Once a stochastic variable \emph{X} has been defined, an infinity of other stochastic variables derives from it, namely all quantities \emph{Y} that are functions of \emph{X} by some mapping \emph{f}. These quantities \emph{Y} may be any kind of mathematical object, in particular also functions of an additional variable t,
\begin{equation}
Y_{X}(t) = f(X,t)
\end{equation}
Such a quantity \emph{Y(t)} is called a ``random function'', or, since in most cases t stands for time, a \emph{stochastic process}. Thus a stochastic process is simply a function of two variables, one of which is the time t, and the other a stochastic variable \emph{X}. On inserting for \emph{X} one of its possible values x, we obtain an ordinary function of \emph{t}:
\begin{equation}
Y_{x}(t) = f(x,t)
\end{equation}
called a \emph{sample function} or \emph{realization} of the process. In physical parlance one regards the stochastic process as an ``ensemble'' of these sample functions.
\subsection{Markov Processes}
There is a subclass of stochastic processes that have the Markov property. Such processes are by far the most important in physics and chemistry. A Markov process is defined as a stochastic process with the property that for any set of \emph{n successive} times (i.e., $t_{1} < t_{2} < ... < t_{n}$) one has
\begin{equation}\label{Markov}
P(y_{n},t_{n}|y_{n-1},t_{n-1},...,y_{1},t_{1}) = P(y_{n},t_{n}|y_{n-1},t_{n-1})
\end{equation}
that is, the conditional probability density at $t_{n}$, given the value $y_{n-1}$ at $t_{n-1}$, is uniquely determined and is not affected by any knowledge of the values at earlier times. The system loses any kind of information of its state before the present value, and so the markovian systems are usually called memory-less.
\section{The Chemical Master Equation (CME)}\label{TheCMESection}
As we have understood, stochasticity is very important in our biological context. The master equation approach is used to describe the time evolution of the probability of a stochastic system to be in a specific configuration in the framework of Markov processes. Indeed, the mathematical derivation of the Master equation starts from the equation (\ref{Markov}), since we can write such relation in a more formal way \cite{PHD}, that gives the probability of a transition of the system of going to the state $y_{3}$ from the state $y_{2}$, that is at the time $t+\tau$, given the observation of the state $y_{1}$ at the time \emph{t}. Given that $T_{\tau}$ is the \emph{transition probability} that does not depend on the moment in time but only on the elapsed time and calling $P(y_{2}, t+\tau|y_{1},t) = T_{\tau}(y_{2}|y_{1})$, we obtain the important Chapman-Kolmogoroff equation for the transition propensity:
\begin{equation}
T_{\tau+\tau'}(y_{3}|y_{1}) = \int T_{\tau'}(y_{3}|y_{2})T_{\tau}(y_{2}|y_{1})dy_{2}
\end{equation}
The CK equation for $T_{\tau}$ is a functional relation, which is not easy to handle in actual applications. The master equation is a more convenient version of the same equation: it is a differential equation obtained by going to the limit of vanishing time difference $\tau'$. Hence, taking the first order term of the Taylor series of the $T_{\tau'}(y_{3}|y_{2})$ integral for small $\tau'$, we can write it as:
\begin{equation}
T_{\tau'}(y_{3}|y_{2}) = (1-a_{0}\tau')\delta(y_{3}-y_{2}) - \tau'W(y_{3}|y_{2}) + O(\tau'^{2})
\end{equation}
Here $W(y_{2}|y_{1})$ is the \emph{transition probability per unit time} from $y_{1}$ to $y_{2}$ and hence $W(y_{2}|y_{1})\geq0$. The coefficient $1-a_{0}\tau'$ in front of the delta function is the probability that no transition takes place during $\tau'$, hence $a_{0}(y_{1})=\int W(y_{2}|y_{1}) dy_{2}$.
Now we insert this expression in the CK equation for $T_{\tau'}$
\begin{equation}
T_{\tau+\tau'}(y_{3}|y_{1}) = [1 - a_{0}\tau']T_{\tau}(y_{3}|y_{1})+ \tau'\int W(y_{3}|y_{2})T_{\tau}(y_{2}|y_{1})dy_{2}
\end{equation}
Divide by $\tau'$, go to the limit $\tau' \rightarrow 0$, and use $a_{0}(y_{3})=\int W(y_{3}|y_{2}) dy_{2}$:
\begin{equation}
\frac{\partial }{\partial \tau}T_{\tau}(y_{3}|y_{1}) = \int {W(y_{3}|y_{2})T_{\tau}(y_{2}|y_{1}) - W(y_{2}|y_{3})T_{\tau}(y_{3}|y_{1})}dy_{2}
\end{equation}
This is the differential form of the ChapmanâKolmogorov equation that is known as the \emph{Master equation}. It can be written in the simplified and more intuitive form
\begin{equation}\label{continuous}
\frac{\partial P(y,t)}{\partial t} = \int {W(y|y')P(y',t)-W(y'|y)P(y,t)}dy'
\end{equation}
This can be recognized as an influx of probability to the state \emph{y} from all the ``surrounding'' (in the sense connected) states $y'$ and an efflux from \emph{y} to every state $y'$ to which it can move to \cite{PHD}. This thesis is concerned with solving the master equation for chemical systems and if the system state space is discrete, as when we work with a system with a discrete number of individuals or molecules, we can write the probability as $P_{n}(t)$ to represent the \emph{discreteness} of the state space. In this case the master equation can be called \emph{Chemical Master Equation} (referring to a chemical environment) and it will be written with sums instead of integrals \cite{PHD}:
\begin{equation}\label{discrete}
\frac{dP_{n}(t)}{dt} = \sum_{n'}[W_{n,n'}P_{n'}(t)-W_{n',n}P_{n}(t)]
\end{equation}
as well as $W(y'|y)$ represents the transition probability per unit time from \emph{y} to \emph{y'} in the continuous equation, $W_{n,n'}$ is the transition probability per unit time from the state \emph{n} to the state \emph{n'} in the discrete equation. Hence, also in this form, we can recognize the meaning of Eq \ref{discrete} as we did for Eq \ref{continuous} and state that \emph{the master equation is a gainâloss equation for the probabilities of the separate states n} \cite{Math}. The first term is the gain due to transitions from $n'$ to $n$ states, and the second term is the loss due to transitions from $n$ to $n'$ states.
The Eq \ref{discrete} shows a linear relationship between transition probabilities, therefore we can write it as a linear dynamical system gaining a more compact form of the CME:
\begin{equation}
\partial_{t}\Vec{P}(t) = \mathbb{W} \Vec{P}(t)
\end{equation}
where $\Vec{P}$ is a column vector with components $p_{n}$ and $\mathbb{W}$ is called the \emph{transition matrix} because it is the matrix of the transition rates. $\mathbb{W}$ is defined as
\begin{equation}
\mathbb{W} =
\begin{cases}
& \mathbb{W}_{n,n'} \textrm{ for } n \neq n' \\
& \mathbb{W}_{n,n} = - \sum_{n \neq n'} W_{n',n}
\end{cases}
\end{equation}
In the general case, the matrix $\mathbb{W}_{n,n'}$ should obey the following properties:
\begin{equation}
\mathbb{W}_{n,n'} \geq 0 \textrm{ for } n \neq n'
\end{equation}
\begin{equation}\label{det}
\sum_{n} \mathbb{W}_{n,n'} = 0 \textrm{ for each } n'
\end{equation}
The Eq \ref{det} states that the matrix $\mathbb{W}$ has zero determinant. We can have a confirmation of this property by writing $\mathbb{W}$ for $N = 3$, i.e. the number of possible states are $n \in [0,3]$, that we can do as an exercise:
\begin{equation}
\mathbb{W} =
\begin{pmatrix}
-(W_{2,1}+W_{3,1}) & W_{1,2} & W_{1,3}\\
W_{2,1} & -(W_{1,2}+W_{3,2}) & W_{2,3} \\
W_{3,1} & W_{3,2} & -(W_{1,3}+W_{2,3})
\end{pmatrix}
\end{equation}
A zero determinant means that there is a zero eigenvalue and the eigenvector corresponding to the zero eigenvalue is the \emph{stationary distribution} $P(n)$, the distribution to which the stochastic process always converges, i.e. $\Vec{P}(t) = 0$, as long as the transition propensities $W_{n,n'}$ are not a function of time.
On the other hand, one can solve the master equation to obtain \emph{n} solutions $P_{n}(t)$ that depend on time since, in general, the master equation represents a set of \emph{n} deterministic equations, with \emph{n} that is the number of states and so, in our context, the number of molecules. In this case, bear in mind a fundamental property of the master equation: as $t \rightarrow \infty$ all solutions tend to the stationary solution.
If the system is fully connected (cannot be broken into two non communicating pieces) the stationary distribution is guaranteed to be unique \cite{PHD}. The stationary distribution will be obviously positive, i.e. all its terms are with positive sign and the sum of all its components is 1 (being a probability distribution). All the other eigenvalues will be with negative module, and the corresponding eigenvectors will have total sum of the components equal to zero, as they can be interpreted as the difference between the present distribution and the stationary one, both having total sums of the components equal to 1. A special role is played by the eigenvalue with the smallest absolute value, which it means that its eigenvector is the longest-standing one. This eigenvector is referred as the \emph{metastable} state and its eigenvalue gives a time-scale of the time of convergence to the stationary distribution.
%METTILO NEL CAPITOLO DI REACTION KINETICS
We end up this section contemplating the impressive similarity between the Chemical Master Equation and the Shr\"{o}dinger's equation: just as the solution of the Shr\"{o}dinger's equation is the probability distribution of finding the particle with a wave function at a given position and so it is the fundamental equation for modeling motions of atomic and subatomic particle systems, the solution of the CME is a probability distribution of the number of molecules in the time for molecular reactions systems \cite{Brizi}. Indeed, just as the Shr\"{o}dinger's equation reduces to the Newton's second law assuming Planck's constant tends to zero, the Chemical Master Equation becomes the Law of Mass Action if the volume of reaction tends to infinity since we tend to the deterministic context.
%We will apply the Law of Mass Action for describing genetic circuits in Chapter \ref{models}.
%\section{Expansion in eigenfunctions of the CME solution}
\section{One-Step processes reveal a Poisson distribution}\label{theonestepdiscussion}
The one-step or \emph{birth-and-death} (BD) processes are a special class of Markov processes.
Birthâdeath (BD) processes are characterized by two types of transitions: \emph{births} which increase the state by one, and \emph{deaths} which decrease the state by one. Many stochastic processes are BD processes, indeed they are suitable for modelling the dynamics of the number of individuals in a population, and are widely used in a broad range of areas such as biology, ecology and operations research. In particular, it can be exploited in order to model the evolution of a population of RNA molecules, since the population can increase (production) or decrease (degradation) by one molecule at a time \cite{onesteparticle}.
More generally and rigorously defined, a BD process is a \emph{continuous time Markov process} whose range consists of integers \emph{n} and whose transition matrix $\mathbb{W}$ permits only jumps between adjacent sites \cite{Math}. In that case, the master equation (Eq \ref{discrete}) is written as
\begin{equation}\label{discreteonestep}
\frac{dP_{n}(t)}{dt} = W_{n,n+1}P_{n+1}(t)+W_{n,n-1}P_{n-1}(t)-W_{n-1,n}P_{n}(t)-W_{n+1,n}P_{n}(t)
\end{equation}
The transition rates, $W_{n',n}$, are written in a special notation for these processes, i.e.
\begin{align}
W_{n+1,n} = g_{n} && W_{n-1,n} = r_{n}
\end{align}
Where $g_{n}$ is the \emph{gain} term, that is the probability per unit time for a jump from \emph{n} to $n + 1$ and $r_{n}$ is the \emph{recombination} term, that is the probability per unit time for a jump from state \emph{n} to state $n - 1$. The probability to jump to one more units in a time $\Delta t$ is not impossible but is of the order of $O(\Delta t^{2})$.
From here the meaning of the master equation as a gain-loss equation for the probabilities between states \emph{n} is even more clear. It is similar to a balance of a fluid: some is entering and some is exiting.
\begin{figure}[ht]
\includegraphics[scale=0.90]{theonestep.jpg}
\caption{ The one-step process with its transition probabilities and the different states. Image from \cite{Ragazza}.}
\label{theonestep}
\end{figure}
\newline
Thus the master equation for one-step processes can be rewritten as
\begin{equation}\label{theonestepmasterequation}
\frac{dP_{n}(t)}{dt} = \dot{P_{n}} = r_{n+1}P_{n+1} + g_{n-1}P_{n-1}-(r_{n}+g_{n})P_{n}
\end{equation}
However, we need to beware to the boundary conditions. Given that we have a number of molecules \emph{n} that is $n \in [0,N]$, if $n = 0$ is a boundary, the Eq \ref{theonestepmasterequation} is meaningless and has to be replaced with
\begin{equation}
\dot{P_{0}} = r_{1}P_{1} - g_{0}P_{0}
\end{equation}
since we need to set this condition $r_{0} = g_{-1} = 0$ otherwise we would have a negative number of molecules which is physically unfeasible. Similarly an upper boundary $n = N$ requires a special equation:
\begin{equation}
\dot{P_{N}} = g_{N-1}P_{N-1} - r_{N}P_{N}
\end{equation}
because the condition $r_{N+1}=g_{N}=0$ or then with $g_{N}$ we would have a transition from the state \emph{N} to the state $N+1$ and so to a state whose number of molecules is greater than the maximum number present in our system; a similar reasoning is done for $r_{N+1}$ that means a transition from the state $N+1$ to \emph{N}.
One-step processes can be subdivided based on the coefficients $r_{n}$ and $g_{n}$ into the following categories: linear, if the coefficients are linear functions of \emph{n}, nonlinear, if the coefficients are nonlinear functions of \emph{n} and random walks, if the coefficients are constant \cite{Ragazza,Math}.
\newline
An important example of a one-step process with constant transition probabilities is the \emph{Poisson process}, defined by
\begin{align}
r_{n} = 0,g_{n}=q,p_{n}(0)=\delta_{n,0},
\end{align}
where \emph{q} is constant parameter. In such case, the master equation is
\begin{equation}
\dot{P_{n}} = q(P_{n-1}-P_{n})
\end{equation}
The solution of this equation is a time dependent probability distribution
\begin{equation}
P_{n}(t) = \frac{(qt)^{n}}{n!}e^{-qt}
\end{equation}
that is a Poisson distribution. Let us keep in mind this theoretical result for the discussion about modeling gene expression in a stochastic framework in Chapter \ref{stochasticmodels}.
\section{Resolution Methods for the CME}
Solving the master equation would seem to be the ideal way of looking at stochastic systems, as it can tell us the full distribution of possible states the system can be in over time, but for one important disadvantage: its complexity.
In some special cases it is possible to solve it with analytical and numerical methods, for instance in case of linearity of the process where every reaction term is constant or proportional to the first power of only one chemical specie \cite{PHD}.
However, in general, what makes it difficult to solve is the mixing of a continuous time evolution with a discrete evolution of the state variables \cite{SanDiegoUniversity}. Furthermore, this becomes apparent when considering how many equations must be solved at each time point, and bearing in mind that the number of possible states can increase exponentially with time \cite{Article}.
Thus we must use trajectory-based approaches, which differ from the master equation in that they do not find the distribution of all possible states over time. Rather, \emph{they simulate single realizations}, meaning that at each step they choose one out of all the possible outcomes.
The trajectory of each stochastic simulation is different, since each is a single realisation sampled randomly from the full distribution given by the master equation. Given many of these random single trajectories, we can build up a picture of the full distribution.
Many approximation methods for the solution of the CME have been developed during the years and which operate by discretizing the time evolution. In this thesis, we will exploit the two simulation algorithms that have become the workhorse stochastic methods for many researchers today: the stochastic simulation algorithm (SSA) and the tau-leaping procedure. In the end, we will propose an ``alternative'' method that combines the two mentioned algorithms in order to speed up the stochastic simulation of a chemically reacting system.
\subsection{Stochastic simulation algorithm (SSA)}\label{Gillespie}
Most of the attempts in solving the CME have concentrated on Monte Carlo simulations using the \emph{stochastic simulation algorithm} (SSA) proposed by Gillespie in his seminal paper. This method is statistically exact, namely a full probability distribution built up from an infinite number of simulations will be identical to the distribution of the Markov process, given by the master equation \cite{Article}. This algorithm is arguably the simplest SSA variant and is based on the following steps in order to simulate \emph{one stochastic realisation}: assuming that the system is described by the state vector $\Vec{Y}$ which in general represents the amount of molecules for each chemical species and given that the initial state of the system is $\Vec{Y} = \Vec{Y}_{0}$, one has to choose which reaction happens between the possible ones and how much time the system will spend in such state before the next reaction happens \cite{PHD}. Each reaction \emph{i} is described by its \emph{transition rate} (or even called \emph{propensity}) $k_{i}$ and by the modification of the state vector that generate. Given that we are dealing with Markov process, the difference in time between two successive events $\Delta t$ follows an \emph{exponential distribution}, whose parameter is the propensity $k_{i}$:
\begin{equation}
p_{i}(\Delta t) = e^{-k_{i}\Delta t}
\end{equation}
given that any reaction is independent from the others, the distance in time between two successive events is distributed as:
\begin{equation}\label{exponential}
p(\Delta t) = e^{-\sum_{i=0}^{R} k_{i} \Delta t}
\end{equation}
This process is iterated untill the whole time of interest has been simulated or untill an other desired condition is achieved.
Unfortunately, the direct method SSA pays the price for its simplicity in computational time, as two random numbers must be generated at each step \cite{Article} (the one for choice of the event and the other for the choice of the time spent in the state given by an exponential random variable). Furthermore, the SSA has an inherent limitation: it must simulate every single reaction. This is, of course, its great strength too, but in cases where there are many reactions or larger molecular populations, it often becomes too slow to generate useful numbers of simulations in a practical time period.
\subsection{Tau-leap algorithm}
The tau-leap method, or also referred as explicit tau-leaping procedure, is similar to the SSA in that each simulation is a \emph{single stochastic realisation} from the full distribution at each step \cite{Article}. However, the steps are fixed and in each one of these \emph{it counts the total number of occurrences of each type of reaction over that step approximating the number of firings of each reaction channel during a chosen time increment $\tau$ as a Poisson random variable}. Thus, also in this case based on the initial state of molecular populations $\Vec{Y}=\Vec{Y_{0}}$, if there are \emph{M} reactions, we take \emph{M} Poisson random number samples. We can call the time interval in which these reactions happen as $\tau$. Hence, the vector of molecules given by the \emph{M} reactions $\Vec{Y}=\Vec{Y}_{M}$ in the time interval $\tau$ is given by
\begin{equation}\label{MPoissonReactions}
\Vec{Y}_{m}(\tau) = P(\Vec{K_{i}}\tau)
\end{equation}
where \emph{P} is the Poisson distribution with mean equal to $\Vec{K_{i}}\tau$ of which $\Vec{K_{i}}$ is the vector of propensities. Then the initial state of molecular populations is updated based on Eq \ref{MPoissonReactions} and this is iterated untill the whole time of interest has been simulated or a desired condition is met, similarly to SSA.
Hence, the basic idea of this procedure is to advance the system by a \emph{pre-selected} time increment $\tau$ (in contrast to the generated time increment that is in the SSA), which is large enough that many reaction events occur in that time, but nevertheless small enough that \emph{no propensity function value is likely to change ``significantly'' as a consequence of those reaction events} \cite{Method}.
Summing all up, tau-leap algorithm has the key advantage that we now do not need to spend time simulating individual reactions and as such it is much faster than the SSA. On the other hand its main drawback is that we have lost accuracy compared to the SSA in some ways such as we do not know when each reaction occurred within the time step.
% how many reactions occurred (we can only estimate this)
The accuracy issue can be mitigated (at the expense of computational time) as the error level, and so time step size, is decreased; indeed, as the time step tends to zero, the tau-leap effectively becomes the SSA, with each step experiencing either zero or one reaction \cite{Article}.
Moreover, as anticipated in some rows above, there is another complementary issue with the tau-leap: as many reactions are simulated at once, molecular species that are depleted in any reaction can go negative if the time step is too large. This is obviously unphysical, and so undesirable and makes it inappropriate to simulate reactions that regards species which are typically lower in number such as genes.
Fig. \ref{ODESSATAULEAP} shows the number of RNA molecules and proteins produced during time obtained by using the just explained stochastic simulation methods (SSA and Tau-leap) and those given by the deterministic approach, i.e. ordinary differential equation solution, considering a model with a gene that is always expressed (see section \ref{modelconstitutivegeneexpression}). In this particular case in which the RNA production rate is higher than the degradation rate and the same is for proteins, we do not run the risk to obtain unphysical negative number of molecules and we can compare the results given by the algorithms. In particular, we consider the following values of rates: gene activation $k_{a}=1$, gene deactivation $k_{i}=0$, RNA production $k_{1}=1$, RNA degradation $k_{2}=0.1$, protein production $k_{3}=1$, protein degradation $k_{4}=0.1$.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.30]{ODESSATauleaptimeplot100au.png}
\caption{Comparison of simulation methods. The typical trajectory of the number of molecules as function of time is reported from the SSA (blue line) and tau-leap (orange dotted line) and ODE solution (black thick line). The upper plot refers to the number of RNAs molecules produced as a function of time starting from a gene that is always active, while the plot below represents the number of proteins produced starting from the RNAs of the same model.}
\label{ODESSATAULEAP}
\end{figure}
From Fig. \ref{ODESSATAULEAP}, we notice that trends resulting from the three methods are similar both for the number of RNA molecules and for the number of proteins produced during time.
However, in order to compare the two stochastic simulation methods, it is more useful to compare the stationary distributions (i.e. the distribution of states) obtained instead of the time dependent results.
Hence, we run the same simulation till a longer time limit to allow to obtain the expected Poisson distributions. The results are shown in Fig. \ref{SSATauleapdistribution}:
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.70]{RemovingwarmupSSA55.25sTauleap10.22stimelimit9000.png}
\caption{Comparison of simulation methods by using the stationary distributions given by the SSA (orange bar plot) and tau-leap (blue bar plot).}
\label{SSATauleapdistribution}
\end{figure}
\newpage
There is a clear overlap between the two distributions, showing that the SSA simulation is providing the same results to the Tauleaping, validating also that the two algorithms work in representing the stochastic trajectory of the chemical reacting system we are analysing.
The time plot referred to the stationary distribution shown in Fig. \ref{SSATauleapdistribution} is the one reported in Fig. \ref{ODESSATauleaptimeplot9000}:
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.30]{ODESSATauleaptimeplot9000au.png}
\caption{Comparison of simulation methods using a higher time limit than the one in Fig.\ref{ODESSATAULEAP}. The typical trajectory of the number of molecules as function of time is reported from the SSA (blue line) and tau-leap (orange dotted line) and ODE solution (black thick line).}
\label{ODESSATauleaptimeplot9000}
\end{figure}
\newpage
%I parametri del modello erano ka=1, ki=0, k1=1, k2=0.1, k3=0.1, k4=1, k5=0.001, time limit 100.
Also in Fig. \ref{ODESSATauleaptimeplot9000}, the ODE solution is reported. We can notice that the deterministic approach, even though cannot be considered representative of the true behaviour of the system, its result represent a useful statistic of the system that in this case is the means of our unimodal symmetric distributions of the number of RNAs and proteins produced. The black line in Fig. \ref{ODESSATAULEAP} and \ref{ODESSATauleaptimeplot9000} is in fact a rolling mean of the stochastic simulation.
Furthermore, deterministic simulations are generally much faster to run than their statistical counterparts \cite{Article}.
In order to make this simulation for a virtual time equal to 9000 with a computer of 1.10 GHz CPU, the SSA simulation has spent 55 seconds with respect to the 10 seconds spent by the explicit tau-leaping algorithm.
However the tau-leaping algorithm can work well in this situation of constitutive gene expression. And what if our gene is regulated ? The risk of running in negative number of molecules is much higher hence we need an algorithm that checks that this does not occur if we want to go on using the tau-leap procedure.
In this kind of system, the typical choice is to combine the SSA and the tau-leaping procedure because the first one can simulate the gene behavior while the second one is suitable for simulating the changing of more numerous molecular species such as RNAs and proteins.
In the next section, we propose and describe a combined SSA/tau-leaping algorithm.
\subsection{An alternative method: the ``hybrid''stochastic simulation algorithm}\label{hybrid}
%METTICI PI� FORMULE PER DESCRIVERE OGNI STEP. Inoltre specifica che questi metodi servono sia per cercare un modo per velocizzare la simulazione (cosa che � molto importante) per� anche per attestare che i risultati che otteniamo sono giusti ossia che i metodi funzionano. Dunque bisogna controllare le autocorrelazioni dei diversi metodi con le incertezze e verificare che sono uguali.+ Nota che la configurazione dell'autorepressore � il peggio che possiamo chiedere al Tau-leap e l'efficienza di questi metodi (ossia il tempo impiegato per fare la simulazione) appunto dipende molto dalla configurazione che stiamo considerando.
Many schemes have been developed that allow the choice of larger time steps in the Tau-leap algorithm whilst avoiding negative populations. In this thesis work, we have tried an approach that combines, basically, the SSA with Tau-leap algorithm in a way that the first one updates the state of genes while the latter updates the state of other more numerous species such as RNA molecules and proteins.
The model that we consider to develop this ``hybrid'' algorithm is the first simple protein synthesis model (the one described in the first section of Chapter \ref{models}). We have 3 molecular species: one \emph{gene} whose transcription leads to \emph{RNAs} production that are translated into \emph{proteins}. In other words, this is what is called as \emph{the central dogma of biology}, that states that genes specify the sequence of mRNA molecules, which in turn specify the sequence of proteins (DNA $\rightarrow$ RNA $\rightarrow$ Protein).
%stiffness!!!!!!!!!!!!!!!!!!!!!!!!!!!
Hence we have 6 possible reactions: gene activation, gene deactivation, RNA production, RNA degradation, protein production and protein degradation.
Let us consider our usual state of molecular populations $\Vec{Y}$ and the initial state $\Vec{Y}=\Vec{Y_{0}}$. The transition rates related to such state is $\Vec{K}=\Vec{K}_{0}$.
%\Vec{\mathbf{K}}=\Vec{\mathbf{K_{0}}}
The proposed hybrid algorithm is based on the following steps:
\begin{enumerate}
\item In state $\Vec{Y}$ at time \emph{t}, evaluate the transitions rates of all species and those related to genes $\Vec{K}_{genes}$.
\item Apply the SSA algorithm by considering the sum of genes transition rates in Eq.\ref{exponential} and hence select the reactions that modify the time of residency of the gene state.
\item Eq.\ref{exponential} represents in this case the time of residency of the gene $t_{gene\_state\_residency}$ in the initial state. The aim of this algorithm is to speed up the SSA by considering such $t_{gene\_state\_residency}$ as the tau $\tau$ in the Tauleaping.
\item Thus, after the SSA application, consider the $t_{gene\_state\_residency}$ and the updated gene state $\Vec{Y}'$.
\item Apply the Tauleaping algorithm to the RNAs and proteins state and by considering $t_{gene\_state\_residency}$ as $\tau$, so $t_{gene\_state\_residency} = \tau$ .
\item Consider the updated state by the tauleaping and SSA $\Vec{Y}'$ and compute also the difference between the transition rates $\Vec{K}_{0}$ refered to the initial state $\Vec{Y}_{0}$ and the final value of transition rate $\Vec{K}'$ obtained considering the updated state $\Vec{Y}'$. In particular compute the following difference:
\begin{equation}
\Delta \Vec{K} = \left|\frac{\Vec{K}'}{K'_{TOT}}-\frac{\Vec{K}}{K_{TOT}}\right|
\end{equation}
where $K'_{TOT}$ is the sum of all the transition rates calculated with the new updated state $\Vec{Y}'$ and the same for $K_{TOT}$ that is the sum of all the transition rates calculated with the initial state $\Vec{Y}_{0}$.
Let us rember indeed that we are dealing with first order kinetics reactions, whose transition rates are propotional to the amount of molecules of interest by a constant $k$.
$\Vec{Y}'$ and $\Delta \Vec{K}$ are the two quantities that must be monitored, because, as regards the first one, of course, we need to check that there are not negative number of molecules; for the latter, if it is too large, it can lead in very low accuracy in the algorithm results. For instance, if there is a tauleaping step that leads to a final state $\Vec{Y}'$ that has a much higher number of proteins with respect to the initial state $\Vec{Y}_{0}$, the final stationary distribution will result in an ``explosion'' in the number of molecules that does not represent the true distribution.
Thus, in the following steps we describe how we monitor these two quantities.
\item If $\Vec{Y}'$ has all positive values and $\Delta \Vec{K}$ is lower than a given threashold chosen by the user, update rates and the state by considering the results of the SSA and tauleaping algorithms already applied in points 2 and 5. If the simulation is running untill a precise time limit (as we always do in this thesis work), update the time of observation using the $t_{gene\_state\_residency}$.
\item If instead $\Vec{Y}'$ has a negative number of molecules or one of $\Delta \Vec{K}$ is higher than the threashold, it means that we need to make smaller steps than the one proposed by the SSA.
Hence we divide by one half $\tau$, $\tau' \rightarrow \tau/2$, untill the conditions on the number of molecules and on the $\Delta \Vec{K}$ are satisfied.
When we obtain the right $\tau'$, we use it to make a tauleaping, updating the RNAs and proteins states.
Then, we attempt a tauleaping using the remaining time $\tau - \tau'$ that we call as $\tau''$.
If the conditions are satisfied, we update the total time of observation with $\tau$ and attempt a new tau-leaping starting from point 1 by considering a new updated state $\Vec{Y}'$ and the new rates $\Vec{K}'$.
If it is not, we divide again $\tau''$ by one half, $\tau''' \rightarrow \tau''/2$ untill the desired conditions are met.
This process is iterated untill the $\tau^{(n)}$ is higher than 3 times the characteristic time $t_{c}$ given by the SSA by considering all reactions happening in our system.
\item If the new $\tau^{(n)}$ reaches a lower value, i.e. $\tau^{(n)} \leq 3t_{c}$, it is worth to make a SSA step without getting trapped in the previous iterating system that is certainly less precise than the SSA. In this case, update the total time with the characteristic time given by the SSA applied considering all the reactions that are possible in our system.
\item Apply steps from 1 to 9 untill the desired total time is reached or the desired condition is met.
\end{enumerate}
As in the tauleaping section, we have compared the results of the new algorithm stationary distribution with the SSA. We consider the following model parameters: activation rate $k_{a}=1$, deactivation rate $k_{i}=0.5$, RNA production rate $k_{1}=1$, RNA degradation rate $k_{2}=0.1$, protein production rate $k_{3}=1$, protein degradation rate $k_{4}=0.1$. These are the same parameters and the same model used to compare the SSA simulation with the tauleaping simulation except for the fact that the deactivation rate is now different from 0. Indeed now we can simulate without worrying about the fact that the number of molecules becomes negative and so we can add gene regulation to our model.
We consider a threashold $\Delta K_{t} = 10$ and the result for our basic model is promising.
\newpage
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.70]{ka1ki0.5SSA1min47sHybrid7.5stimelimit14000.png}
\caption{Comparison of simulation methods using stationary distributions. The blue bar plot is the result given by the SSA algorithm while the orange bar plot is the result given by the hybrid algorithm.}
\label{SSAHybridBasicModel}
\end{figure}
Fig. \ref{SSAHybridBasicModel} shows the distribution of states given by the SSA (blue bar plot) and the hybrid algorithm (orange bar plot). They show a high degree of overlapping with hybrid approach speed that is about 10 times lower than the SSA. More precisely, using a computer with 1.10 GHz, the speed results are summed up in the table \ref{tablebasicmodel}.
\begin{table}[!h]
\begin{center}
\begin{tabular}{ |c|c|c| }
\hline
& \textbf{speed} \\
\hline
SSA & 107s \\
\hline
Hybrid & 7.5s \\
\hline
\end{tabular}
\caption{\label{tablebasicmodel}Computational time required for the SSA and Hybrid algorithm in case of our basic model.}
\end{center}
\end{table}
\newpage
We can modify the model changing the parameter value $k_{1}$ (that in the end it is also at the basis of the distribution). By considering $k_{1}=10$, the related stationary distribution is shown in Fig. \ref{SSAHybridk110}.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.70]{ka1ki0.5SSA11min58sHybrid7.7stimelimit14000k110.png}
\caption{Comparison of simulation methods using stationary distributions with $k_{1}=10$. The blue bar plot is the result given by the SSA algorithm while the orange bar plot is the result given by the hybrid algorithm.}
\label{SSAHybridk110}
\end{figure}
\newpage
There is still a high degree of overlap between the two distributions even though the protein hybrid distribution result is more scattered with respect to the SSA. In this case, the gain in velocity is really high. Once again, the relative speeds are reported in the table \ref{tablek110}.
\begin{table}[!h]
\begin{center}
\begin{tabular}{ |c|c|c| }
\hline
& \textbf{speed} \\
\hline
SSA & 718s \\
\hline
Hybrid & 8s \\
\hline
\end{tabular}
\caption{\label{tablek110}Computational time required for the SSA and Hybrid algorithm is case $k_{1}=10$.}
\end{center}
\end{table}
However, the situation changes if we decrease 10 times the rate of RNA production and so $k1=0.1$ instead of increasing it by 10 times as in the previous case.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.70]{ka1ki0.5SSA8.31sHybrid7.9stimelimit14000k10.1.png}
\caption{Comparison of simulation methods using stationary distributions with $k_{1}=0.1$. The blue bar plot is the result given by the SSA algorithm while the orange bar plot is the result given by the hybrid algorithm.}
\label{SSAHybridk10.1}
\end{figure}
\newpage
The degree of overlap is still high, however we do not achieve a considerable improvement in velocity:
\begin{table}[!h]
\begin{center}
\begin{tabular}{ |c|c|c| }
\hline
& \textbf{speed} \\
\hline
SSA & 8.3s \\
\hline
Hybrid & 7.9s \\
\hline
\end{tabular}
\caption{\label{tablek10.1}Computational time required for the SSA and Hybrid algorithm is case $k_{1}=0.1$.}
\end{center}
\end{table}
This suggests that decreasing the number of RNAs produced, the performance of our hybrid algorithm decreases. What if we consider the opposite type of regulation ?
Let us consider $k_{a}=0.1$ and $k_{i}=1$ and let the other parameters equal to the basic model values.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.70]{ka0.1ki1SSA9sHybrid35stimelimit14000.png}
\caption{Comparison of simulation methods using stationary distributions considering gene activation rate lower than the deactivation rate ($k_{a}=0.1$ and $k_{i}=1$). The blue bar plot is the result given by the SSA algorithm while the orange bar plot is the result given by the hybrid algorithm.}
\label{SSAHybridka0.1ki1}
\end{figure}
Fig. \ref{SSAHybridka0.1ki1} shows the effects of this other kind of regulation on the distribution given by the two algorithms. As expected, in both cases the results show that the residency time refered to the state with the number of molecules equal to zero is the highest. However, in this case, the hybrid algorithm shows an increase in the number of molecules state both in the RNAs and even more in the proteins states. One can try to decrease the threashold $\Delta K_{t}$ but at expence of computational time that is already 3 times higher than the SSA (see table \ref{ka0.1ki1}).
\begin{table}
\begin{center}
\begin{tabular}{ |c|c|c| }
\hline
& \textbf{speed} \\
\hline
SSA & 9s \\
\hline
Hybrid & 35s \\
\hline
\end{tabular}
\caption{\label{ka0.1ki1}Computational time required for the SSA and Hybrid algorithm is case $k_{a}=0.1$ and $k_{i}=1$.}
\end{center}
\end{table}
Hence we can conclude that this proposed hybrid algorithm works very well from the point of view of accuracy and computational time when the model tends to produce a high number of molecules, whereas we still do not find an approach that is better than Gillespie's algorithm when the number of molecules produced is low. Anyway, in such cases of low number of molecules, the SSA is efficient and there is not a huge needing to be optimized.%E' meno efficiente in questi casi perch� tendiamo ad andare negli altri if.
Furthemore, we need to notice that our algorithm is a ``paramater-dependent'' algorithm, where the parameter is $\Delta k_{t}$. As regards how tuning this parameter, we can suggest to keep its value as high as possible if one wants to increase the speed but paying attention to the accuracy because with an higher $\Delta k_{t}$ you will run the risk of ``the explosion of molecules''.
One can notice, indeed, that the logic of our hybrid algorithm is similar to the Adaptive Runge-Kutta methods used for the approximate solutions of simultaneous nonlinear equations: during the integration, the step size is adapted such that the estimated error stays below a user-defined threshold. If the error is too high, a step is repeated with a lower step size; if the error is much smaller, the step size is increased to save time \cite{RungeKutta}.
Anyway, in this thesis work we will use the Stochastic Simulation Algorithm because our hybrid approach needs to be further tested and improved.
%L'algoritmo sembra funzionare meglio quando c'� una maggiore produzione di molecole (controlla proprio cosa succede: quante volte andiamo a finire in un if e quante in un altro).
%The trends of molecules vs time of these methods in this case are almost overlapping, as seen by eye. However, we can make a more quantitative comparison of these different approaches by computing the means of different simulations and the relative standard deviations.
%autocorrelation con k4 0.01 e relativi plot nel tempo.
%quantili
%DOMANDA AL PROFESSORE: We need to point out, anyway, that the efficiency of these algorithms, i.e. the simulation execution times, depend on the biological model that we are considering. In the next part we will discuss about some important biological models. For the moment, we can say that, for example, the pure SSA algorithm is still efficient for systems where the number of species keeps low as for the case of \emph{autorepressor} where there is only one gene and the proteins produced by their ``father'' gene reduces the probability of activation of the same gene.
\section{Statistical analysis techniques for warm-up detection}\label{warmupsection}
Up to now we have shown the results of stationary distributions of the CME for comparing different algorithms. However we have omitted to explain an important analysis step when dealing with simulations: the warm-up detection and its removal.
We can define our simulations as ``infinite horizon simulations'': it has no natural beginning point or ending point. The desired starting conditions are sometimes in the far off future. We are interested in the long-run performance of the system and since there is no natural ending point of interest, we consider the horizon infinite \cite{Manuel}.
As mentioned in section \ref{ACFSubsection}, sometimes we are interested in finding the moments of the probability distributions, such as the mean.
It is true that if the steady state distribution exists and you run the simulation long enough the estimators will tend to converge to the desired quantities. You can observe it in plots reported in Fig. \ref{ODESSATAULEAP} and \ref{SSATauleapdistribution}. At the beginning the number of molecules is low. Then by increasing virtual time, they increase untill they reach a state that we can call as ``steady'' state, that means that the distribution of the desired performance measure has reached a point where it is sufficiently similar to the desired steady state distribution \cite{Manuel}.
We have also mentioned in section \ref{TheCMESection}, when talking about Chemical Master Equation solutions, that the eigenvector called also as the metastable state is the longest-standing one and its eigenvalue gives a time-scale of the time of convergence to the stationary distribution.
We need to clarify that ``steady state'' is a concept involving the performance measures generated by the system as time goes to infinity. It does not mean that the system itself has reached steady state.
The system never reaches steady state. If the system itself reached steady state, then by implication it would never change with respect to time. The system continues to evolve with respect to time.
The period before the so called steady state can affect the mean calculation and so also the distribution of states. This period is called warmup-period.
At the end of the warm up period, the system can be in any of the possible states of the system. Some states will be more likely than others.
Thus, within the infinite horizon simulation context, you must decide on how long to run the simulations and how to handle the effect of \emph{the initial conditions} on the estimates of performance by estimating when the warmup period ends, i.e. a time point that we can indicate also with $T_{w}$.
The initial conditions of a simulation represent the state of the system when the simulation is started.
If $T_{w}$ is long enough, then on average across the replications, you are more likely to start collecting data when the system in states that are more representative over long term (rather than just states with low number of molecules).
\subsection{The effect of initial conditions}
Consider the output stochastic process $Y_{i}$ of the simulation and let $F_{i}(y|I)$ be the conditional cumulative distribution function of $Y_{i}$ where \emph{I} represents the initial conditions used to start the simulation at time 0. If $F_{i}(y|I) \rightarrow F(y)$ when $i \rightarrow \infty$ for all conditions \emph{I}, then $F(y)$ is called the steady state distribution of the output process.
The distribution $F_{i}(y|I)$ at the start of the simulation, however, tends to depend more heavily upon the initial conditions. Estimators of steady state performance, such as the sample average, will tend to be biased.
If the expected value of the sampling distribution is equal to the parameter of interest then the estimator is said to be unbiased, i.e. $E[\hat{\theta}] = \theta $, with $\hat{\theta}$ a point estimator.
If instead the estimator is biased then the difference $E[\hat{\theta}] - \theta \neq 0$ and this difference is the bias of the estimator $\hat{\theta}$.
This is the so called initialization bias problem in steady state simulation or warm up problem. The warm-up period $d$ is the period such that given $Y_{i}$ with $i=d+1,...$, $Y_{i}$ \emph{will have substantially similar distributional properties as the steady state distribution}.
Unless the initial conditions of the simulation can be generated according to $F(y)$, which is not known, you must focus on methods that detect and/or mitigate the presence of initialization bias.
%Metti il dogma centrale della biologia. Si tratta di un sistema stiff con le proteine che si producono pi� velocemente perch� ad ogni RNA corrisponde una proteina o pi� proteine ? Mentre invece da un solo singolo gene partono pi� molecole di RNA. Infatti il tempo di warm up nel caso delle proteine � maggiore del caso dell'RNA che raggiunge uno stato stazionario che ha una media minore.
%Sito.
%Immagine del periodo di warmup.
\subsection{``Replication-deletion'' method}
There are many methods that aim to determine the warmup period. In this thesis work we have used the a strategy that we can call ``replication-deletion'' method because it is based on the following steps:
\begin{enumerate}
\item Make \emph{R} replications. Typically $R \geq 5$ is recommended.
\item Since the sample points in SSA are obtained at different time intervals for each simulation, we interpolate them at constant intervals (we choose $dt=0.01$).
\item Thank to the previous step, we can make the average across the replications.
Let $Y_{rj}$ be the $j^{th}$ observation on replication \emph{r} $j=1,2,...,m$ where \emph{m} is the minimum of the number of observations in the $r^{th}$ replication, $m=min(m_{r})$.
\begin{equation}
\hat{Y}=\sum_{r=1}^{n} Y_{rj}
\end{equation}
\item Find $T_{w}$ in $\hat{Y}$ by calculating the difference between $\hat{Y}$ at t and at t+dt, t+dt and t+2dt etc... when the differences start to become constant, then t is $T_{w}$ that we are looking for.
%ELENCA IL TUO METODO + UNO SEMPLICE TIPO 10 %
\end{enumerate}
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.90]{figWarmUpPeriod.png}
\caption{Example of a simulation with warmup period detection. Image from \cite{Manuel}.}
\label{figWarmUpPeriod}