-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathatom.xml
503 lines (266 loc) · 874 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>我的小破站</title>
<link href="http://www.shelven.com/atom.xml" rel="self"/>
<link href="http://www.shelven.com/"/>
<updated>2024-05-14T11:43:44.000Z</updated>
<id>http://www.shelven.com/</id>
<generator uri="https://hexo.io/">Hexo</generator>
<entry>
<title>V2Ray海外节点搭建及优化(学术专用)</title>
<link href="http://www.shelven.com/2024/04/30/a.html"/>
<id>http://www.shelven.com/2024/04/30/a.html</id>
<published>2024-04-29T22:50:54.000Z</published>
<updated>2024-05-14T11:43:44.000Z</updated>
<content type="html"><![CDATA[<p>因为做课题需要用到谷歌,以前一直用的<code>clash</code>,买的别人搭建好的订阅链接。最近写文章要用到谷歌学术,买的订阅链接这个时候全都挂了…..不如自己配置个更稳定的节点,于是在RackNerd买了一台VPS,准备自己折腾折腾访问谷歌学术和<code>github</code>用。</p><span id="more"></span><p>以前用国内的服务器搭建过校园网的反向代理,当时用的<code>frp</code>这个反代软件,可以看底下这篇博客:</p><p><a href="https://www.shelven.com/2023/02/09/a.html">校园网代理服务器的搭建 - 我的小破站 (shelven.com)</a></p><p>这次的目的不一样,想要绕过GFW访问谷歌,需要有一台可以在国内访问公网ip的<strong>海外服务器</strong>。将我们(客户端)发出的http请求发送到代理服务器(服务端),代理服务器转发给目标服务器,再将响应返回到我们手上。整个过程隐藏了我们客户端的ip,也就是一个正向代理的过程。</p><h2 id="1-服务器"><a href="#1-服务器" class="headerlink" title="1. 服务器"></a>1. 服务器</h2><div class="story post-story"><h3 id="1-1-购买海外服务器"><a href="#1-1-购买海外服务器" class="headerlink" title="1.1 购买海外服务器"></a>1.1 购买海外服务器</h3><p>海外服务器比较有名的是<strong>RackNerd</strong>,在多个<a href="https://www.zhujiceping.com/">主机论坛</a>里都有相关测评对比。主打一个价格便宜,KVM虚拟,纯SSD raid10阵列,solusvm面板,自带一个IPV4,1Gbps带宽,多个机房可选。</p><p><img src="https://www.shelven.com/tuchuang/20240429/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>年费超过14美元才可以选<strong>洛杉矶</strong>的机房,洛杉矶和纽约都是美国服务器主要的节点,几乎所有国外服务器商在这两个地方都有节点。对于国内而言,洛杉矶机房在地理位置上更近,意味着网络延迟(ping值)较小。</p><p>说的太多有打广告的嫌疑,具体可以在<code>RackNerd</code>官网查看:<a href="https://www.racknerd.com/">RackNerd - Introducing Infrastructure Stability</a></p><p>顺便提一下在<code>RackNerd</code>购买的服务器72小时内可以更换一次ipv4,之后再次更换需要收3美刀。因为<strong>我搭建这个节点是为了访问谷歌学术</strong>,如果ip被封了会提示:</p><p><code>Your client does not have permission to get URL</code></p><p>连人机验证都没有,说明以前有人用这个ip干过爬虫一类的被谷歌学术直接封了。这个时候就要赶紧提工单换ip,或者申请ipv6(RackNerd似乎没有免费提供ipv6),谷歌学术一般封的是ipv4,这是后话。</p><p>购买的VPS可以通过<code>SolusVM</code>面板进行控制,邮件中有用户名和密码:<a href="https://nerdvm.racknerd.com/">https://nerdvm.racknerd.com/</a></p><h3 id="1-2-测试服务器性能"><a href="#1-2-测试服务器性能" class="headerlink" title="1.2 测试服务器性能"></a>1.2 测试服务器性能</h3><p>这一步不是必要的,毕竟你在选配置的时候信息对性能也有个预估。这里就简单记几个测试的脚本:</p><p><strong>VPS规格测试:</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 测试linux系统信息、IO读写和全球下载速度</span></span><br><span class="line">curl -Lso- bench.sh | bash</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20240429/2.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/2.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><strong>GB6跑分:</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 和上面类似</span></span><br><span class="line">curl -sL yabs.sh | bash</span><br></pre></td></tr></table></figure><p><strong>三网和教育网测速:</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">bash <(curl -sL bash.icu/speedtest)</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20240429/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><strong>回程路由:</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">wget -qO- git.io/besttrace | bash</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20240429/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><strong>流媒体解锁:</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">bash <(curl -L -s media.ispvps.com)</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20240429/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><strong>其他测试:</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># ping测试</span></span><br><span class="line">https://ping.pe</span><br><span class="line"></span><br><span class="line"><span class="comment"># 硬盘测试</span></span><br><span class="line">wget -q https://github.com/Aniverse/A/raw/i/a && bash a</span><br></pre></td></tr></table></figure></div><h2 id="2-V2Ray"><a href="#2-V2Ray" class="headerlink" title="2. V2Ray"></a>2. V2Ray</h2><div class="story post-story"><p>虽然直接用<code>nginx</code>就可以实现http正向代理的过程(https代理需要安装插件),但是为了数据安全,以及防止被墙检测,还是用更成熟的代理软件更为合适。常用的代理软件一个是<code>clash</code>,另一个就是<code>V2Ray</code>。<code>clash</code>的配置比较灵活,支持规则、代理组和混淆等功能;<code>V2Ray</code>配置相对简单(我觉得),更适合我这种小白。</p><p><code>V2Ray</code>有非常详细的中文部署文档和白话文教程:<a href="https://www.v2ray.com/chapter_00/start.html">新手上路 · Project V 官方网站 (v2ray.com)</a></p><p>没有必要重复造轮子了,只是提一下客户端(自己的电脑)和服务端(海外服务器)部署的时候需要注意的地方</p><ol><li><p>客户端和服务端时区可以不同,但是转换时区后的时间差要<strong>小于90秒</strong></p></li><li><p><code>https://install.direct/go.sh</code>这个linux一键安装脚本已经不能用了,需要运行如下命令安装和更新<code>V2Ray</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">bash <(curl -L https://raw.githubusercontent.com/v2fly/fhs-install-v2ray/master/install-release.sh)</span><br></pre></td></tr></table></figure></li><li><p>服务端设置完成之后记得放行对应的防火墙端口,并重载防火墙使其生效,常用的相关命令:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 放行16823端口:</span></span><br><span class="line">firewall-cmd --add-port=16823/tcp --zone=public --permanent</span><br><span class="line"><span class="comment"># 重载防火墙:</span></span><br><span class="line">systemctl restart firewalld</span><br><span class="line"><span class="comment"># 查看放行的端口:</span></span><br><span class="line">firewall-cmd --list-port</span><br><span class="line"><span class="comment"># 查看进程:</span></span><br><span class="line">ps -ef | grep v2ray</span><br><span class="line"><span class="comment"># 启动v2ray:</span></span><br><span class="line">systemctl start v2ray</span><br></pre></td></tr></table></figure></li><li><p>客户端和服务端的<code>"alterId"</code>都设置为0,用户手册说可以指定额外ID数量,<strong>但是我这里不为0的话流量只能走直连,写的其他规则全都无法生效</strong>(不知道算不算bug)。</p></li><li><p>用户ID使用的是UUID格式,<strong>不推荐手打!</strong>可能会引发未知错误,推荐工具<a href="https://www.uuidgenerator.net/">Online UUID Generator Tool</a></p></li></ol><p>以我的配置简单做个示范。</p><p>服务端配置:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"inbounds"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"port"</span><span class="punctuation">:</span> <span class="number">16823</span><span class="punctuation">,</span> <span class="comment">// 服务器监听端口</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"vmess"</span><span class="punctuation">,</span> <span class="comment">// 主传入协议,vmess是v2ray特有的协议</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"clients"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"id"</span><span class="punctuation">:</span> <span class="string">"********-****-****-****-************"</span><span class="punctuation">,</span> <span class="comment">// 用户 ID,客户端与服务器必须相同</span></span><br><span class="line"> <span class="attr">"alterId"</span><span class="punctuation">:</span> <span class="number">0</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"outbounds"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"freedom"</span><span class="punctuation">,</span> <span class="comment">// 主传出协议</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span><span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure><p>客户端配置:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"log"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"loglevel"</span><span class="punctuation">:</span> <span class="string">"warning"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line"> <span class="attr">"inbounds"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"port"</span><span class="punctuation">:</span> <span class="number">1080</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"listen"</span><span class="punctuation">:</span> <span class="string">"127.0.0.1"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"tag"</span><span class="punctuation">:</span> <span class="string">"http-inbound"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"http"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"auth"</span><span class="punctuation">:</span> <span class="string">"noauth"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"udp"</span><span class="punctuation">:</span> <span class="keyword">false</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"ip"</span><span class="punctuation">:</span> <span class="string">"127.0.0.1"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"sniffing"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"enabled"</span><span class="punctuation">:</span> <span class="keyword">true</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"destOverride"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="string">"http"</span><span class="punctuation">,</span> <span class="string">"tls"</span><span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line"> <span class="attr">"outbounds"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"vmess"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"vnext"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"address"</span><span class="punctuation">:</span> <span class="string">"***.***.***.***"</span><span class="punctuation">,</span><span class="comment">// 服务器ip</span></span><br><span class="line"> <span class="attr">"port"</span><span class="punctuation">:</span> <span class="number">16823</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"users"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"id"</span><span class="punctuation">:</span> <span class="string">"********-****-****-****-************"</span><span class="punctuation">,</span><span class="comment">// 用户ID,与服务端匹配</span></span><br><span class="line"> <span class="attr">"alterId"</span><span class="punctuation">:</span> <span class="number">0</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"tag"</span><span class="punctuation">:</span> <span class="string">"rule"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"blackhole"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span><span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"tag"</span><span class="punctuation">:</span> <span class="string">"blocked"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"freedom"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span><span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"tag"</span><span class="punctuation">:</span> <span class="string">"direct"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line"> <span class="attr">"routing"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"domainStrategy"</span><span class="punctuation">:</span> <span class="string">"IPOnDemand"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"rules"</span><span class="punctuation">:</span><span class="punctuation">[</span></span><br><span class="line"> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"type"</span><span class="punctuation">:</span> <span class="string">"field"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"ip"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="string">"geoip:private"</span><span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"outboundTag"</span><span class="punctuation">:</span> <span class="string">"blocked"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"type"</span><span class="punctuation">:</span> <span class="string">"field"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"domain"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="string">"geosite:category-ads"</span><span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"outboundTag"</span><span class="punctuation">:</span> <span class="string">"blocked"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"type"</span><span class="punctuation">:</span> <span class="string">"field"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"domain"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="string">"geosite:cn"</span><span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"outboundTag"</span><span class="punctuation">:</span> <span class="string">"direct"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"type"</span><span class="punctuation">:</span> <span class="string">"field"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"outboundTag"</span><span class="punctuation">:</span> <span class="string">"direct"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"ip"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="string">"geoip:private"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="string">"geoip:cn"</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"type"</span><span class="punctuation">:</span> <span class="string">"field"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"outboundTag"</span><span class="punctuation">:</span> <span class="string">"direct"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"domain"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="string">"domain:taobao.com"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="string">"domain:jd.com"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="string">"domain:baidu.com"</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line"> <span class="attr">"dns"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"hosts"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"domain:v2ray.com"</span><span class="punctuation">:</span> <span class="string">"www.vicemc.net"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"domain:github.io"</span><span class="punctuation">:</span> <span class="string">"pages.github.com"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"domain:wikipedia.org"</span><span class="punctuation">:</span> <span class="string">"www.wikimedia.org"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"domain:shadowsocks.org"</span><span class="punctuation">:</span> <span class="string">"electronicsrealm.com"</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"servers"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="string">"1.1.1.1"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"address"</span><span class="punctuation">:</span> <span class="string">"114.114.114.114"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"port"</span><span class="punctuation">:</span> <span class="number">53</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"domains"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="string">"geosite:cn"</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="string">"8.8.8.8"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="string">"localhost"</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line"> <span class="attr">"policy"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"levels"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"0"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"uplinkOnly"</span><span class="punctuation">:</span> <span class="number">0</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"downlinkOnly"</span><span class="punctuation">:</span> <span class="number">0</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"system"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"statsInboundUplink"</span><span class="punctuation">:</span> <span class="keyword">false</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"statsInboundDownlink"</span><span class="punctuation">:</span> <span class="keyword">false</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"statsOutboundUplink"</span><span class="punctuation">:</span> <span class="keyword">false</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"statsOutboundDownlink"</span><span class="punctuation">:</span> <span class="keyword">false</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line"> <span class="attr">"other"</span><span class="punctuation">:</span> <span class="punctuation">{</span><span class="punctuation">}</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure><p>官网给的客户端例子是走的<code>socks</code>协议,只有火狐(firefox)浏览器才可以设置<code>socks</code>协议代理,其他浏览器像是edge、chrome都是直接跳转计算机中的代理设置(不用插件的话)。因此直接在客户端的<code>inbound</code>设置<code>http</code>协议,然后在客户端直接设置代理地址<code>127.0.0.1</code>,端口<code>1080</code>即可。</p><p><img src="https://www.shelven.com/tuchuang/20240429/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="3-优化"><a href="#3-优化" class="headerlink" title="3. 优化"></a>3. 优化</h2><div class="story post-story"><p>把服务端和客户端的<code>V2Ray</code>都设置好之后,就可以在客户端运行<code>v2ray.exe</code>,正常浏览谷歌学术了<del>(ip没被封的话)</del>。</p><p>细心的你可能会发现,现在浏览网页的速度<strong>非常慢</strong>,完全不像是刚开始给VPS做三网测速的速度好吗喂!</p><p>在测速网站测试一下代理速度:<a href="https://www.speedtest.net/">https://www.speedtest.net/</a></p><p><img src="https://www.shelven.com/tuchuang/20240429/77.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/77.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这个下载速度有点过于绝望了,别说浏览网页,<strong>下载个文献都费劲。</strong></p><p>另外,TCP连接本身不提供数据加密,传输的数据在网络上可能会被窃听或篡改。这里介绍下主流的V2Ray优化方案:<strong>WebSocket + TLS + Nginx(Web) + cloudflare(CDN)</strong></p><p>总的来说,<code>WebSocket</code>监听服务端<code>V2Ray</code>的<code>inbounds</code>端口,转发流量到<code>HTTP</code>服务器(同一个服务器),再由<code>HTTP</code>服务器经过<code>TLS</code>加密传数据到<code>cloudflare CDN</code>服务器。隐蔽性比直接<code>tcp</code>连接更高,<code>cloudflare</code>加速也十分给力。</p><h3 id="3-1-cloudflare"><a href="#3-1-cloudflare" class="headerlink" title="3.1 cloudflare"></a>3.1 cloudflare</h3><p>cloudflare要求有域名。我是在阿里云注册了一个<code>.com</code>的域名,<strong>先在域名注册商那里解析域名到服务器ip地址</strong>。</p><p>在cloudflare官网右侧点击<code>ADD a site</code>,输入注册的二级域名,点击不要钱的那个plan。之后会让你在域名注册商那里修改DNS服务器(Name Server),注意有两个DNS服务器。</p><p><img src="https://www.shelven.com/tuchuang/20240429/8.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/8.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>修改好之后回到cloudflare,就会发现之前添加的二级域名状态变成了<code>activate</code>,并且<strong>两条</strong>DNS解析记录也被提取了出来(没有两条可以自己手动添加,一条www,一条二级域名)。</p><p><img src="https://www.shelven.com/tuchuang/20240429/9.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/9.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这个时候就可以在自己电脑上ping域名,可以ping成功说明设置没有问题。</p><p><strong>在DNS里先关闭代理状态</strong>(橙色的云改成灰色的云)。</p><h3 id="3-2-nginx"><a href="#3-2-nginx" class="headerlink" title="3.2 nginx"></a>3.2 nginx</h3><p>用其他的web服务软件比如<code>apache</code>之类的都是可以的,我的博客服务器用的<code>apache</code>,这里稍微折腾下<code>nginx</code>,<del>以后就都有经验了,嗯。</del></p><p>在<code>nginx</code>官网下载稳定版本,这里以<code>1.25.5</code>为例。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 解压,编译,安装nginx,一定要装ssl模块!除非你不用TLS</span></span><br><span class="line">tar -xvf nginx-1.25.5.tar.gz</span><br><span class="line"><span class="built_in">cd</span> nginx-1.25.5/</span><br><span class="line">./configure --prefix=/usr/soft/nginx --with-http_ssl_module</span><br><span class="line">make</span><br><span class="line">make install</span><br><span class="line"></span><br><span class="line"><span class="comment"># 开启80端口,443端口(ssl经常用的端口,也可以指定别的)</span></span><br><span class="line">firewall-cmd --add-port=80/tcp --zone=public --permanent</span><br><span class="line">firewall-cmd --add-port=443/tcp --zone=public --permanent</span><br><span class="line">systemctl restart firewalld</span><br><span class="line"></span><br><span class="line"><span class="comment">#启动nginx</span></span><br><span class="line">/usr/soft/nginx/sbin/nginx</span><br></pre></td></tr></table></figure><p>这个时候访问<code>ip:80</code>端口就可以成功访问到<code>nginx</code>的初始界面了。</p><p><img src="https://www.shelven.com/tuchuang/20240429/10.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/10.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>因为我们后面要改<code>nginx</code>的配置,所以这里先关掉….</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 查找80端口被什么进程占用,这里nginx默认使用80端口</span></span><br><span class="line">netstat -nlp |grep :80</span><br><span class="line"><span class="comment"># kill对应的进程,进程号改成自己的</span></span><br><span class="line"><span class="built_in">kill</span> 16242</span><br></pre></td></tr></table></figure><h3 id="3-3-TLS证书(SSL)"><a href="#3-3-TLS证书(SSL)" class="headerlink" title="3.3 TLS证书(SSL)"></a>3.3 TLS证书(SSL)</h3><p>这部分和我之前再apache服务器申请SSL证书一模一样….TLS是SSL的继任者,理解成一个意思就行。</p><h4 id="3-3-1-zerossl证书"><a href="#3-3-1-zerossl证书" class="headerlink" title="3.3.1 zerossl证书"></a>3.3.1 zerossl证书</h4><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 安装一些软件包和脚本,刷新环境变量</span></span><br><span class="line">yum install socat</span><br><span class="line">curl https://get.acme.sh | sh</span><br><span class="line"><span class="built_in">source</span> ~/.bashrc</span><br><span class="line"></span><br><span class="line"><span class="comment"># 注册zerossl账号</span></span><br><span class="line">acme.sh --register-account -m 你的Email地址 --server zerossl</span><br></pre></td></tr></table></figure><p>回到<code>cloudflare</code>,搜索API界面,点击生成全局的<code>cloudflare API key</code>:</p><p><img src="https://www.shelven.com/tuchuang/20240429/11.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/11.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 添加环境变量</span></span><br><span class="line"><span class="built_in">export</span> CF_Key=<span class="string">"你的API key"</span></span><br><span class="line"><span class="built_in">export</span> CF_Email=<span class="string">"你的Email地址"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 生成TLS证书</span></span><br><span class="line">acme.sh --server zerossl --issue -d 你的域名 --dns dns_cf</span><br></pre></td></tr></table></figure><p>以上操作会在<code>/root/.acme.sh/</code>路径下生成TLS证书,安装证书换一个方便自己找的地方。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 安装证书,自定义证书的路径和名称</span></span><br><span class="line">acme.sh --installcert -d 你的域名 --fullchainpath /etc/crt/你的域名.crt --keypath /etc/crt/你的域名.key</span><br></pre></td></tr></table></figure><p>zerossl证书有效期是90天,不过<code>acme</code>会创建一个cronjob自动更新证书,还是挺省心的(据说zerossl不支持免费ssl证书更换了,为了避免未来可能存在的问题,我还是用了下面方法)。</p><h4 id="3-3-2-cloudflare证书"><a href="#3-3-2-cloudflare证书" class="headerlink" title="3.3.2 cloudflare证书"></a>3.3.2 cloudflare证书</h4><p>还有种比较简单的证书,就是cloudflare自己的Orign CA,证书免费,周期15年。</p><p>回到<code>cloudflare</code>,搜索SSL/TLS界面,选择<code>Create Certificate</code>,直接创建并上传到服务器就可以了。</p><p>不管用哪种方式申请的证书,需要在SSL/TLS界面将加密模式修改为<code>Full</code>或者<code>Full(strict)</code>,<strong>这个时候再在DNS设置中打开代理状态</strong>(灰色的云改成橙色的云)。</p><p><img src="https://www.shelven.com/tuchuang/20240429/12.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/12.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这样修改后你的服务器到CDN服务器之间也采用了<code>ssl/tls</code>加密的方式传输数据,更不容易被检测。</p><h3 id="3-4-修改nginx配置"><a href="#3-4-修改nginx配置" class="headerlink" title="3.4 修改nginx配置"></a>3.4 修改nginx配置</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 修改nginx.conf内容</span></span><br><span class="line"><span class="comment"># 前面自定义了nginx安装位置,我这里是/usr/soft/nginx/conf目录下</span></span><br><span class="line"><span class="comment"># 只展示需要修改的80和443端口</span></span><br><span class="line">http {</span><br><span class="line"> include mime.types;</span><br><span class="line"> default_type application/octet-stream;</span><br><span class="line"> sendfile on;</span><br><span class="line"> keepalive_timeout 65;</span><br><span class="line"></span><br><span class="line"> server {</span><br><span class="line"> listen 80;</span><br><span class="line"> server_name 你的域名;</span><br><span class="line"> location / {</span><br><span class="line"> proxy_redirect off;</span><br><span class="line"> proxy_pass http://127.0.0.1:16823; <span class="comment"># v2ray服务器端口</span></span><br><span class="line">proxy_http_version 1.1;</span><br><span class="line">proxy_set_header Upgrade <span class="variable">$http_upgrade</span>;</span><br><span class="line">proxy_set_header Connection <span class="string">"upgrade"</span>;</span><br><span class="line">proxy_set_header Host <span class="variable">$http_host</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> server {</span><br><span class="line"> listen 443 ssl;</span><br><span class="line"> server_name 你的域名;</span><br><span class="line"> ssl_certificate /etc/crt/你的域名.crt;</span><br><span class="line"> ssl_certificate_key /etc/crt/你的域名.key;</span><br><span class="line">ssl_ciphers HIGH:!aNULL:!MD5;</span><br><span class="line"> ssl_session_cache shared:SSL:1m;</span><br><span class="line"> ssl_session_timeout 5m;</span><br><span class="line"> ssl_prefer_server_ciphers on;</span><br><span class="line"></span><br><span class="line"> location / {</span><br><span class="line"> proxy_redirect off;</span><br><span class="line"> proxy_pass http://127.0.0.1:16823;<span class="comment"># v2ray服务器端口</span></span><br><span class="line">proxy_http_version 1.1;</span><br><span class="line">proxy_set_header Upgrade <span class="variable">$http_upgrade</span>;</span><br><span class="line">proxy_set_header Connection <span class="string">"upgrade"</span>;</span><br><span class="line">proxy_set_header Host <span class="variable">$http_host</span>;</span><br><span class="line">proxy_set_header X-Real-IP <span class="variable">$remote_addr</span>;</span><br><span class="line">proxy_set_header X-Forwarded-For <span class="variable">$proxy_add_x_forwarded_for</span>;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>这个时候启动<code>nginx</code>,以你的域名去访问也可以正常显示页面(地址栏没有红锁头),说明<code>TSL</code>就设置成功了。</p><h3 id="3-5-修改v2ray设置"><a href="#3-5-修改v2ray设置" class="headerlink" title="3.5 修改v2ray设置"></a>3.5 修改v2ray设置</h3><p>服务端修改配置并<strong>重启</strong>:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"inbounds"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"port"</span><span class="punctuation">:</span> <span class="number">16823</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"vmess"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"clients"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"id"</span><span class="punctuation">:</span> <span class="string">"********-****-****-****-************"</span><span class="punctuation">,</span> </span><br><span class="line"> <span class="attr">"alterId"</span><span class="punctuation">:</span> <span class="number">0</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"streamSettings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"><span class="attr">"network"</span><span class="punctuation">:</span> <span class="string">"ws"</span><span class="punctuation">,</span></span><br><span class="line"><span class="attr">"wsSettings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"><span class="attr">"path"</span><span class="punctuation">:</span> <span class="string">"/"</span></span><br><span class="line"><span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"outbounds"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line"> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"freedom"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span><span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure><p>与前面相比只是多了一个<code>"streamSettings"</code>,设置的是<code>WebSocket</code>转发的规则,注意路径要与<code>nginx</code>中设置的一样。<strong>实际运行的时候我这有个小bug</strong>,设置别的路径都会报错404,只有<code>"/"</code>可以正常代理。</p><p>客户端只需要在原来的基础上,在<code>"outbounds"</code>添加<code>"streamSettings"</code>相关内容,以及修改<code>"settings"</code>端口和域名地址,如下(只展示部分):</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">"outbounds"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"protocol"</span><span class="punctuation">:</span> <span class="string">"vmess"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"streamSettings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"network"</span><span class="punctuation">:</span> <span class="string">"ws"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"security"</span><span class="punctuation">:</span> <span class="string">"tls"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"wsSettings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"path"</span><span class="punctuation">:</span> <span class="string">"/"</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"settings"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"vnext"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"address"</span><span class="punctuation">:</span> <span class="string">"你的域名"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"port"</span><span class="punctuation">:</span> <span class="number">443</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"users"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">{</span></span><br><span class="line"> <span class="attr">"id"</span><span class="punctuation">:</span> <span class="string">"********-****-****-****-************"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="attr">"alterId"</span><span class="punctuation">:</span> <span class="number">0</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">]</span></span><br><span class="line"> <span class="punctuation">}</span><span class="punctuation">,</span></span><br></pre></td></tr></table></figure><p><strong>客户端运行时记得打开计算机代理!</strong></p><h3 id="3-6-优化后的下载速度对比"><a href="#3-6-优化后的下载速度对比" class="headerlink" title="3.6 优化后的下载速度对比"></a>3.6 优化后的下载速度对比</h3><p>不管是从数据安全性还是下载速度上,优化后的比之前纯<code>v2ray</code>搭建的节点要强很多。</p><p>我这里校园网带宽只有20Mbps,测不到这个节点的下载速度上限;开了手机热点,测的等效带宽是50Mbps,<strong>比之前3Mbps提高了20倍</strong>,实际上还没有测到上限。这下下载文献是没问题了。</p><p><img src="https://www.shelven.com/tuchuang/20240429/133.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240429/133.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div>]]></content>
<summary type="html"><p>因为做课题需要用到谷歌,以前一直用的<code>clash</code>,买的别人搭建好的订阅链接。最近写文章要用到谷歌学术,买的订阅链接这个时候全都挂了…..不如自己配置个更稳定的节点,于是在RackNerd买了一台VPS,准备自己折腾折腾访问谷歌学术和<code>github</code>用。</p></summary>
<category term="网络相关" scheme="http://www.shelven.com/categories/%E7%BD%91%E7%BB%9C%E7%9B%B8%E5%85%B3/"/>
<category term="建站" scheme="http://www.shelven.com/tags/%E5%BB%BA%E7%AB%99/"/>
<category term="正向代理" scheme="http://www.shelven.com/tags/%E6%AD%A3%E5%90%91%E4%BB%A3%E7%90%86/"/>
</entry>
<entry>
<title>snakemake初学笔记</title>
<link href="http://www.shelven.com/2024/04/08/a.html"/>
<id>http://www.shelven.com/2024/04/08/a.html</id>
<published>2024-04-08T12:26:06.000Z</published>
<updated>2024-04-08T12:27:57.000Z</updated>
<content type="html"><![CDATA[<p><code>snakemake</code>是一款强大的工作流管理工具,用于构建和运行复杂的数据分析工作流。其工作流是基于<code>python</code>语言描述的,类似于<code>Makefile</code>的工作流描述语言。<code>Makefile</code>指定的是源文件之间的依赖关系,以及如何将它们编译成可执行文件和库,而<code>snakemake</code>指定的是数据处理过程中的依赖关系和规则,并自动化执行这些规则生成最终的输出。</p><span id="more"></span><h2 id="Snakemake"><a href="#Snakemake" class="headerlink" title="Snakemake"></a>Snakemake</h2><div class="story post-story"><p>snakemake的一些特点:</p><ol><li><strong>声明式工作流描述</strong>: 使用Python风格的语法编写工作流规则,描述数据处理步骤和依赖关系。也可以在Snakefile中直接编写python代码,解决复杂的功能需求。</li><li><strong>自动化任务调度</strong>: Snakemake可以自动解决任务之间的依赖关系,并根据需要并行运行任务,以最大程度地提高效率。</li><li><strong>灵活性</strong>: Snakemake支持复杂的工作流设计,包括并行处理、条件执行、动态文件生成等功能,使用户能够灵活地定制数据处理流程。</li><li><strong>集成性</strong>: Snakemake可以与常见的集群调度系统(如SLURM)集成,方便在计算集群上运行工作流。</li><li><strong>报告生成</strong>: Snakemake可以生成详细的报告,展示工作流执行的结果和性能指标,有助于用户进行结果分析和优化。</li></ol><p>官方给出的文档中有非常详细的操作,包括如何安装,以及如何执行和定义一个工作流。官方还有个代码库,存放了大量的生信分析的工作流代码和文件供我们参考。</p><p>官方文档:<a href="https://snakemake.readthedocs.io/en/stable/">Snakemake | Snakemake 8.10.6 documentation</a></p><p>工作流库:<a href="https://github.com/snakemake-workflows">Snakemake-Workflows (github.com)</a>,<a href="https://snakemake.github.io/snakemake-workflow-catalog/">Snakemake workflow catalog</a></p><p>这里记录下我对<code>snakemake</code>的学习过程,以官方文档的例子为主初步理解和学习这个软件的基本用法,再用自己的服务器跑一跑简单的流程。</p></div><h2 id="1-安装snakemake和下载示例数据"><a href="#1-安装snakemake和下载示例数据" class="headerlink" title="1. 安装snakemake和下载示例数据"></a>1. 安装snakemake和下载示例数据</h2><div class="story post-story"><p>官方推荐使用<code>Conda/Mamba</code>进行安装,因为这两个包管理软件可以很好处理snakemake的软件依赖关系。也可以使用python的<code>pip</code>工具安装,但需要手动解决一些依赖问题。</p><p>由于<strong>学校集群至今没能联网</strong><del>(对学校这方面的管理非常非常不满,一个集群没有专人管理,出问题找不到管理员解决,也没有用户手册,没有后台监测,导致一些用户乱用登录节点资源。真的推荐向华农学习学习,南疆最大的集群管理如此混乱,说出去真的不好意思)</del>,源码安装需要解决的依赖太多,所以暂时不做集群中的演示了,以我的服务器为例跑一跑流程。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 先安装mamba(base环境下)</span></span><br><span class="line">conda install -n base -c conda-forge mamba</span><br><span class="line"></span><br><span class="line"><span class="comment"># 下载示例数据</span></span><br><span class="line"><span class="built_in">mkdir</span> snakemake-tutorial</span><br><span class="line"><span class="built_in">cd</span> snakemake-tutorial</span><br><span class="line">curl -L https://api.github.com/repos/snakemake/snakemake-tutorial-data/tarball -o snakemake-tutorial-data.tar.gz</span><br><span class="line"></span><br><span class="line"><span class="comment"># 使用mamba安装snakemake,创建虚拟环境(需要在conda的base环境下)</span></span><br><span class="line"><span class="comment">## --wildcards通配符模式,-xf解压缩并提取文件,--strip 1只提取压缩文件中的子目录和文件,而不包含顶层目录</span></span><br><span class="line">tar --wildcards -xf snakemake-tutorial-data.tar.gz --strip 1 <span class="string">"*/data"</span> <span class="string">"*/environment.yaml"</span></span><br><span class="line">mamba <span class="built_in">env</span> create --name snakemake-tutorial --file environment.yaml</span><br><span class="line"></span><br><span class="line"><span class="comment"># 激活环境,查看帮助(需要重新连接一次激活mamba)</span></span><br><span class="line">mamba activate snakemake-tutorial</span><br><span class="line">snakemake --<span class="built_in">help</span></span><br></pre></td></tr></table></figure><p>官方的示例是跑一个<code>bwa</code>比对流程,用<code>samtools</code>转sam文件并排序后,最终使用<code>bcftools</code>进行变异检测,一套标准的<strong>call genomic variants</strong>流程。</p><p>示例数据结构如下:</p><p><img src="https://www.shelven.com/tuchuang/20240407/11.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240407/11.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><code>data</code>文件夹中是工作流需要用到的数据,包括基因组序列<code>.fa</code>及其索引文件,和三个测序数据<code>.fastq</code>文件。</p><p><code>environment.yaml</code>文件是示例中需要用到的软件,我这里<code>pip</code>部分需要改源和手动安装(默认的pip源安装软件会超时报错)。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pysam==0.22</span><br></pre></td></tr></table></figure></div><h2 id="2-snakemake基本用法"><a href="#2-snakemake基本用法" class="headerlink" title="2. snakemake基本用法"></a>2. snakemake基本用法</h2><div class="story post-story"><p>基本用法主要参考官方:<a href="https://snakemake.readthedocs.io/en/stable/tutorial/basics.html">Basics: An example workflow | Snakemake 8.10.6 documentation</a></p><p><code>Snakefile</code>是snakemake默认读取的文件,记录程序运行过程中的数据来源、程序命令、参数、输出目录等等。运行snakemake会自动寻找目录下是否有这个文件,<strong>如果你改了这个文件的名字,需要在运行时指定参数</strong> <code>-s</code>。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 创建Snakefile</span></span><br><span class="line">vim snakefile</span><br><span class="line"></span><br><span class="line"><span class="comment"># 编写rule</span></span><br><span class="line">rule bwa_map:</span><br><span class="line"> input:</span><br><span class="line"> <span class="string">"data/genome.fa"</span>,</span><br><span class="line"> <span class="string">"data/samples/A.fastq"</span></span><br><span class="line"> output:</span><br><span class="line"> <span class="string">"mapped_reads/A.bam"</span></span><br><span class="line"> shell:</span><br><span class="line"> <span class="string">"bwa mem {input} | samtools view -Sb > {output}"</span></span><br></pre></td></tr></table></figure><p>以上就是snakemake最简单的用法,以下是使用的两个基本点:</p><ul><li>Snakefile基本组成单位是<code>rule</code>,定义一个规则,也就是一个处理步骤。</li><li>每个<code>rule</code>包含三个基本元素:<code>input</code>、<code>output</code>、<code>shell/run/script</code>。<code>input</code>为输入文件,一个文件一行,<strong>注意逗号不能省略</strong>,python会连接后续字符串,没有逗号会导致文件读取错误。<code>output</code>定义输出文件,不存在的路径会自动创建。<code>shell/run/script</code>是要在shell中执行的命令,或者要执行的python代码,shell类似于python的<code>subprocess</code>模块创建子进程。</li></ul><p>以上<code>Snakefile</code>文件仅支持空格<strong>whitespace</strong>,<strong>不支持缩进写法</strong>,tab键无法被识别。当输入输出文件有多行的时候,逗号会将每个文件以一个空格的形式分隔,也就是执行前会将<code>{input}</code>部分替换为<code>data/genome.fa data/samples/A.fastq</code>。</p><p>但是<code>shell</code>中不用逗号,可以将两行字符串直接连接,<strong>注意每个字符串需要有一个尾随空格</strong>(默认是直接连接一起的,两个字符串之间没有空格)。</p><p>在运行snakemake前,我们可以伪执行一次检测是否有输入输出文件的错误:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">snakemake -np</span><br><span class="line"><span class="comment"># -n是dry-run,伪执行;-p是打印生成的shell命令</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20240407/2.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240407/2.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>没有报错说明这个rule的逻辑没有问题,可以直接执行:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">snakemake --core 1</span><br><span class="line"><span class="comment"># 分配一个核执行该snakefile</span></span><br><span class="line"></span><br><span class="line">snakemake --core 1 mapped_reads/A.bam</span><br><span class="line"><span class="comment"># 也可以指定输出文件来确定执行的是哪个rule</span></span><br></pre></td></tr></table></figure><p>运行时需要指定整个流程同时最多使用的核心数量,每个rule中可以用<code>threads</code>定义该rule使用的线程数。</p><p>Snakemake运行的逻辑是指定输出的文件或者目录,Snakemake为了拿到这个结果文件会一层层往上寻找得到该文件的流程,跳过不需要执行的rule。所以书写的时候每个rule之间的顺序不重要,只是书写代码的时候方便我们阅读。</p><p>上面的作业命令在运行结束后,结果文件不会被再次创建(即使你重新运行一遍命令),<strong>除非更改文件的时间戳</strong>,snakemake仅在输入文件比输出文件更新,或者输入文件被其他作业更改后才可以重新运行作业命令。</p></div><h2 id="3-rule-all"><a href="#3-rule-all" class="headerlink" title="3. rule all"></a>3. rule all</h2><div class="story post-story"><p>实际上一个工作流往往含有多个<code>rule</code>,我们会制定多个<code>rule</code>将工作流中的所有步骤模块化写入,snakemake会自动分析不同<code>rule</code>之间的<code>input</code>和<code>output</code>的依赖关系,从而将这个流程串联起来。</p><p><code>rule all</code>是一个特殊的规则,用于定义工作流中的主要输出文件或目标,也就是指定我们希望整个工作流生成的结果。Snakemake 将根据 <code>rule all</code>中定义的目标来确定需要运行的规则,以确保生成这些目标文件,简化工作流的管理和执行过程。</p><p>比如上面的例子,我们希望最终得到<code>mapped_reads/A.bam</code>这个结果,可以加上一个<code>rule all</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">rule all:</span><br><span class="line"> input:</span><br><span class="line"> <span class="string">"mapped_reads/A.bam"</span></span><br><span class="line"> </span><br><span class="line">rule bwa_map:</span><br><span class="line"> input:</span><br><span class="line"> <span class="string">"data/genome.fa"</span>,</span><br><span class="line"> <span class="string">"data/samples/A.fastq"</span></span><br><span class="line"> output:</span><br><span class="line"> <span class="string">"mapped_reads/A.bam"</span></span><br><span class="line"> shell:</span><br><span class="line"> <span class="string">"bwa mem {input} | samtools view -Sb > {output}"</span></span><br></pre></td></tr></table></figure><p>在复杂一些的流程中会加入这个规则,作为snakemake运行的入口,一般只需要定义<code>input</code>,因为一个流程的最终输出结果是其他<code>rule</code>的<code>output</code>。snakemake会从<code>rule all</code>的<code>input</code>向上一层层回溯,找到最初<code>rule</code>的<code>input</code>文件,也就是我们的原始文件,比如这里的测序文件和基因组文件,从而理清整个流程并依次执行。</p><p>如果一个<code>rule</code>的<code>input和output</code>文件不被其他<code>rule</code>依赖,则这个<code>rule</code>不会被执行。</p><p>上面的这些操作如果没有在命令行中指定目标文件,且<code>Snakefile</code>中没有定义<code>rule all</code>,将会默认执行第一条规则。</p></div><h2 id="4-通配符"><a href="#4-通配符" class="headerlink" title="4. 通配符"></a>4. 通配符</h2><div class="story post-story"><p>以上例子我们用了具体的文件名作为输入和输出,实际应用时往往有批量的文件需要作为输入或者输出文件,snakemake可以使用<strong>通配符</strong>进行命名。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">SAMPLES = [<span class="string">'A'</span>, <span class="string">'B'</span>, <span class="string">'C'</span>]</span><br><span class="line"></span><br><span class="line">rule all:</span><br><span class="line"> input:</span><br><span class="line"> <span class="built_in">expand</span>(<span class="string">"mapped_reads/{sample}.bam"</span>, sample=SAMPLES),</span><br><span class="line"></span><br><span class="line">rule bwa_map:</span><br><span class="line"> input:</span><br><span class="line"> REF=<span class="string">"data/genome.fa"</span>,</span><br><span class="line"> FQ=<span class="string">"data/samples/{sample}.fastq"</span>,</span><br><span class="line"> output:</span><br><span class="line"> BAM=<span class="string">"mapped_reads/{sample}.bam"</span>,</span><br><span class="line"> shell:</span><br><span class="line"> <span class="string">"bwa mem {input.REF} {input.FQ}| samtools view -Sb > {output.BAM}"</span></span><br></pre></td></tr></table></figure><p>以上例子就很形象地体现了snakemake的python特性:</p><ul><li><code>expand()</code>是snakemake内部函数,用于定义规则的输入、输出或其他需要动态生成文件路径的地方。</li><li><code>{sample}</code>通配符用来指定由SAMPLES列表生成的文件路径。</li><li>这里的<code>input</code>和<code>output</code>我们可以看做是python中的<strong>字典</strong>类型变量,比如<code>input.REF</code>表示访问字典<code>input</code>中的键为<code>REF</code>的值,相当于调用了<code>input</code>这个对象的<code>REF</code>属性,当然也就可以写成<code>input[REF]</code>,这样可以有效区分输入输出的文件及顺序。</li></ul><p>当然,说到python也可以用列表推导式来表示输出的文件名:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 使用列表推导式,省去了SAMPLES列表。替代rule all规则中input对象expand()函数表示的文件名。</span></span><br><span class="line"><span class="comment">## 以下两种方式等效于</span></span><br><span class="line">expand(<span class="string">"mapped_reads/{sample}.bam"</span>, sample=SAMPLES),</span><br><span class="line"></span><br><span class="line"><span class="comment">## 1.格式化输出,比如%s字符串占位,%d整数占位</span></span><br><span class="line">[<span class="string">"mapped_reads/%s.bam"</span> % sample <span class="keyword">for</span> sample <span class="keyword">in</span> [<span class="string">"A"</span>, <span class="string">"B"</span>,<span class="string">"C"</span>]],</span><br><span class="line"></span><br><span class="line"><span class="comment">## 2.format()在字符串对象上调用方法,传入相应的值格式化字符串,{}占位</span></span><br><span class="line">[<span class="string">"mapped_reads/{}.bam"</span>.<span class="built_in">format</span>(sample) <span class="keyword">for</span> sample <span class="keyword">in</span> [<span class="string">"A"</span>,<span class="string">"B"</span>,<span class="string">"C"</span>]],</span><br></pre></td></tr></table></figure><p>如果有多个参数列表,<code>expand()</code>函数也可以对应传入:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">expand</span>(<span class="string">"mapped_reads/{sample}.{replicate}.bam"</span>, sample=SAMPLES, replicate=[0,1])</span><br><span class="line"><span class="comment">## 输出["mapped_reads/A.0.bam","mapped_reads/A.1.bam","mapped_reads/B.0.bam","mapped_reads/B.1.bam","mapped_reads/C.0.bam","mapped_reads/C.1.bam"]</span></span><br><span class="line"><span class="comment">## 共6个文件,当然这里并没有6个输出文件,只是个示范</span></span><br></pre></td></tr></table></figure><p>如果样本数多,有一定的规律可循,则可以使用<code>glob_wildcards()</code>函数和使用<code>wildcards</code>对象提取文件名称和路径,这两者都是使用类似shell命令中的通配符访问对象,但是两者在使用方法上有一些区别。</p><p>假设我们现在有以下sample的双端测序文件:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">files = ["fastq/sample1_R1.fastq", "fastq/sample1_R2.fastq", "fastq/sample2_R1.fastq", "fastq/sample2_R2.fastq"]</span><br></pre></td></tr></table></figure><p><code>glob_wildcards()</code>可以接受多个通配符表达式,返回的是一个<strong>元组类型</strong>数据:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">(SAMPLES, PAIR) = glob_wildcards(<span class="string">"fastq/{sample}_{pair}.fastq"</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># 返回值,两个列表组成的元组</span></span><br><span class="line">SAMPLES = [<span class="string">"sample1"</span>,<span class="string">"sample2"</span>]</span><br><span class="line">PAIR = [<span class="string">"R1"</span>,<span class="string">"R2"</span>]</span><br></pre></td></tr></table></figure><p> <code>wildcards</code> 也就是通配符对象,用来捕获单个文件名中的通配符信息:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line">(SAMPLES, PAIR) = glob_wildcards(<span class="string">"fastq/{sample}_{pair}.fastq"</span>)</span><br><span class="line"></span><br><span class="line">rule all:</span><br><span class="line"> input:</span><br><span class="line"> <span class="built_in">expand</span>(<span class="string">"result/{sample}_{pair}/result.txt"</span>, sample=SAMPLES, pair=PAIR)</span><br><span class="line"></span><br><span class="line">rule fastq:</span><br><span class="line"> input:</span><br><span class="line"> FQ=<span class="string">"fastq/{sample}_{pair}.fastq"</span>,</span><br><span class="line"> output:</span><br><span class="line"> RES=<span class="string">"result/{sample}_{pair}/result.txt"</span>,</span><br><span class="line"> shell:</span><br><span class="line"> <span class="string">""</span><span class="string">"</span></span><br><span class="line"><span class="string">echo {wildcards.sample} > {output.RES}</span></span><br><span class="line"><span class="string">echo {wildcards.pair} >> {output.RES}</span></span><br><span class="line"><span class="string">"</span><span class="string">""</span></span><br><span class="line"><span class="comment">## 生成的result文件结构如下,每个txt文件中分别记录了sample名和双端测序编号:</span></span><br><span class="line">result</span><br><span class="line">├── sample1_R1</span><br><span class="line">│ └── result.txt</span><br><span class="line">├── sample1_R2</span><br><span class="line">│ └── result.txt</span><br><span class="line">├── sample2_R1</span><br><span class="line">│ └── result.txt</span><br><span class="line">└── sample2_R2</span><br><span class="line"> └── result.txt</span><br></pre></td></tr></table></figure><p><code>shell</code>中也可以直接用<strong>三引号</strong>将要运行的shell程序包含进去,<code>shell</code>中不能直接识别通配符<code>{sample}</code>和<code>{pair}</code>。在Snakemake规则中,<code>input</code>和 <code>output</code>可以直接使用通配符(wildcards)的语法,如<code>{sample}</code> 和 <code>{pair}</code>,来引用通配符的值。在 <code>shell</code>部分中,需要使用<code>{wildcards.<通配符名>}</code> 的语法来引用通配符的值,<strong>也就是通过wildcards对象访问其属性值</strong>。</p><p>回到一开始下载的数据,再次提醒<code>glob_wildcards()</code>函数返回的是<strong>元组</strong>类型<strong>,所以当只匹配一个变量的时候,需要添加逗号!</strong>否则会找不到文件:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">(SAMPLES,) = glob_wildcards(<span class="string">"data/samples/{sample}.fastq"</span>) </span><br><span class="line"></span><br><span class="line">rule all:</span><br><span class="line"> input:</span><br><span class="line"> <span class="built_in">expand</span>(<span class="string">"mapped_reads/{sample}.bam"</span>, sample=SAMPLES),</span><br><span class="line"></span><br><span class="line">rule bwa_map:</span><br><span class="line"> input:</span><br><span class="line"> REF=<span class="string">"data/genome.fa"</span>,</span><br><span class="line"> FQ=<span class="string">"data/samples/{sample}.fastq"</span>,</span><br><span class="line"> output:</span><br><span class="line"> BAM=<span class="string">"mapped_reads/{sample}.bam"</span>,</span><br><span class="line"> shell:</span><br><span class="line"> <span class="string">"bwa mem {input.REF} {input.FQ}| samtools view -Sb > {output.BAM}"</span></span><br></pre></td></tr></table></figure><p>顺便再提一句,snakemake在格式化<code>shell</code>命令时,使用花括号<code>{}</code>表示通配符引用的内容,如果在<code>shell</code>命令中有使用花括号的命令,需要<strong>使用两个花括号进行转义</strong>。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">rule example_rule:</span><br><span class="line"> input:</span><br><span class="line"> <span class="string">"data/samples/A.fastq"</span></span><br><span class="line"> output:</span><br><span class="line"> <span class="string">"output_file.txt"</span></span><br><span class="line"> shell:</span><br><span class="line"> <span class="string">""</span><span class="string">"</span></span><br><span class="line"><span class="string"> # 在这里使用双花括号来转义花括号</span></span><br><span class="line"><span class="string"> ls data/samples/{{A,B,C}}.fastq > {output}</span></span><br><span class="line"><span class="string"> "</span><span class="string">""</span></span><br></pre></td></tr></table></figure></div><h2 id="5-配置文件"><a href="#5-配置文件" class="headerlink" title="5. 配置文件"></a>5. 配置文件</h2><div class="story post-story"><p>一般在snakefile文件中,我们会先定义各种变量名,但是当原始数据比较多的时候全都写在一个文件中会比较乱。为了使snakemake脚本更加通用,可以使用配置文件,将分析需要用到的原始数据、参考基因组等写入到配置文件中。</p><p>配置文件可以书写成<code>JSON</code>或者<code>YAML</code>格式,读入为python的<strong>字典</strong>类型对配置参数和值进行定义。</p><p>以官方代码仓库中的<code>call SNP</code>流程的部分配置文件为例:</p><p><a href="https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/main/config/config.yaml">dna-seq-gatk-variant-calling/config/config.yaml at main · snakemake-workflows/dna-seq-gatk-variant-calling (github.com)</a></p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">samples:</span> <span class="string">config/samples.tsv</span></span><br><span class="line"><span class="attr">units:</span> <span class="string">config/units.tsv</span></span><br><span class="line"></span><br><span class="line"><span class="attr">ref:</span></span><br><span class="line"> <span class="attr">species:</span> <span class="string">homo_sapiens</span></span><br><span class="line"> <span class="attr">release:</span> <span class="number">98</span></span><br><span class="line"> <span class="attr">build:</span> <span class="string">GRCh38</span></span><br><span class="line"></span><br><span class="line"><span class="attr">filtering:</span></span><br><span class="line"> <span class="attr">vqsr:</span> <span class="literal">false</span></span><br><span class="line"> <span class="attr">hard:</span></span><br><span class="line"> <span class="attr">snvs:</span></span><br><span class="line"> <span class="string">"QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"</span></span><br><span class="line"> <span class="attr">indels:</span></span><br><span class="line"> <span class="string">"QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0"</span></span><br></pre></td></tr></table></figure><p>在snakefile中调用配置文件参数:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># common.smk定义了各种需要用的脚本以及配置参数</span></span><br><span class="line">configfile: <span class="string">"config/config.yaml"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># ref.smk中定义</span></span><br><span class="line">rule get_genome:</span><br><span class="line"> output:</span><br><span class="line"> <span class="string">"resources/genome.fasta"</span>,</span><br><span class="line"> <span class="built_in">log</span>:</span><br><span class="line"> <span class="string">"logs/get-genome.log"</span>,</span><br><span class="line"> params:</span><br><span class="line"> species=config[<span class="string">"ref"</span>][<span class="string">"species"</span>],</span><br><span class="line"> datatype=<span class="string">"dna"</span>,</span><br><span class="line"> build=config[<span class="string">"ref"</span>][<span class="string">"build"</span>],</span><br><span class="line"> release=config[<span class="string">"ref"</span>][<span class="string">"release"</span>],</span><br><span class="line"> cache: True</span><br><span class="line"> wrapper:</span><br><span class="line"> <span class="string">"0.74.0/bio/reference/ensembl-sequence"</span></span><br></pre></td></tr></table></figure><p>可以看到调用字典key值的方式就可以定义配置参数了,用<code>configfile: "config.yaml"</code>读取成字典,变量名为<code>config</code>。</p><p>这个例子定义了很多子模块执行不同的功能,就像搭积木一样,将这些模块搭成一套完整的分析流程。</p></div><h2 id="6-其他常用关键字"><a href="#6-其他常用关键字" class="headerlink" title="6. 其他常用关键字"></a>6. 其他常用关键字</h2><div class="story post-story"><p>上面的例子中还有几个其他常用参数顺便也记录下。</p><h3 id="6-1-日志-log"><a href="#6-1-日志-log" class="headerlink" title="6.1 日志 log"></a>6.1 日志 log</h3><p>在运行的时候我们会发现snakemake的日志是直接输出到屏幕的,我们可以选择保存的位置,使用<code>log</code>参数定义输出的日志。运行出错时,在log里面定义的文件不会被snakemake删掉,而<code>output</code>里面的文件则会被删除。</p><h3 id="6-2-规则参数-params"><a href="#6-2-规则参数-params" class="headerlink" title="6.2 规则参数 params"></a>6.2 规则参数 params</h3><p>有时在<code>shell</code>中,需要根据不同的样本使用不同的参数,该参数其既不是输入文件,也不是输出文件,如<code>bwa mem -R</code> 参数。如果把这些参数放在input里,则会因为找不到文件而出错。snakemake使用<code>params</code>关键字来设置这些参数。</p><p><code>bwa mem -R</code> 设置reads标头,放到一对引号中,也就是sam文件中的RG部分,同一样品可能包括多个测序结果,来自不同lane,不同文库,或者不同样品的比对结果合并到同一个文件中进行处理,需要通过RG进行标记区分。RG每个标记用冒号分割键和值,不同标记用 ‘\t’ 分隔。例如‘@RG\tID:foo\tSM:bar\tLB:library1’</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">(SAMPLES,) = glob_wildcards(<span class="string">"data/samples/{sample}.fastq"</span>) </span><br><span class="line"></span><br><span class="line">rule all:</span><br><span class="line"> input:</span><br><span class="line"> <span class="built_in">expand</span>(<span class="string">"mapped_reads/{sample}.bam"</span>, sample=SAMPLES),</span><br><span class="line"></span><br><span class="line">rule bwa_map:</span><br><span class="line"> input:</span><br><span class="line"> REF=<span class="string">"data/genome.fa"</span>,</span><br><span class="line"> FQ=<span class="string">"data/samples/{sample}.fastq"</span>,</span><br><span class="line"> output:</span><br><span class="line"> BAM=<span class="string">"mapped_reads/{sample}.bam"</span>,</span><br><span class="line"> threads: 8 <span class="comment"># threads定义线程数</span></span><br><span class="line"> params:</span><br><span class="line"> rg=<span class="string">"@RG\tID:{sample}\tSM:{sample}"</span></span><br><span class="line"> shell:</span><br><span class="line"> <span class="string">"bwa mem -R '{params.rg}' -t {threads} {input.REF} {input.FQ}| samtools view -Sb - > {output.BAM}"</span></span><br></pre></td></tr></table></figure><h3 id="6-3-运行时间和内存"><a href="#6-3-运行时间和内存" class="headerlink" title="6.3 运行时间和内存"></a>6.3 运行时间和内存</h3><p>在该关键字下的文件会自动记录该规则运行所消耗的时间和内存。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">(SAMPLES,) = glob_wildcards(<span class="string">"data/samples/{sample}.fastq"</span>) </span><br><span class="line"></span><br><span class="line">rule all:</span><br><span class="line"> input:</span><br><span class="line"> <span class="built_in">expand</span>(<span class="string">"mapped_reads/{sample}.bam"</span>, sample=SAMPLES),</span><br><span class="line"></span><br><span class="line">rule bwa_map:</span><br><span class="line"> input:</span><br><span class="line"> REF=<span class="string">"data/genome.fa"</span>,</span><br><span class="line"> FQ=<span class="string">"data/samples/{sample}.fastq"</span>,</span><br><span class="line"> output:</span><br><span class="line"> protected(BAM=<span class="string">"mapped_reads/{sample}.bam"</span>),</span><br><span class="line"> threads: 8</span><br><span class="line"> params:</span><br><span class="line"> rg=<span class="string">"@RG\tID:{sample}\tSM:{sample}"</span></span><br><span class="line"> <span class="built_in">log</span>:</span><br><span class="line"> <span class="string">"logs/bwa_mem/{sample}.log"</span></span><br><span class="line"> benchmark:</span><br><span class="line"> <span class="string">"benchmarks/{sample}.bwa.benchmark.txt"</span></span><br><span class="line"> shell:</span><br><span class="line"> <span class="string">"bwa mem -R '{params.rg}' -t {threads} {input.REF} {input.FQ}| samtools view -Sb - > {output.BAM}"</span></span><br></pre></td></tr></table></figure><h3 id="6-4-wrapper"><a href="#6-4-wrapper" class="headerlink" title="6.4 wrapper"></a>6.4 wrapper</h3><p>指定一个包装器(wrapper)脚本,这个包装器脚本可以用来自动化地配置和运行特定软件或工具。使用 <code>wrapper</code> 关键字可以简化规则的编写,尤其是对于需要复杂参数设置或环境配置的软件工具。</p><p>具体来说,<code>wrapper</code> 关键字的作用包括:</p><ol><li><strong>自动化软件配置</strong>:通过指定一个包装器脚本,可以在规则中自动化配置软件的安装路径、参数设置、环境变量等。这样可以简化规则的编写,并确保软件的正确配置。</li><li><strong>版本控制</strong>:包装器脚本通常会包含特定软件版本的配置信息,这样可以确保在不同环境中使用相同版本的软件,避免版本不一致导致的问题。</li><li><strong>参数设置</strong>:包装器脚本可以预先设置好软件的参数,使得在规则中调用软件时不需要重复指定参数,减少了重复劳动。</li><li><strong>环境隔离</strong>:包装器脚本可以帮助实现软件运行环境的隔离,确保规则中调用的软件在一个独立的环境中执行,避免与其他软件或库的冲突。</li></ol><p>比如前面的例子用了<code>"0.74.0/bio/reference/ensembl-sequence"</code>这个<code>wrapper</code>值,实际上用的是以下包装好的脚本:</p><p><a href="https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/reference/ensembl-sequence.html#ensembl-sequence">ENSEMBL-SEQUENCE | Snakemake wrappers (snakemake-wrappers.readthedocs.io)</a></p><p>这些打包好的脚本可以在如下的官方网站查找:</p><p><a href="https://snakemake-wrappers.readthedocs.io/en/stable/">The Snakemake Wrappers repository | Snakemake wrappers (snakemake-wrappers.readthedocs.io)</a></p><h3 id="6-5-其他"><a href="#6-5-其他" class="headerlink" title="6.5 其他"></a>6.5 其他</h3><p>不一一列举了,放个表格:</p><table><thead><tr><th>关键字</th><th>描述</th><th>关键字</th><th>描述</th></tr></thead><tbody><tr><td>message</td><td>打印消息</td><td>threads</td><td>定义该规则线程数</td></tr><tr><td>resources</td><td>定义使用的资源,如内存、CPU</td><td>version</td><td>定义规则版本</td></tr><tr><td>conda</td><td>定义conda环境</td><td>singularity</td><td>容器环境运行</td></tr><tr><td>run</td><td>定义多行命令</td><td>shell</td><td>执行shell命令</td></tr><tr><td>script</td><td>定义规则中处理脚本</td><td>notebook</td><td>执行jupyter notebook文件</td></tr></tbody></table><p>运行参数:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 设置出错后重新运行次数</span></span><br><span class="line">snakemake --restart-times 3</span><br><span class="line"></span><br><span class="line"><span class="comment"># 可视化,生成有向无环图</span></span><br><span class="line">snakemake --dag | dot -Tsvg > dag.svg</span><br><span class="line">snakemake --dag | dit -Tpdf > dag.pdf</span><br><span class="line"></span><br><span class="line"><span class="comment"># 指定snakemake运行的文件</span></span><br><span class="line">snakemake -s Snakefile</span><br><span class="line"></span><br><span class="line"><span class="comment"># 伪运行</span></span><br><span class="line">snakemake -np</span><br></pre></td></tr></table></figure><p>生成的有向无环图(也就是这里的流程图)如下:</p><p><img src="https://www.shelven.com/tuchuang/20240407/3.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240407/3.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h2><div class="story post-story"><p>自己只是简单粗浅地了解了下Snakemake这个工作流软件,越是了解越是发现其便捷和强大之处。之前做生信分析流程东一块代码西一串数据,写的博客也是零零散散,真的很有必要把流程给串起来形成一套方便维护的流程代码,需要学习的东西还有很多。</p><p>Snakemake还有非常多的优点这里没有一一体现出来,比如其<strong>支持大规模的平行计算</strong>(等到学校集群联网了,一定试试怎么投递slurm作业),<strong>可移植性强</strong>(多种依赖工具的支持,conda环境、singularity、jupyter),分析流程模块化(分解成多个rule,用<code>include</code>调用)。</p><p>自己在摸索的时候也看到官方很多工作流示例,国内也有很多质量非常高的进阶教程,都需要我慢慢去理解和动手测试。</p><p><a href="https://cloud.tencent.com/developer/article/2032005">Snakemake+RMarkdown定制你的分析流程和报告-腾讯云开发者社区-腾讯云 (tencent.com)</a></p><p>这里先挖个坑,有空把我使用GATK call SNP和INDEL的流程自己写一遍,加深对snakemake的理解,再请教下师兄怎么联合jupter使用snakemake,毕竟写python代码还是jupter用起来比较舒服,测试也比较方便。</p></div>]]></content>
<summary type="html"><p><code>snakemake</code>是一款强大的工作流管理工具,用于构建和运行复杂的数据分析工作流。其工作流是基于<code>python</code>语言描述的,类似于<code>Makefile</code>的工作流描述语言。<code>Makefile</code>指定的是源文件之间的依赖关系,以及如何将它们编译成可执行文件和库,而<code>snakemake</code>指定的是数据处理过程中的依赖关系和规则,并自动化执行这些规则生成最终的输出。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="Snakemake" scheme="http://www.shelven.com/tags/Snakemake/"/>
</entry>
<entry>
<title>基因家族分析——鉴定基因家族成员</title>
<link href="http://www.shelven.com/2024/04/04/a.html"/>
<id>http://www.shelven.com/2024/04/04/a.html</id>
<published>2024-04-04T07:30:25.000Z</published>
<updated>2024-04-04T07:37:08.000Z</updated>
<content type="html"><![CDATA[<p>最近基因家族分析的文章越来越多(铺天盖地的培训班宣传),知网上几乎天天都有更新这类文章。很多的这类文章都是纯生信分析,用的公共数据库中的基因组、蛋白序列和转录组数据,最多加个qPCR验证(甚至有些文章都没有),内容没有深度比较氵。写这篇笔记并不是鼓励大家水文章,只是掌握一门分析方法,不要把这类分析看得太难,0代码基础也可以做分析。</p><span id="more"></span><p>开始前需要准备以下数据:</p><ul><li>待分析物种蛋白序列(基因组或者转录组得到的都行),鉴定基因家族用</li><li>基因组序列及注释文件(<code>gff</code>或者<code>gtf</code>都可),鉴定基因家族的染色体分布、结构分析、共线性分析以及启动子区域顺式作用元件预测用</li><li>模式植物的待分析基因家族蛋白序列(比如拟南芥),构建进化树用</li></ul><h2 id="鉴定基因家族"><a href="#鉴定基因家族" class="headerlink" title="鉴定基因家族"></a>鉴定基因家族</h2><div class="story post-story"><p>基因家族(gene family),是来源于同一个祖先,由一个基因通过基因重复而产生的一组基因,它们在结构和功能上具有明显的相似性,编码相似的蛋白质产物,行使相似的功能。同一个基因家族有着相同的保守结构域(domain),我们可以根据这个特点从蛋白序列中筛选目标基因家族成员。</p><p>以我做的<strong>WRKY</strong>基因家族为例,简单讲讲怎么鉴定候选的基因家族成员,<strong>这里只需要用到待分析物种的蛋白序列即可</strong>。</p></div><h2 id="1-通过注释获得候选基因家族"><a href="#1-通过注释获得候选基因家族" class="headerlink" title="1. 通过注释获得候选基因家族"></a>1. 通过注释获得候选基因家族</h2><div class="story post-story"><p>将<strong>蛋白序列</strong>(数量要少于10万条)先在直系同源蛋白注释网站<code>eggNOG-mapper</code>上自动注释一遍:</p><p><a href="http://eggnog-mapper.embl.de/">eggNOG-mapper (embl.de)</a></p><p><img src="https://www.shelven.com/tuchuang/20240403/1.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/1.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这个网站的用法在前面的博客中有写,不罗嗦了。<a href="https://www.shelven.com/2023/09/19/a.html">基因组注释(6)——在线版eggNOG-mapper注释功能基因</a></p><p>目的是为了获得结果文件<code>out.emapper.annotations.xls</code>最后一列<code>Pfam</code>数据库的注释(其他列的KEGG和GO注释结果也可以用来做富集分析)。<code>Pfam</code>数据库是个蛋白家族数据库,提供各个基因家族的隐马尔可夫模型(HMM)。</p><p>网页版<code>eggNOG-mapper</code>的注释原理就是用<code>hmmer3</code>工具将所有蛋白序列在<code>Pfam</code>数据库搜寻最佳匹配的HMM,再用<code>phmmer</code>工具针对由最佳匹配HMM搜索每个query蛋白及对应的注释,见下面参考文献:</p><p><a href="https://www.biorxiv.org/content/10.1101/076331v1.full">Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper | bioRxiv</a></p><p>相当于网站帮我们做了<code>hmm search</code>,我们只需要筛选最后一列,输入关键词<code>WRKY</code>,删除H列描述(Description)中可能与WRKY无关的基因,可以看到有50个候选的WRKY基因家族成员。这个时候就可以把A列(query)基因名提取出来,<strong>查找基因名对应的蛋白序列,复制整理出候选基因家族蛋白序列</strong>。</p><p><img src="https://www.shelven.com/tuchuang/20240403/2.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/2.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="2-通过hmm-search鉴定基因家族"><a href="#2-通过hmm-search鉴定基因家族" class="headerlink" title="2. 通过hmm search鉴定基因家族"></a>2. 通过hmm search鉴定基因家族</h2><div class="story post-story"><p>上面的注释结果中不一定有目的基因家族成员,这个时候需要我们直接从<code>Pfam</code>网站下载对应基因家族的HMM模型,或者使用多序列比对构建HMM模型,再进行基于蛋白结构域的比对寻找候选的基因家族成员。</p><h3 id="2-1-使用TBtools的Simple-HMM-Search功能"><a href="#2-1-使用TBtools的Simple-HMM-Search功能" class="headerlink" title="2.1 使用TBtools的Simple HMM Search功能"></a>2.1 使用TBtools的Simple HMM Search功能</h3><p><code>TBtools</code>这个软件功能还是比较全的,确实可以实现无代码做基因家族分析。但是我觉得还是有必要了解一下原理,毕竟软件是别人做的,过度依赖的话会导致自己处于被动的地位。</p><p>首先下载pfam全库的HMM模型:</p><p><a href="https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz">https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz</a></p><p>解压之后大概1.5个G的大小,这里面包含了<strong>所有</strong>HMM模型!</p><p>打开TBtools的<code>Simple HMM Search</code>功能页面:</p><p><img src="https://www.shelven.com/tuchuang/20240403/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>第一个输入框是物种蛋白序列文件。</p><p>第二个输入框是解压之后的pfam全库模型。</p><p>第三个输入框是你需要查找的HMM模型编号,比如我这里想要查找编号为<code>PF03106</code>的WRKY基因家族模型,就创建一个文件第一行写入PF03106即可(可以多个模型,一个模型编号一行)。一般通过文献中给出的HMM模型编号(PF号),或者在<code>Pfam</code>数据库直接搜基因家族对应的结构域信息得到模型编号。<a href="https://www.ebi.ac.uk/interpro/search/text/WRKY/#table">Search - InterPro (ebi.ac.uk)</a></p><p>第四个输入框指定输出文件位置和名称。</p><p>结果文件可以分为<code>Sequence scores</code>和<code>Domain scores</code>两部分,根据自己需要对E-value进行过滤。</p><p><img src="https://www.shelven.com/tuchuang/20240403/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>也可以看到这个工具就是调用了<code>hmmsearch</code>进行的同源序列搜寻。</p><p>如果很不巧你的基因家族在pfam网站没有相关的保守结构域,那就需要进行多序列比对构建自己的HMM模型了。</p><p>我在tbtools作者的知乎账号里有看到有使用多序列比对结果进行HMMsearch,<strong>但是在最新的2.081版本里没有看到相关功能插件…</strong></p><p><a href="https://zhuanlan.zhihu.com/p/358660673?utm_id=0">插件 | 地表最强 Hmmer Search 界面工具 - 知乎 (zhihu.com)</a></p><p>不过不影响,我们可以直接用<strong>HMMER软件包</strong>。</p><h3 id="2-2-使用HMMER软件包"><a href="#2-2-使用HMMER软件包" class="headerlink" title="2.2 使用HMMER软件包"></a>2.2 使用HMMER软件包</h3><p>使用HMMER软件包需要对以下两个主要程序有个大致的理解:</p><ul><li><code>hmmbuild</code>: 使用多序列比对构建HMM模型</li><li><code>hmmsearch</code>:使用HMM模型搜索蛋白序列库</li></ul><p>还是以WRKY基因家族为例,在pfam网站上直接下载WRKY结构域的HMM模型(搜索<code>PF03106</code>,点击<strong>Curation</strong>、<strong>Download</strong>)</p><p><img src="https://www.shelven.com/tuchuang/20240403/3.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/3.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">hmmsearch --cut_tc --domtblout Ap-WRKY.out PF03106.hmm Ap_rmTE.pep.fa</span><br><span class="line"></span><br><span class="line"><span class="comment"># --cut_tc使用cut_tc算法对隐马可夫模型进行搜索</span></span><br><span class="line"><span class="comment">## 可选--cut_ga、--cut_nc和--cut_tc,分别是GA gathering、NC noise和trusted cutoffs</span></span><br><span class="line"><span class="comment"># --domtblout保存结构域的名中结果到输出文件</span></span><br><span class="line"><span class="comment">## 可选输出模式--tblout、--domtblout、--pfamtblout</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20240403/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>我这里用到的输出格式<code>--domtblout</code>,选择最后一个输出格式就和上面<code>tbtools</code>出来的<strong>结果完全一样了</strong>。</p><p>这里筛选出候选的基因家族成员后,我们还可以进一步的构建自己这个物种的WRKY基因家族HMM模型。对于前面说的,<strong>如果pfam网站没有你想要的保守结构域信息</strong>,我们也可以找到<strong>其他物种</strong>的同源蛋白,基于多序列比对结果构建HMM模型。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">grep -v <span class="string">"#"</span> Ap-WRKY.out|awk <span class="string">'($7 + 0) < 1E-20'</span>|<span class="built_in">cut</span> -f1 -d <span class="string">" "</span>|<span class="built_in">sort</span> -u > Ap-WRKY-geneid.txt</span><br><span class="line"><span class="comment"># 滤掉包含"#"字符的注释行,将第7列转换为数字并检查是否小于1E-20,按空格分隔每行并提取第一个字段,排序并去重,得到基因id</span></span><br><span class="line"></span><br><span class="line">seqkit grep -f Ap-WRKY-geneid.txt Ap_rmTE.pep.fa > Ap-WRKY.fasta</span><br><span class="line"><span class="comment"># 提取蛋白序列</span></span><br><span class="line"></span><br><span class="line">mafft --localpair --maxiterate 1000 Ap-WRKY.fasta > WRKY-aln.fas</span><br><span class="line"><span class="comment"># mafft多重序列比对,也可以用clustal、muscle等等。保存为fasta格式或者STOCKHOLM格式(.sto)</span></span><br><span class="line"></span><br><span class="line">hmmbuild Ap_WRKY.hmm WRKY-aln.fas</span><br><span class="line"><span class="comment"># 构建HMM模型</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20240403/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这一步构建的模型和你从<code>pfam</code>官网下载的HMM模型格式是一样的(这里除了GA/TC/NC参数,下面说),只不过用了我们自己的序列比对结果做为种子,训练多序列比对结果构建HMM模型。接下来使用构建的HMM模型扫描物种的蛋白数据库,进一步寻找候选的基因家族成员。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">hmmsearch --domtblout WRKY.out Ap_WRKY.hmm Ap_rmTE.pep.fa</span><br><span class="line"><span class="comment"># 再次对蛋白数据集进行HMM搜索</span></span><br><span class="line"></span><br><span class="line">grep -v <span class="string">"#"</span> WRKY.out|awk <span class="string">'($7 + 0) < 1E-20'</span>|<span class="built_in">cut</span> -f1 -d <span class="string">" "</span>|<span class="built_in">sort</span> -u > WRKY-geneid.txt</span><br><span class="line"><span class="comment"># 过滤</span></span><br><span class="line"></span><br><span class="line">seqkit grep -f WRKY-geneid.txt Ap_rmTE.pep.fa > WRKY.fasta</span><br><span class="line"><span class="comment"># 提取</span></span><br></pre></td></tr></table></figure><p>需要注意一点,如果你用的fa格式的多序列比对结果构建的HMM,<strong>是不包含GA、TC和NC参数的</strong>。只有用官方推荐的<strong>SELEX</strong>或者<strong>Stockholm</strong>格式(PFAM友好的多序列比对格式)才会有上面三个参数,否侧你用<code>--cut_tc</code>这样指定算法是会<strong>报错</strong>的!</p><p>一个可以转换多序列比对格式的网站,需要可以自取:<br><a href="https://sequenceconversion.bugaco.com/converter/biology/sequences/fasta_to_phylip.php">https://sequenceconversion.bugaco.com/converter/biology/sequences/fasta_to_phylip.php</a></p><p>这种方式找到的候选基因家族数量比上面直接<code>hmmsearch</code>得到的数量多,也是需要自己确定阈值筛选的。</p><h3 id="2-3-HMMER补充说明"><a href="#2-3-HMMER补充说明" class="headerlink" title="2.3 HMMER补充说明"></a>2.3 HMMER补充说明</h3><p>HMMER软件包功能远不止上面的构建HMM模型和搜寻同源序列,顺便就再介绍下其他功能和操作。</p><p>一个蛋白可能有多种结构域,比如说我想确定上面候选的WRKY基因家族成员中还包含了哪些<strong>其他结构域</strong>该怎么办?这个时候可以用到<code>hmmscan</code>命令,<strong>使用蛋白序列搜寻HMM库</strong>。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">gzip -d Pfam-A.hmm.gz</span><br><span class="line"><span class="comment"># 解压pfam全库HMM模型</span></span><br><span class="line"></span><br><span class="line">hmmpress Pfam-A.hmm</span><br><span class="line"><span class="comment"># 构建索引</span></span><br><span class="line"></span><br><span class="line">hmmscan -o result.txt --tblout result2.txt --domtblout result3.txt -E 1e-5 Pfam-A.hmm WRKY.fasta</span><br><span class="line"><span class="comment"># 以Pfam-A.hmm为参考,注释蛋白结构域信息</span></span><br><span class="line"><span class="comment">## 输出参数还可以用--tblout、--domtblout,这俩输出方式都比-o输出的直观</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20240403/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以看到前面鉴定的WRKY基因家族成员中还有<code>Plant_zn_clust</code>这个结构域,这也是常见于WRKY基因家族中的一种锌指结构域。</p><p>顺便再提一嘴,HMMER也有线上的服务,见<a href="https://www.ebi.ac.uk/Tools/hmmer/search/phmmer">phmmer search | HMMER (ebi.ac.uk)</a></p><p><img src="https://www.shelven.com/tuchuang/20240403/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>提供了包括前面说的<code>hmmscan</code>和<code>hmmsearch</code>两个功能,不过这两个功能<strong>不能指定你自己的蛋白数据集</strong>,只能用网上的数据库,比如<code>Reference Proteomes</code>、<code>SwissProt</code>、<code>Ensembl</code>这些。换句话说,如果你的物种比较小众,或者没有注释地比较完整的蛋白库,那你就不能通过这个网站去鉴定基因家族,也没法自建HMM模型。</p><p>如果你做的是模式植物,那么恭喜你,选择第一个高置信度的参考蛋白组数据库<code>Reference Proteomes</code>,指定特定的物种名即可。</p><p>再顺便说一下另外两个功能:</p><ul><li><code>phmmer</code>:与blastp类似,使用一个蛋白质序列搜索蛋白质序列库</li><li><code>jackhmmer</code>:与psiBlast类似,蛋白质序列迭代搜索蛋白质序列库</li></ul><p><code>Jackhmmer</code>是一种基于HMM的迭代搜索算法,可以使用一条或多条蛋白质序列,在蛋白质序列数据库中寻找同源序列。首先用输入序列构建一个初始的HMM模型,在指定的数据库中搜索,将这些匹配的序列,加入到输入序列中(也就是迭代的过程),重新构建一个HMM,并重复搜索过程,直到达到最大迭代次数或没有新的匹配序列为止。优点是可以发现较远的同源序列,提高敏感性和准确性。</p><p>官方都有示例数据可以在线跑,操作一遍就能理解了。</p></div><h2 id="3-验证基因家族"><a href="#3-验证基因家族" class="headerlink" title="3. 验证基因家族"></a>3. 验证基因家族</h2><div class="story post-story"><p>我们往往会用多种方法验证基因家族鉴定的结果,常见的是使用<code>blastp</code>。关于怎么使用blastp我就不多说了,可以见我以前的博客<a href="https://www.shelven.com/2022/07/05/a.html">blastn & blastp 寻找同源基因</a></p><p>还有在线网站,比如NCBI的CDD数据库<a href="https://www.ncbi.nlm.nih.gov/cdd/">Home - Conserved Domains - NCBI (nih.gov)</a></p><p><img src="https://www.shelven.com/tuchuang/20240403/7.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/7.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以使用<code>Batch CD-Search</code>功能模块,对鉴定的基因家族成员蛋白序列在进行一次结构域预测,结果如下:</p><p><img src="https://www.shelven.com/tuchuang/20240403/8.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240403/8.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>一方面可以验证鉴定的基因家族准确性,另一方面还可以用<strong>CDD预测的结果画保守结构域图</strong>,这个下次再说。</p></div>]]></content>
<summary type="html"><p>最近基因家族分析的文章越来越多(铺天盖地的培训班宣传),知网上几乎天天都有更新这类文章。很多的这类文章都是纯生信分析,用的公共数据库中的基因组、蛋白序列和转录组数据,最多加个qPCR验证(甚至有些文章都没有),内容没有深度比较氵。写这篇笔记并不是鼓励大家水文章,只是掌握一门分析方法,不要把这类分析看得太难,0代码基础也可以做分析。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="基因家族分析" scheme="http://www.shelven.com/categories/%E5%9F%BA%E5%9B%A0%E5%AE%B6%E6%97%8F%E5%88%86%E6%9E%90/"/>
<category term="HMMER3" scheme="http://www.shelven.com/tags/HMMER3/"/>
<category term="TBtools" scheme="http://www.shelven.com/tags/TBtools/"/>
</entry>
<entry>
<title>基因家族分析——顺式作用元件预测及作图</title>
<link href="http://www.shelven.com/2024/04/01/a.html"/>
<id>http://www.shelven.com/2024/04/01/a.html</id>
<published>2024-04-01T15:48:22.000Z</published>
<updated>2024-04-01T15:51:06.000Z</updated>
<content type="html"><![CDATA[<p>整理一下前段时间做的基因家族分析笔记,这部分分析用的代码部分比较少,因为有现成的软件以及网站可以分析,没什么太多需要自己创造的地方,按部就班的分析流程实在让我提不起兴趣……简单记录下需要自己整理数据和写代码作图的部分——基因家族顺式作用元件预测。</p><span id="more"></span><p>关于怎么鉴定基因家族,以后有空再更新吧,以及基因家族成员蛋白保守结构域和基序的分析、基因家族在染色体上的分布、同源基因共线性关系等等这些都非常容易,这里就不赘述了。<strong>做到这一步,假设都拿到了基因家族成员对应基因组中基因的名字。</strong></p><h2 id="1-顺式作用元件概念"><a href="#1-顺式作用元件概念" class="headerlink" title="1. 顺式作用元件概念"></a>1. 顺式作用元件概念</h2><div class="story post-story"><p>还是要先了解一下概念:<strong>顺式作用元件</strong>(cis-regulatory element,CRE),是与结构基因串联的一段<strong>非编码DNA序列</strong>,起到调节<strong>邻近基因</strong>的转录作用(cis的拉丁语意思就是同一侧,即与调节的基因在同一段DNA序列上)。</p><p><strong>顺式作用元件并不是只存在于转录位点的上游!</strong>最早的顺式作用模块的定义最初用于描述增强子,可以见下面这篇文献:</p><p><a href="https://www.sciencedirect.com/science/article/abs/pii/S1084952109001542?via=ihub">A systems biology approach to understanding cis-regulatory module function - ScienceDirect</a></p><p>如今顺式作用元件(或者称顺式作用模块,CRM)定义为具有转录因子结合位点的DNA序列,包括了启动子、增强子、沉默子、绝缘子等等,所以<strong>它可以在基因的上游、下游甚至基因内发挥作用!</strong></p><p>而现在的基因家族顺式作用元件预测,清一色取基因上游<code>1000 - 2000 bp</code>左右序列,这或多或少有失偏颇。2000 bp的长度只是包括启动子区域(如果是细菌的话,2000 bp甚至可能取到其他基因区域了),而离转录起始位点比较远的,比如增强子就没有办法检测到了。所以我认为<strong>真核生物</strong>的这2000 bp长度,只能说是<strong>启动子区域的顺式作用元件预测</strong>。</p></div><h2 id="2-提取起始密码子上游序列"><a href="#2-提取起始密码子上游序列" class="headerlink" title="2. 提取起始密码子上游序列"></a>2. 提取起始密码子上游序列</h2><div class="story post-story"><p><del>上面说了这么多,其实就想吐槽一下这样的公式化论文太多……做科研的同时要想想我们这么做的目的是什么,而不是一味套模板</del></p><p>回到正题,我们也取起始密码子上游的2000 bp序列,做基因家族启动子区域的顺式作用元件预测。</p><p>使用<code>seqkit</code>软件从<code>fa</code>格式的基因组文件和<code>gtf</code>格式的注释文件中提取<strong>起始密码子上游序列</strong>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">seqkit subseq --gtf Ap_rmTE.gtf --feature start_codon --up-stream 2000 --only-flank --gtf-tag transcript_id /public/home/wlxie/Genome/Ap.fasta > upstream.fa</span><br><span class="line"></span><br><span class="line"><span class="comment"># --feature 指定feature</span></span><br><span class="line"><span class="comment"># --up-stream 指定上游序列</span></span><br><span class="line"><span class="comment"># --only-flank 不包括feature</span></span><br><span class="line"><span class="comment"># --gtf-tag 基因名添加gtf文件中的tag名</span></span><br></pre></td></tr></table></figure><p>得到的上游序列格式如图,基因名会在最后加上指定的tag名:</p><p><img src="https://www.shelven.com/tuchuang/20240401/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240401/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>根据tag和自己需要分析的基因名之间的对应关系,可以把自己需要的基因的上游序列挑出来,顺便可以重命名一下。</p></div><h2 id="3-启动子区域顺式作用元件预测"><a href="#3-启动子区域顺式作用元件预测" class="headerlink" title="3. 启动子区域顺式作用元件预测"></a>3. 启动子区域顺式作用元件预测</h2><div class="story post-story"><p>将挑选的基因启动子区域序列(fasta格式)上传到<strong>PlantCARE</strong>进行顺式元件预测:</p><p><a href="https://bioinformatics.psb.ugent.be/webtools/plantcare/html/">PlantCARE, a database of plant promoters and their cis-acting regulatory elements (ugent.be)</a></p><p><img src="https://www.shelven.com/tuchuang/20240401/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240401/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>结束后会收到一封邮件,附件中有预测结果,我们只需要<code>plantCARE_output_PlantCARE_5730.tab</code>这种格式的汇总文件,这个文件可以重命名为<code>xls</code>格式,就可以在excel中打开编辑了。</p><p>我这里50个基因,<strong>预测的启动子区域顺式作用元件有8718个</strong>,需要根据自己实验进行数据过滤。举个例子,<strong>我想要知道这些启动子区域的顺式元件有哪些功能</strong>,我的筛选标准是:</p><blockquote><ol><li>删除CAAT-box和TATA-box,因为这两个是转录起始必须的,在功能上对我的分析内容没有任何帮助</li><li>删除第二列没有命名的顺式元件</li><li>删除最后一列没有功能注释的顺式元件</li></ol></blockquote><p><strong>这样删除后8718个顺式元件就只剩下了1230个</strong>,平均下来一个基因有功能注释的启动子区顺式元件有20个左右,在可以接受的范围。</p><p>将最后一列功能注释找个翻译软件翻译一下,一共也就40个左右类型,根据功能描述,将它们进行分类以及最后一列批量替换,如下:</p><p><img src="https://www.shelven.com/tuchuang/20240401/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240401/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>D列是起始位点,E列是顺式元件长度,F列是正负链,G是预测的物种,H是改后的功能名称。</p><p>简单来说,我将这些顺势元件的功能分为了<strong>光反应相关(light_responsive)、激素响应相关(hormone_responsive)、生长发育相关(development_related)和环境胁迫相关(environmental_stress)</strong>共四种类型。</p></div><h2 id="3-顺式作用元件统计图和堆积柱状图"><a href="#3-顺式作用元件统计图和堆积柱状图" class="headerlink" title="3. 顺式作用元件统计图和堆积柱状图"></a>3. 顺式作用元件统计图和堆积柱状图</h2><div class="story post-story"><p>看到有的分析会把顺式元件做个结构图显示出来,比如下面这样的(做两个bed文件,然后在<a href="https://gsds.gao-lab.org/">GSDS</a>网站中可视化):</p><p><img src="https://www.shelven.com/tuchuang/20240401/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240401/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>我个人觉得实在意义不大,这种处理相当于是把这2000 bp序列看作是一个基因序列,<strong>每个顺式作用元件相当于外显子给可视化出来</strong>。且不说顺式元件只有10 bp不到,所有“外显子”看起来一样大,而且功能分类太多,一个结构图上花花绿绿的各种颜色,无法让人一眼get你想表达什么信息。</p><p>我觉得还不如直接统计预测的各种顺式元件是什么功能,上面的四种功能分区分别有多少成员,这种让人一眼可以得到有效信息的统计图更实在。因此,位置信息就没有任何作用了,对表格进一步精简,<strong>只保留A列、B列和H列即可</strong>,重命名为:<code>cis.classification.tab.xls</code></p><p><img src="https://www.shelven.com/tuchuang/20240401/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240401/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>我这里对显示的基因有顺序要求,<strong>我需要将同一个亚家族的基因放在一起进行比较</strong>。因此,需要一个给基因排序的文件<code>order.txt</code>:</p><p><img src="https://www.shelven.com/tuchuang/20240401/7.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240401/7.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>就是简单给基因进行手动排序,方便后续AI添加x轴和y轴的分类。作图使用R,代码如下:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line">library<span class="punctuation">(</span>tidyverse<span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">setwd<span class="punctuation">(</span><span class="string">"D:\\zhuomian\\WRKY\\04.cis-acting"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">df <span class="operator"><-</span> read_tsv<span class="punctuation">(</span><span class="string">"D:\\zhuomian\\WRKY\\04.cis-acting\\cis.classification.tab.xls"</span><span class="punctuation">,</span>col_names <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 第三列排序</span></span><br><span class="line">df <span class="operator"><-</span> df <span class="operator">%>%</span> arrange<span class="punctuation">(</span>X3<span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 第二列和第三列去重,重设因子水平</span></span><br><span class="line">uniq_level <span class="operator"><-</span> df <span class="operator">%>%</span> distinct<span class="punctuation">(</span>X2<span class="punctuation">,</span>X3<span class="punctuation">)</span></span><br><span class="line">df<span class="operator">$</span>X2 <span class="operator"><-</span> factor<span class="punctuation">(</span>df<span class="operator">$</span>X2<span class="punctuation">,</span> levels <span class="operator">=</span> uniq_level<span class="operator">$</span>X2<span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 整理数据, 按照第一列X1和第二列X2进行分组,对每个组计算数量,并将结果保存在新的列number中,按照数量(降序)进行排序,重新赋值给tidy</span></span><br><span class="line">tidy <span class="operator"><-</span> df <span class="operator">%>%</span></span><br><span class="line"> group_by<span class="punctuation">(</span>X1<span class="punctuation">,</span>X2<span class="punctuation">)</span> <span class="operator">%>%</span></span><br><span class="line"> summarise<span class="punctuation">(</span>number <span class="operator">=</span> n<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span></span><br><span class="line"> arrange<span class="punctuation">(</span>desc<span class="punctuation">(</span>number<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 从文件中读取排序顺序(基因家族的亚家族分组)</span></span><br><span class="line">order <span class="operator"><-</span> readLines<span class="punctuation">(</span><span class="string">"order.txt"</span><span class="punctuation">)</span></span><br><span class="line">tidy<span class="operator">$</span>X1 <span class="operator"><-</span> factor<span class="punctuation">(</span>tidy<span class="operator">$</span>X1<span class="punctuation">,</span> levels <span class="operator">=</span> order<span class="punctuation">)</span></span><br><span class="line">tidy <span class="operator"><-</span> tidy<span class="punctuation">[</span>order<span class="punctuation">(</span>tidy<span class="operator">$</span>X1<span class="punctuation">)</span><span class="punctuation">,</span> <span class="punctuation">]</span></span><br><span class="line"></span><br><span class="line">low <span class="operator"><-</span> rgb<span class="punctuation">(</span><span class="number">255</span><span class="punctuation">,</span><span class="number">240</span><span class="punctuation">,</span><span class="number">235</span><span class="punctuation">,</span><span class="built_in">max</span> <span class="operator">=</span> <span class="number">255</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># ggplot作统计图</span></span><br><span class="line">ggplot<span class="punctuation">(</span>tidy<span class="punctuation">,</span> aes<span class="punctuation">(</span>x <span class="operator">=</span> X2<span class="punctuation">,</span> y <span class="operator">=</span> X1<span class="punctuation">,</span> fill <span class="operator">=</span> number<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_tile<span class="punctuation">(</span>color <span class="operator">=</span> <span class="string">'black'</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span>label <span class="operator">=</span> number<span class="punctuation">)</span><span class="punctuation">,</span>col<span class="operator">=</span><span class="string">'black'</span><span class="punctuation">,</span>cex <span class="operator">=</span> <span class="number">3</span><span class="punctuation">,</span> size <span class="operator">=</span> <span class="number">13</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> scale_fill_gradient<span class="punctuation">(</span>low <span class="operator">=</span> low<span class="punctuation">,</span> high <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> scale_x_discrete<span class="punctuation">(</span>position <span class="operator">=</span> <span class="string">"top"</span><span class="punctuation">)</span><span class="operator">+</span></span><br><span class="line"> theme_classic<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme<span class="punctuation">(</span></span><br><span class="line"> legend.title <span class="operator">=</span> element_blank<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> legend.position <span class="operator">=</span> <span class="string">"bottom"</span><span class="punctuation">,</span></span><br><span class="line"> axis.ticks <span class="operator">=</span> element_blank<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.line <span class="operator">=</span> element_blank<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.text.x <span class="operator">=</span> element_text<span class="punctuation">(</span>angle <span class="operator">=</span> <span class="number">90</span><span class="punctuation">,</span> hjust <span class="operator">=</span> <span class="number">0</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.title <span class="operator">=</span> element_blank<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.text <span class="operator">=</span> element_text<span class="punctuation">(</span>size <span class="operator">=</span> <span class="number">13</span><span class="punctuation">,</span> color <span class="operator">=</span> <span class="string">'black'</span><span class="punctuation">)</span></span><br><span class="line"><span class="punctuation">)</span></span><br><span class="line">ggsave<span class="punctuation">(</span><span class="string">"cis_acting_element-1.svg"</span><span class="punctuation">,</span>device <span class="operator">=</span> <span class="string">"svg"</span><span class="punctuation">,</span>width <span class="operator">=</span> <span class="number">18</span><span class="punctuation">,</span>height <span class="operator">=</span> <span class="number">13</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p>主图出来后,<strong>简单用AI添加x轴和y轴的分类即可</strong>(x轴顺式元件名称是按照四个功能类型分组的),如下:</p><p><img src="https://www.shelven.com/tuchuang/20240401/6.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240401/6.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>同样,还可以做一个堆积柱状图来统计四种功能的顺式元件个数:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 重设因子水平,做第二个水平方向的堆积柱状图</span></span><br><span class="line">uniq_level_1 <span class="operator"><-</span> df <span class="operator">%>%</span> distinct<span class="punctuation">(</span>X3<span class="punctuation">)</span></span><br><span class="line">df_1 <span class="operator"><-</span> df</span><br><span class="line">df_1<span class="operator">$</span>X3 <span class="operator"><-</span> factor<span class="punctuation">(</span>df_1<span class="operator">$</span>X3<span class="punctuation">,</span> levels <span class="operator">=</span> rev<span class="punctuation">(</span>uniq_level_1<span class="operator">$</span>X3<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">tidy_1 <span class="operator"><-</span> df_1 <span class="operator">%>%</span></span><br><span class="line"> group_by<span class="punctuation">(</span>X1<span class="punctuation">,</span>X3<span class="punctuation">)</span> <span class="operator">%>%</span></span><br><span class="line"> summarise<span class="punctuation">(</span>number <span class="operator">=</span> n<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span></span><br><span class="line"> arrange<span class="punctuation">(</span>desc<span class="punctuation">(</span>number<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 从文件中读取排序顺序(按照分组)</span></span><br><span class="line">order <span class="operator"><-</span> readLines<span class="punctuation">(</span><span class="string">"order.txt"</span><span class="punctuation">)</span></span><br><span class="line">tidy_1<span class="operator">$</span>X1 <span class="operator"><-</span> factor<span class="punctuation">(</span>tidy_1<span class="operator">$</span>X1<span class="punctuation">,</span> levels <span class="operator">=</span> order<span class="punctuation">)</span></span><br><span class="line">tidy_1 <span class="operator"><-</span> tidy_1<span class="punctuation">[</span>order<span class="punctuation">(</span>tidy_1<span class="operator">$</span>X1<span class="punctuation">)</span><span class="punctuation">,</span> <span class="punctuation">]</span></span><br><span class="line"><span class="comment"># 堆积图的颜色分类</span></span><br><span class="line">group4 <span class="operator"><-</span> rgb<span class="punctuation">(</span><span class="number">164</span><span class="punctuation">,</span><span class="number">255</span><span class="punctuation">,</span><span class="number">255</span><span class="punctuation">,</span><span class="built_in">max</span> <span class="operator">=</span> <span class="number">255</span><span class="punctuation">)</span></span><br><span class="line">group3 <span class="operator"><-</span> rgb<span class="punctuation">(</span><span class="number">255</span><span class="punctuation">,</span><span class="number">240</span><span class="punctuation">,</span><span class="number">147</span><span class="punctuation">,</span><span class="built_in">max</span> <span class="operator">=</span> <span class="number">255</span><span class="punctuation">)</span></span><br><span class="line">group2 <span class="operator"><-</span> rgb<span class="punctuation">(</span><span class="number">211</span><span class="punctuation">,</span><span class="number">190</span><span class="punctuation">,</span><span class="number">231</span><span class="punctuation">,</span><span class="built_in">max</span> <span class="operator">=</span> <span class="number">255</span><span class="punctuation">)</span></span><br><span class="line">group1 <span class="operator"><-</span> rgb<span class="punctuation">(</span><span class="number">185</span><span class="punctuation">,</span><span class="number">255</span><span class="punctuation">,</span><span class="number">206</span><span class="punctuation">,</span><span class="built_in">max</span> <span class="operator">=</span> <span class="number">255</span><span class="punctuation">)</span></span><br><span class="line">color <span class="operator"><-</span> <span class="built_in">c</span><span class="punctuation">(</span>group4<span class="punctuation">,</span>group3<span class="punctuation">,</span>group2<span class="punctuation">,</span>group1<span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">ggplot<span class="punctuation">(</span>tidy_1<span class="punctuation">,</span> aes<span class="punctuation">(</span>x <span class="operator">=</span> number<span class="punctuation">,</span> y <span class="operator">=</span> X1<span class="punctuation">,</span> fill <span class="operator">=</span> X3<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_col<span class="punctuation">(</span>position <span class="operator">=</span> <span class="string">"stack"</span><span class="punctuation">,</span> width <span class="operator">=</span> <span class="number">0.8</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span>label <span class="operator">=</span> number<span class="punctuation">)</span><span class="punctuation">,</span> color <span class="operator">=</span> <span class="string">'black'</span><span class="punctuation">,</span> cex <span class="operator">=</span> <span class="number">3</span><span class="punctuation">,</span> position <span class="operator">=</span> position_stack<span class="punctuation">(</span>vjust <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">)</span><span class="punctuation">,</span> size <span class="operator">=</span> <span class="number">13</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_classic<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> scale_fill_manual<span class="punctuation">(</span>values <span class="operator">=</span> color<span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> scale_x_continuous<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> scale_x_continuous<span class="punctuation">(</span>guide <span class="operator">=</span> guide_axis<span class="punctuation">(</span>position <span class="operator">=</span> <span class="string">"top"</span><span class="punctuation">)</span><span class="punctuation">,</span> expand <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span><span class="number">0</span><span class="punctuation">)</span><span class="punctuation">,</span> breaks <span class="operator">=</span> seq<span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span> <span class="number">35</span><span class="punctuation">,</span> by <span class="operator">=</span> <span class="number">5</span><span class="punctuation">)</span><span class="punctuation">)</span><span class="operator">+</span></span><br><span class="line"> guides<span class="punctuation">(</span>fill <span class="operator">=</span> guide_legend<span class="punctuation">(</span>reverse <span class="operator">=</span> <span class="literal">TRUE</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme<span class="punctuation">(</span></span><br><span class="line"> legend.title <span class="operator">=</span> element_blank<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> legend.text <span class="operator">=</span> element_text<span class="punctuation">(</span>size <span class="operator">=</span> <span class="number">13</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.text <span class="operator">=</span> element_text<span class="punctuation">(</span>size <span class="operator">=</span> <span class="number">13</span><span class="punctuation">,</span> color <span class="operator">=</span> <span class="string">'black'</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.title <span class="operator">=</span> element_blank<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> <span class="comment">#axis.text.y = element_blank(),</span></span><br><span class="line"> axis.line.y <span class="operator">=</span> element_blank<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.ticks.y <span class="operator">=</span> element_blank<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.ticks.length.x <span class="operator">=</span> unit<span class="punctuation">(</span><span class="number">0.2</span><span class="punctuation">,</span><span class="string">'cm'</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.ticks <span class="operator">=</span> element_line<span class="punctuation">(</span>size <span class="operator">=</span> <span class="number">0.8</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> axis.line <span class="operator">=</span> element_line<span class="punctuation">(</span>size <span class="operator">=</span> <span class="number">0.8</span><span class="punctuation">)</span></span><br><span class="line"> <span class="punctuation">)</span></span><br><span class="line">ggsave<span class="punctuation">(</span><span class="string">"cis_acting_element-2.svg"</span><span class="punctuation">,</span>device <span class="operator">=</span> <span class="string">"svg"</span><span class="punctuation">,</span>width <span class="operator">=</span> <span class="number">18</span><span class="punctuation">,</span>height <span class="operator">=</span> <span class="number">13</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p>出图如下:</p><p><img src="https://www.shelven.com/tuchuang/20240401/8.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240401/8.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>做这种水平的堆积图只是为了方便拼接上面的统计图(因为Y轴基因顺序完全一致),也可以做成正常的堆积柱状图展示,这里就不作多解释了。</p></div>]]></content>
<summary type="html"><p>整理一下前段时间做的基因家族分析笔记,这部分分析用的代码部分比较少,因为有现成的软件以及网站可以分析,没什么太多需要自己创造的地方,按部就班的分析流程实在让我提不起兴趣……简单记录下需要自己整理数据和写代码作图的部分——基因家族顺式作用元件预测。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="基因家族分析" scheme="http://www.shelven.com/categories/%E5%9F%BA%E5%9B%A0%E5%AE%B6%E6%97%8F%E5%88%86%E6%9E%90/"/>
<category term="R语言" scheme="http://www.shelven.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="PlantCARE" scheme="http://www.shelven.com/tags/PlantCARE/"/>
</entry>
<entry>
<title>转录组差异基因表达趋势分析</title>
<link href="http://www.shelven.com/2024/03/28/a.html"/>
<id>http://www.shelven.com/2024/03/28/a.html</id>
<published>2024-03-28T14:47:07.000Z</published>
<updated>2024-03-28T14:50:36.000Z</updated>
<content type="html"><![CDATA[<p>经历了三个月小论文+大论文的摧残,最近终于闲下来了一点,继续更新一下学习笔记~</p><p>今天主要记录下怎么做的转录组趋势(时序)分析。大多数时候,我们的转录组数据不仅仅只有一组处理组和对照组,比如梯度实验会设置不同处理浓度,或者同一浓度处理下设置不同取样时间,来观察取样组织中基因随着浓度、时间等的变化规律,也就是进行基因表达的<strong>趋势分析</strong>。</p><span id="more"></span><p>我们知道转录组测序分析的核心是获得每个样本基因的表达量,转录组趋势分析也是一样的。转录组分析的处理组和对照组中往往能鉴定<strong>成千上万个发生差异表达的基因</strong>,画成火山图会有密密麻麻的点,很难得到有效信息。趋势分析可以有效<strong>缩小我们的研究范围</strong>,将这些成千上万个差异基因按照表达趋势(比如同时上调,同时下调等等)进行聚类,同一种表达趋势的差异基因聚在一个cluster中,后续还可以再对感兴趣的cluster进行富集分析,进行功能注释等等。</p><p>现在的测序公司会把你送的材料做两两对照,<strong>每一个处理组和其他组之间都会丢给你一个差异基因表达量统计的excel表</strong>,这里面很多信息是用不到的= =,我感觉只是测序公司为了体现工作量而已。。。。。。跑题了,不管是送公司测转录组还是自己做转录组,我们都要明确一点,我们需要的分析的是<strong>差异基因</strong>(当然也可以是你感兴趣一部分基因),而不是一股脑儿把所有基因表达量都搬上去做趋势分析。</p><p>举个例子,我有个干旱胁迫的转录组测序数据,分别是胁迫发生的0h、4h和12h三组处理,我需要研究胁迫发生的这三个时间段叶片组织中差异基因有什么样的表达趋势。公司测转录组就很可能把所有组两两比对一遍,而我只需要 <strong>4 h vs 0 h 以及 12 h vs 0 h</strong>,这两组数据进行差异基因筛选,比如筛选标准是 <code>|log2(foldchange)| > 1, FDR < 0.05</code>,接着取<strong>两个组差异基因的并集</strong>进行后续趋势分析。当然,你自己用下机数据分析的话可以用<code>Deseq2</code>R包先去做个差异基因筛选,详细操作可以见我这篇博客,这里不罗嗦了。<a href="https://www.shelven.com/2022/04/18/a.html">转录组数据分析笔记(7)——DESeq2差异分析 - 我的小破站 (shelven.com)</a></p><p>网上有很多在线平台可以做趋势分析,<strong>但是!我 都 不 推 荐!</strong></p><p>原因很简单,这些平台都需要排队、收费等等,提高了一个很简单的分析流程门槛,而且这些平台都是基于<code>STEM</code>免费软件做的<del>,吃相未免有点难看了。</del>不如直接用官方软件。</p><h2 id="1-STEM"><a href="#1-STEM" class="headerlink" title="1. STEM"></a>1. STEM</h2><div class="story post-story"><p><strong>Short Time-series Expression Miner (STEM)<strong>,翻译过来叫短时间序列表达挖掘,原先开发的时候就是为了做基因表达时序分析的。官方建议的时间点(也就是处理组)不超过8个,因为超过8个组表达趋势会过于零碎和复杂,这种情况下可以做</strong>WGCNA分析</strong>,暂且不表。</p><p>安装STEM需要在windows系统上先安装<strong>JAVA</strong>。STEM官网如下:</p><p><a href="http://www.cs.cmu.edu/~jernst/stem/">STEM: Short Time-series Expression Miner (cmu.edu)</a></p><p>下载之后,双击<code>stem.jar</code>即可运行,下面是软件的UI:</p><p><img src="https://www.shelven.com/tuchuang/20240328/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240328/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>参数虽然有很多,但是很多都是可以用默认的。</p><h3 id="1-1-Expression-data-info"><a href="#1-1-Expression-data-info" class="headerlink" title="1.1 Expression data info"></a>1.1 Expression data info</h3><p>导入前面取的差异基因并集,比如说我用 <code>|log2(foldchange)| > 1</code> 的标准进行差异基因筛选,就可以整理以下<code>tsv</code>表格:</p><p><img src="https://www.shelven.com/tuchuang/20240328/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240328/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><strong>分别代表4 h和12 h两个时间点相对对照组的差异基因表达倍数变化趋势</strong>,如果你有重复组,还可以点<code>Repeat Data</code>按钮进行补充。</p><p>导入数据有三种标准化方式,如果你用的是表达量(比如TPM、FPKM),可以使用第一个log2标准化,或者第二个表达量差值的标准化。这两个标准化方式都是以第一个时间点为对照,<strong>一个是取表达量差值的log值,一个直接取差值。</strong></p><p><strong>如果你和我一样用的是表达量倍数</strong>(一般情况下做趋势分析也是关注表达量倍数的变化)就选第三种,也就是<strong>不进行标准化</strong>,因为我们用的已经是相对对照组的变化倍数了。<strong>第三种标准化方式会添加一个表达量为0的虚拟样本,实际上并不存在,可以理解为我这里的0 h对照组。</strong></p><h3 id="1-2-Gene-info"><a href="#1-2-Gene-info" class="headerlink" title="1.2 Gene info"></a>1.2 Gene info</h3><p>这一栏主要是做每个模块(cluster)内基因富集注释用的,可以选择官方注释数据库,或者提供基因注释文件。我这个是非模式植物,后续需要手动构建KEGG和GO注释库进行富集分析,这里也暂时不表。</p><h3 id="1-3-Options"><a href="#1-3-Options" class="headerlink" title="1.3 Options"></a>1.3 Options</h3><p>这一栏是对聚类参数进行调整的,可以选择聚类方法,<strong>包括stem的聚类方法或者是K-means聚类方法</strong>。</p><p>主要的参数是<code>Maximum Number of Model profiles</code>,这个参数决定我们的差异基因最多可以分为几个cluster。<strong>不要用默认参数50!我这里相当于有3组,只需要设置8个cluster就足够了</strong>,数量太多反而会让趋势碎片,不方便我们进行趋势分析。</p><p><code>Maximum Unit Change in Model Profiles between Time Points</code>这个参数是设置一个cluster模型可以在不同时间点改变的最大单位,改变这个参数会影响最终分组后的p值,这个问题不大,用默认即可。也可以看看用户手册有官网的详细描述。</p><p>在<code>Advanced Options</code>设置中可以设置基因的筛选条件、注释文件的细节等等,如果前面选择的聚类方法为K-means的话,也可以点击去调节细节,这里也不赘述了。</p><h3 id="1-4-Execute"><a href="#1-4-Execute" class="headerlink" title="1.4 Execute"></a>1.4 Execute</h3><p>前面的参数设置完成之后,点击这一栏黄色高亮的<code>Excute</code>即可。</p><p><img src="https://www.shelven.com/tuchuang/20240328/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240328/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>点击每个cluster可以看详细的信息,包括这一个cluster中的基因数量,每一个基因的表达模式,还可以导出每一个cluster的基因信息。</p><p>我们可以点击底下黄色高亮的<code>Order Profiles By</code>设置cluster的显示方式,比如可以加上p值(Significance,显著性特征),可以看到这些<strong>有颜色的cluster是p值比较低,也就是置信度比较高的。</strong></p><p><img src="https://www.shelven.com/tuchuang/20240328/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240328/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><strong>需要注意,那些p值为1的cluster不代表没有意义的表达趋势,</strong>还是要看我们感兴趣的差异基因在哪个模块,或者对我们感兴趣的基因对应地做分析!</p></div><h2 id="2-ClusterGVis"><a href="#2-ClusterGVis" class="headerlink" title="2. ClusterGVis"></a>2. ClusterGVis</h2><div class="story post-story"><p>上面这种转录组趋势分析图放在<strong>10年前</strong>可以说比较有亮点,因为STEM软件是2006年开发出来的,至今虽然也有大量的转录组分析文章在用,<strong>但是在今天看来,它能展示的信息量还是有点捉襟见肘。</strong></p><p>介绍一个中国药科大学的博士师兄做的R包<code>ClusterGVis</code>:</p><p><a href="https://github.com/junjunlab/ClusterGVis">junjunlab/ClusterGVis: One-step to Cluster and Visualize Gene Expression Matrix (github.com)</a></p><p>这个R包可以一步将趋势分析结果与基因的表达量结合一起,并且如果你自己有做每个cluster基因的富集分析的话,还可以把富集结果放在一张图中进行展示。作者在github上展示了这个R包的详细用法,我这里只做个人使用记录,向大佬表示感谢!!!</p><h3 id="2-1-准备含有标记基因的基因表达量矩阵"><a href="#2-1-准备含有标记基因的基因表达量矩阵" class="headerlink" title="2.1 准备含有标记基因的基因表达量矩阵"></a>2.1 准备含有标记基因的基因表达量矩阵</h3><p>比如我这里的<code>TPM.marker.tsv</code>文件,格式如下:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">0h4h12h</span><br><span class="line">g1068.6220088512.79421660748.90308827</span><br><span class="line">g10710.96566020.07309914111.04575138</span><br><span class="line">g1087.3901515922.88389237916.17186854</span><br><span class="line">g1100.6562150570.0180024740.256361386</span><br><span class="line">g35850.3256077711.397635542.439213784</span><br><span class="line">g36060.6937637191.8254220526.159699021</span><br><span class="line">g36095.4375238682.3687140613.914576455</span><br><span class="line">g36111.680554276.0646174466.313053585</span><br><span class="line">g361837.2204663415.1275186621.1622645</span><br><span class="line">ApWRKY39.83848891547.2981500663.77491512</span><br></pre></td></tr></table></figure><p>我这里用的是各个组的差异基因TPM表达量,对于自己感兴趣的基因,重命名为<code>Ap</code>开头的基因,也就是<strong>标记基因</strong>(比如差异基因里有你感兴趣的基因家族)。当然,如果你没有感兴趣的基因也可以不标记。这个表达量矩阵是做热图用的,也是做cluster聚类的基础。</p><h3 id="2-2-cluster聚类"><a href="#2-2-cluster聚类" class="headerlink" title="2.2 cluster聚类"></a>2.2 cluster聚类</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line">setwd<span class="punctuation">(</span><span class="string">"D:\\zhuomian\\WRKY\\06.transcriptome\\clustergvis"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 因为要从github上安装这个R包和依赖,所以设置个代理</span></span><br><span class="line">install.packages<span class="punctuation">(</span><span class="string">"r.proxy"</span><span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span><span class="string">"r.proxy"</span><span class="punctuation">)</span></span><br><span class="line">r.proxy<span class="operator">::</span>proxy<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">devtools<span class="operator">::</span>install_github<span class="punctuation">(</span><span class="string">"igraph/rigraph"</span><span class="punctuation">)</span></span><br><span class="line">devtools<span class="operator">::</span>install_github<span class="punctuation">(</span><span class="string">"cysouw/qlcMatrix"</span><span class="punctuation">)</span></span><br><span class="line">devtools<span class="operator">::</span>install_github<span class="punctuation">(</span><span class="string">"junjunlab/ClusterGVis"</span><span class="punctuation">)</span></span><br><span class="line">devtools<span class="operator">::</span>install_github<span class="punctuation">(</span><span class="string">"jokergoo/ComplexHeatmap"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">library<span class="punctuation">(</span><span class="string">"ClusterGVis"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 载入数据</span></span><br><span class="line">a <span class="operator"><-</span> read.table<span class="punctuation">(</span>file <span class="operator">=</span> <span class="string">"TPM.marker.tsv"</span><span class="punctuation">,</span> header<span class="operator">=</span><span class="built_in">T</span><span class="punctuation">,</span> row.names <span class="operator">=</span> <span class="number">1</span><span class="punctuation">,</span> check.names<span class="operator">=</span><span class="built_in">F</span><span class="punctuation">)</span></span><br><span class="line">head<span class="punctuation">(</span>a<span class="punctuation">,</span><span class="number">3</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 计算均方和,确定聚类个数</span></span><br><span class="line">getClusters<span class="punctuation">(</span><span class="built_in">exp</span> <span class="operator">=</span> a<span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># mfuzz聚类</span></span><br><span class="line">cm <span class="operator"><-</span> clusterData<span class="punctuation">(</span><span class="built_in">exp</span> <span class="operator">=</span> a<span class="punctuation">,</span></span><br><span class="line"> <span class="comment">#cluster.method = "kmeans",</span></span><br><span class="line"> cluster.method <span class="operator">=</span> <span class="string">"mfuzz"</span><span class="punctuation">,</span></span><br><span class="line"> cluster.num <span class="operator">=</span> <span class="number">8</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 绘制折线图(kmeans聚类的话没有颜色映射,因为没有membership信息)</span></span><br><span class="line">visCluster<span class="punctuation">(</span>object <span class="operator">=</span> cm<span class="punctuation">,</span></span><br><span class="line"> plot.type <span class="operator">=</span> <span class="string">"line"</span><span class="punctuation">,</span></span><br><span class="line"> ms.col <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"green"</span><span class="punctuation">,</span><span class="string">"orange"</span><span class="punctuation">,</span><span class="string">"red"</span><span class="punctuation">)</span><span class="punctuation">,</span> <span class="comment"># 线条颜色</span></span><br><span class="line"> add.mline <span class="operator">=</span> <span class="literal">TRUE</span> <span class="comment"># 中位线</span></span><br><span class="line"> <span class="punctuation">)</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>这里可以选择使用<code>K-means</code>或者<code>mfuzz</code>包进行聚类,<code>getcluster</code>函数出图如下,可以帮助你确认聚类个数,我这里选择8:</p><p><img src="https://www.shelven.com/tuchuang/20240328/6.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240328/6.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>绘制的聚类折线图如下:</p><p><img src="https://www.shelven.com/tuchuang/20240328/7.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240328/7.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="2-3-每个cluster进行富集分析"><a href="#2-3-每个cluster进行富集分析" class="headerlink" title="2.3 每个cluster进行富集分析"></a>2.3 每个cluster进行富集分析</h3><p>前面的R代码中,我们把聚类结果保存在了<code>cm</code>变量中,可以点开这个变量看具体的聚类结果:</p><p><img src="https://www.shelven.com/tuchuang/20240328/9.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240328/9.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>简单来说,cluster聚类结果在<code>cm[["wide.res"]][cluster]</code>中,同一个cluster中的基因用同一种数字表示。</p><p>一个很笨但是有效的方法,<strong>把每个cluster的基因名调出来,每个cluster基因都做一遍GO和KEGG富集分析</strong>,非模式生物构建方法可见上一篇博客<a href="https://www.shelven.com/2023/12/23/a.html">非模式生物的GO和KEGG富集 - 我的小破站 (shelven.com)</a></p><p>我们不需要展示所有富集结果,简单来说,可以将cluster富集结果分别做一个KEGG和GO统计结果表如下:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"># 文件cluster.all.GO.tsv(每组取前五富集结果)</span><br><span class="line">groupDescriptionpvalueratio</span><br><span class="line">GO:00045191endonuclease activity4.73E-077.7102804</span><br><span class="line">GO:00045181nuclease activity4.14E-067.7102804</span><br><span class="line">GO:00903051obsolete nucleic acid phosphodiester bond hydrolysis7.53E-067.9439252</span><br><span class="line">.</span><br><span class="line">.</span><br><span class="line">.</span><br><span class="line">GO:00303128external encapsulating structure0.0005777345.5105348</span><br><span class="line">GO:00716698plant-type cell wall organization or biogenesis0.0010348883.0794165</span><br><span class="line">GO:00312268obsolete intrinsic component of plasma membrane0.0013435653.8897893</span><br><span class="line"></span><br><span class="line"># 文件cluster.all.KEGG.tsv(每组取前五富集结果)</span><br><span class="line">groupDescriptionpvalueratio</span><br><span class="line">ko040161MAPK signaling pathway - plant0.1811313024.4585987</span><br><span class="line">ko009041Diterpenoid biosynthesis0.1811313021.910828</span><br><span class="line">ko009101Nitrogen metabolism0.1811313021.910828</span><br><span class="line">.</span><br><span class="line">.</span><br><span class="line">.</span><br><span class="line">ko000738Cutin, suberine and wax biosynthesis0.0026698451.8115942</span><br><span class="line">ko040758Plant hormone signal transduction0.1050323174.3478261</span><br><span class="line">ko009108Nitrogen metabolism0.1232640961.4492754</span><br></pre></td></tr></table></figure><h3 id="2-4-组合趋势聚类图、表达量热图以及注释结果"><a href="#2-4-组合趋势聚类图、表达量热图以及注释结果" class="headerlink" title="2.4 组合趋势聚类图、表达量热图以及注释结果"></a>2.4 组合趋势聚类图、表达量热图以及注释结果</h3><p>就是调参数的问题啦~</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 添加标记基因(就是判断你要展示的基因是哪些行,这里我是Ap开头的基因)</span></span><br><span class="line">rows_to_select <span class="operator"><-</span> grepl<span class="punctuation">(</span><span class="string">"Ap"</span><span class="punctuation">,</span> rownames<span class="punctuation">(</span>a<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">markGenes <span class="operator"><-</span> rownames<span class="punctuation">(</span>a<span class="punctuation">)</span><span class="punctuation">[</span>rows_to_select<span class="punctuation">]</span></span><br><span class="line"><span class="comment">#markGenes = rownames(a)[sample(1:nrow(a),30,replace = F)]</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 绘制热图</span></span><br><span class="line">visCluster<span class="punctuation">(</span>object <span class="operator">=</span> cm<span class="punctuation">,</span></span><br><span class="line"> plot.type <span class="operator">=</span> <span class="string">"heatmap"</span><span class="punctuation">,</span></span><br><span class="line"> column_names_rot <span class="operator">=</span> <span class="number">45</span><span class="punctuation">,</span></span><br><span class="line"> markGenes <span class="operator">=</span> markGenes<span class="punctuation">,</span> <span class="comment"># 标记基因</span></span><br><span class="line"> ht.col.list<span class="operator">=</span><span class="built_in">list</span><span class="punctuation">(</span>col_range <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="operator">-</span><span class="number">2</span><span class="punctuation">,</span> <span class="number">0</span><span class="punctuation">,</span> <span class="number">2</span><span class="punctuation">)</span><span class="punctuation">,</span>col_color <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"#FC7C5A"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"#81C7C1"</span><span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> ctAnno.col <span class="operator">=</span> ggsci<span class="operator">::</span>pal_npg<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">(</span><span class="number">8</span><span class="punctuation">)</span><span class="punctuation">,</span> <span class="comment">#修改注释条颜色</span></span><br><span class="line"><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 富集结果输入(整理的富集结果)</span></span><br><span class="line">enrich <span class="operator"><-</span> read.table<span class="punctuation">(</span><span class="string">"cluster.all.GO.tsv"</span><span class="punctuation">,</span> sep <span class="operator">=</span> <span class="string">"\t"</span><span class="punctuation">,</span>header<span class="operator">=</span><span class="built_in">T</span><span class="punctuation">)</span></span><br><span class="line">head<span class="punctuation">(</span>enrich<span class="punctuation">,</span><span class="number">3</span><span class="punctuation">)</span></span><br><span class="line">enrich.KEGG <span class="operator"><-</span> read.table<span class="punctuation">(</span><span class="string">"cluster.all.KEGG.tsv"</span><span class="punctuation">,</span> sep <span class="operator">=</span> <span class="string">"\t"</span><span class="punctuation">,</span>header<span class="operator">=</span><span class="built_in">T</span><span class="punctuation">)</span></span><br><span class="line">head<span class="punctuation">(</span>enrich.KEGG<span class="punctuation">,</span><span class="number">3</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">palette <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"Grays"</span><span class="punctuation">,</span><span class="string">"Greens2"</span><span class="punctuation">,</span><span class="string">"Blues2"</span><span class="punctuation">,</span><span class="string">"Blues3"</span><span class="punctuation">,</span><span class="string">"Purples2"</span><span class="punctuation">,</span><span class="string">"Purples3"</span><span class="punctuation">,</span><span class="string">"Reds2"</span><span class="punctuation">,</span><span class="string">"Reds3"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">lapply<span class="punctuation">(</span><span class="built_in">seq_along</span><span class="punctuation">(</span>unique<span class="punctuation">(</span>enrich<span class="operator">$</span>group<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">,</span> <span class="keyword">function</span><span class="punctuation">(</span>x<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"> <span class="comment"># GO plot</span></span><br><span class="line"> tmp <span class="operator"><-</span> enrich <span class="operator">|></span> dplyr<span class="operator">::</span>filter<span class="punctuation">(</span>group <span class="operator">==</span> unique<span class="punctuation">(</span>enrich<span class="operator">$</span>group<span class="punctuation">)</span><span class="punctuation">[</span>x<span class="punctuation">]</span><span class="punctuation">)</span> <span class="operator">|></span></span><br><span class="line"> dplyr<span class="operator">::</span>arrange<span class="punctuation">(</span>desc<span class="punctuation">(</span>pvalue<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"> </span><br><span class="line"> tmp<span class="operator">$</span>Description <span class="operator"><-</span> factor<span class="punctuation">(</span>tmp<span class="operator">$</span>Description<span class="punctuation">,</span>levels <span class="operator">=</span> tmp<span class="operator">$</span>Description<span class="punctuation">)</span></span><br><span class="line"> </span><br><span class="line"> <span class="comment"># plot</span></span><br><span class="line"> p <span class="operator"><-</span></span><br><span class="line"> ggplot<span class="punctuation">(</span>tmp<span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_col<span class="punctuation">(</span>aes<span class="punctuation">(</span>x <span class="operator">=</span> <span class="operator">-</span>log10<span class="punctuation">(</span>pvalue<span class="punctuation">)</span><span class="punctuation">,</span>y <span class="operator">=</span> Description<span class="punctuation">,</span>fill <span class="operator">=</span> <span class="operator">-</span>log10<span class="punctuation">(</span>pvalue<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> width <span class="operator">=</span> <span class="number">0.75</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> <span class="comment">#geom_line(aes(x = log10(ratio),y = as.numeric(Description)),color = "grey50") +</span></span><br><span class="line"> geom_point<span class="punctuation">(</span>aes<span class="punctuation">(</span>x <span class="operator">=</span> log10<span class="punctuation">(</span>ratio<span class="punctuation">)</span><span class="punctuation">,</span>y <span class="operator">=</span> Description<span class="punctuation">)</span><span class="punctuation">,</span>size <span class="operator">=</span> <span class="number">3</span><span class="punctuation">,</span>color <span class="operator">=</span> <span class="string">"coral1"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_bw<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> scale_y_discrete<span class="punctuation">(</span>position <span class="operator">=</span> <span class="string">"right"</span><span class="punctuation">,</span></span><br><span class="line"> labels <span class="operator">=</span> <span class="keyword">function</span><span class="punctuation">(</span>x<span class="punctuation">)</span> stringr<span class="operator">::</span>str_wrap<span class="punctuation">(</span>x<span class="punctuation">,</span> width <span class="operator">=</span> <span class="number">40</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> scale_x_continuous<span class="punctuation">(</span>sec.axis <span class="operator">=</span> sec_axis<span class="punctuation">(</span><span class="operator">~</span>.<span class="punctuation">,</span>name <span class="operator">=</span> <span class="string">"log10(ratio)"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> colorspace<span class="operator">::</span>scale_fill_binned_sequential<span class="punctuation">(</span>palette <span class="operator">=</span> palette<span class="punctuation">[</span>x<span class="punctuation">]</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> ylab<span class="punctuation">(</span><span class="string">""</span><span class="punctuation">)</span></span><br><span class="line"> </span><br><span class="line"> <span class="comment"># KEGG plot</span></span><br><span class="line"> tmp.kg <span class="operator"><-</span> enrich.KEGG <span class="operator">|></span> dplyr<span class="operator">::</span>filter<span class="punctuation">(</span>group <span class="operator">==</span> unique<span class="punctuation">(</span>enrich.KEGG<span class="operator">$</span>group<span class="punctuation">)</span><span class="punctuation">[</span>x<span class="punctuation">]</span><span class="punctuation">)</span> <span class="operator">|></span></span><br><span class="line"> dplyr<span class="operator">::</span>arrange<span class="punctuation">(</span>desc<span class="punctuation">(</span>pvalue<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"> </span><br><span class="line"> tmp.kg<span class="operator">$</span>Description <span class="operator"><-</span> factor<span class="punctuation">(</span>tmp.kg<span class="operator">$</span>Description<span class="punctuation">,</span>levels <span class="operator">=</span> tmp.kg<span class="operator">$</span>Description<span class="punctuation">)</span></span><br><span class="line"> </span><br><span class="line"> <span class="comment"># plot</span></span><br><span class="line"> pk <span class="operator"><-</span></span><br><span class="line"> ggplot<span class="punctuation">(</span>tmp.kg<span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_segment<span class="punctuation">(</span>aes<span class="punctuation">(</span>x <span class="operator">=</span> <span class="number">0</span><span class="punctuation">,</span>xend <span class="operator">=</span> <span class="operator">-</span>log10<span class="punctuation">(</span>pvalue<span class="punctuation">)</span><span class="punctuation">,</span>y <span class="operator">=</span> Description<span class="punctuation">,</span>yend <span class="operator">=</span> Description<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> lty <span class="operator">=</span> <span class="string">"dashed"</span><span class="punctuation">,</span>linewidth <span class="operator">=</span> <span class="number">0.75</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_point<span class="punctuation">(</span>aes<span class="punctuation">(</span>x <span class="operator">=</span> <span class="operator">-</span>log10<span class="punctuation">(</span>pvalue<span class="punctuation">)</span><span class="punctuation">,</span>y <span class="operator">=</span> Description<span class="punctuation">,</span>color <span class="operator">=</span> <span class="operator">-</span>log10<span class="punctuation">(</span>pvalue<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">,</span>size <span class="operator">=</span> <span class="number">5</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_bw<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> scale_y_discrete<span class="punctuation">(</span>position <span class="operator">=</span> <span class="string">"right"</span><span class="punctuation">,</span></span><br><span class="line"> labels <span class="operator">=</span> <span class="keyword">function</span><span class="punctuation">(</span>x<span class="punctuation">)</span> stringr<span class="operator">::</span>str_wrap<span class="punctuation">(</span>x<span class="punctuation">,</span> width <span class="operator">=</span> <span class="number">40</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> colorspace<span class="operator">::</span>scale_color_binned_sequential<span class="punctuation">(</span>palette <span class="operator">=</span> palette<span class="punctuation">[</span>x<span class="punctuation">]</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> ylab<span class="punctuation">(</span><span class="string">""</span><span class="punctuation">)</span> <span class="operator">+</span> xlab<span class="punctuation">(</span><span class="string">"-log10(pvalue)"</span><span class="punctuation">)</span></span><br><span class="line"> </span><br><span class="line"> <span class="comment"># combine</span></span><br><span class="line"> cb <span class="operator"><-</span> cowplot<span class="operator">::</span>plot_grid<span class="punctuation">(</span>plotlist <span class="operator">=</span> <span class="built_in">list</span><span class="punctuation">(</span>p<span class="punctuation">,</span>pk<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"> </span><br><span class="line"> <span class="built_in">return</span><span class="punctuation">(</span>cb<span class="punctuation">)</span></span><br><span class="line"><span class="punctuation">}</span><span class="punctuation">)</span> <span class="operator">-></span> gglist</span><br><span class="line"><span class="built_in">names</span><span class="punctuation">(</span>gglist<span class="punctuation">)</span> <span class="operator"><-</span> paste<span class="punctuation">(</span><span class="string">"C"</span><span class="punctuation">,</span><span class="number">1</span><span class="operator">:</span><span class="number">8</span><span class="punctuation">,</span>sep <span class="operator">=</span> <span class="string">""</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 组合图</span></span><br><span class="line">pdf<span class="punctuation">(</span><span class="string">'final_result.pdf'</span><span class="punctuation">,</span>height <span class="operator">=</span> <span class="number">20</span><span class="punctuation">,</span>width <span class="operator">=</span> <span class="number">20</span><span class="punctuation">,</span>onefile <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span></span><br><span class="line">visCluster<span class="punctuation">(</span>object <span class="operator">=</span> cm<span class="punctuation">,</span></span><br><span class="line"> plot.type <span class="operator">=</span> <span class="string">"both"</span><span class="punctuation">,</span></span><br><span class="line"> markGenes <span class="operator">=</span> markGenes<span class="punctuation">,</span></span><br><span class="line"> line.side <span class="operator">=</span> <span class="string">"left"</span><span class="punctuation">,</span></span><br><span class="line"> markGenes.side <span class="operator">=</span> <span class="string">"right"</span><span class="punctuation">,</span></span><br><span class="line"> column_names_rot <span class="operator">=</span> <span class="number">45</span><span class="punctuation">,</span></span><br><span class="line"> cluster.order <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">8</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> ctAnno.col <span class="operator">=</span> ggsci<span class="operator">::</span>pal_nejm<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">(</span><span class="number">8</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> ggplot.panel.arg <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="number">5</span><span class="punctuation">,</span><span class="number">0.5</span><span class="punctuation">,</span><span class="number">32</span><span class="punctuation">,</span><span class="string">"grey90"</span><span class="punctuation">,</span><span class="literal">NA</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> ht.col.list<span class="operator">=</span><span class="built_in">list</span><span class="punctuation">(</span>col_range <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="operator">-</span><span class="number">2</span><span class="punctuation">,</span> <span class="number">0</span><span class="punctuation">,</span> <span class="number">2</span><span class="punctuation">)</span><span class="punctuation">,</span>col_color <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"#81C7C1"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"#FC7C5A"</span><span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> gglist <span class="operator">=</span> gglist<span class="punctuation">,</span></span><br><span class="line"> show_row_dend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span></span><br><span class="line">dev.off<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>直接出图结果如下:</p><p><img src="https://www.shelven.com/tuchuang/20240328/10.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20240328/10.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这样的出图基本上是不用改的,最多用AI调一调位置,大小。或者在R里调一下表达量热图的三个颜色:</p><p><code>ht.col.list=list(col_range = c(-2, 0, 2),col_color = c("#81C7C1", "white", "#FC7C5A"))</code></p><p>根据自己的数据调整一下就行,这种出图质量直接用于文章是完全没问题的,而且展示的信息量也足够多~</p></div>]]></content>
<summary type="html"><p>经历了三个月小论文+大论文的摧残,最近终于闲下来了一点,继续更新一下学习笔记~</p>
<p>今天主要记录下怎么做的转录组趋势(时序)分析。大多数时候,我们的转录组数据不仅仅只有一组处理组和对照组,比如梯度实验会设置不同处理浓度,或者同一浓度处理下设置不同取样时间,来观察取样组织中基因随着浓度、时间等的变化规律,也就是进行基因表达的<strong>趋势分析</strong>。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="转录组数据分析" scheme="http://www.shelven.com/categories/%E8%BD%AC%E5%BD%95%E7%BB%84%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/"/>
<category term="R语言" scheme="http://www.shelven.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="STEM" scheme="http://www.shelven.com/tags/STEM/"/>
<category term="ClusterGVis" scheme="http://www.shelven.com/tags/ClusterGVis/"/>
</entry>
<entry>
<title>非模式生物的GO和KEGG富集</title>
<link href="http://www.shelven.com/2023/12/23/a.html"/>
<id>http://www.shelven.com/2023/12/23/a.html</id>
<published>2023-12-22T16:08:41.000Z</published>
<updated>2023-12-23T10:59:32.000Z</updated>
<content type="html"><![CDATA[<p>最近在做非模式生物的GO和KEGG富集分析,参考了网上的一些帖子和知乎专栏,发现代码总有一些小问题,于是自己摸索修改终于跑通了= =,这里做个记录。</p><span id="more"></span><p>本篇笔记代码主要参考自:</p><p><a href="https://www.jianshu.com/p/6671f3189309">比较基因组学分析3:特异节点基因家族富集分析(非模式物种GO/KEEG富集分析) - 简书 (jianshu.com)</a></p><p><a href="https://zhuanlan.zhihu.com/p/389005258?utm_id=0">小众或非模式生物的自建库GO/KEGG富集分析 - 知乎 (zhihu.com)</a></p><h2 id="1-eggNOG注释"><a href="#1-eggNOG注释" class="headerlink" title="1. eggNOG注释"></a>1. eggNOG注释</h2><div class="story post-story"><p>因为我们要做的物种是非模式生物,所有背景基因需要我们亲自注释,这里只推荐<a href="http://eggnog-mapper.embl.de/">eggNOG</a>,速度快,可以本地运行或者在线运行,在线运行有10万条序列的限制。在线网页的介绍可以看我的这篇笔记:<a href="https://www.shelven.com/2023/09/19/a.html">基因组注释(6)——在线版eggNOG-mapper注释功能基因 - 我的小破站 (shelven.com)</a></p><p>假设我们已经拿蛋白序列做了注释,在一大堆结果文件中我们只需要<code>out.emapper.annotation</code>。用<code>notepad++</code>打开,去掉带“#”符号的前三行,<strong>把表头的query前面的“#”注释也给去掉</strong>,这些处理做好后文件如下:</p><p><img src="https://www.shelven.com/tuchuang/20231223/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>后续都是从这个文件中找到基因名与GO和KEGG号之间的对映关系。</p><p>假设我们从基因家族收缩/扩张分析、转录组等等数据已经拿到了我们想要的序列,创建一个新文件,一行一个基因名放入我们要分析的序列,文件命名为<code>gene.txt</code>如下所示:</p><p><img src="https://www.shelven.com/tuchuang/20231223/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="2-GO富集"><a href="#2-GO富集" class="headerlink" title="2. GO富集"></a>2. GO富集</h2><div class="story post-story"><p>首先进行GO注释库的构建:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"># 下载最新的GO库注释</span><br><span class="line">wget http://purl.obolibrary.org/obo/go/go-basic.obo</span><br><span class="line"># 处理注释文件,主要是帮我们找对应关系</span><br><span class="line">grep "^id:" go-basic.obo |awk '{print $2}' > GO.id</span><br><span class="line">grep "^name:" go-basic.obo | awk -F ': ' '{print $2}' > GO.name</span><br><span class="line">grep "^namespace:" go-basic.obo |awk '{print $2}' > GO.class</span><br><span class="line">paste GO.id GO.name GO.class -d "\t" > GO.library</span><br></pre></td></tr></table></figure><p>构建的GO注释库如下:</p><p><img src="https://www.shelven.com/tuchuang/20231223/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>第一列是GO号,第二列是描述(<strong>注意要完整的描述,而非一个单词,否则后续因为描述重复而无法富集!</strong>),第三列是GO分类。</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line">library<span class="punctuation">(</span>clusterProfiler<span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>dplyr<span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>stringr<span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">setwd<span class="punctuation">(</span><span class="string">"D:\\zhuomian\\基因家族进化\\GO"</span><span class="punctuation">)</span><span class="comment"># 设置自己的工作目录</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## GO注释生成</span></span><br><span class="line">options<span class="punctuation">(</span>stringsAsFactors <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span></span><br><span class="line">egg <span class="operator"><-</span> read.delim<span class="punctuation">(</span><span class="string">"out.emapper.annotations"</span><span class="punctuation">,</span>header <span class="operator">=</span> <span class="built_in">T</span><span class="punctuation">,</span>sep<span class="operator">=</span><span class="string">"\t"</span><span class="punctuation">)</span></span><br><span class="line">egg<span class="punctuation">[</span>egg<span class="operator">==</span><span class="string">""</span><span class="punctuation">]</span><span class="operator"><-</span><span class="literal">NA</span></span><br><span class="line">gterms <span class="operator"><-</span> egg <span class="operator">%>%</span></span><br><span class="line"> dplyr<span class="operator">::</span>select<span class="punctuation">(</span>query<span class="punctuation">,</span> GOs<span class="punctuation">)</span> <span class="operator">%>%</span> na.omit<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line">gene_ids <span class="operator"><-</span> egg<span class="operator">$</span>query</span><br><span class="line">eggnog_lines_with_go <span class="operator"><-</span> egg<span class="operator">$</span>GOs<span class="operator">!=</span> <span class="string">""</span></span><br><span class="line">eggnog_annoations_go <span class="operator"><-</span> str_split<span class="punctuation">(</span>egg<span class="punctuation">[</span>eggnog_lines_with_go<span class="punctuation">,</span><span class="punctuation">]</span><span class="operator">$</span>GOs<span class="punctuation">,</span> <span class="string">","</span><span class="punctuation">)</span></span><br><span class="line">gene2go <span class="operator"><-</span> data.frame<span class="punctuation">(</span>gene <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span>gene_ids<span class="punctuation">[</span>eggnog_lines_with_go<span class="punctuation">]</span><span class="punctuation">,</span><span class="comment"># 一个基因可能有多个GOterm,需要拆分成1:1对应关系</span></span><br><span class="line"> times <span class="operator">=</span> sapply<span class="punctuation">(</span>eggnog_annoations_go<span class="punctuation">,</span> <span class="built_in">length</span><span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> term <span class="operator">=</span> unlist<span class="punctuation">(</span>eggnog_annoations_go<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">go2name <span class="operator"><-</span> read.delim<span class="punctuation">(</span><span class="string">'GO.library'</span><span class="punctuation">,</span> header <span class="operator">=</span> <span class="literal">FALSE</span><span class="punctuation">,</span> stringsAsFactors <span class="operator">=</span> <span class="literal">FALSE</span><span class="punctuation">)</span></span><br><span class="line"><span class="built_in">names</span><span class="punctuation">(</span>go2name<span class="punctuation">)</span> <span class="operator"><-</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">'ID'</span><span class="punctuation">,</span> <span class="string">'Description'</span><span class="punctuation">,</span> <span class="string">'Ontology'</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># GO富集</span></span><br><span class="line">gene_select <span class="operator"><-</span> read.delim<span class="punctuation">(</span>file <span class="operator">=</span> <span class="string">'gene.txt'</span><span class="punctuation">,</span> stringsAsFactors <span class="operator">=</span> <span class="literal">FALSE</span><span class="punctuation">,</span>header <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span><span class="operator">$</span>V1</span><br><span class="line">go_rich <span class="operator"><-</span> enricher<span class="punctuation">(</span>gene <span class="operator">=</span> gene_select<span class="punctuation">,</span></span><br><span class="line"> TERM2GENE <span class="operator">=</span> gene2go<span class="punctuation">[</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">'term'</span><span class="punctuation">,</span><span class="string">'gene'</span><span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">,</span> <span class="comment"># 这两项都要注意顺序</span></span><br><span class="line"> TERM2NAME <span class="operator">=</span> go2name<span class="punctuation">[</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">'ID'</span><span class="punctuation">,</span> <span class="string">'Description'</span><span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">,</span> </span><br><span class="line"> pvalueCutoff <span class="operator">=</span> <span class="number">1</span><span class="punctuation">,</span> </span><br><span class="line"> pAdjustMethod <span class="operator">=</span> <span class="string">'BH'</span><span class="punctuation">,</span> </span><br><span class="line"> qvalueCutoff <span class="operator">=</span> <span class="number">1</span></span><br><span class="line"><span class="punctuation">)</span></span><br><span class="line">tmp <span class="operator"><-</span> merge<span class="punctuation">(</span>go_rich<span class="punctuation">,</span> go2name<span class="punctuation">[</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">'ID'</span><span class="punctuation">,</span> <span class="string">'Ontology'</span><span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">,</span> by <span class="operator">=</span> <span class="string">'ID'</span><span class="punctuation">)</span></span><br><span class="line">tmp <span class="operator"><-</span> tmp<span class="punctuation">[</span><span class="built_in">c</span><span class="punctuation">(</span><span class="number">10</span><span class="punctuation">,</span> <span class="number">1</span><span class="operator">:</span><span class="number">9</span><span class="punctuation">)</span><span class="punctuation">]</span></span><br><span class="line">tmp <span class="operator"><-</span> tmp<span class="punctuation">[</span>order<span class="punctuation">(</span>tmp<span class="operator">$</span>pvalue<span class="punctuation">)</span><span class="punctuation">,</span> <span class="punctuation">]</span></span><br><span class="line">write.table<span class="punctuation">(</span>tmp<span class="punctuation">,</span> <span class="string">'GO_enrichment.xls'</span><span class="punctuation">,</span> sep <span class="operator">=</span> <span class="string">'\t'</span><span class="punctuation">,</span> row.names <span class="operator">=</span> <span class="literal">FALSE</span><span class="punctuation">,</span> <span class="built_in">quote</span> <span class="operator">=</span> <span class="literal">FALSE</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p>看一下生成的<code>GO_enrichment.xls</code>文件如下:</p><p><img src="https://www.shelven.com/tuchuang/20231223/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><strong>我们一般拿到这个文件后,根据qvalue值从小到大筛选可信度最高的富集结果,再手动去掉不合理的(比如你做植物,却注释到动物和微生物的)。</strong></p><p>做到这一步生成GO富集文件就可以了<del>(再做就不礼貌了)</del>,GO富集结果可以在R做可视化,但是我觉得没必要,真的,调参数真的太难受了,而且每个人作图的要求都不一样。我个人比较建议在一些在线网站上手动画,比如这个网站:<a href="https://www.chiplot.online/">ChiPlot</a></p><p>网站的开发者在B站传过可视化富集结果的教学视频:<a href="https://www.bilibili.com/video/BV1mP4y1w74C/?spm_id_from=333.999.0.0&vd_source=7779d154860677839e3a88514c21d826">【ChiPlot】绘制KEGG/GO富集图_哔哩哔哩_bilibili</a></p><h3 id="省流不看版"><a href="#省流不看版" class="headerlink" title="省流不看版"></a>省流不看版</h3><p>根据富集结果,筛选molecular function、biological process和cellular component三个类别中<strong>qvalue</strong>值前5的GO term,计算第四列的<strong>GeneRatio</strong>值。</p><p>创建一个新的excel表,需要5列内容,如下:</p><p><img src="https://www.shelven.com/tuchuang/20231223/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>type那一列不需要,只是为了方便展示,id这一列的列名可以随意,<strong>Description、Qvalue、Count、GeneRatio</strong>四列列名必须要一样,顺序无所谓。</p><p>再做一个展示图层的excel,作用是区分GO term属于哪个类型:</p><p><img src="https://www.shelven.com/tuchuang/20231223/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>传到上面网站的KEGG/GO富集作图的模块,手动调整参数(自己多试一下)再加点AI作图,反正比我R做的好一些:</p><p><img src="https://www.shelven.com/tuchuang/20231223/10.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/10.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="3-KEGG富集"><a href="#3-KEGG富集" class="headerlink" title="3. KEGG富集"></a>3. KEGG富集</h2><div class="story post-story"><p>思路和GO注释是类似的,相比GO,KEGG可选择的物种会多一些,我们这里仍然当作是非模式生物,以第一步eggNOG注释的基因集为背景基因集。</p><p>构建KEGG注释库:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 下载KEGG注释库 https://www.genome.jp/kegg-bin/get_htext?ko00001,点击 Download json,下载得到ko00001.json文件</span></span><br><span class="line"></span><br><span class="line">library<span class="punctuation">(</span>clusterProfiler<span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>dplyr<span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>stringr<span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>jsonlite<span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">setwd<span class="punctuation">(</span><span class="string">"D:\\zhuomian\\基因家族进化\\KEGG"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">pathway2name <span class="operator"><-</span> tibble<span class="punctuation">(</span>Pathway <span class="operator">=</span> character<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span> Name <span class="operator">=</span> character<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">ko2pathway <span class="operator"><-</span> tibble<span class="punctuation">(</span>Ko <span class="operator">=</span> character<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span> Pathway <span class="operator">=</span> character<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">kegg <span class="operator"><-</span> fromJSON<span class="punctuation">(</span><span class="string">"ko00001.json"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> <span class="punctuation">(</span>a <span class="keyword">in</span> <span class="built_in">seq_along</span><span class="punctuation">(</span>kegg<span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"> A <span class="operator"><-</span> kegg<span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"name"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>a<span class="punctuation">]</span><span class="punctuation">]</span></span><br><span class="line"> <span class="keyword">for</span> <span class="punctuation">(</span>b <span class="keyword">in</span> <span class="built_in">seq_along</span><span class="punctuation">(</span>kegg<span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>a<span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"> B <span class="operator"><-</span> kegg<span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>a<span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"name"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>b<span class="punctuation">]</span><span class="punctuation">]</span> </span><br><span class="line"> <span class="keyword">for</span> <span class="punctuation">(</span><span class="built_in">c</span> <span class="keyword">in</span> <span class="built_in">seq_along</span><span class="punctuation">(</span>kegg<span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>a<span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>b<span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"> pathway_info <span class="operator"><-</span> kegg<span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>a<span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>b<span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"name"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="built_in">c</span><span class="punctuation">]</span><span class="punctuation">]</span></span><br><span class="line"> pathway_id <span class="operator"><-</span> str_match<span class="punctuation">(</span>pathway_info<span class="punctuation">,</span> <span class="string">"ko[0-9]{5}"</span><span class="punctuation">)</span><span class="punctuation">[</span><span class="number">1</span><span class="punctuation">]</span></span><br><span class="line"> pathway_name <span class="operator"><-</span> str_replace<span class="punctuation">(</span>pathway_info<span class="punctuation">,</span> <span class="string">" \\[PATH:ko[0-9]{5}\\]"</span><span class="punctuation">,</span> <span class="string">""</span><span class="punctuation">)</span> <span class="operator">%>%</span> str_replace<span class="punctuation">(</span><span class="string">"[0-9]{5} "</span><span class="punctuation">,</span> <span class="string">""</span><span class="punctuation">)</span></span><br><span class="line"> pathway2name <span class="operator"><-</span> rbind<span class="punctuation">(</span>pathway2name<span class="punctuation">,</span> tibble<span class="punctuation">(</span>Pathway <span class="operator">=</span> pathway_id<span class="punctuation">,</span> Name <span class="operator">=</span> pathway_name<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"> kos_info <span class="operator"><-</span> kegg<span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>a<span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span>b<span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"children"</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="built_in">c</span><span class="punctuation">]</span><span class="punctuation">]</span><span class="punctuation">[[</span><span class="string">"name"</span><span class="punctuation">]</span><span class="punctuation">]</span></span><br><span class="line"> kos <span class="operator"><-</span> str_match<span class="punctuation">(</span>kos_info<span class="punctuation">,</span> <span class="string">"K[0-9]*"</span><span class="punctuation">)</span><span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span></span><br><span class="line"> ko2pathway <span class="operator"><-</span> rbind<span class="punctuation">(</span>ko2pathway<span class="punctuation">,</span> tibble<span class="punctuation">(</span>Ko <span class="operator">=</span> kos<span class="punctuation">,</span> Pathway <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span>pathway_id<span class="punctuation">,</span> <span class="built_in">length</span><span class="punctuation">(</span>kos<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"><span class="punctuation">}</span></span><br><span class="line">colnames<span class="punctuation">(</span>ko2pathway<span class="punctuation">)</span> <span class="operator"><-</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"KO"</span><span class="punctuation">,</span><span class="string">"Pathway"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">write.table<span class="punctuation">(</span>pathway2name<span class="punctuation">,</span><span class="string">"KEGG.library"</span><span class="punctuation">,</span>sep<span class="operator">=</span><span class="string">"\t"</span><span class="punctuation">,</span>row.names <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p>构建的KEGG注释库如下:</p><p><img src="https://www.shelven.com/tuchuang/20231223/11.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/11.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>第一列是ko号,第二列是ko号对应的代谢通路描述,同样的,需要注意一个基因可能对应多个ko号,在对映关系上需要做一些处理:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## KEGG注释生成</span></span><br><span class="line">options<span class="punctuation">(</span>stringsAsFactors <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span></span><br><span class="line">egg <span class="operator"><-</span> read.delim<span class="punctuation">(</span><span class="string">"out.emapper.annotations"</span><span class="punctuation">,</span>header <span class="operator">=</span> <span class="built_in">T</span><span class="punctuation">,</span>sep<span class="operator">=</span><span class="string">"\t"</span><span class="punctuation">)</span></span><br><span class="line">egg<span class="punctuation">[</span>egg<span class="operator">==</span><span class="string">""</span><span class="punctuation">]</span><span class="operator"><-</span><span class="literal">NA</span></span><br><span class="line">gene2ko <span class="operator"><-</span> egg <span class="operator">%>%</span></span><br><span class="line"> dplyr<span class="operator">::</span>select<span class="punctuation">(</span>GID <span class="operator">=</span> query<span class="punctuation">,</span> KO <span class="operator">=</span> KEGG_ko<span class="punctuation">)</span> <span class="operator">%>%</span></span><br><span class="line"> na.omit<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">pathway2name <span class="operator"><-</span> read.delim<span class="punctuation">(</span><span class="string">"KEGG.library"</span><span class="punctuation">)</span></span><br><span class="line">colnames<span class="punctuation">(</span>pathway2name<span class="punctuation">)</span><span class="operator"><-</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">"Pathway"</span><span class="punctuation">,</span><span class="string">"Name"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">ko2gene <span class="operator"><-</span> tibble<span class="punctuation">(</span>Ko<span class="operator">=</span>character<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span>GID<span class="operator">=</span>character<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="keyword">for</span> <span class="punctuation">(</span>Query <span class="keyword">in</span> gene2ko<span class="operator">$</span>GID<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"> ko_list <span class="operator"><-</span> strsplit<span class="punctuation">(</span>gene2ko<span class="operator">$</span>KO<span class="punctuation">[</span>which<span class="punctuation">(</span>gene2ko<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>Query<span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">,</span>split <span class="operator">=</span> <span class="string">','</span><span class="punctuation">)</span></span><br><span class="line"> <span class="keyword">for</span> <span class="punctuation">(</span>ko <span class="keyword">in</span> ko_list<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"> <span class="keyword">if</span> <span class="punctuation">(</span><span class="built_in">length</span><span class="punctuation">(</span>which<span class="punctuation">(</span>ko2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>ko<span class="punctuation">)</span><span class="punctuation">)</span><span class="operator">==</span><span class="number">0</span><span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"> tmp <span class="operator"><-</span> data.frame<span class="punctuation">(</span>Ko<span class="operator">=</span>ko<span class="punctuation">,</span>GID<span class="operator">=</span>Query<span class="punctuation">)</span></span><br><span class="line"> ko2gene <span class="operator"><-</span> rbind<span class="punctuation">(</span>ko2gene<span class="punctuation">,</span>tmp<span class="punctuation">)</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="keyword">else</span><span class="punctuation">{</span></span><br><span class="line"> old_Query <span class="operator"><-</span> ko2gene<span class="operator">$</span>GID<span class="punctuation">[</span>which<span class="punctuation">(</span>ko2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>ko<span class="punctuation">)</span><span class="punctuation">]</span></span><br><span class="line"> ko2gene<span class="operator">$</span>GID<span class="punctuation">[</span>which<span class="punctuation">(</span>ko2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>ko<span class="punctuation">)</span><span class="punctuation">]</span> <span class="operator"><-</span> paste<span class="punctuation">(</span>old_Query<span class="punctuation">,</span>Query<span class="punctuation">,</span>sep <span class="operator">=</span> <span class="string">','</span><span class="punctuation">)</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"><span class="punctuation">}</span></span><br><span class="line"></span><br><span class="line">pathway2gene <span class="operator"><-</span> tibble<span class="punctuation">(</span>Pathway <span class="operator">=</span> character<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">,</span> GID <span class="operator">=</span> character<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> <span class="punctuation">(</span>ko <span class="keyword">in</span> ko2pathway<span class="operator">$</span>KO<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"> pathway_list <span class="operator"><-</span> ko2pathway<span class="operator">$</span>Pathway<span class="punctuation">[</span>which<span class="punctuation">(</span>ko2pathway<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>ko<span class="punctuation">)</span><span class="punctuation">]</span></span><br><span class="line"> <span class="keyword">for</span> <span class="punctuation">(</span>pathway <span class="keyword">in</span> pathway_list<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"> <span class="keyword">if</span> <span class="punctuation">(</span>paste<span class="punctuation">(</span><span class="string">'ko:'</span><span class="punctuation">,</span>ko<span class="punctuation">,</span>sep<span class="operator">=</span><span class="string">''</span><span class="punctuation">)</span> <span class="operator">%in%</span> ko2gene<span class="operator">$</span>Ko<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"> ko <span class="operator"><-</span> paste<span class="punctuation">(</span><span class="string">'ko:'</span><span class="punctuation">,</span>ko<span class="punctuation">,</span>sep<span class="operator">=</span><span class="string">''</span><span class="punctuation">)</span></span><br><span class="line"> <span class="keyword">if</span> <span class="punctuation">(</span><span class="built_in">length</span><span class="punctuation">(</span>which<span class="punctuation">(</span>pathway2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>pathway<span class="punctuation">)</span><span class="punctuation">)</span><span class="operator">==</span><span class="number">0</span> <span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"> ko2gene<span class="operator">$</span>GID<span class="punctuation">[</span>which<span class="punctuation">(</span>ko2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>ko<span class="punctuation">)</span><span class="punctuation">]</span></span><br><span class="line"> tmp <span class="operator"><-</span> data.frame<span class="punctuation">(</span>pathway<span class="operator">=</span>pathway<span class="punctuation">,</span>GID<span class="operator">=</span>ko2gene<span class="operator">$</span>GID<span class="punctuation">[</span>which<span class="punctuation">(</span>ko2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>ko<span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">)</span></span><br><span class="line"> pathway2gene <span class="operator"><-</span> rbind<span class="punctuation">(</span>pathway2gene<span class="punctuation">,</span>tmp<span class="punctuation">)</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="keyword">else</span><span class="punctuation">{</span></span><br><span class="line"> old_Query <span class="operator"><-</span> pathway2gene<span class="operator">$</span>GID<span class="punctuation">[</span>which<span class="punctuation">(</span>pathway2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>pathway<span class="punctuation">)</span><span class="punctuation">]</span></span><br><span class="line"> Query <span class="operator"><-</span> ko2gene<span class="operator">$</span>GID<span class="punctuation">[</span>which<span class="punctuation">(</span>ko2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>ko<span class="punctuation">)</span><span class="punctuation">]</span></span><br><span class="line"> pathway2gene<span class="operator">$</span>GID<span class="punctuation">[</span>which<span class="punctuation">(</span>pathway2gene<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="operator">==</span>pathway<span class="punctuation">)</span><span class="punctuation">]</span> <span class="operator"><-</span> paste<span class="punctuation">(</span>old_Query<span class="punctuation">,</span>Query<span class="punctuation">,</span>sep<span class="operator">=</span><span class="string">','</span><span class="punctuation">)</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"><span class="punctuation">}</span></span><br><span class="line"></span><br><span class="line">new_pathway2gene <span class="operator"><-</span> data.frame<span class="punctuation">(</span><span class="punctuation">)</span><span class="comment"># 遍历pathway2gene每一行,拆分gene</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> <span class="punctuation">(</span>i <span class="keyword">in</span> <span class="number">1</span><span class="operator">:</span>nrow<span class="punctuation">(</span>pathway2gene<span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"> pathway <span class="operator"><-</span> pathway2gene<span class="operator">$</span>pathway<span class="punctuation">[</span>i<span class="punctuation">]</span></span><br><span class="line"> genes <span class="operator"><-</span> strsplit<span class="punctuation">(</span><span class="built_in">as.character</span><span class="punctuation">(</span>pathway2gene<span class="operator">$</span>GID<span class="punctuation">[</span>i<span class="punctuation">]</span><span class="punctuation">)</span><span class="punctuation">,</span> <span class="string">","</span><span class="punctuation">)</span><span class="punctuation">[[</span><span class="number">1</span><span class="punctuation">]</span><span class="punctuation">]</span></span><br><span class="line"> <span class="keyword">for</span> <span class="punctuation">(</span>gene <span class="keyword">in</span> genes<span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"> new_row <span class="operator"><-</span> data.frame<span class="punctuation">(</span>pathway <span class="operator">=</span> pathway<span class="punctuation">,</span> gene <span class="operator">=</span> gene<span class="punctuation">)</span></span><br><span class="line"> new_pathway2gene <span class="operator"><-</span> rbind<span class="punctuation">(</span>new_pathway2gene<span class="punctuation">,</span> new_row<span class="punctuation">)</span></span><br><span class="line"> <span class="punctuation">}</span></span><br><span class="line"><span class="punctuation">}</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#KEGG富集</span></span><br><span class="line">gene_select <span class="operator"><-</span> read.delim <span class="punctuation">(</span><span class="string">'gene.txt'</span><span class="punctuation">,</span> stringsAsFactors <span class="operator">=</span> <span class="literal">FALSE</span><span class="punctuation">,</span>header <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span><span class="operator">$</span>V1</span><br><span class="line">kegg_rich <span class="operator"><-</span> enricher <span class="punctuation">(</span>gene <span class="operator">=</span> gene_select<span class="punctuation">,</span></span><br><span class="line"> TERM2GENE <span class="operator">=</span> new_pathway2gene<span class="punctuation">[</span><span class="built_in">c</span> <span class="punctuation">(</span><span class="string">'pathway'</span><span class="punctuation">,</span><span class="string">'gene'</span><span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">,</span> </span><br><span class="line"> TERM2NAME <span class="operator">=</span> pathway2name<span class="punctuation">[</span><span class="built_in">c</span> <span class="punctuation">(</span><span class="string">'Pathway'</span><span class="punctuation">,</span><span class="string">'Name'</span><span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">,</span> </span><br><span class="line"> pvalueCutoff <span class="operator">=</span> <span class="number">1</span><span class="punctuation">,</span> </span><br><span class="line"> pAdjustMethod <span class="operator">=</span> <span class="string">'BH'</span><span class="punctuation">,</span> </span><br><span class="line"> qvalueCutoff <span class="operator">=</span> <span class="number">1</span><span class="punctuation">,</span> </span><br><span class="line"> maxGSSize <span class="operator">=</span> <span class="number">500</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line">write.table <span class="punctuation">(</span>kegg_rich<span class="punctuation">,</span> <span class="string">'KEGG_enrichment.xls'</span><span class="punctuation">,</span> sep <span class="operator">=</span> <span class="string">'\t'</span><span class="punctuation">,</span> row.names <span class="operator">=</span> <span class="literal">FALSE</span><span class="punctuation">,</span> <span class="built_in">quote</span> <span class="operator">=</span> <span class="literal">FALSE</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p>最终得到的KEGG富集结果<code>KEGG_enrichment.xls</code>如下:</p><p><img src="https://www.shelven.com/tuchuang/20231223/13.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/13.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>同样是根据qvalue筛选,剔除不合理的富集通路(比如你做的植物,富集到动物、微生物等代谢通路),作图是和GO富集作图一模一样的,这里不赘述了。</p><h3 id="剔除不合理的代谢通路"><a href="#剔除不合理的代谢通路" class="headerlink" title="剔除不合理的代谢通路"></a>剔除不合理的代谢通路</h3><p>顺便说一下,KEGG结果中会有不少不合理的地方,比如我注释的植物基因代谢通路,会有挺多注释到非植物相关的KEGG通路……</p><p>当然,我们可以一个个在KEGG官网查询富集的通路信息,也可以在R中把KEGG的所有植物代谢通路整理出来。这里需要用到一个R包<code>KEGGREST</code>快速获取KEGG数据库信息,需要注意,通过BiocManager下载的<code>KEGGREST</code>包已经不能正常使用了(2022年6月1日调用API地址发生了变化),需要在github上下载最新版:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 国内或许会出现网络问题,需要一点魔法(摊手)</span></span><br><span class="line"><span class="comment">## 不要找别人备份的库,可能也会有问题,认准官方Bioconductor</span></span><br><span class="line">library<span class="punctuation">(</span>devtools<span class="punctuation">)</span></span><br><span class="line">devtools<span class="operator">::</span>install_github<span class="punctuation">(</span><span class="string">"https://github.com/Bioconductor/KEGGREST"</span><span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>KEGGREST<span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p>参考代码:<a href="https://zhuanlan.zhihu.com/p/657531005?utm_id=0">玩转KEGG (二)——植物富集到了动物通路?难不成咱研究的植物人儿😂 - 知乎 (zhihu.com)</a></p><p>具体思路是用<code>KEGGREST</code>包抓取植物物种信息,获取每个KEGG库中植物的代谢通路,取并集:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">org <span class="operator"><-</span> data.frame<span class="punctuation">(</span>keggList<span class="punctuation">(</span><span class="string">"organism"</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">plants <span class="operator"><-</span> org<span class="punctuation">[</span>grep<span class="punctuation">(</span><span class="string">"Plants"</span><span class="punctuation">,</span> org<span class="operator">$</span>phylogeny<span class="punctuation">)</span><span class="punctuation">,</span> <span class="punctuation">]</span><span class="comment"># 选取植物的物种信息</span></span><br><span class="line">pathways_tot <span class="operator"><-</span> vector<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> <span class="punctuation">(</span>i <span class="keyword">in</span> <span class="number">1</span><span class="operator">:</span><span class="built_in">length</span><span class="punctuation">(</span>plants<span class="operator">$</span>organism<span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"> try<span class="punctuation">(</span><span class="punctuation">{</span></span><br><span class="line"> pathways <span class="operator"><-</span> keggLink<span class="punctuation">(</span><span class="string">"pathway"</span><span class="punctuation">,</span> plants<span class="punctuation">[</span>i<span class="punctuation">,</span><span class="number">2</span><span class="punctuation">]</span><span class="punctuation">)</span></span><br><span class="line"> pathways <span class="operator"><-</span> sub<span class="punctuation">(</span>paste<span class="punctuation">(</span><span class="string">".*"</span><span class="punctuation">,</span>plants<span class="punctuation">[</span>i<span class="punctuation">,</span><span class="number">2</span><span class="punctuation">]</span><span class="punctuation">,</span> sep <span class="operator">=</span> <span class="string">""</span><span class="punctuation">)</span><span class="punctuation">,</span> <span class="string">""</span><span class="punctuation">,</span> pathways<span class="punctuation">)</span></span><br><span class="line"> pathways <span class="operator"><-</span> unique<span class="punctuation">(</span>pathways<span class="punctuation">)</span></span><br><span class="line"> pathways_tot <span class="operator"><-</span> append<span class="punctuation">(</span>pathways_tot<span class="punctuation">,</span>pathways<span class="punctuation">)</span></span><br><span class="line"> pathways_tot <span class="operator"><-</span> unique<span class="punctuation">(</span>pathways_tot<span class="punctuation">)</span> <span class="punctuation">}</span><span class="punctuation">)</span></span><br><span class="line"><span class="punctuation">}</span></span><br><span class="line"></span><br><span class="line">pathways_tot <span class="operator"><-</span> paste0<span class="punctuation">(</span><span class="string">"ko"</span><span class="punctuation">,</span> pathways_tot<span class="punctuation">)</span></span><br><span class="line">writeLines<span class="punctuation">(</span>pathways_tot<span class="punctuation">,</span> <span class="string">"plants.kegg.txt"</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p>每一个物种都要抓取ko号,所以时间会有点慢,大概10分钟左右。</p><p><img src="https://www.shelven.com/tuchuang/20231223/15.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/15.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以看到现在KEGG中一共有152个植物,取代谢通路的并集后,每行输出结果。</p><p>最后用我熟悉的python简单处理过滤一下:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'plants.kegg.txt'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> ko_plant = file.read().splitlines()</span><br><span class="line"></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'KEGG_enrichment.xls'</span>,<span class="string">'r'</span>) <span class="keyword">as</span> <span class="built_in">input</span>, <span class="built_in">open</span>(<span class="string">'KEGG_enrichment.filter.xls'</span>,<span class="string">'w'</span>) <span class="keyword">as</span> output:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> <span class="built_in">input</span>:</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">'ko'</span>):</span><br><span class="line"> <span class="built_in">id</span> = line.split(<span class="string">'\t'</span>)[<span class="number">0</span>]</span><br><span class="line"> <span class="keyword">if</span> <span class="built_in">id</span> <span class="keyword">in</span> ko_plant:</span><br><span class="line"> output.write(line)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> output.write(line)</span><br></pre></td></tr></table></figure><p>费这么老大劲功夫不如上KEGG官网把前几个代谢通路搜一下…..<del>(想给自己一嘴巴子)</del></p><p><a href="https://www.genome.jp/dbget-bin/www_bget?ko03250">KEGG PATHWAY: ko03250 (genome.jp)</a></p><p>把网址上的ko改一改,看一下description就行…不过有的描述不是很清楚,这样过滤一遍准确性肯定是提高的。</p><p>最后把KEGG富集结果拿到Chiplot上做图即可:</p><p><img src="https://www.shelven.com/tuchuang/20231223/16.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231223/16.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div>]]></content>
<summary type="html"><p>最近在做非模式生物的GO和KEGG富集分析,参考了网上的一些帖子和知乎专栏,发现代码总有一些小问题,于是自己摸索修改终于跑通了&#x3D; &#x3D;,这里做个记录。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="GO" scheme="http://www.shelven.com/tags/GO/"/>
<category term="KEGG" scheme="http://www.shelven.com/tags/KEGG/"/>
</entry>
<entry>
<title>基因家族收缩扩张分析</title>
<link href="http://www.shelven.com/2023/12/01/a.html"/>
<id>http://www.shelven.com/2023/12/01/a.html</id>
<published>2023-12-01T09:34:24.000Z</published>
<updated>2023-12-23T11:13:34.000Z</updated>
<content type="html"><![CDATA[<p>最近用自己组装的植物基因组在做基因家族分析,简单记录下自己对数据的处理以及分析的流程。</p><p>随着现在测序技术的普及,越来越多的植物做了全基因组测序,对于基因组比较小的植物,现在纯做基因组组装和注释已经很难发文章了,一般来说我们还要再提出和解决一些生物学问题,最基础的就是通过比较基因组学,对系统发育中的代表性物种之间的<strong>基因家族</strong>进行比较分析、构建系统发育图谱,来揭示这些基因家族的起源和功能。</p><span id="more"></span><p>这篇笔记的流程主要是记录下使用<code>OrthoFinder + r8s + cafe</code>进行基因家族聚类、构建系统发生树、估算分歧时间和基因家族的收缩扩张分析。</p><h2 id="数据准备"><a href="#数据准备" class="headerlink" title="数据准备"></a>数据准备</h2><div class="story post-story"><p>最重要的就是要拿到你想要分析物种的基因组数canX据,做基因家族分析的话<strong>蛋白质序列文件</strong>、<strong>CDS序列文件</strong>这两个是必需的。如果要做<strong>共线性分析</strong>(如MCScan’X),还要有gff文件,以及对组装的基因组有要求,<strong>一定是要组装到染色体级别</strong>(只有contigs的共线性分析是没有意义的)。</p><h3 id="选择物种"><a href="#选择物种" class="headerlink" title="选择物种"></a>选择物种</h3><p>植物基因组数据可以通过NCBI的<a href="https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/">genebank</a>数据库、<a href="http://asia.ensembl.org/index.html">Ensemble</a>数据库或者<a href="https://phytozome-next.jgi.doe.gov/">Phytozome</a>数据库获得,选择的物种要和自己研究的物种亲缘关系不能太远,而且要在进化上有层级结构,染色体倍性最好一样,性状要能够说明一定的生物学问题。还要选择一个合适的外群,做进化树的时候要通过外群确定树根。</p><p>我们可以将已发表的近缘物种基因组文章中进化树作为参考选择需要的物种,或者结合网站<a href="https://www.plabipd.de/plant_genomes_pa.ep">Published Plant Genomes (plabipd.de)</a>选择合适的物种。上面的网站至今仍然在更新,你可以很方便的找到已发表的植物基因组文章,得到这些已测序的植物在进化关系中的位置,如下所示:</p><p><img src="https://www.shelven.com/tuchuang/20231130/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>我是一边参考这个网站,一边打开genebank数据库找到需要的植物基因组,下载蛋白序列文件和CDS文件(genbank有些植物基因组中没有相应的数据,需要从gbff文件中提取)。可以根据网站上显示的进化关系,提前写一个思维导图预计做出来物种树长什么样子(可以和后续做出来的进化树对比)。</p><h3 id="数据预处理"><a href="#数据预处理" class="headerlink" title="数据预处理"></a>数据预处理</h3><p>预处理主要做提取最长转录本,以保证基因家族聚类的准确性和提高计算效率。在genebank中下载的几个物种蛋白质序列都是去冗余的,但是我自己组装的物种没有去冗余,可以用下面的脚本将isoform蛋白序列过滤,只保留最长转录本。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 蛋白序列提取最长转录本(开头有空行会报错!需要检查输入文件)</span></span><br><span class="line"><span class="comment">## removeRedundantProteins.py</span></span><br><span class="line"><span class="keyword">import</span> sys</span><br><span class="line"><span class="keyword">import</span> getopt</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">usage</span>():</span><br><span class="line"> <span class="built_in">print</span>(<span class="string">'用法:python3 removeRedundantProteins.py -i <输入fasta文件> -o <输出fasta文件> <-h>'</span>)</span><br><span class="line"> <span class="keyword">return</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">removeRedundant</span>(<span class="params">in_file, out_file</span>):</span><br><span class="line"> gene_dic = {} <span class="comment"># 存储基因的字典</span></span><br><span class="line"> flag = <span class="string">''</span> <span class="comment"># 标记当前处理的基因</span></span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(in_file, <span class="string">'r'</span>) <span class="keyword">as</span> in_fasta:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> in_fasta:</span><br><span class="line"> <span class="keyword">if</span> <span class="string">'>'</span> <span class="keyword">in</span> line:</span><br><span class="line"> <span class="comment"># 取基因名的判断行,根据需要自己改</span></span><br><span class="line"> li = line.strip(<span class="string">'>\n'</span>).split(<span class="string">'.'</span>)[<span class="number">0</span>]</span><br><span class="line"> flag = li</span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> gene_dic[li]</span><br><span class="line"> <span class="keyword">except</span> KeyError:</span><br><span class="line"> gene_dic[li] = [line]</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> gene_dic[li].append(line)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> gene_dic[flag][-<span class="number">1</span>] += line <span class="comment"># 将当前行的基因序列追加到最后一个基因标识行中</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(out_file, <span class="string">'w'</span>) <span class="keyword">as</span> out_fasta:</span><br><span class="line"> <span class="keyword">for</span> key, value <span class="keyword">in</span> gene_dic.items():</span><br><span class="line"> <span class="keyword">if</span> <span class="built_in">len</span>(value) == <span class="number">1</span>: <span class="comment"># 如果只有一个基因序列,直接写入结果</span></span><br><span class="line"> out_fasta.write(gene_dic[key][<span class="number">0</span>])</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> trans_max = <span class="string">''</span> <span class="comment"># 存储最长的基因序列</span></span><br><span class="line"> <span class="keyword">for</span> trans <span class="keyword">in</span> gene_dic[key]:</span><br><span class="line"> <span class="keyword">if</span> <span class="built_in">len</span>(trans) > <span class="built_in">len</span>(trans_max): <span class="comment"># 如果当前基因序列更长,则更新最长基因序列</span></span><br><span class="line"> trans_max = trans</span><br><span class="line"> out_fasta.write(trans_max) <span class="comment"># 将最长的基因序列写入输出文件</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">main</span>(<span class="params">argv</span>):</span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> opts, args = getopt.getopt(argv, <span class="string">'hi:o:'</span>) <span class="comment"># h不需要参数,i和o后面各带一个参数,args储存额外的参数,正常情况下为空列表,不用处理</span></span><br><span class="line"> <span class="keyword">except</span> getopt.GetoptError: <span class="comment"># 没有找到参数列表,或选项的需要的参数为空时报错</span></span><br><span class="line"> usage()</span><br><span class="line"> sys.exit()</span><br><span class="line"> <span class="keyword">for</span> opt, arg <span class="keyword">in</span> opts:</span><br><span class="line"> <span class="keyword">if</span> opt == <span class="string">'-h'</span>:</span><br><span class="line"> usage()</span><br><span class="line"> sys.exit()</span><br><span class="line"> <span class="keyword">elif</span> opt == <span class="string">'-i'</span>:</span><br><span class="line"> in_fasta_name = arg</span><br><span class="line"> <span class="keyword">elif</span> opt == <span class="string">'-o'</span>:</span><br><span class="line"> outfile_name = arg</span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> removeRedundant(in_fasta_name, outfile_name)</span><br><span class="line"> <span class="keyword">except</span> UnboundLocalError: <span class="comment"># 引用变量未初始化的报错</span></span><br><span class="line"> usage()</span><br><span class="line"> <span class="keyword">return</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">'__main__'</span>:</span><br><span class="line"> main(sys.argv[<span class="number">1</span>:]) <span class="comment"># 取文件名之后的参数</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>一个简单判断自己蛋白序列是否冗余的方法:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">grep "\.2" 蛋白序列文件</span><br></pre></td></tr></table></figure><p>如果基因名中有xxx.2的格式,说明这个蛋白序列不是去冗余的(正常情况下)。蛋白序列去冗余之后记得也要把相应的最长转录本的cds序列提取出来,写一个python脚本可以提取出来,这里就不赘述了(本篇笔记cds序列暂时用不到,后面做ks计算再说)。</p></div><h2 id="1-基因家族聚类-amp-系统进化"><a href="#1-基因家族聚类-amp-系统进化" class="headerlink" title="1. 基因家族聚类&系统进化"></a>1. 基因家族聚类&系统进化</h2><div class="story post-story"><p>基因家族聚类是将来源于同一个祖先,由一个基因通过复制而产生的一组基因进行分类和注释,对本物种特有的基因家族进行GO和KEGG富集分析(这个后续再说)。通过物种共有的基因家族中单拷贝同源基因构建系统发育树,用韦恩图可视化待分析物种间特有的基因家族和共有的基因家族数量。</p><p>把这两个放在一起的原因是<code>OrthoFinder</code>可以一步做到位。</p><h3 id="OrthoFinder"><a href="#OrthoFinder" class="headerlink" title="OrthoFinder"></a>OrthoFinder</h3><p><a href="https://github.com/davidemms/OrthoFinder">davidemms/OrthoFinder: Phylogenetic orthology inference for comparative genomics (github.com)</a></p><p>按照github官方仓库中说的conda安装方法,运行的时候会报下面的错物:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">For Linux 64, Open MPI is built with CUDA awareness but this support is disabled by default.</span><br><span class="line">To enable it, please set the environment variable OMPI_MCA_opal_cuda_support=true before</span><br><span class="line">launching your MPI processes. Equivalently, you can set the MCA parameter in the command line:</span><br><span class="line">mpiexec --mca opal_cuda_support 1 ...</span><br><span class="line"> </span><br><span class="line">In addition, the UCX support is also built but disabled by default.</span><br><span class="line">To enable it, first install UCX (conda install -c conda-forge ucx). Then, set the environment</span><br><span class="line">variables OMPI_MCA_pml="ucx" OMPI_MCA_osc="ucx" before launching your MPI processes.</span><br><span class="line">Equivalently, you can set the MCA parameters in the command line:</span><br><span class="line">mpiexec --mca pml ucx --mca osc ucx ...</span><br><span class="line">Note that you might also need to set UCX_MEMTYPE_CACHE=n for CUDA awareness via UCX.</span><br><span class="line">Please consult UCX's documentation for detail.</span><br></pre></td></tr></table></figure><p>不知道为什么,<code>OrthoFinder</code>运行是不依赖于Open MPI的,总之无法正常运行,所以我下载了<a href="https://github.com/davidemms/OrthoFinder/releases/tag/2.5.5">官方的source源码</a>,解压后conda安装了必需的两个软件:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash">多序列比对软件mafft</span></span><br><span class="line">conda install mafft</span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">建树软件fasttree</span></span><br><span class="line">conda install fasttree</span><br></pre></td></tr></table></figure><p>我们这里只需要将比较的物种蛋白序列放在一个文件夹中(比如<code>mydata</code>)就可以了,每个物种一个文件,以<code>.fa</code> <code>.faa</code> <code>.fasta</code> <code>.fas</code> <code>.pep</code>后缀都可以识别,前缀改成物种名。写一个slurm脚本:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">#</span><span class="language-bash">!/bin/bash</span></span><br><span class="line"><span class="meta prompt_">#</span><span class="language-bash">SBATCH -n 5</span></span><br><span class="line"><span class="meta prompt_">#</span><span class="language-bash">SBATCH -t 7200</span></span><br><span class="line"></span><br><span class="line">python ./orthofinder.py -M msa -f mydata/ -t 5</span><br></pre></td></tr></table></figure><ul><li><code>-M msa</code>指定物种树的推断方法,将使用<strong>串联的多序列比对</strong>进行最大似然树推断。该模式下会使用<code>MAFFT</code>进行序列比对,用<code>FastTree</code>进行树推断。</li></ul><p>构建物种树有两种方法:串联法和合并法,<code>orthofinder</code>使用的是串联法,将所有单拷贝直系同源基因序列连起来,再进行多序列比对建树,该比对文件位于<code>MultipleSequenceAlignments/SpeciesTreeAlignment.fa</code>。合并法是将每个单拷贝基因家族建树,然后对基因树进行整合得到一致树(了解一下就行)。</p><h3 id="结果文件"><a href="#结果文件" class="headerlink" title="结果文件"></a>结果文件</h3><p>运行结束后可以在蛋白序列文件中找到<code>OrthoFinder</code>结果文件夹,结果文件非常多,官方有解释每一个文件夹信息:<a href="https://davidemms.github.io/orthofinder_tutorials/exploring-orthofinders-results.html">Exploring OrthoFinder’s results | OrthoFinder Tutorials </a></p><blockquote><p>比较重要的几个结果文件:</p><p>MultipleSequenceAlignments/SpeciesTreeAlignment.fa 串联的多序列比对文件</p><p>Species_Tree/SpeciesTree_rooted.txt 含有bootstrap值的物种树文件</p><p>Orthogroups/Orthogroups.GeneCount.tsv 直系同源基因数目文件</p><p>Single_Copy_Orthologue_Sequences 单拷贝直系同源基因文件夹</p><p>Comparative_Genomics_Statistics/Statistics_Overall.tsv 统计文件</p></blockquote><p>打开<code>Statistics_Overall.tsv</code>文件可以获得如下信息:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">Number of species 12</span><br><span class="line">Number of genes 502539</span><br><span class="line">Number of genes in orthogroups 468058</span><br><span class="line">Number of unassigned genes 34481</span><br><span class="line">Percentage of genes in orthogroups 93.1</span><br><span class="line">Percentage of unassigned genes 6.9</span><br><span class="line">Number of orthogroups 39603</span><br><span class="line">Number of species-specific orthogroups 13537</span><br><span class="line">Number of genes in species-specific orthogroups 72104</span><br><span class="line">Percentage of genes in species-specific orthogroups 14.3</span><br><span class="line">Mean orthogroup size 11.8</span><br><span class="line">Median orthogroup size 5.0</span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>一共有468058个基因(占比93.1%)被聚类到39603个正交群(orthogroup)中,至少80%以上的基因可以聚类到正交群,说明这个结果还是很可信的。</p><p>打开<code>SpeciesTree_rooted.txt</code>文件可以获得进化树信息:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">(O.s:0.288854,(((L.s:0.323215,(N.a:0.233945,(C.c:0.185824,((A.p:0.00372873,A.v:0.00344612)1:0.102058,C.r:0.121885)1:0.0883246)1:0.0751893)1:0.0565903)1:0.0505776,(((G.h:0.0112911,G.b:0.011717)1:0.209852,A.t:0.363066)1:0.0376571,M.t:0.325608)1:0.0450521)1:0.0678736,A.c:0.272559)1:0.288854);</span><br></pre></td></tr></table></figure><p>去掉冒号前的1,就是标准的进化树文件类型<code>.nwk</code>,<strong>1是Bootstrap值(自展值)</strong>,大于70%表示构建的进化树非常可靠,<strong>后续处理的时候需要把1手动删除</strong>。关于进化树的基础概念,可以参考:<a href="https://zhuanlan.zhihu.com/p/562813963">R语言绘制进化树:treeio+ggtree - 知乎 (zhihu.com)</a></p><p>我们可以用可视化树的软件,比如导入<a href="https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/dendroscope/">Dendroscope</a>:</p><p><img src="https://www.shelven.com/tuchuang/20231130/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这个时候可以对比一下前面思维导图做的进化树,结构是一模一样的。</p><p>打开<code>Orthogroups.GeneCount.tsv</code>文件,可以看到各个物种在正交群中的基因数,这个文件后面做基因家族收缩扩张要用到。我们可以先简单整理一下需要分析的物种数据:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">#</span><span class="language-bash"><span class="comment"># 删除第一行,然后选择第二列大于零的行,并将这些行的第一列保存到名为 1.txt 的文件中</span></span></span><br><span class="line">sed '1d' Orthogroups.GeneCount.tsv |awk '$2 >0 {print $1}' >1.txt</span><br></pre></td></tr></table></figure><p><code>awk '$2 >0 {print $1}'</code>根据需要改哪一列,最终是一个物种一个文件,每个文件统计基因数量大于0的正交群,可以用在线工具可视化聚类结果。以在线VENN图为例:<a href="https://jvenn.toulouse.inrae.fr/app/example.html">jvenn (inrae.fr)</a></p><p><img src="https://www.shelven.com/tuchuang/20231130/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="2-分歧时间估算"><a href="#2-分歧时间估算" class="headerlink" title="2. 分歧时间估算"></a>2. 分歧时间估算</h2><div class="story post-story"><p>根据<code>Species_Tree/SpeciesTree_rooted.txt</code>文件中的进化树信息,我们借助化石时间矫正,可以得到有分化时间的物种树(超度量树/时钟树)。进行分歧时间估算的软件有<code>PAML</code>中的<strong>mcmctree命令</strong>,软件<code>beast</code>、<code>r8s</code>、<code>Multidivtime</code>等。</p><p>在算法上,<strong>mcmctree、Multidivtime和beast</strong>三个软件都是基于贝叶斯(Bayes)方法,<code>mcmctree</code>可以选择相关速率模型(correlated rate model)或者独立速率模型(independent rate model),支持蛋白序列或者核酸序列输入。<strong>r8s</strong>可以选择NPRS(Nonparametric rate smoothing)方法,惩分似然法(penalized likelihood,PL)等。</p><p>看到一篇文献比较了使用不同软件和模型构造时钟树,<strong>mcmctree在较浅的分歧估算中表现好,Multidivtime和r8s在较深的分歧估算中表现更好,而beast表现都不怎么好</strong>。参考以下原文:</p><p><a href="https://www.researchgate.net/publication/273499450_Dating_the_origin_of_the_major_lineages_of_Branchiopoda">(PDF) Dating the origin of the major lineages of Branchiopoda (researchgate.net)</a></p><p>考虑到<code>r8s</code>运算速度更快,以下用<code>r8s</code>进行分歧时间估算,有空再把<code>mcmctree</code>也跑一遍。</p><h3 id="r8s"><a href="#r8s" class="headerlink" title="r8s"></a>r8s</h3><p>源码地址:<a href="https://sourceforge.net/projects/r8s/">r8s download | SourceForge.net</a></p><p>我下载r8s源码编译的时候会报错,后来在<strong>Biostars</strong>找到一个解决方法:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">tar -xzvf r8s1.81.tar.gz</span><br><span class="line">cd r8s1.81/src</span><br><span class="line">cp Makefile.linux Makefile.linux.bak</span><br><span class="line">sed -i 's/continuousML.o //' Makefile.linux</span><br><span class="line">sed -i 's/continuousML.o:/#continuousML.o:/' Makefile.linux</span><br><span class="line">make -f Makefile.linux</span><br></pre></td></tr></table></figure><p>源码包中有使用手册,国内有翻译成中文的版本:<a href="https://max.book118.com/html/2017/0427/102822263.shtm">r8s使用指南.pdf (book118.com)</a></p><p>使用<code>r8s</code>需要根据手册写一个批处理文件,你可以自己根据需要写,也可以用<code>cafe5</code>官方提供的<a href="https://github.com/hahnlab/CAFE5/blob/master/docs/tutorial/prep_r8s.py">prep_r8s.py</a></p><blockquote><p>这个python脚本需要几个输入参数:</p><ul><li>-i 输入文件,也就是上面生成的树文件</li><li>-o 输出文件,自定义的输出文件名称</li><li>-s 对齐的序列氨基酸数量</li><li>-p 校准的物种名</li><li>-c 校准时间</li></ul></blockquote><p>注意!<strong>如果你是python2可以直接运行,如果是python3以上需要修改源代码的59行</strong>:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">print "\nRunning cafetutorial_clade_and_size_filter.py as a standalone...\n"</span><br><span class="line"># 修改为</span><br><span class="line">print("\nRunning cafetutorial_clade_and_size_filter.py as a standalone...\n")</span><br></pre></td></tr></table></figure><p>python3之后<code>print “”</code>改成了<code>print()</code>。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ seqkit stat MultipleSequenceAlignments/SpeciesTreeAlignment.fa</span><br><span class="line">file format type num_seqs sum_len min_len avg_len max_len</span><br><span class="line">MultipleSequenceAlignments/SpeciesTreeAlignment.fa FASTA Protein 12 1,541,496 128,458 128,458 128,458</span><br></pre></td></tr></table></figure><p>可以用seqkit工具(需要conda下载)计算对齐序列的氨基酸数量,这里是128458</p><p>校准时间可以通过物种分化时间查询网站<a href="http://www.timetree.org/">TimeTree :: The Timescale of Life</a>获得:</p><p><img src="https://www.shelven.com/tuchuang/20231130/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python prep_r8s.py -i SpeciesTree_rooted.txt -o r8s_ctl_file.txt -s 125517 -p 'A.thaliana,O.sativa' -c '160'</span><br></pre></td></tr></table></figure><p>生成的文件<code>r8s_ctl_file.txt</code>就是<code>r8s</code>的批处理文件,其实这个python脚本就是指定了一些参数,可以看到定义了<code>anaiva</code>物种为假定的拟南芥和水稻的共同祖先(两个物种名的后三个字母组合),分化的时间为我们上面查的160MYA。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">#NEXUS</span><br><span class="line">begin trees;</span><br><span class="line">tree nj_tree = 你的树文件</span><br><span class="line">End;</span><br><span class="line">begin rates;</span><br><span class="line">blformat nsites=128458 lengths=persite ultrametric=no;</span><br><span class="line">collapse;</span><br><span class="line">mrca anaiva A.thaliana O.sativa;</span><br><span class="line">fixage taxon=anaiva age=160;</span><br><span class="line">divtime method=pl algorithm=tn cvStart=0 cvInc=0.5 cvNum=8 crossv=yes;</span><br><span class="line">describe plot=chronogram;</span><br><span class="line">describe plot=tree_description;</span><br><span class="line">end;</span><br></pre></td></tr></table></figure><p>有了批处理文件就可以运行r8s了:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">r8s -b -f r8s_ctl_file.txt >r8s_tmp.txt</span><br></pre></td></tr></table></figure><p>实际上我们只需要最后一行的时钟树,<strong>并手动删除假想的共同祖先</strong><code>anaiva</code>(为了之后运行cafe):</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">tail -n 1 r8s_tmp.txt | cut -c 16- > tree.txt</span><br></pre></td></tr></table></figure><p>这个tree文件可以在<a href="https://www.chiplot.online/tvbot.html">TVBOT (chiplot.online)</a>进行可视化(网站上推荐是用mcmctree和beast结果),最后图片可以用AI做美化。</p></div><h2 id="3-基因家族收缩扩张"><a href="#3-基因家族收缩扩张" class="headerlink" title="3. 基因家族收缩扩张"></a>3. 基因家族收缩扩张</h2><div class="story post-story"><p>通过时钟树和基因家族的聚类结果,根据出生死亡模型估计每个分枝祖先基因家族成员个数,从而预测目标物种基因家族相对祖先的收缩扩张情况。</p><h3 id="cafe5"><a href="#cafe5" class="headerlink" title="cafe5"></a>cafe5</h3><p><a href="https://github.com/hahnlab/CAFE5">hahnlab/CAFE5: Version 5 of the CAFE phylogenetics software (github.com)</a></p><blockquote><p>cafe需要两个输入文件:</p><ol><li>第一列为功能描述的直系同源基因数目文件</li><li>二元的,有根的,超度量树(nwk格式)</li></ol></blockquote><p>第二个文件就是上一步<code>r8s</code>生成的<code>tree.txt</code>树文件,第一个文件需要我们对<code>Orthofinder</code>结果文件中的<code>Orthogroups.GeneCount.tsv</code>做一些处理:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash">去掉最后一列total,添加第一列为(null),修改第一行Desc</span></span><br><span class="line">awk 'OFS="\t" {$NF=""; print}' Orthogroups.GeneCount.tsv > tmp && awk '{print "(null)""\t"$0}' tmp > cafe.input.tsv && sed -i '1s/(null)/Desc/g' cafe.input.tsv && rm tmp</span><br></pre></td></tr></table></figure><p>处理之后生成的<code>cafe.input.tsv</code>如下:</p><p><img src="https://www.shelven.com/tuchuang/20231130/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>有两个物种我放的是四倍体,没拆成亚基因组,其他都是二倍体,可能会对单拷贝基因家族树统计产生影响。用官方仓库中的<a href="https://github.com/hahnlab/CAFE5/blob/master/docs/tutorial/clade_and_size_filter.py">clade_and_size_filter.py</a>脚本进行过滤:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python cafetutorial_clade_and_size_filter.py -i cafe.input.tsv -o gene.families.filtered.tsv -s 2> filtered.log</span><br></pre></td></tr></table></figure><p>过滤超过100个拷贝的基因家族,得到<code>gene.families.filtered.tsv</code>文件后就可以运行cafe了:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">cafe5 -i gene.families.filtered.tsv -t tree.txt -c 10 -p -k 5 -o k5p</span><br></pre></td></tr></table></figure><p><code>cafe5</code>和之前的版本不同,不用再写运行脚本,直接指定参数即可,<code>-p</code>指定根的频率分布为泊松分布,<code>-k</code>指定使用Gamma模型,这里指定了3种gamma rate,一般在2-5之间,<code>-o</code>指定输出文件夹名称。</p><p>如果你觉得自己做的物种进化速率不一致的话,还可以修改不同的lambda值,并指定参数<code>-y</code>:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"># 官方的例子,chimphuman_separate_lambda.txt,1和2分别代表不同进化速率的群体</span><br><span class="line">((((cat:1,horse:1):1,cow:1):1,(((((chimp:2,human:2):2,orang:1):1,gibbon:1):1,(macaque:1,baboon:1):1):1,marmoset:1):1):1,(rat:1,mouse:1):1);</span><br></pre></td></tr></table></figure><h3 id="结果文件-1"><a href="#结果文件-1" class="headerlink" title="结果文件"></a>结果文件</h3><p>这里我指定了Gamma模型,结果文件结构如下:</p><blockquote><p>├── Gamma_asr.tre# 每个基因家族树文件<br>├── Gamma_branch_probabilities.tab<br>├── Gamma_category_likelihoods.txt<br>├── Gamma_change.tab<br>├── Gamma_clade_results.txt# 每个节点的扩张/收缩基因家族数量<br>├── Gamma_count.tab<br>├── Gamma_family_likelihoods.txt<br>├── Gamma_family_results.txt<br>├── Gamma_report.cafe# 报告文件,用于下游统计分析、可视化作图等<br>└── Gamma_results.txt# Gamma模型的最终似然值,lambda值</p></blockquote><p><code>cafe5</code>刚更新的时候没有<code>report.cafe</code>这个文件,当时有人做了个可视化结果软件<code>CafePlotter</code></p><p><a href="https://github.com/moshi4/CafePlotter">moshi4/CafePlotter: A tool for plotting CAFE5 gene family expansion/contraction result (github.com)</a></p><p>这个软件近期仍然在更新,作图还是比较方便的:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">pip install cafeplotter</span><br><span class="line"></span><br><span class="line">cafeplotter -i k5p/ -o k5p_plotter/ --format svg</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231130/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这个图片的美化可以用<code>Adobe illustrator</code>,还有个可视化软件<code>CAFE_fig</code>:</p><p><a href="https://github.com/LKremer/CAFE_fig">LKremer/CAFE_fig: A tool to extract and visualize the results of CAFE (Computational Analysis of gene Family Evolution) (github.com)</a></p><p>我运行这个软件会报错<code>module 'ete3' has no attribute 'TreeStyle'</code>,暂时无法解决。</p></div><h2 id="2023-x2F-12-x2F-17更新"><a href="#2023-x2F-12-x2F-17更新" class="headerlink" title="## 2023/12/17更新"></a>## 2023/12/17更新</h2><div class="story post-story"><h3 id="mcmctree"><a href="#mcmctree" class="headerlink" title="mcmctree"></a>mcmctree</h3><p>前面说到分歧时间估算除了用<code>r8s</code>之外,现在文章里最常用的就是<code>mcmctree</code>。正好这几天重新选了几个物种,顺便跑了一遍<code>mcmctree</code>,这里做一个记录。</p><p>因为mcmctree是<code>PAML</code>的一个子程序,所以直接源码编译安装PAML即可:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"># PAML的github镜像</span><br><span class="line">git clone https://githubfast.com/abacus-gene/paml.git</span><br><span class="line"></span><br><span class="line">cd paml/src</span><br><span class="line">make -f Makefile</span><br></pre></td></tr></table></figure><p>这里编译的过程遇到一点小问题:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">cc -O3 -Wall -Wno-unused-result -Wmemset-elt-size -c baseml.c</span><br><span class="line">cc: error: unrecognized command line option ‘-Wmemset-elt-size’</span><br><span class="line">make: *** [baseml.o] Error 1</span><br></pre></td></tr></table></figure><p><code>-Wmemset-elt-size</code>这个参数只起到编译过程中忽略Warning信息的作用,我的gcc版本可能过低,不支持这个参数,<strong>在Makefile文件中直接把这个参数删除即可正常编译</strong>。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"># 将编译完成的二进制执行文件放到主目录的bin文件夹下,并把路径加入到.bashrc文件中</span><br><span class="line">mkdir ../bin/</span><br><span class="line">mv baseml basemlg chi2 codeml evolver infinitesites mcmctree pamp yn00 ../bin/</span><br><span class="line"></span><br><span class="line">vim ~/.bashrc</span><br><span class="line">export PATH="/public/home/wlxie/biosoft/paml/bin:$PATH"</span><br><span class="line">. ~/.bashrc</span><br></pre></td></tr></table></figure><p>这样<code>PAML</code>就算安装结束了。</p><p>分析步骤前面是一模一样的,准备物种蛋白序列,<strong>取最长转录本,顺便筛选删除了50个AA长度以内的蛋白序列</strong>,跑一遍<code>Orthofinder</code>做基因家族聚类,记得加上参数<code>-M msa</code>,可以直接输出物种间单拷贝直系同源基因的比对结果(串联法)。</p><p>不是很明白,文献中好多是把<code>Orthofinder</code>鉴定的单拷贝直系同源基因又去跑一遍多序列比对,再去用别的软件建一个物种树,明明一个<code>Orthofinder</code>就可以包办这些事的。差别不大的话我个人觉得没必要去折腾……</p><p><code>mcmctree</code>接收的是<strong>phylip</strong>格式的多序列比对文件,所以我们要对orthofinder跑出来的<code>MultipleSequenceAlignments/SpeciesTreeAlignment.fa</code>文件进行一点点修改(手动转成phylip)格式:</p><p><img src="https://www.shelven.com/tuchuang/20231130/10.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/10.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>具体来说,第一行是物种数量 + <strong>空格</strong> + 比对的序列长度,后面每行是物种名(不能有.这个符号),两个以上的空格,以及比对的序列(一行的形式),简单写个python脚本处理一下:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 创建一个字典保存蛋白序列信息</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">protein_sequence</span>(<span class="params">file_path</span>):</span><br><span class="line"> sequences = {}</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(file_path, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> content = file.read()</span><br><span class="line"> blocks = content.split(<span class="string">'>'</span>) </span><br><span class="line"> <span class="keyword">for</span> block <span class="keyword">in</span> blocks[<span class="number">1</span>:]:</span><br><span class="line"> index = block.find(<span class="string">'\n'</span>)</span><br><span class="line"> header = block[:index]</span><br><span class="line"> sequence = block[index + <span class="number">1</span>: -<span class="number">1</span>].replace(<span class="string">'\n'</span>, <span class="string">''</span>)</span><br><span class="line"> sequences[header] = sequence</span><br><span class="line"> <span class="keyword">return</span> sequences</span><br><span class="line"></span><br><span class="line">sequences = protein_sequence(<span class="string">'./基因家族进化/mcmctree/SpeciesTreeAlignment.fa'</span>)</span><br><span class="line"></span><br><span class="line">species_number = <span class="built_in">len</span>(sequences)</span><br><span class="line">sequence_length = <span class="built_in">len</span>(sequences[<span class="built_in">next</span>(<span class="built_in">iter</span>(sequences))]) <span class="comment"># 获取第一个序列的长度(实际上每个都是一样的)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 保存成mcmctree需要的phylip格式</span></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'mcmctree.phylip'</span>, <span class="string">'w'</span>) <span class="keyword">as</span> f:</span><br><span class="line"> f.write(<span class="string">f'<span class="subst">{species_number}</span> <span class="subst">{sequence_length}</span>\n'</span>)<span class="comment"># 一个空格</span></span><br><span class="line"> <span class="keyword">for</span> key, value <span class="keyword">in</span> sequences.items():</span><br><span class="line"> f.write(<span class="string">f'<span class="subst">{key}</span> <span class="subst">{value}</span>\n'</span>)<span class="comment"># 两个空格</span></span><br></pre></td></tr></table></figure><p>和<code>r8s</code>一样,<code>mcmctree</code>估算分歧时间也需要化石证据做校准,<strong>区别在于mcmctree时间尺度是100个百万年(100Mya)</strong>,比如我这里用了水稻和拟南芥以及拟南芥和葡萄的分歧时间做校准,也是用<a href="http://www.timetree.org/">http://www.timetree.org/</a></p><p>mcmctree是在树文件中加入校准时间,并且要删除所有枝长、标签等内容,第一行加入物种数量,树的数量,直接拿<code>orthofinder</code>做的物种树<code>Species_Tree/SpeciesTree_rooted.txt</code>,修改后的物种树长这个样子:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"># 修改后的物种树命名为mcmctree.tree</span><br><span class="line">10 1</span><br><span class="line">((((((((As,Cg),Mt),(Av,Ap)),Vt),Cr),Vv),At)'B(1.098,1.244)',Os)'B(1.421,1.635)';</span><br></pre></td></tr></table></figure><p>注意树中的物种名称和序列比对文件中的要一致,我这里隐去了物种信息,知道意思就行。</p><p>还需要修改一下配置文件,这里我参考生信技工的一篇文章,具体参数的意思就不重复造轮子了,<a href="https://yanzhongsino.github.io/2021/03/25/bioinfo_phylogeny_caculate.divergence.time/">估算系统树分歧时间 —— paml.mcmctree,r8s | 生信技工 (yanzhongsino.github.io)</a></p><p>配置文件如下:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line"> seed = -1</span><br><span class="line"> seqfile = mcmctree.phylip</span><br><span class="line"> treefile = mcmctree.tree</span><br><span class="line"> outfile = mcmc.out</span><br><span class="line"></span><br><span class="line"> ndata = 1</span><br><span class="line"> seqtype = 2 * 0: nucleotides; 1:codons; 2:AAs</span><br><span class="line"> usedata = 3 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV</span><br><span class="line"> clock = 3 * 1: global clock; 2: independent rates; 3: correlated rates</span><br><span class="line"> RootAge = <2.0 * safe constraint on root age, used if no fossil for root.</span><br><span class="line"></span><br><span class="line"> model = 2 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85</span><br><span class="line"> alpha = 0 * alpha for gamma rates at sites</span><br><span class="line"> ncatG = 5 * No. categories in discrete gamma</span><br><span class="line"> aaRatefile = wag.dat</span><br><span class="line"></span><br><span class="line"> cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?</span><br><span class="line"></span><br><span class="line"> BDparas = 1 1 0 * birth, death, sampling</span><br><span class="line"> kappa_gamma = 6 2 * gamma prior for kappa</span><br><span class="line"> alpha_gamma = 1 1 * gamma prior for alpha</span><br><span class="line"></span><br><span class="line"> rgene_gamma = 2 2 * gamma prior for overall rates for genes</span><br><span class="line"> sigma2_gamma = 1 10 * gamma prior for sigma^2 (for clock=2 or 3)</span><br><span class="line"></span><br><span class="line"> finetune = 1: 0.1 0.1 0.1 0.01 .5 * auto (0 or 1) : times, musigma2, rates, mixing, paras, FossilErr</span><br><span class="line"></span><br><span class="line"> print = 1</span><br><span class="line"> burnin = 8000</span><br><span class="line"> sampfreq = 2</span><br><span class="line"> nsample = 200000</span><br><span class="line"></span><br><span class="line">*** Note: Make your window wider (100 columns) before running the program.</span><br></pre></td></tr></table></figure><p>把<code>dat/wag.dat</code>这个氨基酸替换速率文件复制到当前目录下,跑第一次<code>mcmctree mcmctree.ctl</code>,生成<code>out.BV</code>这个文件。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"># 修改out.BV文件名为in.BV</span><br><span class="line">mv out.BV in.BV</span><br><span class="line"></span><br><span class="line"># 修改配置文件</span><br><span class="line">usedata = 2</span><br></pre></td></tr></table></figure><p>再跑一次<code>mcmctree mcmctree.ctl</code>,就生成了<code>FigTree.tre</code>这个分歧树的结果文件,这个文件可以直接在<a href="https://www.chiplot.online/treeGallery.html">Tree Gallery (chiplot.online)</a>可视化:</p><p><img src="https://www.shelven.com/tuchuang/20231130/12.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/12.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>注意下时间尺度即可,还可以加上地质时间标尺等等,可以看开发者的b站教学视频:</p><p><a href="https://space.bilibili.com/30493771/channel/collectiondetail?sid=192106">小驰Coding的个人空间-小驰Coding个人主页-哔哩哔哩视频 (bilibili.com)</a></p><p>美化后加点AI调整,整理出图如下:</p><p><img src="https://www.shelven.com/tuchuang/20231130/100.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231130/100.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div>]]></content>
<summary type="html"><p>最近用自己组装的植物基因组在做基因家族分析,简单记录下自己对数据的处理以及分析的流程。</p>
<p>随着现在测序技术的普及,越来越多的植物做了全基因组测序,对于基因组比较小的植物,现在纯做基因组组装和注释已经很难发文章了,一般来说我们还要再提出和解决一些生物学问题,最基础的就是通过比较基因组学,对系统发育中的代表性物种之间的<strong>基因家族</strong>进行比较分析、构建系统发育图谱,来揭示这些基因家族的起源和功能。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="比较基因组学" scheme="http://www.shelven.com/categories/%E6%AF%94%E8%BE%83%E5%9F%BA%E5%9B%A0%E7%BB%84%E5%AD%A6/"/>
<category term="OrthoFinder" scheme="http://www.shelven.com/tags/OrthoFinder/"/>
<category term="r8s" scheme="http://www.shelven.com/tags/r8s/"/>
<category term="cafe5" scheme="http://www.shelven.com/tags/cafe5/"/>
<category term="mcmctree" scheme="http://www.shelven.com/tags/mcmctree/"/>
</entry>
<entry>
<title>基因家族分析——自动化提交在线网站数据和处理结果</title>
<link href="http://www.shelven.com/2023/11/10/a.html"/>
<id>http://www.shelven.com/2023/11/10/a.html</id>
<published>2023-11-10T14:01:55.000Z</published>
<updated>2024-04-25T08:10:37.000Z</updated>
<content type="html"><![CDATA[<p>最近在做一个植物物种的基因家族分析,花了一周时间把能做的图都做了一遍,有空就把所有分析流程都记录一下。</p><p>先说一个蛋白性质和序列分析中碰到的问题,我这里也收录了不少蛋白类的在线分析工具和数据库,可以<a href="https://www.shelven.com/Bioinformatics/">点击这里查看 (shelven.com)</a>,这些工具我每年会做一次更新。有的在线分析网站只能<strong>输入一条序列</strong>分析,当你手上很多序列的时候,一条条数据复制粘贴,点击提交,然后下一个页面再复制粘贴你要的数据,属实麻烦= =</p><span id="more"></span><p>这里就简单记录下我最近用的几个蛋白分析在线网站,以及如何做的大批量自动提交数据以及整理结果。</p><h2 id="1-ExPASY"><a href="#1-ExPASY" class="headerlink" title="1. ExPASY"></a>1. ExPASY</h2><div class="story post-story"><p>这个网站主要是做蛋白性质预测的,可以提供氨基酸序列长度、蛋白分子量、等电点、不稳定指数、脂溶指数和亲水指数等等的信息。</p><p><img src="https://www.shelven.com/tuchuang/20231110/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以通过如下链接访问官网:<a href="https://web.expasy.org/protparam/">Expasy - ProtParam tool</a></p><p>这个网站唯一的缺点是一次只能输入一条序列,所以我简单写了个selenium脚本来实现自动化输入序列,以及收集想要的预测信息:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># ExPASY预测专用</span></span><br><span class="line"><span class="keyword">from</span> selenium <span class="keyword">import</span> webdriver</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver.common.by <span class="keyword">import</span> By</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver.support <span class="keyword">import</span> expected_conditions <span class="keyword">as</span> EC</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver.support.wait <span class="keyword">import</span> WebDriverWait</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver.chrome.service <span class="keyword">import</span> Service</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver <span class="keyword">import</span> ChromeOptions</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> tqdm <span class="keyword">import</span> tqdm<span class="comment"># 加了个进度条</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 读取蛋白和存储序列,返回一个字典</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">protein_sequence</span>(<span class="params">file_path</span>):</span><br><span class="line"> sequences = {}</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(file_path, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> content = file.read()</span><br><span class="line"> blocks = content.split(<span class="string">'>'</span>) <span class="comment"># 以“>”分隔区块</span></span><br><span class="line"> </span><br><span class="line"> <span class="keyword">for</span> block <span class="keyword">in</span> blocks[<span class="number">1</span>:]:</span><br><span class="line"> index = block.find(<span class="string">'\n'</span>)<span class="comment"># 寻找第一个换行符索引值</span></span><br><span class="line"> header = block[:index]</span><br><span class="line"> sequence = block[index + <span class="number">1</span>: -<span class="number">1</span>]</span><br><span class="line"> sequences[header] = sequence</span><br><span class="line"> <span class="keyword">return</span> sequences</span><br><span class="line"></span><br><span class="line"><span class="comment"># selenium程序</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">selenium_search</span>(<span class="params">args</span>):</span><br><span class="line"> option = ChromeOptions()</span><br><span class="line"> option.add_experimental_option(<span class="string">'excludeSwitches'</span>, [<span class="string">'enable-automation'</span>]) <span class="comment"># 防检测</span></span><br><span class="line"> bro = Service(executable_path = <span class="string">'./chromedriver.exe'</span>)</span><br><span class="line"> bro = webdriver.Chrome(service = bro, options = option)</span><br><span class="line"> bro.get(<span class="string">'https://web.expasy.org/protparam/'</span>)</span><br><span class="line"> search_input = bro.find_element(By.NAME, <span class="string">'sequence'</span>)</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> key,value <span class="keyword">in</span> tqdm(args.items()):</span><br><span class="line"> search_input.clear()</span><br><span class="line"> search_input.send_keys(value) <span class="comment"># 输入框</span></span><br><span class="line"> btn = bro.find_element(By.XPATH, <span class="string">'//*[@type="submit"]'</span>)</span><br><span class="line"> btn.click()</span><br><span class="line"> wait = WebDriverWait(bro, <span class="number">2</span>) <span class="comment"># 两秒缓冲</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># xpath + lxml定位标签</span></span><br><span class="line"> page_text = bro.page_source</span><br><span class="line"> tree = etree.HTML(page_text)</span><br><span class="line"></span><br><span class="line"> Number_of_AA = <span class="string">""</span>.join(tree.xpath(<span class="string">'//*[@id="sib_body"]/pre[2]/b[1]/following-sibling::text()[1]'</span>)).strip()</span><br><span class="line"> Molecular_weight = <span class="string">""</span>.join(tree.xpath(<span class="string">'//*[@id="sib_body"]/pre[2]/b[2]/following-sibling::text()[1]'</span>)).strip()</span><br><span class="line"> Theoretical_pI = <span class="string">""</span>.join(tree.xpath(<span class="string">'//*[@id="sib_body"]/pre[2]/b[3]/following-sibling::text()[1]'</span>)).strip()</span><br><span class="line"> Instability_index = tree.xpath(<span class="string">"//*[contains(text(),'Instability index:')]/following-sibling::text()[1]"</span>)</span><br><span class="line"> Instability_index = <span class="string">""</span>.join(Instability_index).strip().split(<span class="string">' '</span>)[<span class="number">8</span>].split(<span class="string">'\n'</span>)[<span class="number">0</span>]</span><br><span class="line"> Aliphatic_index = <span class="string">""</span>.join(tree.xpath(<span class="string">"//*[contains(text(),'Aliphatic index:')]/following-sibling::text()[1]"</span>)).strip()</span><br><span class="line"> gravy = <span class="string">""</span>.join(tree.xpath(<span class="string">"//*[contains(text(),'Grand average of hydropathicity (GRAVY):')]/following-sibling::text()[1]"</span>)).strip()</span><br><span class="line"> </span><br><span class="line"> <span class="comment"># 保存结果</span></span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'ExPASY.tab'</span>, <span class="string">'a'</span>, encoding = <span class="string">'utf-8'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> file.write(<span class="string">f'<span class="subst">{key}</span>\t<span class="subst">{Number_of_AA}</span>\t<span class="subst">{Molecular_weight}</span>\t<span class="subst">{Theoretical_pI}</span>\t<span class="subst">{Instability_index}</span>\t<span class="subst">{Aliphatic_index}</span>\t<span class="subst">{gravy}</span>\n'</span>)</span><br><span class="line"> </span><br><span class="line"> bro.back() <span class="comment"># 浏览器回退</span></span><br><span class="line"> bro.quit()</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">protein_file = <span class="string">"path/to/your/protein/file"</span><span class="comment">## 改成你的蛋白文件路径</span></span><br><span class="line"></span><br><span class="line">protein_dictionary = protein_sequence(protein_file)</span><br><span class="line">selenium_search(protein_dictionary)</span><br></pre></td></tr></table></figure><p>运行的时候会自动调用chrome,这个时候就可以喝杯茶休息一下等结果了~</p><p>结果文件<code>ExPASY.tab</code>中一共有tab键分隔的7列,分别是<strong>蛋白名、氨基酸序列长度、蛋白分子量、等电点、不稳定指数、脂溶指数和亲水指数(GRAVY)</strong>。我这里51个序列只用了2分钟不到:</p><p><img src="https://www.shelven.com/tuchuang/20231110/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这个网页我的印象很深,因为结果网页中所有我要的数据是<strong>没有标签包裹的</strong>,所以只能用<code>following-sibling::text()</code>这种方式来定位两个标签之间的文本。而且很奇怪,xpath定位后三个数据的路径无法用前三个数据的定位方法,也就是<code>b[11]、b[12]和b[13]</code>这几个标签无法用路径定位,但是selenum自带的<code>find_element(By.XPATH)</code>却可以,可能因为网页是动态加载的?</p><p><del>如果怕访问太快ip被ban,可以适当sleep几秒</del></p></div><h2 id="2-Plant-mPLoc"><a href="#2-Plant-mPLoc" class="headerlink" title="2. Plant-mPLoc"></a>2. Plant-mPLoc</h2><div class="story post-story"><p>这个网站是预测蛋白亚细胞定位的,和上面一样一次只能提交一条序列。</p><p><img src="https://www.shelven.com/tuchuang/20231110/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以通过以下链接访问官网:<a href="http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/">Plant-mPLoc server (sjtu.edu.cn)</a></p><p>这个网站的结果页面很简单,且很容易定位到我们想要的标签,只需要把上面的代码稍微改一改即可:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Plant-mPLoc预测专用</span></span><br><span class="line"><span class="keyword">from</span> selenium <span class="keyword">import</span> webdriver</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver.common.by <span class="keyword">import</span> By</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver.support.wait <span class="keyword">import</span> WebDriverWait</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver.chrome.service <span class="keyword">import</span> Service</span><br><span class="line"><span class="keyword">from</span> selenium.webdriver <span class="keyword">import</span> ChromeOptions</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"><span class="keyword">from</span> tqdm <span class="keyword">import</span> tqdm</span><br><span class="line"></span><br><span class="line"><span class="comment"># 读取蛋白和存储序列,返回一个字典</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">protein_sequence</span>(<span class="params">file_path</span>):</span><br><span class="line"> sequences = {}</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(file_path, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> content = file.read()</span><br><span class="line"> blocks = content.split(<span class="string">'>'</span>) </span><br><span class="line"> <span class="keyword">for</span> block <span class="keyword">in</span> blocks[<span class="number">1</span>:]:</span><br><span class="line"> index = block.find(<span class="string">'\n'</span>)</span><br><span class="line"> header = block[:index]</span><br><span class="line"> sequence = block[index + <span class="number">1</span>: -<span class="number">1</span>]</span><br><span class="line"> sequences[header] = sequence</span><br><span class="line"> <span class="keyword">return</span> sequences</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">search_selenium</span>(<span class="params">args</span>):</span><br><span class="line"> option = ChromeOptions()</span><br><span class="line"> option.add_experimental_option(<span class="string">'excludeSwitches'</span>, [<span class="string">'enable-automation'</span>]) <span class="comment"># 防检测</span></span><br><span class="line"> bro = Service(executable_path = <span class="string">'./chromedriver.exe'</span>)</span><br><span class="line"> bro = webdriver.Chrome(service = bro, options = option)</span><br><span class="line"> bro.get(<span class="string">'http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/#'</span>)</span><br><span class="line"> search_input = bro.find_element(By.NAME, <span class="string">'S1'</span>)</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> key,value <span class="keyword">in</span> tqdm(args.items()):</span><br><span class="line"> search_input.clear()</span><br><span class="line"> search_input.send_keys(<span class="string">f"><span class="subst">{key}</span>\n<span class="subst">{value}</span>"</span>) <span class="comment"># 输入框</span></span><br><span class="line"> btn = bro.find_element(By.XPATH, <span class="string">'//*[@type="submit"]'</span>)</span><br><span class="line"> btn.click()</span><br><span class="line"> wait = WebDriverWait(bro, <span class="number">2</span>) <span class="comment"># 两秒缓冲</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># xpath + lxml定位标签</span></span><br><span class="line"> page_text = bro.page_source</span><br><span class="line"> tree = etree.HTML(page_text)</span><br><span class="line"> location = tree.xpath(<span class="string">'/html/body/div/table/tbody/tr[8]/td/table/tbody/tr[2]/td[2]/strong/font/text()'</span>)</span><br><span class="line"> location = location[<span class="number">0</span>].strip().replace(<span class="string">'.'</span>, <span class="string">''</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 输出结果</span></span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'Plant-mPLoc.tab'</span>, <span class="string">'a'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> file.write(<span class="string">f'<span class="subst">{key}</span>\t<span class="subst">{location}</span>\n'</span>)</span><br><span class="line"></span><br><span class="line"> bro.back() <span class="comment"># 浏览器回退</span></span><br><span class="line"> bro.quit()</span><br><span class="line"></span><br><span class="line">protein_file = <span class="string">"path/to/your/protein/file"</span><span class="comment">## 改成你的蛋白文件路径</span></span><br><span class="line"></span><br><span class="line">protein_dictionary = protein_sequence(protein_file)</span><br><span class="line">search_selenium(protein_dictionary)</span><br></pre></td></tr></table></figure><p>结果文件就两列,一列是蛋白名,一列是预测的亚细胞定位区域。</p><p>如果有<code>http connection</code>报错,把电脑的代理关了即可(上面也一样)。</p></div><h2 id="3-NetPhos-3-1"><a href="#3-NetPhos-3-1" class="headerlink" title="3. NetPhos 3.1"></a>3. NetPhos 3.1</h2><div class="story post-story"><p>这个网站时预测蛋白磷酸化位点的,和上面两个不同,这个网站支持多条序列输入,也就是可以直接准备fasta格式的文件上传。</p><p><img src="https://www.shelven.com/tuchuang/20231110/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以通过以下网址访问:<a href="https://services.healthtech.dtu.dk/services/NetPhos-3.1/">NetPhos 3.1 - DTU Health Tech - Bioinformatic Services</a></p><p><strong>虽然支持多条序列输入,但是人家结果是一股脑儿全堆在结果网页中的,你没法打包下载…….</strong></p><p>对于预测的磷酸化位点的图片,我们可以通过下面方式批量下载:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 批量下载NetPhos预测结果图</span></span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"></span><br><span class="line"><span class="comment"># 读取蛋白和存储序列,返回一个字典</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">protein_sequence</span>(<span class="params">file_path</span>):</span><br><span class="line"> sequences = {}</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(file_path, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> content = file.read()</span><br><span class="line"> blocks = content.split(<span class="string">'>'</span>) </span><br><span class="line"> <span class="keyword">for</span> block <span class="keyword">in</span> blocks[<span class="number">1</span>:]:</span><br><span class="line"> index = block.find(<span class="string">'\n'</span>)</span><br><span class="line"> header = block[:index]</span><br><span class="line"> sequence = block[index + <span class="number">1</span>: -<span class="number">1</span>]</span><br><span class="line"> sequences[header] = sequence</span><br><span class="line"> <span class="keyword">return</span> sequences</span><br><span class="line"></span><br><span class="line">protein_file = <span class="string">"your/path/to/your/protein/file"</span><span class="comment">## 改成你的蛋白文件路径</span></span><br><span class="line"></span><br><span class="line">protein_dictionary = protein_sequence(protein_file)</span><br><span class="line"><span class="keyword">for</span> key <span class="keyword">in</span> protein_dictionary:</span><br><span class="line"> url = <span class="string">f"https://services.healthtech.dtu.dk/services/NetPhos-3.1/tmp/your_number/netphos-3.1b.<span class="subst">{key}</span>.gif"</span><span class="comment">## 需要修改成自己的</span></span><br><span class="line"> file_name = <span class="string">f'<span class="subst">{key}</span>.gif'</span></span><br><span class="line"></span><br><span class="line"> response = requests.get(url)</span><br><span class="line"> <span class="keyword">if</span> response.status_code == <span class="number">200</span>:</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(file_name, <span class="string">"wb"</span>) <span class="keyword">as</span> file:</span><br><span class="line"> file.write(response.content)</span><br><span class="line"> <span class="built_in">print</span>(<span class="string">f'图片<span class="subst">{file_name}</span>保存成功!'</span>)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> <span class="built_in">print</span>(<span class="string">f'无法下载图片:\n<span class="subst">{url}</span>'</span>)</span><br></pre></td></tr></table></figure><p>懒得再写一个函数了,反正就是改<strong>两个地方</strong>,一个是蛋白文件路径,一个是图片的url地址。</p><p>图片url地址怎么找呢?我们在结果网页中,按<code>F12</code>调出控制台,点击下面的红框1审查元素,再点击结果网页中的任何一张图片,右边就会定位到这张图片的img标签位置,查看源地址,我们是可以发现规律的:</p><p><code>https://services.healthtech.dtu.dk/services/NetPhos-3.1/tmp/你的临时文件数字/netphos-3.1b.{key}.gif</code></p><p>这里的key值就是上面<code>protein_sequence</code>函数提取的蛋白名,不用改。</p><p><img src="https://www.shelven.com/tuchuang/20231110/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>当然,只有图片也不行,毕竟还有一堆结果数据嘛,我们可以在以下区域统计磷酸化位点信息:</p><p><img src="https://www.shelven.com/tuchuang/20231110/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>虽然可以用xpath定位和正则匹配的方式得到信息,但是写正则还是有点麻烦….直接把整个网页内容全选,保存为<code>result.txt</code>文件保存在本地,然后用以下代码统计磷酸化信息:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 统计NetPhos 3.1预测的磷酸化位点</span></span><br><span class="line"></span><br><span class="line">protein_dictionary = {}</span><br><span class="line">Serine = <span class="number">0</span></span><br><span class="line">Threonine = <span class="number">0</span></span><br><span class="line">Tyrosine = <span class="number">0</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'result.txt'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> f:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line"> <span class="keyword">if</span> line.strip():</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">">"</span>):</span><br><span class="line"> gene = line.strip().split(<span class="string">'\t'</span>)[<span class="number">0</span>]</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">"%1"</span>):</span><br><span class="line"> Serine += line.count(<span class="string">"S"</span>)</span><br><span class="line"> Threonine += line.count(<span class="string">"T"</span>)</span><br><span class="line"> Tyrosine += line.count(<span class="string">"Y"</span>)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> <span class="keyword">if</span> gene <span class="keyword">in</span> protein_dictionary:</span><br><span class="line"> <span class="keyword">continue</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> protein_dictionary[gene] = [Serine, Threonine, Tyrosine]</span><br><span class="line"> Serine = <span class="number">0</span></span><br><span class="line"> Threonine = <span class="number">0</span></span><br><span class="line"> Tyrosine = <span class="number">0</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">"classification.txt"</span>, <span class="string">'a'</span>) <span class="keyword">as</span> out:</span><br><span class="line"> <span class="keyword">for</span> key, value <span class="keyword">in</span> protein_dictionary.items():</span><br><span class="line"> total = <span class="built_in">sum</span>(value)</span><br><span class="line"> out.write(key[<span class="number">1</span>:] + <span class="string">'\t'</span> + <span class="string">'\t'</span>.join(<span class="built_in">map</span>(<span class="built_in">str</span>, value)) + <span class="string">'\t'</span> + <span class="built_in">str</span>(total) + <span class="string">'\n'</span>)</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>也是简单统计各个蛋白的磷酸化信息,结果文件<code>classification.txt</code>一共有5列,分别是<strong>蛋白名、丝氨酸磷酸化位点数、苏氨酸磷酸化位点数、酪氨酸磷酸化位点数和总的磷酸化位点数</strong>。如下:</p><p><img src="https://www.shelven.com/tuchuang/20231110/7.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/7.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以自己在最上方加上对应的信息。</p></div><h2 id="4-Phyre2"><a href="#4-Phyre2" class="headerlink" title="4. Phyre2"></a>4. Phyre2</h2><div class="story post-story"><p>这个网站是预测蛋白三维结构的,<strong>对于注册信息的用户,官方提供批量预测的通道(注册只要邮箱)</strong>。</p><p>官网如下:<a href="http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index">PHYRE2 Protein Fold Recognition Server (ic.ac.uk)</a></p><p><img src="https://www.shelven.com/tuchuang/20231110/8.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/8.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>再次提醒,<strong>批量预测一定要注册信息</strong>!注册后点击左上角<code>Expert Mode</code>,选择<code>Batch processing</code>,如下:</p><p><img src="https://www.shelven.com/tuchuang/20231110/9.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/9.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>不要跑了半天发现只预测了一条数据<del>(说的就是我…….)</del>。</p><p>单条序列的话结果文件会通过邮件附件直接发送给你,如果是<code>batch processing</code>,会提醒你到他们的网站查看结果。</p><p><img src="https://www.shelven.com/tuchuang/20231110/10.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/10.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>下载上面的压缩包解压(里面是预测的最优3D模型数据),你会发现里面都是哈希值命名的文件,不过问题不大,有一个<code>summaryinfo</code>文件记录了蛋白名和哈希值之间的关系,先处理一下并且重命名<code>pdb</code>文件:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 重命名pdb文件</span></span><br><span class="line"><span class="keyword">import</span> os</span><br><span class="line"></span><br><span class="line"><span class="comment">## 提取文件对应关系</span></span><br><span class="line">filename = {}</span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'1293015811e6f396/summaryinfo'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> f:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">"A"</span>):</span><br><span class="line"> gene = line.split(<span class="string">' '</span>)[<span class="number">0</span>]</span><br><span class="line"> file = line.split(<span class="string">' '</span>)[<span class="number">2</span>]</span><br><span class="line"> filename[gene] = file</span><br><span class="line"></span><br><span class="line"><span class="comment"># 文件重命名</span></span><br><span class="line"><span class="keyword">for</span> key, value <span class="keyword">in</span> filename.items():</span><br><span class="line"> old_filename = <span class="string">f'1293015811e6f396/<span class="subst">{value}</span>.final.pdb'</span></span><br><span class="line"> new_filename = <span class="string">f'1293015811e6f396/<span class="subst">{key}</span>.pdb'</span></span><br><span class="line"> os.rename(old_filename, new_filename)</span><br></pre></td></tr></table></figure><p>路径名用自己的就不解释了,pdb文件是记录蛋白三级结构的文件,我们一会儿还要用。</p><p>在上面的那个结果页面,我们可以点击<code>Status</code>栏的<code>Finished</code>,查看每条序列的预测详细结果,官网虽然也给出了3D模型预测图,但是像素比较低,而且黑乎乎的不好看,一会儿我们可以用<code>PyMOL</code>美化一下。这里拉到最底下还可以看到跨膜螺旋的预测结果:</p><p><img src="https://www.shelven.com/tuchuang/20231110/11.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/11.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这个图也可以用,按照前面爬图片的方法,也是F12呼出控制台,审查这个图片元素,你会发现这个图片的url也是有规律的,中间有一段哈希值就是<code>summaryinfo</code>文件中的哈希值,所以可以用下面脚本批量下载:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 批量下载phyre2的跨膜螺旋预测图片</span></span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"></span><br><span class="line"><span class="comment"># 提取url</span></span><br><span class="line">filename = {}</span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'1293015811e6f396/summaryinfo'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> f:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">"A"</span>):</span><br><span class="line"> gene = line.split(<span class="string">' '</span>)[<span class="number">0</span>]</span><br><span class="line"> file = line.split(<span class="string">' '</span>)[<span class="number">2</span>]</span><br><span class="line"> filename[gene] = file</span><br><span class="line"></span><br><span class="line"><span class="comment"># 下载图片</span></span><br><span class="line"><span class="keyword">for</span> key, value <span class="keyword">in</span> filename.items():</span><br><span class="line"> image_url = <span class="string">f'http://www.sbg.bio.ic.ac.uk/phyre2/phyre2_output/<span class="subst">{value}</span>/query_cartoon_memsat_svm.png'</span></span><br><span class="line"> file_name = <span class="string">f'<span class="subst">{key}</span>.png'</span></span><br><span class="line"></span><br><span class="line"> response = requests.get(image_url)</span><br><span class="line"> <span class="keyword">if</span> response.status_code == <span class="number">200</span>:</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(file_name, <span class="string">"wb"</span>) <span class="keyword">as</span> file:</span><br><span class="line"> file.write(response.content)</span><br><span class="line"> <span class="built_in">print</span>(<span class="string">f'图片<span class="subst">{file_name}</span>保存成功!'</span>)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> <span class="built_in">print</span>(<span class="string">f'无法下载图片:\n<span class="subst">{image_url}</span>'</span>)</span><br></pre></td></tr></table></figure></div><h2 id="5-PyMOL"><a href="#5-PyMOL" class="headerlink" title="5. PyMOL"></a>5. PyMOL</h2><div class="story post-story"><p>这是一个开源的分子三维结构显示软件,不过后来商业化了….好消息是咱们学生可以申请教育版认证,也是留下邮箱即可,认证的过程就不说了。最后我们得到一个认证证书,在第一次启动软件的时候会提醒我们认证。</p><p>这个不是在线工具,想了想还是顺便放上来了,主要用来给<code>Phyre2</code>预测的3D模型批量调整的。</p><p>使用方法不介绍了,网上教程一大堆,就接着上面的步骤。这个工具的优点在于可以用命令行跑,兼容python<del>(人家本来就是python写的)</del>,写一个批量处理的脚本:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 路径根据需要修改,只能在pymol软件中跑</span></span><br><span class="line"><span class="keyword">from</span> pymol <span class="keyword">import</span> cmd</span><br><span class="line"></span><br><span class="line"><span class="comment"># 获取文件名</span></span><br><span class="line">filename = []</span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'1293015811e6f396/summaryinfo'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> f:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">"A"</span>):</span><br><span class="line"> file = line.split(<span class="string">' '</span>)[<span class="number">0</span>]</span><br><span class="line"> filename.append(file)</span><br><span class="line"> </span><br><span class="line"><span class="comment"># 批量处理和保存</span></span><br><span class="line"><span class="keyword">for</span> i <span class="keyword">in</span> filename:</span><br><span class="line"> file_path = <span class="string">f'1293015811e6f396/<span class="subst">{i}</span>.pdb'</span></span><br><span class="line"> cmd.load(file_path)</span><br><span class="line"> cmd.bg_color(<span class="string">'white'</span>) <span class="comment"># 白色背景</span></span><br><span class="line"> cmd.spectrum(expression=<span class="string">"count"</span>, palette=<span class="string">"rainbow"</span>) <span class="comment"># 上色</span></span><br><span class="line"> cmd.ray(<span class="number">2000</span>,<span class="number">2000</span>) <span class="comment"># 渲染</span></span><br><span class="line"> new_path = <span class="string">f'3D_models/Models/<span class="subst">{i}</span>.png'</span></span><br><span class="line"> cmd.save(new_path)</span><br><span class="line"> cmd.delete(<span class="string">'all'</span>) <span class="comment"># 不叠加一定要delete</span></span><br></pre></td></tr></table></figure><p>打开软件,点击<code>File</code>——<code>Run Script</code>,运行上面的python脚本:</p><p><img src="https://www.shelven.com/tuchuang/20231110/12.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/12.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>就可以静静等待出图啦~</p><p>我这里就是简单把黑乎乎的背景改成了白色,如果想提高渲染质量,可以把<code>cmd.ray()</code>函数的参数再调高,以及其他个性化的处理。可以根据自己的需求,查看PyMOL的命令行手册:<a href="https://pymol.org/pymol-command-ref.html#set_color">PyMOL Command Reference</a></p><p><img src="https://www.shelven.com/tuchuang/20231110/13.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231110/13.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="2024-x2F-4-x2F-20-更新"><a href="#2024-x2F-4-x2F-20-更新" class="headerlink" title="2024/4/20 更新"></a>2024/4/20 更新</h2><div class="story post-story"><p>如果觉得selenium运行的时候浏览器太碍事,可以无头浏览器模式挂在后台运行。需要注意<strong>优化页面交互</strong>,如果频繁进行页面刷新和跳转,无头浏览器运行可能会抛出异常”StaleElementReferenceException” 。</p><p>这个异常是由于网页元素在你尝试与之交互的时候已经发生了变化,导致 Selenium 无法再找到该元素。可以通过调整等待时间,等待元素稳定后再进行交互,用<strong>WebDriverWait</strong> 显式等待来确保元素已经完全加载和可交互。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> selenium.webdriver.support <span class="keyword">import</span> expected_conditions <span class="keyword">as</span> EC</span><br><span class="line"></span><br><span class="line">search_input = WebDriverWait(bro,<span class="number">5</span>).until(EC.presence_of_element_located((By.NAME, <span class="string">'S1'</span>)))</span><br><span class="line"></span><br><span class="line">btn = WebDriverWait(bro, <span class="number">10</span>).until(EC.element_to_be_clickable((By.XPATH, <span class="string">'//*[@type="submit"]'</span>)))</span><br></pre></td></tr></table></figure><ol><li><code>WebDriverWait(bro, 5)</code>:这里创建了一个 WebDriverWait 对象,传入了 WebDriver 对象 <code>bro</code> 和等待的最大时长 5 秒。这意味着程序将等待最多5秒,直到条件满足或超时。</li><li><code>until(EC.presence_of_element_located((By.NAME, 'S1')))</code>:这是 WebDriverWait 的方法,它接受一个条件作为参数,直到该条件成立或超时才会继续执行。在这里,使用 <code>EC.presence_of_element_located</code> 来指定条件,即等待直到页面上具有指定名称(’S1’)的元素出现在DOM结构中。使用 <code>EC.element_to_be_clickable</code> 来指定条件,即等待直到页面上具有指定 XPath 的元素可被点击。</li></ol><p>这样可以确保元素已经加载并可见,避免了在元素未出现时导致的 “StaleElementReferenceException” 异常。</p></div>]]></content>
<summary type="html"><p>最近在做一个植物物种的基因家族分析,花了一周时间把能做的图都做了一遍,有空就把所有分析流程都记录一下。</p>
<p>先说一个蛋白性质和序列分析中碰到的问题,我这里也收录了不少蛋白类的在线分析工具和数据库,可以<a href="https://www.shelven.com/Bioinformatics/">点击这里查看 (shelven.com)</a>,这些工具我每年会做一次更新。有的在线分析网站只能<strong>输入一条序列</strong>分析,当你手上很多序列的时候,一条条数据复制粘贴,点击提交,然后下一个页面再复制粘贴你要的数据,属实麻烦&#x3D; &#x3D;</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="基因家族分析" scheme="http://www.shelven.com/categories/%E5%9F%BA%E5%9B%A0%E5%AE%B6%E6%97%8F%E5%88%86%E6%9E%90/"/>
<category term="ExPASY" scheme="http://www.shelven.com/tags/ExPASY/"/>
<category term="Plant-mPLoc" scheme="http://www.shelven.com/tags/Plant-mPLoc/"/>
<category term="NetPhos 3.1" scheme="http://www.shelven.com/tags/NetPhos-3-1/"/>
<category term="Phyre2" scheme="http://www.shelven.com/tags/Phyre2/"/>
<category term="PyMOL" scheme="http://www.shelven.com/tags/PyMOL/"/>
</entry>
<entry>
<title>python解决github的2FA认证</title>
<link href="http://www.shelven.com/2023/11/01/a.html"/>
<id>http://www.shelven.com/2023/11/01/a.html</id>
<published>2023-11-01T09:18:54.000Z</published>
<updated>2023-11-01T09:40:30.000Z</updated>
<content type="html"><![CDATA[<p>今天收到一封来自github的邮件,大致的意思是需要我在一个半月内完成2FA认证,否则后续将无法登录github。</p><span id="more"></span><p><img src="https://www.shelven.com/tuchuang/20231101/0.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231101/0.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h2 id="2FA"><a href="#2FA" class="headerlink" title="2FA"></a>2FA</h2><div class="story post-story"><p>2FA认证指的是双因子认证(Two Factor Authentication),这里的因子是身份认证因子,我们平常登录网站或者在app中设置的账户密码是属于<strong>秘密信息因子</strong>。除此之外还有<strong>物品因子</strong>(比如游戏行业的“将军令”,银行卡网银盾,验证码),<strong>生物特征因子</strong>(比如指纹,面部特征),<strong>位置因子</strong>(比如特定设备,特定位置,特定ip)等,有两个因子验证用户信息的都是2FA认证。</p><p>邮件中github给出了以下几种形式的2FA认证:</p><ul><li>Security key:硬件信息生成安全密钥,可以用usb移动硬盘、iPhone、iPad、Android设备(后三者要扫描QR码)</li><li>GitHub Mobile:github的手机app</li><li>Authenticator application (TOTP):验证器应用,使用的算法是一种<strong>基于时间的一次性密码算法(TOTP)</strong></li><li>Text messages (SMS):短信验证,国内的手机不可以</li></ul></div><h2 id="pyotp库"><a href="#pyotp库" class="headerlink" title="pyotp库"></a>pyotp库</h2><div class="story post-story"><p>我第一次进行2FA认证的时候,官方2FA页面只提供了验证器应用的选项:</p><p><img src="https://www.shelven.com/tuchuang/20231101/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231101/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>官方推荐的应用是<a href="https://support.1password.com/one-time-passwords/">1Password</a>,<a href="https://authy.com/guides/github/">Authy</a>,<a href="https://www.microsoft.com/en-us/security/mobile-authenticator-app">Microsoft Authenticator</a>。用过1Password的小伙伴应该知道,这个应用管理和生成各种网站密码确实好用,唯一的缺点是收费(学生可以申请6个月免费)。其他的应用不了解,类似的验证器应用还有个开源的<a href="https://github.com/jamie-mh/AuthenticatorPro">Authenticator Pro</a>,嗯…….不过这些用的2FA认证都是基于<code>TOTP</code>算法,我们完全可以用python的<code>pyotp</code>库实现一样的功能。</p><p><code>pyotp</code>库不是python自带的,需要在命令行终端安装:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install pyotp</span><br></pre></td></tr></table></figure><p>我们可以直接扫码或者点击红框中的<code>setup_key</code>获得github给我们创建的密钥(本地保存一份),然后运行以下python代码:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> pyotp</span><br><span class="line"></span><br><span class="line">secret_key = <span class="string">"你的密钥字符串"</span></span><br><span class="line">totp = pyotp.TOTP(secret_key)</span><br><span class="line">val = totp.now()</span><br><span class="line"><span class="built_in">print</span>(val)</span><br></pre></td></tr></table></figure><p>把输出的6个数字验证码输入框内:</p><p><img src="https://www.shelven.com/tuchuang/20231101/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231101/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>第二步就会得到一个备用的恢复密钥,也下载到本地保存起来,以备不时之需。</p><p>接着点击下一步就完成了github的2FA验证,如果操作时间太长认证失败,就返回第一步重新来一遍。</p><p><img src="https://www.shelven.com/tuchuang/20231101/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231101/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>我们用的TOTP算法是基于时间同步的,默认情况下验证码有效期只有30秒(默认一个步长的时间,如果没有时延窗口上一次生成的可能就会失效),客户端和服务端共享了一个初始的密钥(也就是第一步得到的setup_key),两边计算结果相同时才会认证成功。不知道以后github用到2FA认证会不会更新初始密钥,暂时保留着以防万一。</p></div>]]></content>
<summary type="html"><p>今天收到一封来自github的邮件,大致的意思是需要我在一个半月内完成2FA认证,否则后续将无法登录github。</p></summary>
<category term="github" scheme="http://www.shelven.com/categories/github/"/>
<category term="python" scheme="http://www.shelven.com/tags/python/"/>
</entry>
<entry>
<title>Python自学笔记(9)——Numpy库</title>
<link href="http://www.shelven.com/2023/10/26/a.html"/>
<id>http://www.shelven.com/2023/10/26/a.html</id>
<published>2023-10-26T08:50:02.000Z</published>
<updated>2023-10-26T08:51:58.000Z</updated>
<content type="html"><![CDATA[<p>Numpy(Numerical Python)是python的一个语言拓展程序库,它提供了一个强大的多维数组对象(<code>ndarray</code>),以及用于操作数组的函数和工具。NumPy是许多其他科学计算库和数据分析库的基础,如SciPy(Scientfic Python)、Pandas和Matplotlib(绘图库)。</p><span id="more"></span><blockquote><p>SciPy:开源的python算法库和数学工具包,包含的模块有最优化、线性代数、积分、插值、快速傅里叶变换、信号处理和图像处理等</p><p>Pandas:另一个数据处理和分析工具,核心数据结构是两种类型的对象:Series 和 DataFrame</p><p>Matplotlib:Numpy的可视化操作界面,利用通用的图形用户界面工具包(如Tkinter)向应用程序嵌入式绘图提供API</p></blockquote><p>Numpy官方手册:</p><p><a href="https://numpy.org/doc/stable/user/">NumPy user guide — NumPy v1.26 Manual</a></p><h2 id="1-数据类型"><a href="#1-数据类型" class="headerlink" title="1. 数据类型"></a>1. 数据类型</h2><div class="story post-story"><p>Numpy支持的数据类型可以和C语言的数据类型对应上,和python内置的六大数据类型相比,Numpy提供的数据类型相应的要细分很多,以下是常用的数据类型:</p><table><thead><tr><th>名称</th><th>概述</th></tr></thead><tbody><tr><td>bool_</td><td>布尔型数据类型(True或False)</td></tr><tr><td>int_</td><td>默认的整数类型(C语言中的long,int32或int64)</td></tr><tr><td>intc</td><td>c的int类型,int32或int64</td></tr><tr><td>intp</td><td>索引的整数类型</td></tr><tr><td>int8</td><td>整数 -128 to 127</td></tr><tr><td>int16</td><td>整数 -32768 to 32767</td></tr><tr><td>int32</td><td>整数 -2147483648 to 2147483647</td></tr><tr><td>int64</td><td>整数 -9223372036854775808 to 9223372026854775807</td></tr><tr><td>uint8</td><td>无符号整数 0 to 255</td></tr><tr><td>uint16</td><td>无符号整数 0 to 65535</td></tr><tr><td>uint32</td><td>无符号整数 0 to 4294967295</td></tr><tr><td>uint64</td><td>无符号整数 0 to 18446744073709551615</td></tr><tr><td>float_</td><td>float64类型</td></tr><tr><td><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.complex_">complex_</a></td><td>complex128类型,128位复数</td></tr><tr><td><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.bytes_">bytes_</a></td><td>字节序列数据类型,可以包含任意字节值,通常用于处理原始的二进制数据</td></tr><tr><td><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.str_">str_</a></td><td>字符串数据类型,存储 Unicode 字符串数据</td></tr></tbody></table><p>在numpy中,数据类型是通过 <code>dtype</code> 对象来表示的。<code>dtype</code> 对象描述了数据在内存中的存储方式,包括数据的类型(整数、浮点数等)和字节大小等信息。每个数据类型都对应一个唯一的字符码(Character Code),用于标识该数据类型。</p><p><code>dtype</code> 对象本身是一个类,它有多个属性和方法来描述和操作数据类型。例如,<code>int32</code> 数据类型的 <code>dtype</code> 对象可以通过 <code>np.int32</code> 或 <code>np.dtype('int32')</code> 来创建。它的字符码是 <code>'i'</code>,用于表示整数类型。</p></div><h2 id="2-创建数组"><a href="#2-创建数组" class="headerlink" title="2. 创建数组"></a>2. 创建数组</h2><div class="story post-story"><p>Numpy最大的特征(或者说是核心)是其提供的N维数组对象<code>ndarray</code>(N-dimensional array,多维数组),其有以下特征:</p><blockquote><ol><li>多维数组:ndarray是一个多维数组对象</li><li>数据类型:ndarray中的元素<strong>具有相同的数据类型</strong>,通常是数值类型,如整数(int)、浮点数(float)或复数(complex)</li><li>形状:ndarray对象的形状用于描述数组的维度。例如,一维数组的形状是一个整数,表示数组的长度;二维数组的形状是一个元组(rows, columns),表示数组的行数和列数</li><li>大小:ndarray对象的大小等于数组形状中各个维度的乘积,可以通过ndarray对象的<code>size</code>属性获取</li><li>内存布局:ndarray对象在内存中以连续的方式存储数据。这种连续存储的方式使得对数组的访问和操作更加高效</li><li>索引和切片:可以使用索引和切片操作访问ndarray对象中的元素。一维数组的索引类似于Python的列表索引(0下标开始),而<strong>多维数组可以用整数数组索引等来访问特定的元素或切片</strong></li><li>广播(Broadcasting):<code>ndarray</code> 支持广播操作,可以在不同形状的数组之间进行运算,NumPy 会自动进行形状的调整,使得运算能够进行</li></ol></blockquote><p><code>ndarray</code>就类似于python中的<code>list</code>,只不过<code>ndarray</code>只能<strong>存储同一个类型的数据</strong>。</p><p>我们可以直接用ndarray构造器来创建数组:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line">a = np.array([<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>])</span><br><span class="line"><span class="built_in">print</span>(a)</span><br><span class="line"><span class="built_in">print</span>(a.dtype)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[1 2 3]</span></span><br><span class="line"><span class="string">int32</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 数组的类型可以在创建时显式指定(不指定默认是float64)</span></span><br><span class="line">b = np.array([[<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>], [<span class="number">4</span>, <span class="number">5</span>, <span class="number">6</span>]], dtype=<span class="built_in">complex</span>)</span><br><span class="line"><span class="built_in">print</span>(b)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[1.+0.j 2.+0.j 3.+0.j]</span></span><br><span class="line"><span class="string"> [4.+0.j 5.+0.j 6.+0.j]]</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><p>也可以用其他方式创建数组:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># .zero创建指定大小数组,元素以0填充</span></span><br><span class="line">np.zeros((<span class="number">2</span>, <span class="number">2</span>, <span class="number">3</span>))</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[[0., 0., 0.],</span></span><br><span class="line"><span class="string"> [0., 0., 0.]],</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string"> [[0., 0., 0.],</span></span><br><span class="line"><span class="string"> [0., 0., 0.]]])</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># .ones创建指定大小数组,元素以1填充</span></span><br><span class="line">np.ones((<span class="number">2</span>, <span class="number">3</span>, <span class="number">4</span>), dtype=np.int16)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[[1, 1, 1, 1],</span></span><br><span class="line"><span class="string"> [1, 1, 1, 1],</span></span><br><span class="line"><span class="string"> [1, 1, 1, 1]],</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string"> [[1, 1, 1, 1],</span></span><br><span class="line"><span class="string"> [1, 1, 1, 1],</span></span><br><span class="line"><span class="string"> [1, 1, 1, 1]]], dtype=int16)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># .empty创建一个指定形状(shape)、数据类型(dtype)且未初始化的数组</span></span><br><span class="line">np.empty((<span class="number">2</span>, <span class="number">3</span>))</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[6.23042070e-307, 4.67296746e-307, 1.69121096e-306],</span></span><br><span class="line"><span class="string"> [1.89145708e-307, 6.23045466e-307, 2.22526399e-307]])</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># .arrange类似python的range函数,需要起始值,终止值和步长(步长可以是浮点数)</span></span><br><span class="line">np.arange(<span class="number">10</span>, <span class="number">30</span>, <span class="number">5</span>)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([10, 15, 20, 25])</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># .linspace创建等差数列构成的数组</span></span><br><span class="line">np.linspace(<span class="number">0</span>, <span class="number">2</span>, <span class="number">9</span>) </span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ])</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure></div><h2 id="3-数组属性"><a href="#3-数组属性" class="headerlink" title="3. 数组属性"></a>3. 数组属性</h2><div class="story post-story"><p>Numpy的数组有以下几个常用的属性:</p><table><thead><tr><th>属性</th><th>说明</th></tr></thead><tbody><tr><td><strong>ndarray.ndim</strong></td><td>数组的维度数量(也称为秩, rank),比如一维就是1,二维就是2</td></tr><tr><td><strong>ndarray.shape</strong></td><td>数组的维度。这是一个整数元组,指示数组每个维度的大小,比如二维数组n行m列</td></tr><tr><td><strong>ndarray.size</strong></td><td>数组元素的总数,比如上面二维数组就是n*m</td></tr><tr><td><strong>ndarray.dtype</strong></td><td>数组对象的元素类型</td></tr><tr><td><strong>ndarray.itemsize</strong></td><td>数组中每个元素大小(字节为单位),如float64类型的元素数组的 itemsize 为 8 (=64/8)</td></tr><tr><td><strong>ndarray.data</strong></td><td>包含数组元素缓冲区</td></tr></tbody></table><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">a = np.array([[<span class="number">1</span>, <span class="number">2</span>], [<span class="number">2</span>, <span class="number">3</span>], [<span class="number">3</span>, <span class="number">4</span>]])</span><br><span class="line"><span class="built_in">print</span>(a.shape)</span><br><span class="line"><span class="built_in">print</span>(a.ndim)</span><br><span class="line"><span class="built_in">print</span>(a.size)</span><br><span class="line"><span class="built_in">print</span>(a.dtype)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">(3, 2)</span></span><br><span class="line"><span class="string">2</span></span><br><span class="line"><span class="string">6</span></span><br><span class="line"><span class="string">int32</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure></div><h2 id="4-索引、切片和迭代"><a href="#4-索引、切片和迭代" class="headerlink" title="4. 索引、切片和迭代"></a>4. 索引、切片和迭代</h2><div class="story post-story"><p>对于一维数据,python用的索引和切片方法,Numpy中均可以使用:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">a = np.arange(<span class="number">10</span>)</span><br><span class="line"><span class="built_in">print</span>(a)</span><br><span class="line"><span class="built_in">print</span>(a[<span class="number">1</span>:<span class="number">5</span>:<span class="number">2</span>])</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[0 1 2 3 4 5 6 7 8 9]</span></span><br><span class="line"><span class="string">[1 3]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 布尔索引,可以在索引中放入判断式,只返回ture对应的元素</span></span><br><span class="line"><span class="built_in">print</span>(a[a><span class="number">3</span>])</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[4 5 6 7 8 9]</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><p>对于多维数组,Numpy中还有一种索引方式方式称为**”整数数组索引”<strong>(integer array indexing)或</strong>“花式索引”**(fancy indexing),使用整数数组作为索引来选择数组中的元素:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">a = np.array([[<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>], [<span class="number">5</span>, <span class="number">8</span>, <span class="number">9</span>], [<span class="number">10</span>, <span class="number">12</span>, <span class="number">15</span>]])</span><br><span class="line"><span class="built_in">print</span>(a)</span><br><span class="line">b = a[[<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>],[<span class="number">0</span>,<span class="number">1</span>,<span class="number">0</span>]]</span><br><span class="line"><span class="built_in">print</span>(b)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[ 1 2 3]</span></span><br><span class="line"><span class="string"> [ 5 8 9]</span></span><br><span class="line"><span class="string"> [10 12 15]]</span></span><br><span class="line"><span class="string">[ 1 8 10]</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><p>两个整数数组 <code>[0, 1, 2]</code> 和 <code>[0, 1, 0]</code> 作为索引,分别表示要选择的行和列的索引。这种索引方式会返回一个由对应位置上的元素组成的新数组。</p><p>具体来说,在这个例子中,我们选择了数组 <code>a</code> 的以下元素:</p><ul><li>行索引为 0,列索引为 0 的元素:<code>a[0, 0]</code>,值为 1</li><li>行索引为 1,列索引为 1 的元素:<code>a[1, 1]</code>,值为 8</li><li>行索引为 2,列索引为 0 的元素:<code>a[2, 0]</code>,值为 10</li></ul><p>这些选中的元素被组合成一个新的一维数组 <code>b</code>,即 <code>[1, 8, 10]</code>。</p><p>还可以使用逗号分隔的索引元组来访问特定元素,每个索引元组对应一个维度的切片范围:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">a = np.array([[<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>], [<span class="number">5</span>, <span class="number">8</span>, <span class="number">9</span>], [<span class="number">10</span>, <span class="number">12</span>, <span class="number">15</span>]])</span><br><span class="line"><span class="built_in">print</span>(a)</span><br><span class="line">b = a[:,<span class="number">1</span>]</span><br><span class="line"><span class="built_in">print</span>(b)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[ 1 2 3]</span></span><br><span class="line"><span class="string"> [ 5 8 9]</span></span><br><span class="line"><span class="string"> [10 12 15]]</span></span><br><span class="line"><span class="string">[ 2 8 12]</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><p><code>a[:, 1]</code> 表示对二维数组 <code>a</code> 的所有行进行切片,并选择索引值为1的列的元素。</p><p>在Numpy中进行数组的迭代只会发生在数组的<strong>第一个维度上</strong>,我们用以下方式生成一个数组:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># np.fromfunction(function, shape, **kwargs),根据指定的形状,在每个位置 (i, j) 上调用函数 f(i, j),并将返回的值作为数组的元素填充</span></span><br><span class="line"><span class="comment"># function 是一个函数或可调用对象,用于计算数组中每个元素的值</span></span><br><span class="line"><span class="comment"># shape 是一个表示数组维度的元组或整数,指定了要创建的数组的形状</span></span><br><span class="line"><span class="comment"># **kwargs 是可选的关键字参数,用于传递给函数的额外参数。</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">f</span>(<span class="params">x, y</span>):</span><br><span class="line"> <span class="keyword">return</span> <span class="number">10</span> * x + y</span><br><span class="line">b = np.fromfunction(f, (<span class="number">5</span>, <span class="number">4</span>), dtype=<span class="built_in">int</span>)</span><br><span class="line"><span class="built_in">print</span>(b)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[ 0 1 2 3]</span></span><br><span class="line"><span class="string"> [10 11 12 13]</span></span><br><span class="line"><span class="string"> [20 21 22 23]</span></span><br><span class="line"><span class="string"> [30 31 32 33]</span></span><br><span class="line"><span class="string"> [40 41 42 43]]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 用for循环遍历数组b</span></span><br><span class="line"><span class="keyword">for</span> row <span class="keyword">in</span> b:</span><br><span class="line"> <span class="built_in">print</span>(row)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[0 1 2 3]</span></span><br><span class="line"><span class="string">[10 11 12 13]</span></span><br><span class="line"><span class="string">[20 21 22 23]</span></span><br><span class="line"><span class="string">[30 31 32 33]</span></span><br><span class="line"><span class="string">[40 41 42 43]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="comment"># 对每一个数组元素进行操作,可以使用flast属性(结果略)</span></span><br><span class="line"><span class="keyword">for</span> element <span class="keyword">in</span> b.flat:</span><br><span class="line"> <span class="built_in">print</span>(element)</span><br></pre></td></tr></table></figure></div><h2 id="5-广播机制"><a href="#5-广播机制" class="headerlink" title="5. 广播机制"></a>5. 广播机制</h2><div class="story post-story"><p>广播(broadcasting)机制描述的是Numpy在算术操作过程中,如何处理不同形状(shape)的数组。简单来说,当两个数组的维度和长度相同时(形状相同),两个数组的运算将会发生在两个数组对应的位置;当两个数组大小不一致,较小的数组会在较大的数组中被“广播”:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 较小数组被扩展成相同的形状</span></span><br><span class="line">a = np.array([[<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>],[<span class="number">10</span>,<span class="number">10</span>,<span class="number">10</span>],[<span class="number">20</span>,<span class="number">20</span>,<span class="number">20</span>],[<span class="number">30</span>,<span class="number">30</span>,<span class="number">30</span>]])</span><br><span class="line">b = np.array([<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>])</span><br><span class="line"></span><br><span class="line"><span class="built_in">print</span>(a + b)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[ 1 2 3]</span></span><br><span class="line"><span class="string"> [11 12 13]</span></span><br><span class="line"><span class="string"> [21 22 23]</span></span><br><span class="line"><span class="string"> [31 32 33]]</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231025/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231025/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>广播机制遵循以下规则:</p><blockquote><ol><li>如果两个数组的维度数不同,那么维度较低的数组会在前面补1,直到维度数匹配。</li><li>如果两个数组的维度在任何一个维度上都不匹配,且在该维度上一个数组的形状为1,那么可以将其扩展为与另一个数组相同的形状。</li><li>如果两个数组的维度在任何一个维度上都不匹配,且在该维度上两个数组的形状都不为1,那么会引发一个错误,表示无法进行广播。</li></ol></blockquote><p>数组 <code>a</code> 的形状是 <code>(4, 3)</code>,数组 <code>b</code> 的形状是 <code>(3)</code>,它们的维度不匹配。但是根据广播机制的规则,可以将数组 <code>b</code> 扩展为 <code>(1, 3)</code> 的形状,使得它与数组 <code>a</code> 的形状匹配。在进行元素级的运算时,广播机制会自动将数组 <code>b</code> 在第一个维度上进行复制,使得它的形状与数组 <code>a</code> 相同,然后进行对应位置的运算。</p><p>如果 <code>b</code> 的形状是 <code>(2, 3)</code> 或者是 <code>(4)</code> 都无法进行扩张。</p></div><h2 id="6-数组基础操作"><a href="#6-数组基础操作" class="headerlink" title="6. 数组基础操作"></a>6. 数组基础操作</h2><div class="story post-story"><h3 id="修改数组形状"><a href="#修改数组形状" class="headerlink" title="修改数组形状"></a>修改数组形状</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># resize、reshape修改数组形状,两者区别如下</span></span><br><span class="line"><span class="comment"># resize修改原数组</span></span><br><span class="line"><span class="comment"># reshape返回修改后的数组(不影响原数组)</span></span><br><span class="line">a = np.arange(<span class="number">12</span>)</span><br><span class="line">a.reshape(<span class="number">3</span>,<span class="number">4</span>)</span><br><span class="line"><span class="built_in">print</span>(a)</span><br><span class="line">a.resize(<span class="number">3</span>,<span class="number">4</span>)</span><br><span class="line"><span class="built_in">print</span>(a)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[ 0 1 2 3 4 5 6 7 8 9 10 11]</span></span><br><span class="line"><span class="string">[[ 0 1 2 3]</span></span><br><span class="line"><span class="string"> [ 4 5 6 7]</span></span><br><span class="line"><span class="string"> [ 8 9 10 11]]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 如果有一个维度为-1,其他维度会自动计算(不能整除会报错)</span></span><br><span class="line">a = np.arange(<span class="number">12</span>)</span><br><span class="line"><span class="built_in">print</span>(a.reshape(<span class="number">3</span>,-<span class="number">1</span>))</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[ 0 1 2 3]</span></span><br><span class="line"><span class="string"> [ 4 5 6 7]</span></span><br><span class="line"><span class="string"> [ 8 9 10 11]]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># flatten、ravel将多维数组转化为一维数组(打平)两者区别如下:</span></span><br><span class="line"><span class="comment"># flatten返回原始数组的拷贝,修改不会影响原数组</span></span><br><span class="line"><span class="comment"># ravel返回的是原数组的视图,修改会影响原数组</span></span><br><span class="line">a = np.array([[<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>],[<span class="number">4</span>,<span class="number">5</span>,<span class="number">6</span>,<span class="number">7</span>],[<span class="number">8</span>,<span class="number">9</span>,<span class="number">10</span>,<span class="number">11</span>]])</span><br><span class="line">b = a.flatten()</span><br><span class="line">b[<span class="number">0</span>] = <span class="number">10</span></span><br><span class="line"><span class="built_in">print</span>(a)</span><br><span class="line">c = a.ravel()</span><br><span class="line">c[<span class="number">0</span>] = <span class="number">10</span></span><br><span class="line"><span class="built_in">print</span>(a)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[ 0 1 2 3]</span></span><br><span class="line"><span class="string"> [ 4 5 6 7]</span></span><br><span class="line"><span class="string"> [ 8 9 10 11]]</span></span><br><span class="line"><span class="string">[[10 1 2 3]</span></span><br><span class="line"><span class="string"> [ 4 5 6 7]</span></span><br><span class="line"><span class="string"> [ 8 9 10 11]]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 展开方式有两种,“F”为按列方式展开,“C”为按行方式展开</span></span><br><span class="line">a = np.array([[<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>],[<span class="number">4</span>,<span class="number">5</span>,<span class="number">6</span>,<span class="number">7</span>],[<span class="number">8</span>,<span class="number">9</span>,<span class="number">10</span>,<span class="number">11</span>]])</span><br><span class="line">b = a.flatten(order=<span class="string">'F'</span>)</span><br><span class="line">c = a.flatten(order=<span class="string">'C'</span>)</span><br><span class="line"><span class="built_in">print</span>(b)</span><br><span class="line"><span class="built_in">print</span>(c)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[ 0 4 8 1 5 9 2 6 10 3 7 11]</span></span><br><span class="line"><span class="string">[ 0 1 2 3 4 5 6 7 8 9 10 11]</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><p>注意下这里的副本和视图的概念:</p><ul><li><p><code>flatten</code>返回数组的副本,也就是一个数组的完整拷贝(深拷贝),它和原始数据存在不同的物理内存,副本的修改不会影响原数据。</p></li><li><p><code>ravel</code>返回数组的视图,也就是数据的别称或者说是引用(浅拷贝),它和原始数据物理内存在同一个位置,修改视图会影响原数据。<strong>切片修改数据会影响原始数组。</strong></p></li></ul><h3 id="翻转数组"><a href="#翻转数组" class="headerlink" title="翻转数组"></a>翻转数组</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># transpose、T将数组维度进行翻转</span></span><br><span class="line"><span class="comment"># np.transpose(arr,axes) 多维情况下可以指定维度,比如np.transpose(a, (2, 0, 1))</span></span><br><span class="line">a = np.array([[<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>],[<span class="number">4</span>,<span class="number">5</span>,<span class="number">6</span>,<span class="number">7</span>],[<span class="number">8</span>,<span class="number">9</span>,<span class="number">10</span>,<span class="number">11</span>]])</span><br><span class="line">np.transpose(a)</span><br><span class="line">a.T<span class="comment"># 两种情况下输出是完全相同的</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[ 0, 4, 8],</span></span><br><span class="line"><span class="string"> [ 1, 5, 9],</span></span><br><span class="line"><span class="string"> [ 2, 6, 10],</span></span><br><span class="line"><span class="string"> [ 3, 7, 11]])</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># swapaxes起到类似功能,指定维度交换(和python的transpose函数一样的)</span></span><br><span class="line"><span class="comment"># np.swapaxes(arr,axis1,axis2)</span></span><br><span class="line">a = np.arange(<span class="number">27</span>).reshape(<span class="number">3</span>,<span class="number">3</span>,<span class="number">3</span>)</span><br><span class="line">np.transpose(a, (<span class="number">0</span>, <span class="number">2</span>, <span class="number">1</span>))</span><br><span class="line">np.swapaxes(a, <span class="number">1</span>, <span class="number">2</span>)<span class="comment"># 两种情况下输出是完全相同的</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[[ 0, 3, 6],</span></span><br><span class="line"><span class="string"> [ 1, 4, 7],</span></span><br><span class="line"><span class="string"> [ 2, 5, 8]],</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string"> [[ 9, 12, 15],</span></span><br><span class="line"><span class="string"> [10, 13, 16],</span></span><br><span class="line"><span class="string"> [11, 14, 17]],</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string"> [[18, 21, 24],</span></span><br><span class="line"><span class="string"> [19, 22, 25],</span></span><br><span class="line"><span class="string"> [20, 23, 26]]])</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><h3 id="修改维度"><a href="#修改维度" class="headerlink" title="修改维度"></a>修改维度</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># np.expand_dims(arr,axis)在指定位置插入新的维度</span></span><br><span class="line">a = np.arange(<span class="number">4</span>).reshape(<span class="number">2</span>, <span class="number">2</span>)</span><br><span class="line">b = np.expand_dims(a, axis=<span class="number">0</span>)<span class="comment"># 第0维度插入新的维度</span></span><br><span class="line"><span class="built_in">print</span>(b.shape)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">(1, 2, 2)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># np.newaxis也可以新增加一个维度,只是和上面的表现方式不同</span></span><br><span class="line">a = np.arange(<span class="number">4</span>).reshape(<span class="number">2</span>, <span class="number">2</span>)</span><br><span class="line">b = a[np.newaxis,:,:]</span><br><span class="line"><span class="built_in">print</span>(b.shape)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">(1, 2, 2)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># np.squeeze在数组中删除一维的维度</span></span><br><span class="line">a = np.arange(<span class="number">4</span>).reshape(<span class="number">1</span>, <span class="number">2</span>, <span class="number">2</span>, <span class="number">1</span>)</span><br><span class="line">b = np.squeeze(a)</span><br><span class="line"><span class="built_in">print</span>(b.shape)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">(2, 2)</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><h3 id="数组拼接"><a href="#数组拼接" class="headerlink" title="数组拼接"></a>数组拼接</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># np.concatenate((a1,a2),axis)多个数组在指定的维度上进行拼接,默认是0维</span></span><br><span class="line"><span class="comment"># 在哪个维度拼接,哪个维度可以不同,但是其他维度必需相同。输出结果的维度和原始数组的维度相同</span></span><br><span class="line">a = np.array([[<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>],[<span class="number">4</span>,<span class="number">5</span>,<span class="number">6</span>]])</span><br><span class="line">b = np.array([[<span class="number">7</span>,<span class="number">8</span>,<span class="number">9</span>],[<span class="number">10</span>,<span class="number">11</span>,<span class="number">12</span>]])</span><br><span class="line">np.concatenate((a,b),axis=<span class="number">0</span>)<span class="comment"># 两个(2,3)数组在0维上合并为(4,3)数组</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[ 1, 2, 3],</span></span><br><span class="line"><span class="string"> [ 4, 5, 6],</span></span><br><span class="line"><span class="string"> [ 7, 8, 9],</span></span><br><span class="line"><span class="string"> [10, 11, 12]])</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">np.concatenate((a,b),axis=<span class="number">1</span>)<span class="comment"># 两个(2,3)数组在1维上合并为(2,6)数组</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[ 1, 2, 3, 7, 8, 9],</span></span><br><span class="line"><span class="string"> [ 4, 5, 6, 10, 11, 12]])</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># np.stack((a1,a2),axis)也是多个数组在指定维度上拼接,准确说是堆叠,默认也是0维</span></span><br><span class="line"><span class="comment"># 要求堆叠的两个数组有完全一样的形状,。输出结果的维度比原始数组高一维(指定的轴上创建新的维度)</span></span><br><span class="line">np.stack((a,b),axis=<span class="number">0</span>)<span class="comment"># 两个(2,3)数组在0维上堆叠为(2,2,3)数组</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[[ 1, 2, 3],</span></span><br><span class="line"><span class="string"> [ 4, 5, 6]],</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string"> [[ 7, 8, 9],</span></span><br><span class="line"><span class="string"> [10, 11, 12]]])</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># stack的变体</span></span><br><span class="line"><span class="comment"># np.hstack水平堆叠(第二个轴),np.vstack垂直堆叠(第一个轴)</span></span><br><span class="line">np.hstack((a,b))<span class="comment"># 等同于np.concatenate((a,b),axis=1)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[ 1, 2, 3, 7, 8, 9],</span></span><br><span class="line"><span class="string"> [ 4, 5, 6, 10, 11, 12]])</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">np.vstack((a,b))<span class="comment"># 等同于np.concatenate((a,b),axis=0)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[ 1, 2, 3],</span></span><br><span class="line"><span class="string"> [ 4, 5, 6],</span></span><br><span class="line"><span class="string"> [ 7, 8, 9],</span></span><br><span class="line"><span class="string"> [10, 11, 12]])</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><h3 id="数组拆分"><a href="#数组拆分" class="headerlink" title="数组拆分"></a>数组拆分</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 为了方便演示再创建一个随机数生成器对象,种子值(1)进行初始化</span></span><br><span class="line">rg = np.random.default_rng(<span class="number">1</span>)</span><br><span class="line">arr = rg.integers(<span class="number">0</span>, <span class="number">10</span>, size=(<span class="number">3</span>, <span class="number">5</span>))<span class="comment"># 0-10之间取随机整数,构造(3,5)形状的数组</span></span><br><span class="line"><span class="built_in">print</span>(arr)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[4 5 7 9 0]</span></span><br><span class="line"><span class="string"> [1 8 9 2 3]</span></span><br><span class="line"><span class="string"> [8 4 2 8 2]]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="comment"># np.split(ary,indices_or_sections,axis),axis确定沿哪个轴,默认0,横向拆分</span></span><br><span class="line">np.split(arr, <span class="number">3</span>)<span class="comment"># 将原数组平均拆分为3个数组</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[array([[4, 5, 7, 9, 0]], dtype=int64),</span></span><br><span class="line"><span class="string"> array([[1, 8, 9, 2, 3]], dtype=int64),</span></span><br><span class="line"><span class="string"> array([[8, 4, 2, 8, 2]], dtype=int64)]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">np.split(arr, [<span class="number">1</span>,<span class="number">3</span>], axis=<span class="number">1</span>)<span class="comment"># 纵向拆分,中间数组左闭右开,指示的数组位置</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[array([[4],</span></span><br><span class="line"><span class="string"> [1],</span></span><br><span class="line"><span class="string"> [8]], dtype=int64),</span></span><br><span class="line"><span class="string"> array([[5, 7],</span></span><br><span class="line"><span class="string"> [8, 9],</span></span><br><span class="line"><span class="string"> [4, 2]], dtype=int64),</span></span><br><span class="line"><span class="string"> array([[9, 0],</span></span><br><span class="line"><span class="string"> [2, 3],</span></span><br><span class="line"><span class="string"> [8, 2]], dtype=int64)]</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><h3 id="数组增删"><a href="#数组增删" class="headerlink" title="数组增删"></a>数组增删</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># np.append(arr,values,axis=None)增加数组元素</span></span><br><span class="line"><span class="comment"># 需要注意默认轴为None,结果会将数组转换为一维数组并增加元素</span></span><br><span class="line">arr = rg.integers(<span class="number">0</span>, <span class="number">10</span>, size=(<span class="number">2</span>, <span class="number">5</span>))</span><br><span class="line"><span class="built_in">print</span>(arr)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">[[4 5 7 9 0]</span></span><br><span class="line"><span class="string"> [1 8 9 2 3]]</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">np.append(arr,[<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>,<span class="number">4</span>])<span class="comment"># 不加axis参数的情况</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([4, 5, 7, 9, 0, 1, 8, 9, 2, 3, 0, 1, 2, 3, 4], dtype=int64)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">np.append(arr,[[<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>,<span class="number">4</span>]],axis=<span class="number">0</span>)<span class="comment"># 添加元素时注意维度要相同</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[4, 5, 7, 9, 0],</span></span><br><span class="line"><span class="string"> [1, 8, 9, 2, 3],</span></span><br><span class="line"><span class="string"> [0, 1, 2, 3, 4]], dtype=int64)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">np.append(arr,[[<span class="number">0</span>,<span class="number">1</span>],[<span class="number">2</span>,<span class="number">3</span>]],axis=<span class="number">1</span>)</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[4, 5, 7, 9, 0, 0, 1],</span></span><br><span class="line"><span class="string"> [1, 8, 9, 2, 3, 2, 3]], dtype=int64)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># np.insert(arr,obj,values,axis=None)插入数组元素</span></span><br><span class="line"><span class="comment"># obj是要插入的位置索引,axis是插入的轴,和上面一样,axis不加参数会使数组展开</span></span><br><span class="line">np.insert(arr, <span class="number">2</span>, <span class="number">100</span>, axis=<span class="number">1</span>)<span class="comment"># 轴1(也就是列)的索引2之前插入数值100</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[ 4, 5, 100, 7, 9, 0],</span></span><br><span class="line"><span class="string"> [ 1, 8, 100, 9, 2, 3]], dtype=int64)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># np.delete(arr,obj,axis=None)删除数组元素</span></span><br><span class="line"><span class="comment"># obj可以是整数或者切片对象,表示要删除的子数组,axis使处理的轴,也同样要指定</span></span><br><span class="line">np.delete(arr,<span class="number">2</span>,axis=<span class="number">1</span>)<span class="comment"># 轴1索引2的值被删除,也就是原数组中的元素7和9</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[4, 5, 9, 0],</span></span><br><span class="line"><span class="string"> [1, 8, 2, 3]], dtype=int64)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">np.delete(arr,np.s_[::<span class="number">2</span>],axis=<span class="number">1</span>)<span class="comment"># np.s_用于创建一个切片对象,[::2]从起始到终止步长为2的形式切片,这里切了0,2,4列</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">array([[5, 9],</span></span><br><span class="line"><span class="string"> [8, 2]], dtype=int64)</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure></div><h2 id="7-常用函数"><a href="#7-常用函数" class="headerlink" title="7. 常用函数"></a>7. 常用函数</h2><div class="story post-story"><table><thead><tr><th>函数</th><th>描述</th></tr></thead><tbody><tr><td>数学函数</td><td></td></tr><tr><td><strong>np.around()</strong></td><td>函数返回指定数字的四舍五入值</td></tr><tr><td><strong>np.floor()</strong></td><td>返回小于或者等于指定表达式的最大整数,也就是向下取整</td></tr><tr><td><strong>np.ceil()</strong></td><td>返回大于或者等于指定表达式的最小整数,也就是向上取整</td></tr><tr><td>np.abs()</td><td>计算数组元素的绝对值</td></tr><tr><td>np.sqrt()</td><td>计算数组元素的平方根</td></tr><tr><td>np.sum()</td><td>计算数组元素的和</td></tr><tr><td>np.mean()</td><td>计算数组元素的平均值</td></tr><tr><td>np.max() / np.min()</td><td>找出数组的最大/最小值</td></tr><tr><td>字符串函数</td><td></td></tr><tr><td>np.char.add()</td><td>两个数组逐个字符串拼接</td></tr><tr><td>np.char.center()</td><td>居中字符串</td></tr><tr><td>np.char.capitalize()</td><td>将字符串第一个字母转为大写</td></tr><tr><td>np.char.title()</td><td>字符串每个单词的第一个字母转大写</td></tr><tr><td>np.char.lower() / np.char.upper()</td><td>数组元素转小写/大写</td></tr><tr><td>np.char.strip()</td><td>移除开头和结尾的特殊字符</td></tr><tr><td>np.char.join()</td><td>指定分隔符连接数组元素</td></tr><tr><td>np.char.replace()</td><td>替换字符串</td></tr><tr><td>np.char.split()</td><td>指定分隔符对字符串分割,返回数组列表</td></tr></tbody></table></div><h2 id="8-Matplotlib"><a href="#8-Matplotlib" class="headerlink" title="8. Matplotlib"></a>8. Matplotlib</h2><div class="story post-story"><p>在Numpy的官方文档的Numpy实际应用案例中,几乎所有的可视化数据案例都是用<code>Matplotlib</code>库做的。<code>Matplotlib</code>是python的绘图库。</p><p>Matplotlib就和R类似,能做的图非常多,要用的时候还是直接查官方文档比较方便:</p><p><a href="https://matplotlib.org/stable/users/index">Using Matplotlib — Matplotlib 3.8.0 documentation</a></p><p>对于一副Matplotlib绘制的图,我们需要知道以下的一些图表属性:</p><img src="https://www.shelven.com/tuchuang/20231025/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231025/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" style="zoom:80%;" /><p>用Matplotlib库简单绘制一个图(推荐用Jupyter Notebook):</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">from</span> matplotlib <span class="keyword">import</span> pyplot <span class="keyword">as</span> plt</span><br><span class="line"></span><br><span class="line">x = np.arange(<span class="number">0</span>,<span class="number">10</span>)</span><br><span class="line">y = x**<span class="number">2</span></span><br><span class="line">plt.title(<span class="string">"Demo"</span>)</span><br><span class="line">plt.xlabel(<span class="string">"X axis"</span>)</span><br><span class="line">plt.ylabel(<span class="string">"Y axis"</span>)</span><br><span class="line">plt.plot(x,y)</span><br><span class="line">plt.show()</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231025/3.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231025/3.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><a href="https://numpy.org/numpy-tutorials/applications.html">NumPy Applications — NumPy Tutorials</a></p><p>从上面的Numpy应用案例中看一个绘制分形图像的例子:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">import</span> matplotlib.pyplot <span class="keyword">as</span> plt</span><br><span class="line"><span class="keyword">from</span> mpl_toolkits.axes_grid1 <span class="keyword">import</span> make_axes_locatable</span><br><span class="line"></span><br><span class="line"><span class="comment"># 计算给定网格上的julia set</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">general_julia</span>(<span class="params">mesh, c=-<span class="number">1</span>, f=np.square, num_iter=<span class="number">100</span>, radius=<span class="number">2</span></span>):</span><br><span class="line"> z = mesh.copy()</span><br><span class="line"> diverge_len = np.zeros(z.shape) <span class="comment"># 用于记录迭代次数</span></span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(num_iter):</span><br><span class="line"> conv_mask = np.<span class="built_in">abs</span>(z) < radius<span class="comment"># 只有网格上的元素绝对值小于radius才进行计算</span></span><br><span class="line"> z[conv_mask] = f(z[conv_mask]) + c</span><br><span class="line"> diverge_len[conv_mask] += <span class="number">1</span></span><br><span class="line"> <span class="keyword">return</span> diverge_len</span><br><span class="line"></span><br><span class="line"><span class="comment"># 在julia set中进行的操作</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">accident</span>(<span class="params">z</span>):</span><br><span class="line"> <span class="keyword">return</span> z - (<span class="number">2</span> * np.power(np.tan(z), <span class="number">2</span>) / (np.sin(z) * np.cos(z)))</span><br><span class="line"></span><br><span class="line"><span class="comment"># 绘制分形图像</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">plot_fractal</span>(<span class="params">fractal, title=<span class="string">'Fractal'</span>, figsize=(<span class="params"><span class="number">10</span>, <span class="number">10</span></span>), cmap=<span class="string">'rainbow'</span>, extent=[-<span class="number">2</span>, <span class="number">2</span>, -<span class="number">2</span>, <span class="number">2</span>]</span>):</span><br><span class="line"> plt.figure(figsize=figsize)</span><br><span class="line"> plt.rcParams[<span class="string">'font.sans-serif'</span>] = [<span class="string">'SimHei'</span>] <span class="comment"># 显示中文字体</span></span><br><span class="line"> plt.rcParams[<span class="string">'axes.unicode_minus'</span>] = <span class="literal">False</span> <span class="comment"># 正常显示符号</span></span><br><span class="line"> ax = plt.axes()</span><br><span class="line"> ax.set_title(<span class="string">f'$<span class="subst">{title}</span>$'</span>)</span><br><span class="line"> ax.set_xlabel(<span class="string">'实轴'</span>)</span><br><span class="line"> ax.set_ylabel(<span class="string">'虚轴'</span>)</span><br><span class="line"> im = ax.imshow(fractal, extent=extent, cmap=cmap)<span class="comment"># 在坐标轴上绘制图像</span></span><br><span class="line"> divider = make_axes_locatable(ax)<span class="comment"># 创建一个可分离的坐标轴对象</span></span><br><span class="line"> cax = divider.append_axes(<span class="string">"right"</span>, size=<span class="string">"5%"</span>, pad=<span class="number">0.1</span>)<span class="comment"># 右侧添加新坐标轴对象</span></span><br><span class="line"> plt.colorbar(im, cax=cax, label=<span class="string">'迭代次数'</span>)<span class="comment"># 添加颜色条并设置标签</span></span><br><span class="line"></span><br><span class="line">output = general_julia(mesh, f=accident, num_iter=<span class="number">15</span>, c=<span class="number">0</span>, radius=np.pi)</span><br><span class="line">kwargs = {<span class="string">'title'</span>: <span class="string">'Accidental \ fractal'</span>, <span class="string">'cmap'</span>: <span class="string">'Blues'</span>}</span><br><span class="line"></span><br><span class="line">plot_fractal(output, **kwargs)</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231025/5.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231025/5.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div>]]></content>
<summary type="html"><p>Numpy(Numerical Python)是python的一个语言拓展程序库,它提供了一个强大的多维数组对象(<code>ndarray</code>),以及用于操作数组的函数和工具。NumPy是许多其他科学计算库和数据分析库的基础,如SciPy(Scientfic Python)、Pandas和Matplotlib(绘图库)。</p></summary>
<category term="编程自学" scheme="http://www.shelven.com/categories/%E7%BC%96%E7%A8%8B%E8%87%AA%E5%AD%A6/"/>
<category term="python" scheme="http://www.shelven.com/tags/python/"/>
<category term="Numpy" scheme="http://www.shelven.com/tags/Numpy/"/>
</entry>
<entry>
<title>记一次PostgreSQL漏洞引起的kdevtmpfsi挖矿病毒攻击</title>
<link href="http://www.shelven.com/2023/10/24/a.html"/>
<id>http://www.shelven.com/2023/10/24/a.html</id>
<published>2023-10-23T16:07:37.000Z</published>
<updated>2023-10-23T16:11:19.000Z</updated>
<content type="html"><![CDATA[<p>事情是这样的,为了存放qq机器人的用户数据,昨天我下载了PostgreSQL的docker镜像,当时docker运行一切正常。然后今天下午3点过,服务器商那边发了个邮件提醒服务器存在恶意文件,一连发了三条:</p><span id="more"></span><p><img src="https://www.shelven.com/tuchuang/20231023/1.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231023/1.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>好家伙,赶紧ssh登录服务器,一眼就看到cpu被干爆了(直接100%占用,还能ssh上也挺神奇的)。</p><h2 id="1-分析排查"><a href="#1-分析排查" class="headerlink" title="1. 分析排查"></a>1. 分析排查</h2><div class="story post-story"><p>首先用<code>top</code>看看什么程序占用了cpu资源:</p><p><img src="https://www.shelven.com/tuchuang/20231023/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231023/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>一个名为<code>kdevtmpfsi</code>的进程占用了几乎所有cpu算力资源,想都不用想,这肯定是一个挖矿病毒了,赶紧百度查了下确实如此:</p><p><a href="https://worktile.com/kb/ask/36677.html">centos病毒有哪些 • Worktile社区</a></p><p>还好这个病毒并不会破坏计算机内的文件,只是占用你的CPU。这个病毒还有一个<strong>守护进程</strong>名字是<code>kinsing</code> 。</p><p>服务器商提醒我恶意文件来自于<code>/tmp/kdevtmpfsi</code>、<code>/var/tmp/kdevtmpfsi</code>和<code>/tmp/kinsing</code>,然而我在主机相应路径下并没有找到对应的文件,要么就两种情况,病毒在执行文件后删除了原文件,或者这个病毒在docker文件中。</p><p>直接在进程中查这两个程序对应的PID:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ps -aux | grep kinsing</span><br><span class="line">ps -aux | grep kdevtmpfsi</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231023/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231023/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>然后直接kill掉(先kill守护进程<code>kinsing</code>,再kill挖矿病毒<code>kdevtmpfsi</code>):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">kill</span> -9 7842</span><br><span class="line"><span class="built_in">kill</span> -9 12201</span><br></pre></td></tr></table></figure><p>直接kill是因为现在需要先把CPU资源暂时释放出来,防止你在登录界面卡死。这种病毒一般会隔一段时间重新运行,因此我们需要找到问题的根源在哪。</p><p>从根目录开始找这两个文件的位置,并删除:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">find / -name kdevtmpfsi</span><br><span class="line">/var/lib/docker/overlay2/cb454dcc5c6592a9389757846a15e41d8bf6d68a7480f59aab579bd307baaeca/diff/var/tmp/kdevtmpfsi</span><br><span class="line">/var/lib/docker/overlay2/cb454dcc5c6592a9389757846a15e41d8bf6d68a7480f59aab579bd307baaeca/merged/var/tmp/kdevtmpfsi</span><br><span class="line"></span><br><span class="line">find / -name kinsing</span><br><span class="line">/var/lib/docker/overlay2/cb454dcc5c6592a9389757846a15e41d8bf6d68a7480f59aab579bd307baaeca/diff/tmp/kinsing</span><br><span class="line">/var/lib/docker/overlay2/cb454dcc5c6592a9389757846a15e41d8bf6d68a7480f59aab579bd307baaeca/merged/tmp/kinsing</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231023/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231023/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>我这里仍然不清楚为什么搜到两个文件却只能删一个,可能在删除的时候已经引起病毒脚本删除另一个文件了?总之能找到文件就删,一个不要放过。</p><p>接下来检查是否有异常启动的定时任务(因为这个病毒会定时重启,所以要考虑这种可能):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">crontab -l</span><br><span class="line"></span><br><span class="line">*/5 * * * * flock -xn /tmp/stargate.lock -c <span class="string">'/usr/local/qcloud/stargate/admin/start.sh > /dev/null 2>&1 &'</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## 后来发现这个是腾讯云的云监控控制台设置的定时任务,删了就删了</span></span><br></pre></td></tr></table></figure><p>这里能看到有一个定时任务,我印象里没有制定过定时任务,所以直接将账户的定时任务取消(<strong>谨慎!不可恢复,看到来自别的ip的可疑定时任务,建议用crontab -e进入配置界面,删除异常定时任务,wq保存退出</strong>):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">crontab -r<span class="comment"># 不建议这样做!这里只是记录我的操作</span></span><br></pre></td></tr></table></figure><p>还有一种方法可以查所有账户的定时任务:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cat</span> /etc/passwd | <span class="built_in">cut</span> -f 1 -d : |xargs -I {} crontab -l -u {}</span><br></pre></td></tr></table></figure><p>这里相当于将所有用户名提取出来,执行<code>crontab -l</code>,同样是发现异常定时任务的话删除。</p></div><h2 id="2-病毒溯源"><a href="#2-病毒溯源" class="headerlink" title="2. 病毒溯源"></a>2. 病毒溯源</h2><div class="story post-story"><p>在一边删除病毒的同时,我这里一边在溯源病毒的来源= =</p><p>首先看看是否有人用ssh黑入root账户:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">less /var/log/secure|grep <span class="string">'Accepted'</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231023/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231023/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>然而ip登录地址全是我常用的ip地址,说明不是通过ssh登录root账号注入的病毒。</p><p>再看看TCP连接和监听端口是否运行可疑程序:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">netstat -anltp</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231023/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231023/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>并没有任何可疑程序的运行。</p><p>这个时候我发现这些病毒文件<strong>毫无例外</strong>来自于<code>docker</code>的<code>overlay2</code>文件系统,想着最近用过的只有<code>PostgreSQL</code>容器,内鬼八成是它了。</p><p>查看我的本地docker镜像库:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">docker images</span><br><span class="line"></span><br><span class="line">REPOSITORY TAG IMAGE ID CREATED SIZE</span><br><span class="line">postgres latest f7d9a0d4223b 5 weeks ago 417MB</span><br><span class="line">xzhouqd/qsign 8.9.63 153680998c24 3 months ago 558MB</span><br><span class="line">hello-world latest 9c7a54a9a43c 5 months ago 13.3kB</span><br></pre></td></tr></table></figure><p>前面查到的<code>kinsing</code>和<code>kdevtmpfsi</code>文件来自于以下文件夹:</p><p><code>/var/lib/docker/overlay2/cb454dcc5c6592a9389757846a15e41d8bf6d68a7480f59aab579bd307baaeca</code></p><p>问题来了,怎么查看后面这一长串哈希值文件对应哪个镜像?</p><p>查看现在正在运行的docker镜像:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker ps -a </span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231023/7.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231023/7.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>查看可疑的postgresql镜像元数据:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker inspect 27d1f4378405</span><br></pre></td></tr></table></figure><p>返回的是一个json格式的元数据,也可以参数-f获得更具体的key:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker inspect -f <span class="string">'{{.GraphDriver.Data}}'</span> 27d1f4378405</span><br></pre></td></tr></table></figure><p>好家伙,在得到的postgresql镜像元数据中可以找到上面那个一长串哈希值文件夹,真相大白了:<strong>病毒就是从postgresql容器注入的</strong>。</p><p>找到病毒起源后,那就是停止删除一条龙,拜拜了您内~</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">docker stop pgsql<span class="comment"># 停止实例</span></span><br><span class="line">docker <span class="built_in">rm</span> pgsql<span class="comment"># 删除实例</span></span><br><span class="line">docker rmi postgres<span class="comment"># 删除本地仓库</span></span><br></pre></td></tr></table></figure><p>其实在上面删除病毒源文件和进程后,病毒就已经不再重新运行了。此时我的数据库已经因为不明原因被破坏,且查到了病毒来源的docker镜像,就顺便删了= =</p></div><h2 id="3-复盘攻击原因"><a href="#3-复盘攻击原因" class="headerlink" title="3. 复盘攻击原因"></a>3. 复盘攻击原因</h2><div class="story post-story"><p>一开始我以为这个病毒是通过端口扫描工具,找到端口漏洞后注入病毒的,然而并不是。</p><p>在找到<code>kinsing</code>和<code>kdevtmpfsi</code>两个文件的时候,我还在同一个文件夹中找到<code>curl</code>文件,基本上可以确定这个挖矿病毒是通过<code>curl</code>命令不断下载病毒并执行的。容器一开始是不存在问题的,病毒为何在两天后突然爆发?</p><p>从一开始服务器商提醒有恶意文件开始,那个时候并没有在对应的位置发现有恶意文件,而docker中可以找到,也可以说明<strong>是docker被入侵导致的注入病毒程序</strong>。可惜docker被我删除了,无法进一步看到病毒在我的postgresql容器中具体做了哪些文件的改动。</p><p>复盘前两天的操作,我在运行postgresql的docker镜像后,为了方便远程连接数据库以及可视化操作,我开放了postgresql的默认端口,并改为了所有ip均可接入(0.0.0.0/0)。postgresql的默认用户是<code>postgres</code>,当然,我为了安全起见更改了密码(纯字母和下划线),不限制ip段的接入,可能给了黑客广撒网的机会,暴力攻破我的密码并下载了病毒(postgres账户下)。<strong>由于我的postgresql是在容器中运行的</strong>,所以病毒能在docker的<code>overlay2</code>文件系统中找到,它不能改变我宿主机的文件系统,作为一个挖矿病毒,只要能用上我宿主机的算力即可。</p><p>以下这篇文章支持我的假想,具体说了黑客如何利用PostgreSQL的远程代码执行漏洞(RCE)攻击数据库服务器进行加密货币挖矿,一旦攻破数据库账户,就可以用PostgreSQL的“复制程序”功能下载并启动挖矿脚本:</p><p><a href="https://mp.weixin.qq.com/s?__biz=MzUzNDYxOTA1NA==&mid=2247507579&idx=3&sn=89c43bc8a22c9882335604af7808abdd&chksm=fa9368bacde4e1acbf59ad8dcf3cfb88e311e40b94d7c618a20a16626466a90e8a710e1049e1&scene=27">PGMiner:利用PostgreSQL漏洞的新的加密货币挖矿僵尸网络 (qq.com)</a></p><p>还是要多注意自己的数据库安全,<strong>设置复杂一些的密码,并且限制ip段接入</strong>才是避免黑客攻击的最好方法。还有经过这次攻击后,我也顺便改了服务器的登录密码,关闭了一些用不到的端口,小心驶得万年船~</p><p>这次入侵的病毒文件我也下载到本地了,本来想了解一下是怎么实现的,可惜是二进制文件,没办法就去google了一下,果然有做网安的大佬详细解释了<code>kinsing</code>作为守护进程是如何运作的,以及挖矿病毒<code>kdevtmpfsi</code>的详细运作方式:</p><p><a href="https://sysdig.com/blog/zoom-into-kinsing-kdevtmpfsi/">Learn the Attack Patterns of Kinsing with Sysdig</a></p><p>因为我是docker被入侵,直接删了出问题的镜像一了百了,<strong>如果你是linux服务器被入侵,这里有几个细节值得注意</strong>:</p><p><code>kinsing</code>会添加定时任务,一定要删除(我应该是docker中被添加了,所以宿主机中找不到);这个守护进程会检查<code>kdevtmpfsi</code>进程的状态,若被删除会重启,所以要先删除这个守护进程;并且这个守护进程会读取系统中的SSH密钥,通过密钥横向转移到你其他机子上:</p><p><img src="https://www.shelven.com/tuchuang/20231023/8.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231023/8.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>所以最好再看看自己的密钥中是否有可疑的数据,防止黑客留下后门。</p><p>至于这个<code>kdevtmpfsi</code>挖矿病毒如何分配你的系统资源,何如与矿池通信,都可以在上面的文章中看到。既然是个linux二进制可执行文件,反汇编拿到源码也不是不行,现在看看病毒原理就好,其他的就不整了。</p></div>]]></content>
<summary type="html"><p>事情是这样的,为了存放qq机器人的用户数据,昨天我下载了PostgreSQL的docker镜像,当时docker运行一切正常。然后今天下午3点过,服务器商那边发了个邮件提醒服务器存在恶意文件,一连发了三条:</p></summary>
<category term="网络相关" scheme="http://www.shelven.com/categories/%E7%BD%91%E7%BB%9C%E7%9B%B8%E5%85%B3/"/>
<category term="PostgreSQL" scheme="http://www.shelven.com/tags/PostgreSQL/"/>
<category term="kdevtmpfsi" scheme="http://www.shelven.com/tags/kdevtmpfsi/"/>
<category term="kinsing" scheme="http://www.shelven.com/tags/kinsing/"/>
</entry>
<entry>
<title>基于近缘物种参考基因组的染色体水平组装和注释</title>
<link href="http://www.shelven.com/2023/10/19/a.html"/>
<id>http://www.shelven.com/2023/10/19/a.html</id>
<published>2023-10-19T07:08:13.000Z</published>
<updated>2023-10-19T07:20:54.000Z</updated>
<content type="html"><![CDATA[<p>有的时候存在这种情况:我手上有两个近缘植物的基因组测序数据,这两个物种可能没有人做过,或者别人做过但是没有提供参考基因组。而课题组因为经费不足,只测了一个物种的Hi-C<del>(嗯,说的就是我)</del>,那如何以组装的基因组为参考,把另一个近缘物种基因组也组装到染色体级别呢?</p><p>记录一下基于近缘物种参考基因组的染色体水平组装和注释,用到的软件是<code>RagTag</code>以及配套的<code>Liftoff</code>。</p><span id="more"></span><h2 id="1-RagTag"><a href="#1-RagTag" class="headerlink" title="1. RagTag"></a>1. RagTag</h2><div class="story post-story"><p>RagTag是一个纯python编写的软件工具集(但并不是所有功能都是python实现的,比如<code>minimap2/Nucmer/unimap</code>是通过<code>subprocess</code>模块调用命令行使用,产生子进程实现的),用于将组装的contig级别的基因组提高到染色体级别。具体来说可以做到以下三个功能:</p><ul><li>基于同源的组装错误纠正</li><li>基于同源的scaffold组装以及修补(也就是填补gap)</li><li>scaffold合并(不同方式得到的scaffold合并)</li></ul><p>同时官方提供了处理常见基因组组装文件格式的命令行实用程序,也是纯python编写的,都在<code>file utilities</code>文件夹中,主要是实现以下几种功能:</p><ul><li><p><code>agp2fa</code>: 将<code>AGP</code>文件转换成<code>fasta</code>文件。AGP文件是描述<strong>contig</strong>如何构建成<strong>scaffold</strong>的,可以看NCBI对该文件类型的描述:<a href="https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/">AGP Specification v2.1 (nih.gov)</a></p></li><li><p><code>agpcheck</code>: 验证<code>AGP</code>格式是否正确</p></li><li><p><code>asmstats</code>: 统计assembly序列的信息</p></li><li><p><code>splitasm</code>: 按照gap分隔assembly序列,为后续处理提供<code>AGP</code>文件</p></li><li><p><code>delta2paf</code>: 转化<code>delta</code>文件到<code>PAF</code>文件。<a href="https://mummer.sourceforge.net/manual/#nucmer">delta</a>文件是<code>NUCmer</code>基因组比对软件(NUCleotide MUMmer)产生的结果文件,作用是记录每个联配的坐标,每个联配中的插入和缺失的距离。<a href="https://lh3.github.io/minimap2/minimap2.html">PAF</a>文件也是类似描述两组序列之间近似的联配位置的文件,是<code>minimap2</code>的输出文件。点击蓝色字体可以看两种格式的具体样式。</p></li><li><p><code>paf2delta</code>: 和上面相反</p></li><li><p><code>updategff</code>: 根据RagTag的AGP文件更新gff文件(<strong>不是得到最终基因组的gff文件</strong>,要想得到基因组的gff文件,作者推荐用配套的软件<a href="https://github.com/agshumate/Liftoff">Liftoff</a>)</p></li></ul><p>RagTag运行流程主要分4步:</p><p><img src="https://www.shelven.com/tuchuang/20231018/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231018/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>官方仓库:<a href="https://github.com/malonge/RagTag">RagTag/ragtag_correct.py at master · malonge/RagTag (github.com)</a></p><h3 id="1-1-correct"><a href="#1-1-correct" class="headerlink" title="1.1 correct"></a>1.1 correct</h3><p>RagTag的校正模块,使用参考基因组来识别和校正基因组中潜在的错误组装。这一步直会将可能存在错误组装的序列打断,<strong>不会增加或者减少原序列大小</strong>。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">usage: ragtag.py correct <reference.fa> <query.fa></span><br><span class="line"></span><br><span class="line">Homology-based misassembly correction: Correct sequences in 'query.fa' by comparing them to sequences in 'reference.fa'></span><br><span class="line"></span><br><span class="line">positional arguments:</span><br><span class="line"> <reference.fa> reference fasta file (uncompressed or bgzipped)</span><br><span class="line"> <query.fa> query fasta file (uncompressed or bgzipped)</span><br></pre></td></tr></table></figure><p>详细参数<code>-h</code>可以查看,需要注意<code>reference.fa</code>是参考基因组(同一个物种或者近缘物种),<code>query.fa</code>是我们需要组装的contigs(contigs是通过二代+三代测序下机数据组装的)。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/bin/bash</span></span><br><span class="line"><span class="comment">#SBATCH -n 5</span></span><br><span class="line"><span class="comment">#SBATCH -t 7200</span></span><br><span class="line"></span><br><span class="line">ragtag.py correct ../../Genome/Ap.fa Av.fasta -t 5</span><br></pre></td></tr></table></figure><p>得到的结果文件:</p><p>├── ragtag_output<br>│ ├── ragtag.correct.agp<br>│ ├── ragtag.correct.asm.paf<br>│ ├── ragtag.correct.asm.paf.log<br>│ ├── ragtag.correct.err<br>│ ├── ragtag.correct.fasta</p><p>可以看到contigs从原来的38条被切断成了519条:</p><p><img src="https://www.shelven.com/tuchuang/20231018/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231018/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="1-2-scaffold"><a href="#1-2-scaffold" class="headerlink" title="1.2 scaffold"></a>1.2 scaffold</h3><p>RagTag的脚手架(scaffold)组装模块,将组装的草稿(draft assembly)序列排序和重定向为更长的序列。使用上一步纠错后的contigs比对参考基因组序列,contigs之间的gap使用N(默认100个)填充,同样这一步不会改变原来的序列。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">usage: ragtag.py scaffold <reference.fa> <query.fa></span><br><span class="line"></span><br><span class="line">Homology-based assembly scaffolding: Order and orient sequences in 'query.fa' by comparing them to sequences in 'reference.fa'</span><br><span class="line"></span><br><span class="line">positional arguments:</span><br><span class="line"> <reference.fa> reference fasta file (uncompressed or bgzipped)</span><br><span class="line"> <query.fa> query fasta file (uncompressed or bgzipped)</span><br></pre></td></tr></table></figure><p>详细参数<code>-h</code>可以查看,<code>reference.fa</code>是参考基因组,<code>query.fa</code>是上一步纠错后的序列。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/bin/bash</span></span><br><span class="line"><span class="comment">#SBATCH -n 5</span></span><br><span class="line"><span class="comment">#SBATCH -t 7200</span></span><br><span class="line"></span><br><span class="line">ragtag.py scaffold ../../Genome/Ap.fa ragtag_output/ragtag.correct.fasta -t 5 -C -u</span><br></pre></td></tr></table></figure><p><code>-C</code>这个参数可以把没有比对上的序列都放在<code>chr0</code>这条假想的染色体上。</p><p>得到的结果文件:</p><p>├── ragtag_output<br>│ ├── ragtag.scaffold.agp<br>│ ├── ragtag.scaffold.asm.paf<br>│ ├── ragtag.scaffold.asm.paf.log<br>│ ├── ragtag.scaffold.confidence.txt<br>│ ├── ragtag.scaffold.err<br>│ ├── ragtag.scaffold.fasta<br>│ └── ragtag.scaffold.stats</p><p>统计下各scaffold序列的长度:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">$ seqkit fx2tab --length --name --header-line ragtag_output/ragtag.scaffold.fasta</span><br><span class="line"></span><br><span class="line">#name length</span><br><span class="line">LG01_RagTag 20785546</span><br><span class="line">LG02_RagTag 20079267</span><br><span class="line">LG03_RagTag 17989129</span><br><span class="line">LG04_RagTag 19176014</span><br><span class="line">LG05_RagTag 22754901</span><br><span class="line">LG06_RagTag 24973536</span><br><span class="line">LG07_RagTag 20015323</span><br><span class="line">LG08_RagTag 20197945</span><br><span class="line">LG09_RagTag 17850387</span><br><span class="line">LG10_RagTag 23023498</span><br><span class="line">LG11_RagTag 18329485</span><br><span class="line">Chr0_RagTag 5764532</span><br></pre></td></tr></table></figure><p>因为我的参考基因组是11条染色体,这里大部分contig都能比对上,长度也正常,剩下的contig都在<code>Chr0_RagTag</code>这条序列。</p><p><code>ragtag.scaffold.stats</code>这个文件可以查看比对上scaffold上的contig信息:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cat</span> ragtag.scaffold.stats | column -t -s $<span class="string">'\t'</span></span><br><span class="line"></span><br><span class="line">placed_sequences placed_bp unplaced_sequences unplaced_bp gap_bp gap_sequences</span><br><span class="line">90 225167131 429 5721732 50700 507</span><br></pre></td></tr></table></figure><p>第一步拆分的519条contigs有90条可以比对上,429条未比对上参考基因组,但是这429条序列总长度却只有5721732 bp。根据上面的信息也可以知道,产生的507个gap中有428个在<code>Chr0_RagTag</code>这条序列中(unplaced_sequences都在这),真正在染色体上的gap数量是79,可以写个脚本统计一下各条染色体上gap数量:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># count_gap.py 作用是统计这一步产生的各scaffold上的gap数量</span></span><br><span class="line"></span><br><span class="line">sequence = {} <span class="comment"># 用于存储序列信息的字典</span></span><br><span class="line">gap = {} <span class="comment"># 用于存储序列gap信息的字典</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">store_sequence</span>(<span class="params">file_name</span>):</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(file_name, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> header = <span class="string">''</span></span><br><span class="line"> seq = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> file:</span><br><span class="line"> line = line.strip()</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">'>'</span>):</span><br><span class="line"> <span class="keyword">if</span> header:</span><br><span class="line"> sequence[header] = seq</span><br><span class="line"> header = line[<span class="number">1</span>:]</span><br><span class="line"> seq = <span class="string">''</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> seq += line</span><br><span class="line"> sequence[header] = seq <span class="comment"># 加入最后一条序列</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">count_gap</span>():</span><br><span class="line"> <span class="keyword">for</span> header <span class="keyword">in</span> sequence:</span><br><span class="line"> seq = sequence[header]</span><br><span class="line"> count = seq.count(<span class="string">'N'</span>)/<span class="number">100</span></span><br><span class="line"> gap[header] = count</span><br><span class="line"></span><br><span class="line">store_sequence(<span class="string">'ragtag.scaffold.fasta'</span>)</span><br><span class="line">count_gap()</span><br><span class="line"><span class="keyword">for</span> header <span class="keyword">in</span> gap:</span><br><span class="line"> header = header</span><br><span class="line"> counts = gap[header]</span><br><span class="line"> <span class="built_in">print</span>(<span class="string">f'scaffold:<span class="subst">{header}</span>\tgap数量:<span class="subst">{counts}</span>'</span>)</span><br><span class="line"> </span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">scaffold:LG01_RagTag gap数量:21.0</span></span><br><span class="line"><span class="string">scaffold:LG02_RagTag gap数量:5.0</span></span><br><span class="line"><span class="string">scaffold:LG03_RagTag gap数量:5.0</span></span><br><span class="line"><span class="string">scaffold:LG04_RagTag gap数量:6.0</span></span><br><span class="line"><span class="string">scaffold:LG05_RagTag gap数量:7.0</span></span><br><span class="line"><span class="string">scaffold:LG06_RagTag gap数量:0.0</span></span><br><span class="line"><span class="string">scaffold:LG07_RagTag gap数量:7.0</span></span><br><span class="line"><span class="string">scaffold:LG08_RagTag gap数量:2.0</span></span><br><span class="line"><span class="string">scaffold:LG09_RagTag gap数量:4.0</span></span><br><span class="line"><span class="string">scaffold:LG10_RagTag gap数量:12.0</span></span><br><span class="line"><span class="string">scaffold:LG11_RagTag gap数量:10.0</span></span><br><span class="line"><span class="string">scaffold:Chr0_RagTag gap数量:428.0</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><p>可以对上gap数量,没有问题。</p><h3 id="1-3-patch"><a href="#1-3-patch" class="headerlink" title="1.3 patch"></a>1.3 patch</h3><p>RagTag的填补模块,关于这个模块,国内的各方帖子都说是填补上一步骤产生的有gap的scaffold序列。<strong>对,但是不完全对。</strong><code>patch</code>和<code>scaffold</code>是两个独立的模块,<strong>且两者都可以独立完成scaffolding的过程</strong>。</p><p><code>scaffold</code>模块中,我们以参考基因组序列为基准,将contig定向和排序成为scaffold,整个过程没有发生contig序列的增加或者减少。</p><p><code>patch</code>模块和gap填补软件类似,一般情况下是用三代长读长序列(ONT Ultra-long reads这种)填补gap,<strong>但是这里用的是assembly序列</strong>,且填补前后序列会发生变化。<code>patch</code>模块有两种运行模式:</p><blockquote><p><code>--fill-only</code> 只填补已存在的gap,就和传统的gap填补工具类似</p><p><code>--join-only</code> 不填补已存在的gap, 会对contig重新进行定向和排序,这会产生新的gap,且填补的是<strong>新产生的gap</strong></p></blockquote><p>fill模式的原理如下:</p><p><img src="https://www.shelven.com/tuchuang/20231018/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231018/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>默认情况下,两种运行模式都会进行,可以看到填补的基础是需要另一条同一个物种的<strong>assembly序列</strong>。</p><p>如果你前一步跑了<code>scaffold</code>模块,仅仅只需要填补该过程产生的gap,那麽只要以<code>--fill-only</code>的模式运行<code>patch</code>模块即可。</p><p>因为我没有其他assembly序列,所以跑<code>scaffold</code>模块就够了,为了测试一下这个模块,我又用原始assembly序列跑了一遍:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">usage: ragtag.py patch <target.fa> <query.fa></span><br><span class="line"></span><br><span class="line">Homology-based assembly patching: Make continuous joins and fill gaps in 'target.fa' using sequences from 'query.fa'</span><br><span class="line"></span><br><span class="line">positional arguments:</span><br><span class="line"> <target.fa> target fasta file (uncompressed or bgzipped)</span><br><span class="line"> <query.fa> query fasta file (uncompressed or bgzipped)</span><br></pre></td></tr></table></figure><p>详细参数<code>-h</code>可以查看,这里<code>target.fa</code>是上一步产生的scaffold序列,<code>query.fa</code>是我一开始跑correct模块的原始assembly序列。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/bin/bash</span></span><br><span class="line"><span class="comment">#SBATCH -n 5</span></span><br><span class="line"><span class="comment">#SBATCH -t 7200</span></span><br><span class="line"></span><br><span class="line">ragtag.py patch ragtag_output/ragtag.scaffold.fasta Av.fasta -t 5 -u -i 0.05</span><br></pre></td></tr></table></figure><p>得到的结果文件:</p><p>├── ragtag_output<br>│ ├── ragtag.patch.agp<br>│ ├── ragtag.patch.asm.delta<br>│ ├── ragtag.patch.asm.delta.log<br>│ ├── ragtag.patch.asm.paf<br>│ ├── ragtag.patch.comps.fasta<br>│ ├── ragtag.patch.comps.fasta.fai<br>│ ├── ragtag.patch.ctg.agp<br>│ ├── ragtag.patch.ctg.fasta<br>│ ├── ragtag.patch.ctg.fasta.fai<br>│ ├── ragtag.patch.err<br>│ ├── ragtag.patch.fasta<br>│ ├── ragtag.patch.rename.agp<br>│ ├── ragtag.patch.rename.fasta<br>│ ├── ragtag.patch.rename.fasta.fai</p><p>这里结果文件较多,有必要说一下几个文件的作用(来自官网):</p><blockquote><table><thead><tr><th>file name</th><th>Description</th></tr></thead><tbody><tr><td><code>ragtag.patch.agp</code></td><td>The final AGP file defining how <code>ragtag.patch.fasta</code> is built</td></tr><tr><td><code>ragtag.patch.asm.*</code></td><td>Assembly alignment files</td></tr><tr><td><code>ragtag.patch.comps.fasta</code></td><td>The split target assembly and the renamed query assembly combined into one FASTA file. This file contains all components in <code>ragtag.patch.agp</code></td></tr><tr><td><code>ragtag.patch.ctg.agp</code></td><td>An AGP file defining how the target assembly was split at gaps</td></tr><tr><td><code>ragtag.patch.ctg.fasta</code></td><td>The target assembly split at gaps</td></tr><tr><td><code>ragtag.patch.err</code></td><td>Standard error logging for all external RagTag commands</td></tr><tr><td><code>ragtag.patch.fasta</code></td><td>The final FASTA file containing the patched assembly</td></tr><tr><td><code>ragtag.patch.rename.agp</code></td><td>An AGP file defining the new names for query sequences</td></tr><tr><td><code>ragtag.patch.rename.fasta</code></td><td>A FASTA file with the original query sequence, but with new names</td></tr></tbody></table></blockquote><p>我们关注的是最后的结果文件<code>ragtag.patch.fasta</code>。这一步运行的时间很长,多长呢?和前面两个步骤放一起感受一下:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">$ sacct --format=JobName,Start,End,Elapsed,NCPUS</span><br><span class="line"></span><br><span class="line"> JobName Start End Elapsed NCPUS </span><br><span class="line">---------- ------------------- ------------------- ---------- ---------- </span><br><span class="line"> correct 2023-10-18T14:41:49 2023-10-18T14:44:50 00:03:01 5 </span><br><span class="line"> scaffold 2023-10-18T15:02:21 2023-10-18T15:03:13 00:00:52 5 </span><br><span class="line"> patch 2023-10-18T15:31:30 2023-10-18T17:25:49 01:54:19 5 </span><br><span class="line"> </span><br><span class="line"># 已隐去多余信息</span><br></pre></td></tr></table></figure><p>前两步分分钟完成,<code>patch</code>这一步需要花2小时。</p><p>还有一点需要注意,如果用来填补gap的序列是<strong>T2T</strong>或者<strong>近似T2T</strong>的序列,需要考虑到它们包含大量的高度重复的序列,RagTag的<code>pactch</code>模块是通过query contig和至少两个target contig之间的唯一比对(unique alignments),来确定潜在的填补区域。大量的重复区域会导致唯一比对出现大量的gap,从而误判这个区域不是潜在的填补区域。</p><p>作者提出的建议是加入参数<code>-i</code>并调整这个值,来控制最大对齐中断长度(maximum alignment break length),可以更大限度容忍唯一比对中出现的长gap区域。或者直接将query序列重复区域打断(也就消除了重复序列导致的非唯一比对)。</p><p>看了下结果文件,几乎和<code>scaffold</code>模块一样,就填补了几个gap(因为我没有其他assembly序列),而且对<code>Chr0_RagTag</code>这条多余contig合在一起的序列,填补gap完全没意义,于我而言这步结果意义不大且更不可信,只是为了测试这个模块。主要还是用上一步的<code>ragtag.scaffold.fasta</code>文件。</p><h3 id="1-4-merge"><a href="#1-4-merge" class="headerlink" title="1.4 merge"></a>1.4 merge</h3><p>RagTag的合并scaffolding结果的模块,起到改进scaffolding结果的作用。</p><p>因为scaffolding过程可以使用多种方式,比如使用Hi-C技术、Bionano的光学图谱技术、遗传图谱等等,而<code>merge</code>模块的作用是将各种技术的scaffolding结果(AGP格式)合并到一起,提高基因组组装草图的精度。</p><p>这一步我没有Hi-C数据做测试了,就放一下官方的使用方法,用到再做详解:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">usage: ragtag.py merge <asm.fa> <scf1.agp> <scf2.agp> [...]</span><br><span class="line"></span><br><span class="line">Scaffold merging: derive a consensus scaffolding solution by reconciling distinct scaffoldings of 'asm.fa'</span><br><span class="line"></span><br><span class="line">positional arguments:</span><br><span class="line"> <asm.fasta> assembly fasta file (uncompressed or bgzipped)</span><br><span class="line"> <scf1.agp> <scf2.agp> [...]</span><br><span class="line"> scaffolding AGP files</span><br><span class="line"></span><br><span class="line">optional arguments:</span><br><span class="line"> -h, --help show this help message and exit</span><br></pre></td></tr></table></figure></div><h2 id="2-Liftoff"><a href="#2-Liftoff" class="headerlink" title="2. Liftoff"></a>2. Liftoff</h2><div class="story post-story"><p>Liftoff软件的作用是将同一或者近缘物种的基因组注释映射到组装的基因组中。输入文件很简单,一个组装的基因组,一个参考基因组以及注释文件即可。</p><p>这款软件是使用<code>Minimap2</code>将基因序列从一个基因组比对到另一个基因组(不是基因组之间的比对),每个基因都会寻找外显子的比对,以最大限度提高序列的同一性,同时保留转录本和基因结构。如果两个基因都比对到overlapping位置,还会判断哪个基因最有可能是错误比对,并进行重新比对(挺好奇怎么实现的?)。</p><p>官方仓库:<a href="https://github.com/agshumate/Liftoff">agshumate/Liftoff: An accurate GFF3/GTF lift over pipeline (github.com)</a></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">usage: liftoff [-h] (-g GFF | -db DB) [-o FILE] [-u FILE] [-exclude_partial]</span><br><span class="line"> [-dir DIR] [-mm2_options =STR] [-a A] [-s S] [-d D] [-flank F]</span><br><span class="line"> [-V] [-p P] [-m PATH] [-f TYPES] [-infer_genes]</span><br><span class="line"> [-infer_transcripts] [-chroms TXT] [-unplaced TXT] [-copies]</span><br><span class="line"> [-sc SC] [-overlap O] [-mismatch M] [-gap_open GO]</span><br><span class="line"> [-gap_extend GE]</span><br><span class="line"> target reference</span><br><span class="line"></span><br><span class="line">Lift features from one genome assembly to another</span><br><span class="line"></span><br><span class="line">Required input (sequences):</span><br><span class="line"> target target fasta genome to lift genes to</span><br><span class="line"> reference reference fasta genome to lift genes from</span><br><span class="line"></span><br><span class="line">Required input (annotation):</span><br><span class="line"> -g GFF annotation file to lift over in GFF or GTF format</span><br><span class="line"> -db DB name of feature database; if not specified, the -g</span><br><span class="line"> argument must be provided and a database will be built</span><br><span class="line"> automatically</span><br></pre></td></tr></table></figure><p>详细参数<code>-h</code>可以查看,跑一个流程试试:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/bin/bash</span></span><br><span class="line"><span class="comment">#SBATCH -n 20</span></span><br><span class="line"><span class="comment">#SBATCH -t 7200</span></span><br><span class="line"></span><br><span class="line">gff_path=/public/home/wlxie/biosoft/braker3/Ap_mydb/Ap_rmTE_1.gff3</span><br><span class="line">target=/public/home/wlxie/biosoft/ragtag/ragtag_output/ragtag.scaffold.fasta</span><br><span class="line">reference=/public/home/wlxie/Genome/Ap.fa</span><br><span class="line"></span><br><span class="line">liftoff -g <span class="variable">${gff_path}</span> -o Av.gff3 -p 20 <span class="variable">${target}</span> <span class="variable">${reference}</span></span><br></pre></td></tr></table></figure><p>两分钟就可以跑完,结果文件如下:<br>├── Av.gff3<br>├── intermediate_files<br>│ ├── reference_all_genes.fa<br>│ └── reference_all_to_target_all.sam<br>└── unmapped_features.txt</p><p>gff3文件是我们要的结果,<code>intermediate_files</code>文件夹中放的是中间文件,没啥用;<code>unmapped_features.txt</code>记录的是没比对上的基因。</p><p>提取一下CDS序列和蛋白序列:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ gffread Av.gff3 -g ragtag.scaffold.fasta -x Av.codingseq</span><br><span class="line"></span><br><span class="line">$ gffread Av.gff3 -g ragtag.scaffold.fasta -y Av.aa</span><br></pre></td></tr></table></figure></div>]]></content>
<summary type="html"><p>有的时候存在这种情况:我手上有两个近缘植物的基因组测序数据,这两个物种可能没有人做过,或者别人做过但是没有提供参考基因组。而课题组因为经费不足,只测了一个物种的Hi-C<del>(嗯,说的就是我)</del>,那如何以组装的基因组为参考,把另一个近缘物种基因组也组装到染色体级别呢?</p>
<p>记录一下基于近缘物种参考基因组的染色体水平组装和注释,用到的软件是<code>RagTag</code>以及配套的<code>Liftoff</code>。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="基因组三代测序分析" scheme="http://www.shelven.com/categories/%E5%9F%BA%E5%9B%A0%E7%BB%84%E4%B8%89%E4%BB%A3%E6%B5%8B%E5%BA%8F%E5%88%86%E6%9E%90/"/>
<category term="RagTag" scheme="http://www.shelven.com/tags/RagTag/"/>
<category term="Liftoff" scheme="http://www.shelven.com/tags/Liftoff/"/>
</entry>
<entry>
<title>关于braker3 UTR区域注释的踩坑记录以及补充</title>
<link href="http://www.shelven.com/2023/10/13/a.html"/>
<id>http://www.shelven.com/2023/10/13/a.html</id>
<published>2023-10-13T13:43:20.000Z</published>
<updated>2023-10-14T09:09:37.000Z</updated>
<content type="html"><![CDATA[<p>在<a href="https://www.shelven.com/2023/04/03/a.html">基因组注释(4)——基因预测</a>这篇博客中记录了怎么用<code>braker3</code>进行蛋白编码基因的预测,当时为了方便安装和使用,直接下载了官方的<code>singularity</code>容器。用过braker3的朋友会发现,<strong>官方给的BRAKER标准运行流程中是不包括UTR区域预测的</strong>,也就是说,最后得到的gtf/gff文件中没有3’或者5’UTR区域的信息。</p><p>前排提示,以下操作是个人尝试,不保证一定正确。</p><span id="more"></span><h2 id="需求描述"><a href="#需求描述" class="headerlink" title="需求描述"></a>需求描述</h2><div class="story post-story"><p>当你要克隆某个基因,如果设计的引物不包含UTR区域,会导致pcr出来的基因不完整,为了长远考虑还是要把UTR信息加到注释结果中。怎么才能在braker结果中添加UTR区域的注释呢?</p></div><h2 id="官方注释策略"><a href="#官方注释策略" class="headerlink" title="官方注释策略"></a>官方注释策略</h2><div class="story post-story"><p>Braker3官方提供了两种注释策略,其实就是两个参数<code>--UTR=on</code>和<code>--addUTR=on</code>。这两个参数不能同时使用,<code>--UTR=on</code>会根据RNA-seq的数据训练AUGUSTUS的UTR模型,并最终生成包含utr注释信息的gtf文件。看似一步到位,实际上有诸多问题,看官方的issues中很多人提到utr注释信息和基因不匹配,有些utr区域与CDS区域相隔甚远,这明显不符合常理。开发者的回复是使用<code>--UTR=on</code>参数会对UTR进行训练和预测,可能会使UTR注释结果变糟糕,需要谨慎使用:</p><p><img src="https://www.shelven.com/tuchuang/20231013/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231013/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><code>--addUTR=on</code>参数是在AUGUSTUS预测基因的结果上,直接提取RNAseq比对产生的bam文件中的coverage信息,不进行UTR的训练和预测,直接将UTR加到AUGUSTUS的结果文件中(实现方式是通过开发者编写的软件<a href="https://github.com/Gaius-Augustus/GUSHR">GUSHR</a>)。<strong>需要注意这个参数并不是在第一遍跑BRAKER3的时候加的</strong>,我们需要正常流程(不加–UTR=on这个参数)跑一遍,得到结果文件后,再用braker.pl跑一遍下面得命令:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">braker.pl --genome=../genome.fa --addUTR=on \</span><br><span class="line"> --bam=../RNAseq.bam --workingdir=$wd \</span><br><span class="line"> --AUGUSTUS_hints_preds=augustus.hints.gtf --threads=8 \</span><br><span class="line"> --skipAllTraining --species=somespecies</span><br></pre></td></tr></table></figure><p>刚需是<code>--genome</code>、<code>--bam</code>、<code>--AUGUSTUS_hints_preds</code>和<code>--skipAllTraining</code>这四个参数,<code>augustus.hints.gtf</code>这个文件可以从braker结果文件的AUGUSTUS文件夹中找到。</p></div><h2 id="使用singularity碰到的问题"><a href="#使用singularity碰到的问题" class="headerlink" title="使用singularity碰到的问题"></a>使用singularity碰到的问题</h2><div class="story post-story"><p>如果你和我一样用的singularity容器,你会很尴尬的发现加上上面任意一个参数都运行不了,会报错JAVA_PATH无法找到或者其他java相关的问题,我先后遇到了下面的场景:</p><p><a href="https://github.com/Gaius-Augustus/BRAKER/issues/638">JAVA error when adding UTR options · Issue #638 · Gaius-Augustus/BRAKER (github.com)</a></p><p><a href="https://github.com/Gaius-Augustus/BRAKER/issues/584">set_JAVA_PATH bug…? · Issue #584 · Gaius-Augustus/BRAKER (github.com)</a></p><p>即使我知道报错的地方想修改braker.pl文件,但是我们这个singularity只提供只读的仓库啊(<del>singularity的缺点又体现了可恶……</del>),我试了3.0.3版本和2.0.2版本都存在这个问题,所以如果用了singularity镜像仓库,就不要想着直接加参数预测UTR区域了……</p><p>针对将UTR信息加入到<code>braker.gtf</code>文件的需求,开发者两周前又更新了一个将stringtie组装转录本的中间文件信息加入到<code>braker.gtf</code>文件的脚本:</p><p><img src="https://www.shelven.com/tuchuang/20231013/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231013/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>地址如下:</p><p><a href="https://github.com/Gaius-Augustus/BRAKER/blob/utr_from_stringtie/scripts/stringtie2utr.py">https://github.com/Gaius-Augustus/BRAKER/blob/utr_from_stringtie/scripts/stringtie2utr.py</a></p><p>很可惜的是……<code>transcripts_merged.gff</code>是个中间文件,正常情况下跑完braker就被自动删除了……(你可以在braker的log文件中看到它是啥时候被删的)也不要试图自己再跑一遍stringtie,亲自试了下就算最终得到stringtie的gff文件,跑上面的流程也是跑不通的,得用braker流程的stringtie中间文件(格式不兼容)。</p></div><h2 id="比较坎坷的解决方案"><a href="#比较坎坷的解决方案" class="headerlink" title="比较坎坷的解决方案"></a>比较坎坷的解决方案</h2><div class="story post-story"><p>当然也不是完全没办法,用开发者写的软件<a href="https://github.com/Gaius-Augustus/GUSHR">GUSHR</a>,只需要用hisat将RNA-seq数据比对参考基因组,比对文件转成bam文件后作为输入文件直接用,起到和braker运行加入参数<code>--addUTR=on</code>一样的效果。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">#!/bin/bash</span><br><span class="line">#SBATCH -n 20</span><br><span class="line"></span><br><span class="line">genome_path=/public/home/wlxie/Genome/Ap.fasta</span><br><span class="line">fq_path=/public/home/wlxie/Sequencing_data/BYT2022020901/Apocynum_pictum/</span><br><span class="line"></span><br><span class="line"># hisat2 构建索引,比对参考基因组</span><br><span class="line">hisat2-build ${genome_path} genome</span><br><span class="line">hisat2 -p 20 -x genome -S out.sam -1 ${fq_path}4-216031965_raw_1.fq -2 ${fq_path}4-216031965_raw_2.fq</span><br><span class="line"></span><br><span class="line"># samtools转sam为bam并排序</span><br><span class="line">samtools sort -@ 20 -o RNAseq.bam out.sam</span><br><span class="line"></span><br><span class="line"># 删除中间文件</span><br><span class="line">rm -rf genome.*</span><br><span class="line">rm -rf out.sam</span><br></pre></td></tr></table></figure><p>上面的脚本处理转录组下机数据,最终得到<code>RNAseq.bam</code>文件。</p><p>接着克隆GUSHR的github镜像仓库:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git clone https://ghproxy.com/https://github.com/Gaius-Augustus/GUSHR.git</span><br></pre></td></tr></table></figure><p>下载<code>gtf2gff.pl</code>:<a href="https://github.com/Gaius-Augustus/Augustus/blob/cc41cd053721756dd6c0aba2f5c306054edd9151/scripts/gtf2gff.pl#L347">Augustus/scripts/gtf2gff.pl at cc41cd053721756dd6c0aba2f5c306054edd9151 · Gaius-Augustus/Augustus (github.com)</a></p><p>因为不想改环境变量,直接改了下<code>gushr.py</code>的源码,把<code>gtf2gff.pl</code>文件绝对位置写在gtf2gff变量里。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 197行</span></span><br><span class="line"><span class="string">''' Find AUGUSTUS script gtf2gff.pl '''</span></span><br><span class="line"></span><br><span class="line">gtf2gff = <span class="string">"/public/home/wlxie/biosoft/braker3/GUSHR/gtf2gff.pl"</span></span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">if args.verbosity > 0:</span></span><br><span class="line"><span class="string"> print("Searching for gtf2gff.pl:")</span></span><br><span class="line"><span class="string">if args.AUGUSTUS_SCRIPTS_PATH:</span></span><br><span class="line"><span class="string"> gtf2gff = args.AUGUSTUS_SCRIPTS_PATH + "/gtf2gff.pl"</span></span><br><span class="line"><span class="string"> if not(os.access(gtf2gff, os.X_OK)):</span></span><br><span class="line"><span class="string"> frameinfo = getframeinfo(currentframe())</span></span><br><span class="line"><span class="string"> print('Error in file ' + frameinfo.filename + ' at line ' +</span></span><br><span class="line"><span class="string"> str(frameinfo.lineno) + ': ' + gtf2gff + " is not executable!")</span></span><br><span class="line"><span class="string"> exit(1)</span></span><br><span class="line"><span class="string"> else:</span></span><br><span class="line"><span class="string"> if args.verbosity > 0:</span></span><br><span class="line"><span class="string"> print("Will use " + gtf2gff)</span></span><br><span class="line"><span class="string">else:</span></span><br><span class="line"><span class="string"> if shutil.which("gtf2gff.pl") is not None:</span></span><br><span class="line"><span class="string"> gtf2gff = shutil.which("gtf2gff.pl")</span></span><br><span class="line"><span class="string"> if args.verbosity > 0:</span></span><br><span class="line"><span class="string"> print("Will use " + gtf2gff)</span></span><br><span class="line"><span class="string"> else:</span></span><br><span class="line"><span class="string"> frameinfo = getframeinfo(currentframe())</span></span><br><span class="line"><span class="string"> print('Error in file ' + frameinfo.filename + ' at line ' +</span></span><br><span class="line"><span class="string"> str(frameinfo.lineno) + ': '</span></span><br><span class="line"><span class="string"> + "Unable to locate gtf2gff.pl!")</span></span><br><span class="line"><span class="string"> print("gtf2gff.pl is part of AUGUSTUS scripts.")</span></span><br><span class="line"><span class="string"> print("You can obtain it " +</span></span><br><span class="line"><span class="string"> "from github with:")</span></span><br><span class="line"><span class="string"> print("git clone https://github.com/Gaius-Augustus/Augustus.git")</span></span><br><span class="line"><span class="string"> print("Compilation and full installation of AUGUSTUS is not " +</span></span><br><span class="line"><span class="string"> "required for excuting this script. You only need to add " +</span></span><br><span class="line"><span class="string"> "the missing script to your $PATH.")</span></span><br><span class="line"><span class="string"> exit(1)</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">''' Find bash sort '''</span></span><br></pre></td></tr></table></figure><p>可以先跑一个test测试一下(需要另外下载bam文件),没有问题就可以带入自己相关的文件跑这个脚本了。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./gushr.py -t ~/biosoft/braker3/Ap_mydb/Augustus/augustus.hints.gtf -b Ap/RNAseq.bam -g ~/Genome/Ap.fasta -o utrs</span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20231013/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231013/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以看到最后的<code>utrs.gtf</code>文件中注释上了UTR区域(并不是每个基因都能注释到)。</p><p>需要注意的是,这里是给AUGUSTUS结果<code>augustus.hints.gtf</code>加上UTR注释信息,而不是<code>braker.gtf</code>(用这个文件跑上面的程序也是会报错的)。开发者对两个gtf有如下的解释:</p><blockquote><p>you have two options:</p><ul><li><strong>augustus.hints.</strong>*: This is the final output of AUGUSTUS with protein hints.</li><li><strong>braker.{gtf,gtf3}</strong>: This is a union of <strong>augustus.hints.</strong>* and most reliable genes from GeneMark-EP+ prediction, which is a part of the BRAKER2 pipeline with proteins. Thus, this set is generally more sensitive (more genes correctly predicted) and can be less specific (more false-positive predictions can be present).</li><li>The remaining files are intermediate results that are useful for development and debugging purposes.</li></ul><p>The results of <strong>augustus.hints.</strong>* and <strong>braker.{gtf,gtf3}</strong> will be quite similar, if you prefer sensitivity, use <strong>braker.{gtf,gtf3}</strong>. Use <strong>augustus.hints.</strong>* otherwise.</p></blockquote><p>也就是两个文件差别不大,<code>braker.gtf</code>灵敏度高,但是假阳性多。</p><p>也有人提过将braker预测结果直接加入UTR注释(非AUGUSTUS训练UTR模型,因为结果不稳定),开发者提到这并不是一件容易的事,因为<code>braker.gtf</code>格式与UTR预测不兼容……原话如下:</p><blockquote><p>there is currently no easy way to do this because the format of <code>braker.gtf</code> is not compatible with UTR procedure.</p><p>How many extra genes do you have in <code>braker.gtf</code> compared to <code>augustus.hints.gtf</code>? In case it’s not many genes, it should be OK to just use <code>augustus.hints_utr.gtf</code> and add the extra genes from <code>braker.gtf</code> to the result (the extra genes will have <code>GeneMark.hmm</code> in the second column).</p></blockquote><p><a href="https://github.com/Gaius-Augustus/BRAKER/issues/373">Updating previous braker.gtf with addUTR output for non-fungal eukaryotic genome · Issue #373 · Gaius-Augustus/BRAKER (github.com)</a></p><p>然而这两个文件的基因和转录本编号并不是一一对应的……直接将<code>braker.gtf</code>第二列<code>GeneMark.hmm</code>加到刚刚UTR注释的<code>augustus.hints.gtf</code>中会导致基因和转录本id混杂,并且有些AUGUSTUS预测的基因位置会和GeneMark标的基因位置重复。</p><p>权衡了一下,还是直接用<code>augustus.hints.gtf</code>作为结果文件比较合适,经过上面的UTR注释后,这个gtf文件整体会小一圈,因为删除了intron和exon信息,只保留了CDS、start_codon、stop_codon、transcript、gene以及5‘和3’-UTR。</p><p>别忘了这个gtf是非标准的gtf文件,还需要经过两步处理…..(真的头疼)</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"># 转换该gtf文件为gff3文件</span><br><span class="line">cat utrs.gtf | perl -ne 'if(m/\tAUGUSTUS\t/ or m/\tAnnotationFinalizer\t/ or m/\tGUSHR\t/) {print $_;}' | perl gtf2gff.pl --gff3 --out=result.gff3</span><br></pre></td></tr></table></figure><p>perl文件来自于braker singularity容器的<code>/usr/share/augustus/scripts/gtf2gff.pl</code>。或者可以到AUGUSTUS官方仓库取:</p><p><a href="https://github.com/Gaius-Augustus/Augustus/blob/cc41cd053721756dd6c0aba2f5c306054edd9151/scripts/gtf2gff.pl">Augustus/scripts/gtf2gff.pl at cc41cd053721756dd6c0aba2f5c306054edd9151 · Gaius-Augustus/Augustus (github.com)</a></p><p>这个是开发者写的gtf转gff3脚本,然而转了以后还存在空格问题,无法被常规的gff软件解析,如下:</p><p><img src="https://www.shelven.com/tuchuang/20231013/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231013/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>需要自己再处理一下:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 将含有AnnotationFinalizer和GUSHR字段的行提取,去除空格</span></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'result.gff3'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> file, <span class="built_in">open</span>(<span class="string">'final.gff3'</span>, <span class="string">'a'</span>) <span class="keyword">as</span> final:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> file:</span><br><span class="line"> <span class="keyword">if</span> <span class="string">"AnnotationFinalizer"</span> <span class="keyword">in</span> line <span class="keyword">or</span> <span class="string">"GUSHR"</span> <span class="keyword">in</span> line:</span><br><span class="line"> line_list = line.split(<span class="string">'\t'</span>)</span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(line_list)-<span class="number">1</span>):</span><br><span class="line"> line_list[i] = line_list[i].strip()</span><br><span class="line"> line = <span class="string">"\t"</span>.join(line_list)</span><br><span class="line"> final.write(line)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> final.write(line)</span><br></pre></td></tr></table></figure></div><h2 id="最后的问题"><a href="#最后的问题" class="headerlink" title="最后的问题"></a>最后的问题</h2><div class="story post-story"><p>最后这个final.gff3文件表面风平浪静,实际还存在一些问题需要手动排查:</p><p><img src="https://www.shelven.com/tuchuang/20231013/7.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20231013/7.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>我跑了一遍有三十多个基因存在start位置大于end位置的情况,这可以手动修改。但是预测的5’和3’UTR区域<strong>不止一个</strong>,有的也会相隔甚远,基本上来说,只有最后一个5’UTR区域和第一个3’UTR区域比较靠谱,所以仍然要写脚本筛选UTR,并修改gene和mRNA所在的位置。</p><p>这种问题就要扒<code>gushr.py</code>的源码看作者对UTR区域筛选的处理方式了……emmmmm还是直接说结论吧,braker的结果对UTR区域的注释确实不那么友好,现阶段我只能给出上面的思路,最后还是要写脚本自己处理gff文件筛选UTR区域,以后有空再去实现了。</p><h3 id="现在比较着急想要某基因的CDS上下游区域怎么办呢?"><a href="#现在比较着急想要某基因的CDS上下游区域怎么办呢?" class="headerlink" title="现在比较着急想要某基因的CDS上下游区域怎么办呢?"></a>现在比较着急想要某基因的CDS上下游区域怎么办呢?</h3><p>从注释文件中查找到目的基因,拿gff文件和基因组文件,直接提取braker预测的基因上下游一段区域,自己取舍一下。</p><h3 id="有没有更好的方式注释UTR区域呢?"><a href="#有没有更好的方式注释UTR区域呢?" class="headerlink" title="有没有更好的方式注释UTR区域呢?"></a>有没有更好的方式注释UTR区域呢?</h3><p>我的理解是最好测一个全长转录组,以组装的基因组做参考基因组,拼接非冗余的全长转录本,可以比较明确的区分各基因的UTR区域。最后记得导入UCSC基因组浏览器看一看结果。</p><p>总之这么长一串的踩坑和处理记录,估计别人也都用不了,就当是自己操作的备份吧hhhhhhhhh</p></div>]]></content>
<summary type="html"><p>在<a href="https://www.shelven.com/2023/04/03/a.html">基因组注释(4)——基因预测</a>这篇博客中记录了怎么用<code>braker3</code>进行蛋白编码基因的预测,当时为了方便安装和使用,直接下载了官方的<code>singularity</code>容器。用过braker3的朋友会发现,<strong>官方给的BRAKER标准运行流程中是不包括UTR区域预测的</strong>,也就是说,最后得到的gtf&#x2F;gff文件中没有3’或者5’UTR区域的信息。</p>
<p>前排提示,以下操作是个人尝试,不保证一定正确。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="基因组三代测序分析" scheme="http://www.shelven.com/categories/%E5%9F%BA%E5%9B%A0%E7%BB%84%E4%B8%89%E4%BB%A3%E6%B5%8B%E5%BA%8F%E5%88%86%E6%9E%90/"/>
<category term="Barker3" scheme="http://www.shelven.com/tags/Barker3/"/>
</entry>
<entry>
<title>Hard-masking和Soft-masking结果相互转化</title>
<link href="http://www.shelven.com/2023/10/10/a.html"/>
<id>http://www.shelven.com/2023/10/10/a.html</id>
<published>2023-10-10T12:48:19.000Z</published>
<updated>2023-10-10T12:52:17.000Z</updated>
<content type="html"><![CDATA[<p>在<a href="https://www.shelven.com/2023/03/12/a.html">串联重复序列注释</a>这篇笔记里记录了如何用TRF软件进行TR预测,这款软件可以使用-m参数得到屏蔽后的序列,当时没写如何把Hard masking结果转换成Soft masking,这里就补个档。这两种屏蔽方式的结果文件是可以相互转换的。</p><span id="more"></span><h2 id="1-Hardmasking转Softmasking"><a href="#1-Hardmasking转Softmasking" class="headerlink" title="1. Hardmasking转Softmasking"></a>1. Hardmasking转Softmasking</h2><div class="story post-story"><p>因为每个人的基因组文件可能各不相同,有的序列大小写混杂,有的60个核苷酸或者80个核苷酸换一行,为了方便起见,首先将基因组文件以及hard masking之后的序列文件转换成以下fasta格式:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># transform.py </span></span><br><span class="line"><span class="comment"># 每条序列两行数据形式存储,所有序列用大写字母</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">transform</span>(<span class="params">input_file, output_file</span>):</span><br><span class="line"> sequence = []</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(input_file, <span class="string">'r'</span>) <span class="keyword">as</span> input_f:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> input_f:</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">'>'</span>):</span><br><span class="line"> <span class="keyword">if</span> sequence:</span><br><span class="line"> sequence = <span class="string">""</span>.join(sequence).replace(<span class="string">'\n'</span>, <span class="string">''</span>).upper()<span class="comment"># 合并分行的序列以及转换成大写字母</span></span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(output_file, <span class="string">'a'</span>) <span class="keyword">as</span> new:</span><br><span class="line"> new.write(sequence + <span class="string">'\n'</span> + line)</span><br><span class="line"> sequence = []</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(output_file, <span class="string">'a'</span>) <span class="keyword">as</span> new: <span class="comment"># 第一条序列的处理方式</span></span><br><span class="line"> new.write(line)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> sequence.append(line)</span><br><span class="line"> sequence = <span class="string">""</span>.join(sequence).replace(<span class="string">'\n'</span>, <span class="string">''</span>) <span class="comment"># 处理最后一条序列处理方式</span></span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(output_file, <span class="string">'a'</span>) <span class="keyword">as</span> new:</span><br><span class="line"> new.write(sequence)</span><br><span class="line"></span><br><span class="line">transform(<span class="string">'genome.fa'</span>, <span class="string">'standard_genome.fa'</span>)</span><br></pre></td></tr></table></figure><p>NCI官网有fasta标准格式要求,这个要求还是相当宽泛的,但是不方便我们做转换,所以还是有必要统一一下。</p><p><a href="https://www.ncbi.nlm.nih.gov/genbank/fastaformat/">FASTA Format for Nucleotide Sequences (nih.gov)</a></p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># hard2soft.py</span></span><br><span class="line"><span class="comment"># hardmasking文件为input_file,想要得到的softmasking文件为output_file,基因组文件为reference_file</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">hard2soft</span>(<span class="params">input_file, output_file, reference_file</span>):</span><br><span class="line"> reference_sequences = {}<span class="comment"># 序列名称和对应的序列储存在字典中</span></span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(reference_file, <span class="string">'r'</span>) <span class="keyword">as</span> ref:</span><br><span class="line"> sequence_name = <span class="string">''</span></span><br><span class="line"> sequence = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> ref:</span><br><span class="line"> line = line.strip()</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">'>'</span>):</span><br><span class="line"> <span class="keyword">if</span> sequence_name:</span><br><span class="line"> reference_sequences[sequence_name] = sequence</span><br><span class="line"> sequence_name = line[<span class="number">1</span>:]</span><br><span class="line"> sequence = <span class="string">''</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> sequence += line</span><br><span class="line"> <span class="keyword">if</span> sequence_name: <span class="comment"># 处理最后一条序列</span></span><br><span class="line"> reference_sequences[sequence_name] = sequence</span><br><span class="line"></span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(input_file, <span class="string">'r'</span>) <span class="keyword">as</span> input_f, <span class="built_in">open</span>(output_file, <span class="string">'w'</span>) <span class="keyword">as</span> output_f:</span><br><span class="line"> sequence_name = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> input_f:</span><br><span class="line"> line = line.strip()</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">'>'</span>):</span><br><span class="line"> sequence_name = line[<span class="number">1</span>:]</span><br><span class="line"> output_f.write(line + <span class="string">'\n'</span>)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> sequence = line</span><br><span class="line"> <span class="keyword">if</span> sequence_name <span class="keyword">in</span> reference_sequences:</span><br><span class="line"> reference_sequence = reference_sequences[sequence_name]</span><br><span class="line"> replaced_sequence = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(sequence)):</span><br><span class="line"> <span class="keyword">if</span> sequence[i] == <span class="string">'N'</span>: </span><br><span class="line"> <span class="keyword">if</span> reference_sequence[i] == <span class="string">'N'</span>: <span class="comment"># 如果基因组中原本就有N(填补gap产生的),则无需转换</span></span><br><span class="line"> replaced_sequence += <span class="string">'N'</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> replaced_sequence += reference_sequence[i].lower()</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> replaced_sequence += sequence[i]</span><br><span class="line"> output_f.write(replaced_sequence + <span class="string">'\n'</span>)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> output_f.write(sequence + <span class="string">'\n'</span>)</span><br><span class="line"></span><br><span class="line">hard2soft(<span class="string">'hardmask.fa'</span>, <span class="string">'softmask.fa'</span>, <span class="string">'standard_genome.fa'</span>)</span><br></pre></td></tr></table></figure><p>需要注意,如果基因组中原本就有填补gap产生的N,就不需要转换,直接用N表示。</p></div><h2 id="2-Softmasking转Hardmasking"><a href="#2-Softmasking转Hardmasking" class="headerlink" title="2. Softmasking转Hardmasking"></a>2. Softmasking转Hardmasking</h2><div class="story post-story"><p>Softmasking转Hardmasking的情况就要简单多了,直接搜序列中的小写字母再用N替代即可:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># soft2hard.py</span></span><br><span class="line"><span class="comment"># softmasking文件为input_file,hardmasking文件为output_file</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">soft2hard</span>(<span class="params">input_file, output_file</span>):</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(input_file, <span class="string">'r'</span>) <span class="keyword">as</span> input_f, <span class="built_in">open</span>(output_file, <span class="string">'w'</span>) <span class="keyword">as</span> output_f:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> input_f:</span><br><span class="line"> line = line.strip()</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">'>'</span>):</span><br><span class="line"> output_f.write(line + <span class="string">'\n'</span>)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> sequence = line</span><br><span class="line"> replaced_sequence = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(sequence)):</span><br><span class="line"> <span class="keyword">if</span> sequence[i].islower():</span><br><span class="line"> replaced_sequence += <span class="string">'N'</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> replaced_sequence += sequence[i]</span><br><span class="line"> output_f.write(replaced_sequence + <span class="string">'\n'</span>)</span><br><span class="line"></span><br><span class="line">soft2hard(<span class="string">'softmask.txt'</span>, <span class="string">'hardmask.txt'</span>)</span><br></pre></td></tr></table></figure><p>不管用哪种方式屏蔽重复序列,都是为了下游分析服务的,没必要死磕用什么软件才是最优解……能做出符合自己预期的结果就行O(∩_∩)O</p><p>顺便一提,UCSC Genome Browser Group提供的生物分析套件和工具的源码,其中也有用TRF获得softmasking结果以及后续分析等一系列的流程,感兴趣可以看github仓库上的perl源码:</p><p><a href="https://github.com/ucscGenomeBrowser/kent/blob/307976d1f4c1ecbc73a55dea1f6348c19c1336b8/src/hg/utils/automation/doSimpleRepeat.pl">kent/src/hg/utils/automation/doSimpleRepeat.pl at 307976d1f4c1ecbc73a55dea1f6348c19c1336b8 · ucscGenomeBrowser/kent (github.com)</a></p></div>]]></content>
<summary type="html"><p>在<a href="https://www.shelven.com/2023/03/12/a.html">串联重复序列注释</a>这篇笔记里记录了如何用TRF软件进行TR预测,这款软件可以使用-m参数得到屏蔽后的序列,当时没写如何把Hard masking结果转换成Soft masking,这里就补个档。这两种屏蔽方式的结果文件是可以相互转换的。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="基因组三代测序分析" scheme="http://www.shelven.com/categories/%E5%9F%BA%E5%9B%A0%E7%BB%84%E4%B8%89%E4%BB%A3%E6%B5%8B%E5%BA%8F%E5%88%86%E6%9E%90/"/>
<category term="格式转换" scheme="http://www.shelven.com/tags/%E6%A0%BC%E5%BC%8F%E8%BD%AC%E6%8D%A2/"/>
</entry>
<entry>
<title>HTML、CSS和JavaScript入门(3)——JavaScript基础</title>
<link href="http://www.shelven.com/2023/10/08/a.html"/>
<id>http://www.shelven.com/2023/10/08/a.html</id>
<published>2023-10-08T07:31:18.000Z</published>
<updated>2023-10-08T07:40:51.000Z</updated>
<content type="html"><![CDATA[<p>这一篇笔记主要记录下ES6版本的Javascript的入门学习笔记,只记录了自己了解的基础知识,实际用这门语言的时候还是要多查阅文档。</p><span id="more"></span><p>顺便推荐几个github上收藏量比较高的JavaScript学习仓库:</p><p><a href="https://github.com/getify/You-Dont-Know-JS">getify/You-Dont-Know-JS: A book series on JavaScript. @YDKJS on twitter. (github.com)</a></p><p><a href="https://github.com/trekhleb/javascript-algorithms">trekhleb/javascript-algorithms: 📝 Algorithms and data structures implemented in JavaScript with explanations and links to further readings (github.com)</a></p><p><a href="https://github.com/airbnb/javascript">airbnb/javascript: JavaScript Style Guide (github.com)</a></p><h2 id="1-JavaScript基础"><a href="#1-JavaScript基础" class="headerlink" title="1. JavaScript基础"></a>1. JavaScript基础</h2><div class="story post-story"><p>HTML中的JavaScript脚本代码必须位于 <code><script></code> 和<code></script></code>标签之间,可以在<code><head></code>中,也可以在<code><body></code>中。</p><p>放在<code><head></code>中的脚本一般是定义一个JavaScript函数,后续在<code><body></code>中引用;或者也可以像前面的CSS一样从外部引入,需要在<code><script></code>标签的<code>src</code>属性中设置JS文件的吧位置。</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta"><!DOCTYPE <span class="keyword">html</span>></span></span><br><span class="line"><span class="tag"><<span class="name">html</span> <span class="attr">lang</span>=<span class="string">"en"</span>></span></span><br><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">charset</span>=<span class="string">"UTF-8"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">http-equiv</span>=<span class="string">"X-UA-Compatible"</span> <span class="attr">content</span>=<span class="string">"IE=edge"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">name</span>=<span class="string">"viewport"</span> <span class="attr">content</span>=<span class="string">"width=device-width, initial-scale=1.0"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">title</span>></span>Document<span class="tag"></<span class="name">title</span>></span></span><br><span class="line"><span class="comment"><!-- 一种引入JS的方式 --></span></span><br><span class="line"> <span class="tag"><<span class="name">script</span>></span><span class="language-javascript"></span></span><br><span class="line"><span class="language-javascript"> <span class="keyword">function</span> <span class="title function_">myFoo</span>(<span class="params"></span>) {</span></span><br><span class="line"><span class="language-javascript"> <span class="variable language_">document</span>.<span class="title function_">getElementById</span>(<span class="string">"test"</span>).<span class="property">innerHTML</span>=<span class="string">"修改HTML当前标签的内容"</span>;</span></span><br><span class="line"><span class="language-javascript"> }</span></span><br><span class="line"><span class="language-javascript"> </span><span class="tag"></<span class="name">script</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span> <span class="attr">id</span>=<span class="string">"test"</span>></span>这是一个id为test的段落<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="comment"><!-- 创建按钮并绑定单击事件,调用函数 --></span></span><br><span class="line"> <span class="tag"><<span class="name">button</span> <span class="attr">type</span>=<span class="string">"button"</span> <span class="attr">onclick</span>=<span class="string">"myFoo()"</span>></span>修改<span class="tag"></<span class="name">button</span>></span></span><br><span class="line"> <span class="comment"><!-- 外部引入JS的方式 --></span></span><br><span class="line"> <span class="tag"><<span class="name">script</span> <span class="attr">src</span>=<span class="string">"js/index.js"</span>></span><span class="tag"></<span class="name">script</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br><span class="line"><span class="tag"></<span class="name">html</span>></span></span><br></pre></td></tr></table></figure><p>以上是个简单的例子,点击按钮后可以改变<code>id</code>属性为”test”的标签的内容,可以看到JavaScript是如何控制网页行为的。</p><p>JavaScript可以用<code>console.log()</code>的方式将输出数据写入控制台;或者用<code>window.alert()</code>的方式将输出结果弹窗显示,方便起见后面的例子都用<code>console.log()</code>将结果写入控制台(浏览器中按F12,点击控制台)。</p><p>我是python入门的,新学的编程语言也都会和python进行对比,JavaScript对缩进的要求不像python那么严格,因为python以缩进(4个空格)区分代码块,而JS以左右花括号区分代码块,换行的缩进建议为<strong>两个空格</strong>。</p><p>在ES6版本之后,官方建议使用<code>const(声明常量)</code>和<code>let(声明变量)</code>代替<code>var</code>进行变量声明和赋值,<strong>var作用域是函数(指函数内)</strong>,而<strong>const和let是块级作用域(左右两个花括号之内)</strong>,以下是作用域的区别:</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">function</span> <span class="title function_">example</span>(<span class="params"></span>) {</span><br><span class="line"><span class="comment">// 使用 let 声明块级作用域变量</span></span><br><span class="line"> <span class="keyword">let</span> x = <span class="number">10</span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> (<span class="literal">true</span>) {</span><br><span class="line"> <span class="comment">// 在块级作用域内部声明新的变量,不会影响外部的 x</span></span><br><span class="line"> <span class="keyword">let</span> x = <span class="number">20</span>;</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(x); <span class="comment">// 输出 20</span></span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(x); <span class="comment">// 输出 10,外部的 x 不受内部影响</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// 使用 const 声明常量</span></span><br><span class="line"> <span class="keyword">const</span> y = <span class="number">30</span>;</span><br><span class="line"> <span class="comment">// y = 40; // 错误,常量不能重新赋值</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// 使用 var 声明变量</span></span><br><span class="line"> <span class="keyword">var</span> z = <span class="number">50</span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> (<span class="literal">true</span>) {</span><br><span class="line"> <span class="comment">// 在同一作用域内重复声明变量,不会引发错误</span></span><br><span class="line"> <span class="keyword">var</span> z = <span class="number">60</span>;</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(z); <span class="comment">// 输出 60</span></span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(z); <span class="comment">// 输出 60,外部的 z 受内部影响</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="title function_">example</span>();</span><br></pre></td></tr></table></figure><p>再说明下这里的<code>var</code>作用域是函数内,所以和python一样,一个函数内定义的变量名不会影响到其他函数(除非你在函数外定义,那作用域就是全局)。</p><p>不仅仅是作用域的区别,在JavaScript中有个很有意思的概念**Hoisting(声明提升)**。我们知道python是一种顺序执行的语言,从上到下一行一行执行代码,JavaScript某种程度上也是顺序执行,但是可以被打破。</p><p>举个例子,<strong>var以及函数</strong>的<strong>声明</strong>会被提前到最近的作用域的最前面(const和let不会),<strong>但是赋值语句没有被提前</strong>,这就意味着我们可以在声明变量/函数之前使用变量/函数,但是变量的值就会成为<code>undefined</code>:</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(x); <span class="comment">// 输出:undefined,也就是说这个变量此时还未赋值,注意这不是报错</span></span><br><span class="line"><span class="keyword">var</span> x = <span class="number">5</span>;</span><br><span class="line"></span><br><span class="line"><span class="title function_">foo</span>(); <span class="comment">// 输出:5</span></span><br><span class="line"><span class="keyword">function</span> <span class="title function_">foo</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(x);</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">// 用 const,let就不一样了。</span></span><br><span class="line"><span class="title function_">example</span>()</span><br><span class="line"><span class="keyword">function</span> <span class="title function_">example</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(y); <span class="comment">// 输出:ReferenceError,这才是报错,引用在初始化(赋值)前</span></span><br><span class="line"> <span class="keyword">const</span> y = <span class="number">5</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>在上面的示例中,变量 <code>x</code> 和函数 <code>foo</code> 的声明被提升到作用域的顶部。因此,即使在它们的声明之前使用,代码也不会引发错误。但是,变量 <code>x</code> 的值在声明之前是 <code>undefined</code>,而后才被赋值为5。还有一点,<strong>函数声明提升的优先级高于变量声明提升</strong>,这意味着函数声明会覆盖<strong>同名</strong>的变量声明。</p><p>为了不引起混淆,在使用变量和函数前都要先进行声明,对于变量的声明就用不会被提升的const和let。</p></div><h2 id="2-数据类型"><a href="#2-数据类型" class="headerlink" title="2. 数据类型"></a>2. 数据类型</h2><div class="story post-story"><p>上面列了些javascript和python带给我的直观区别,要入门一门语言还是要从最基础的数据类型开始。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// String, Number, Boolean, null, undefined, Symbol</span></span><br><span class="line"><span class="keyword">const</span> username = <span class="string">"Phantom"</span>;</span><br><span class="line"><span class="keyword">const</span> age = <span class="number">26</span>;</span><br><span class="line"><span class="keyword">const</span> iscol = <span class="literal">true</span>;</span><br><span class="line"><span class="keyword">const</span> x = <span class="literal">null</span>; <span class="comment">// 值被定义,但是空,typeof显示类型为object(历史遗留问题)</span></span><br><span class="line"><span class="keyword">const</span> y = <span class="literal">undefined</span>; <span class="comment">// 不存在定义</span></span><br><span class="line"><span class="keyword">const</span> n = <span class="title class_">Symbol</span>();<span class="comment">// 表示独一无二的值,最大的用法是用来定义对象的唯一属性名</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="keyword">typeof</span> x); <span class="comment">// typeof查看类型(有风险,比如上面的null,以及声明提升的问题)</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// null类型数据不是Object</span></span><br><span class="line"><span class="keyword">const</span> z = <span class="literal">null</span>;</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(z <span class="keyword">instanceof</span> <span class="title class_">Object</span>); <span class="comment">// 注意大写的O,输出:false</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(z === <span class="literal">null</span>); <span class="comment">// 输出:true</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// Symbol确定对象属性名的唯一性,避免与其他属性冲突</span></span><br><span class="line"><span class="keyword">const</span> symbol1 = <span class="title class_">Symbol</span>(<span class="string">'description'</span>);</span><br><span class="line"><span class="keyword">const</span> symbol2 = <span class="title class_">Symbol</span>(<span class="string">'description'</span>);</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(symbol1 === symbol2); <span class="comment">// 输出:false</span></span><br></pre></td></tr></table></figure></div><h2 id="3-对象和解构赋值"><a href="#3-对象和解构赋值" class="headerlink" title="3. 对象和解构赋值"></a>3. 对象和解构赋值</h2><div class="story post-story"><p>要明确一点,JavaScript和python一样都是<strong>面向对象</strong>的编程语言。JavaScript的对象由花括号分隔,在括号内部,对象属性以名称和值对的方式来定义,属性之间逗号分隔。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 创建一个对象</span></span><br><span class="line"><span class="keyword">const</span> person = {</span><br><span class="line"> <span class="attr">name</span>: <span class="string">'Phantom'</span>,</span><br><span class="line"> <span class="attr">age</span>: <span class="number">26</span>,</span><br><span class="line"> <span class="attr">hobbies</span>: [<span class="string">"music"</span>, <span class="string">"sing"</span>, <span class="string">"dance"</span>, <span class="string">"basketball"</span>],</span><br><span class="line"> <span class="attr">address</span>: {</span><br><span class="line"> <span class="attr">street</span>: <span class="string">"tarim street"</span>,</span><br><span class="line"> <span class="attr">city</span>: <span class="string">"tarim"</span>,</span><br><span class="line"> },</span><br><span class="line"> <span class="title function_">sayHello</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">`Hello, my name is <span class="subst">${<span class="variable language_">this</span>.name}</span>.`</span>);</span><br><span class="line"> }</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="comment">// 调用对象的方法和属性</span></span><br><span class="line">person.<span class="title function_">sayHello</span>(); <span class="comment">// 输出:Hello, my name is Phantom.</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(person.<span class="property">name</span>, person.<span class="property">age</span>); <span class="comment">// 输出:Phantom 26</span></span><br></pre></td></tr></table></figure><p>JavaScript中对赋值运算有个比较有意思的拓展,叫做<strong>解构赋值</strong>,可以对数组或者对象进行模式匹配,变量一个萝卜一个坑对应进行赋值:</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 数组解构赋值</span></span><br><span class="line"><span class="keyword">let</span> [a, b, c] = [<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>]</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(a, b, c) <span class="comment">// 输出 1 2 3</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// 对象解构赋值</span></span><br><span class="line"><span class="keyword">const</span> person = {</span><br><span class="line"> <span class="attr">firstname</span>: <span class="string">"Alex"</span>,</span><br><span class="line"> <span class="attr">lastname</span>: <span class="string">"Xie"</span>,</span><br><span class="line"> <span class="attr">age</span>: <span class="number">18</span>,</span><br><span class="line"> <span class="attr">hobbies</span>: [<span class="string">"music"</span>, <span class="string">"sing"</span>, <span class="string">"dance"</span>, <span class="string">"basketball"</span>],</span><br><span class="line"> <span class="attr">address</span>: {</span><br><span class="line"> <span class="attr">street</span>: <span class="string">"tarim street"</span>,</span><br><span class="line"> <span class="attr">city</span>: <span class="string">"tarim"</span>,</span><br><span class="line"> },</span><br><span class="line">};</span><br><span class="line"><span class="keyword">const</span> {</span><br><span class="line"> firstname,</span><br><span class="line"> lastname,</span><br><span class="line"> <span class="attr">address</span>: { city },</span><br><span class="line">} = person; <span class="comment">// 相当于用同名变量将值从person变量中抽取出来</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(city); <span class="comment">// 输出tarim</span></span><br></pre></td></tr></table></figure><p>在这里,我们通过解构赋值从 <code>person</code> 对象中提取了 <code>firstname</code>、<code>lastname</code> 和 <code>address</code> 属性,并将 <code>address</code> 属性解构为 <code>city</code> 变量。由于 <code>person</code> 对象的 <code>address</code> 属性是一个对象,而我们只关心其中的 <code>city</code> 属性,所以通过解构赋值的方式将 <code>city</code> 属性提取出来。</p><p>整个过程相当于用一个同名变量将值从<code>person</code>对象中提取出来。</p></div><h2 id="4-常用的内置方法"><a href="#4-常用的内置方法" class="headerlink" title="4. 常用的内置方法"></a>4. 常用的内置方法</h2><div class="story post-story"><p>方法部分就比较多了,这里就列举常见的,真正需要的时候得查手册。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 字符串内置方法</span></span><br><span class="line"><span class="keyword">const</span> username = <span class="string">"Phantom"</span>;</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(username.<span class="property">length</span>); <span class="comment">//.length 长度,没括号是属性 输出:7</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(username.<span class="title function_">toUpperCase</span>()); <span class="comment">// 有括号的是方法 输出:PHANTOM</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(username.<span class="title function_">substring</span>(<span class="number">0</span>, <span class="number">3</span>).<span class="title function_">toUpperCase</span>()); <span class="comment">// 输出:PHA</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(username.<span class="title function_">includes</span>(<span class="string">"P"</span>)) <span class="comment">// 判断是否含有参数字符串 输出:true</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(username.<span class="title function_">startsWith</span>(<span class="string">"n"</span>)); <span class="comment">// 判断是否参数字符串开头,注意函数名是驼峰写法 输出:false</span></span><br><span class="line"><span class="keyword">const</span> test_txt = <span class="string">"p h a n t o m"</span>;</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(test_txt.<span class="title function_">split</span>(<span class="string">" "</span>)); <span class="comment">// 切分数组,其实和python的.split()方法是一样的 输出:['p', 'h', 'a', 'n', 't', 'o', 'm']</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// 数组内置方法</span></span><br><span class="line"><span class="keyword">const</span> numbers = <span class="keyword">new</span> <span class="title class_">Array</span>(<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>, <span class="number">4</span>, <span class="number">5</span>); <span class="comment">// 构造函数创建数组</span></span><br><span class="line"><span class="keyword">const</span> fruits = [<span class="string">"apple"</span>, <span class="string">"banana"</span>]; <span class="comment">//直接用中括号申明数组</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(numbers); <span class="comment">// 输出:[1, 2, 3, 4, 5]</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(fruits[<span class="number">1</span>]); <span class="comment">// 中括号索引,和python一模一样 输出:banana</span></span><br><span class="line"></span><br><span class="line">fruits.<span class="title function_">push</span>(<span class="string">"watermalon"</span>); <span class="comment">// .push() 数组中添加元素,等同于python列表的.append()</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(fruits); <span class="comment">// 输出:['apple', 'banana', 'watermalon']</span></span><br><span class="line">fruits.<span class="title function_">pop</span>(); <span class="comment">// 删除末尾元素</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="title class_">Array</span>.<span class="title function_">isArray</span>(fruits)); <span class="comment">// 判断是否为数组 输出:true</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(fruits.<span class="title function_">indexOf</span>(<span class="string">"apple"</span>)); <span class="comment">// 计算索引 输出:0</span></span><br></pre></td></tr></table></figure></div><h2 id="5-条件语句"><a href="#5-条件语句" class="headerlink" title="5. 条件语句"></a>5. 条件语句</h2><div class="story post-story"><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// if条件语句</span></span><br><span class="line"><span class="keyword">const</span> n = <span class="number">20</span>;</span><br><span class="line"><span class="keyword">const</span> m = <span class="number">10</span>;</span><br><span class="line"><span class="comment">// ===才会连数据类型一起判断,==不会判断数据类型</span></span><br><span class="line"><span class="keyword">if</span> (n === <span class="number">10</span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"n is 10"</span>);</span><br><span class="line">} <span class="keyword">else</span> <span class="keyword">if</span> (n > <span class="number">10</span> || m > <span class="number">10</span>) {</span><br><span class="line"> <span class="comment">// 条件判断中”或“用||双竖线表示,”且“用&&表示</span></span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"n is greater than 10 or m is greater than 10"</span>);</span><br><span class="line">} <span class="keyword">else</span> {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"n is less than 10 and m is less than 10"</span>);</span><br><span class="line">} <span class="comment">// 输出:n is greater than 10 or m is greater than 10</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// 三目运算符(简化if else代码用)</span></span><br><span class="line"><span class="keyword">const</span> z = <span class="number">11</span>;</span><br><span class="line"><span class="keyword">const</span> color = z > <span class="number">10</span> ? <span class="string">"red"</span> : <span class="string">"blue"</span>; <span class="comment">// ?表示判断为真,冒号后是判断为假</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(color); <span class="comment">// 输出:red</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// 另一种条件语句switch</span></span><br><span class="line"><span class="keyword">switch</span> (color) {</span><br><span class="line"> <span class="keyword">case</span> <span class="string">"red"</span>: <span class="comment">// 需要注意case判断的条件是===</span></span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"color is red"</span>);</span><br><span class="line"> <span class="keyword">break</span>; <span class="comment">// case break相当于shell里的if fi</span></span><br><span class="line"> <span class="keyword">case</span> <span class="string">"blue"</span>:</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"color is blue"</span>);</span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"> <span class="attr">default</span>: <span class="comment">// 表示没有匹配到任意一个条件,执行下面得代码块</span></span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"color is not red or blue"</span>);</span><br><span class="line">} <span class="comment">// 输出:color is red</span></span><br></pre></td></tr></table></figure></div><h2 id="6-循环语句"><a href="#6-循环语句" class="headerlink" title="6. 循环语句"></a>6. 循环语句</h2><div class="story post-story"><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// for 循环</span></span><br><span class="line"><span class="keyword">for</span> (<span class="keyword">let</span> i = <span class="number">0</span>; i < <span class="number">10</span>; i++) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">`for loop number: <span class="subst">${i}</span>`</span>); <span class="comment">// 像shell里的for循环,注意符号``</span></span><br><span class="line">} <span class="comment">// 输出:for loop number: 0 一直到9,10行</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// while 循环,与for不同,条件判断在while里面,否则死循环</span></span><br><span class="line"><span class="keyword">let</span> l = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">while</span> (l < <span class="number">10</span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">`while loop number: <span class="subst">${l}</span>`</span>);</span><br><span class="line"> l++;</span><br><span class="line">} <span class="comment">// 输出:while loop number: 0 一直到9,10行</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">const</span> array = [<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>]</span><br><span class="line"><span class="keyword">for</span> (<span class="keyword">let</span> i = <span class="number">0</span>; i < array.<span class="property">length</span>; i++) {</span><br><span class="line"> <span class="comment">// 结合数组的for循环</span></span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(array[i]);</span><br><span class="line">} <span class="comment">// 输出 1 到 3 共三行</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> (<span class="keyword">let</span> i <span class="keyword">of</span> array) {</span><br><span class="line"> <span class="comment">// 类似于python的for i in [list]</span></span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(i);</span><br><span class="line">} <span class="comment">// 输出 1 到 3 共三行</span></span><br></pre></td></tr></table></figure><p>同样有<code>continue</code> 和 <code>break</code> 是用于控制循环流程:</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// continue 跳过当前循环剩余代码,进入下一个循环</span></span><br><span class="line"><span class="keyword">for</span> (<span class="keyword">let</span> i = <span class="number">0</span>; i < <span class="number">5</span>; i++) {</span><br><span class="line"> <span class="keyword">if</span> (i === <span class="number">2</span>) {</span><br><span class="line"> <span class="keyword">continue</span>;</span><br><span class="line"> }</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(i);</span><br><span class="line">} <span class="comment">// 输出:0 1 3 4(4行)</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// break用于完全终止循环,跳出循环体</span></span><br><span class="line"><span class="keyword">let</span> i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">while</span> (<span class="literal">true</span>) {</span><br><span class="line"> <span class="keyword">if</span> (i === <span class="number">3</span>) {</span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"> }</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(i);</span><br><span class="line"> i++;</span><br><span class="line">} <span class="comment">// 输出:0 1 2(3行)</span></span><br></pre></td></tr></table></figure></div><h2 id="7-类和构造函数"><a href="#7-类和构造函数" class="headerlink" title="7. 类和构造函数"></a>7. 类和构造函数</h2><div class="story post-story"><p>JavaScript也使用 <code>class</code> 关键字来声明类,类名通常使用大写字母开头。类的方法定义和python类的方法定义是一样的,都是使用普通函数的语法。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">MyClass</span> {</span><br><span class="line"> <span class="title function_">myMethod</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="comment">// 方法的定义</span></span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>和python的构造函数使用方法名<code>__init__()</code>定义类似,JavaScript的构造函数是类的一个普通方法,使用 <code>constructor</code> 关键字定义。<strong>Python</strong>的构造函数可以接受任意数量的参数,<strong>包括 self(指向类的实例)作为第一个参数</strong>。在构造函数内部,可以使用这些参数来初始化实例的属性。而<strong>JavaScript</strong>的构造函数使用普通的函数参数来接收传递的值,<strong>没有特殊的self参数</strong>。</p><p>演示一下两者构造函数的差异:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># python的构造函数</span></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">MyClass</span>:</span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self, arg1, arg2</span>):</span><br><span class="line"> self.arg1 = arg1</span><br><span class="line"> self.arg2 = arg2</span><br><span class="line"></span><br><span class="line">my_instance = MyClass(<span class="string">"Hello"</span>, <span class="string">"World"</span>)</span><br><span class="line"><span class="built_in">print</span>(my_instance.arg1) <span class="comment"># 输出:Hello</span></span><br><span class="line"><span class="built_in">print</span>(my_instance.arg2) <span class="comment"># 输出:World</span></span><br></pre></td></tr></table></figure><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// javascript的构造函数</span></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">MyClass</span> {</span><br><span class="line"> <span class="title function_">constructor</span>(<span class="params">arg1, arg2</span>) {</span><br><span class="line"> <span class="variable language_">this</span>.<span class="property">arg1</span> = arg1;</span><br><span class="line"> <span class="variable language_">this</span>.<span class="property">arg2</span> = arg2;</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">const</span> myInstance = <span class="keyword">new</span> <span class="title class_">MyClass</span>(<span class="string">"Hello"</span>, <span class="string">"World"</span>);</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="property">arg1</span>); <span class="comment">// 输出:Hello</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="property">arg2</span>); <span class="comment">// 输出:World</span></span><br></pre></td></tr></table></figure><p><code>new</code>关键字用于创建一个类的实例对象,比如上面的例子,通过<code>new</code>关键字,调用了<code>MyClass</code>类的构造函数,并且传递了参数<code>"Hello" </code>, <code>"World"</code>,将构造函数的<code>this</code>绑定到新创建的对象,最后,<code>new</code>关键字返回了新创建的<code>MyClass</code>实例对象<code>myInstance</code>。</p><p><code>constructor</code> 关键字也不是一定要显示声明的,需要理解构造函数是用来创建和初始化对象的特殊函数,<strong>在 JavaScript 中,如果一个函数被用于创建对象,它就被认为是构造函数。</strong></p><p><strong>python的构造函数不允许显示返回值</strong>,只负责初始化实例状态。而 <strong>JavaScript 的构造函数可以显式返回一个对象</strong>,如果返回的是一个对象,则该对象将作为实例创建的结果,否则将返回新创建的实例(上面的例子就是)。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 构造函数返回对象的情况</span></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">MyClass</span> {</span><br><span class="line"> <span class="title function_">constructor</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">this</span>.<span class="property">property</span> = <span class="string">"Value"</span>;</span><br><span class="line"> <span class="keyword">return</span> { <span class="attr">customProperty</span>: <span class="string">"Custom Value"</span> };</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">const</span> myInstance = <span class="keyword">new</span> <span class="title class_">MyClass</span>();</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="property">property</span>); <span class="comment">// 输出:undefined</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="property">customProperty</span>); <span class="comment">// 输出:"Custom Value"</span></span><br></pre></td></tr></table></figure><p>上面的例子中,构造函数使用了<code>return</code>语句返回了一个对象<code>{ customProperty: "Custom Value" }</code>。当我们使用<code>new</code>关键字创建<code>MyClass</code>的实例时,返回的结果不是新创建的实例对象,而是显式返回的对象,因此这个对象并没有 <code>property</code> 属性,而是具有 <code>customProperty</code> 属性。</p></div><h2 id="8-封装、继承和多态"><a href="#8-封装、继承和多态" class="headerlink" title="8. 封装、继承和多态"></a>8. 封装、继承和多态</h2><div class="story post-story"><p>既然JavaScript是面向对象的编程语言,就一定有面向对象的三大特征:封装、继承和多态。</p><h3 id="8-1-封装(Encapsulation)"><a href="#8-1-封装(Encapsulation)" class="headerlink" title="8.1 封装(Encapsulation)"></a>8.1 封装(Encapsulation)</h3><ul><li>对象和函数:很明显JavaScript中的对象和函数可以用来封装数据和行为。对象可以包含属性和方法,函数可以封装一段可执行的代码,并且可以接收参数和返回值。</li><li>访问控制:通过使用<strong>闭包</strong>或者<strong>WeakMap</strong>,可以模拟私有属性和方法,从而实现对数据的隐藏和封装。</li></ul><p>我们不能像python中一样用<code>_</code>或者<code>__</code>在属性或者方法上实现私有化,JavaScript提供了一些特殊的机制:</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 使用闭包,简单来说,闭包是指一个函数能够访问并操作其作用域外部的变量</span></span><br><span class="line"><span class="keyword">function</span> <span class="title function_">MyFunction</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="keyword">let</span> privateProperty = <span class="string">"Private Value"</span>;</span><br><span class="line"> </span><br><span class="line"> <span class="variable language_">this</span>.<span class="property">getPrivateProperty</span> = <span class="keyword">function</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="keyword">return</span> privateProperty;</span><br><span class="line"> };</span><br><span class="line"> </span><br><span class="line"> <span class="variable language_">this</span>.<span class="property">setPrivateProperty</span> = <span class="keyword">function</span>(<span class="params">value</span>) {</span><br><span class="line"> privateProperty = value;</span><br><span class="line"> };</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">const</span> myInstance = <span class="keyword">new</span> <span class="title class_">MyFunction</span>(); <span class="comment">// 创建实例对象,执行构造函数代码,返回一个新的对象</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="property">privateProperty</span>); <span class="comment">// 构造函数内部的局部变量,外部不可访问,输出:undefined</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="title function_">getPrivateProperty</span>()); <span class="comment">// 输出:"Private Value"</span></span><br><span class="line">myInstance.<span class="title function_">setPrivateProperty</span>(<span class="string">"New Value"</span>);</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="title function_">getPrivateProperty</span>()); <span class="comment">// 输出:"New Value"</span></span><br></pre></td></tr></table></figure><p>在上述示例中,我们使用闭包创建了一个私有变量<code>privateProperty</code>。通过在构造函数内部定义公有方法<code>getPrivateProperty</code>和<code>setPrivateProperty</code>,我们可以访问和修改私有属性(内部的两个方法访问修改了外部函数的变量)。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 使用WeakMap</span></span><br><span class="line"><span class="keyword">const</span> privateProperties = <span class="keyword">new</span> <span class="title class_">WeakMap</span>();</span><br><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">MyClass</span> {</span><br><span class="line"> <span class="title function_">constructor</span>(<span class="params"></span>) {</span><br><span class="line"> privateProperties.<span class="title function_">set</span>(<span class="variable language_">this</span>, { <span class="attr">privateProperty</span>: <span class="string">"Private Value"</span> });</span><br><span class="line"> }<span class="comment">// WeakMap用法set(key,value)</span></span><br><span class="line"> </span><br><span class="line"> <span class="title function_">getPrivateProperty</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="keyword">return</span> privateProperties.<span class="title function_">get</span>(<span class="variable language_">this</span>).<span class="property">privateProperty</span>;</span><br><span class="line"> }</span><br><span class="line"> </span><br><span class="line"> <span class="title function_">setPrivateProperty</span>(<span class="params">value</span>) {</span><br><span class="line"> privateProperties.<span class="title function_">get</span>(<span class="variable language_">this</span>).<span class="property">privateProperty</span> = value;</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">const</span> myInstance = <span class="keyword">new</span> <span class="title class_">MyClass</span>();</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="property">privateProperty</span>); <span class="comment">// 输出:undefined</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="title function_">getPrivateProperty</span>()); <span class="comment">// 输出:"Private Value"</span></span><br><span class="line">myInstance.<span class="title function_">setPrivateProperty</span>(<span class="string">"New Value"</span>);</span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myInstance.<span class="title function_">getPrivateProperty</span>()); <span class="comment">// 输出:"New Value"</span></span><br></pre></td></tr></table></figure><p><code>Map</code>和<code>WeakMap</code>都是ES6中新的数据结构集,一个对象由多个key-value键值对构成,在Map中,任何类型都可以做对象的key;而在WeakMap中,key必需是对象,所有的key都是弱引用,不用考虑垃圾回收。</p><p>在构造函数中,使用<code>privateProperties</code>定义了一个<code>WeakMap</code>实例。通过 <code>privateProperties.set(this, { privateProperty: "Private Value" })</code> 将当前实例对象作为键,将一个包含私有属性 <code>privateProperty</code> 的对象作为值存储在 <code>WeakMap</code> 中。</p><p>然后,通过在类的原型上定义 <code>getPrivateProperty</code> 和 <code>setPrivateProperty</code> 方法,访问和修改存储在 <code>WeakMap</code> 中的私有属性。</p><h3 id="8-2-继承(Inheritance)"><a href="#8-2-继承(Inheritance)" class="headerlink" title="8.2 继承(Inheritance)"></a>8.2 继承(Inheritance)</h3><p>使用 <code>extends</code> 关键字来创建子类,并使用 <code>super</code> 关键字来调用父类的构造函数和方法。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">Animal</span> {</span><br><span class="line"> <span class="title function_">constructor</span>(<span class="params">name</span>) {</span><br><span class="line"> <span class="variable language_">this</span>.<span class="property">name</span> = name;</span><br><span class="line"> }</span><br><span class="line"> </span><br><span class="line"> <span class="title function_">speak</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">`<span class="subst">${<span class="variable language_">this</span>.name}</span> makes a sound.`</span>);</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">Dog</span> <span class="keyword">extends</span> <span class="title class_ inherited__">Animal</span> {</span><br><span class="line"> <span class="title function_">constructor</span>(<span class="params">name, breed</span>) {</span><br><span class="line"> <span class="variable language_">super</span>(name); <span class="comment">// 调用父类的构造函数</span></span><br><span class="line"> <span class="variable language_">this</span>.<span class="property">breed</span> = breed;</span><br><span class="line"> }</span><br><span class="line"> </span><br><span class="line"> <span class="title function_">speak</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">`<span class="subst">${<span class="variable language_">this</span>.name}</span> barks.`</span>);</span><br><span class="line"> }</span><br><span class="line"> </span><br><span class="line"> <span class="title function_">fetch</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">`<span class="subst">${<span class="variable language_">this</span>.name}</span> fetches the ball.`</span>);</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">const</span> myDog = <span class="keyword">new</span> <span class="title class_">Dog</span>(<span class="string">"Buddy"</span>, <span class="string">"Golden Retriever"</span>);</span><br><span class="line">myDog.<span class="title function_">speak</span>(); <span class="comment">// 输出:"Buddy barks."</span></span><br><span class="line">myDog.<span class="title function_">fetch</span>(); <span class="comment">// 输出:"Buddy fetches the ball."</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myDog.<span class="property">name</span>); <span class="comment">// 输出:"Buddy"</span></span><br><span class="line"><span class="variable language_">console</span>.<span class="title function_">log</span>(myDog.<span class="property">breed</span>); <span class="comment">// 输出:"Golden Retriever"</span></span><br></pre></td></tr></table></figure><p>在上面的例子中,我们有一个父类<code>Animal</code>和一个子类<code>Dog</code>。子类<code>Dog</code>使用<code>extends</code>关键字继承了父类<code>Animal</code>(python就不需要)。子类<code>Dog</code>在构造函数中使用<code>super(name)</code>调用了父类的构造函数,并传递了<code>name</code>参数。</p><p>子类可以覆盖父类的方法,上面的例子就覆盖了父类的<code>speak()</code>方法。在子类中,我们可以使用<code>this</code>关键字引用当前实例的属性和方法,包括从父类继承的属性和方法。</p><h3 id="8-3-多态(Polymorphism)"><a href="#8-3-多态(Polymorphism)" class="headerlink" title="8.3 多态(Polymorphism)"></a>8.3 多态(Polymorphism)</h3><p>JavaScript 是一种动态类型语言,变量的类型可以在运行时改变。这种动态性使得 JavaScript 具有一定的多态性。例如,多个对象可以对同一个方法做出不同的响应,根据对象的实际类型来确定要调用的方法(<strong>相同接口,不同实现</strong>)。</p><p>举个例子,假设我们有一个 <code>Animal</code> 类,它有一个 <code>makeSound</code> 方法用于发出动物的声音。然后我们派生出两个子类 <code>Dog</code> 和 <code>Cat</code>,它们分别重写了 <code>makeSound</code> 方法来发出不同的声音。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">Animal</span> {</span><br><span class="line"> <span class="title function_">makeSound</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"The animal makes a sound"</span>);</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">Dog</span> <span class="keyword">extends</span> <span class="title class_ inherited__">Animal</span> {</span><br><span class="line"> <span class="title function_">makeSound</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"The dog barks"</span>);</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">Cat</span> <span class="keyword">extends</span> <span class="title class_ inherited__">Animal</span> {</span><br><span class="line"> <span class="title function_">makeSound</span>(<span class="params"></span>) {</span><br><span class="line"> <span class="variable language_">console</span>.<span class="title function_">log</span>(<span class="string">"The cat meows"</span>);</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>现在,我们可以创建不同的对象并调用它们的 <code>makeSound</code> 方法,它们会根据自己的实现发出不同的声音。</p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">const</span> animal = <span class="keyword">new</span> <span class="title class_">Animal</span>();</span><br><span class="line"><span class="keyword">const</span> dog = <span class="keyword">new</span> <span class="title class_">Dog</span>();</span><br><span class="line"><span class="keyword">const</span> cat = <span class="keyword">new</span> <span class="title class_">Cat</span>();</span><br><span class="line"></span><br><span class="line">animal.<span class="title function_">makeSound</span>(); <span class="comment">// 输出:"The animal makes a sound"</span></span><br><span class="line">dog.<span class="title function_">makeSound</span>(); <span class="comment">// 输出:"The dog barks"</span></span><br><span class="line">cat.<span class="title function_">makeSound</span>(); <span class="comment">// 输出:"The cat meows"</span></span><br></pre></td></tr></table></figure><p>本质上就是子类对父类的继承方法的重写,实现调用一个接口具有不同的实现。</p><p>对于第8部分可能有些难以理解,可以结合前面python的面向对象编程笔记:</p><p><a href="https://www.shelven.com/2022/11/25/a.html">python自学笔记(3)——面向对象编程(上) - 我的小破站 (shelven.com)</a></p><p><a href="https://www.shelven.com/2022/11/26/a.html">python自学笔记(4)——面向对象编程(下) - 我的小破站 (shelven.com)</a></p><p>JavaScript还有很多高级用法,这里只记录了入门的一些基础,以后要用到再深入了解。</p></div>]]></content>
<summary type="html"><p>这一篇笔记主要记录下ES6版本的Javascript的入门学习笔记,只记录了自己了解的基础知识,实际用这门语言的时候还是要多查阅文档。</p></summary>
<category term="个人主页" scheme="http://www.shelven.com/categories/%E4%B8%AA%E4%BA%BA%E4%B8%BB%E9%A1%B5/"/>
<category term="编程自学" scheme="http://www.shelven.com/categories/%E7%BC%96%E7%A8%8B%E8%87%AA%E5%AD%A6/"/>
<category term="JavaScript" scheme="http://www.shelven.com/tags/JavaScript/"/>
</entry>
<entry>
<title>HTML、CSS和JavaScript入门(2)——CSS基础</title>
<link href="http://www.shelven.com/2023/09/30/a.html"/>
<id>http://www.shelven.com/2023/09/30/a.html</id>
<published>2023-09-30T10:15:00.000Z</published>
<updated>2023-09-30T10:24:05.000Z</updated>
<content type="html"><![CDATA[<p>前一篇笔记介绍了HTML的基础,这一篇主要记录下CSS的基础知识。</p><span id="more"></span><h2 id="1-CSS的引入方式"><a href="#1-CSS的引入方式" class="headerlink" title="1. CSS的引入方式"></a>1. CSS的引入方式</h2><div class="story post-story"><h3 id="1-1-内联样式"><a href="#1-1-内联样式" class="headerlink" title="1.1 内联样式"></a>1.1 内联样式</h3><p>直接用<code>style</code>写进HTML元素,仅对当前的HTML元素有效:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>未引入CSS的h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span> <span class="attr">style</span>=<span class="string">"color:brown"</span>></span>内联样式引入CSS的h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230930/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以看到在用内联样式引入的css只在当前html元素生效,并不会影响其他的元素。</p><h3 id="1-2-内部样式表"><a href="#1-2-内部样式表" class="headerlink" title="1.2 内部样式表"></a>1.2 内部样式表</h3><p>在<code><head></code>区域中用<code><style></code>标签写入css规则,对所有标签都生效:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">style</span>></span><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">h1</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: darkslateblue;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"> </span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>这是第一个h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>这是第二个h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230930/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="1-3-外部样式表"><a href="#1-3-外部样式表" class="headerlink" title="1.3 外部样式表"></a>1.3 外部样式表</h3><p>顾名思义时从外部css文件引入,需要在<code><head></code>区域中引入<code>link:css</code>。首先我们创建一个<code>style.css</code>文件,并写入以下内容:</p><figure class="highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="selector-tag">h1</span> {</span><br><span class="line"> <span class="attribute">color</span>: darkkhaki;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>回到html文件,在<code><head></code>区域内输入link,在提示的第三行出现<code>link:css</code>,选中回车就自动引入了我们创建的css文件:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">link</span> <span class="attr">rel</span>=<span class="string">"stylesheet"</span> <span class="attr">href</span>=<span class="string">"style.css"</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>这是第一个h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>这是第二个h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230930/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>还有种外部样式表引用方式,是在<code><head></code>标签的<code><style></code>区域内用<code>@import</code>的方式引入:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">style</span>></span><span class="language-xml"></span></span><br><span class="line"><span class="language-xml"> <span class="comment"><!-- 注意有分号,css提供的方式 --></span></span></span><br><span class="line"><span class="language-xml"> @import url("style.css");</span></span><br><span class="line"><span class="language-xml"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>这是第一个h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>这是第二个h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p>最终的效果和上面是一样的,就不放图片了。需要注意,使用<code>@import</code>方式引入有一些限制,它<strong>会在HTML解析完毕以后再加载CSS文件,就可能导致页面渲染的延迟</strong>,并且引入的CSS文件不能并行加载,可能会影响网页的性能。</p><p>所以一般都是用<code><link></code>标签的方式引入CSS文件,提高性能的同时也有更好的兼容性。</p></div><h2 id="2-CSS引入的优先级"><a href="#2-CSS引入的优先级" class="headerlink" title="2. CSS引入的优先级"></a>2. CSS引入的优先级</h2><div class="story post-story"><p>上面说了三种引入CSS的方式,如果一个标签被不同方法/顺序引入的CSS文件修饰,最终会表现出哪种样式呢?这就要说到样式的优先级。</p><p>浏览器会为HTML元素提供默认的样式,如果没有其他的样式覆盖,默认样式将会应用于这些元素(可以在网页按F12,元素,body中看到<em>用户代理样式表</em>,也就是浏览器自身定义的样式)。</p><p>样式的优先级有如下基本规则:</p><ul><li><ol><li><code>!important</code>优先级最高,会覆盖CSS中任何其他声明,不推荐使用,因为它改变了你样式表的级联规则,难以调试。</li></ol></li><li><ol start="2"><li><strong>内联样式</strong>优先级高于内部样式表和外部样式表中的样式定义。</li></ol></li><li><ol start="3"><li>相同的规则按照加载顺序,写在后面的声明会覆盖前面的。</li></ol></li><li><ol start="4"><li>继承的样式优先级低于直接指定的样式。</li></ol></li></ul><p>什么是继承的样式?继承样式是指某个元素会继承其父元素的某些样式属性,即<strong>子元素会继承父元素的样式</strong>。</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"><span class="tag"><<span class="name">style</span>></span><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">body</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-family</span>: Arial;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: burlywood;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">h1</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-size</span>: <span class="number">24px</span>;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: blue;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"><span class="tag"><<span class="name">h1</span>></span>这是标题<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"><span class="tag"><<span class="name">p</span>></span>这是一个段落。<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p>上面的例子中,我们定义了<code><body></code>标签的<code>font-family</code>和<code>color</code>属性,子元素<code><h1></code>和<code><p></code>都继承了父元素的这两个属性,但同时我们又指定了<code><h1></code>标签的<code>font-size</code>和<code>color</code>属性。所以直接指定的<code><h1></code>除了继承了父标签的属性外,<code>color</code>属性的优先级高于继承的,为蓝色。</p><p><img src="https://www.shelven.com/tuchuang/20230930/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>继承样式并不是所有样式属性都会继承的,只有一部分样式属性具有继承性。常见的继承样式属性包括 <code>font-family</code>、<code>color</code>、<code>font-size</code>、<code>font-weight</code>、<code>text-align</code> 等。但是像 <code>width</code>、<code>height</code>、<code>margin</code>、<code>padding</code>、<code>background-color</code> 等属性就不具有继承性。</p></div><h2 id="3-类和选择器"><a href="#3-类和选择器" class="headerlink" title="3. 类和选择器"></a>3. 类和选择器</h2><div class="story post-story"><p>为了保持代码的可维护性和可读性,尽量避免用<code>!important</code>声明样式,也尽量避免滥用内联样式,一般情况下是将样式定义在外部样式表中,使用类和选择器来管理样式。这样可以更好地组织和维护样式,并提高代码的重用性。</p><p>为了方便演示,以下都用内部样式表的方式引入CSS样式。</p><h3 id="3-1-类(class)"><a href="#3-1-类(class)" class="headerlink" title="3.1 类(class)"></a>3.1 类(class)</h3><p>类是一种CSS的标记,用于标识一组具有相同样式的元素。通过为HTML元素添加<code>class属性</code>,并在CSS中定义对应的样式规则,可以将样式应用于多个元素。类名以<code>.</code>开头。</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">style</span>></span><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="selector-class">.highlight</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">background-color</span>: yellow;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-weight</span>: bold;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span> <span class="attr">class</span>=<span class="string">"highlight"</span>></span>这是一个带有highlight类的段落。<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span>这是一个普通的段落。<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230930/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="3-2-选择器(Selector)"><a href="#3-2-选择器(Selector)" class="headerlink" title="3.2 选择器(Selector)"></a>3.2 选择器(Selector)</h3><p>选择器用于选择HTML中要应用样式的元素。可以根据元素的标签名、类、ID、属性等进行选择。</p><ul><li><strong>元素选择器(Element Selector)</strong>:根据元素的标签名选择元素。例如,<code>p</code>选择器选择所有的<code><p></code>段落元素。</li><li><strong>类选择器(Class Selector)</strong>:根据类名选择元素。例如,<code>.highlight</code>选择器选择所有具有highlight类的元素。</li><li><strong>ID选择器(ID Selector)</strong>:根据元素的唯一ID选择元素。例如,<code>#header</code>选择器选择具有header ID的元素。</li><li><strong>属性选择器(Attribute Selector)</strong>:根据元素的属性选择元素。例如,<code>[type="text"]</code>选择器选择所有<code>type</code>属性为”text”的元素。</li><li><strong>通用选择器(Universal Selector)</strong>:所有元素都会被选中。</li></ul><p>不同选择器之间也是有优先级的,<strong>ID选择器 > 类选择器 > 元素选择器</strong>。</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">style</span>></span><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 元素选择器 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">p</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: blue;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 类选择器 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-class">.highlight</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">background-color</span>: yellow;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-weight</span>: bold;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span></span><br><span class="line"><span class="language-css"> <span class="comment">/* ID选择器 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-id">#header</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-size</span>: <span class="number">20px</span>;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 属性选择器 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">input</span><span class="selector-attr">[type=<span class="string">"text"</span>]</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">border</span>: <span class="number">1px</span> solid gray;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 通用选择器 */</span></span></span><br><span class="line"><span class="language-css"> * {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-family</span>: <span class="string">'Times New Roman'</span>, Times, serif;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span>这是一个普通的段落。<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span> <span class="attr">class</span>=<span class="string">"highlight"</span>></span>这是一个带有highlight类的段落。<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span> <span class="attr">id</span>=<span class="string">"header"</span>></span>这是一个带有header ID的段落。<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">input</span> <span class="attr">type</span>=<span class="string">"text"</span> <span class="attr">placeholder</span>=<span class="string">"文本输入框"</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230930/7.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/7.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>还有些比较特殊的选择器,简单介绍两个:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">style</span>></span><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 两个元素之间是空格,表示前面的父类,后面的是所有该标签的子类 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">div</span> <span class="selector-tag">p</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: brown;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 两个元素之间是逗号,表示同时选中 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">h1</span>,<span class="selector-tag">h2</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: aqua;</span></span><br><span class="line"><span class="language-css"> } </span></span><br><span class="line"><span class="language-css"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span>这是一个普通的段落<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span>这是在div父元素下的段落<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>这是h1标签<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h2</span>></span>这是h2标签<span class="tag"></<span class="name">h2</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230930/8.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/8.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>空格是取所有的后代元素,如果只想要一级的子元素,可以用符号<code>></code>。</p></div><h2 id="4-伪类和伪元素"><a href="#4-伪类和伪元素" class="headerlink" title="4. 伪类和伪元素"></a>4. 伪类和伪元素</h2><div class="story post-story"><h3 id="4-1-伪类(Pseudo-class)"><a href="#4-1-伪类(Pseudo-class)" class="headerlink" title="4.1 伪类(Pseudo-class)"></a>4.1 伪类(Pseudo-class)</h3><p><strong>伪类用于选择处于特定状态的元素</strong>,例如鼠标悬停、被点击、是第一个子元素等。伪类以冒号<code>:</code>开头,紧跟在选择器后面。</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">style</span>></span><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 选择所有的链接元素 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">a</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: blue;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 选择鼠标悬停在链接上的元素 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">a</span><span class="selector-pseudo">:hover</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: red;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 选择div的第一个p子元素 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">div</span> > <span class="selector-tag">p</span><span class="selector-pseudo">:first</span>-child {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-weight</span>: bold;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span></span><br><span class="line"><span class="language-css"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">a</span> <span class="attr">href</span>=<span class="string">"https://www.shelven.com"</span>></span>这是一个普通的链接<span class="tag"></<span class="name">a</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span>这是在div父元素下的第一个段落<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span>这是在div父元素下的第二个段落<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p>在上面的例子中,<code>:hover</code>是一个伪类选择器,用于选择鼠标悬停在链接上的元素,并应用红色字体颜色。<code>:first-child</code>也是一个伪类选择器,用于选择第一个子元素,并应用粗体字体样式(不指定的话就是任意一个元素的子元素)。</p><p>上边是正常显示的页面,下边是鼠标悬停在链接上的效果:</p><p><img src="https://www.shelven.com/tuchuang/20230930/9.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/9.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="> <img src="https://www.shelven.com/tuchuang/20230930/10.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/10.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="4-2-伪元素(Pseudo-element)"><a href="#4-2-伪元素(Pseudo-element)" class="headerlink" title="4.2 伪元素(Pseudo-element)"></a>4.2 伪元素(Pseudo-element)</h3><p><strong>伪元素用于在元素的特定位置插入内容或样式</strong>,例如在元素的前后插入额外的内容、选择元素的第一个字母等。伪元素以双冒号<code>::</code>开头,紧跟在选择器后面。</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">style</span>></span><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 在段落前插入内容 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">p</span><span class="selector-pseudo">::before</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">content</span>: <span class="string">"前置内容"</span>;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-weight</span>: bold;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 选择第一个字母 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">p</span><span class="selector-pseudo">::first-letter</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-size</span>: <span class="number">2em</span>;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">color</span>: red;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="comment">/* 在段落后插入内容 */</span></span></span><br><span class="line"><span class="language-css"> <span class="selector-tag">p</span><span class="selector-pseudo">::after</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">content</span>: <span class="string">"后置内容"</span>;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">font-weight</span>: bold;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span></span><br><span class="line"> 这是一段普通的段落内容</span><br><span class="line"> <span class="tag"></<span class="name">p</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230930/11.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/11.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>在上面例子里,<code>::before</code>是一个伪元素选择器,用于在每个段落前插入额外的内容,并应用粗体字体样式。<code>::first-letter</code>也是一个伪元素选择器,用于选择每个段落的第一个字母,并应用2倍字体大小和红色字体颜色。</p></div><h2 id="5-盒子模型"><a href="#5-盒子模型" class="headerlink" title="5. 盒子模型"></a>5. 盒子模型</h2><div class="story post-story"><p>CSS中的盒子模型(Box Model)是用于描述元素在页面中占用空间的一种模型。它将每个元素看作是一个矩形的盒子,由四个部分组成:内容区域(Content)、内边距(Padding)、边框(Border)和外边距(Margin)。</p><ul><li><p>1.内容区域(Content):<br>内容区域指的是元素内部实际显示内容的区域,包括文本、图片或其他子元素。它的大小由元素的宽度(width)和高度(height)属性决定。</p></li><li><p>2.内边距(Padding):<br>内边距是元素内容区域与边框之间的空白区域,用于控制内容与边框之间的距离。可以使用<code>padding</code>属性设置内边距的大小。</p></li><li><p>3.边框(Border):<br>边框是围绕元素内容和内边距的线条或样式,用于分隔元素与其他元素的区域。可以使用<code>border</code>属性设置边框的样式、宽度和颜色。</p></li><li><p>4.外边距(Margin):<br>外边距是元素与相邻元素之间的空白区域,用于控制元素与其他元素之间的距离。可以使用<code>margin</code>属性设置外边距的大小。</p></li></ul><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">style</span>></span><span class="language-css"></span></span><br><span class="line"><span class="language-css"> <span class="selector-class">.box</span> {</span></span><br><span class="line"><span class="language-css"> <span class="attribute">width</span>: <span class="number">100px</span>;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">height</span>: <span class="number">100px</span>;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">background-color</span>: blue;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">border</span>: <span class="number">3px</span> solid black;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">padding</span>: <span class="number">10px</span> <span class="number">9px</span> <span class="number">8px</span> <span class="number">7px</span>;</span></span><br><span class="line"><span class="language-css"> <span class="attribute">margin</span>: <span class="number">5px</span> <span class="number">5px</span> <span class="number">5px</span> <span class="number">5px</span>;</span></span><br><span class="line"><span class="language-css"> }</span></span><br><span class="line"><span class="language-css"> </span><span class="tag"></<span class="name">style</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">class</span>=<span class="string">"box"</span>></span></span><br><span class="line"> 这是一个普通的box模型</span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p>当我们打开网页,按下F12,选择元素,div.box,就可以看到浏览器中展示的盒子模型:</p><p><img src="https://www.shelven.com/tuchuang/20230930/13.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230930/13.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>总的来说,盒子模型是帮助我们理解元素在页面中的布局和占用空间的,通过调整内容区域、内边距、边框和外边距的大小和样式,可以实现各种页面布局效果。</p><p>以上都是些基本的CSS常识,实例部分可以参考菜鸟教程,里面给的例子挺多的,自己试一下才能更好理解:<a href="https://www.runoob.com/css/css-examples.html">CSS 实例 | 菜鸟教程 (runoob.com)</a></p><p>关于CSS的书写格式,github上也有一个中文版指南:<a href="https://github.com/Zhangjd/css-style-guide">Zhangjd/css-style-guide: A mostly reasonable approach to CSS and Sass. (github.com)</a></p><p>关于缩进的问题,我用的vscode回车就是hard tabs而不是soft tabs(两个空格),不过似乎也没影响,就先这样吧…..</p></div>]]></content>
<summary type="html"><p>前一篇笔记介绍了HTML的基础,这一篇主要记录下CSS的基础知识。</p></summary>
<category term="个人主页" scheme="http://www.shelven.com/categories/%E4%B8%AA%E4%BA%BA%E4%B8%BB%E9%A1%B5/"/>
<category term="编程自学" scheme="http://www.shelven.com/categories/%E7%BC%96%E7%A8%8B%E8%87%AA%E5%AD%A6/"/>
<category term="CSS" scheme="http://www.shelven.com/tags/CSS/"/>
</entry>
<entry>
<title>HTML、CSS和JavaScript入门(1)——HTML基础</title>
<link href="http://www.shelven.com/2023/09/28/a.html"/>
<id>http://www.shelven.com/2023/09/28/a.html</id>
<published>2023-09-28T15:35:43.000Z</published>
<updated>2023-10-06T07:16:09.000Z</updated>
<content type="html"><![CDATA[<p>不知不觉用Hexo建站也快一年半了,以前下载安装Node.js、搭建Hexo博客和使用volantis主题,以及到最后部署到服务器上,这一系列流程都是跟着教程和volantis的中文社区操作,因为自己是纯小白,属于两眼一抹黑干就完事了的那种。遇到bug也是各种百度谷歌,别人咋改文件自己跟着改,也完全不懂其中原理。</p><span id="more"></span><p>后来因为课题需要接触到编程,才慢慢有了美化网站的想法(虽然现在也不好看,主要是因为没时间整hhhh)。作为一个小白,看到网上动辄十几个小时的编程教程真的头疼,自己只是需要了解基本概念,会用就行,不需要学太深,所以有了这篇纯入门笔记。</p><h2 id="1-前言"><a href="#1-前言" class="headerlink" title="1. 前言"></a>1. 前言</h2><div class="story post-story"><p>要学习前端web开发,首先需要了解和学习下面这些技术:</p><ol><li><strong>HTML(HyperText Markup Language)</strong>:HTML是用于定义网页结构和内容的<strong>标记语言</strong>。它用于创建网页的各种元素,如标题、段落、链接、图像等。</li><li><strong>CSS(Cascading Style Sheets)</strong>:CSS用于控制网页的样式和布局。通过CSS,用来设置设置网页的字体、颜色、大小、布局等各种外观属性。</li><li><strong>JavaScript</strong>:JavaScript是一种用于网页交互和动态效果的脚本语言。它可以通过操作网页元素、处理用户输入和响应事件等来实现交互性和动态性。</li></ol><p><strong>简单来说,把web前端开发比喻成我的世界这款游戏的话,HTML就是游戏实际中的地形、各种方块等实体,CSS就像游戏中的纹理包和材质,而JavaScript就是游戏中的逻辑和行为,控制方块和实体的移动、交互碰撞等行为。</strong></p><p>以上是基础中的基础,真正要学web开发,还要学习前端框架和库如React、Vue.js,会用前端构建工具像是Webpack、Vite来构建打包前端项目,需要学习网络和HTTP协议以便更好和后端交互,还要考虑移动端优化、跨浏览器兼容性等等……万丈高楼平地起,没有基础其他的都是空中楼阁。</p><p>我这里用vscode做为集成开发环境,安装了以下插件:</p><ul><li><ol><li>Live Server:可以在浏览器中<strong>实时预览网页</strong>,安装后只需要在对应的html文件右键,选择Open With Live Server即可在本地浏览器快速打开,编辑文件后会自动刷新浏览器。</li></ol></li></ul><p><img src="https://www.shelven.com/tuchuang/20230928/live.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/live.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><ul><li><ol start="2"><li>HTML CSS Support:根据输入内容和上下文,当你输入HTML标签或者CSS属性的时候,自动显示可能的选项(省去html文件和css文件之间反复切换)。</li></ol></li></ul><p><img src="https://www.shelven.com/tuchuang/20230928/HTML.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/HTML.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><ul><li><ol start="3"><li>Auto Rename Tag:自动重命名标签,修改一侧标签后同步修改另一侧标签。</li></ol></li></ul><p><img src="https://www.shelven.com/tuchuang/20230928/Auto.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/Auto.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><ul><li><ol start="4"><li>JavaScript(ES6) code snippets:支持JavaScript和TypeScript快速生成代码和语法提示。</li></ol></li></ul><p><img src="https://www.shelven.com/tuchuang/20230928/JS.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/JS.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="2-HTML5"><a href="#2-HTML5" class="headerlink" title="2. HTML5"></a>2. HTML5</h2><div class="story post-story"><h3 id="2-1-HTML基础"><a href="#2-1-HTML基础" class="headerlink" title="2.1 HTML基础"></a>2.1 HTML基础</h3><p>HTML5是HTML出的第五个版本,HTML5方便书写、精简,也便于阅读和理解,以下均以HTML5为例。</p><p>在vscode中新建一个<code>index.html</code>文件,输入英文的<code>!</code>并按下tab键补全,就会出现最基础的HTML5代码:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta"><!DOCTYPE <span class="keyword">html</span>></span></span><br><span class="line"><span class="tag"><<span class="name">html</span> <span class="attr">lang</span>=<span class="string">"en"</span>></span></span><br><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">charset</span>=<span class="string">"UTF-8"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">http-equiv</span>=<span class="string">"X-UA-Compatible"</span> <span class="attr">content</span>=<span class="string">"IE=edge"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">name</span>=<span class="string">"viewport"</span> <span class="attr">content</span>=<span class="string">"width=device-width, initial-scale=1.0"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">title</span>></span>Document<span class="tag"></<span class="name">title</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> </span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br><span class="line"><span class="tag"></<span class="name">html</span>></span></span><br></pre></td></tr></table></figure><blockquote><p><code><!DOCTYPE html></code>声明为HTML5文档</p><p><code><html></code>元素是HTML页面的根元素,所有元素都在根元素内</p><p><code><head></code>元素包含了文档的元数据(meta),链接工程内的css文件,为搜索引擎提供网站描述、关键词等数据。这部分内容<strong>是不可见的</strong>,比如:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">meta</span> <span class="attr">charset</span>=<span class="string">"字符集"</span>></span> 指定字符编码集,如UTF-8 </span><br><span class="line"><span class="tag"><<span class="name">meta</span> <span class="attr">name</span>=<span class="string">"description"</span> <span class="attr">content</span>=<span class="string">"描述内容"</span>></span> 定义文档描述,通常用于搜索引擎的摘要显示</span><br><span class="line"><span class="tag"><<span class="name">meta</span> <span class="attr">name</span>=<span class="string">"keywords"</span> <span class="attr">content</span>=<span class="string">"关键词列表"</span>></span> 定义文档关键词,指定文档关键词或者标签,有助于搜索引擎的索引和分类</span><br><span class="line"><span class="tag"><<span class="name">meta</span> <span class="attr">name</span>=<span class="string">"author"</span> <span class="attr">content</span>=<span class="string">"作者名称"</span>></span> 定义文档作者</span><br><span class="line"><span class="tag"><<span class="name">meta</span> <span class="attr">name</span>=<span class="string">"viewport"</span> <span class="attr">content</span>=<span class="string">"width=device-width, initial-scale=1.0"</span>></span> 定义视口(viewpoint)位置,用于控制网页在移动设备的显示和缩放行为</span><br><span class="line"><span class="tag"><<span class="name">meta</span> <span class="attr">http-equiv</span>=<span class="string">"refresh"</span> <span class="attr">content</span>=<span class="string">"秒数; URL=重定向URL"</span>></span> 定义文档刷新或者重定向</span><br></pre></td></tr></table></figure><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">title</span>></span>描述文档标题(标题页显示的内容)<span class="tag"></<span class="name">title</span>></span></span><br></pre></td></tr></table></figure><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">link</span> <span class="attr">rel</span>=<span class="string">"stylesheet"</span> <span class="attr">href</span>=<span class="string">"样式表路径"</span>></span>引入外部样式表,这个在css引入里再说</span><br></pre></td></tr></table></figure><p><code><body></code>元素包括了所有<strong>可见的</strong>页面内容</p></blockquote><p>从上面的例子可以看到,HTML不是一种编程语言,而是一种<strong>标记语言</strong>,标记语言有自己的一套标记标签(tag),标签由一对尖括号包围,且通常是<strong>成对</strong>出现的,比如<code><p>这里是内容</p></code>,开始标签和结束标签也称为开放标签和闭合标签。</p><p>也有一些标签是自闭和的,比如上面的<code><meta>、<title>、<link></code>,还有<code><br>(换行)</code>、<code><hr>(水平分隔线)</code>、<code><input>(输入控件)</code>、<code><image>(插入图片)</code>、<code><area>(定义可点击的区域)</code>,他们不需要闭合因为他们本身就不需要内容。</p><p>顺便说一句,在vscode中,<code>ctrl+/</code>可以快速注释当前行内容;块注释可以选中代码后<code>alt+shift+A</code>;<code>alt+shift+↓</code>可以快速复制当前行内容到下一行(选中多行代码同理);在html文件中,输入小写字母再按tab键,<strong>会自动补齐开始和结束标签</strong>,算是提高编程效率的小技巧……</p><p>现在就可以右键<code>index.html</code>文件,选择<code>Open With Live Server</code>,<code>win+→</code>把网页放在屏幕右边,就可以同屏查看代码和网页了。</p><p><img src="https://www.shelven.com/tuchuang/20230928/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="2-2-常用标签和属性"><a href="#2-2-常用标签和属性" class="headerlink" title="2.2 常用标签和属性"></a>2.2 常用标签和属性</h3><p>标题标签,通过<code><h1></code>到<code><h6></code>来定义:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>这是标题 1<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h2</span>></span>这是标题 2<span class="tag"></<span class="name">h2</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h3</span>></span>这是标题 3<span class="tag"></<span class="name">h3</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h4</span>></span>这是标题 4<span class="tag"></<span class="name">h4</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h5</span>></span>这是标题 5<span class="tag"></<span class="name">h5</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h6</span>></span>这是标题 6<span class="tag"></<span class="name">h6</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230928/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>段落标签,通过<code><P></code>标签定义:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span> </span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span></span><br><span class="line"> Lorem ipsum dolor sit amet consectetur adipisicing elit. Repudiandae quia alias tempore natus perspiciatis commodi sapiente fugit ducimus id, in ut deleniti sint sequi explicabo, totam pariatur quidem. Non, quis?</span><br><span class="line"> Lorem 会输出占位用的无意义内容</span><br><span class="line"> <span class="tag"></<span class="name">p</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p>标签可以是<strong>行内元素</strong>或者<strong>块级元素</strong>,行级元素之间不会新起一段,其所占空间与内容本身大小有关,块级元素会新起一段。</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span> </span><br><span class="line"> <span class="comment"><!-- 这是一个注释,不会显示 --></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span></span><br><span class="line"> 行级元素不会新起一段,比如</span><br><span class="line"> <span class="tag"><<span class="name">span</span>></span>span(创建行内的容器)<span class="tag"></<span class="name">span</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">img</span> <span class="attr">decoding</span>=<span class="string">"async"</span> <span class="attr">src</span>=<span class="string">"https://www.shelven.com/tuchuang/avatar.jpg"</span> <span class="attr">width</span>=<span class="string">"50"</span>></span>img</span><br><span class="line"> <span class="tag"><<span class="name">a</span> <span class="attr">href</span>=<span class="string">"https://www.shelven.com"</span> <span class="attr">target</span>=<span class="string">"_blank"</span>></span>a(链接标签)<span class="tag"></<span class="name">a</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">strong</span>></span>strong加粗<span class="tag"></<span class="name">strong</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">em</span>></span>em斜体<span class="tag"></<span class="name">em</span>></span> </span><br><span class="line"> <span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span></span><br><span class="line"> 块级元素会新起一段,占据当前给定行的百分百宽度,包括</span><br><span class="line"> <span class="tag"><<span class="name">div</span>></span>div标签<span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h6</span>></span>h标签<span class="tag"></<span class="name">h6</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span>p标签<span class="tag"></<span class="name">p</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">form</span>></span>form标签<span class="tag"></<span class="name">form</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">p</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230928/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p><code><span></code>和<code><div></code>这两个标签在HTML页面结构布局中比较重要,两者都没有特定的含义,<code><span></code>通常用来包裹文本和其他行内元素,<code><div></code>通常用来容纳其他块级元素或者行内元素。</p><p>在上面的例子中,标签内的比如decoding、href这样的信息称为HTML元素的<strong>属性</strong>,属性是以名称键值对的形式出现的,且都是放在开始标签中,而属性值不管是数字还是字符串,都要<strong>包括在引号内(首选双引号)</strong>。</p><blockquote><p>适用于大多数HTML标签的属性:</p><ul><li>class 为html元素定义类名(这个类名由css文件引入),方便器选择,<strong>可以写多个</strong></li><li>id 为html元素定义<strong>唯一</strong>的id值,也是方便选择器选择</li><li>hidden 隐藏html元素</li><li>style 规定元素的行内样式(内联样式),css导入中会详细说</li><li>title 为元素提供额外信息,比如用在a标签,鼠标悬停会显示的信息</li></ul><p>顺便提下,上面例子超链接的两种导入方式,<strong>hre</strong>f是超文本引用,建立文档与资源之间的关系,常用在link、a标签中;<strong>src</strong>是将指向的资源直接下载并用到当前页面,常用在script、img标签中。</p></blockquote><p>要让行级元素换行或者同一个段落内的内容换行,可以插入块级元素,或者是加入<code><br>(换行)</code>或者<code><hr>(插入水平线)</code>标签:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span> </span><br><span class="line"> <span class="tag"><<span class="name">p</span>></span></span><br><span class="line"> 这是一个段落<span class="tag"><<span class="name">br</span>></span>这是另一个段落<span class="tag"><<span class="name">hr</span>></span>接着再来一段</span><br><span class="line"> <span class="tag"></<span class="name">p</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230928/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="2-3-列表样式和表格"><a href="#2-3-列表样式和表格" class="headerlink" title="2.3 列表样式和表格"></a>2.3 列表样式和表格</h3><p>html的列表样式主要有两种:前面有个点的<strong>无序列表</strong>和带有数字排序的有序列表:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="comment"><!-- 无序列表ul标签 --></span></span><br><span class="line"> <span class="tag"><<span class="name">ul</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">li</span>></span>无序列表项1<span class="tag"></<span class="name">li</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">li</span>></span>无序列表项2<span class="tag"></<span class="name">li</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">li</span>></span>无序列表项3<span class="tag"></<span class="name">li</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">li</span>></span>无序列表项4<span class="tag"></<span class="name">li</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">ul</span>></span></span><br><span class="line"> <span class="comment"><!-- 有序列表ol标签 --></span></span><br><span class="line"> <span class="tag"><<span class="name">ol</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">li</span>></span>有序列表项1<span class="tag"></<span class="name">li</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">li</span>></span>有序列表项2<span class="tag"></<span class="name">li</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">li</span>></span>有序列表项3<span class="tag"></<span class="name">li</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">li</span>></span>有序列表项4<span class="tag"></<span class="name">li</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">ol</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230928/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>表格标签为<code><table></code>,对应的分为<code>表头<thead></code>和<code>表体<tbody></code>。表头和表体都有行与列的概念,行在两者中都是<code><tr></code>标签,列在表头中标签为<code><th></code>,在表体中标签为<code><td></code>:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"><span class="tag"><<span class="name">table</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">thead</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">th</span>></span>姓名<span class="tag"></<span class="name">th</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">th</span>></span>年龄<span class="tag"></<span class="name">th</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">th</span>></span>qq<span class="tag"></<span class="name">th</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">thead</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">tbody</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>phantom<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>28<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>1021618642<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>aria<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>26<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>1269035311<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">tbody</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">table</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230928/7.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/7.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>可以看到浏览器会把表头显示为粗体居中的文本,其他和表体没啥区别。</p><h3 id="2-4-表单"><a href="#2-4-表单" class="headerlink" title="2.4 表单"></a>2.4 表单</h3><p>表单是收集用户输入信息的工具,可以通过<code><form></code>标签创建,需要注意,html表单<strong>只能给外观不能给功能</strong>,功能实现需要javascript,或者写一个php服务器端脚本来接收参数。</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">form</span>></span></span><br><span class="line"> <span class="comment"><!-- label定义input内容的标签 --></span></span><br><span class="line"> <span class="tag"><<span class="name">label</span>></span>账号:<span class="tag"></<span class="name">label</span>></span></span><br><span class="line"> <span class="comment"><!-- input ype="text"输入框可以输入文本 --></span></span><br><span class="line"> <span class="tag"><<span class="name">input</span> <span class="attr">type</span>=<span class="string">"text"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">br</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">label</span>></span>密码:<span class="tag"></<span class="name">label</span>></span></span><br><span class="line"> <span class="comment"><!-- input type="password"输入框不会明文显示 --></span></span><br><span class="line"> <span class="tag"><<span class="name">input</span> <span class="attr">type</span>=<span class="string">"password"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">br</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">label</span>></span>文本域<span class="tag"></<span class="name">label</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">br</span>></span></span><br><span class="line"> <span class="comment"><!-- textarea是确定一个文本域 --></span></span><br><span class="line"> <span class="tag"><<span class="name">textarea</span> <span class="attr">name</span>=<span class="string">"输入框"</span> <span class="attr">cols</span>=<span class="string">"50"</span> <span class="attr">rows</span>=<span class="string">"10"</span>></span><span class="tag"></<span class="name">textarea</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">br</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">label</span>></span>性别<span class="tag"></<span class="name">label</span>></span></span><br><span class="line"> <span class="comment"><!-- select是给出一个下拉框 --></span></span><br><span class="line"> <span class="tag"><<span class="name">select</span> <span class="attr">name</span>=<span class="string">"选项"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">option</span> <span class="attr">value</span>=<span class="string">"male"</span>></span>男<span class="tag"></<span class="name">option</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">option</span> <span class="attr">value</span>=<span class="string">"female"</span>></span>女<span class="tag"></<span class="name">option</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">option</span> <span class="attr">value</span>=<span class="string">"unknow"</span>></span>未知<span class="tag"></<span class="name">option</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">select</span>></span></span><br><span class="line"> <span class="comment"><!-- input type="radio"提供一个单选框 --></span></span><br><span class="line"> <span class="tag"><<span class="name">input</span> <span class="attr">type</span>=<span class="string">"radio"</span> <span class="attr">name</span>=<span class="string">"sex"</span> <span class="attr">value</span>=<span class="string">"female"</span>></span>女</span><br><span class="line"> <span class="comment"><!-- input type="checkbox"提供一个复选框 --></span></span><br><span class="line"> <span class="tag"><<span class="name">input</span> <span class="attr">type</span>=<span class="string">"checkbox"</span> <span class="attr">name</span>=<span class="string">"sex"</span> <span class="attr">value</span>=<span class="string">"male"</span>></span>男</span><br><span class="line"> <span class="tag"><<span class="name">input</span> <span class="attr">type</span>=<span class="string">"checkbox"</span> <span class="attr">name</span>=<span class="string">"sex"</span> <span class="attr">value</span>=<span class="string">"female"</span>></span>女</span><br><span class="line"> <span class="tag"></<span class="name">form</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230928/8.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/8.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="2-5-iframe"><a href="#2-5-iframe" class="headerlink" title="2.5 iframe"></a>2.5 iframe</h3><p>也就是HTML的框架,一个网页中可以嵌套别的网页,不过大部分的浏览器对嵌套层数是有限制的,一般都用不到:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">iframe</span> <span class="attr">src</span>=<span class="string">"https://www.shelven.com"</span> <span class="attr">width</span>=<span class="string">"800"</span> <span class="attr">height</span>=<span class="string">"600"</span>></span><span class="tag"></<span class="name">iframe</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230928/9.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/9.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>每个iframe都要加载和渲染独立的文档,过多的嵌套会增加加载时间和资源消耗,页面太复杂用户体验也会不好= =</p><p>简单地把iframe和div结合一下:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta"><!DOCTYPE <span class="keyword">html</span>></span></span><br><span class="line"><span class="tag"><<span class="name">html</span> <span class="attr">lang</span>=<span class="string">"en"</span>></span></span><br><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">charset</span>=<span class="string">"UTF-8"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">http-equiv</span>=<span class="string">"X-UA-Compatible"</span> <span class="attr">content</span>=<span class="string">"IE=edge"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">name</span>=<span class="string">"viewport"</span> <span class="attr">content</span>=<span class="string">"width=device-width, initial-scale=1.0"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">title</span>></span>Document<span class="tag"></<span class="name">title</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">id</span>=<span class="string">"container"</span> <span class="attr">style</span>=<span class="string">"width: 800px;height: 600px;text-align: center;"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">id</span>=<span class="string">"header"</span> <span class="attr">style</span>=<span class="string">"background-color:#0099ff;height: 50px;"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span> ></span>我的小破站<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">id</span>=<span class="string">"menu"</span> <span class="attr">style</span>=<span class="string">"background-color:#ff00b3;height:400px;width:100px;float:left;"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">a</span> <span class="attr">href</span>=<span class="string">"https://www.shelven.com/"</span> <span class="attr">target</span>=<span class="string">"_blank"</span>></span>菜单<span class="tag"></<span class="name">a</span>></span><span class="tag"><<span class="name">br</span>></span> </span><br><span class="line"> <span class="tag"><<span class="name">a</span> <span class="attr">href</span>=<span class="string">"https://www.shelven.com/categories/"</span> <span class="attr">target</span>=<span class="string">"_blank"</span>></span>分类<span class="tag"></<span class="name">a</span>></span><span class="tag"><<span class="name">br</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">a</span> <span class="attr">href</span>=<span class="string">"https://www.shelven.com/tags/"</span> <span class="attr">target</span>=<span class="string">"_blank"</span>></span>标签<span class="tag"></<span class="name">a</span>></span><span class="tag"><<span class="name">br</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">a</span> <span class="attr">href</span>=<span class="string">"https://www.shelven.com/archives/"</span> <span class="attr">target</span>=<span class="string">"_blank"</span>></span>归档<span class="tag"></<span class="name">a</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">id</span>=<span class="string">"content"</span> <span class="attr">style</span>=<span class="string">"background-color:#EEEEEE;height: 400px;width: 700px;float:left;"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">iframe</span> <span class="attr">src</span>=<span class="string">"https://www.shelven.com"</span> <span class="attr">frameborder</span>=<span class="string">"0"</span> <span class="attr">width</span>=<span class="string">"700"</span> <span class="attr">height</span>=<span class="string">"400"</span>></span><span class="tag"></<span class="name">iframe</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">id</span>=<span class="string">"footer"</span> <span class="attr">style</span>=<span class="string">"background-color:#00ff0d;height: 50px;clear:both;text-align:center;font-size: large;"</span>></span></span><br><span class="line"> 萌ICP备20220246号 浙ICP备2022010847号</span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span> </span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"></<span class="name">html</span>></span></span><br></pre></td></tr></table></figure><p>大概就是这样的布局:</p><p><img src="https://www.shelven.com/tuchuang/20230928/10.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/10.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>有了以上HTML的基础概念,后面CSS的引入就比较容易理解了。</p></div><h2 id="2023-x2F-10-x2F-5更新"><a href="#2023-x2F-10-x2F-5更新" class="headerlink" title="2023/10/5更新"></a>2023/10/5更新</h2><div class="story post-story"><p>发现一个很好用的vscode插件:</p><p><img src="https://www.shelven.com/tuchuang/20230928/111.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230928/111.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>该插件可以用来格式化css和js文件,可能你从别的项目中拷贝的js/css文件是一行显示的,打开一大坨不方便阅读。下载这个插件以后,在对应的文件中按<strong>shift + alt + F</strong>,选择格式化方法为Prettier,可以快速格式化文件。</p></div>]]></content>
<summary type="html"><p>不知不觉用Hexo建站也快一年半了,以前下载安装Node.js、搭建Hexo博客和使用volantis主题,以及到最后部署到服务器上,这一系列流程都是跟着教程和volantis的中文社区操作,因为自己是纯小白,属于两眼一抹黑干就完事了的那种。遇到bug也是各种百度谷歌,别人咋改文件自己跟着改,也完全不懂其中原理。</p></summary>
<category term="个人主页" scheme="http://www.shelven.com/categories/%E4%B8%AA%E4%BA%BA%E4%B8%BB%E9%A1%B5/"/>
<category term="编程自学" scheme="http://www.shelven.com/categories/%E7%BC%96%E7%A8%8B%E8%87%AA%E5%AD%A6/"/>
<category term="HTML" scheme="http://www.shelven.com/tags/HTML/"/>
</entry>
<entry>
<title>R基础入门——基本语法和作图</title>
<link href="http://www.shelven.com/2023/09/26/a.html"/>
<id>http://www.shelven.com/2023/09/26/a.html</id>
<published>2023-09-26T11:34:48.000Z</published>
<updated>2023-09-26T11:38:13.000Z</updated>
<content type="html"><![CDATA[<p>整理笔记的时候翻到两年前做的R入门笔记,还记得21年冬天那个时候是第一次接触R,华中农业大学的孔秋生教授来塔里木大学做的R语言讲座。两年了有些东西过时了,整理下做个备份吧~顺便回头复习复习,温故而知新 ^_^</p><span id="more"></span><h2 id="1-R是什么"><a href="#1-R是什么" class="headerlink" title="1. R是什么"></a>1. R是什么</h2><div class="story post-story"><p><strong>R</strong>是一种用于统计计算和数据分析的编程语言。它提供了广泛的统计和图形功能,以及丰富的数据处理和建模工具。R具有强大的数据处理能力和丰富的统计函数库,被广泛应用于学术研究、数据科学、金融分析、生物医学等领域。</p><p><strong>RStudio</strong>是一个集成开发环境(Integrated Development Environment,IDE),用于编写、运行和调试R语言代码。它提供了许多功能和工具,旨在提高R语言开发的效率和便利性。说白了,我们是在Rstudio中编写和运行R语言代码。</p><p>当然,这个集成开发环境不是唯一的,我们也可以在比如vscode中调试运行。Rstudio只是提供一个为新手入门提供一个友好的界面,熟练后甚至可以不用集成开发环境,比如在linux中也可以运行,这就是后话了。</p></div><h2 id="2-前期准备和一些基础认识"><a href="#2-前期准备和一些基础认识" class="headerlink" title="2. 前期准备和一些基础认识"></a>2. 前期准备和一些基础认识</h2><div class="story post-story"><p>先安装R,再安装Rstudio,顺序不能反,否则可能会提示找不到R在什么地方…</p><p>R官网:<a href="https://www.r-project.org/">R: The R Project for Statistical Computing (r-project.org)</a></p><p>Rstudio官网(现在已经改名为Posit,还真不习惯):<a href="https://posit.co/">Posit | The Open-Source Data Science Company</a></p><p>全部安装好,进入Rstudio后,点击菜单栏<strong>Tools</strong>,下拉框的<strong>Global Options</strong>,这里可以修改全局设置。主要修改的是自己的工作目录(也可以在代码中修改),我顺便改了四个窗口的布局(在<strong>Pane Layout</strong>中修改):</p><p><img src="https://www.shelven.com/tuchuang/20230926/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><blockquote><ul><li><ol><li>Source窗口:写R代码的窗口,也可以在3窗口(console)写,个人习惯</li></ol></li><li><ol start="2"><li>Environment、History等窗口:可以看到代码运行过程中生成的变量、自己的历史命令等等</li></ol></li><li><ol start="3"><li>Console窗口:R语言的交互式控制台,可以逐行输入和执行R代码,并立即看到结果,ctrl+L可以清屏</li></ol></li><li><ol start="4"><li>Files、Plot等窗口,前者可以看当前工作环境的文件,后者看到绘制的图</li></ol></li></ul></blockquote><p>2和4窗口有多个选项供选择,我用的比较多的是这些,仅供参考。</p><p>对于1和3码代码的窗口部分,对于<strong>有较大代码块或者需要保存和重复的代码,建议用Source窗口,运行每一行需要Ctrl+回车</strong>;而<strong>对于简单的代码测试、快速计算或者做交互式探索的话,可以选择在Console窗口,回车就可以运行</strong>。</p><p>R语言要调用的软件包在CRAN仓库中,我们可以在以下R包官网中找到你需要的R包,<strong>以及各R包的参数、用法</strong>。</p><p><a href="https://cran.r-project.org/web/packages/">CRAN - Contributed Packages (r-project.org)</a></p><p>在Rstudio中,你可以通过菜单栏<strong>Tools</strong>,下拉框的第一个<strong>Install Packages</strong>窗口,输入你想要安装的R包,点击install安装:</p><p><img src="https://www.shelven.com/tuchuang/20230926/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>以上是认识R和Rstudio的最基础的知识,下面主要讲讲R代码的语法和作图的一些示例。</p></div><h2 id="3-R代码基础语法"><a href="#3-R代码基础语法" class="headerlink" title="3. R代码基础语法"></a>3. R代码基础语法</h2><div class="story post-story"><p><strong>再次申明这是入门写的笔记,不会介绍很详细</strong>,完整的可以看官方手册<a href="https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Preface">An Introduction to R (r-project.org)</a></p><p>为了方便展示运行结果,以下运行结果前均以两个井号##开头。</p><h3 id="3-1-数据类型"><a href="#3-1-数据类型" class="headerlink" title="3.1 数据类型"></a>3.1 数据类型</h3><p>常用的数据类型有<strong>数值型(numeric)<strong>,</strong>字符型(character)<strong>,</strong>逻辑型(logical)</strong>。</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">a <span class="operator">=</span> <span class="number">123</span> <span class="comment"># =赋值,<-也可以赋值,R官方社区用<-较多,自己取舍。#表示注释,不会运行</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span> <span class="comment"># class()确定括号内数据类型</span></span><br><span class="line"><span class="comment">## [1] "numeric"</span></span><br><span class="line"></span><br><span class="line">a <span class="operator">=</span> <span class="string">"123"</span> <span class="comment"># 赋值为字符型加双引号,与代码相关的都是英文字符</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "character"</span></span><br><span class="line"></span><br><span class="line">a <span class="operator">=</span> <span class="literal">TRUE</span> <span class="comment"># 逻辑型包括TRUE/FALSE/T/F</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "logical"</span></span><br></pre></td></tr></table></figure><h3 id="3-2-数据结构"><a href="#3-2-数据结构" class="headerlink" title="3.2 数据结构"></a>3.2 数据结构</h3><p>在R中,向量是一种基本的数据结构,用于存储一系列<strong>相同类型的元素</strong>。</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 向量vector c()这个函数用来创建向量</span></span><br><span class="line"><span class="comment"># 数值型向量</span></span><br><span class="line">a <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="number">1</span><span class="punctuation">,</span><span class="number">7</span><span class="punctuation">,</span><span class="number">10</span><span class="punctuation">)</span> <span class="comment"># 数值之间逗号间隔</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "numeric"</span></span><br><span class="line"></span><br><span class="line">seq<span class="punctuation">(</span><span class="number">1</span><span class="punctuation">,</span><span class="number">5</span><span class="punctuation">)</span> <span class="comment"># seq()序列函数</span></span><br><span class="line"><span class="comment">## [1] 1 2 3 4 5</span></span><br><span class="line">seq<span class="punctuation">(</span>from <span class="operator">=</span> <span class="number">1</span><span class="punctuation">,</span> to <span class="operator">=</span> <span class="number">3</span><span class="punctuation">,</span> by <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">)</span> <span class="comment"># by表示步长</span></span><br><span class="line"><span class="comment">## [1] 1.0 1.5 2.0 2.5 3.0</span></span><br><span class="line"><span class="number">1</span><span class="operator">:</span><span class="number">10</span> <span class="comment"># x:x也可以表示序列</span></span><br><span class="line"><span class="comment">## [1] 1 2 3 4 5 6 7 8 9 10</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 字符型向量</span></span><br><span class="line">a <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"A"</span><span class="punctuation">,</span><span class="string">"B"</span><span class="punctuation">,</span><span class="string">"C"</span><span class="punctuation">)</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "character"</span></span><br><span class="line"></span><br><span class="line"><span class="built_in">letters</span> <span class="comment"># 26个小写英文字母顺序排列</span></span><br><span class="line"><span class="comment">## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"</span></span><br><span class="line"><span class="built_in">LETTERS</span> <span class="comment"># 26个大写英文字母顺序排列</span></span><br><span class="line"><span class="comment">## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 逻辑型向量</span></span><br><span class="line">a <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="built_in">T</span><span class="punctuation">,</span><span class="literal">TRUE</span><span class="punctuation">,</span><span class="built_in">F</span><span class="punctuation">,</span><span class="literal">FALSE</span><span class="punctuation">)</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "logical"</span></span><br></pre></td></tr></table></figure><h3 id="3-3-向量数据操作"><a href="#3-3-向量数据操作" class="headerlink" title="3.3 向量数据操作"></a>3.3 向量数据操作</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># rep()重复</span></span><br><span class="line"><span class="built_in">rep</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">,</span> times <span class="operator">=</span> <span class="number">4</span><span class="punctuation">)</span> <span class="comment"># 1到3重复3次</span></span><br><span class="line"><span class="comment">## [1] 1 2 3 1 2 3 1 2 3 1 2 3</span></span><br><span class="line"><span class="built_in">rep</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">,</span> each <span class="operator">=</span> <span class="number">4</span><span class="punctuation">)</span> <span class="comment"># 1到3每个数字重复三次</span></span><br><span class="line"><span class="comment">## [1] 1 1 1 1 2 2 2 2 3 3 3 3</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># paste()组合</span></span><br><span class="line">paste<span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">,</span> </span><br><span class="line"> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"A"</span><span class="punctuation">,</span><span class="string">"B"</span><span class="punctuation">,</span><span class="string">"C"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="comment"># 默认组合中间有空格</span></span><br><span class="line"><span class="comment">## [1] "1 A" "2 B" "3 C"</span></span><br><span class="line">paste0<span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">,</span></span><br><span class="line"> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"A"</span><span class="punctuation">,</span><span class="string">"B"</span><span class="punctuation">,</span><span class="string">"C"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="comment"># 去掉组合中间空格</span></span><br><span class="line"><span class="comment">## [1] "1A" "2B" "3C"</span></span><br><span class="line">paste<span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">,</span></span><br><span class="line"> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"A"</span><span class="punctuation">,</span><span class="string">"B"</span><span class="punctuation">,</span><span class="string">"C"</span><span class="punctuation">)</span><span class="punctuation">,</span> <span class="comment"># 换行注意逗号不要漏</span></span><br><span class="line"> sep <span class="operator">=</span> <span class="string">"/"</span><span class="punctuation">)</span> <span class="comment"># 自定义连接组合的符号</span></span><br><span class="line"><span class="comment">## [1] "1/A" "2/B" "3/C"</span></span><br><span class="line"></span><br><span class="line"><span class="comment">##练习:3个处理ABC,3个重复</span></span><br><span class="line">paste0<span class="punctuation">(</span><span class="built_in">rep</span><span class="punctuation">(</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">"A"</span><span class="punctuation">,</span><span class="string">"B"</span><span class="punctuation">,</span><span class="string">"C"</span><span class="punctuation">)</span><span class="punctuation">,</span> each <span class="operator">=</span> <span class="number">3</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> <span class="built_in">rep</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">,</span> times <span class="operator">=</span> <span class="number">3</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "A1" "A2" "A3" "B1" "B2" "B3" "C1" "C2" "C3"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># []索引,注意索引第一位是1而不是0</span></span><br><span class="line">a <span class="operator">=</span> <span class="number">2</span><span class="operator">:</span><span class="number">10</span></span><br><span class="line">a<span class="punctuation">[</span><span class="number">5</span><span class="punctuation">]</span> <span class="comment"># a向量中第5个数</span></span><br><span class="line"><span class="comment">## [1] 6</span></span><br><span class="line">a<span class="punctuation">[</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">]</span> <span class="comment"># a向量第1-3个数</span></span><br><span class="line"><span class="comment">## [1] 2 3 4</span></span><br><span class="line">a<span class="punctuation">[</span><span class="built_in">c</span><span class="punctuation">(</span><span class="number">1</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">,</span><span class="number">5</span><span class="punctuation">)</span><span class="punctuation">]</span> <span class="comment"># a向量第1,4,5个数</span></span><br><span class="line"><span class="comment">## [1] 2 5 6</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 逻辑操作符</span></span><br><span class="line"><span class="comment"># &与;|或;!非</span></span><br><span class="line">a<span class="punctuation">[</span>a<span class="operator">></span><span class="number">5</span><span class="punctuation">]</span> <span class="comment"># a向量中比5大的数</span></span><br><span class="line"><span class="comment">## [1] 6 7 8 9 10</span></span><br><span class="line">a<span class="punctuation">[</span>a<span class="operator">></span><span class="number">5</span> <span class="operator">&</span> a<span class="operator"><</span><span class="number">8</span><span class="punctuation">]</span> <span class="comment"># a向量中大于5且小于8的数</span></span><br><span class="line"><span class="comment">## [1] 6 7</span></span><br><span class="line">a<span class="punctuation">[</span>a<span class="operator">></span><span class="number">5</span> <span class="operator">|</span> a<span class="operator"><</span><span class="number">3</span><span class="punctuation">]</span> <span class="comment"># a向量中大于5或小于3的数</span></span><br><span class="line"><span class="comment">## [1] 2 6 7 8 9 10</span></span><br><span class="line">a<span class="punctuation">[</span>a<span class="operator">!=</span><span class="number">8</span><span class="punctuation">]</span> <span class="comment"># a向量中不包括8的值</span></span><br><span class="line"><span class="comment">## [1] 2 3 4 5 6 7 9 10</span></span><br></pre></td></tr></table></figure><h3 id="3-4-向量计算"><a href="#3-4-向量计算" class="headerlink" title="3.4 向量计算"></a>3.4 向量计算</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># +, -, *, / 加减乘除正常计算</span></span><br><span class="line"><span class="number">2</span> <span class="operator">*</span> <span class="number">1</span><span class="operator">:</span><span class="number">3</span></span><br><span class="line"><span class="comment">## [1] 2 4 6</span></span><br><span class="line"><span class="number">1</span><span class="operator">:</span><span class="number">3</span> <span class="operator">*</span> <span class="number">1</span><span class="operator">:</span><span class="number">3</span> <span class="comment"># 两个向量分别相乘</span></span><br><span class="line"><span class="comment">## [1] 1 4 9</span></span><br><span class="line"><span class="number">2</span><span class="operator">:</span><span class="number">5</span> <span class="operator">+</span> <span class="number">1</span><span class="operator">:</span><span class="number">3</span> <span class="comment"># 注意两个向量长度不同,计算方式不同,最后一个是5+1得到的</span></span><br><span class="line"><span class="comment">## [1] 3 5 7 6</span></span><br><span class="line"><span class="comment">## Warning message:</span></span><br><span class="line"><span class="comment">## In 2:5 + 1:3 :</span></span><br><span class="line"><span class="comment">## longer object length is not a multiple of shorter object length</span></span><br></pre></td></tr></table></figure><h3 id="3-5-向量类型转换"><a href="#3-5-向量类型转换" class="headerlink" title="3.5 向量类型转换"></a>3.5 向量类型转换</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># as.+数据类型()</span></span><br><span class="line">a <span class="operator">=</span> <span class="number">1</span><span class="operator">:</span><span class="number">3</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "integer"</span></span><br><span class="line"><span class="comment"># as.character() 转换字符型数据</span></span><br><span class="line">b <span class="operator">=</span> <span class="built_in">as.character</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line">b</span><br><span class="line"><span class="comment">## [1] "1" "2" "3"</span></span><br><span class="line"><span class="comment"># as.numeric() 转换数值型数据</span></span><br><span class="line"><span class="built_in">c</span> <span class="operator">=</span> <span class="built_in">as.numeric</span><span class="punctuation">(</span>b<span class="punctuation">)</span> </span><br><span class="line"><span class="built_in">c</span></span><br><span class="line"><span class="comment">## [1] 1 2 3</span></span><br><span class="line"><span class="comment"># as.logical() 转换逻辑型数据</span></span><br><span class="line">d <span class="operator">=</span> <span class="built_in">as.logical</span><span class="punctuation">(</span><span class="built_in">c</span><span class="punctuation">)</span></span><br><span class="line">d</span><br><span class="line"><span class="comment">## [1] TRUE TRUE TRUE</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## 练习:产生5个大写字母</span></span><br><span class="line">a <span class="operator">=</span> <span class="built_in">LETTERS</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">:</span><span class="number">5</span><span class="punctuation">]</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "character"</span></span><br><span class="line">a</span><br><span class="line"><span class="comment">## [1] "A" "B" "C" "D" "E"</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## 练习:产生4对T,F</span></span><br><span class="line">a <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="built_in">c</span><span class="punctuation">(</span><span class="built_in">T</span><span class="punctuation">,</span><span class="built_in">F</span><span class="punctuation">)</span><span class="punctuation">,</span> times <span class="operator">=</span> <span class="number">4</span><span class="punctuation">)</span></span><br><span class="line"><span class="built_in">class</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "logical"</span></span><br><span class="line"><span class="built_in">as.numeric</span><span class="punctuation">(</span>a<span class="punctuation">)</span> <span class="comment"># TRUE数值为1,FALSE数值为0</span></span><br><span class="line"><span class="comment">## [1] 1 0 1 0 1 0 1 0</span></span><br><span class="line"><span class="built_in">as.character</span><span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] "TRUE" "FALSE" "TRUE" "FALSE" "TRUE" "FALSE" "TRUE" "FALSE"</span></span><br></pre></td></tr></table></figure><h3 id="3-6-常用计算函数"><a href="#3-6-常用计算函数" class="headerlink" title="3.6 常用计算函数"></a>3.6 常用计算函数</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">mean<span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">10</span><span class="punctuation">)</span> <span class="comment"># 平均数</span></span><br><span class="line"><span class="comment">## [1] 5.5</span></span><br><span class="line">sd<span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">10</span><span class="punctuation">)</span> <span class="comment"># 标准差</span></span><br><span class="line"><span class="comment">## [1] 3.02765</span></span><br><span class="line"><span class="built_in">max</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">10</span><span class="punctuation">)</span> <span class="comment"># 最大值</span></span><br><span class="line"><span class="comment">## [1] 10</span></span><br><span class="line"><span class="built_in">range</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">10</span><span class="punctuation">)</span> <span class="comment"># 最小值和最大值</span></span><br><span class="line"><span class="comment">## [1] 1 10</span></span><br><span class="line"><span class="built_in">length</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">10</span><span class="punctuation">)</span> <span class="comment"># 长度</span></span><br><span class="line"><span class="comment">## [1] 10</span></span><br><span class="line"><span class="built_in">length</span><span class="punctuation">(</span><span class="built_in">letters</span><span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [1] 26</span></span><br></pre></td></tr></table></figure><h3 id="3-7-矩阵(matrix)"><a href="#3-7-矩阵(matrix)" class="headerlink" title="3.7 矩阵(matrix)"></a>3.7 矩阵(matrix)</h3><p>矩阵是一种二维数据结构,由行和列组成,其中每个元素有<strong>相同的数据类型</strong>。矩阵可以看成是向量的拓展。</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)</span></span><br><span class="line"><span class="comment"># data:矩阵的元素,默认为NA,即未给出元素值的话,各项为NA</span></span><br><span class="line"><span class="comment"># nrow:矩阵的行数,默认为1,可简写nr</span></span><br><span class="line"><span class="comment"># ncol:矩阵的列数,默认为1,可简写nc</span></span><br><span class="line"><span class="comment"># byrow:元素是否按行填充,默认按列</span></span><br><span class="line"><span class="comment"># dimnames:以字符型向量表示的行名及列名</span></span><br><span class="line">a <span class="operator">=</span> <span class="number">1</span><span class="operator">:</span><span class="number">12</span></span><br><span class="line">matrix<span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [,1]</span></span><br><span class="line"><span class="comment">## [1,] 1</span></span><br><span class="line"><span class="comment">## [2,] 2</span></span><br><span class="line"><span class="comment">## [3,] 3</span></span><br><span class="line"><span class="comment">## [4,] 4</span></span><br><span class="line"><span class="comment">## [5,] 5</span></span><br><span class="line"><span class="comment">## [6,] 6</span></span><br><span class="line"><span class="comment">## [7,] 7</span></span><br><span class="line"><span class="comment">## [8,] 8</span></span><br><span class="line"><span class="comment">## [9,] 9</span></span><br><span class="line"><span class="comment">##[10,] 10</span></span><br><span class="line"><span class="comment">##[11,] 11</span></span><br><span class="line"><span class="comment">##[12,] 12</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 数值型矩阵</span></span><br><span class="line">a <span class="operator">=</span> matrix<span class="punctuation">(</span>a<span class="punctuation">,</span></span><br><span class="line"> nr <span class="operator">=</span> <span class="number">3</span><span class="punctuation">,</span> <span class="comment"># nrow可以不写</span></span><br><span class="line"> byrow <span class="operator">=</span> <span class="built_in">T</span><span class="punctuation">)</span> <span class="comment"># 先行后列形式填充</span></span><br><span class="line">a</span><br><span class="line"><span class="comment">## [,1] [,2] [,3] [,4]</span></span><br><span class="line"><span class="comment">## [1,] 1 2 3 4</span></span><br><span class="line"><span class="comment">## [2,] 5 6 7 8</span></span><br><span class="line"><span class="comment">## [3,] 9 10 11 12</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 字符型矩阵</span></span><br><span class="line">matrix<span class="punctuation">(</span><span class="built_in">LETTERS</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">:</span><span class="number">12</span><span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"> ncol <span class="operator">=</span> <span class="number">3</span><span class="punctuation">)</span></span><br><span class="line"><span class="comment">## [,1] [,2] [,3]</span></span><br><span class="line"><span class="comment">## [1,] "A" "E" "I" </span></span><br><span class="line"><span class="comment">## [2,] "B" "F" "J" </span></span><br><span class="line"><span class="comment">## [3,] "C" "G" "K" </span></span><br><span class="line"><span class="comment">## [4,] "D" "H" "L"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 行/列名 使用行数和列数相等的向量命名</span></span><br><span class="line">colnames<span class="punctuation">(</span>a<span class="punctuation">)</span> <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"第一列"</span><span class="punctuation">,</span><span class="string">"第二列"</span><span class="punctuation">,</span><span class="string">"第三列"</span><span class="punctuation">,</span><span class="string">"第四列"</span><span class="punctuation">)</span></span><br><span class="line">a</span><br><span class="line"><span class="comment">## 第一列 第二列 第三列 第四列</span></span><br><span class="line"><span class="comment">## [1,] 1 2 3 4</span></span><br><span class="line"><span class="comment">## [2,] 5 6 7 8</span></span><br><span class="line"><span class="comment">## [3,] 9 10 11 12</span></span><br><span class="line">row.names<span class="punctuation">(</span>a<span class="punctuation">)</span> <span class="operator">=</span> <span class="built_in">LETTERS</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">]</span></span><br><span class="line">a</span><br><span class="line"><span class="comment">## 第一列 第二列 第三列 第四列</span></span><br><span class="line"><span class="comment">## A 1 2 3 4</span></span><br><span class="line"><span class="comment">## B 5 6 7 8</span></span><br><span class="line"><span class="comment">## C 9 10 11 12</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 数据过滤(提取)</span></span><br><span class="line">a<span class="punctuation">[</span><span class="number">1</span><span class="punctuation">,</span><span class="punctuation">]</span> <span class="comment"># 提取矩阵第一行</span></span><br><span class="line"><span class="comment">## 第一列 第二列 第三列 第四列 </span></span><br><span class="line"><span class="comment">## 1 2 3 4 </span></span><br><span class="line">a<span class="punctuation">[</span><span class="number">1</span><span class="operator">:</span><span class="number">2</span><span class="punctuation">,</span><span class="punctuation">]</span> </span><br><span class="line"><span class="comment">## 第一列 第二列 第三列 第四列</span></span><br><span class="line"><span class="comment">## A 1 2 3 4</span></span><br><span class="line"><span class="comment">## B 5 6 7 8</span></span><br><span class="line">a<span class="punctuation">[</span><span class="operator">-</span><span class="number">1</span><span class="punctuation">,</span><span class="punctuation">]</span> <span class="comment"># 删除矩阵第一行</span></span><br><span class="line"><span class="comment">## 第一列 第二列 第三列 第四列</span></span><br><span class="line"><span class="comment">## B 5 6 7 8</span></span><br><span class="line"><span class="comment">## C 9 10 11 12</span></span><br></pre></td></tr></table></figure><h3 id="3-8-数据框(Data-Frame)"><a href="#3-8-数据框(Data-Frame)" class="headerlink" title="3.8 数据框(Data Frame)"></a>3.8 数据框(Data Frame)</h3><p>数据框是R语言中另一种常见的二维数据结构,它<strong>可以存储不同类型的数据</strong>,比如数值、字符、因子(factor)等等,并且每一列可以有不同的长度。</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br></pre></td><td class="code"><pre><span class="line">data.frame<span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">12</span><span class="punctuation">,</span><span class="number">5</span><span class="operator">:</span><span class="number">8</span><span class="punctuation">)</span> <span class="comment"># 注意行数不同,填充方式不同</span></span><br><span class="line"><span class="comment">## X1.12 X5.8</span></span><br><span class="line"><span class="comment">## 1 1 5</span></span><br><span class="line"><span class="comment">## 2 2 6</span></span><br><span class="line"><span class="comment">## 3 3 7</span></span><br><span class="line"><span class="comment">## 4 4 8</span></span><br><span class="line"><span class="comment">## 5 5 5</span></span><br><span class="line"><span class="comment">## 6 6 6</span></span><br><span class="line"><span class="comment">## 7 7 7</span></span><br><span class="line"><span class="comment">## 8 8 8</span></span><br><span class="line"><span class="comment">## 9 9 5</span></span><br><span class="line"><span class="comment">## 10 10 6</span></span><br><span class="line"><span class="comment">## 11 11 7</span></span><br><span class="line"><span class="comment">## 12 12 8</span></span><br><span class="line"></span><br><span class="line">a <span class="operator">=</span> <span class="number">1</span><span class="operator">:</span><span class="number">4</span></span><br><span class="line">b <span class="operator">=</span> <span class="number">5</span><span class="operator">:</span><span class="number">8</span></span><br><span class="line"><span class="built_in">c</span> <span class="operator">=</span> <span class="built_in">letters</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">:</span><span class="number">4</span><span class="punctuation">]</span></span><br><span class="line">d <span class="operator">=</span> <span class="built_in">letters</span><span class="punctuation">[</span><span class="number">5</span><span class="operator">:</span><span class="number">8</span><span class="punctuation">]</span></span><br><span class="line">e <span class="operator">=</span> data.frame<span class="punctuation">(</span>a<span class="punctuation">,</span>b<span class="punctuation">,</span><span class="built_in">c</span><span class="punctuation">,</span>d<span class="punctuation">)</span></span><br><span class="line">e</span><br><span class="line"><span class="comment">## a b c d</span></span><br><span class="line"><span class="comment">## 1 1 5 a e</span></span><br><span class="line"><span class="comment">## 2 2 6 b f</span></span><br><span class="line"><span class="comment">## 3 3 7 c g</span></span><br><span class="line"><span class="comment">## 4 4 8 d h</span></span><br><span class="line"></span><br><span class="line">str<span class="punctuation">(</span>e<span class="punctuation">)</span> <span class="comment"># 检查数据框中数据类型</span></span><br><span class="line"><span class="comment">## 'data.frame':4 obs. of 4 variables:</span></span><br><span class="line"><span class="comment">## $ a: int 1 2 3 4</span></span><br><span class="line"><span class="comment">## $ b: int 5 6 7 8</span></span><br><span class="line"><span class="comment">## $ c: chr "a" "b" "c" "d"</span></span><br><span class="line"><span class="comment">## $ d: chr "e" "f" "g" "h"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 基本数据操作</span></span><br><span class="line"><span class="comment"># 行/列提取</span></span><br><span class="line">e<span class="punctuation">[</span><span class="number">1</span><span class="punctuation">,</span><span class="punctuation">]</span></span><br><span class="line"><span class="comment">## a b c d</span></span><br><span class="line"><span class="comment">## 1 1 5 a e</span></span><br><span class="line">e<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span></span><br><span class="line"><span class="comment">## [1] 1 2 3 4</span></span><br><span class="line">e<span class="operator">$</span>a <span class="comment"># $对列提取,abcd为列数</span></span><br><span class="line"><span class="comment">## [1] 1 2 3 4</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 增加列</span></span><br><span class="line">e<span class="operator">$</span>e <span class="operator">=</span> <span class="number">9</span><span class="operator">:</span><span class="number">12</span> <span class="comment"># 增加不存在的e列为向量9:12</span></span><br><span class="line">e</span><br><span class="line"><span class="comment">## a b c d e</span></span><br><span class="line"><span class="comment">## 1 1 5 a e 9</span></span><br><span class="line"><span class="comment">## 2 2 6 b f 10</span></span><br><span class="line"><span class="comment">## 3 3 7 c g 11</span></span><br><span class="line"><span class="comment">## 4 4 8 d h 12</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 行/列命名</span></span><br><span class="line">row.names<span class="punctuation">(</span>e<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">3</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"P"</span></span><br><span class="line">row.names<span class="punctuation">(</span>e<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">4</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"l"</span> <span class="comment"># 更改行名</span></span><br><span class="line">colnames<span class="punctuation">(</span>e<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">1</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"第一列"</span> <span class="comment"># 更改列名</span></span><br><span class="line">e</span><br><span class="line"><span class="comment">## 第一列 b c d e</span></span><br><span class="line"><span class="comment">## 1 1 5 a e 9</span></span><br><span class="line"><span class="comment">## 2 2 6 b f 10</span></span><br><span class="line"><span class="comment">## P 3 7 c g 11</span></span><br><span class="line"><span class="comment">## l 4 8 d h 12</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 合并矩阵两种方式rbind,cbind(矩阵,名称 = 赋值)</span></span><br><span class="line">f <span class="operator">=</span> cbind<span class="punctuation">(</span>e<span class="punctuation">,</span>f <span class="operator">=</span> <span class="number">13</span><span class="operator">:</span><span class="number">16</span><span class="punctuation">)</span> <span class="comment"># 合并以后相当于增加了1列</span></span><br><span class="line">f</span><br><span class="line"><span class="comment">## 第一列 b c d e f</span></span><br><span class="line"><span class="comment">## 1 1 5 a e 9 13</span></span><br><span class="line"><span class="comment">## 2 2 6 b f 10 14</span></span><br><span class="line"><span class="comment">## P 3 7 c g 11 15</span></span><br><span class="line"><span class="comment">## l 4 8 d h 12 16</span></span><br><span class="line">colnames<span class="punctuation">(</span>f<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">6</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"Y"</span></span><br><span class="line">f</span><br><span class="line">g <span class="operator">=</span> rbind<span class="punctuation">(</span>f<span class="punctuation">,</span> <span class="number">6</span> <span class="punctuation">)</span> <span class="comment"># 合并以后相当于增加了1行</span></span><br><span class="line">g</span><br><span class="line"><span class="comment">## 第一列 b c d e Y</span></span><br><span class="line"><span class="comment">## 1 1 5 a e 9 13</span></span><br><span class="line"><span class="comment">## 2 2 6 b f 10 14</span></span><br><span class="line"><span class="comment">## P 3 7 c g 11 15</span></span><br><span class="line"><span class="comment">## l 4 8 d h 12 16</span></span><br><span class="line"><span class="comment">## 5 6 6 6 6 6 6</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 行/列平均数</span></span><br><span class="line"><span class="comment"># rowMeans 行平均数 colMeans 列平均数</span></span><br></pre></td></tr></table></figure><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## 练习:生成一个矩阵,包含21-40的值,给行和列取名</span></span><br><span class="line"><span class="comment">## 加上最后一列平均数;加上最后一列,计算第二列和第一列的差</span></span><br><span class="line">a <span class="operator">=</span> matrix<span class="punctuation">(</span><span class="number">21</span><span class="operator">:</span><span class="number">40</span><span class="punctuation">,</span> ncol <span class="operator">=</span> <span class="number">5</span><span class="punctuation">)</span></span><br><span class="line">colnames<span class="punctuation">(</span>a<span class="punctuation">)</span> <span class="operator">=</span> <span class="built_in">LETTERS</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">:</span><span class="number">5</span><span class="punctuation">]</span></span><br><span class="line">row.names<span class="punctuation">(</span>a<span class="punctuation">)</span> <span class="operator">=</span> <span class="built_in">letters</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">:</span><span class="number">4</span><span class="punctuation">]</span></span><br><span class="line">cbind<span class="punctuation">(</span>a<span class="punctuation">,</span>Mean <span class="operator">=</span> rowMeans<span class="punctuation">(</span>a<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="comment">## A B C D E Mean</span></span><br><span class="line"><span class="comment">## a 21 25 29 33 37 29</span></span><br><span class="line"><span class="comment">## b 22 26 30 34 38 30</span></span><br><span class="line"><span class="comment">## c 23 27 31 35 39 31</span></span><br><span class="line"><span class="comment">## d 24 28 32 36 40 32</span></span><br><span class="line">b <span class="operator">=</span> a<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span></span><br><span class="line"><span class="built_in">c</span> <span class="operator">=</span> a<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">2</span><span class="punctuation">]</span></span><br><span class="line">d <span class="operator">=</span> <span class="built_in">c</span><span class="operator">-</span>b</span><br><span class="line">cbind<span class="punctuation">(</span>a<span class="punctuation">,</span> H <span class="operator">=</span> d<span class="punctuation">)</span></span><br><span class="line"><span class="comment">## A B C D E H</span></span><br><span class="line"><span class="comment">## a 21 25 29 33 37 4</span></span><br><span class="line"><span class="comment">## b 22 26 30 34 38 4</span></span><br><span class="line"><span class="comment">## c 23 27 31 35 39 4</span></span><br><span class="line"><span class="comment">## d 24 28 32 36 40 4</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">## 练习:建一个数据框统计一天消费,第一列开支,第二列单价,第三列数量</span></span><br><span class="line">b <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="number">648</span><span class="punctuation">,</span><span class="number">328</span><span class="punctuation">,</span><span class="number">128</span><span class="punctuation">,</span><span class="number">60</span><span class="punctuation">,</span><span class="number">30</span><span class="punctuation">,</span><span class="number">6</span><span class="punctuation">)</span></span><br><span class="line"><span class="built_in">c</span> <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="number">1</span><span class="punctuation">,</span><span class="number">3</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">,</span><span class="number">6</span><span class="punctuation">)</span></span><br><span class="line">a <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="built_in">c</span><span class="operator">*</span>b<span class="punctuation">)</span></span><br><span class="line">d <span class="operator">=</span> data.frame<span class="punctuation">(</span>a<span class="punctuation">,</span>b<span class="punctuation">,</span><span class="built_in">c</span><span class="punctuation">)</span></span><br><span class="line">colnames<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">1</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"消费金额"</span></span><br><span class="line">colnames<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">2</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"单价"</span></span><br><span class="line">colnames<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">3</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"数量"</span></span><br><span class="line">row.names<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">1</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"648氪金消费"</span></span><br><span class="line">row.names<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">2</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"328氪金消费"</span></span><br><span class="line">row.names<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">3</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"128氪金消费"</span></span><br><span class="line">row.names<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">4</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"60氪金消费"</span></span><br><span class="line">row.names<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">5</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"30氪金消费"</span></span><br><span class="line">row.names<span class="punctuation">(</span>d<span class="punctuation">)</span><span class="punctuation">[</span><span class="number">6</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">"6氪金消费"</span></span><br><span class="line">rbind<span class="punctuation">(</span>d<span class="punctuation">,</span>总氪金量 <span class="operator">=</span> <span class="built_in">sum</span><span class="punctuation">(</span>a<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="comment">## 消费金额 单价 数量</span></span><br><span class="line"><span class="comment">## 648氪金消费 648 648 1</span></span><br><span class="line"><span class="comment">## 328氪金消费 984 328 3</span></span><br><span class="line"><span class="comment">## 128氪金消费 512 128 4</span></span><br><span class="line"><span class="comment">## 60氪金消费 60 60 1</span></span><br><span class="line"><span class="comment">## 30氪金消费 120 30 4</span></span><br><span class="line"><span class="comment">## 6氪金消费 36 6 6</span></span><br><span class="line"><span class="comment">## 总氪金量 2360 2360 2360</span></span><br></pre></td></tr></table></figure></div><h2 id="4-R做图"><a href="#4-R做图" class="headerlink" title="4. R做图"></a>4. R做图</h2><div class="story post-story"><p>咱们学生物最关心的就是怎么作图了,上面的编程基础没懂也没事,下面怎么画图可以套模板。</p><p>R可以导入不同的包作图,这里用最最基础的ggplot为例,以下是安装和导入方式,后面所有例子均需导入ggplot:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 安装ggplot2包</span></span><br><span class="line">install.packages<span class="punctuation">(</span><span class="string">"ggplot2"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 导入ggplot2</span></span><br><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br></pre></td></tr></table></figure><h3 id="4-1-线性回归图"><a href="#4-1-线性回归图" class="headerlink" title="4.1 线性回归图"></a>4.1 线性回归图</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line"><span class="comment"># 以R内置的cars数据为例</span></span><br><span class="line">ggplot<span class="punctuation">(</span>data <span class="operator">=</span> cars<span class="punctuation">,</span></span><br><span class="line"> mapping <span class="operator">=</span> aes<span class="punctuation">(</span>x <span class="operator">=</span> speed<span class="punctuation">,</span> <span class="comment"># aes描述数据中的变量映射到geom的可视属性(美学?)</span></span><br><span class="line"> y <span class="operator">=</span> dist<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_point<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span> geom_line<span class="punctuation">(</span>col <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment"># geom表示图层</span></span><br><span class="line"> geom_smooth<span class="punctuation">(</span>method <span class="operator">=</span> <span class="string">"lm"</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment"># lm表示线性回归方法</span></span><br><span class="line"> annotate<span class="punctuation">(</span>geom <span class="operator">=</span> <span class="string">"text"</span><span class="punctuation">,</span> <span class="comment"># annotate注释</span></span><br><span class="line"> x <span class="operator">=</span> <span class="number">10</span><span class="punctuation">,</span> <span class="comment"># 注释坐标</span></span><br><span class="line"> y <span class="operator">=</span> <span class="number">100</span><span class="punctuation">,</span></span><br><span class="line"> label <span class="operator">=</span> <span class="string">"y = -17.6x + 3.9 \n p = 1.49e-12"</span><span class="punctuation">,</span> <span class="comment"># \n为换行符</span></span><br><span class="line"> size <span class="operator">=</span> <span class="number">5</span><span class="punctuation">)</span> <span class="comment"># 注释大小</span></span><br><span class="line"></span><br><span class="line">lm.cars <span class="operator">=</span> lm<span class="punctuation">(</span>formula <span class="operator">=</span> dist <span class="operator">~</span> speed<span class="punctuation">,</span></span><br><span class="line"> data <span class="operator">=</span> cars<span class="punctuation">)</span></span><br><span class="line">summary<span class="punctuation">(</span>lm.cars<span class="punctuation">)</span> <span class="comment"># 回归方程详细参数总结,显示结果如下,代码中不要加下面的东西</span></span><br><span class="line"><span class="comment">#-------------------------------------------------------------------------------------</span></span><br><span class="line"><span class="comment">#Call:</span></span><br><span class="line"><span class="comment">#lm(formula = dist ~ speed, data = cars)</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#Residuals:</span></span><br><span class="line"><span class="comment"># Min 1Q Median 3Q Max </span></span><br><span class="line"><span class="comment">#-29.069 -9.525 -2.272 9.215 43.201 </span></span><br><span class="line"></span><br><span class="line"><span class="comment">#Coefficients:</span></span><br><span class="line"><span class="comment"># Estimate Std. Error t value Pr(>|t|) </span></span><br><span class="line"><span class="comment">#(Intercept) -17.5791 6.7584 -2.601 0.0123 * </span></span><br><span class="line"><span class="comment">#speed 3.9324 0.4155 9.464 1.49e-12 ***</span></span><br><span class="line"><span class="comment">#---</span></span><br><span class="line"><span class="comment">#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#Residual standard error: 15.38 on 48 degrees of freedom</span></span><br><span class="line"><span class="comment">#Multiple R-squared: 0.6511,Adjusted R-squared: 0.6438 </span></span><br><span class="line"><span class="comment">#F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/3.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/3.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="4-2-直方图"><a href="#4-2-直方图" class="headerlink" title="4.2 直方图"></a>4.2 直方图</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line"><span class="comment"># 以R内置的iris数据为例</span></span><br><span class="line">ggplot<span class="punctuation">(</span>data <span class="operator">=</span> iris<span class="punctuation">,</span></span><br><span class="line"> mapping <span class="operator">=</span> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Length<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_histogram<span class="punctuation">(</span>col <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">,</span> <span class="comment"># 边缘颜色</span></span><br><span class="line"> fill <span class="operator">=</span> <span class="string">"blue"</span><span class="punctuation">,</span> <span class="comment"># 填充色</span></span><br><span class="line"> alpha <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">,</span> <span class="comment"># 透明度</span></span><br><span class="line"> bins <span class="operator">=</span> <span class="number">30</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment">#分组数 </span></span><br><span class="line"> labs<span class="punctuation">(</span>x <span class="operator">=</span> <span class="string">"Sepal Length(cm)"</span><span class="punctuation">,</span> <span class="comment"># labs横纵坐标名</span></span><br><span class="line"> y <span class="operator">=</span> <span class="string">"Count"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_classic<span class="punctuation">(</span><span class="punctuation">)</span> <span class="comment"># 主题——经典背景(白色的)</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## 练习:对iris数据集中的Sepal.Width形状做直方图(图略)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>data <span class="operator">=</span> iris<span class="punctuation">,</span></span><br><span class="line"> mapping <span class="operator">=</span> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Width<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_histogram<span class="punctuation">(</span>col <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> <span class="string">"blue"</span><span class="punctuation">,</span> </span><br><span class="line"> alpha <span class="operator">=</span> <span class="number">0.3</span><span class="punctuation">,</span></span><br><span class="line"> bins <span class="operator">=</span> <span class="number">30</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> labs<span class="punctuation">(</span>x <span class="operator">=</span> <span class="string">"Sepal Width(cm)"</span><span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> <span class="string">"Count"</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> theme_classic<span class="punctuation">(</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/4.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/4.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="4-3-密度图"><a href="#4-3-密度图" class="headerlink" title="4.3 密度图"></a>4.3 密度图</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>iris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Width<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_density<span class="punctuation">(</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/5.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/5.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 密度图 + 直方图</span></span><br><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>data <span class="operator">=</span> iris<span class="punctuation">,</span></span><br><span class="line"> mapping <span class="operator">=</span> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Width<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_histogram<span class="punctuation">(</span>aes<span class="punctuation">(</span>y <span class="operator">=</span> ..density..<span class="punctuation">)</span><span class="punctuation">,</span> <span class="comment"># 直方图纵坐标改成密度</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> <span class="string">"blue"</span><span class="punctuation">,</span> </span><br><span class="line"> alpha <span class="operator">=</span> <span class="number">0.3</span><span class="punctuation">,</span></span><br><span class="line"> bins <span class="operator">=</span> <span class="number">30</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_density<span class="punctuation">(</span>col <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">,</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_dark<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## 练习:做Petal.Length直方图和密度图(图略)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>data <span class="operator">=</span> iris<span class="punctuation">,</span></span><br><span class="line"> mapping <span class="operator">=</span> aes<span class="punctuation">(</span>x <span class="operator">=</span> Petal.Length<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_histogram<span class="punctuation">(</span>aes<span class="punctuation">(</span>y <span class="operator">=</span> ..density..<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> bins <span class="operator">=</span> <span class="number">30</span><span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"blue"</span><span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> <span class="string">"green"</span><span class="punctuation">)</span><span class="operator">+</span></span><br><span class="line"> geom_density<span class="punctuation">(</span>col <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/6.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/6.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="4-4-折线图"><a href="#4-4-折线图" class="headerlink" title="4.4 折线图"></a>4.4 折线图</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>iris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Width<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_freqpoly<span class="punctuation">(</span><span class="punctuation">)</span> <span class="comment"># 折线图</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/7.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/7.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="4-5-柱形图"><a href="#4-5-柱形图" class="headerlink" title="4.5 柱形图"></a>4.5 柱形图</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>data <span class="operator">=</span> iris<span class="punctuation">,</span></span><br><span class="line"> mapping <span class="operator">=</span> aes<span class="punctuation">(</span>x <span class="operator">=</span> Species<span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> Species<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment"># 总体颜色设置</span></span><br><span class="line"> geom_bar<span class="punctuation">(</span>width <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">,</span> <span class="comment"># 设置宽度</span></span><br><span class="line"> alpha <span class="operator">=</span> <span class="number">0.3</span><span class="punctuation">,</span> <span class="comment"># 注意设置颜色后总体颜色设置失效</span></span><br><span class="line"> show.legend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span> <span class="comment"># 设置F为图例不显示</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## 练习:以mtcars数据集为例,对cyl作图(图略)</span></span><br><span class="line">mtcars</span><br><span class="line">ggplot<span class="punctuation">(</span>mtcars<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> factor<span class="punctuation">(</span>cyl<span class="punctuation">)</span><span class="punctuation">,</span> <span class="comment"># factor连续型变量数值转换成因子</span></span><br><span class="line"> fill <span class="operator">=</span> factor<span class="punctuation">(</span>cyl<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_bar<span class="punctuation">(</span>width <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">,</span></span><br><span class="line"> alpha <span class="operator">=</span> <span class="number">0.3</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> labs<span class="punctuation">(</span>x <span class="operator">=</span> <span class="string">"Number of cylinder"</span><span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> <span class="string">"Count"</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 2个变量作图(图略)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>iris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Length<span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> Species<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment"># 以种作为分类依据</span></span><br><span class="line"> geom_density<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## 练习:对iris的Sepal.Length,根据不同种做直方图(图略)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>iris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x<span class="operator">=</span> Sepal.Length<span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> Species<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_histogram<span class="punctuation">(</span>bins <span class="operator">=</span> <span class="number">30</span><span class="punctuation">,</span></span><br><span class="line"> alpha <span class="operator">=</span> <span class="number">0.4</span><span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"black"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_classic<span class="punctuation">(</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/8.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/8.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 柱形图叠加误差棒</span></span><br><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">Mean <span class="operator">=</span> tapply<span class="punctuation">(</span>iris<span class="operator">$</span>Sepal.Length<span class="punctuation">,</span> <span class="comment"># tapply分组,提取 </span></span><br><span class="line"> iris<span class="operator">$</span>Species<span class="punctuation">,</span></span><br><span class="line"> mean<span class="punctuation">)</span></span><br><span class="line">Mean <span class="operator">=</span> as.data.frame<span class="punctuation">(</span>Mean<span class="punctuation">)</span> <span class="comment"># 转换array类型为数据框</span></span><br><span class="line">Mean</span><br><span class="line"><span class="comment">## Mean</span></span><br><span class="line"><span class="comment">## setosa 5.006</span></span><br><span class="line"><span class="comment">## versicolor 5.936</span></span><br><span class="line"><span class="comment">## virginica 6.588</span></span><br><span class="line">Mean<span class="operator">$</span>Species <span class="operator">=</span> row.names<span class="punctuation">(</span>Mean<span class="punctuation">)</span></span><br><span class="line">Mean</span><br><span class="line"><span class="comment">## Mean Species</span></span><br><span class="line"><span class="comment">## setosa 5.006 setosa</span></span><br><span class="line"><span class="comment">## versicolor 5.936 versicolor</span></span><br><span class="line"><span class="comment">## virginica 6.588 virginica</span></span><br><span class="line">sd <span class="operator">=</span> tapply<span class="punctuation">(</span>iris<span class="operator">$</span>Sepal.Length<span class="punctuation">,</span> <span class="comment"># tapply分组,提取 </span></span><br><span class="line"> iris<span class="operator">$</span>Species<span class="punctuation">,</span></span><br><span class="line"> sd<span class="punctuation">)</span></span><br><span class="line">sd <span class="operator">=</span> as.data.frame<span class="punctuation">(</span>sd<span class="punctuation">)</span></span><br><span class="line">sd</span><br><span class="line"><span class="comment">## sd</span></span><br><span class="line"><span class="comment">## setosa 0.3524897</span></span><br><span class="line"><span class="comment">## versicolor 0.5161711</span></span><br><span class="line"><span class="comment">## virginica 0.6358796</span></span><br><span class="line">Newiris <span class="operator">=</span> cbind<span class="punctuation">(</span>Mean<span class="punctuation">,</span> Sd <span class="operator">=</span> sd<span class="operator">$</span>sd<span class="punctuation">)</span></span><br><span class="line">Newiris</span><br><span class="line"><span class="comment">## Mean Species Sd</span></span><br><span class="line"><span class="comment">## setosa 5.006 setosa 0.3524897</span></span><br><span class="line"><span class="comment">## versicolor 5.936 versicolor 0.5161711</span></span><br><span class="line"><span class="comment">## virginica 6.588 virginica 0.6358796</span></span><br><span class="line">ggplot<span class="punctuation">(</span>Newiris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> Species<span class="punctuation">,</span>y <span class="operator">=</span> Mean<span class="punctuation">)</span><span class="punctuation">)</span><span class="operator">+</span></span><br><span class="line"> geom_col<span class="punctuation">(</span>width <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>fill <span class="operator">=</span> Species<span class="punctuation">)</span><span class="punctuation">)</span><span class="operator">+</span></span><br><span class="line"> geom_errorbar<span class="punctuation">(</span>aes<span class="punctuation">(</span>ymin <span class="operator">=</span> Mean <span class="operator">-</span> Sd<span class="punctuation">,</span></span><br><span class="line"> ymax <span class="operator">=</span> Mean <span class="operator">+</span> Sd<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> width <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"black"</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/9.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/9.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="4-6-箱式图"><a href="#4-6-箱式图" class="headerlink" title="4.6 箱式图"></a>4.6 箱式图</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line"><span class="comment"># 三条线是75%,50%,25%。长度是箱高1.5倍,超过为离群值</span></span><br><span class="line">ggplot<span class="punctuation">(</span>iris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> Species<span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> Sepal.Length<span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> Species<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_violin<span class="punctuation">(</span>show.legend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">,</span></span><br><span class="line"> width <span class="operator">=</span> <span class="number">0.8</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment"># 叠加小提琴图</span></span><br><span class="line"> geom_boxplot<span class="punctuation">(</span>show.legend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">,</span></span><br><span class="line"> width <span class="operator">=</span> <span class="number">0.2</span><span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> <span class="string">"white"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_jitter<span class="punctuation">(</span>width <span class="operator">=</span> <span class="number">0.1</span><span class="punctuation">,</span></span><br><span class="line"> size <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">,</span></span><br><span class="line"> show.legend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span> <span class="comment"># 显示每个点,size为点大小</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/10.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/10.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="4-7-一些简单的练习"><a href="#4-7-一些简单的练习" class="headerlink" title="4.7 一些简单的练习"></a>4.7 一些简单的练习</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## 练习:以iris为例,做Sepal.Length和Petal.Length回归分析,并可视化</span></span><br><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>iris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Length<span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> Petal.Length<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_point<span class="punctuation">(</span>size <span class="operator">=</span> <span class="number">0.5</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_smooth<span class="punctuation">(</span>method <span class="operator">=</span> lm<span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"black"</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> annotate<span class="punctuation">(</span>geom <span class="operator">=</span> <span class="string">"text"</span><span class="punctuation">,</span></span><br><span class="line"> x <span class="operator">=</span> <span class="number">5.5</span><span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> <span class="number">7</span><span class="punctuation">,</span></span><br><span class="line"> size <span class="operator">=</span> <span class="number">5</span><span class="punctuation">,</span></span><br><span class="line"> label <span class="operator">=</span> <span class="string">"y = -7.10x + 1.86 \n p < 2e-16 \n Adjusted r-squared = 0.75"</span> <span class="punctuation">)</span></span><br><span class="line">lm.iris <span class="operator">=</span> lm<span class="punctuation">(</span>Petal.Length <span class="operator">~</span> Sepal.Length<span class="punctuation">,</span></span><br><span class="line"> iris<span class="punctuation">)</span></span><br><span class="line">summary<span class="punctuation">(</span>lm.iris<span class="punctuation">)</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/14.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/14.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## 练习:做个糖葫芦?</span></span><br><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">x <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">,</span><span class="number">5</span><span class="punctuation">)</span></span><br><span class="line">y <span class="operator">=</span> <span class="number">1</span><span class="operator">:</span><span class="number">5</span></span><br><span class="line">a <span class="operator">=</span> data.frame<span class="punctuation">(</span>x<span class="punctuation">,</span>y<span class="punctuation">)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>data <span class="operator">=</span> a<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> x<span class="punctuation">,</span> y <span class="operator">=</span> y<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_vline<span class="punctuation">(</span>xintercept <span class="operator">=</span> <span class="number">1</span><span class="operator">:</span><span class="number">3</span><span class="punctuation">,</span> <span class="comment"># 生成一条直线,交于x = 1,2,3</span></span><br><span class="line"> size <span class="operator">=</span> <span class="number">3</span><span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"yellow"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_point<span class="punctuation">(</span>size <span class="operator">=</span> <span class="number">18</span><span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"#FF5000"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> ylim<span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span><span class="number">6</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment"># 调整y轴上下限</span></span><br><span class="line"> xlim<span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_void<span class="punctuation">(</span><span class="punctuation">)</span> <span class="comment"># 主题设置为空</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/13.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/13.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## 练习:做个字母表?</span></span><br><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">x <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">6</span><span class="punctuation">,</span><span class="number">5</span><span class="punctuation">)</span></span><br><span class="line">y <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="number">5</span><span class="operator">:</span><span class="number">1</span><span class="punctuation">,</span>each <span class="operator">=</span> <span class="number">6</span><span class="punctuation">)</span></span><br><span class="line">m <span class="operator">=</span> <span class="built_in">letters</span></span><br><span class="line">n <span class="operator">=</span> <span class="built_in">LETTERS</span></span><br><span class="line">l <span class="operator">=</span> paste0<span class="punctuation">(</span>n<span class="punctuation">,</span>m<span class="punctuation">)</span></span><br><span class="line">l</span><br><span class="line">al <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span>l<span class="punctuation">,</span><span class="built_in">rep</span><span class="punctuation">(</span><span class="literal">NA</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="comment"># 设置4个NA值补齐</span></span><br><span class="line">a <span class="operator">=</span> data.frame<span class="punctuation">(</span>x<span class="punctuation">,</span>y<span class="punctuation">,</span>al<span class="punctuation">)</span></span><br><span class="line">a</span><br><span class="line">ggplot<span class="punctuation">(</span>a<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x<span class="punctuation">,</span>y<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span>label <span class="operator">=</span> al<span class="punctuation">)</span><span class="punctuation">,</span> <span class="comment"># 加文本图层,点的坐标映射字母</span></span><br><span class="line"> size <span class="operator">=</span> <span class="number">5</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_void<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> ylim<span class="punctuation">(</span><span class="operator">-</span><span class="number">1</span><span class="punctuation">,</span><span class="number">6</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> xlim<span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span><span class="number">7</span><span class="punctuation">)</span> </span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/11.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/11.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## 优化一下,做个炫彩字母表?大写字母在上,小写字母在下</span></span><br><span class="line">library<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line">x <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">6</span><span class="punctuation">,</span><span class="number">5</span><span class="punctuation">)</span></span><br><span class="line">y <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="number">5</span><span class="operator">:</span><span class="number">1</span><span class="punctuation">,</span>each <span class="operator">=</span> <span class="number">6</span><span class="punctuation">)</span></span><br><span class="line">al <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="built_in">LETTERS</span><span class="punctuation">,</span><span class="built_in">rep</span><span class="punctuation">(</span><span class="literal">NA</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">al1 <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="built_in">letters</span><span class="punctuation">,</span><span class="built_in">rep</span><span class="punctuation">(</span><span class="literal">NA</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">mydata <span class="operator">=</span> data.frame<span class="punctuation">(</span>x<span class="punctuation">,</span>y<span class="punctuation">,</span>al<span class="punctuation">)</span></span><br><span class="line">mydata</span><br><span class="line">ggplot<span class="punctuation">(</span>mydata<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x<span class="punctuation">,</span>y<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span>label <span class="operator">=</span> al<span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> al<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> size <span class="operator">=</span><span class="number">5</span><span class="punctuation">,</span></span><br><span class="line"> show.legend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span>y <span class="operator">=</span> y <span class="operator">-</span> <span class="number">0.3</span><span class="punctuation">,</span></span><br><span class="line"> label <span class="operator">=</span> al1<span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> al1<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> show.legend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_void<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> ylim<span class="punctuation">(</span><span class="operator">-</span><span class="number">1</span><span class="punctuation">,</span><span class="number">6</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> xlim<span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span><span class="number">7</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230926/12.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230926/12.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## 算了 自由发挥吧,图就不放了</span></span><br><span class="line"><span class="comment"># 练习 3个变量</span></span><br><span class="line">ggplot<span class="punctuation">(</span>iris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Length<span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> Petal.Length<span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> Species<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment"># 增加种属变量</span></span><br><span class="line"> geom_point<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_smooth<span class="punctuation">(</span>method <span class="operator">=</span> lm<span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 练习</span></span><br><span class="line">ggplot<span class="punctuation">(</span>mtcars<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> factor<span class="punctuation">(</span>cyl<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> factor<span class="punctuation">(</span>carb<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_bar<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 练习</span></span><br><span class="line">ToothGrowth<span class="comment"># 另一个奇奇怪怪的R内置数据</span></span><br><span class="line">str<span class="punctuation">(</span>ToothGrowth<span class="punctuation">)</span> <span class="comment"># str()查看括号内数据类型信息</span></span><br><span class="line">ToothGrowth<span class="operator">$</span>dose <span class="operator">=</span> factor<span class="punctuation">(</span>ToothGrowth<span class="operator">$</span>dose<span class="punctuation">)</span></span><br><span class="line">a <span class="operator">=</span> tapply<span class="punctuation">(</span>ToothGrowth<span class="operator">$</span>len<span class="punctuation">,</span></span><br><span class="line"> ToothGrowth<span class="operator">$</span>supp<span class="operator">:</span>ToothGrowth<span class="operator">$</span>dose<span class="punctuation">,</span></span><br><span class="line"> mean<span class="punctuation">)</span></span><br><span class="line">a <span class="operator">=</span> data.frame<span class="punctuation">(</span>a<span class="punctuation">)</span></span><br><span class="line">supp <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">"OJ"</span><span class="punctuation">,</span><span class="string">"VC"</span><span class="punctuation">)</span><span class="punctuation">,</span>each <span class="operator">=</span> <span class="number">3</span><span class="punctuation">)</span></span><br><span class="line">dose <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="built_in">c</span><span class="punctuation">(</span><span class="number">0.5</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">,</span><span class="number">2</span><span class="punctuation">)</span><span class="punctuation">,</span><span class="number">2</span><span class="punctuation">)</span></span><br><span class="line">b <span class="operator">=</span> cbind<span class="punctuation">(</span>len <span class="operator">=</span> a<span class="operator">$</span>a<span class="punctuation">,</span>supp<span class="punctuation">,</span>dose<span class="punctuation">)</span></span><br><span class="line">b <span class="operator">=</span> as.data.frame<span class="punctuation">(</span>b<span class="punctuation">)</span></span><br><span class="line">ggplot<span class="punctuation">(</span>b<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> supp<span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> len<span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> dose<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_col<span class="punctuation">(</span>position <span class="operator">=</span> position_dodge<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 练习</span></span><br><span class="line">ggplot<span class="punctuation">(</span>ToothGrowth<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> supp<span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> len<span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> dose<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_boxplot<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_jitter<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 练习</span></span><br><span class="line">ggplot<span class="punctuation">(</span>iris<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x <span class="operator">=</span> Sepal.Length<span class="punctuation">,</span></span><br><span class="line"> y <span class="operator">=</span> Sepal.Width<span class="punctuation">,</span></span><br><span class="line"> size <span class="operator">=</span> Petal.Length<span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> Petal.Length<span class="punctuation">,</span></span><br><span class="line"> alpha <span class="operator">=</span> Petal.Length<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span> <span class="comment"># 根据需要选第三个变量</span></span><br><span class="line"> geom_point<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 练习</span></span><br><span class="line">x <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span><span class="number">6</span><span class="punctuation">,</span><span class="number">5</span><span class="punctuation">)</span></span><br><span class="line">y <span class="operator">=</span> <span class="built_in">rep</span><span class="punctuation">(</span><span class="number">5</span><span class="operator">:</span><span class="number">1</span><span class="punctuation">,</span>each <span class="operator">=</span> <span class="number">6</span><span class="punctuation">)</span></span><br><span class="line">al <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="built_in">LETTERS</span><span class="punctuation">,</span><span class="built_in">rep</span><span class="punctuation">(</span><span class="literal">NA</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">al1 <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="built_in">letters</span><span class="punctuation">,</span><span class="built_in">rep</span><span class="punctuation">(</span><span class="literal">NA</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line">mydata <span class="operator">=</span> data.frame<span class="punctuation">(</span>x<span class="punctuation">,</span>y<span class="punctuation">,</span>al<span class="punctuation">)</span></span><br><span class="line">mydata</span><br><span class="line">ggplot<span class="punctuation">(</span>mydata<span class="punctuation">,</span></span><br><span class="line"> aes<span class="punctuation">(</span>x<span class="punctuation">,</span>y<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span>label <span class="operator">=</span> al<span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> al<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> size <span class="operator">=</span><span class="number">5</span><span class="punctuation">,</span></span><br><span class="line"> show.legend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span> <span class="operator">+</span> </span><br><span class="line"> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span>y <span class="operator">=</span> y <span class="operator">-</span> <span class="number">0.3</span><span class="punctuation">,</span></span><br><span class="line"> label <span class="operator">=</span> al1<span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> al1<span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> show.legend <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> theme_void<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> ylim<span class="punctuation">(</span><span class="operator">-</span><span class="number">1</span><span class="punctuation">,</span><span class="number">6</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"> xlim<span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span><span class="number">7</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div>]]></content>
<summary type="html"><p>整理笔记的时候翻到两年前做的R入门笔记,还记得21年冬天那个时候是第一次接触R,华中农业大学的孔秋生教授来塔里木大学做的R语言讲座。两年了有些东西过时了,整理下做个备份吧~顺便回头复习复习,温故而知新 ^_^</p></summary>
<category term="编程自学" scheme="http://www.shelven.com/categories/%E7%BC%96%E7%A8%8B%E8%87%AA%E5%AD%A6/"/>
<category term="R语言" scheme="http://www.shelven.com/tags/R%E8%AF%AD%E8%A8%80/"/>
</entry>
<entry>
<title>基因组注释(7)——功能基因注释评估</title>
<link href="http://www.shelven.com/2023/09/25/a.html"/>
<id>http://www.shelven.com/2023/09/25/a.html</id>
<published>2023-09-25T13:46:23.000Z</published>
<updated>2023-09-25T13:52:41.000Z</updated>
<content type="html"><![CDATA[<p>前面说了如何用eggNOG-mapper快速注释功能基因,我们最后得到了很多结果文件,其中最重要的是两个<code>annotations</code>文件。这里主要讲一下怎么整理结果文件,并且对注释的结果做质量评估。</p><span id="more"></span><h2 id="1-基因功能注释评估"><a href="#1-基因功能注释评估" class="headerlink" title="1. 基因功能注释评估"></a>1. 基因功能注释评估</h2><div class="story post-story"><p>其实就是整理结果文件中有多少基因注释到了哪些数据库中,以一种直观的方式展现结果。</p><p>这里分两种情况,如果自己喜欢以编程的方式处理文件,可以直接处理<code>out.emapper.annotations</code>这个文件。如果前一种方式不是很得心应手的话,那就直接处理<code>out.emapper.annotations.xlsx</code>这个文件,毕竟可以用excel直接打开。</p><p>在<a href="https://www.shelven.com/2023/09/19/a.html">上一篇博客</a>中详细介绍了结果文件中20列代表什么含义,删掉前两行运行的参数和命令,剩下的就是我们熟悉的excel表格。</p><p>将第一行表头的内容选中,筛选。A列(query)就是eggNOG注释到的基因名,选中并且粘贴到新的表格;J列(GOs)筛选除了“-”以外的内容,复制第一列作为注释到GO的序列;L列(KEGG_ko)筛选除了“-”以外的内容,复制第一列作为注释到KEGG的序列;U列(PFAMs)筛选除“-”以外的内容,复制第一列作为注释到PFAM的序列。可以得到如下形式的表格(另存为<code>annotation.xlsx</code>):</p><p><img src="https://www.shelven.com/tuchuang/20230921/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230921/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>这种类型的输入数据可以在本地导入R包做韦恩(Venn)图,也可以选择用在线工具直接生成韦恩图,比如这个网站:<a href="https://bioinfogp.cnb.csic.es/tools/venny/index.html">Venny 2.1.0 (csic.es)</a></p><p><img src="https://www.shelven.com/tuchuang/20230921/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230921/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>最多支持四组样本,把数据贴进左边4个框内即可。上方的<strong>Style</strong>可以选择不同的着色方式,点击图中的数字可以在左下方<strong>Results</strong>框内看到各个数据集的重叠关系,右键图片也可以直接保存。<del>这个功能不难实现,有空可以更新到我的本地小工具集合中。</del></p><blockquote><p>类似的在线做韦恩图的网站还有很多:</p><p><a href="https://jvenn.toulouse.inrae.fr/app/example.html">jvenn (inrae.fr)</a></p><p><a href="http://www.biovenn.nl/index.php">BioVenn - a web application for the comparison and visualization of biological lists using area-proportional Venn diagrams</a></p><p><a href="http://www.pangloss.com/seidel/Protocols/venn.cgi">Venn Diagram generator (pangloss.com)</a></p></blockquote><p>或者选择自己折腾R,用经典的VennDiagram包做韦恩图,或者用UpSetR包做Upset图:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># R包readxl导入xlsx文件,VennDiagram做Veen图</span></span><br><span class="line">library<span class="punctuation">(</span>VennDiagram<span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>readxl<span class="punctuation">)</span></span><br><span class="line">data <span class="operator">=</span> read_excel<span class="punctuation">(</span><span class="string">"annotation.xlsx"</span><span class="punctuation">,</span> sheet <span class="operator">=</span> <span class="number">1</span><span class="punctuation">)</span></span><br><span class="line">data<span class="punctuation">[</span><span class="built_in">is.na</span><span class="punctuation">(</span>data<span class="punctuation">)</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">""</span> <span class="comment"># 替换读入的NA为空</span></span><br><span class="line"></span><br><span class="line">plot <span class="operator">=</span> venn.diagram<span class="punctuation">(</span></span><br><span class="line"> x <span class="operator">=</span> <span class="built_in">list</span><span class="punctuation">(</span></span><br><span class="line"> eggNOG <span class="operator">=</span> data<span class="operator">$</span>eggNOG<span class="punctuation">,</span></span><br><span class="line"> GO <span class="operator">=</span> data<span class="operator">$</span>GO<span class="punctuation">,</span></span><br><span class="line"> KEGG <span class="operator">=</span> data<span class="operator">$</span>KEGG<span class="punctuation">,</span></span><br><span class="line"> PFAM <span class="operator">=</span> data<span class="operator">$</span>PFAM</span><br><span class="line"> <span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> disable.logging <span class="operator">=</span> <span class="literal">TRUE</span><span class="punctuation">,</span></span><br><span class="line"> filename <span class="operator">=</span> <span class="string">'Veen_annotation.png'</span><span class="punctuation">,</span></span><br><span class="line"> col <span class="operator">=</span> <span class="string">"black"</span><span class="punctuation">,</span></span><br><span class="line"> lwd <span class="operator">=</span> <span class="number">3</span><span class="punctuation">,</span></span><br><span class="line"> lty <span class="operator">=</span> <span class="string">"solid"</span><span class="punctuation">,</span></span><br><span class="line"> fill <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"cornflowerblue"</span><span class="punctuation">,</span> <span class="string">"green"</span><span class="punctuation">,</span> <span class="string">"yellow"</span><span class="punctuation">,</span> <span class="string">"darkorchid1"</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> alpha <span class="operator">=</span> <span class="number">0.50</span><span class="punctuation">,</span></span><br><span class="line"> label.col <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"orange"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"darkorchid4"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"darkblue"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span></span><br><span class="line"> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">,</span> <span class="string">"darkgreen"</span><span class="punctuation">,</span> <span class="string">"white"</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> cex <span class="operator">=</span> <span class="number">2.0</span><span class="punctuation">,</span></span><br><span class="line"> fontfamily <span class="operator">=</span> <span class="string">"serif"</span><span class="punctuation">,</span></span><br><span class="line"> fontface <span class="operator">=</span> <span class="string">"bold"</span><span class="punctuation">,</span></span><br><span class="line"> cat.col <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="string">"darkblue"</span><span class="punctuation">,</span> <span class="string">"darkgreen"</span><span class="punctuation">,</span> <span class="string">"orange"</span><span class="punctuation">,</span> <span class="string">"darkorchid4"</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> cat.cex <span class="operator">=</span> <span class="number">1.5</span><span class="punctuation">,</span></span><br><span class="line"> <span class="comment">#cat.pos = 0,</span></span><br><span class="line"> <span class="comment">#cat.dist = 0.07,</span></span><br><span class="line"> rotation.degree <span class="operator">=</span> <span class="number">0</span><span class="punctuation">,</span></span><br><span class="line"> cat.fontfamily <span class="operator">=</span> <span class="string">"serif"</span><span class="punctuation">,</span></span><br><span class="line"> cat.fontface <span class="operator">=</span> <span class="string">"bold"</span><span class="punctuation">,</span></span><br><span class="line"> margin <span class="operator">=</span> <span class="number">0.1</span></span><br><span class="line"><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><img src="https://www.shelven.com/tuchuang/20230921/Veen_annotation.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230921/Veen_annotation.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" style="zoom: 67%;" /><p>VennDiagram包参数参考<a href="https://cran.r-project.org/web/packages/VennDiagram/VennDiagram.pdf">VennDiagram: Generate High-Resolution Venn and Euler Plots (r-project.org)</a></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># UpSetR包做upset图</span></span><br><span class="line">library<span class="punctuation">(</span>UpSetR<span class="punctuation">)</span></span><br><span class="line">library<span class="punctuation">(</span>readxl<span class="punctuation">)</span></span><br><span class="line">data <span class="operator">=</span> read_excel<span class="punctuation">(</span><span class="string">"annotation.xlsx"</span><span class="punctuation">,</span> sheet <span class="operator">=</span> <span class="number">1</span><span class="punctuation">)</span></span><br><span class="line">data<span class="punctuation">[</span><span class="built_in">is.na</span><span class="punctuation">(</span>data<span class="punctuation">)</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="string">""</span> <span class="comment"># 替换NA为空</span></span><br><span class="line">data <span class="operator">=</span> fromList<span class="punctuation">(</span>data<span class="punctuation">)</span> <span class="comment"># 转换成0和1的矩阵(UpSetR包对导入数据有要求,建议看github官网)</span></span><br><span class="line"></span><br><span class="line">upset<span class="punctuation">(</span></span><br><span class="line"> data<span class="punctuation">,</span></span><br><span class="line"> nsets <span class="operator">=</span> <span class="number">4</span><span class="punctuation">,</span></span><br><span class="line"> matrix.color <span class="operator">=</span> <span class="string">"gray23"</span><span class="punctuation">,</span></span><br><span class="line"> main.bar.color <span class="operator">=</span> <span class="string">"gray23"</span><span class="punctuation">,</span></span><br><span class="line"> mainbar.y.label <span class="operator">=</span> <span class="string">"交集基因数"</span><span class="punctuation">,</span></span><br><span class="line"> mainbar.y.max <span class="operator">=</span> <span class="number">7000</span><span class="punctuation">,</span></span><br><span class="line"> sets.bar.color <span class="operator">=</span> <span class="string">"gray23"</span><span class="punctuation">,</span></span><br><span class="line"> sets.x.label <span class="operator">=</span> <span class="string">"注释的基因数"</span><span class="punctuation">,</span></span><br><span class="line"> point.size <span class="operator">=</span> <span class="number">2.2</span><span class="punctuation">,</span> </span><br><span class="line"> line.size <span class="operator">=</span> <span class="number">0.7</span><span class="punctuation">,</span></span><br><span class="line"> mb.ratio <span class="operator">=</span> <span class="built_in">c</span><span class="punctuation">(</span><span class="number">0.7</span><span class="punctuation">,</span> <span class="number">0.3</span><span class="punctuation">)</span><span class="punctuation">,</span></span><br><span class="line"> show.numbers <span class="operator">=</span> <span class="string">"yes"</span><span class="punctuation">,</span></span><br><span class="line"> set_size.show <span class="operator">=</span> <span class="literal">FALSE</span></span><br><span class="line"><span class="punctuation">)</span></span><br></pre></td></tr></table></figure><p><img src="https://www.shelven.com/tuchuang/20230921/3.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230921/3.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>UpSetR参数参考<a href="https://cran.r-project.org/web/packages/UpSetR/UpSetR.pdf">UpSetR: A More Scalable Alternative to Venn and Euler Diagrams for Visualizing Intersecting Sets (r-project.org)</a></p><p>有点扯远了,两个图能获得的信息是一样的(upset图看不懂的话可以对照韦恩图,看柱状图数字以及底下的点和线就明白了),作图不是为了炫技而是让人一目了然,个人感觉upset图更直观一点,当然也可以直接整理一个表格(很多测序公司的报告都喜欢这么做,感觉也没啥意义)。</p><p>基于不同数据库得到的基因功能注释结果进行统计,结果显示能注释到功能数据库的基因数目为20411个,占预测基因总数的92.04%,具体如下:</p><table><thead><tr><th align="left">Type</th><th align="left">Number</th><th align="left">Percent(%)</th></tr></thead><tbody><tr><td align="left">eggNOG</td><td align="left">20411</td><td align="left">92.04</td></tr><tr><td align="left">GO</td><td align="left">10766</td><td align="left">48.55</td></tr><tr><td align="left">KEGG</td><td align="left">10245</td><td align="left">46.20</td></tr><tr><td align="left">Pfam</td><td align="left">18167</td><td align="left">81.92</td></tr><tr><td align="left">Overall</td><td align="left">22176</td><td align="left">100.00</td></tr></tbody></table><blockquote><p>Type: 各数据库类型</p><p>Number: 注释的基因数目</p><p>Percent: 个数据库注释到的基因所占的预测基因的比例</p><p>Overall: 总的预测基因数</p></blockquote></div><h2 id="2-BUSCO评估"><a href="#2-BUSCO评估" class="headerlink" title="2. BUSCO评估"></a>2. BUSCO评估</h2><div class="story post-story"><p>前面的博客写过一篇如何做BUSCO评估——<a href="https://www.shelven.com/2023/03/01/a.html">0基础学习基因组三代测序组装(7)——基因组组装质量评估(BUSCO、LAI指数)</a>,当时是评估组装的基因组完整性。这里注释完成后也需要对预测的基因集进行评估,和前面不一样的地方是,这里我们用BUSCO的<strong>Protein mode</strong>。</p><p>以前已经下载过真双子叶库的BUSCO保守基因序列了,这里就直接用:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">#</span><span class="language-bash">!/bin/bash</span> </span><br><span class="line"><span class="meta prompt_">#</span><span class="language-bash">SBATCH -n 8</span></span><br><span class="line"></span><br><span class="line">database_path=/public/home/wlxie/biosoft/busco_soft/busco/test_data/eukaryota/busco_downloads/lineages/eudicots_odb10</span><br><span class="line">sequence_path=/public/home/wlxie/biosoft/braker3/Ap_mydb/Ap_rmTE.aa</span><br><span class="line"></span><br><span class="line">busco -i ${sequence_path} -l ${database_path} -o Ap -m proteins --cpu 8 --offline</span><br></pre></td></tr></table></figure><p>用busco自带的作图脚本处理一下结果文件<code>short_summary.specific.eudicots_odb10.Ap.txt</code>:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">mkdir result</span><br><span class="line">cp Ap/short_summary.specific.eudicots_odb10.Ap.txt result</span><br><span class="line">python /public/home/wlxie/biosoft/busco/scripts/generate_plot.py -wd result</span><br></pre></td></tr></table></figure><p>20秒不到就可以出图片结果:</p><img src="https://www.shelven.com/tuchuang/20230921/busco_figure.png" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230921/busco_figure.png" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" style="zoom:80%;" /><p>也可以根据结果文件整理一个表格:</p><table><thead><tr><th>Gene ID</th><th>Number</th><th>Percent(%)</th></tr></thead><tbody><tr><td>Complete BUSCOs(C)</td><td>2232</td><td>95.96</td></tr><tr><td>Fragmented BUSCOs(D)</td><td>6</td><td>0.26</td></tr><tr><td>Missing BUSCOs(M)</td><td>88</td><td>3.78%</td></tr><tr><td>eudicots</td><td>2326</td><td>100.00</td></tr></tbody></table><p>eudicots_odb10真双子叶库有2326个BUSCO groups,2232个(95.96%)BUSCO groups能够完整比对上(包括1837个单拷贝和395个多拷贝),说明这个基因组注释的完整性还是不错的。</p></div><h2 id="3-基因转录组表达"><a href="#3-基因转录组表达" class="headerlink" title="3. 基因转录组表达"></a>3. 基因转录组表达</h2><div class="story post-story"><p>用二代转录组数据比对前面组装的参考基因组,统计样本中表达量大于0的基因数。表达量可以用FPKM或者TPM来衡量,最好是用TPM,给定一个标准,比如我这里统计TPM>0.001的基因,就是转录组的一系列标准操作。</p><p>简单写个从转录组下机数据比对到定量的脚本,需要新建一个文件夹提交作业:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">#</span><span class="language-bash">!/bin/bash</span></span><br><span class="line"><span class="meta prompt_">#</span><span class="language-bash">SBATCH -n 20</span></span><br><span class="line"></span><br><span class="line">genome_path=/public/home/wlxie/Genome/Ap.fasta</span><br><span class="line">gff3_path=/public/home/wlxie/biosoft/braker3/Ap_rmTE.gff3</span><br><span class="line">fq_path=/public/home/wlxie/Sequencing_data/BYT2022020901/Apocynum_pictum/</span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">hisat2 构建索引,比对参考基因组</span></span><br><span class="line">hisat2-build ${genome_path} genome</span><br><span class="line">hisat2 -p 20 -x genome -S out.sam -1 ${fq_path}1.fq -2 ${fq_path}2.fq</span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">samtools转sam为bam并排序</span></span><br><span class="line">samtools sort -@ 20 -o out.bam out.sam</span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">stringtie统计定量</span></span><br><span class="line">stringtie -p 20 -e -G ${gff3_path} -o result.gtf out.bam</span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">删除中间文件</span></span><br><span class="line">rm -rf genome.*</span><br><span class="line">rm -rf out.*</span><br></pre></td></tr></table></figure><p>最终得到<code>result.gtf</code>,这个文件详细记录了每个转录本表达量:</p><p><img src="https://www.shelven.com/tuchuang/20230921/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230921/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>一行显示不下……后面还有个TPM值,简单写个python处理下数据,统计TPM值大于0.001的基因个数:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">count = <span class="number">0</span></span><br><span class="line">gen_count = <span class="number">0</span></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'result.gtf'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> gtf:</span><br><span class="line"> lines = gtf.readlines()</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> lines:</span><br><span class="line"> <span class="keyword">if</span> <span class="string">"TPM"</span> <span class="keyword">in</span> line:</span><br><span class="line"> gen_count += <span class="number">1</span></span><br><span class="line"> TPM = line.split(<span class="string">'\t'</span>)[<span class="number">8</span>].split(<span class="string">';'</span>)[<span class="number">4</span>].split(<span class="string">' '</span>)[<span class="number">2</span>]</span><br><span class="line"> TPM = <span class="built_in">float</span>(TPM.strip(<span class="string">'"'</span>))</span><br><span class="line"> <span class="keyword">if</span> TPM > <span class="number">0.001</span>:</span><br><span class="line"> count += <span class="number">1</span></span><br><span class="line"><span class="built_in">print</span>(<span class="string">f'基因总数:<span class="subst">{gen_count}</span> \n表达量大于0的基因数:<span class="subst">{count}</span>'</span>)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">f'占比:<span class="subst">{count/gen_count*<span class="number">100</span>:<span class="number">.2</span>f}</span> %'</span>)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">基因总数:22176 </span></span><br><span class="line"><span class="string">表达量大于0的基因数:19760</span></span><br><span class="line"><span class="string">占比:89.11 %</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></table></figure><p>也就是说有19760个表达的基因是受转录组数据支持的,占总基因数的89.11%。严格来说这边用的数据是注释功能基因之前的,用到的gff文件是移除TE序列后修改的gff文件,用来评价功能注释的结果似乎没有说服力还不够。这里没有对基因去冗余,计算的表达量和转录组分析的表达量还是有差异的。</p></div>]]></content>
<summary type="html"><p>前面说了如何用eggNOG-mapper快速注释功能基因,我们最后得到了很多结果文件,其中最重要的是两个<code>annotations</code>文件。这里主要讲一下怎么整理结果文件,并且对注释的结果做质量评估。</p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="基因组三代测序分析" scheme="http://www.shelven.com/categories/%E5%9F%BA%E5%9B%A0%E7%BB%84%E4%B8%89%E4%BB%A3%E6%B5%8B%E5%BA%8F%E5%88%86%E6%9E%90/"/>
<category term="BUSCO" scheme="http://www.shelven.com/tags/BUSCO/"/>
<category term="韦恩图" scheme="http://www.shelven.com/tags/%E9%9F%A6%E6%81%A9%E5%9B%BE/"/>
<category term="UpSet图" scheme="http://www.shelven.com/tags/UpSet%E5%9B%BE/"/>
</entry>
<entry>
<title>基因组注释(6)——在线版eggNOG-mapper注释功能基因</title>
<link href="http://www.shelven.com/2023/09/19/a.html"/>
<id>http://www.shelven.com/2023/09/19/a.html</id>
<published>2023-09-19T14:04:47.000Z</published>
<updated>2023-09-20T07:55:59.000Z</updated>
<content type="html"><![CDATA[<p>继续更新一下基因组注释,在基因组注释的<a href="https://www.shelven.com/2023/04/11/a.html">第5篇博客</a>中,我们已经拿到了Braker3预测的功能基因,并且删除了其中可能被TE插入而失去功能的基因。仅仅拿到这些基因CDS序列和蛋白序列肯定是不够的,我们还需要知道这些蛋白具体行使什么生物学功能。 </p><span id="more"></span><h2 id="1-功能注释常用数据库"><a href="#1-功能注释常用数据库" class="headerlink" title="1. 功能注释常用数据库"></a>1. 功能注释常用数据库</h2><div class="story post-story"><p>基因功能注释的原理都是相似的,我们拿到了蛋白序列之后,与现有的蛋白数据库进行比对,统计比对结果。稍微介绍几个常用的比对数据库,其他没介绍到的可以参考本站<a href="https://www.shelven.com/Bioinformatics/">网址导航—生信网站快速导航</a>,里面有本人年初总结的一些常用生信数据库和工具(有好的工具也可以联系我补充)。</p><h3 id="1-1-NR数据库"><a href="#1-1-NR数据库" class="headerlink" title="1.1 NR数据库"></a>1.1 NR数据库</h3><p>NR数据库全称Non-Redundant Protein Sequence Database(非冗余蛋白数据库),由NCBI建立和维护,可以在NCBI主页的<strong>Data&Software</strong>入口找到<a href="https://ftp.ncbi.nlm.nih.gov/blast/db/">FTP:BLAST Databases</a>。用过本地blast的人都知道,blast比对NR/NT库的结果中包含了物种注释信息和相应的氨基酸序列,可以用于物种分类、数据污染评估等等(没用过的话可以参考我的这篇博客——<a href="https://www.shelven.com/2022/06/20/a.html?keyword=blast">数据污染评估</a>)。</p><p>一般的注释方法和数据污染评估一样,使用<code>blastp</code>的比对方法,设置<code>-evalue 1e-5</code>, <code>-max_target_seqs 1</code>,只保留匹配得分最高的蛋白序列。</p><h3 id="1-2-KEGG数据库"><a href="#1-2-KEGG数据库" class="headerlink" title="1.2 KEGG数据库"></a>1.2 KEGG数据库</h3><p><a href="https://www.kegg.jp/">KEGG</a>数据库全称Kyoto Encyclopedia of Genes and Genomes,是日本京都大学生物信息学中心1995年建立的数据库,该数据库描述生物体中复杂的生物学通路,其丰富的通路信息帮助我们系统了解蛋白的生物学功能,如代谢通路、遗传信息传递以及细胞过程等一些复杂的生物功能。</p><p>一般的注释方法同上,比对的数据库为KEGG数据库,并统计蛋白序列注释到的KEGG通路信息。</p><h3 id="1-3-KOG数据库"><a href="#1-3-KOG数据库" class="headerlink" title="1.3 KOG数据库"></a>1.3 KOG数据库</h3><p><a href="https://www.hsls.pitt.edu/obrc/index.php?page=URL1144075392">KOG</a>数据库全称Eukaryotic Orthologous Groups of protein(真核同源群数据库),和这个数据库齐名的还有COG(Clusters of Orthologous Groups of proteins),这两个数据库都是NCBI中基于直系同源关系的数据库,COG针对原核生物,KOG针对真核生物。其比对原理是将蛋白序列比对并注释到某个同源蛋白簇中,这个同源蛋白簇是NCBI通过生物的完整基因组的编码蛋白系统进化关系分类构建而成的,比对到的同源蛋白簇揭示该蛋白的功能。</p><p>一般的注释方法同上,比对的是KOG数据库,并统计蛋白序列注释的KOG信息。</p><h3 id="1-4-GO数据库"><a href="#1-4-GO数据库" class="headerlink" title="1.4 GO数据库"></a>1.4 GO数据库</h3><p><a href="https://geneontology.org/">GO</a>数据库全称Gene Ontology,数据库由基因本体联合会建立,是将所有与基因有关的研究结果进行分类汇总的综合数据库。可能听起来比较抽象,基因本体论的本质是用特定的一套词汇描述生物学功能,对不同物种的基因功能进行注释的统一化(去掉物种特异性),我们可以将基因按照其参与的<strong>生物过程</strong>(BiologicalProcess, BP) 、 <strong>细胞组分</strong>(Cellular Component, CC) , <strong>分子功能</strong>(Molecular Function, MF) 三个方面进行分类注释。很多蛋白数据库都带有GO号注释信息。</p><p>GO注释目前有两种主要的注释方法:</p><ol><li>序列相似性比对,也就是上面blast方法,比对的数据库可以是Swissport,提取GO列的注释信息。</li><li>结构域相似性比对,常用的软件有<code>InterproScan</code> ,比对的数据库可以是Pfam数据库(一个蛋白质域家族的数据库),也是提取GO注释信息。</li></ol><h3 id="1-5-Swissprot数据库"><a href="#1-5-Swissprot数据库" class="headerlink" title="1.5 Swissprot数据库"></a>1.5 Swissprot数据库</h3><p>Swissprot<del>(不要打成Swissport,瑞士机场)</del>隶属于<a href="https://www.uniprot.org/">Uniprot</a>数据库, 包含<strong>经过注释和验证的严格非冗余的</strong>蛋白序列数据库, 提供了蛋白序列详尽的注释信息。也是一个非常常用的蛋白注释数据库,现在<strong>UniprotKB</strong>(Universal Protein Knowledge Base)主要由两个子库构成,一个是<code>TrEMB</code>(该部分是计算机进行分析注释的,未人工校验的蛋白),另一个就是<code>Swissprot</code>。</p><p>一般的注释方法同上,比对Swissprot数据库,统计基因组蛋白序列注释到的Swissprot数据库蛋白信息。</p><p>上面介绍的是一般情况下我们做基因功能注释要用的数据库,全都是可以用blast比对相应的数据库,最后做个汇总比如韦恩图查看各个数据库注释的情况。</p><img src="https://www.shelven.com/tuchuang/20230919/4.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/4.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" style="zoom:80%;" /><p>之所以说一般情况,是因为随着算法和技术的发展,不断有新的比对工具(比如<code>DIAMOND</code>,<a href="https://github.com/bbuchfink/diamond">速度是blast的100到10000倍</a>;还有基于隐马尔科夫模型的<code>HMMER 3</code>等)和注释流程的出现,这些经典的方法固然可用,缺点也很明显:比对速度慢,需要下载大量的比对数据库,占用资源等等。</p></div><h2 id="2-eggNOG"><a href="#2-eggNOG" class="headerlink" title="2. eggNOG"></a>2. eggNOG</h2><div class="story post-story"><p><a href="http://eggnog6.embl.de/">eggNOG</a>(evolutionary gene genealogy Non-supervised Orthologous Groups)数据库是从NCBI的COG/KOG延伸出来的,前面介绍过这两个数据库是直系同源数据库,在2008年提出的eggNOG的时候只有这两个直系同源数据库,这两个数据库在当时一方面是数据不多(当时只有 312 个细菌、26 个古细菌和 35 个真核生物基因组),另一方面是更新缓慢,主要对当时比较基因组学研究影响较大。</p><p>看到<strong>Non-supervised</strong>无监督这几个字,相信有的小伙伴就明白了,他们用了Smith–Waterman算法(与blast齐名的寻找序列最优相似比较的算法,比blast更精准)构建了直系同源群(Orthologous Group),并且<strong>通过基因的描述文件、注释的功能类别和预测的蛋白质结构域</strong>等,自动注释这些直系同源群。这种非监督的方法,不受物种限制,不限于已有基因的功能和关系,这种灵活性、可扩展性和跨物种比较的优势,可以帮助人们发现新的功能和关系,并提供基于进化关系的功能注释。</p><p>感兴趣的可以看看下面这篇2008年的原文:</p><p><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238944/">eggNOG: automated construction and annotation of orthologous groups of genes - PMC (nih.gov)</a></p><p>经过十几年的迭代,eggNOG现在已经发布了6.0版本(2023年1月6日),总的来说,eggNOG 6.0提供了超过1700万个直系同源群 (OG) ,涵盖10756个细菌、457个古细菌和1322个真核生物,这些直系同源群可注释的信息包括 KEGG、GO、UniProtKB、BiGG、CAZy、CARD、PFAM 和 SMART。此外,eggNOG 6.0网站还推出了挖掘直系同源基因和基因功能数据分析的新功能,包括为跨物种的多个直系同源群生成系统发育图谱等。</p><p><img src="https://www.shelven.com/tuchuang/20230919/5.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/5.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p></div><h2 id="3-eggNOG-mapper"><a href="#3-eggNOG-mapper" class="headerlink" title="3. eggNOG-mapper"></a>3. eggNOG-mapper</h2><div class="story post-story"><p>eggNOG-mapper是eggNOG推出的一款进行批量基因功能注释的工具,在2021年的时候已经更新到了v2版本,主要更新了从raw contigs开始的从头基因预测、内置成对的同源预测(built-in pairwise orthology prediction)、快速检测蛋白结构域和自动生成gff文件四项功能。</p><h3 id="3-1-eggNOG-mapper原理"><a href="#3-1-eggNOG-mapper原理" class="headerlink" title="3.1 eggNOG-mapper原理"></a>3.1 eggNOG-mapper原理</h3><p>整个流程和原理可以用作者原文的一张图概括:</p><p><img src="https://www.shelven.com/tuchuang/20230919/8.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/8.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><blockquote><p>A: 编码基因预测阶段,用Prodigal从组装的contigs中进行蛋白预测(或者blastx模式预测)。</p><p>B: 搜索阶段,选择HMMER或者DIAMOND或者MMESQS2将输入的蛋白序列和EggNOG数据库进行比对,生成seed orthologs。</p><p>C: 直系同源推断阶段,根据所需要的分类范围,生成直系同源报告。</p><p>D: 注释阶段,根据蛋白注释信息和蛋白结构域注释信息,生成表格和gff格式的报告。</p></blockquote><h3 id="3-2-运行网页版eggNOG-mapper"><a href="#3-2-运行网页版eggNOG-mapper" class="headerlink" title="3.2 运行网页版eggNOG-mapper"></a>3.2 运行网页版eggNOG-mapper</h3><p>eggNOG-mapper这个工具可以下载到本地运行(主要是方便宏基因组这种大批量的注释),也可以在线运行,官方提供了在线服务地址<a href="http://eggnog-mapper.embl.de/">http://eggnog-mapper.embl.de</a></p><p>因为我自己不是做宏基因组的,所以也没必要下载工具和数据库(需要的时候再更),直接使用在线服务就可以。</p><img src="https://www.shelven.com/tuchuang/20230919/6.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/6.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" style="zoom:80%;" /><p>可以看到eggNOG-mapper在线版支持5种类型数据输入:</p><blockquote><p>Proteins:蛋白序列,上限是10万条序列</p><p>CDS:CDS序列,上限也是10万条,会在搜索前倍翻译成蛋白序列</p><p>Genomic:基因组序列,支持最多1000条DNA序列,1000万个核苷酸。直接上传基因组序列会多一步编码蛋白预测,可以选择使用<code>Prodigal</code>或者<code>Blastx-like</code>这两者之一。</p><p>Metagenomic:Contig级别的基因组序列,本质上和上面是一样的,所以限制条件也相同。同样会进行编码蛋白预测。</p><p>Seeds:eggNOG-mapper跑的seed orthologs,支持最多上传10万条,这个主要用于重新注释。</p></blockquote><p>我们前面拿到了蛋白序列,所以直接上传蛋白序列即可,注意一下上传的要求是<code>.gz</code>的压缩文件。所以稍微处理下:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash">保留原文件,以gz格式压缩为另一个文件</span></span><br><span class="line">gzip -c Ap_rmTE.aa > Ap_pep.gz</span><br></pre></td></tr></table></figure><p>上传gz文件,留下邮箱,底下还可以设置比对的参数和注释的参数,都以默认参数运行即可。稍微说下注释的一些参数:</p><img src="https://www.shelven.com/tuchuang/20230919/7.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/7.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" style="zoom:80%;" /><blockquote><p>Taxonomic Scope:可以手动选择的分类范围,官方建议是默认的方式,但也可以手动选择,比如我注释植物,就可以选择Viridiplantae绿色植物。这里的参数会影响流程里的直系同源推断阶段的范围,默认情况下每个序列都会自动调整。</p><p>Orthology restrictions:直系同源限制,可以选择from any ortholog或者from one to one ortholog,也是影响直系同源推断阶段的结果,后者只会在直系同源中匹配。</p><p>Gene Ontology evidence:基因本体论证据,这里是两种:1. experimental,也就是有实验支撑的(比如做过亚细胞定位啊,免疫荧光啊这些)。2. non-electronic,通俗说就是“非电子”证据,电子证据是与实验证据相距最远的,这里的指的是除实验和电子以外的证据。</p><p>PFAM refinement:这里的选项是对蛋白结构域注释的调整,可以直接出报告,也可以再将序列进行realign等。</p><p>SMART annotation:顾名思义,就是这步可以用SMART再进行蛋白质结构域预测,默认是跳过的。</p></blockquote><p>关于基因本体论的证据,GO官网有解释,我这里解释可能不是很清楚,可以查看官网对基因本体论的证据说明以及分类:</p><p><a href="https://geneontology.org/docs/guide-go-evidence-codes/">Guide to GO evidence codes (geneontology.org)</a></p><p>所有参数确定之后,点击<strong>submit</strong>,网页会提醒你查看邮件。进入邮箱,点击第一个第一个选项<strong>Click to manage your job</strong>,再点击<strong>Start job</strong>即可开始运行。</p><p><img src="https://www.shelven.com/tuchuang/20230919/3.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/3.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><h3 id="3-3-eggNOG-mapper结果文件解读"><a href="#3-3-eggNOG-mapper结果文件解读" class="headerlink" title="3.3 eggNOG-mapper结果文件解读"></a>3.3 eggNOG-mapper结果文件解读</h3><p>在线运行还是非常快的~我这20000个左右的蛋白序列,吃个饭的功夫回来就跑完了,不用下数据库不用配置环境就可以做分析,一个字,香!</p><p><img src="https://www.shelven.com/tuchuang/20230919/1.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/1.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>当我们看到右上角状态为<strong>Done</strong>后,就证明已经跑完了。我们返回前一个页面,点击<strong>Access your job files here</strong>,这个链接就是我们结果文件的存储位置,注意存储是暂时的,需要我们尽快下载到本地保存。</p><p><img src="https://www.shelven.com/tuchuang/20230919/2.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/2.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><p>检查日志无报错即可。我们主要用的是其中的两个annotations文件,一个是方便你自己提取内容做富集分析(内容制表符分割,怎么提取和做GO、KEGG富集分析下次再说吧,写个脚本的事),第二个带<code>.xlsx</code>后缀方便你用excel编辑和打开。</p><p>这里主要讲一下结果文件的各项参数代表什么意思:</p><p><img src="https://www.shelven.com/tuchuang/20230919/9.jpg" class="lazyload" data-srcset="https://www.shelven.com/tuchuang/20230919/9.jpg" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="></p><ul><li><ol><li>query:蛋白序列名称</li></ol></li><li><ol start="2"><li>seed_ortholog:搜索阶段比对上的seed ortholog编号</li></ol></li><li><ol start="3"><li>evalue:evalue值,越小结果越可靠</li></ol></li><li><ol start="4"><li>score:比对的得分,越大越可靠</li></ol></li><li><ol start="5"><li>eggNOG_OGs:为序列确定的以逗号分隔、按照进化分支深度排序的直系同源组(orthologs groups,OGs)列表。每个直系同源组以OG@tax_id|tax_name方式展现。</li></ol></li><li><ol start="6"><li>max_annot_lvl:用于检索注释的最宽的直系同源组,以tax_id|tax_name方式展现。</li></ol></li><li><ol start="7"><li>COG_category:从最佳OG中推测的COG功能分类(一个字母),具体有哪些可以看这里NCBI的介绍<a href="https://www.ncbi.nlm.nih.gov/research/cog#">COG - NCBI (nih.gov)</a>。</li></ol></li><li><ol start="8"><li>Description:注释的基因功能描述(eggNOG-mapper注释的描述真的很简短……)</li></ol></li><li><ol start="9"><li>Preferred_name:普遍使用的基因名缩写</li></ol></li><li><ol start="10"><li>GOs:注释的GO编号,一个基因可能对映非常多的GO号</li></ol></li><li><ol start="11"><li>EC:KEGG的EC通路编号,代表相关的酶</li></ol></li><li><ol start="12"><li>KEGG_ko:KEGG的KO编号,表示直系同源基因,代表一个具体的功能</li></ol></li><li><ol start="13"><li>KEGG_Pathway:通常有ko编号和map编号,map编号代表reference pathway,是一种代谢通路类型,比较具体有一般参考意义</li></ol></li><li><ol start="14"><li>KEGG_Module:KEGG Module数据库编号,以M开头,实际上就是多个KO划分在一个共同发挥功能的单元里</li></ol></li><li><ol start="15"><li>KEGG_Reaction:KEGG Reaction数据库编号,以R开头,包含代谢通路上酶促反应相关信息</li></ol></li><li><ol start="16"><li>KEGG_rclass:KEGG RCLASS数据库编号,以RC编号开头,手动整理的反应数据集合</li></ol></li><li><ol start="17"><li>BRITE:KEGG的Brite数据库编号,用的不是很多,是一个储存分类信息的数据库</li></ol></li><li><ol start="18"><li>CAZy:碳水化合物酶相关的专业数据库,可以看官网<a href="http://www.cazy.org/">CAZy - Home</a></li></ol></li><li><ol start="19"><li>BiGG_Reaction:BiGG是一个整合基因组尺度代谢网络模型的数据库,官网<a href="http://bigg.ucsd.edu/">BiGG Models (ucsd.edu)</a></li></ol></li><li><ol start="20"><li>PFAMs:PFAM数据库,根据多序列比对结果和隐马尔可夫模型,将蛋白分为不同家族的一个数据库,官网<a href="https://www.ebi.ac.uk/interpro/">InterPro (ebi.ac.uk)</a>。没错官网是InterPro,pfam数据库已经被InterPro合并,并且2023年1月之后原网页就失效了。</li></ol></li></ul><p>这么多数据咋一看很头疼,可以整理一下写个脚本,做GO和KEGG富集分析,这个下次再说。</p></div>]]></content>
<summary type="html"><p>继续更新一下基因组注释,在基因组注释的<a href="https://www.shelven.com/2023/04/11/a.html">第5篇博客</a>中,我们已经拿到了Braker3预测的功能基因,并且删除了其中可能被TE插入而失去功能的基因。仅仅拿到这些基因CDS序列和蛋白序列肯定是不够的,我们还需要知道这些蛋白具体行使什么生物学功能。 </p></summary>
<category term="学习笔记" scheme="http://www.shelven.com/categories/%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<category term="基因组三代测序分析" scheme="http://www.shelven.com/categories/%E5%9F%BA%E5%9B%A0%E7%BB%84%E4%B8%89%E4%BB%A3%E6%B5%8B%E5%BA%8F%E5%88%86%E6%9E%90/"/>
<category term="eggNOG mapper" scheme="http://www.shelven.com/tags/eggNOG-mapper/"/>
</entry>
</feed>