-
Notifications
You must be signed in to change notification settings - Fork 0
/
tubelex-ja.out
58 lines (50 loc) · 1.33 KB
/
tubelex-ja.out
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Cleaning stats:
* video IDs in the list:
120000
* files:
119909 total
6203 too short (<3 lines)
3931 not enough ja characters (<0.7 characters of the corresponding charset)
8111 not enough detected ja (<0.95)
101664 valid files after cleaning
* sequences removed from valid files:
29428 tags
1714 addresses
* lines in valid files:
18417195 total lines
116 whitespace-only lines
37504 lines composed of non-ja characters
18379575 valid lines after cleaning
* VTT cue cleanup:
14015410 total cues
84962 empty cues
31492 repeated cues
18242 scrolling cues
Duplicate (similarity >= 0.95) stats:
101664 total
910 duplicates removed
100754 valid files
[unidic-lite]
Frequency counting stats:
253377 CC descriptions filtered
163439781 tokens counted
[unidic-310]
Frequency counting stats:
253377 CC descriptions filtered
163234151 tokens counted
[unidic-lite] --form base --pos
Frequency counting stats:
253377 CC descriptions filtered
163439781 tokens counted
[unidic-lite] --form lemma --pos
Frequency counting stats:
253377 CC descriptions filtered
163462537 tokens counted
[unidic-310] --form base --pos
Frequency counting stats:
253377 CC descriptions filtered
163233435 tokens counted
[unidic-310] --form lemma --pos
Frequency counting stats:
253377 CC descriptions filtered
163312632 tokens counted