forked from LanguageMachines/foliautils
-
Notifications
You must be signed in to change notification settings - Fork 1
/
NEWS
323 lines (287 loc) · 11.1 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
foliautils 0.22 27-05-2024
[Ko van der Sloot]
* requires libfolia 2.19 or higher
* fixes github actions on MacOSX
* small code updates and improvements
foliautils 0.21 26-04-2024
[Ko van der Sloot]
* a lot off code changes. Many regarding hypens
* added an extract_final_hyphen function, used by serveral programs
FoLiA-abby, FoLiA-page and FoLiA-txt
* FoLiA-txt: filter out ZWNJ characters. Avoid spurious LineBreaks
* fix for https://github.com/LanguageMachines/foliautils/issues/68
* FoLiA-idf was not quite working. Fixed
* FoliA-page: removed --sent option. updated man page
lots of other fixes too
* added experimental FoLiA-merge program:
merging lemma/pos information into FoLiA files
* more and better tests added
[Maarten van Gompel]
* updates in README.MD
foliautils 0.20 12-03-2023
[Ko van der Sloot]
* Fix in FoLiA-txt. A <t-hbr> signals a newline, so adding an extra <br/> is
not correct
foliautils 0.19 21-02-2023
[Ko van der Sloot]
* general C++ cleanup and refactoring
* Some fixes for building on Mac OSX
* FoLiA-txt:
- now we handle soft-hyphens
- --remove-end-hyphens is the default now. We create <t-hbr> nodes
- modifications for https://github.com/proycon/foliapy/issues/25
- Unicode awareness
* FoLiA-2text:
- added a --restore-formatting option, which outputs the text inside
<t-hspace> and <t-hbr> nodes
* FoLiA-abby:
- handling of soft-hyphens
- fixes for <br/> and <t-hbr>
- preserve original spaces in <t-hspace>'s text
* FoLiA-correct: small fix in program logic.
foliautils 0.18 22-07-2022
[Ko van der Sloot]
* FoLiA-page: only add LineBreak annotation when needed
* added more tests to make check
* adapted and fixed tests
* fixed the ugly problem of temporally disabling text checking.
* start using the "system" foliadiff
* fix declarations
[Maarten van Gompel]
* FoLiA-page: added a --nomarkup parameter to revert to the old behaviour, and an extra --nostrings parameter to omit the strings #65
* added a note for the --sent option #65
* Added some comments for the ugly disable set_checktext patch, I don't like this but it seems needed (underlying libfolia issue?) #65
* Add linebreaks and t-str to the paragraph text (currently fails text validation)
* added Dockerfile and instructions
* codemeta.json: updated according to (proposed) CLARIAH requirements (CLARIAH/clariah-plus#38)
foliautils 0.17 12-07-2021
[Ko van der Sloot]
* needs libfolia 2.9 or above
* replaced TravisCI by GitHub actions
* FoLiA-correct:
- fixex a problem with correcting FoLia with both p and s nodes
- added support for the FoLiA 'tag' feature
- clearer error messages
- fixed bugs in HEMP handling
- better handling of Ucto's ABBREVIATION* tokens
- fixed correctings when a word has 'space="no"'.
- some smaller fixes
- added more tests
* FoLiA-clean:
- improved, using new features from libfolia 2.9
* FoLiA-2text:
- replaced '--original' parameter by a '--correction-handling' parameter
- implemented a --honour-tags option, to interpret tag="token" tags
- some improvement in output-file naming
* FoLiA-abby:
- complete reworked the code
- added '-S' and '-C' as alternatives for '--setname' and '--classname'
- added a --keephyphens option
- added a --addbreaks option
- addes option --addmetrics to optionally add positional info to the
paragraphs
- improved handling of '-' (Hyphen)
- add 'font_properties', 'font_id' and 'font_style' as a feature node
- improved handling of text with spaces at 'unexpected' locations
* all modules:
- Code refactoring and cleaning
- added and improved tests
- adapted man pages
foliautils 0.16 07-01-2021
[Ko vd Sloot]
* requires libfolia 2.7 or above
* provenance data is better for a lot of modules
* added better checking on invalid NCnames in some modules.
* FoLiA-abby:
- a lot of refactoring and additions to handle font/style information
* FoLiA-pm:
- Notes are handled correctly now
- fixed error in xlink attributes
* FoLiA-page:
- more types of Page files are handled now
- fixed annotation declarations
- fixed offset calculation (due to change in FoLiA's opinion on those)
- page number is added as a <br> node and in the metadata
- added a --trusttokens option. This means that Word items in the Page file
are added as Word's in the FoLiA, embedded in Sentences.
- added a --norefs option to avoid adding references to the original texts
* FoLiA-correct:
- make sure that the default is to run on 1 thread
- added a --rebase-inputclass option
* FoLiA-alto:
- the -t option was not always handled correctly
[Maarten van Gompel]
* FoLiA-benchmark: guard against compiler optimisation #48
foliautils 0.15 15-09-2020
[Maarten van Gompel]
* FoLiA-txt: check if a string is empty after normalisation (fix for #46)
[Ko vd Sloot]
* folia-correct: fix one-off error in hemp handling (when no hemp was found) #45
* some refactoring
- centralized definition of XML_PARSER_OPTIONS
* bugfix in threading
foliautils 0.14 15-04-2020
[Martin Reynaert]
* updated man pages
[Ko vd Sloot]
* added man pages
* revised usage() in many modules
* the default seperator in FoLiA-stats is '_' now
* fix for: https://github.com/LanguageMachines/foliautils/issues/37
* fix for: https://github.com/LanguageMachines/foliautils/issues/41
* adapted to changes in libfolia
* many small code refactorings
* FoLiA-correct is improved a lot, allowing ngram corrections in FoLiA
* FoLiA-stats accepts a 'word_in_doc' mode now
* FoLiA-alto by default ceated <w> nodes now. use --oldstring to get <str>
* improved a lot in tests/
* many small fixes
foliautils 0.13 30-08-2019
[Ko vd Sloot]
bug fixes:
* fix for https://github.com/LanguageMachines/foliautils/issues/35
* fix for https://github.com/LanguageMachines/foliautils/issues/36
[Maarten van Gompel]
new features:
* FoLiA-wordtranslate.cxx: frog should be able to deal with
spaces now, no need for ugly non-breaking space hack that now causes
other problems in frog (LanguageMachines/frog#77)
* FoLiA-wordtranslate.cxx: adding ability to constrain by language
foliautils 0.12 29-05-2019
[Ko vd Sloot]
Release for FoLiA 2.0.
foliautils 0.11 15-05-2019
[Ko vd Sloot]
Stabilizing release for pre FoLiA 2.0
* Updated and added some tests
* started moving common code to a separate file and build a library
(libfoliautils)
- hemp detection is one of them
* FoLiA-stats:
- added possibility to read a list of directories + file-names to process
into seperate output directories. (could be generalized to other programs)
- better hemp detection
* FoLiA-correct:
- use same hempo detections as FoLiA-stats
* FoLiA-abby:
- support more flavors
* FoLiA-clean:
- avoid removing the last remaining tekt on <w> nodes
- cleaning of tokenization now works
foliautils 0.10 29-11-2018
[Ko vd Sloot]
* fixed icu:namespace issues
* added FoLiA-abby, an ABBY to FoLiA convertor
* src/FoLiA-abby.cxx, src/FoLiA-page.cxx, src/FoLiA-pm.cxx:
- Allow 'none' value for --prefix
* src/FoLiA-page.cxx, src/FoLiA-hocr.cxx: fixed Alignment info
* src/FoLiA-correct.cxx:
- fixed a problem with correction of the last word of a trigram.
- fix correction of paragraphs with only deeper text
- The --rank option accepts more flavors of files
* src/FoLiA-stats.cxx:
- added a --detokenize option
* several minor fixes, refactorings etc.
* updated tests
foliautils 0.9.2 05-06-2018
[Ko vd Sloot]
Bug fix release:
* append small prefixes to output filenames, to ALWAYS avoid names starting with
a numeric value.
'FPM-' for FoLiA-pm. 'FP-' for FoLiA-page, 'FH-' for FoLiA-hocr
Can bet set witth --prefix
* FoLiA-stats.cxx:
- added --collect to usage() and 'man' page
* FoLiA-correct:
- added --inputclass and --outputclass parameters (must be different)
- Don't crash on empty text.
foliautils 0.9.1 17-05-2018
[Ko vd Sloot]
Bug fix release: the tests directory wasn't included in the distribution
foliautils 0.9 16-05-2018
[Ko vd Sloot]
* FoLiA-stats.cxx:
- added a --collect option, to create files with all n-grams
together
- clearer message in FoLiA-stats when no results were found
- extract text from deeper nodes, if needed
- fixed out-of-bounds problem
- now fails when every inputfile fails
* FoLiA-txt:
- now fails when every inputfile fails
* avoid xml:id's starting with a number. Add "id-" in front.
* added more tests
[Maarten van Gompel]
* added codemeta.json
foliautils 0.8 19-02-2018
[Ko vd Sloot]
* added -R option to FoLiA-collect
* FoLiA-collect now can work in parallel (-t option)
* modernized configuration, whit better Max OSX support (including OpenMP)
* all modules end with an exit code now.
* added more tests to 'make check'
* added output of Type-Token Ratio's (also in degrees)
* several bugfixes.
* code cleanup and refactoring, some speedup too
foliautils 0.7 24-10-2017
[ko vd Sloot]
* updated and expanded tests
* fixed offset calculations in FoLiA-hocr, FoLiA-page.cxx
and FoLiA-alto. We use unicode points now. (needed for folia v1.5 and above)
* Changed 'modes' in FoLiA-stats, to be a bit more comprehensible
* fixed problem with metadatatype when 'foreign-data' is present
* enhanced FoLiA-clean. Still not done...
* switched to dynamic OMP scheduling in most programs.
(which process files with probably big differences in processing time)
* small bugfixes.
* general cleanup and refactoring
[Maarten van Gompel]
* Added and improved FoLiA-wordtranslate.cxx
foliautils 0.6 04-04-2017
This is an intermediate release!!
Work on some tools is developing rapidly. next releases won't take long.
For now, backward compatibility is still maintained mostly.
[Ko van der Sloot]
* uses libfolia 1.7 now!
* FoLiA-correct now uses an other output file naming scheme (breaks backward compatiblity)
* FoLiA-langcat now has a --tags parameter to select which <t> nodes are searched
* FoLiA-stats:
- a new --separator option is added
- added a --max-ngram option.
- added a --languages option for multiple languages
- now we have a --aggregate option for multiple language statistics
- fixed a bug in total counts
* added a first version of FoLiA-clean program. Cleans up tests/tags in FoLiA files.
* FoLiA-correct:
- output statistics
- verbosity option improved
* added and improved a lot of tests
foliautils 0.5 17-01-2017
[Ko van der Sloot]
* based on libfolia 1.5 or higher
* use recent ucto with textcat support
* use ISO 639-3 language names
* lot's of code refactoring
* improved tests
* bug fixes in FoLiA-correct unigram correction
* extended and improved FoLiA-pm a lot
* changed default values for '--lang' and '--class' in FoLiA-stats (issue#3)
* FoLiA-alto can now work without a Didl too (issue #2)
* numerous additions...
foliautils 0.4 24-05-2016
[Ko van der Sloot]
* added FoLiA-pm, a convertor from Political Mashup format to FoLiA
needs libfolia v1.2 for new ForeignData meta nodes and extended references
foliautils 0.3 2016-03-1
[Ko van der Sloot]
* renamed from foliatools to foliautils
* now based on libfolia v1.0
foliatools 0.2 2016-01-14
[Ko van der Sloot]
* repository moved to GitHub
* added Travis support
* fixed OpenMP problems
* made the code 'distributable': overhauling the beta-stuff
adding new programs
0.1 [Ko vd Sloot]
first attempt