1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434
|
0.14 2018-11-29
[Ko van der Sloot]
* updated usage() and removed -S option (never used)
* make sure the right textclass is assigned to <w> nodes in FoLiA
* minor code fixes/refactorings
* added more tests
* updated man.1 page
[Maarten van Gompel]
* updated README.md
[Iris Hendrickx]
* Updated and extended the manual
0.13.2 2018-05-17
[Ko van der Sloot]
Bug fix release:
* uctodata is mandatory. So don't install default rules anymore
0.13.1 2018-05-17
[Ko van der Sloot]
Bug fix release:
* configure now finds out the location of the uctodata files.
should make it work on Mac systems too
0.13 2018-05-16
[Ko van der Sloot]
* improved configure/build/test
* added a --split option
* fixed -P option
* removed -S option (never used, and only half implemented)
* added a --add-tokens option, to add special tokens for the default language
* generally use the icu:: namespace
* added more tests
* fixed uninitialized variable.
* added code to use an alternative search-path for uctodata
[Maarten van Gompel]
* added codemeta.json
0.12 2018-02-19
[Ko van der Sloot]
* now use the UniFilter Unicode Filter from ticcutils
* now use the UnicodeNormalizer from ticcutils
* improved configuration. Support vor Mac OSX added
0.11 2017-12-04
[Ko van der Sloot]
Bug fix release:
* problems with text inside Cell elements
0.10 2017-11-07
[Ko van der Sloot]
New release due to outdated files in the previous release.
0.9.9 2017-11-06
[Ko van der Sloot]
Minor fix:
* bumped the .so version to 3.0.0
0.9.8 2017-10-23
[Ko van der Sloot]
Bug-fix release
* fixed utterance handling in FoLiA input. Don't try sentence detection!
0.9.7 2017-10-17
[Ko van der Sloot]
* added textredundancy option, default is 'minimal'
* small adaptations to work with FoLiA 1.5 specs
- set textclass on words when outputclass != inputclass
- DON'T filter special characters when inputclass == outputclass
* -F (folia input) is automatically set for .xml files
* more robust against texts with embedded tabs, etc.
* more and better tests added
* better logging and error messaging
* improved language handling. TODO: Language detection in FoLiA
* bug fixes:
- correctly handle xml-comment inside a <t>
- better id generation when parent has no id
- better reaction on overly long 'words'
0.9.6 2017-01-23
[Maarten van Gompel]
* Moving data files from etc/ to share/, as they are more data files than
configuration files that should be edited.
* Requires uctodata >= 0.4.
* Should solve debian packaging issues (#18)
* Minor updates to the manual (#2)
* Some refactoring/code cleanup, temper expectations regarding ucto's
date-tagging abilities (#16, thanks also to @sanmai-NL)
0.9.5 2017-01-06
[Ko van der Sloot]
Bug fix release:
* updated tokconfig-generic, which is removed from the uctodata package
* configure no longer insists on the presence of uctodata, it merely warns
when missing
0.9.4 2017-01-05
[Ko van der Sloot]
Major update
* Language support
- added support for multiple languages
- auto detection of languages using textcat
* some refactoring
- no more call to exit()
- Better logging and Warning messages
- some folia output improvements
* bug fixes
- in passthru,
- issue #11
0.9.3 2016-09-28
[Ko van der Sloot]
Major update:
- require ICU 5.2
- implemented recursive application of rules. (which may be dangerous)
- modfied tests, because not all failures wre detected correctly
- check the uctodata version. version > 0.2 is preferred.
0.9.1 2016-07-12
[Ko van der Sloot]
Bug fix release:
- fixed autoconfig issue
0.9.0 2016-07-11
[Ko van der Sloot]
Major update
- now use uctodata for language specific information
ucto itself only supports a generic tokenizer
- interactive use now uses readline library
- accept long options --help and --verision
- UTF16BE now works
- better support for crooked Windows files in general
- added a --normalize option to map tokens in a certain TokenClass
to it's generic name
0.8.6 2016-04-25
[Ko van der Sloot]
* Bug fix release: fixing Sentence boundaries after abbreviations
0.8.5 2016-04-25
[Ko van der Sloot]
* Bug fix release: Better handling of regexps
0.8.4 2016-03-10
[Ko van der Sloot]
* implemented on top of libfolia 1.0
0.8.1 2016-01-14
[Ko van der Sloot]
* repository moved to GIT
* added Travis support
* more tests added
* added META-RULES code
* %include now supports full paths
* updated some languages
* fixed passthru mode
* code cleanup
0.8.0 2015-01-29
[Ko van der Sloot]
* next release
[Maarten van Gompel]
* added new tokenize(string,string) meta-function for the API
* allatonce enabled by default for tokenize() to folia doc
* fixing date rules and adding FRACNUMBER
* added Russian
* Adicionei regras para tokenização portuguesa.
[Antal vd Bosch]
* added RK to dutch abbrev.
0.7.0 2014-11-26
[Ko van der Sloot]
* unofficial release
* experimental PUNCTUATION filter
* bug fixes
[Maarten van Gompel]
* reduced memory usage
0.6.0 2014-09-23
[Ko van der Sloot]
* release
0.5.5 2014-06-xx
* made getSentence() public
* adapted to most recent libfolia (0.11 or above)
* needs libticcutils 0.6 or above
* uses TiCC::CommandLine
* detect EMOTICON's
* generally switched to UChar32 and Unicode codepoints. (avoid length() problems)
* handle FoLiA Note like Caption
* a lot of bug fixes concerning FoLiA output (<t> nodes, textclass values etc.)
* again some changes around quotes
* improved tokenisation in differeny languages
* added swedisch
0.5.3 2013-04-04
[Folgert Karsdorp]
* Fixed quote detection, added tests. still shaky and default disabled
[Ko van der Sloot]
* changed verbose output slightly
* fixed id's in folia output
* various folia fixes
* honour BOM markers in input file
* lots of configuration updates
* some fixes in handling if RULES
0.5.2 2012-03-29
[Ko vd Sloot]
* some small changes. Made it work with libfolia 0.9
0.5.1 2012-02-27
[Ko vd Sloot]
* added 'escape' possibility for regexps that start with a [
* better debugging output
* removed all (?i) stuff from regexps. This attempts to avoid an ICU bug
* added -X en --id= options
* adapted to libfolia 0.8 (/tests too!)
* some cleanup and refactoring
[Maarten van Gompel]
* added better rules for apostrophs in ATTACHEDSUFFIX and TOKENS
0.5.0 2012-01-09
[Ko vd Sloot]
* added a different and more powerpull SMILEY rule. Which happens also to work
on older ICU versions
0.4.9 2011-12-21
[Ko vd sloot]
* reworked and more folia integration
0.4.8 2011-11-02
[Ko vd sloot]
* use libfolia to generate folia XML
0.4.7 - (not released yet, feel free to add more stuff)
[ Maarten van Gompel ]
* Fix: proper XML entities in FoLiA output
* fixed bug77 (the NOSPACE bug)
* Fix: Nested quote problem (2011-08-18)
* Improved protection against unbalanced quotes/sentences (2011-08-18)
[Ko van der Sloot]
* fixed passthru encoding problem
* fixed problem with CRLF separated lines (bug 78)
* configdir vs. config file hassle moved more inside. simpler API now.
* -Q option works reversed now. -Q Enables Quote detection.
Quote detection apears to be very hard and fragile.
0.4.6 - 2011-05-17
[Ko van der Sloot]
* changed the regexp for KNOWN-ABBREVIATIONS to case sensitive
* fixed include file handling for non-standard locations
* fixed a problem with NON-Unix files. ucto would crash on a line with just '\r'
0.4.5 - 2011-04-27
[ Maarten van Gompel]
* Added sentenceperline support for PassThru mode , improved sentenceperline support for normal mode
[ Ko vd Sloot ]
* on failue, ucto didn't use the right exit code. 0 == SUCCESS (on most systems)
* added functions to display version info.
0.4.4 - 2011-03-31
[ Maarten van Gompel]
* fixed "fatal error: ucto: out of range :No sentence exists with the specified index" problem. (Bug 65)
[ Ko van der Sloot ]
* Fixed terrible bug. Unicode strings were output in the current locale.
But we advertise UTF8
0.4.3 - 2011-03-19
[ Ko van der Sloot ]
* src/ucto.cxx: fixed --passthru problem
* tests/testpassthru.ok: test now works
[ Joost van Baal ]
* NEWS: record changes and releases
0.4.2 - 2011-03-17
[ Ko van der Sloot ]
* include/ucto/tokenize.h, src/tokenize.cxx,
src/unicode.cxx: passes -pedantic
* configure.ac: some cleanup, bumped version
* include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx:
added (hidden) --passthru option
* [r8842] tests/passthru.txt, tests/testall, tests/testpassthru,
added a passthru test.
has t0 be tested :)
* include/ucto/tokenize.h, src/tokenize.cxx: make compiler
more happy
* docs/ucto.1: added description, smal update
0.4.1 - 2011-03-11
[ Ko van der Sloot ]
* src/tokenize.cxx: fixed regexp and error messag
* config/tokconfig-nl, src/tokenize.cxx: added the
possiblity to ste the order of RULES in the config file
* tests/bug0063.nl.tok.V, tests/bug0063.nl.txt: added a
test for bug63
Not sure about the 'correct' solution
* docs/ucto.1: updated man page
[ Maarten van Gompel ]
* src/tokenize.cxx: fixed passthruline (skip=t) bug, FoLiA XSL has to be
local unfortunately
* tests/bug0063.nl.tok.V: override
* config/tokconfig-nl, src/tokenize.cxx,
tests/bug0052.nl.tok.V, tests/normalisation.nl.tok.V,
tests/test.nl.tok.V: fix bug0063
0.4.0 - 2011-03-04
[ Maarten van Gompel ]
* logo.svg: added logo
0.3.7 - 2011-03-01
[ Ko van der Sloot ]
* [r8636] tests/testoption1.ok, tests/testusage.ok: these tests
give a different outcome now.
* [r8318] src/tokenize.cxx: added experimental code to use the -n
option ( output one sentence per line) also to process the input
one sentence per line
* [r8317] tests/bug0054.nl.tok.V, tests/bug0054.nl.txt: testcase
for bug0054
[ Maarten van Gompel ]
* [r8618] include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx:
sentence per line input and output: two modes
* [r8617] src/tokenize.cxx, tests/bug0048.nl.tok.V,
tests/bug0054.nl.tok.V: Fixed bug 54
* [r8615] src/tokenize.cxx, tests/abbreviations.nl.tok.V,
tests/nu.nl.tok.V, tests/test.nl.tok.V: fixes
* [r8614] src/tokenize.cxx: FoLiA improvement
0.3.6 - 2011-02-12
[ Ko van der Sloot ]
* tests/: more tests added
* configure.ac, include/ucto/tokenize.h, src/tokenize.cxx,
src/ucto.cxx, tests/testnormalisation: added possibility to set
the inputEncoding breaks ucto user interface!
0.3.5 - 2011-02-10
[ Ko van der Sloot ]
* src/ucto.cxx: fix memory leak
* include/ucto/tokenize.h, include/ucto/unicode.h,
src/Makefile.am, src/tokenize.cxx, src/ucto.cxx, src/unicode.cxx,
include/ucto/tokenize.h, src/unicode.cxx: added copyright notice
* include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx: -f option
now works
* config/tokconfig-nl, include/ucto/tokenize.h, src/tokenize.cxx,
src/ucto.cxx: added support for ligature filtering and Unicode
normalizing. a bit rough still
* tests/: more tests added
* ucto.pc.in: now uses ucto-icu.pc
[ Maarten van Gompel ]
* configure.ac: version bump
0.3.4 - 2011-01-27
[ Joost van Baal ]
* Makefile.am, configure.ac, icu.pc.in, ucto-icu.pc.in: rename icu.pc to
ucto-icu.pc: be sure we wont suffer from filename clashes in the future.
Once Debian and other distos ship icu 4.6's usr/lib/pkgconfig/icu-io.pc
(released 2010-12-02) we can get rid of our local copy.
[ Ko van der Sloot ]
* tests/: more tests added
[ Maarten van Gompel ]
* include/ucto/tokenize.h, src/tokenize.cxx: Updates in FoLiA support
0.3.3 - 2011-01-27
[ Joost van Baal ]
* Various bugfixes
0.3.2 - 2011-01-27
[ Ko van der Sloot ]
* Various bugfixes
0.3.1 - 2011-01-26
[ Ko van der Sloot ]
* Various bugfixes
0.3.0 - 2011-01-26
[ Maarten van Gompel ]
* tests/: Added lots of tests
* configure.ac, include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx:
major refactoring. Improved buffering, less unnecessary storing of
token/sentence vectors in memory. Improved quote support.
* include/ucto/tokenize.h, src/tokenize.cxx: Ucto now remembers if a token
was spaced or not in the original. Enabling ucto to recontruct the original
text exactly.
* include/ucto/tokenize.h, src/tokenize.cxx: Added quote detection support
* include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx: Added preliminary
FoLiA XML output support in ucto
* include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx: Big API overhaul
[ Peter Berck ]
* config/Makefile.am, config/tokconfig-sv Added Swedish tokconfig
[ Ko van der Sloot ]
* config/tokconfig-nl, src/tokenize.cxx: read QUOTES from config file
* src/ucto.cxx: refuse to run when inputfile is bad
* docs/ucto.1: added a simple 'man' page
* src/ucto.cxx: added al -p switch to disable paragraph
detection
0.0.1 - 2010-12-25
- First snapshot release
unreleased - 09-12-2010
- started to create a separate package
|