1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644
|
0.35 2024-12-16
[Ko van der Sloot]
* require latest ticcutils
* updated GitHub CI
0.34 2024-09-12
[Maarten van Gompel]
* fall back when local config dir can not be checked for whatever reason
https://github.com/LanguageMachines/ucto/issues/97
* extract custom configuration directory if provided, and fall back to that
for includes https://github.com/LanguageMachines/ucto/issues/96
* needs ticcutils >= 0.35
[Ko van der Sloot]
* force use of c++17
* minor code updates
* streamlined Github CI file
* adapted some foliatests to recent libfolia versions
* refactored tests:
- all shell scripts have the .sh extension now
- use folialint or foliadiff to check folia results
0.33 2024-04-26
[Ko van der Sloot]
* added a batch mode: https://github.com/LanguageMachines/ucto/issues/94
* improved handling of NonSpacing markers.
* adapted some tests, based on the newest uctodata package (notably
French was not correct implemented)
0.32.1 2024-03-20
[Ko van der Sloot]
* additional fix for https://github.com/LanguageMachines/ucto/issues/93
0.32 2024-03-19
[Ko van der Sloot]
* fix for https://github.com/LanguageMachines/ucto/issues/95
* automagicly geneate an xml:id when not provided
0.31 2024-02-28
[Ko van der Sloot]
* fixed handling of the rare cases of Unidentifiable Characters
They were ignored, which lead to incompatible text elements in FoLiA
* some small refactoring, rooting out CppCheck warnings
0.30 2023-10-21
[Ko van der Sloot]
* using ticcutils >- 0.34. All Unicode id NFC normalized now
* normalization performed for passthru too.
All output should be in the same encoding (NFC)
* fixed a problem when using the API form Frog
* improving code quality
* added (dangerous, and compiletime only) option to change the magic
'tokconfig-' value.
[Maarten van Gompel]
* README.md: README: added demo screencast
0.29 2023-04-22
[Ko van der Sloot]
* fixes for https://github.com/proycon/python-ucto/issues/16
* added a new --copyclass option, (see comments in
https://github.com/LanguageMachines/ucto/issues/68)
* updated man page
0.28.1 2023-02-22
[Maarten van Gompel]
* Software metadata update only, no functional changes
0.28 2023-02-21
[Ko van der Sloot]
* Made sure that TextCat is not initialized when not needed
* Sentences inside quotes got an inconsistent xml:id (Not invalid though)
* Separated Debug en Log streams.
* C++ Code quality improved
0.27 2023-01-23
[Ko van der Sloot]
* removed dependency on libtar
* fixed build when HAVE_TEXTCAT was not set. Improved guards agains missing textcat support
[Maarten van Gompel]
* guard against uninitialized/missing textcat (https://github.com/proycon/python-frog#22)
* require latest libfolia, ticcutils and a more recent libxml2
0.26 2023-01-02
[Ko van der Sloot]
* some code quality improvements
* fix for https://github.com/LanguageMachines/ucto/issues/89
* updated configure.ac
* updated GitHub action
[Maarten van Gompel]
* Added MAINTAINERS
* updated codemeta.json
* fix for https://github.com/fbkarsdorp/homebrew-lamachine/issues/17
0.25 2022-07-22
[Ko van der Sloot]
* Added a test for https://github.com/LanguageMachines/ucto/issues/87
* Adapted to latest update in tokconfig-fra (uctodata 0.9)
* Deal with unknown languages (as detected by ucto), using iso-639-3 'und' (https://github.com/LanguageMachines/ucto/issues/86)
* don't tokenize unknown languages
* configurable sentence splitter for "und" text
* added tests
* added code to set the separator (--seperators), so ucto can split on more than just spaces
* migrated test wrapper to Python 3 (was still on 2.7)
[Maarten van Gompel]
* Set up a Dockerfile
* Added build-deps.sh to automatically download, build and install dependencies
* Updated software metadata (codemeta.json) to latest requirements as proposed in CLARIAH
* deprecated options -f and -x, still works but no longer advertised and gives a deprecation notice (https://github.com/LanguageMachines/ucto/issues/88)
* textcat.cfg is now searched for in user config dir as well as global config; also allow running without textcat if the config is missing entirely (same as if not compiled in)
* added support for user-based configuration dirs ($XDG_CONFIG_HOME/ucto), takes precedence over global data dirs
0.24.1 2021-12-17
[Ko van der Sloot]
* added UTF8 members to the API, to replace the variants that were
converted to UnicodeString
This should help fixing https://github.com/proycon/python-ucto/issues/11
0.24 2021-12-15
[Ko vd Sloot]
* fix for https://github.com/LanguageMachines/ucto/issues/84
* added a solution for https://github.com/LanguageMachines/ucto/issues/53
(only partly)
* added some UnicodeString members to the API
* bumped library version to 6.0, because of API changes
* code cleanup and refactoring
0.23 2021-07-12
[Ko vd Sloot]
* added support for the new 'tag' feature in FoLiA, only for tag="token"
* fixed a problem with '-T full' option not always adding text
* use the new TextPolicy class from libfolia
* fix for https://github.com/LanguageMachines/ucto/issues/81
* fix for https://github.com/LanguageMachines/ucto/issues/82
* added code to handle several Unicode joiners
* replaced TravisCI by GutHub action
* %include files may have an extension now
* added tests for new features
0.22 2020-10-08
[Ko van der Sloot]
* fix for https://github.com/LanguageMachines/ucto/issues/79
0.21.1 2020-04-15
[Ko van der Sloot]
* fix for https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941498
0.21 2020-04-15
[Ko van der Sloot]
* Adapted to newest libfolia 2.4
* adapted some tests
* added an --allow-word-corrections option
* improved handling of odd FoLiA
0.20 2019-11-27
[Ko van der Sloot]
Bug fix release. solving:
* https://github.com/LanguageMachines/frog/issues/84
* https://github.com/LanguageMachines/frog/issues/83
* https://github.com/LanguageMachines/ucto/issues/76
* https://github.com/LanguageMachines/ucto/issues/74
0.19 2019-09-16
[Ko van der Sloot]
Bug fix release. solving:
* https://github.com/LanguageMachines/ucto/issues/72
* some problems with the newest libfolia.
* better provenance records
0.18 2019-07-22
[Ko van der Sloot]
Bug fix release. solving:
https://github.com/LanguageMachines/ucto/issues/70
0.17 2019-06-19
[Ko van der Sloot]
Bug-fix release:
- solved problems when tokenizing (partly-)tokenized FoLiA
(but this is a very complicated situation. Might need more work)
- solved problems with --passthru on FoLiA
- avoid empty lines in FoLiA output
- use the new generate_id attribute for provenance/processors
- added more tests
KNOW PROBLEM: On TravisCI/MacOSX some tests fail for unclear reasons.
0.16 2019-05-29
Major release supporting FoLiA 2.0
* bug fixes for:
- empty sentences in FoLiA introduced by NonBreakingSpace
- provide provenance data
0.15 2019-05-15
[Ko van der Sloot]
stabilizing release for pre FoLiA 2.0
* uses new folia::engine to process FoLiA
* lots of refactoring and cleanup
* some small bug fixes
* added tests for corner cases in FoLiA
* improved TextCat handling and debugging
0.14.1 2018-12-10
[Ko van der Sloot]
Bug fix release
* fixed textcat installation problems om Debian and OpenBSD
(https://github.com/LanguageMachines/ucto/issues/59)
* typo in the man page fixed
0.14 2018-11-29
[Ko van der Sloot]
* updated usage() and removed -S option (never used)
* make sure the right textclass is assigned to <w> nodes in FoLiA
* minor code fixes/refactorings
* added more tests
* updated man.1 page
[Maarten van Gompel]
* updated README.md
[Iris Hendrickx]
* Updated and extended the manual
0.13.2 2018-05-17
[Ko van der Sloot]
Bug fix release:
* uctodata is mandatory. So don't install default rules anymore
0.13.1 2018-05-17
[Ko van der Sloot]
Bug fix release:
* configure now finds out the location of the uctodata files.
should make it work on Mac systems too
0.13 2018-05-16
[Ko van der Sloot]
* improved configure/build/test
* added a --split option
* fixed -P option
* removed -S option (never used, and only half implemented)
* added a --add-tokens option, to add special tokens for the default language
* generally use the icu:: namespace
* added more tests
* fixed uninitialized variable.
* added code to use an alternative search-path for uctodata
[Maarten van Gompel]
* added codemeta.json
0.12 2018-02-19
[Ko van der Sloot]
* now use the UniFilter Unicode Filter from ticcutils
* now use the UnicodeNormalizer from ticcutils
* improved configuration. Support vor Mac OSX added
0.11 2017-12-04
[Ko van der Sloot]
Bug fix release:
* problems with text inside Cell elements
0.10 2017-11-07
[Ko van der Sloot]
New release due to outdated files in the previous release.
0.9.9 2017-11-06
[Ko van der Sloot]
Minor fix:
* bumped the .so version to 3.0.0
0.9.8 2017-10-23
[Ko van der Sloot]
Bug-fix release
* fixed utterance handling in FoLiA input. Don't try sentence detection!
0.9.7 2017-10-17
[Ko van der Sloot]
* added textredundancy option, default is 'minimal'
* small adaptations to work with FoLiA 1.5 specs
- set textclass on words when outputclass != inputclass
- DON'T filter special characters when inputclass == outputclass
* -F (folia input) is automatically set for .xml files
* more robust against texts with embedded tabs, etc.
* more and better tests added
* better logging and error messaging
* improved language handling. TODO: Language detection in FoLiA
* bug fixes:
- correctly handle xml-comment inside a <t>
- better id generation when parent has no id
- better reaction on overly long 'words'
0.9.6 2017-01-23
[Maarten van Gompel]
* Moving data files from etc/ to share/, as they are more data files than
configuration files that should be edited.
* Requires uctodata >= 0.4.
* Should solve debian packaging issues (#18)
* Minor updates to the manual (#2)
* Some refactoring/code cleanup, temper expectations regarding ucto's
date-tagging abilities (#16, thanks also to @sanmai-NL)
0.9.5 2017-01-06
[Ko van der Sloot]
Bug fix release:
* updated tokconfig-generic, which is removed from the uctodata package
* configure no longer insists on the presence of uctodata, it merely warns
when missing
0.9.4 2017-01-05
[Ko van der Sloot]
Major update
* Language support
- added support for multiple languages
- auto detection of languages using textcat
* some refactoring
- no more call to exit()
- Better logging and Warning messages
- some folia output improvements
* bug fixes
- in passthru,
- issue #11
0.9.3 2016-09-28
[Ko van der Sloot]
Major update:
- require ICU 5.2
- implemented recursive application of rules. (which may be dangerous)
- modfied tests, because not all failures wre detected correctly
- check the uctodata version. version > 0.2 is preferred.
0.9.1 2016-07-12
[Ko van der Sloot]
Bug fix release:
- fixed autoconfig issue
0.9.0 2016-07-11
[Ko van der Sloot]
Major update
- now use uctodata for language specific information
ucto itself only supports a generic tokenizer
- interactive use now uses readline library
- accept long options --help and --verision
- UTF16BE now works
- better support for crooked Windows files in general
- added a --normalize option to map tokens in a certain TokenClass
to it's generic name
0.8.6 2016-04-25
[Ko van der Sloot]
* Bug fix release: fixing Sentence boundaries after abbreviations
0.8.5 2016-04-25
[Ko van der Sloot]
* Bug fix release: Better handling of regexps
0.8.4 2016-03-10
[Ko van der Sloot]
* implemented on top of libfolia 1.0
0.8.1 2016-01-14
[Ko van der Sloot]
* repository moved to GIT
* added Travis support
* more tests added
* added META-RULES code
* %include now supports full paths
* updated some languages
* fixed passthru mode
* code cleanup
0.8.0 2015-01-29
[Ko van der Sloot]
* next release
[Maarten van Gompel]
* added new tokenize(string,string) meta-function for the API
* allatonce enabled by default for tokenize() to folia doc
* fixing date rules and adding FRACNUMBER
* added Russian
* Adicionei regras para tokenização portuguesa.
[Antal vd Bosch]
* added RK to dutch abbrev.
0.7.0 2014-11-26
[Ko van der Sloot]
* unofficial release
* experimental PUNCTUATION filter
* bug fixes
[Maarten van Gompel]
* reduced memory usage
0.6.0 2014-09-23
[Ko van der Sloot]
* release
0.5.5 2014-06-xx
* made getSentence() public
* adapted to most recent libfolia (0.11 or above)
* needs libticcutils 0.6 or above
* uses TiCC::CommandLine
* detect EMOTICON's
* generally switched to UChar32 and Unicode codepoints. (avoid length() problems)
* handle FoLiA Note like Caption
* a lot of bug fixes concerning FoLiA output (<t> nodes, textclass values etc.)
* again some changes around quotes
* improved tokenisation in differeny languages
* added swedisch
0.5.3 2013-04-04
[Folgert Karsdorp]
* Fixed quote detection, added tests. still shaky and default disabled
[Ko van der Sloot]
* changed verbose output slightly
* fixed id's in folia output
* various folia fixes
* honour BOM markers in input file
* lots of configuration updates
* some fixes in handling if RULES
0.5.2 2012-03-29
[Ko vd Sloot]
* some small changes. Made it work with libfolia 0.9
0.5.1 2012-02-27
[Ko vd Sloot]
* added 'escape' possibility for regexps that start with a [
* better debugging output
* removed all (?i) stuff from regexps. This attempts to avoid an ICU bug
* added -X en --id= options
* adapted to libfolia 0.8 (/tests too!)
* some cleanup and refactoring
[Maarten van Gompel]
* added better rules for apostrophs in ATTACHEDSUFFIX and TOKENS
0.5.0 2012-01-09
[Ko vd Sloot]
* added a different and more powerpull SMILEY rule. Which happens also to work
on older ICU versions
0.4.9 2011-12-21
[Ko vd sloot]
* reworked and more folia integration
0.4.8 2011-11-02
[Ko vd sloot]
* use libfolia to generate folia XML
0.4.7 - (not released yet, feel free to add more stuff)
[ Maarten van Gompel ]
* Fix: proper XML entities in FoLiA output
* fixed bug77 (the NOSPACE bug)
* Fix: Nested quote problem (2011-08-18)
* Improved protection against unbalanced quotes/sentences (2011-08-18)
[Ko van der Sloot]
* fixed passthru encoding problem
* fixed problem with CRLF separated lines (bug 78)
* configdir vs. config file hassle moved more inside. simpler API now.
* -Q option works reversed now. -Q Enables Quote detection.
Quote detection apears to be very hard and fragile.
0.4.6 - 2011-05-17
[Ko van der Sloot]
* changed the regexp for KNOWN-ABBREVIATIONS to case sensitive
* fixed include file handling for non-standard locations
* fixed a problem with NON-Unix files. ucto would crash on a line with just '\r'
0.4.5 - 2011-04-27
[ Maarten van Gompel]
* Added sentenceperline support for PassThru mode , improved sentenceperline support for normal mode
[ Ko vd Sloot ]
* on failue, ucto didn't use the right exit code. 0 == SUCCESS (on most systems)
* added functions to display version info.
0.4.4 - 2011-03-31
[ Maarten van Gompel]
* fixed "fatal error: ucto: out of range :No sentence exists with the specified index" problem. (Bug 65)
[ Ko van der Sloot ]
* Fixed terrible bug. Unicode strings were output in the current locale.
But we advertise UTF8
0.4.3 - 2011-03-19
[ Ko van der Sloot ]
* src/ucto.cxx: fixed --passthru problem
* tests/testpassthru.ok: test now works
[ Joost van Baal ]
* NEWS: record changes and releases
0.4.2 - 2011-03-17
[ Ko van der Sloot ]
* include/ucto/tokenize.h, src/tokenize.cxx,
src/unicode.cxx: passes -pedantic
* configure.ac: some cleanup, bumped version
* include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx:
added (hidden) --passthru option
* [r8842] tests/passthru.txt, tests/testall, tests/testpassthru,
added a passthru test.
has t0 be tested :)
* include/ucto/tokenize.h, src/tokenize.cxx: make compiler
more happy
* docs/ucto.1: added description, smal update
0.4.1 - 2011-03-11
[ Ko van der Sloot ]
* src/tokenize.cxx: fixed regexp and error messag
* config/tokconfig-nl, src/tokenize.cxx: added the
possiblity to ste the order of RULES in the config file
* tests/bug0063.nl.tok.V, tests/bug0063.nl.txt: added a
test for bug63
Not sure about the 'correct' solution
* docs/ucto.1: updated man page
[ Maarten van Gompel ]
* src/tokenize.cxx: fixed passthruline (skip=t) bug, FoLiA XSL has to be
local unfortunately
* tests/bug0063.nl.tok.V: override
* config/tokconfig-nl, src/tokenize.cxx,
tests/bug0052.nl.tok.V, tests/normalisation.nl.tok.V,
tests/test.nl.tok.V: fix bug0063
0.4.0 - 2011-03-04
[ Maarten van Gompel ]
* logo.svg: added logo
0.3.7 - 2011-03-01
[ Ko van der Sloot ]
* [r8636] tests/testoption1.ok, tests/testusage.ok: these tests
give a different outcome now.
* [r8318] src/tokenize.cxx: added experimental code to use the -n
option ( output one sentence per line) also to process the input
one sentence per line
* [r8317] tests/bug0054.nl.tok.V, tests/bug0054.nl.txt: testcase
for bug0054
[ Maarten van Gompel ]
* [r8618] include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx:
sentence per line input and output: two modes
* [r8617] src/tokenize.cxx, tests/bug0048.nl.tok.V,
tests/bug0054.nl.tok.V: Fixed bug 54
* [r8615] src/tokenize.cxx, tests/abbreviations.nl.tok.V,
tests/nu.nl.tok.V, tests/test.nl.tok.V: fixes
* [r8614] src/tokenize.cxx: FoLiA improvement
0.3.6 - 2011-02-12
[ Ko van der Sloot ]
* tests/: more tests added
* configure.ac, include/ucto/tokenize.h, src/tokenize.cxx,
src/ucto.cxx, tests/testnormalisation: added possibility to set
the inputEncoding breaks ucto user interface!
0.3.5 - 2011-02-10
[ Ko van der Sloot ]
* src/ucto.cxx: fix memory leak
* include/ucto/tokenize.h, include/ucto/unicode.h,
src/Makefile.am, src/tokenize.cxx, src/ucto.cxx, src/unicode.cxx,
include/ucto/tokenize.h, src/unicode.cxx: added copyright notice
* include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx: -f option
now works
* config/tokconfig-nl, include/ucto/tokenize.h, src/tokenize.cxx,
src/ucto.cxx: added support for ligature filtering and Unicode
normalizing. a bit rough still
* tests/: more tests added
* ucto.pc.in: now uses ucto-icu.pc
[ Maarten van Gompel ]
* configure.ac: version bump
0.3.4 - 2011-01-27
[ Joost van Baal ]
* Makefile.am, configure.ac, icu.pc.in, ucto-icu.pc.in: rename icu.pc to
ucto-icu.pc: be sure we wont suffer from filename clashes in the future.
Once Debian and other distos ship icu 4.6's usr/lib/pkgconfig/icu-io.pc
(released 2010-12-02) we can get rid of our local copy.
[ Ko van der Sloot ]
* tests/: more tests added
[ Maarten van Gompel ]
* include/ucto/tokenize.h, src/tokenize.cxx: Updates in FoLiA support
0.3.3 - 2011-01-27
[ Joost van Baal ]
* Various bugfixes
0.3.2 - 2011-01-27
[ Ko van der Sloot ]
* Various bugfixes
0.3.1 - 2011-01-26
[ Ko van der Sloot ]
* Various bugfixes
0.3.0 - 2011-01-26
[ Maarten van Gompel ]
* tests/: Added lots of tests
* configure.ac, include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx:
major refactoring. Improved buffering, less unnecessary storing of
token/sentence vectors in memory. Improved quote support.
* include/ucto/tokenize.h, src/tokenize.cxx: Ucto now remembers if a token
was spaced or not in the original. Enabling ucto to recontruct the original
text exactly.
* include/ucto/tokenize.h, src/tokenize.cxx: Added quote detection support
* include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx: Added preliminary
FoLiA XML output support in ucto
* include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx: Big API overhaul
[ Peter Berck ]
* config/Makefile.am, config/tokconfig-sv Added Swedish tokconfig
[ Ko van der Sloot ]
* config/tokconfig-nl, src/tokenize.cxx: read QUOTES from config file
* src/ucto.cxx: refuse to run when inputfile is bad
* docs/ucto.1: added a simple 'man' page
* src/ucto.cxx: added al -p switch to disable paragraph
detection
0.0.1 - 2010-12-25
- First snapshot release
unreleased - 09-12-2010
- started to create a separate package
|