1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
|
.TH ucto 1 "2013 march 6"
.SH NAME
ucto - Unicode Tokenizer
.SH SYNOPSYS
ucto [[options]] [input-file] [[output-file]]
.SH DESCRIPTION
.B ucto
ucto tokenizes text files: it separates words from punctuation, splits
sentences (and optionally paragraphs), and finds paired quotes.
Ucto is preconfigured with tokenisation rules for several languages.
.SH OPTIONS
.BR -c " configfile"
.RS
read settings from a file
.RE
.BR -d " value"
.RS
set debug mode to 'value'
.RE
.BR -e " value"
.RS
set input encoding. (default UTF8)
.RE
.BR -f
.RS
disable filtering of special characters
.RE
.BR -L " language"
.RS
Automatically selects a configuration file by language code.
e.g. 'fr' will select the file tokconfig-fr from the installation directory
.RE
.BR -l
.RS
Convert to all lowercase
.RE
.BR -u
.RS
Convert to all uppercase
.RE
.BR -n
.RS
Emit one sentence per line on output
.RE
.BR -m
.RS
Assume one sentence per line on input
.RE
.BR --passthru
.RS
Don't tokenize, but perform input decoding and simple token role detection
.RE
.B -P
.RS
Disable Paragraph Detection
.RE
.B -Q
.RS
Enable Quote Detection. (this is experimental and may lead to unexpected results)
.RE
.B -S
.RS
Disable Sentence Detection
.RE
.B -s
<string>
.RS
Set End-of-sentence marker. (Default <utt>)
.RE
.B -V
.RS
Show version information
.RE
.B -v
.RS
set Verbose mode
.RE
.B -F
.RS
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)
.RE
.BR --textclass " cls"
.RS
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'
.RE
.B -X
.RS
Output FoLiA XML. (this disables usage of most other options: -nulPQvsS)
.RE
.B --id
<DocId>
.RS
Use the specified Document ID for the FoLiA XML
.RE
.B -x
<DocId>
.B (obsolete)
.RS
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)
.B obsolete
Use
.B -X
and
.B --id
instead
.RE
.SH BUGS
likely
.SH AUTHORS
Maarten van Gompel proycon@anaproy.nl
Ko van der Sloot Timbl@uvt.nl
|