1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
|
.TH ucto 1 "2018 nov 13"
.SH NAME
ucto \- Unicode Tokenizer
.SH SYNOPSIS
ucto [[options]] [input\(hyfile] [[output\(hyfile]]
.SH DESCRIPTION
.B ucto
ucto tokenizes text files: it separates words from punctuation, splits
sentences (and optionally paragraphs), and finds paired quotes.
Ucto is preconfigured with tokenisation rules for several languages.
.SH OPTIONS
.BR \-c " configfile"
.RS
read settings from a file
.RE
.BR \-d " value"
.RS
set debug mode to 'value'
.RE
.BR \-e " value"
.RS
set input encoding. (default UTF8)
.RE
.BR \-N " value"
.RS
set UTF8 output normalization. (default NFC)
.RE
.BR \-\-filter =[YES|NO]
.RS
disable filtering of special characters, (default YES)
These special characters can be specified in the [FILTER] block of the
configuration file.
.RE
.BR \-f
.RS
OBSOLETE. use --filter=NO
.RE
.BR \-L " language"
.RS
Automatically selects a configuration file by language code.
The language code is generally a three-letter iso-639-3 code.
For example, 'fra' will select the file tokconfig\(hyfra from the installation directory
.RE
.BR \-\-detectlanguages =<lang1,lang2,..langn>
.RS
try to detect all the specified languages. The default language will be 'lang1'.
(only useful for FoLiA output)
.RE
.BR \-l
.RS
Convert to all lowercase
.RE
.BR \-u
.RS
Convert to all uppercase
.RE
.BR \-n
.RS
Emit one sentence per line on output
.RE
.BR \-m
.RS
Assume one sentence per line on input
.RE
.BR \-\-normalize =class1,class2,..,classn
.RS
map all occurrences of tokens with class1,...class to their generic names. e.g \-\-normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's, E\-mail addresses and so on.
.RE
.BR \-\-add\-tokens ="file"
.RS
Add additional tokens to the [TOKENS] block of the default language.
The file should contain one TOKEN per line.
.RE
.BR \-\-passthru
.RS
Don't tokenize, but perform input decoding and simple token role detection
.RE
.BR \-\-filterpunct
.RS
remove most of the punctuation from the output. (not from abreviations and embeddded punctuation like John's)
.RE
.B \-P
.RS
Disable Paragraph Detection
.RE
.B \-Q
.RS
Enable Quote Detection. (this is experimental and may lead to unexpected results)
.RE
.B \-s
<string>
.RS
Set End\(hyof\(hysentence marker. (Default <utt>)
.RE
.B \-V
.RS
Show version information
.RE
.B \-v
.RS
set Verbose mode
.RE
.B \-F
.RS
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: \-nPQvs)
For files with an '.xml' extension, \-F is the default.
.RE
.BR \-\-inputclass ="cls"
.RS
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'.
The default is "current".
.RE
.BR \-\-outputclass ="cls"
.RS
When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'.
The default is "current".
It is recommended to have different classes for input and output.
.RE
.BR \-\-textclass ="cls" (obsolete)
.RS
use 'cls' for input and output of text from FoLiA. Equivalent to both \-\-inputclass='cls' and \-\-outputclass='cls')
This option is obsolete and NOT recommended. Please use the separate \-\-inputclass= and \-\-outputclass options.
.RE
.B \-X
.RS
Output FoLiA XML. (this disables usage of most other options: \-nPQvs)
.RE
.B \-\-id
<DocId>
.RS
Use the specified Document ID for the FoLiA XML
.RE
.B \-x
<DocId>
.B (obsolete)
.RS
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: \-nPQvs).
.B obsolete
Use
.B \-X
and
.B \-\-id
instead
.RE
.SH BUGS
likely
.SH AUTHORS
Maarten van Gompel proycon@anaproy.nl
Ko van der Sloot Timbl@uvt.nl
|