File: ucto.1

package info (click to toggle)
ucto 0.14-2
links: PTS, VCS
area: main
in suites: buster
size: 2,260 kB
sloc: sh: 4,273; cpp: 3,899; makefile: 39
file content (185 lines) | stat: -rw-r--r-- 3,662 bytes
.TH ucto 1 "2018 nov 13"

.SH NAME
ucto \- Unicode Tokenizer
.SH SYNOPSIS
ucto [[options]] [input\(hyfile] [[output\(hyfile]]

.SH DESCRIPTION
.B ucto
ucto tokenizes text files: it separates words from punctuation, splits
sentences (and optionally paragraphs), and finds paired quotes.
Ucto is preconfigured with tokenisation rules for several languages.

.SH OPTIONS

.BR \-c " configfile"
.RS
read settings from a file
.RE

.BR \-d " value"
.RS
set debug mode to 'value'
.RE

.BR \-e " value"
.RS
set input encoding. (default UTF8)
.RE

.BR \-N " value"
.RS
set UTF8 output normalization. (default NFC)
.RE

.BR \-\-filter =[YES|NO]
.RS
disable filtering of special characters, (default YES)
These special characters can be specified in the [FILTER] block of the
configuration file.
.RE

.BR \-f
.RS
OBSOLETE. use --filter=NO
.RE

.BR \-L " language"
.RS
Automatically selects a configuration file by language code.
The language code is generally a three-letter iso-639-3 code.
For example, 'fra' will select the file tokconfig\(hyfra from the installation directory
.RE

.BR \-\-detectlanguages =<lang1,lang2,..langn>
.RS
try to detect all the specified languages. The default language will be 'lang1'.
(only useful for FoLiA output)
.RE

.BR \-l
.RS
Convert to all lowercase
.RE

.BR \-u
.RS
Convert to all uppercase
.RE

.BR \-n
.RS
Emit one sentence per line on output
.RE

.BR \-m
.RS
Assume one sentence per line on input
.RE

.BR \-\-normalize =class1,class2,..,classn
.RS
map all occurrences of tokens with class1,...class to their generic names. e.g \-\-normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's, E\-mail addresses and so on.
.RE

.BR \-\-add\-tokens ="file"
.RS
Add additional tokens to the [TOKENS] block of the default language.
The file should contain one TOKEN per line.
.RE

.BR \-\-passthru
.RS
Don't tokenize, but perform input decoding and simple token role detection
.RE

.BR \-\-filterpunct
.RS
remove most of the punctuation from the output. (not from abreviations and embeddded punctuation like John's)
.RE

.B \-P
.RS
Disable Paragraph Detection
.RE

.B \-Q
.RS
Enable Quote Detection. (this is experimental and may lead to unexpected results)
.RE

.B \-s
<string>
.RS
Set End\(hyof\(hysentence marker. (Default <utt>)
.RE

.B \-V
.RS
Show version information
.RE

.B \-v
.RS
set Verbose mode
.RE

.B \-F
.RS
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: \-nPQvs)
For files with an '.xml' extension, \-F is the default.
.RE

.BR \-\-inputclass ="cls"
.RS
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'.
The default is "current".
.RE

.BR \-\-outputclass ="cls"
.RS
When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'.
The default is "current".
It is recommended to have different classes for input and output.
.RE

.BR \-\-textclass ="cls" (obsolete)
.RS
use 'cls' for input and output of text from FoLiA. Equivalent to both \-\-inputclass='cls' and \-\-outputclass='cls')

This option is obsolete and NOT recommended. Please use the separate \-\-inputclass= and \-\-outputclass options.
.RE

.B \-X
.RS
Output FoLiA XML. (this disables usage of most other options: \-nPQvs)
.RE

.B \-\-id
<DocId>
.RS
Use the specified Document ID for the FoLiA XML
.RE

.B \-x
<DocId>
.B (obsolete)
.RS
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: \-nPQvs).

.B obsolete
Use
.B \-X
and
.B \-\-id
instead
.RE

.SH BUGS
likely

.SH AUTHORS
Maarten van Gompel proycon@anaproy.nl

Ko van der Sloot Timbl@uvt.nl