File: ucto.1

package info (click to toggle)

ucto 0.5.3-3.1

links: PTS, VCS
area: main
in suites: jessie, jessie-kfreebsd
size: 1,916 kB
ctags: 371
sloc: sh: 11,064; cpp: 2,296; makefile: 34

file content (140 lines) | stat: -rw-r--r-- 2,068 bytes

.TH ucto 1 "2013 march 6"

.SH NAME
ucto - Unicode Tokenizer
.SH SYNOPSYS
ucto [[options]] [input-file] [[output-file]]

.SH DESCRIPTION
.B ucto 
ucto tokenizes text files: it separates words from punctuation, splits 
sentences (and optionally paragraphs), and finds paired quotes. 
Ucto is preconfigured with tokenisation rules for several languages. 

.SH OPTIONS

.BR -c " configfile"
.RS
read settings from a file
.RE

.BR -d " value"
.RS
set debug mode to 'value'
.RE

.BR -e " value"
.RS
set input encoding. (default UTF8)
.RE

.BR -f
.RS
disable filtering of special characters
.RE

.BR -L " language"
.RS
 Automatically selects a configuration file by language code.
e.g. 'fr' will select the file tokconfig-fr from the installation directory
.RE

.BR -l 
.RS
Convert to all lowercase
.RE

.BR -u 
.RS
Convert to all uppercase
.RE

.BR -n 
.RS
Emit one sentence per line on output
.RE

.BR -m
.RS
Assume one sentence per line on input
.RE

.BR --passthru    
.RS
Don't tokenize, but perform input decoding and simple token role detection
.RE

.B -P
.RS
Disable Paragraph Detection
.RE

.B -Q
.RS
Enable Quote Detection. (this is experimental and may lead to unexpected results)
.RE

.B -S
.RS
Disable Sentence Detection
.RE

.B -s
<string>
.RS
Set End-of-sentence marker. (Default <utt>)
.RE

.B -V
.RS 
Show version information
.RE

.B -v
.RS
set Verbose mode
.RE

.B -F
.RS
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)
.RE

.BR --textclass " cls"
.RS
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'
.RE

.B -X
.RS
Output FoLiA XML. (this disables usage of most other options: -nulPQvsS)
.RE	

.B --id
<DocId>
.RS
Use the specified Document ID for the FoLiA XML
.RE

.B -x
<DocId>
.B (obsolete)
.RS
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)

.B obsolete
Use 
.B -X 
and 
.B --id
instead
.RE

.SH BUGS
likely

.SH AUTHORS
Maarten van Gompel proycon@anaproy.nl

Ko van der Sloot Timbl@uvt.nl