[go: up one dir, main page]

File: ucto.1

package info (click to toggle)
ucto 0.5.3-3.1
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 1,916 kB
  • ctags: 371
  • sloc: sh: 11,064; cpp: 2,296; makefile: 34
file content (140 lines) | stat: -rw-r--r-- 2,068 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
.TH ucto 1 "2013 march 6"

.SH NAME
ucto - Unicode Tokenizer
.SH SYNOPSYS
ucto [[options]] [input-file] [[output-file]]

.SH DESCRIPTION
.B ucto 
ucto tokenizes text files: it separates words from punctuation, splits 
sentences (and optionally paragraphs), and finds paired quotes. 
Ucto is preconfigured with tokenisation rules for several languages. 

.SH OPTIONS

.BR -c " configfile"
.RS
read settings from a file
.RE

.BR -d " value"
.RS
set debug mode to 'value'
.RE

.BR -e " value"
.RS
set input encoding. (default UTF8)
.RE

.BR -f
.RS
disable filtering of special characters
.RE

.BR -L " language"
.RS
 Automatically selects a configuration file by language code.
e.g. 'fr' will select the file tokconfig-fr from the installation directory
.RE

.BR -l 
.RS
Convert to all lowercase
.RE

.BR -u 
.RS
Convert to all uppercase
.RE

.BR -n 
.RS
Emit one sentence per line on output
.RE

.BR -m
.RS
Assume one sentence per line on input
.RE

.BR --passthru    
.RS
Don't tokenize, but perform input decoding and simple token role detection
.RE

.B -P
.RS
Disable Paragraph Detection
.RE

.B -Q
.RS
Enable Quote Detection. (this is experimental and may lead to unexpected results)
.RE

.B -S
.RS
Disable Sentence Detection
.RE

.B -s
<string>
.RS
Set End-of-sentence marker. (Default <utt>)
.RE

.B -V
.RS 
Show version information
.RE

.B -v
.RS
set Verbose mode
.RE

.B -F
.RS
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)
.RE

.BR --textclass " cls"
.RS
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'
.RE

.B -X
.RS
Output FoLiA XML. (this disables usage of most other options: -nulPQvsS)
.RE	

.B --id
<DocId>
.RS
Use the specified Document ID for the FoLiA XML
.RE

.B -x
<DocId>
.B (obsolete)
.RS
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)

.B obsolete
Use 
.B -X 
and 
.B --id
instead
.RE

.SH BUGS
likely

.SH AUTHORS
Maarten van Gompel proycon@anaproy.nl

Ko van der Sloot Timbl@uvt.nl