[go: up one dir, main page]

File: spamoracle.conf.5

package info (click to toggle)
spamoracle 1.4-15
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 280 kB
  • ctags: 249
  • sloc: ml: 1,198; makefile: 135
file content (277 lines) | stat: -rw-r--r-- 6,377 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
.TH SPAMORACLE.CONF 5

.SH NAME
spamoracle.conf \- SpamOracle configuration file format

.SH DESCRIPTION
The
.B spamoracle.conf
file is a configuration file governing the operation of the
.BR spamoracle (1)
e-mail classification tool.  By default, the configuration file
is searched in
.IB $HOME /.spamoracle.conf
but an alternate location can be specified using the
.B -config
flag to
.BR spamoracle (1).

.B Important note:
most of the configuration parameters should not be modified lightly,
as this may result in completely wrong e-mail classification.  
Familiarity with Graham's filtering algorithm, as described in the
paper referenced at the end of this page, is required to really
understand the effect of the parameters.

.SH SYNTAX

The
.B spamoracle.conf
file is composed of lines of the form
.I variable
.B =
.IR value .
Lines starting with a hash sign (#) are treated as comments and ignored.
Blank lines are ignored.

Depending on the type of the variable (see the list of variables below), the
.I value 
part is of the following forms:
.TP
.I string
A sequence of characters.  Blanks (spaces, tabs) at the beginning and the
end of the string are ignored.  Alternatively, the string can be
enclosed in double quotes ("), in which case spaces are not trimmed.
Inside quoted strings, blackslashes (\) and double quotes (") must be
escaped with a backslash, as in \\ or \"
.TP
.I boolean
Either
.BR on,
.BR yes,
.BR true,
or
.B 1
to activate the boolean option, or
.BR off,
.BR no,
.BR false,
or
.B 0
to deactivate it.
.TP
.I integer
A decimal integer 
.TP
.I float
A decimal floating-point number.
.TP
.I regexp
A regular expression in
.BR emacs (1)
syntax.  The repetition operators are
.BR * ,
.BR + ,
and
.BR ? .
Alternation is written
.B \e|
and grouping is written
.BR \e( ... \e) .
Character classes are written between brackets
.BR [ ... ]
as usual.  A single dot denotes any character except newline.
Regular expressions are case-insensitive.

.SH CONFIGURABLE PARAMETERS

.TP
.B database_file
(type
.IR string,
default value
.IB $HOME /.spamoracle.db
)
.br
The location of the file that contains the database of word frequencies
used by
.BR spamoracle (1).
.TP
.B html_retain_tags
(type
.IR boolean,
default value
.BR false )
.br
In HTML-formatted e-mails and attachments, the names of HTML tags are
normally not treated as words and are ignored for the word frequency
calculations. If the
.B html_retain_tags
parameter is set to
.BR true ,
HTML tags (such as
.B img
or
.BR bold )
are treated as words and included in the computation of word frequencies.
.TP
.B html_tag_attributes
(type
.IR regexp ,
default value
.br
.BR a/href\e|img/src\e|img/alt\e|frame/src\e|font/face\e|font/color )
.br
This regular expression matches pairs of HTML tags and HTML attributes
written as
.IB tag / attribute.
When scanning HTML-formatted e-mails and attachments, attributes to
HTML tags are normally ignored, unless the tag/attribute pair matches
the regular expression
.BR html_tag_attributes .
If the tag/attribute pair matches this regexp, the value of the attribute
(for instance, the URL for the
.BR a / href
attribute) is scanned for words.
.TP
.B mail_headers
(type
.IR regexp ,
default value
.BR from:\e|subject: )
.br
A regular expression determining which headers of an e-mail message
are scanned for words.
.TP
.B spam_header
(type
.IR string ,
default value
.BR X-Spam )
.br
The name of the header that
.B spamoracle mark
adds to incoming e-mail messages, with the results of the spam/non-spam 
classification.
.TP
.B attachments_header
(type
.IR string ,
default value
.BR X-Attachments )
.br
The name of the header that
.B spamoracle mark
adds to incoming e-mail messages, with the one-line summary of attachment 
types, names and character sets.  The generation of this header can
be turned off with the
.B summarize_attachment
parameter.
.TP
.B summarize_attachment
(type
.IR boolean ,
default value
.BR true )
.br
If this parameter is set,
.B spamoracle mark
generates a one-line summary of the attachments of the incoming messages,
and inserts this summary in the message headers.
Setting this parameter to
.B false
disables the generation of this extra header.
.TP
.B num_meaningful_words
(type
.IR integer ,
default value
.BR 15 )
.br
Maximal number of "meaningful" words that are retained for computing
the spam probability.  During mail analysis,
.B spamoracle
extracts all words of the message, and retains those whose spam frequency
(frequency of occurrence in spam messages) is closest to 1 or to 0.  
At most
.B num_meaningful_words
such "meaningful" words are retained.
.TP
.B max_repetitions
(type
.IR integer ,
default value
.BR 2 )
.br
Maximum number of times a given word can occur in the set of
"meaningful" words retained for computing the spam probability.
The default value of 2 means that at most 2 occurrences of the same
word will be retained.
.TP
.B low_freq_limit
(type
.IR float ,
default value
.BR 0.01 )
.TP
.B high_freq_limit
(type
.IR float ,
default value
.BR 0.99 )
.br
The spam frequency of a word is computed as the number of occurrences
in spam divided by number of occurrences in all messages.  This ratio
is then clipped to the interval [
.BR low_freq_limit ,
.B high_freq_limit
], so that words that are extremely rare or extremely common in spam
do not bias the probability computation too much.  The default values
of 0.01 and 0.99 are adequate for a corpus of a few thousand e-mails.
For larger corpora (e.g. 10000 e-mails), the values 0.001 and 0.999
may give better results.
.TP
.B min_meaningful_words
(type
.IR integer ,
default value
.BR 5 )
.br
Minimum number of "meaningful" words below which 
.B spamoracle mark
refuses to classify the e-mail and outputs "unknown" status.  This
happens with very short e-mails, or e-mails that consist exclusively of
links and pictures.
.TP
.B good_mail_prob
(type
.IR float ,
default value
.BR 0.2 )
.br
Spam probability below which the e-mail is classified as non-spam.
.TP
.B spam_mail_prob
(type
.IR float ,
default value
.BR 0.8 )
.br
Spam probability above which the e-mail is classified as spam.
Messages whose probability falls between
.B good_mail_prob
and
.B spam_mail_prob
are classified as "unknown".

.SH AUTHOR
Xavier Leroy <Xavier.Leroy@inria.fr>

.SH "SEE ALSO"

.BR spamoracle (1)

.B http://www.paulgraham.com/spam.html
(Paul Graham's seminal paper)