[go: up one dir, main page]

File: INSTALL

package info (click to toggle)
linbot 1.0b9-1.1
  • links: PTS
  • area: main
  • in suites: potato
  • size: 420 kB
  • ctags: 245
  • sloc: python: 1,153; sh: 784; makefile: 111; perl: 17
file content (216 lines) | stat: -rw-r--r-- 9,845 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
   
                                   Linbot
                                      
   Installing and using Linbot varies depending on which setup you have
   on your system. Python 1.5 is required and may be download freely at
   [2]http://www.python.org/
     _________________________________________________________________
   
Installing Linbot

   Installation is relatively easy.
    1. Unpack the gzipped tar archive into a directory. Recommended
       directories are /usr/local/lib/linbot or ~/linbot. Be sure to add
       this directory to your PYTHONPATH environment variable:
$ tar zxvf linbot-1.0b6.tar.gz -C /usr/local/lib
$ PYTHONPATH="/usr/local/lib/linbot:$PYTHONPATH"
$ export PYTHONPATH
    2. Add a symbolic link to some place in your PATH where is:
          + "linbot.py" if you have Python on your system
          + "linbot" if you are using the frozen Linux executable
       Note: Starting with 0.8, there will be no frozen linux executable
       for linbot until all major distributions have switched to libc6.
$ ln -s /usr/local/lib/linbot/linbot.py /usr/local/bin/linbot
                                       or
$ ln -s /usr/local/lib/linbot/linbot /usr/local/bin/linbot
    3. Edit the config.py file to your choosing. Most of the defaults are
       safe, the important items can be overridden with command-line
       flags. You may want to keep a copy of the original config.py just
       in case. The config.py options are documented within the file.
     _________________________________________________________________
   
Running Linbot

   It is simple to run Linbot.
   
   Executing Linbot without any command-line arguments will cause it to
   give a simple synopsis of it's usage and then quit:
$ linbot
linbot [-x regex]... [-y regex]... [-l url] [-b][-a][-o dir][-w sec] url [locat
ion]...

   Before running Linbot on a site, you should need to do a little
   preparation.
   
   One thing that Linbot needs is a directory in which to publish its
   reports. It is recommended that you choose a directory that is empty.
   Note that this directory must exist and be writable by Linbot.
$ mkdir /usr/local/httpd/htdocs/linbot

   The report can be viewed using most Web browsers. Browsers using
   frames technology should initially open the "index.html" file.
   Browsers not using frames or with frames disabled can initially open
   the "navbar.html" file. Note these are the default filenames for
   Linbot and may be changed via the config file.
   
   Secondly it should be decided beforehand which structures on your site
   should be considered "internal" and which should be considered
   "external". Linbot defines internal and external links as such:
   
   An internal link is a part of your site that you have control of and
   should be checked, as well as the links that it points to. Basically
   an internal link is one that, if broken, you have the power to fix.
   
   An external link is one that you site points to, but you have no
   jurisdiction over. It can also be a link that you may have power to
   change, but need not be checked for broken links, such as CGI scripts
   or pages that were generated by an automated tool (such as Linbot or
   any program that converts a document of one format to HTML.
   
   Your base url is the url that is the top level of your web site.
   Commonly referred to as the "home page", it is the url that points to
   all other pages either directly or indirectly. A base url can be on
   one server but may point to pages that are on another server but
   should still be considered internal. An example would be a main server
   www.someplaceonthenet.com in which there may be links to an alternate
   or load balancing server called www2.someplaceonthenet.com. In this
   example www2.someplaceonthenet.com would host internal links even
   though your "home page" may be http://www.someplaceonthenet.com
   
   That said, you should have a basic idea of what you do and do not want
   Linbot to check. Don't be surprised if you don't get it exactly right
   the first time. Also, consider using the robots.txt file/protocol as
   Linbot honors this protocol as well as other web robots that may run
   across your site. This protocol is useful to indicate to robots that
   some parts or your site, such as CGI scripts, internal documents, or
   server stats, should not be explored. The robots.txt protocol is
   explained at
   [3]http://info.webcrawler.com/mak/projects/robots/exclusion-admin.htm
   Currently Linbot identifies itself as User-Agent: Linbot.
   
   You can allow Linbot to search a directory but restrict other bots,
   for example, like this:
   User-agent: *
   Disallow: /

   User-agent: Linbot
   Allow: /

   Okay, you've heard enough and you want to run the darn thing. The
   simplest way to run Linbot is:
$ linbot http://www.someplaceonthenet.com/

   This will first read the robots.txt file at www.someplaceonthenet.com
   and then proceed to examine every link pointed to on that site, except
   links denied by robots.txt, if that file exists.
   
   The exact usage for Linbot is explained below:
     _________________________________________________________________
   
SYNOPSIS

   linbot [-x regex]... [-y regex]... [-l url] [-b][-a][-o dir][-w sec]
   url [location[:port]]... 
   
   -x regex
          Use this option to tell Linbot to consider any url matching
          with <regex> to be external. This option can be used multiple
          times
          
   -y regex
          Like the -x switch, though this option will cause linbot to not
          check the link at all, whereas -x will check the link, but not
          its children.
          
   -l url
          Use URL for the logo image on all reports. The URL should point
          to a a valid image.
          
   -b
          Base URLs only. Tells Linbot to consider any url that does not
          start with the base url to be external. For example, if you run
          'linbot -b http://www.someplaceonthenet.com/~someuser/' then
          http://www.someplaceonthenet.com/~someuser/misc/index.html will
          be considered internal whereas
          http://www.someplaceonthenet.com/ would be considered external.
          
   -a
          Avoid external links. Normally, if Linbot is examining an HTML
          page and if finds a link that points to an external document,
          Linbot will not examine the external document. However, it will
          check to see if that document exists, since you may not want to
          point to broken links whether internal or external. However,
          sometimes this default behavior may not be desirable. If the -a
          option is chosen, Linbot will not check for the existence of
          external links.
          
   -o
          Output Directory. Used to specify the directory where Linbot
          will dump its report files. The default is the current
          directory or as specified in config.py
          
   -w sec
          Wait sec seconds. Usually, Linbot will processs a URL and
          immediately move on to the next one. However, on some loaded
          systems, it may be more desirable to have Linbot wait a while
          between requests. This option should be set to any non-negative
          number (in seconds).
          
   url
          The base url. Linbot checks this link first, then all the links
          it points to on down the "tree".
          
   location
          This specifies that urls pointed to at are to be considered
          internal. This can be useful, for example, it the base url is
          on one server but points to "internal" documents on another
          server. location is the name of that server, for example
          www2.someplaceonthenet.com. This can also be used, for example,
          if you have an intranet where some urls may point to
          http://www.someplaceonthenet.com whereas some urls may point to
          just 'www'. This option may be used more than once, but must
          follow the base url
          
   The switches (and other options) can be changed in the config.py file.
   It is recommended that you look at (and edit) this file.
     _________________________________________________________________
   
Examples

   Here are some examples of running Linbot.
$ linbot http://manson.ddns.org/ \
  -x /linbot starship.skyport.net

$ linbot -o /stats/altavista/ \
  http://altavista.digital.com/

$ linbot -o ~/Lang/Python/linbot \
  -b http://starship.skyport.net/crew/marduk/ manson
     _________________________________________________________________
   
Running Periodically

   Linbot may be safely run periodically or on off-peak hours using cron
   or at. It may be safely run unattended. You may want to redirect
   Linbot's output to a null device, log file or have it emailed to an
   account. Consult your operating system manuals on how this can be done
   on your system.
     _________________________________________________________________
   
Questions/Bug Reports

   If you have any questions about Linbot or would like to report a bug,
   send electronic mail to the [4]mailing list. You should also check the
   [5]archives to make sure that the bug was not already reported. In
   order to assist in tracking down bugs, please include either a URL
   where the problem can be found, an HTML file where the error occurs or
   a (small) tar file of a site where the error occurs. Suggestions for
   improvements are also welcomed. Do not send email to marduk directly
   concerning bug reports!.

References

   2. http://www.python.org/
   3. http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html
   4. http://starship.skyport.net/crew/marduk/linbot/mail.html
   5. http://www.findmail.com/list/linbot/