1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
|
Linbot
Installing and using Linbot varies depending on which setup you have
on your system. Python 1.5 is required and may be download freely at
[2]http://www.python.org/
_________________________________________________________________
Installing Linbot
Installation is relatively easy.
1. Unpack the gzipped tar archive into a directory. Recommended
directories are /usr/local/lib/linbot or ~/linbot. Be sure to add
this directory to your PYTHONPATH environment variable:
$ tar zxvf linbot-1.0b6.tar.gz -C /usr/local/lib
$ PYTHONPATH="/usr/local/lib/linbot:$PYTHONPATH"
$ export PYTHONPATH
2. Add a symbolic link to some place in your PATH where is:
+ "linbot.py" if you have Python on your system
+ "linbot" if you are using the frozen Linux executable
Note: Starting with 0.8, there will be no frozen linux executable
for linbot until all major distributions have switched to libc6.
$ ln -s /usr/local/lib/linbot/linbot.py /usr/local/bin/linbot
or
$ ln -s /usr/local/lib/linbot/linbot /usr/local/bin/linbot
3. Edit the config.py file to your choosing. Most of the defaults are
safe, the important items can be overridden with command-line
flags. You may want to keep a copy of the original config.py just
in case. The config.py options are documented within the file.
_________________________________________________________________
Running Linbot
It is simple to run Linbot.
Executing Linbot without any command-line arguments will cause it to
give a simple synopsis of it's usage and then quit:
$ linbot
linbot [-x regex]... [-y regex]... [-l url] [-b][-a][-o dir][-w sec] url [locat
ion]...
Before running Linbot on a site, you should need to do a little
preparation.
One thing that Linbot needs is a directory in which to publish its
reports. It is recommended that you choose a directory that is empty.
Note that this directory must exist and be writable by Linbot.
$ mkdir /usr/local/httpd/htdocs/linbot
The report can be viewed using most Web browsers. Browsers using
frames technology should initially open the "index.html" file.
Browsers not using frames or with frames disabled can initially open
the "navbar.html" file. Note these are the default filenames for
Linbot and may be changed via the config file.
Secondly it should be decided beforehand which structures on your site
should be considered "internal" and which should be considered
"external". Linbot defines internal and external links as such:
An internal link is a part of your site that you have control of and
should be checked, as well as the links that it points to. Basically
an internal link is one that, if broken, you have the power to fix.
An external link is one that you site points to, but you have no
jurisdiction over. It can also be a link that you may have power to
change, but need not be checked for broken links, such as CGI scripts
or pages that were generated by an automated tool (such as Linbot or
any program that converts a document of one format to HTML.
Your base url is the url that is the top level of your web site.
Commonly referred to as the "home page", it is the url that points to
all other pages either directly or indirectly. A base url can be on
one server but may point to pages that are on another server but
should still be considered internal. An example would be a main server
www.someplaceonthenet.com in which there may be links to an alternate
or load balancing server called www2.someplaceonthenet.com. In this
example www2.someplaceonthenet.com would host internal links even
though your "home page" may be http://www.someplaceonthenet.com
That said, you should have a basic idea of what you do and do not want
Linbot to check. Don't be surprised if you don't get it exactly right
the first time. Also, consider using the robots.txt file/protocol as
Linbot honors this protocol as well as other web robots that may run
across your site. This protocol is useful to indicate to robots that
some parts or your site, such as CGI scripts, internal documents, or
server stats, should not be explored. The robots.txt protocol is
explained at
[3]http://info.webcrawler.com/mak/projects/robots/exclusion-admin.htm
Currently Linbot identifies itself as User-Agent: Linbot.
You can allow Linbot to search a directory but restrict other bots,
for example, like this:
User-agent: *
Disallow: /
User-agent: Linbot
Allow: /
Okay, you've heard enough and you want to run the darn thing. The
simplest way to run Linbot is:
$ linbot http://www.someplaceonthenet.com/
This will first read the robots.txt file at www.someplaceonthenet.com
and then proceed to examine every link pointed to on that site, except
links denied by robots.txt, if that file exists.
The exact usage for Linbot is explained below:
_________________________________________________________________
SYNOPSIS
linbot [-x regex]... [-y regex]... [-l url] [-b][-a][-o dir][-w sec]
url [location[:port]]...
-x regex
Use this option to tell Linbot to consider any url matching
with <regex> to be external. This option can be used multiple
times
-y regex
Like the -x switch, though this option will cause linbot to not
check the link at all, whereas -x will check the link, but not
its children.
-l url
Use URL for the logo image on all reports. The URL should point
to a a valid image.
-b
Base URLs only. Tells Linbot to consider any url that does not
start with the base url to be external. For example, if you run
'linbot -b http://www.someplaceonthenet.com/~someuser/' then
http://www.someplaceonthenet.com/~someuser/misc/index.html will
be considered internal whereas
http://www.someplaceonthenet.com/ would be considered external.
-a
Avoid external links. Normally, if Linbot is examining an HTML
page and if finds a link that points to an external document,
Linbot will not examine the external document. However, it will
check to see if that document exists, since you may not want to
point to broken links whether internal or external. However,
sometimes this default behavior may not be desirable. If the -a
option is chosen, Linbot will not check for the existence of
external links.
-o
Output Directory. Used to specify the directory where Linbot
will dump its report files. The default is the current
directory or as specified in config.py
-w sec
Wait sec seconds. Usually, Linbot will processs a URL and
immediately move on to the next one. However, on some loaded
systems, it may be more desirable to have Linbot wait a while
between requests. This option should be set to any non-negative
number (in seconds).
url
The base url. Linbot checks this link first, then all the links
it points to on down the "tree".
location
This specifies that urls pointed to at are to be considered
internal. This can be useful, for example, it the base url is
on one server but points to "internal" documents on another
server. location is the name of that server, for example
www2.someplaceonthenet.com. This can also be used, for example,
if you have an intranet where some urls may point to
http://www.someplaceonthenet.com whereas some urls may point to
just 'www'. This option may be used more than once, but must
follow the base url
The switches (and other options) can be changed in the config.py file.
It is recommended that you look at (and edit) this file.
_________________________________________________________________
Examples
Here are some examples of running Linbot.
$ linbot http://manson.ddns.org/ \
-x /linbot starship.skyport.net
$ linbot -o /stats/altavista/ \
http://altavista.digital.com/
$ linbot -o ~/Lang/Python/linbot \
-b http://starship.skyport.net/crew/marduk/ manson
_________________________________________________________________
Running Periodically
Linbot may be safely run periodically or on off-peak hours using cron
or at. It may be safely run unattended. You may want to redirect
Linbot's output to a null device, log file or have it emailed to an
account. Consult your operating system manuals on how this can be done
on your system.
_________________________________________________________________
Questions/Bug Reports
If you have any questions about Linbot or would like to report a bug,
send electronic mail to the [4]mailing list. You should also check the
[5]archives to make sure that the bug was not already reported. In
order to assist in tracking down bugs, please include either a URL
where the problem can be found, an HTML file where the error occurs or
a (small) tar file of a site where the error occurs. Suggestions for
improvements are also welcomed. Do not send email to marduk directly
concerning bug reports!.
References
2. http://www.python.org/
3. http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html
4. http://starship.skyport.net/crew/marduk/linbot/mail.html
5. http://www.findmail.com/list/linbot/
|