1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
|
URLGRABBER(1)
=============
NAME
----
urlgrabber - a high-level cross-protocol url-grabber.
SYNOPSIS
--------
'urlgrabber' [OPTIONS] URL [FILE]
DESCRIPTION
-----------
urlgrabber is a binary program and python module for fetching files. It is
designed to be used in programs that need common (but not necessarily simple)
url-fetching features.
OPTIONS
-------
--help, -h::
help page specifying available options to the binary program.
--copy-local::
ignored except for file:// urls, in which case
it specifies whether urlgrab should still make
a copy of the file, or simply point to the
existing copy.
--throttle=NUMBER::
if it's an int, it's the bytes/second throttle
limit. If it's a float, it is first multiplied
by bandwidth. If throttle == 0, throttling is
disabled. If None, the module-level default
(which can be set with set_throttle) is used.
--bandwidth=NUMBER::
the nominal max bandwidth in bytes/second. If
throttle is a float and bandwidth == 0,
throttling is disabled. If None, the
module-level default (which can be set with
set_bandwidth) is used.
--range=RANGE::
a tuple of the form first_byte,last_byte
describing a byte range to retrieve. Either or
both of the values may be specified. If
first_byte is None, byte offset 0 is assumed.
If last_byte is None, the last byte available
is assumed. Note that both first and last_byte
values are inclusive so a range of (10,11)
would return the 10th and 11th bytes of the
resource.
--user-agent=STR::
the user-agent string provide if the url is HTTP.
--retry=NUMBER::
the number of times to retry the grab before
bailing. If this is zero, it will retry
forever. This was intentional... really, it was
:). If this value is not supplied or is supplied
but is None retrying does not occur.
--retrycodes::
a sequence of errorcodes (values of e.errno) for
which it should retry. See the doc on
URLGrabError for more details on this. retrycodes
defaults to -1,2,4,5,6,7 if not specified
explicitly.
MODULE USE EXAMPLES
-------------------
In its simplest form, urlgrabber can be a replacement for urllib2's
open, or even python's file if you're just reading:
..................................
from urlgrabber import urlopen
fo = urlopen(url)
data = fo.read()
fo.close()
..................................
Here, the url can be http, https, ftp, or file. It's also pretty smart
so if you just give it something like /tmp/foo, it will
figure it out. For even more fun, you can also do:
..................................
from urlgrabber import urlopen
local_filename = urlgrab(url) # grab a local copy of the file
data = urlread(url) # just read the data into a string
..................................
Now, like urllib2, what's really happening here is that you're using a
module-level object (called a grabber) that kind of serves as a
default. That's just fine, but you might want to get your own private
version for a couple of reasons:
..................................
* it's a little ugly to modify the default grabber because you have to
reach into the module to do it
* you could run into conflicts if different parts of the code
modify the default grabber and therefore expect different
behavior
..................................
Therefore, you're probably better off making your own. This also gives
you lots of flexibility for later, as you'll see:
..................................
from urlgrabber.grabber import URLGrabber
g = URLGrabber()
data = g.urlread(url)
..................................
This is nice because you can specify options when you create the
grabber. For example, let's turn on simple reget mode so that if we
have part of a file, we only need to fetch the rest:
..................................
from urlgrabber.grabber import URLGrabber
g = URLGrabber(reget='simple')
local_filename = g.urlgrab(url)
..................................
The available options are listed in the module documentation, and can
usually be specified as a default at the grabber-level or as options
to the method:
from urlgrabber.grabber import URLGrabber
g = URLGrabber(reget='simple')
local_filename = g.urlgrab(url, filename=None, reget=None)
AUTHORS
-------
Written by:
Michael D. Stenner <mstenner@linux.duke.edu>
Ryan Tomayko <rtomayko@naeblis.cx>
This manual page was written by Kevin Coyner <kevin@rustybear.com> for
the Debian system (but may be used by others). It borrows heavily on
the documentation included in the urlgrabber module. Permission is granted
to copy, distribute and/or modify this document under the terms of
the GNU General Public License, Version 2 any later version published
by the Free Software Foundation.
RESOURCES
---------
Main web site: http://linux.duke.edu/projects/urlgrabber/[]
|