Thursday, October 29, 2009

python urllib2


I'm trying to develop a code that actually get the html page of a given url, extract some text/desc from headers or body of the html page, and get some images, set it as thumbnail for the link itself.

initially i'm using this simple steps:

import urllib2

req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()

and it works, however after several test, it fails on this
url: http://en.wikipedia.org/wiki/Sloth
and error given to me in python cli:

File "/usr/lib/python2.6/urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

403: Forbidden

googling around few minutes, i found this statement in python doc :

Some websites dislike being browsed by programs, or send different versions to different browsers [3] . By default urllib2 identifies itself as
Python-urllib/x.y (where x and y are the major and minor version numbers of the Python release, e.g. Python-urllib/2.5), which may confuse the site, or just plain
not work. The way a browser identifies itself is through the User-Agent header.

so what i did just add the User-Agent header, voila it works.

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
doc = response.read()