I'm trying to develop a code that actually get the html page of a given url, extract some text/desc from headers or body of the html page, and get some images, set it as thumbnail for the link itself.
initially i'm using this simple steps:
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
and it works, however after several test, it fails on this
url: http://en.wikipedia.org/wiki/Sloth
and error given to me in python cli:File "/usr/lib/python2.6/urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
403: Forbidden
googling around few minutes, i found this statement in python doc :
Some websites dislike being browsed by programs, or send different versions to different browsers [3] . By default urllib2 identifies itself as
Python-urllib/x.y (where x and y are the major and minor version numbers of the Python release, e.g. Python-urllib/2.5), which may confuse the site, or just plain
not work. The way a browser identifies itself is through the User-Agent header.
so what i did just add the User-Agent header, voila it works.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
doc = response.read()
No comments:
Post a Comment