requests爬虫之cookie、headers和代理的设置

requests是一个可以用来写爬虫的非常简单好用的python写的http库,用它官方的话说:Requests: HTTP for Humans(官方文档传送门点我点我)。
言外之意,别的HTTP库都不是给人用的...总之它就是很好用就对了...
用它写一个最简单的爬虫,代码如下:

import requests
url = "http://lewism.net"
response = requests.get(url)
print response.content
如果执行成功的话,你就可以看到我的主页内容源代码了~就是这么简单,四行代码就可以把我的主页内容爬下来。
不过,这不是本文的重点。
运行爬虫的时候,总需要考虑对付网站的反爬机制。requests作为一个给人类使用的框架,当然会有相应的处理机制。
在此记录一下requests处理相应问题时的用法:

Custom Headers

>>> url = 'https://api.github.com/some/endpoint'
>>> headers = {'user-agent': 'my-app/0.0.1'}
>>> r = requests.get(url, headers=headers)

Cookies

If a response contains some Cookies, you can quickly access them:
>>> url = 'http://example.com/some/cookie/setting/url'
>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']
'example_cookie_value'
To send your own cookies to the server, you can use the cookies parameter:
>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

Proxies

If you need to use a proxy, you can configure individual requests with the proxies argument to any request method:
import requests

proxies = {
  'http': 'http://10.10.1.10:3128',
  'https': 'http://10.10.1.10:1080',
}

requests.get('http://example.org', proxies=proxies)
You can also configure proxies by setting the environment variables HTTP_PROXY and HTTPS_PROXY.
$ export HTTP_PROXY="http://10.10.1.10:3128"
$ export HTTPS_PROXY="http://10.10.1.10:1080"

$ python
>>> import requests
>>> requests.get('http://example.org')

end(ps:以上代码全部来自官方文档,我觉得比较常用所以记录一下~)