requests的cookie、headers和代理相关

requests爬虫之cookie、headers和代理的设置

requests是一个可以用来写爬虫的非常简单好用的python写的http库，用它官方的话说：Requests: HTTP for Humans(官方文档传送门点我点我）。
言外之意，别的HTTP库都不是给人用的...总之它就是很好用就对了...
用它写一个最简单的爬虫，代码如下：

import requests
url = "http://lewism.net"
response = requests.get(url)
print response.content

如果执行成功的话，你就可以看到我的主页内容源代码了～就是这么简单，四行代码就可以把我的主页内容爬下来。
不过，这不是本文的重点。
运行爬虫的时候，总需要考虑对付网站的反爬机制。requests作为一个给人类使用的框架，当然会有相应的处理机制。
在此记录一下requests处理相应问题时的用法：

Custom Headers

>>> url = 'https://api.github.com/some/endpoint'
>>> headers = {'user-agent': 'my-app/0.0.1'}
>>> r = requests.get(url, headers=headers)

Cookies

If a response contains some Cookies, you can quickly access them:

>>> url = 'http://example.com/some/cookie/setting/url'
>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']
'example_cookie_value'

To send your own cookies to the server, you can use the cookies parameter:

>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

Proxies

If you need to use a proxy, you can configure individual requests with the proxies argument to any request method:

import requests

proxies = {
  'http': 'http://10.10.1.10:3128',
  'https': 'http://10.10.1.10:1080',
}

requests.get('http://example.org', proxies=proxies)

You can also configure proxies by setting the environment variables HTTP_PROXY and HTTPS_PROXY.

$ export HTTP_PROXY="http://10.10.1.10:3128"
$ export HTTPS_PROXY="http://10.10.1.10:1080"

$ python
>>> import requests
>>> requests.get('http://example.org')

end（ps:以上代码全部来自官方文档，我觉得比较常用所以记录一下～）