Python Requests 库高级用法完全指南

一、为什么学 Requests

做爬虫开发这几年，Requests 是我用得最多的库。它把复杂的 HTTP 请求封装得特别简单，几行代码就能搞定。但很多人只停留在 requests.get() 这种基础用法上，遇到需要登录、保持会话、自动重试这些场景就懵了。

这篇文章把我平时项目中积累的经验整理出来，从实际应用场景出发，讲清楚那些"高级"功能到底该怎么用。

本文适合谁看

已经会用 Requests 发简单请求，想进阶的开发者
做爬虫项目时遇到登录、Cookie 问题的同学
需要写稳定可靠的请求代码的工程师

二、基础用法回顾

先快速过一遍基础，确保我们在同一频道。

2.1 发送 GET 请求

                        import requests

# 最简单的 GET 请求
response = requests.get('https://api.github.com')
print(response.status_code)  # 200
print(response.text)         # 返回的 HTML/JSON 内容
                    

2.2 发送 POST 请求

                        # 带表单数据的 POST
payload = {'username': 'admin', 'password': '123456'}
response = requests.post('https://httpbin.org/post', data=payload)

# 发送 JSON 数据
import json
response = requests.post(
    'https://httpbin.org/post',
    json={'key': 'value'}
)
                    

2.3 添加请求头

                        headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json',
    'Authorization': 'Bearer your_token_here'
}
response = requests.get('https://api.example.com/data', headers=headers)
                    

小技巧：用浏览器的开发者工具复制 curl 命令，然后用 curl 转 Python 工具一键转换，比自己手写快多了。

三、Session 会话保持

这是 Requests 里我最常用的功能之一。做爬虫时，很多网站需要登录后才能访问数据，而登录状态是通过 Cookie 维持的。如果每次请求都用 requests.get()，Cookie 不会自动保留，就会导致每次都要重新登录。

3.1 Session 的基本用法

                        import requests

# 创建一个 Session 对象
session = requests.Session()

# 第一次请求：登录
login_data = {'username': 'admin', 'password': '123456'}
response = session.post('https://example.com/login', data=login_data)

# 第二次请求：访问需要登录的页面
# 注意：这里用的是 session.get()，不是 requests.get()
response = session.get('https://example.com/profile')
print(response.text)  # 正常返回个人资料页面
                    

看到区别了吗？session.get() 会自动带上之前请求返回的 Cookie，而 requests.get() 每次都是全新的请求，Cookie 不会保留。

3.2 Session 内部原理

Session 对象本质上是一个会话容器，它会自动帮你做这几件事：

Cookie 持久化：服务器返回的 Set-Cookie 会自动保存，后续请求自动带上
连接复用：底层 TCP 连接会被复用，减少握手开销
默认配置：可以设置全局的 headers、auth 等，不用每次请求都传

3.3 给 Session 设置全局参数

                        session = requests.Session()

# 设置全局 headers，所有请求都会带上
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json'
})

# 设置全局超时
session.timeout = 10

# 之后所有请求都会自动带上这些配置
response = session.get('https://api.example.com/data')
                    

注意：Session 不是线程安全的！如果在多线程环境中使用，每个线程应该创建自己的 Session 实例。

Cookie 是爬虫开发中的重头戏。有些网站的反爬机制会检查 Cookie 的合法性，甚至用 JavaScript 动态生成 Cookie，这时候就需要我们手动处理。

4.1 查看响应中的 Cookie

                        response = requests.get('https://example.com')

# 查看服务器返回的 Cookie
print(response.cookies)
# <RequestsCookieJar[<Cookie session_id=xxx for example.com/>]>

# 获取某个具体的 Cookie 值
session_id = response.cookies.get('session_id')
print(session_id)
                    

4.2 手动设置 Cookie

                        # 方法1：通过 headers 设置
headers = {
    'Cookie': 'session_id=abc123; user_id=456'
}
response = requests.get('https://example.com', headers=headers)

# 方法2：通过 cookies 参数设置（推荐）
cookies = {
    'session_id': 'abc123',
    'user_id': '456'
}
response = requests.get('https://example.com', cookies=cookies)
                    

4.3 Cookie 的域名和路径

有些 Cookie 是绑定特定域名或路径的，跨域请求时不会自动带上。这时候需要手动指定：

                        from requests.cookies import RequestsCookieJar

jar = RequestsCookieJar()
# 设置一个绑定到特定域名的 Cookie
jar.set('session_id', 'abc123', domain='.example.com', path='/')

session = requests.Session()
session.cookies = jar
response = session.get('https://sub.example.com/page')
                    

五、重试机制实现

网络请求不可能 100% 成功。服务器偶尔抽风、网络抖动、代理失效，这些情况都可能导致请求失败。如果每次失败都手动重跑，效率太低了。Requests 配合 urllib3 的重试机制，可以自动处理这些情况。

5.1 基础重试配置

                        from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

# 配置重试策略
retries = Retry(
    total=5,                # 总共重试 5 次
    backoff_factor=1,       # 重试间隔时间：1秒, 2秒, 4秒, 8秒...
    status_forcelist=[500, 502, 503, 504]  # 遇到这些状态码才重试
)

# 给 Session 挂载适配器
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))

# 现在请求失败会自动重试
response = session.get('https://api.example.com/data')
                    

5.2 自定义重试条件

有些接口返回 200 但内容是错误信息，这时候需要根据内容判断是否重试：

                        import time

def request_with_retry(url, max_retries=3):
    for i in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            
            # 检查返回内容是否有效
            data = response.json()
            if data.get('code') == 0:
                return data
            else:
                print(f"接口返回错误: {data.get('msg')}")
                
        except requests.exceptions.RequestException as e:
            print(f"请求失败 ({i+1}/{max_retries}): {e}")
            if i < max_retries - 1:
                wait_time = 2 ** i  # 指数退避
                print(f"等待 {wait_time} 秒后重试...")
                time.sleep(wait_time)
    
    return None
                    

指数退避策略：第一次失败等 1 秒，第二次等 2 秒，第三次等 4 秒...这样既能快速恢复，又不会给服务器造成压力。

六、代理设置

做爬虫难免会遇到 IP 被封的情况。使用代理是最直接的解决方案。

6.1 HTTP/HTTPS 代理

                        # 设置代理
proxies = {
    'http': 'http://127.0.0.1:8080',
    'https': 'http://127.0.0.1:8080',
}

response = requests.get('https://api.ipify.org?format=json', proxies=proxies)
print(response.text)  # 返回代理服务器的 IP
                    

6.2 带认证的代理

                        proxies = {
    'http': 'http://username:password@proxy.example.com:8080',
    'https': 'http://username:password@proxy.example.com:8080',
}
                    

6.3 代理池轮换

单个代理用久了还是会被封，所以需要准备一个代理池，每次请求随机选一个：

                        import random

# 代理池
proxy_pool = [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
    'http://proxy3.example.com:8080',
]

def get_proxy():
    return {'http': random.choice(proxy_pool), 'https': random.choice(proxy_pool)}

# 每次请求随机选代理
for i in range(10):
    proxy = get_proxy()
    response = requests.get('https://api.example.com', proxies=proxy)
    print(f"第 {i+1} 次请求成功")
                    

七、超时控制

默认情况下，Requests 会一直等下去，直到服务器响应。这在爬虫程序里很危险，可能导致程序卡死。所以一定要设置超时时间。

7.1 设置超时

                        # 总超时 10 秒
response = requests.get('https://example.com', timeout=10)

# 分别设置连接超时和读取超时
# 连接超时 3 秒，读取超时 27 秒
response = requests.get('https://example.com', timeout=(3, 27))
                    

7.2 超时异常处理

                        from requests.exceptions import Timeout, ConnectionError

try:
    response = requests.get('https://example.com', timeout=5)
except Timeout:
    print("请求超时，服务器响应太慢")
except ConnectionError:
    print("连接失败，检查网络或服务器地址")
except Exception as e:
    print(f"其他错误: {e}")
                    

八、SSL 证书处理

有时候请求 HTTPS 接口会报 SSL 证书验证失败的错误。开发环境可以跳过验证，但生产环境不建议这么做。

8.1 跳过 SSL 验证（开发调试用）

                        # 跳过 SSL 验证
response = requests.get('https://self-signed.example.com', verify=False)

# 这样写会报警告，可以关掉
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
                    

8.2 指定 CA 证书

                        # 使用自定义 CA 证书
response = requests.get(
    'https://example.com',
    verify='/path/to/ca_bundle.crt'
)
                    

九、流式下载大文件

下载大文件时，如果直接 response.content，会把整个文件加载到内存里，容易爆内存。用流式下载可以边下边存。

9.1 流式下载

                        # 下载大文件
url = 'https://example.com/large_file.zip'
response = requests.get(url, stream=True)

# 分块写入文件
with open('large_file.zip', 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)
                    

9.2 显示下载进度

                        import requests
from tqdm import tqdm

url = 'https://example.com/large_file.zip'
response = requests.get(url, stream=True)

# 获取文件大小
total_size = int(response.headers.get('content-length', 0))

with open('large_file.zip', 'wb') as f:
    with tqdm(total=total_size, unit='B', unit_scale=True) as pbar:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)
                pbar.update(len(chunk))
                    

十、总结

这篇文章把 Requests 库在实际项目中最常用的几个高级功能梳理了一遍。记住这几个核心要点：

功能	适用场景	关键代码
Session	需要保持登录状态	`session = requests.Session()`
重试机制	网络不稳定，需要自动重试	`Retry(total=5, backoff_factor=1)`
代理	IP 被封，需要轮换	`proxies={'http': 'http://ip:port'}`
超时控制	防止请求卡死	`timeout=(3, 27)`
流式下载	下载大文件	`stream=True`

Requests 的文档其实很全，但很多人懒得看。我的建议是：先掌握这些最常用的功能，遇到具体问题时再去查文档。实践出真知，多写代码比看十遍文档都管用。

← 返回博客列表