正则表达式在爬虫中的应用

简介：正则表达式（Regular Expression）是一种强大的文本匹配工具，在爬虫开发中广泛用于数据提取、文本处理和模式匹配。本文将详细介绍正则表达式的基本语法、常用模式以及在爬虫中的实际应用场景。

一、正则表达式基础

1.1 基本元字符

                        import re

# . 匹配任意字符（除换行符）
re.findall(r'a.b', 'acb aab a1b')  # ['acb', 'aab', 'a1b']

# ^ 匹配字符串开头
re.findall(r'^Hello', 'Hello World')  # ['Hello']

# $ 匹配字符串结尾
re.findall(r'World$', 'Hello World')  # ['World']

# * 匹配前一个字符 0 次或多次
re.findall(r'ab*', 'a ab abb abbb')  # ['a', 'ab', 'abb', 'abbb']

# + 匹配前一个字符 1 次或多次
re.findall(r'ab+', 'a ab abb abbb')  # ['ab', 'abb', 'abbb']

# ? 匹配前一个字符 0 次或 1 次
re.findall(r'ab?', 'a ab abb')  # ['a', 'ab', 'ab']
                    

1.2 字符集

                        import re

# [abc] 匹配 a、b 或 c
re.findall(r'[abc]', 'a1b2c3')  # ['a', 'b', 'c']

# [a-z] 匹配小写字母
re.findall(r'[a-z]', 'AaBbCc')  # ['a', 'b', 'c']

# [0-9] 匹配数字
re.findall(r'[0-9]', 'a1b2c3')  # ['1', '2', '3']

# [^abc] 匹配除 a、b、c 之外的字符
re.findall(r'[^abc]', 'a1b2c3')  # ['1', '2', '3']
                    

1.3 预定义字符集

                        import re

# \d 匹配数字（等同于 [0-9]）
re.findall(r'\d+', 'abc123def456')  # ['123', '456']

# \D 匹配非数字
re.findall(r'\D+', 'abc123def456')  # ['abc', 'def']

# \w 匹配字母、数字、下划线
re.findall(r'\w+', 'hello_world 123')  # ['hello_world', '123']

# \W 匹配非单词字符
re.findall(r'\W+', 'hello_world 123')  # [' ']

# \s 匹配空白字符
re.findall(r'\s+', 'hello  world')  # ['  ']

# \S 匹配非空白字符
re.findall(r'\S+', 'hello  world')  # ['hello', 'world']
                    

二、常用正则表达式模式

2.1 提取 URL

                        import re

text = "访问 https://www.example.com 或 http://test.com"
url_pattern = r'https?://[^\s]+'
urls = re.findall(url_pattern, text)
print(urls)  # ['https://www.example.com', 'http://test.com']
                    

2.2 提取邮箱

                        import re

text = "联系邮箱: user@example.com 或 admin@test.org"
email_pattern = r'[\w\.-]+@[\w\.-]+\.\w+'
emails = re.findall(email_pattern, text)
print(emails)  # ['user@example.com', 'admin@test.org']
                    

2.3 提取手机号

                        import re

text = "手机号: 13812345678, 159-8765-4321"
phone_pattern = r'1[3-9]\d{9}|\d{3}-\d{4}-\d{4}'
phones = re.findall(phone_pattern, text)
print(phones)  # ['13812345678', '159-8765-4321']
                    

2.4 提取 IP 地址

                        import re

text = "服务器IP: 192.168.1.1, 客户端IP: 10.0.0.1"
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
ips = re.findall(ip_pattern, text)
print(ips)  # ['192.168.1.1', '10.0.0.1']
                    

三、分组与捕获

3.1 基本分组

                        import re

text = "姓名: 张三, 年龄: 25"
pattern = r'姓名: (\w+), 年龄: (\d+)'
match = re.search(pattern, text)
if match:
    name = match.group(1)
    age = match.group(2)
    print(f"姓名: {name}, 年龄: {age}")
                    

3.2 命名分组

                        import re

text = "2024-06-14"
pattern = r'(?P\d{4})-(?P\d{2})-(?P\d{2})'
match = re.search(pattern, text)
if match:
    print(f"年: {match.group('year')}")
    print(f"月: {match.group('month')}")
    print(f"日: {match.group('day')}")
                    

3.3 非捕获分组

                        import re

# (?:...) 非捕获分组
text = "abc123def456"
pattern = r'(?:abc|def)(\d+)'
matches = re.findall(pattern, text)
print(matches)  # ['123', '456']
                    

四、爬虫中的实际应用

4.1 提取页面中的链接

                        import re
import requests

def extract_links(html):
    """提取页面中的所有链接"""
    # 匹配 href 属性
    pattern = r'href=[\'"]?([^\'" >]+)[\'"]?'
    links = re.findall(pattern, html)
    return links

# 使用示例
response = requests.get('https://example.com')
links = extract_links(response.text)
for link in links[:10]:  # 打印前10个链接
    print(link)
                    

4.2 提取 JSON 数据

                        import re
import json

def extract_json(text):
    """从文本中提取 JSON 数据"""
    pattern = r'\{[^{}]*\}|\[[^\[\]]*\]'
    matches = re.findall(pattern, text)
    
    results = []
    for match in matches:
        try:
            data = json.loads(match)
            results.append(data)
        except json.JSONDecodeError:
            continue
    
    return results
                    

4.3 清理文本数据

                        import re

def clean_text(text):
    """清理文本数据"""
    # 移除 HTML 标签
    text = re.sub(r'<[^>]+>', '', text)
    
    # 移除多余空白
    text = re.sub(r'\s+', ' ', text)
    
    # 移除特殊字符
    text = re.sub(r'[^\w\s\u4e00-\u9fff]', '', text)
    
    return text.strip()

# 使用示例
html = "  Hello  World!  
"
cleaned = clean_text(html)
print(cleaned)  # "Hello World"
                    

4.4 提取价格信息

                        import re

def extract_price(text):
    """提取价格信息"""
    # 匹配 ¥123.45 或 123.45元
    pattern = r'[¥￥]?\d+(?:\.\d{2})?[元]?'
    prices = re.findall(pattern, text)
    return prices

# 使用示例
text = "商品价格: ¥99.99, 优惠价: 88.88元"
prices = extract_price(text)
print(prices)  # ['¥99.99', '88.88元']
                    

五、高级技巧

5.1 贪婪与非贪婪匹配

                        import re

text = "内容1
内容2
"

# 贪婪匹配（默认）
greedy = re.findall(r'.*
', text)
print(greedy)  # ['内容1
内容2
']

# 非贪婪匹配
lazy = re.findall(r'.*?
', text)
print(lazy)  # ['内容1
', '内容2
']
                    

5.2 前瞻断言

                        import re

text = "apple1 banana2 cherry3"

# 正向先行断言：匹配后面跟着数字的单词
pattern1 = r'\w+(?=\d)'
matches1 = re.findall(pattern1, text)
print(matches1)  # ['apple', 'banana', 'cherry']

# 负向先行断言：匹配后面不跟着数字的单词
pattern2 = r'\w+(?!\d)'
matches2 = re.findall(pattern2, text)
print(matches2)  # ['apple', 'banana', 'cherry']
                    

5.3 多行模式

                        import re

text = """Line 1
Line 2
Line 3"""

# ^ 和 $ 匹配每行的开头和结尾
pattern = r'^Line \d+$'
matches = re.findall(pattern, text, re.MULTILINE)
print(matches)  # ['Line 1', 'Line 2', 'Line 3']
                    

六、性能优化建议

注意事项：

避免使用回溯过多的复杂正则表达式
使用非贪婪匹配提高效率
预编译正则表达式（re.compile）
对于简单的 HTML 解析，考虑使用 BeautifulSoup

                        import re

# 预编译正则表达式
pattern = re.compile(r'https?://[^\s]+')

# 多次使用时更高效
urls1 = pattern.findall(text1)
urls2 = pattern.findall(text2)
urls3 = pattern.findall(text3)
                    

七、使用 EasySpider 辅助开发

EasySpider 提供的工具可以辅助正则表达式开发：

文本对比：验证正则表达式提取结果
URL 提取：分析 URL 模式
数据格式化：查看提取的数据结构

总结

正则表达式是爬虫开发中的强大工具。通过本文的学习，你应该能够：

掌握正则表达式的基本语法
编写常用的正则表达式模式
在爬虫中应用正则表达式提取数据
优化正则表达式的性能

返回博客列表