爬虫程序举例分析可以从以下几个方面进行:
基础爬虫示例
使用`requests`库发送HTTP请求,获取网页内容并打印出来。
示例代码如下:
```python
import requests
url = 'https://example.com'
response = requests.get(url)
print(response.text)
```
模拟浏览器行为
为了避免被网站识别为非法爬虫,可以添加请求头信息,模拟浏览器的行为。
示例代码如下:
```python
import requests
url = 'https://example.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)
print(response.text)
```
使用BeautifulSoup解析HTML内容
使用`BeautifulSoup`库解析HTML内容,提取所需数据。
示例代码如下:
```python
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title.string
print(title)
```
使用XPath提取数据
使用XPath选择器提取HTML中的数据。
示例代码如下:
```python
import requests
from lxml import etree
url = 'https://nba.hupu.com/stats/players'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
res = requests.get(url=url, headers=headers)
e = etree.HTML(res.text)
player = e.xpath('//*[@id="data_js"]/div/div/table/tbody/tr/td/a/text()')
team = e.xpath('//*[@id="data_js"]/div/div/table/tbody/tr/td/a/text()')
hit_rate = e.xpath('//*[@id="data_js"]/div/div/table/tbody/tr/td/text()')[1:]
score = e.xpath('//*[@id="data_js"]/div/div/table/tbody/tr/td/text()')[1:]
for p, t, hr, s in zip(player, team, hit_rate, score):
print(f"Player: {p}, Team: {t}, Hit Rate: {hr}, Score: {s}")
```
多线程爬虫
使用多线程提高爬虫的效率。
示例代码如下:
```python
import requests
import re
import time
def main(page):
url = f'https://tieba.baidu.com/p/page/{page}'
response = requests.get(url)
content = response.text
comments = re.findall(r'.*?', content, re.S)
with open('comments.csv', 'a', encoding='utf-8', newline='') as f:
csvwriter = csv.writer(f)
csvwriter.writerow(['Comment', 'User', 'Time'])
for comment in comments:
match = re.search(r'([^<]+)', comment)
if match:
csvwriter.writerow([comment, match.group(1), time.strftime('%Y-%m-%d %H:%M:%S')])
for page in range(1, 8):
main(page)
time.sleep(2)
```
爬虫框架示例
使用爬虫框架(如Scrapy)进行更复杂的爬虫开发。
示例代码如下: