起因

搜索关键字,帖子无法展示👇

但是在帖子列表可以找到👇

但是一页一页翻帖子过于麻烦,于是想到用爬虫直接爬取指定页数的所有帖子和链接等信息方便查找。

准备工作

①首先随便进入一个小组(以小象乐园为例),点击页面下方的“更多小组讨论”
②观察规律
第一页的网址为:
https://www.douban.com/group/613560/discussion?start=0&type=new


第二页的网址为:
https://www.douban.com/group/613560/discussion?start=25&type=new


于是可知一页显示25篇帖子,并且是按不同的start参数进行分组显示
③打开开发者工具(F12),找到网络项,刷新界面,找到该页面所对应的网络请求,查看请求URL和请求方法

代码阶段

需要用到的库

lxml库(用于模板的解析)
requests库(发送HTTP请求,不多赘述)

1
2
from lxml import etree
import requests

get_code()

作用:获取网页源码

1
2
3
4
5
6
7
8
9
10
11
12
13
def get_code(start, group_url):
url = group_url
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Cookie': '此处填自己登录之后的cookie'
}
data = {
'start': start,
'type': 'new'
}
request = requests.get(url=url, params=data, headers=headers)
response = request.text
return response

list_posts()

作用:利用xpath解析网页源码,获取帖子标题、作者、回复数、最后回复时间和链接,利用数组的嵌套分别存入对应的数组中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def list_posts(response, page, titles, urls, dates, authors, replies):
tree = etree.HTML(response)
titles_arr = tree.xpath('//table[@class="olt"]/tr//td/a/@title')
for t in range(len(titles_arr)):
titles[page].append(titles_arr[t])
urls_arr = tree.xpath('//table[@class="olt"]/tr//td[@class="title"]//a/@href')
for u in range(len(urls_arr)):
urls[page].append(urls_arr[u])
dates_arr = tree.xpath('//table[@class="olt"]/tr//td[@class="time"]/text()')
for d in range(len(dates_arr)):
dates[page].append(dates_arr[d])
authors_arr = tree.xpath('//table[@class="olt"]/tr//td[@nowrap="nowrap"]/a/text()')
for a in range(len(authors_arr)):
authors[page].append(authors_arr[a])
replies_arr = tree.xpath('//table[@class="olt"]/tr//td[@class="r-count "]/text()')
for r in range(len(replies_arr)):
replies[page].append(replies_arr[r])
return titles, urls, dates, authors, replies

get_page()

作用:用户传入页数,调用get_code函数获取源码,并调用list_posts函数解析源码,将最后存有数据的数组返回

1
2
3
4
5
6
7
8
9
10
11
def get_page(all_page, group_url):
titles = [[] for i in range(all_page)]
urls = [[] for i in range(all_page)]
dates = [[] for i in range(all_page)]
authors = [[] for i in range(all_page)]
replies = [[] for i in range(all_page)]
for i in range(all_page):
start = i * 25
response = get_code(start, group_url)
titles, urls, dates, authors, replies = list_posts(response, i, titles, urls, dates, authors, replies)
return titles, urls, dates, authors, replies

select_group()

作用:简单提供几个小组供用户选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def select_group():
group_url = ''
print("=============================")
print("(1)小象乐园  (2)电竞九组")
print("(3)社会性死亡 (4)哈组")
print("=============================")
group_id = input('请输入小组编号:')
if group_id == '1':
group_url = 'https://www.douban.com/group/613560/discussion'
elif group_id == '2':
group_url = 'https://www.douban.com/group/705052/discussion'
elif group_id == '3':
group_url = 'https://www.douban.com/group/shesizu/discussion'
elif group_id == '4':
group_url = 'https://www.douban.com/group/638298/discussion'
return group_url

主函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
if __name__ == '__main__':
group_url = select_group()
print("➡使用方法:按m重新选择小组,按p重新选择爬取页数,按q退出⬅")
all_page = input('请输入要爬取的页数:')
print('正在爬取' + str(int(all_page) * 25) + '篇帖子,请稍后...')
titles, urls, dates, authors, replies = get_page(int(all_page), group_url)
print('爬取完成!')
while 1:
search = input("请输入搜索关键字:")
isExist = False
if search == 'q':
break
if search == 'm':
group_url = select_group()
all_page = input('请输入要爬取的页数:')
print('正在爬取' + str(int(all_page) * 25) + '篇帖子,请稍后...')
titles, urls, dates, authors, replies = get_page(int(all_page), group_url)
print('爬取完成!')
continue
if search == 'p':
all_page = input('请输入要爬取的页数:')
print('正在爬取' + str(int(all_page) * 25) + '篇帖子,请稍后...')
titles, urls, dates, authors, replies = get_page(int(all_page), group_url)
print('爬取完成!')
continue
for i in range(int(all_page)):
for j in range(len(titles[i])):
if search in titles[i][j] or search.upper() in titles[i][j]:
print("【查找成功!第%d页】" % (i + 1))
print("标题:" + titles[i][j])
print("作者:" + authors[i][j])
print("回复数:" + replies[i][j])
print("最后回复时间:" + dates[i][j])
print("链接:" + urls[i][j])
print('-----------------------------------------------------')
isExist = True
break
if not isExist:
print("没有找到相关内容!")

效果演示

尚未解决的问题

  目前cookie只能通过手动获取,然后添加到代码里面,比较麻烦。本来想用selenium模拟登录豆瓣,再将cookie返回进行传值,但是豆瓣的反爬机制强悍,不仅有滑块验证,还一直提示网络环境异常,对于我这种新手来说难度过高,遂放弃,将就着用吧。。。

总结

妈妈再也不用担心搜不到瓜了!

OVER