爬虫其实很简单,只要用心,很快就就能掌握这门技术,下面通过实现抓取糗事百科段子,来分析一下为什么爬虫事实上是个非常简单的东西。
本文目标
- 抓取糗事百科热门段子
- 实现每按一次回车显示一个段子的发布时间,发布人,段子内容,点赞数。
获取网页源码
通过Requests框架抓取源码。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import requests import re head = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} TimeOut = 30 def requestpageText(url): try: Page = requests.session().get(url,headers=head,timeout=TimeOut) Page.encoding = "gb2312" return Page.text except BaseException as e: print("联网失败了...",e) site = "http://www.qiushibaike.com/8hr/page/1" text = requestpageText(site)#抓取网页源码 print(text) |
获取段子并打印
通过正则匹配段子数据
1 2 |
patterns = re.compile(r'<div class="article block untagged mb15".*?title="(.*?)">.*?<div class="content">(.*?)</div>.*?<i class="number">(.*?)</i> 好笑.*?<i class="number">(.*?)</i> 评论',re.S) items = re.findall(patterns,text) |
按回车获取下一条
把抓取的数据放在本地列表里,每次按回车,则去下一条数据,如果数据没有了,则执行翻页操作
1 2 3 4 5 6 7 8 9 10 11 12 13 |
index = 0 while index < len(items): try: x = items[index] print("作者:{0} 好笑:{1} 评论{2}".format(x[0],x[2],x[3])) print(x[1]) text = input("按回车键进入下一项") print() print() except Exception as e: print(e) index+=1 |
整合代码
打开糗事百科热门段子,网址为http://www.qiushibaike.com/8hr/page/1,多次翻页发现后面的1位页数。
右键单击网页查看源码,分析源码,可以找到我们需要的数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
import requests import re class qiushibaike: def __init__(self): self.head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'} self.TimeOut = 30 self.url = "http://www.qiushibaike.com/8hr/page/%d" self.page = 1 def requestpageText(self,url): try: print("开始获取数据:",url) Page = requests.session().get(url,headers=self.head,timeout=self.TimeOut) Page.encoding = "utf-8" return Page.text except BaseException as e: print("联网失败了...",e) def downurl(self,page): url = self.url%(page) text = self.requestpageText(url) patterns = re.compile(r'<div class="article block untagged mb15".*?title="(.*?)">.*?<div class="content">(.*?)</div>.*?<i class="number">(.*?)</i> 好笑.*?<i class="number">(.*?)</i> 评论',re.S) items = re.findall(patterns,text) index = 0 while index < len(items): try: x = items[index] print("作者:{0} 好笑:{1} 评论{2}".format(x[0],x[2],x[3])) print(x[1]) text = input("按回车键进入下一项") print() print() except Exception as e: print(e) index+=1 self.page +=1 self.downurl(self.page) def start(self): self.downurl(self.page) q = qiushibaike() text = q.start() |
效果图
1 2 3 4 5 6 7 8 9 |
C:\Users\Administrator>E:\python\learn\qiushibaike\qiushibaike.py 开始获取数据: http://www.qiushibaike.com/8hr/page/1 作者:快乐二霸~武寒 好笑:3118 评论115 致我们终将逝去的青春 按回车键进入下一项 |
未经允许不得转载:Python在线学习 » Python爬虫-抓取糗事百科段子