爬虫其实很简单,只要用心,很快就就能掌握这门技术,下面通过实现抓取妹子图资源美女,来分析一下为什么爬虫事实上是个非常简单的东西。
本文目标
- 解析妹子图网站网页源码
- 抓取妹子图网站美女图片地址
- 把每一个美女的写真图片按照文件夹保存到本地
网站分析
打开妹子图主页,点击清纯美女,然后点击下一页,我们可以看到Url地址为http://www.meizitu.com/a/qingchun_3_2.html,多次翻页,可以得知,最后一个数字2代表页数。
继续分析,选择某一个美女,点击进入妹子图详情,点击鼠标右键,查看源码。找到妹子图高清大图地址如:http://pic.meizitu.com/wp-content/uploads/2016a/07/10/01.jpg”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
</div> <div class="postContent"> <p><div id="picture"> <p> <img alt="短发女孩,第1张" src="http://pic.meizitu.com/wp-content/uploads/2015a/12/10/01.jpg" /><br /> <img alt="短发女孩,第2张" src="http://pic.meizitu.com/wp-content/uploads/2015a/12/10/02.jpg" /><br /> <img alt="短发女孩,第3张" src="http://pic.meizitu.com/wp-content/uploads/2015a/12/10/03.jpg" /><br /> <img alt="短发女孩,第4张" src="http://pic.meizitu.com/wp-content/uploads/2015a/12/10/04.jpg" /><br /> <img alt="短发女孩,第5张" src="http://pic.meizitu.com/wp-content/uploads/2015a/12/10/05.jpg" /><br /> <img alt="短发女孩,第6张" src="http://pic.meizitu.com/wp-content/uploads/2015a/12/10/06.jpg" /><br /> <img alt="短发女孩,第7张" src="http://pic.meizitu.com/wp-content/uploads/2015a/12/10/07.jpg" /><br /> <img alt="短发女孩,第8张" src="http://pic.meizitu.com/wp-content/uploads/2015a/12/10/08.jpg" /><br /> </p> </div> <div class="boxinfo"> |
抓取网页源码
通过Requests框架抓取源码,对获取的网页源码,通过正则表达式进行匹配,找到妹子图详情页图片地址。
如果不了解Requests,可以点击查看Requests学习
直接上代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import requests import re head = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} TimeOut = 30 def requestpageText(url): try: Page = requests.session().get(url,headers=head,timeout=TimeOut) Page.encoding = "gb2312" return Page.text except BaseException as e: print("联网失败了...",e) site = "http://www.meizitu.com/a/qingchun_3_1.html" text = requestpageText(site)#抓取网页源码 patterns = re.compile(r'http:.*?/\d*?.html')#匹配需要的数据 istp = re.findall(patterns,text) for photo in istp: print(photo) |
运行结果如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
C:\Users\Administrator>E:\python\learn\download_meizitu\down_meizitu_detail_test .py http://www.w3.org/TR/xhtml http://www.w3.org/1999/xhtml http://www.meizitu.com/a/5413.html http://www.meizitu.com/a/5413.html http://www.meizitu.com/a/5406.html http://www.meizitu.com/a/5406.html http://www.meizitu.com/a/5402.html http://www.meizitu.com/a/5402.html http://www.meizitu.com/a/5396.html http://www.meizitu.com/a/5396.html http://www.meizitu.com/a/5390.html http://www.meizitu.com/a/5390.html |
下载图片保存到本地
我们可以用python文件读写库对图片数据进行下载和保存
代码如下
1 2 3 4 5 6 7 8 9 10 11 12 |
import requests image_url = "http://pic.meizitu.com/wp-content/uploads/2015a/12/10/01.jpg" pwd = "E:/01.jpg" def downfile(file,url): print("开始下载:",file,url) r = requests.get(url,stream=True) with open(file, 'wb') as fd: for chunk in r.iter_content(): fd.write(chunk) downfile(pwd,image_url) |
执行结果
1 2 3 4 5 |
C:\Users\Administrator>E:\python\learn\download_meizitu\downloadiamge.py 开始下载: E:/01.jpg http://pic.meizitu.com/wp-content/uploads/2015a/12/10/01.jp g C:\Users\Administrator> |
我们到E盘目录下,可以看到图片已经下载成功。
源码
我们可以用一个for循环进行翻页,然后解析网页源码,获取图片详情页地址,在E盘目录下简历对应的图片文件夹,然后解析图片详情页。获取美女高清大图地址,然后把图片下载到本地。
源码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
import re import os import requests head = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Py40.com/20161001 Firefox/3.5.6'} TimeOut = 30 global PhotoName PhotoName = 0 PWD="E:/meizitu/pic/" def requestpageText(url): try: Page = requests.session().get(url,headers=head,timeout=TimeOut) Page.encoding = "gb2312" return Page.text except BaseException as e: print("联网失败了...",e) def downfile(file,url): print("开始下载:",file,url) r = requests.get(url,stream=True) with open(file, 'wb') as fd: for chunk in r.iter_content(): fd.write(chunk) def getImageDetail(url): global PhotoName item_start = url.rindex("/")+1 item_end = url.rindex(".") url_filename = PWD+url[item_start:item_end] if not os.path.exists(url_filename): os.makedirs(url_filename) text = requestpageText(url) patterns = re.compile(r'第\d张"\ssrc="(.*.jpg)') listp = re.findall(patterns,text) print(url,listp) for x in listp: patterns = re.compile(r'http.*?.jpg') image =re.search(patterns,x,flags=0).group(0) image_start = image.rindex("/")+1 image_filename = url_filename+"/"+image[image_start:] if os.path.isfile(image_filename): print("文件存在:",image_filename) continue PhotoName += 1 downfile(image_filename,url=image) PWD = PWD+"qingchun/" if not os.path.exists(PWD): os.makedirs(PWD) for x in range(1,3): site = "http://www.meizitu.com/a/qingchun_3_%d.html" %x text = requestpageText(site) patterns = re.compile(r'http:.*?/\d*?.html') istp = re.findall(patterns,text) for href in istp:#详情页 getImageDetail(href) print ("You have down %d photos" %PhotoName) |
运行结果如下
1 2 3 4 5 6 7 8 9 10 11 12 13 |
C:\Users\Administrator>E:\python\learn\download_meizitu\down_meizitu_detail.py http://www.w3.org/TR/xhtml [] http://www.w3.org/1999/xhtml [] http://www.meizitu.com/a/5413.html ['http://pic.meizitu.com/wp-content/uploads/2 016a/07/10/01.jpg', 'http://pic.meizitu.com/wp-content/uploads/2016a/07/10/02.jp g', 'http://pic.meizitu.com/wp-content/uploads/2016a/07/10/03.jpg', 'http://pic. meizitu.com/wp-content/uploads/2016a/07/10/04.jpg', 'http://pic.meizitu.com/wp-c ontent/uploads/2016a/07/10/05.jpg', 'http://pic.meizitu.com/wp-content/uploads/2 016a/07/10/06.jpg'] 开始下载: E:/meizitu/pic/qingchun/5413/01.jpg http://pic.meizitu.com/wp-content /uploads/2016a/07/10/01.jpg 开始下载: E:/meizitu/pic/qingchun/5413/02.jpg http://pic.meizitu.com/wp-content /uploads/2016a/07/10/02.jpg |
源码下载
源码已上传到github:https://github.com/dy60420667/python_meizitu
未经允许不得转载:Python在线学习 » python爬虫-爬取妹子图资源