爬虫其实很简单,只要用心,很快就就能掌握这门技术,下面通过实现抓取花瓣网美女,来分析一下为什么爬虫事实上是个非常简单的东西。
本文目标
- 获取花瓣网美女图片
- 获取翻页数据
- 下载收藏大于100或者点赞大于10的美女图片到本地
难点
- 获取网页源码并实现分页
- 获取图片的详情页地址
- 下载收藏大于100或者点赞大于10的美女图片到本地
1, 获取网页源码并实现分页
我们可以通过通过Requests框架抓取源码。通过Fillder监控网址请求的地址获取分页。
进入花瓣网美女图片列表页。可以发现url地址为http://huaban.com/favorite/beauty/。向下拉动到底部获取更多图片,发现url地址没有变化。问题就来了?怎么获取分页。这里我们用到网络接口监控工具Fiddler(点击查看Fiddler怎么使用)。打开Fiddler,然后滑动网页列表进行翻页,通过Filder发现url地址为http://huaban.com/favorite/beauty/?iqkxaeyv&limit=20&wfl=1&max=791718782。
可以发现该url地址:http://huaban.com/favorite/beauty/为花瓣网美女图片的地址,后面的是请求参数。max=791718782这个应该是图片的id。我们可以用http://huaban.com/favorite/beauty/?iqkxaeyv&limit=20&wfl=1&max=pin_id拼接的方式实现分页,pin_id为最后一个图片的id,具体怎么获取会在后文讲述。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import requests import re head = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} TimeOut = 30 def requestpageText(url): try: Page = requests.session().get(url,headers=head,timeout=TimeOut) Page.encoding = "gb2312" return Page.text except BaseException as e: print("联网失败了...",e) site = "http://huaban.com/favorite/beauty/" text = requestpageText(site)#抓取网页源码 print(text) |
2, 获取详情页的图片url地址
继续分析源码,右键点击列表页,查看源码。找到带app.page的部分,例如:
1 2 3 4 5 6 7 8 9 |
app.page = app.page || {}; app.page["$url"] = "/favorite/beauty/"; app.page["filter"] = "pin:category:beauty"; app.page["boards"] = [{"board_id":30377705, "user_id":18712058, "title":"美人志*古韵如烟", "description":"此情可待成追忆,只是当时已惘然。 -李商隐《锦瑟》\n北方有佳人,绝世而独立。一顾倾人城,再顾倾人国。宁不知倾城与倾国,佳人难再得!-李延年《佳人歌》", "category_id":"beauty", "seq":30377705, "pin_count":284, "follow_count":64, "like_count":5, "created_at":1466594533, "updated_at":1469077282, "deleting":0, "is_private":0, "extra":null, "pins":[{"pin_id":791719278, "user_id":18712058, "board_id":30377705, "file_id":41964071, "file":{"bucket":"hbimg", "key":"60839419e3bf0d42f7ff437b40f313103e4df2d554db8-M0M3wY", "type":"image/jpeg", "height":750, "width":500, "frames":1}, "media_type":0, "source":null, "link":null, "raw_text":"曼佗罗花开时谁还能够记起从前 谁应了谁的劫谁又变成了谁的执念", "text_meta":{}, "via":387449420, ...... ...... ...... "key":"c586defffb4124e6f54b56d518d2dfceb8db429121856-pTQDmI", "type":"image/jpeg", "width":446, "height":600, "frames":1}, "extra":null}}]; app.page["ads"] = {"fixedAds":[], "normalAds":[]}; |
根据字面app.page[“boards”]=[…]中间的部分为页面数据,即图片列表
把这一段数据在网址bejson,上进行格式化
然后可以看见,网页数据,就是pins[……]的不断重复。好了我们把这一段代码粘贴出来。翻页url地址”max=791718782″后面的部分为最后一条数据的pin_id
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
...... "pins": [ { "pin_id": 791719278, "user_id": 18712058, "board_id": 30377705, "file_id": 41964071, "file": { "bucket": "hbimg", "key": "60839419e3bf0d42f7ff437b40f313103e4df2d554db8-M0M3wY", "type": "image/jpeg", "height": 750, "width": 500, "frames": 1 }, "media_type": 0, "source": null, "link": null, "raw_text": "曼佗罗花开时谁还能够记起从前 谁应了谁的劫谁又变成了谁的执念", "text_meta": { }, "via": 387449420, "via_user_id": 17022564, "original": 158377505, "created_at": 1469077282, "like_count": 3, "comment_count": 0, "repin_count": 12, "is_private": 0, "orig_source": null }, ...... |
我们在网页上点开一个妹子,然后进入妹子图详情,右键复制图片地址,在新的窗口粘贴出来
http://hbimg.b0.upaiyun.com/09bd364bdac4e6031d721347b5d44485d2c5964031ae6-e4ZZqT_fw658
可以看到我们打开了一个妹子图,这就是我们想要的数据
,url):
print(“开始下载:”,file,url)
try:
r = requests.get(url,stream=True)
with open(file, ‘wb’) as fd:
for chunk in r.iter_content():
fd.write(chunk)
except Exception as e:
print(“下载失败了”,e)
编码实现:
好了几个难点都实现了,整理一下,我们现在开始编码实现吧:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
import re import os import requests import time global PhotoNum PhotoNum = 0 PWD="D:/work/python/pic/huaban/" head = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} TimeOut = 30 url = "http://huaban.com/favorite/beauty/" url_image = "http://hbimg.b0.upaiyun.com/" urlNext = "http://huaban.com/favorite/beauty/?iqkxaeyv&limit=20&wfl=1&max=" def downfile(file,url): print("开始下载:",file,url) try: r = requests.get(url,stream=True) with open(file, 'wb') as fd: for chunk in r.iter_content(): fd.write(chunk) except Exception as e: print("下载失败了",e) def requestpageText(url): try: Page = requests.session().get(url,headers=head,timeout=TimeOut) Page.encoding = "utf-8" return Page.text except Exception as e: print("联网失败了...重试中",e) time.sleep(5) print("暂停结束") requestpageText(url) def requestUrl(url): global PhotoNum print("*******************************************************************") print("请求网址:",url) text = requestpageText(url) pattern = re.compile('{"pin_id":(\d*?),.*?"key":"(.*?)",.*?"like_count":(\d*?),.*?"repin_count":(\d*?),.*?}',re.S) items = re.findall(pattern,text) print(items) max_pin_id = 0 for item in items: max_pin_id = item[0] x_key = item[1] x_like_count = int(item[2]) x_repin_count = int(item[3]) if (x_repin_count >10 and x_like_count > 10) or x_repin_count >100 or x_like_count > 20: print("开始下载第{0}张图片".format(PhotoNum)) url_item = url_image+x_key filename = PWD+str(max_pin_id)+".jpg" if os.path.isfile(filename): print("文件存在:",filename) continue downfile(filename,url_item) PhotoNum +=1 requestUrl(urlNext+max_pin_id) if not os.path.exists(PWD): os.makedirs(PWD) requestUrl(url) |
未经允许不得转载:Python在线学习 » Python爬虫-自动下载花瓣网美女一