爬蟲如何抓取網頁數據，爬蟲_抓取京東手機圖片

2023-12-09 阅读 28 评论 0

摘要：#1.抓取網頁html #2,根據正則表達式爬去關鍵內容 #3.根據關鍵內容，再次使用正則匹配出圖片地址 #4.存儲圖片import urllib.request import re import urllib.error def craw(url,page):html=urllib.request.urlopen(url).read()html=str(html)#先把所有圖片部分

#1.抓取網頁html
#2,根據正則表達式爬去關鍵內容
#3.根據關鍵內容，再次使用正則匹配出圖片地址
#4.存儲圖片import urllib.request
import re
import urllib.error
def craw(url,page):html=urllib.request.urlopen(url).read()html=str(html)#先把所有圖片部分的數據取出pat1='<div id="plist".+?<div class="clr">'result1=re.findall(pat1,html)if result1:result1=result1[0]#常加載正則pat2='<img width="220" height="220" data-img="1" src="//(.+?\.jpg)"'# 懶加載正則pat3='<img width="220" height="220" data-img="1" data-lazy-img="//(.+?.jpg)"'imagelist1=re.findall(pat2,result1)imagelist2=re.findall(pat3,result1)#將所有圖片合并imagelist=imagelist1+imagelist2x=1for imageurl in imagelist:#對所存的圖片進行命名imagename='jd/'+str(page)+str(x)+".jpg"#圖片地址imageurl="http://"+imageurltry:#獲取圖片并保存urllib.request.urlretrieve(imageurl,filename=imagename)except urllib.error.HTTPError as e:#hasattr函數判斷是否有這些屬性if hasattr(e,"code"):x += 1if hasattr(e,'reason'):x += 1x+=1print('抓取成功')else:print('抓取失敗，未獲得內容')#分頁
for i in range(1,2):#url重構url='https://list.jd.com/list.html?cat=9987,653,655&page='+str(i)craw(url,i)