利用python爬取，python爬取正確但不出文件_使用Python爬取微信公眾號文章并保存為PDF文件(解決圖片不顯示的問題)...

2023-10-24 阅读 28 评论 0

摘要：前言利用python爬取、第一次寫博客，主要內容是爬取微信公眾號的文章，將文章以PDF格式保存在本地。爬取微信公眾號文章(使用wechatsogou)python讀取csv文件？1.安裝pip install wechatsogou --upgrade用python寫網絡爬蟲，wechatsogou是一個基于搜狗微信搜索的微

前言

利用python爬取、第一次寫博客，主要內容是爬取微信公眾號的文章，將文章以PDF格式保存在本地。

爬取微信公眾號文章(使用wechatsogou)

python讀取csv文件？1.安裝

pip install wechatsogou --upgrade

用python寫網絡爬蟲，wechatsogou是一個基于搜狗微信搜索的微信公眾號爬蟲接口

2.使用方法

使用方法如下所示

import wechatsogou

# captcha_break_time為驗證碼輸入錯誤的重試次數，默認為1

ws_api = wechatsogou.WechatSogouAPI(captcha_break_time=3)

# 公眾號名稱

gzh_name = ''

# 將該公眾號最近10篇文章信息以字典形式返回

data = ws_api.get_gzh_article_by_history(gzh_name)

data數據結構：

{

'gzh': {

'wechat_name': '', # 名稱

'wechat_id': '', # 微信id

'introduction': '', # 簡介

'authentication': '', # 認證

'headimage': '' # 頭像

'article': [

{

'send_id': int, # 群發id，注意不唯一，因為同一次群發多個消息，而群發id一致

'datetime': int, # 群發datatime 10位時間戳

'type': '', # 消息類型，均是49(在手機端歷史消息頁有其他類型，網頁端最近10條消息頁只有49)，表示圖文

'main': int, # 是否是一次群發的第一次消息 1 or 0

'title': '', # 文章標題

'abstract': '', # 摘要

'fileid': int, #

'content_url': '', # 文章鏈接

'source_url': '', # 閱讀原文的鏈接

'cover': '', # 封面圖

'author': '', # 作者

'copyright_stat': int, # 文章類型，例如：原創啊

...

]

}

這里需要得到兩個信息：文章標題，文章url。

得到文章url以后，就可以根據url將html頁面轉換成pdf文件了。

生成PDF文件

1.安裝wkhtmltopdf

2.安裝pdfkit

pip install pdfkit

3.使用方法

import pdfkit

# 根據url生成pdf

pdfkit.from_url('http://baidu.com','out.pdf')

# 根據html文件生成pdf

pdfkit.from_file('test.html','out.pdf')

# 根據html代碼生成pdf

pdfkit.from_string('Hello!','out.pdf')

如果直接用上面得到的文章url去生成pdf，會出現pdf文件不顯示文章圖片的問題。

解決辦法：

# 該方法根據文章url對html進行處理，使圖片顯示

content_info = ws_api.get_article_content(url)

# 得到html代碼(代碼不完整，需要加入head、body等標簽)

html_code = content_info['content_html']

然后根據html_code構造完整的html代碼，調用pdfkit.from_string()方法生成pdf文件，這時候會發現文章中的圖片在pdf文件中顯示出來了。

完整代碼

import os

import pdfkit

import datetime

import wechatsogou

# 初始化API

ws_api = wechatsogou.WechatSogouAPI(captcha_break_time=3)

def url2pdf(url, title, targetPath):

'''

使用pdfkit生成pdf文件

:param url: 文章url

:param title: 文章標題

:param targetPath: 存儲pdf文件的路徑

'''

try:

content_info = ws_api.get_article_content(url)

except:

return False

# 處理后的html

html = f'''

{title}

{content_info['content_html']}

'''

try:

pdfkit.from_string(html, targetPath + os.path.sep + f'{title}.pdf')

except:

# 部分文章標題含特殊字符，不能作為文件名

filename = datetime.datetime.now().strftime('%Y%m%d%H%M%S') + '.pdf'

pdfkit.from_string(html, targetPath + os.path.sep + filename)

if __name__ == '__main__':

# 此處為要爬取公眾號的名稱

gzh_name = ''

targetPath = os.getcwd() + os.path.sep + gzh_name

# 如果不存在目標文件夾就進行創建

if not os.path.exists(targetPath):

os.makedirs(targetPath)

# 將該公眾號最近10篇文章信息以字典形式返回

data = ws_api.get_gzh_article_by_history(gzh_name)

article_list = data['article']

for article in article_list:

url = article['content_url']

title = article['title']

url2pdf(url, title, targetPath)

原文链接：https://hbdhgg.com/3/163779.html

上一篇：python修改excel內容怎么覆蓋_Python修改Excel的內容,python,excel

下一篇：python作品_專業解讀 | 制作游戲、開發APP、爬蟲采集數據等背后，Python全棧專業背后還有更大的世界...

标签：利用python爬取 Python讀取csv文件用python寫網絡爬蟲 python爬蟲教程 python gui python編程 python3 python 讀取pdf

利用python爬取

发表评论: