微信公众号文章 爬虫(python抓取搜狗微信公众号文章)
类别:脚本大全 浏览量:2130
时间:2021-10-26 11:09:27 微信公众号文章 爬虫
python抓取搜狗微信公众号文章初学python,抓取搜狗微信公众号文章存入mysql
mysql表:
代码:
|
import requests import json import re import pymysql # 创建连接 conn = pymysql.connect(host = '你的数据库地址' , port = 端口, user = '用户名' , passwd = '密码' , db = '数据库名称' , charset = 'utf8' ) # 创建游标 cursor = conn.cursor() cursor.execute( "select * from hd_gzh" ) effect_row = cursor.fetchall() from bs4 import beautifulsoup socket.setdefaulttimeout( 60 ) count = 1 headers = { 'user-agent' : 'mozilla/5.0 (windows nt 10.0; win64; x64; rv:65.0) gecko/20100101 firefox/65.0' } #阿布云ip代理暂时不用 # proxyhost = "http-cla.abuyun.com" # proxyport = "9030" # # 代理隧道验证信息 # proxyuser = "h56761606429t7uc" # proxypass = "9168eb00c4167176" # proxymeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { # "host" : proxyhost, # "port" : proxyport, # "user" : proxyuser, # "pass" : proxypass, # } # proxies = { # "http" : proxymeta, # "https" : proxymeta, # } #查看是否已存在数据 def checkdata(name): sql = "select * from gzh_article where title = '%s'" data = (name,) count = cursor.execute(sql % data) conn.commit() if (count! = 0 ): return false else : return true #插入数据 def insertdata(title,picture,author,content): sql = "insert into gzh_article (title,picture,author,content) values ('%s', '%s','%s', '%s')" data = (title,picture,author,content) cursor.execute(sql % data) conn.commit() print ( "插入一条数据" ) return for row in effect_row: newsurl = 'https://weixin.sogou.com/weixin?type=1&s_from=input&query=' + row[ 1 ] + '&ie=utf8&_sug_=n&_sug_type_=' res = requests.get(newsurl,headers = headers) res.encoding = 'utf-8' soup = beautifulsoup(res.text, 'html.parser' ) url = 'https://weixin.sogou.com' + soup.select( '.tit a' )[ 0 ][ 'href' ] res2 = requests.get(url,headers = headers) res2.encoding = 'utf-8' soup2 = beautifulsoup(res2.text, 'html.parser' ) pattern = re. compile (r "url \+= '(.*?)';" , re.multiline | re.dotall) script = soup2.find( "script" ) url2 = pattern.search(script.text).group( 1 ) res3 = requests.get(url2,headers = headers) res3.encoding = 'utf-8' soup3 = beautifulsoup(res3.text, 'html.parser' ) print () pattern2 = re. compile (r "var msglist = (.*?);$" , re.multiline | re.dotall) script2 = soup3.find( "script" , text = pattern2) s2 = json.loads(pattern2.search(script2.text).group( 1 )) #等待10s time.sleep( 10 ) for news in s2[ "list" ]: articleurl = "https://mp.weixin.qq.com" + news[ "app_msg_ext_info" ][ "content_url" ] articleurl = articleurl.replace( '&' , '&' ) res4 = requests.get(articleurl,headers = headers) res4.encoding = 'utf-8' soup4 = beautifulsoup(res4.text, 'html.parser' ) if (checkdata(news[ "app_msg_ext_info" ][ "title" ])): insertdata(news[ "app_msg_ext_info" ][ "title" ],news[ "app_msg_ext_info" ][ "cover" ],news[ "app_msg_ext_info" ][ "author" ],pymysql.escape_string( str (soup4))) count + = 1 #等待5s time.sleep( 10 ) for news2 in news[ "app_msg_ext_info" ][ "multi_app_msg_item_list" ]: articleurl2 = "https://mp.weixin.qq.com" + news2[ "content_url" ] articleurl2 = articleurl2.replace( '&' , '&' ) res5 = requests.get(articleurl2,headers = headers) res5.encoding = 'utf-8' soup5 = beautifulsoup(res5.text, 'html.parser' ) if (checkdata(news2[ "title" ])): insertdata(news2[ "title" ],news2[ "cover" ],news2[ "author" ],pymysql.escape_string( str (soup5))) count + = 1 #等待10s time.sleep( 10 ) cursor.close() conn.close() print ( "操作完成" ) |
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持开心学习网。
原文链接:https://blog.csdn.net/a2398936046/article/details/88814078
您可能感兴趣
- pythonjson库(Python常用的json标准库)
- python与mysql的联系(MySQL和Python交互的示例)
- python3下urllib案例(URL Rewrite Module 2.1 URL重写模块规则写法)
- python中导入模块的命令(Python3 导入上级目录中的模块实例)
- python读取文件的方法和区别(浅谈PYTHON 关于文件的操作)
- python微信消息模拟请求(python实现微信机器人: 登录微信、消息接收、自动回复功能)
- python常见知识点整理(Python基础知识点 初识Python.md)
- jupyter如何编写python(windows系统中Python多版本与jupyter notebook使用虚拟环境的过程)
- python实时输出图像(Python给图像添加噪声具体操作)
- 抖音上很火的表白程序链接(我喜欢你 抖音表白程序python版)
- python批量注册(python实现批量注册网站用户的示例)
- 如何用python处理excel表格(零基础使用Python读写处理Excel表格的方法)
- python字符串匹配教程(Python字符串匹配之6种方法的使用详解)
- python怎么爬取excel数据(python爬取内容存入Excel实例)
- python菜单栏中常用的菜单(Python3实现的简单三级菜单功能示例)
- python使用什么函数定义匿名函数(Python匿名函数及应用示例)
- 包水饺(包水饺手法怎么包视频)
- 越南河粉(越南河粉来自哪里)
- 按这几方面养护佛肚竹盆景,保证枝叶繁茂,造型优美(按这几方面养护佛肚竹盆景)
- 冰岛旅游攻略(冰岛旅游攻略及花费八日游)
- 寒假旅游攻略(成都寒假旅游攻略)
- 菲律宾旅游攻略(菲律宾旅游攻略地图)
热门推荐
- dedecms安全设置(织梦dedecms站点data目录位置变动调整验证码不显示的解决办法)
- idea如何运行springboot项目(使用idea搭建springboot initializer服务器的问题分析)
- python列表中的数组(Python3.4学习笔记之列表、数组操作示例)
- vue 为什么使用虚拟dom(Vue虚拟Dom到真实Dom的转换)
- mysql基本查询方法(MySQL 重写查询语句的三种策略)
- sql server 进阶(SqlServer AS的用法)
- 如何提高代码可读性
- python如何安装requests模块(Python常用模块之requests模块用法分析)
- vue在html里面怎么展示图片(v-html渲染组件问题)
- laravel 前后端开发(在Laravel中实现使用AJAX动态刷新部分页面)
排行榜
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9