pythonhtml文件分析(对Python3 解析html的几种操作方式小结)
pythonhtml文件分析
对Python3 解析html的几种操作方式小结html">解析html是爬虫后的重要的一个处理数据的环节。一下记录解析html的几种方式。
先介绍基础的辅助函数,主要用于获取html并输入解析后的结束
|
#把传递解析函数,便于下面的修改 def get_html(url, paraser = bs4_paraser): headers = { 'Accept' : '*/*' , 'Accept-Encoding' : 'gzip, deflate, sdch' , 'Accept-Language' : 'zh-CN,zh;q=0.8' , 'Host' : 'www.360kan.com' , 'Proxy-Connection' : 'keep-alive' , 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } request = urllib2.Request(url, headers = headers) response = urllib2.urlopen(request) response.encoding = 'utf-8' if response.code = = 200 : data = StringIO.StringIO(response.read()) gzipper = gzip.GzipFile(fileobj = data) data = gzipper.read() value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read() return value else : pass value = get_html( 'http://www.360kan.com/m/haPkY0osd0r5UB.html' , paraser = lxml_parser) for row in value: print row |
1,lxml.html的方式进行解析,
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ. [官网](http://lxml.de/)
|
def lxml_parser(page): data = [] doc = etree.HTML(page) all_li = doc.xpath( '//li[@class="yingping-list-wrap"]' ) for row in all_li: # 获取每一个影评,即影评的item all_li_item = row.xpath( './/li[@class="item"]' ) # find_all('li', attrs={'class': 'item'}) for r in all_li_item: value = {} # 获取影评的标题部分 title = r.xpath( './/li[@class="g-clear title-wrap"][1]' ) value[ 'title' ] = title[ 0 ].xpath( './a/text()' )[ 0 ] value[ 'title_href' ] = title[ 0 ].xpath( './a/@href' )[ 0 ] score_text = title[ 0 ].xpath( './li/span/span/@style' )[ 0 ] score_text = re.search(r '\d+' , score_text).group() value[ 'score' ] = int (score_text) / 20 # 时间 value[ 'time' ] = title[ 0 ].xpath( './li/span[@class="time"]/text()' )[ 0 ] # 多少人喜欢 value[ 'people' ] = int ( re.search(r '\d+' , title[ 0 ].xpath( './li[@class="num"]/span/text()' )[ 0 ]).group()) data.append(value) return data |
2,使用BeautifulSoup,不多说了,大家网上找资料看看
|
def bs4_paraser(html): all_value = [] value = {} soup = BeautifulSoup(html, 'html.parser' ) # 获取影评的部分 all_li = soup.find_all( 'li' , attrs = { 'class' : 'yingping-list-wrap' }, limit = 1 ) for row in all_li: # 获取每一个影评,即影评的item all_li_item = row.find_all( 'li' , attrs = { 'class' : 'item' }) for r in all_li_item: # 获取影评的标题部分 title = r.find_all( 'li' , attrs = { 'class' : 'g-clear title-wrap' }, limit = 1 ) if title is not None and len (title) > 0 : value[ 'title' ] = title[ 0 ].a.string value[ 'title_href' ] = title[ 0 ].a[ 'href' ] score_text = title[ 0 ].li.span.span[ 'style' ] score_text = re.search(r '\d+' , score_text).group() value[ 'score' ] = int (score_text) / 20 # 时间 value[ 'time' ] = title[ 0 ].li.find_all( 'span' , attrs = { 'class' : 'time' })[ 0 ].string # 多少人喜欢 value[ 'people' ] = int ( re.search(r '\d+' , title[ 0 ].find_all( 'li' , attrs = { 'class' : 'num' })[ 0 ].span.string).group()) # print r all_value.append(value) value = {} return all_value |
3,使用SGMLParser,主要是通过start、end tag的方式进行了,解析工程比较明朗,但是有点麻烦,而且该案例的场景不太适合该方法,(哈哈)
|
class CommentParaser(SGMLParser): def __init__( self ): SGMLParser.__init__( self ) self .__start_li_yingping = False self .__start_li_item = False self .__start_li_gclear = False self .__start_li_ratingwrap = False self .__start_li_num = False # a self .__start_a = False # span 3中状态 self .__span_state = 0 # 数据 self .__value = {} self .data = [] def start_li( self , attrs): for k, v in attrs: if k = = 'class' and v = = 'yingping-list-wrap' : self .__start_li_yingping = True elif k = = 'class' and v = = 'item' : self .__start_li_item = True elif k = = 'class' and v = = 'g-clear title-wrap' : self .__start_li_gclear = True elif k = = 'class' and v = = 'rating-wrap g-clear' : self .__start_li_ratingwrap = True elif k = = 'class' and v = = 'num' : self .__start_li_num = True def end_li( self ): if self .__start_li_yingping: if self .__start_li_item: if self .__start_li_gclear: if self .__start_li_num or self .__start_li_ratingwrap: if self .__start_li_num: self .__start_li_num = False if self .__start_li_ratingwrap: self .__start_li_ratingwrap = False else : self .__start_li_gclear = False else : self .data.append( self .__value) self .__value = {} self .__start_li_item = False else : self .__start_li_yingping = False def start_a( self , attrs): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: self .__start_a = True for k, v in attrs: if k = = 'href' : self .__value[ 'href' ] = v def end_a( self ): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear and self .__start_a: self .__start_a = False def start_span( self , attrs): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: if self .__start_li_ratingwrap: if self .__span_state ! = 1 : for k, v in attrs: if k = = 'class' and v = = 'rating' : self .__span_state = 1 elif k = = 'class' and v = = 'time' : self .__span_state = 2 else : for k, v in attrs: if k = = 'style' : score_text = re.search(r '\d+' , v).group() self .__value[ 'score' ] = int (score_text) / 20 self .__span_state = 3 elif self .__start_li_num: self .__span_state = 4 def end_span( self ): self .__span_state = 0 def handle_data( self , data): if self .__start_a: self .__value[ 'title' ] = data elif self .__span_state = = 2 : self .__value[ 'time' ] = data elif self .__span_state = = 4 : score_text = re.search(r '\d+' , data).group() self .__value[ 'people' ] = int (score_text) pass def sgl_parser(html): parser = CommentParaser() parser.feed(html) return parser.data |
4,HTMLParaer,与3原理相识,就是调用的方法不太一样,基本上可以公用,
|
class CommentHTMLParser(HTMLParser.HTMLParser): def __init__( self ): HTMLParser.HTMLParser.__init__( self ) self .__start_li_yingping = False self .__start_li_item = False self .__start_li_gclear = False self .__start_li_ratingwrap = False self .__start_li_num = False # a self .__start_a = False # span 3中状态 self .__span_state = 0 # 数据 self .__value = {} self .data = [] def handle_starttag( self , tag, attrs): if tag = = 'li' : for k, v in attrs: if k = = 'class' and v = = 'yingping-list-wrap' : self .__start_li_yingping = True elif k = = 'class' and v = = 'item' : self .__start_li_item = True elif k = = 'class' and v = = 'g-clear title-wrap' : self .__start_li_gclear = True elif k = = 'class' and v = = 'rating-wrap g-clear' : self .__start_li_ratingwrap = True elif k = = 'class' and v = = 'num' : self .__start_li_num = True elif tag = = 'a' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: self .__start_a = True for k, v in attrs: if k = = 'href' : self .__value[ 'href' ] = v elif tag = = 'span' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: if self .__start_li_ratingwrap: if self .__span_state ! = 1 : for k, v in attrs: if k = = 'class' and v = = 'rating' : self .__span_state = 1 elif k = = 'class' and v = = 'time' : self .__span_state = 2 else : for k, v in attrs: if k = = 'style' : score_text = re.search(r '\d+' , v).group() self .__value[ 'score' ] = int (score_text) / 20 self .__span_state = 3 elif self .__start_li_num: self .__span_state = 4 def handle_endtag( self , tag): if tag = = 'li' : if self .__start_li_yingping: if self .__start_li_item: if self .__start_li_gclear: if self .__start_li_num or self .__start_li_ratingwrap: if self .__start_li_num: self .__start_li_num = False if self .__start_li_ratingwrap: self .__start_li_ratingwrap = False else : self .__start_li_gclear = False else : self .data.append( self .__value) self .__value = {} self .__start_li_item = False else : self .__start_li_yingping = False elif tag = = 'a' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear and self .__start_a: self .__start_a = False elif tag = = 'span' : self .__span_state = 0 def handle_data( self , data): if self .__start_a: self .__value[ 'title' ] = data elif self .__span_state = = 2 : self .__value[ 'time' ] = data elif self .__span_state = = 4 : score_text = re.search(r '\d+' , data).group() self .__value[ 'people' ] = int (score_text) pass def html_parser(html): parser = CommentHTMLParser() parser.feed(html) return parser.data |
3,4对于该案例来说确实是不太适合,趁现在有空记录下来,功学习使用!
以上这篇对Python3 解析html的几种操作方式小结就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持开心学习网。
原文链接:https://blog.csdn.net/yilovexing/article/details/79675672
- python中迭代器的作用(Python3.5迭代器与生成器用法实例分析)
- python3下urllib案例(URL Rewrite Module 2.1 URL重写模块规则写法)
- python3yield使用教程(python中yield的用法详解——最简单,最清晰的解释)
- pythonhtml文件转换成pdf库(Python3转换html到pdf的不同解决方案)
- python3json序列化(Python3.5 Json与pickle实现数据序列化与反序列化操作示例)
- python人脸识别库(python3人脸识别的两种方法)
- python中字符串常用函数或方法(Python3.5字符串常用操作实例详解)
- zabbix sender能否发送告警数据(python3实现zabbix告警推送钉钉的示例)
- python怎么判断文件大小(python3实现指定目录下文件sha256及文件大小统计)
- python3.x base64怎么加密解密(python3.x实现base64加密和解密)
- 如何学会python多线程(Python3多线程基础知识点)
- python符串操作教程(Python3.5运算符操作实例详解)
- python 调钉钉接口(python3实现钉钉消息推送的方法示例)
- python删除数据框重复变量(Python3删除排序数组中重复项的方法分析)
- 笨办法学python3目录(如何愉快地迁移到 Python 3)
- python3语法规则(详解Python3注释知识点)
- 包水饺(包水饺手法怎么包视频)
- 越南河粉(越南河粉来自哪里)
- 按这几方面养护佛肚竹盆景,保证枝叶繁茂,造型优美(按这几方面养护佛肚竹盆景)
- 冰岛旅游攻略(冰岛旅游攻略及花费八日游)
- 寒假旅游攻略(成都寒假旅游攻略)
- 菲律宾旅游攻略(菲律宾旅游攻略地图)
热门推荐
- python html文字分段(Python对HTML转义字符进行反转义的实现方法)
- vue走马灯特效(Javascript结合Vue实现对任意迷宫图片的自动寻路)
- python人脸识别库(python3人脸识别的两种方法)
- windows server 2008r2怎么安装(Windows Server2008 R2 MVC 环境安装配置教程)
- mysql8.0.25安装教程(Mysql8.0.17安装教程推荐)
- 阿里云ECS实例中部署的Web网站运行速度慢的解决方法(阿里云ECS实例中部署的Web网站运行速度慢的解决方法)
- 护卫神主机大师教程(护卫神主机大师Linux版安装及卸载图文教程)
- python3语法规则(详解Python3注释知识点)
- pythonmatplotlib实例(Python3使用Matplotlib 绘制精美的数学函数图形)
- wampserver怎么使用(wampserver怎么用?如何使用wampserver来配置本地php环境)
排行榜
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9