如何解析HTML - 全栈工程师的技术博客

Go home

More......

博客目录

博客 2016-03-09

如何解析HTML

简介：

平时会遇到不同的需求：Json 转化表格；表格转化Json..... 但这里转换的是不规则的表格（如下图），如何转换？

前提：这个表单保存在数据库一个字段里面！

常用方法：

1.JS脚本转换主要是Jquery等方法，比较好用

2. Python 的模块解析SGMLParser等

3. 安装Nodejs 去解析服务器端执行（有点大材小用）

表单如下：

HTML内容如下：

<table border="1" cellpadding="0" cellspacing="0" style="border-bottom: medium none; border-left: medium none; border-collapse: collapse; border-top: medium none; border-right: medium none" width="650"><tbody><tr style="height: 30px"><td style="border-bottom: windowtext 1pt solid; border-left: windowtext 1pt solid; padding-bottom: 0cm; padding-left: 5.4pt; width: 215px; padding-right: 5.4pt; height: 30px; border-top: windowtext 1pt solid; border-right: windowtext 1pt solid; padding-top: 0cm"><p align="right" style="text-align: right"><span style="font-family: 宋体"><span style="font-size: 12pt">应用名称<span style="font-size: 12pt"><span style="font-family: calibri">(</span></span><strong><span style="color: red"><span style="font-family: 宋体"><span style="font-size: 12pt">必填</span></span></span></strong><span style="font-size: 12pt"><span style="font-family: calibri">)</span></span></span></span></p></td><td style="border-bottom: windowtext 1pt solid; padding-bottom: 0cm; padding-left: 5.4pt; width: 151px; padding-right: 5.4pt; height: 30px; border-left-color: #f0f0f0; border-top: windowtext 1pt solid; border-right: windowtext 1pt solid; padding-top: 0cm"><p><br></p></td><td style="border-bottom: windowtext 1pt solid; padding-bottom: 0cm; padding-left: 5.4pt; width: 123px; padding-right: 5.4pt; height: ..........................................................................................很长很长。。。。

URL:http://t.mreald.com/py.html

现在使用python 去解析：

1. 常用的解析模块：

HTMLParser、SGMLParser、pyQuery、BeautifulSoup

下载：http://www.crummy.com/software/BeautifulSoup/bs4/download/

文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

2.现在使用BeautifulSoup

代码如下：

import urllib2

from bs4 import BeautifulSoup

content = urllib2.urlopen('http://t.mreald.com/py.html').read()

soup = BeautifulSoup(content, 'html.parser')

#print(soup.prettify())
i=0
j=0
for tritem in soup.find_all('tr'):
    if i in [0,5,6,7,8,9,10,11,12]:
        print tritem.find_all('td')[0].get_text()+'    '+tritem.find_all('td')[1].get_text()
        i+=1;continue
    elif i == 4:
        print tritem.find_all('td')[1].get_text()+'    '+tritem.find_all('td')[2].get_text()
        i+=1; continue
    elif i == 3:
        print tritem.find_all('td')[0].get_text()+'    '+tritem.find_all('td')[1].get_text().strip('&nbsp;')
        #print tritem.find_all('td')[0].get_text()+'    '+tritem.find_all('td')[1].get_text().strip('&nbsp;')+tritem.find_all('td')[2].get_text()+tritem.find_all('td')[3].get_text()
        i+=1; continue
    else:
        print tritem.find_all('td')[0].get_text()+'    '+tritem.find_all('td')[1].get_text()
        print tritem.find_all('td')[2].get_text()+'    '+tritem.find_all('td')[3].get_text()
        i+=1; continue

执行结果：

参考资料:

发表评论