python抓网页的编码问题

来源：百度知道编辑：UC知道时间：2024/05/28 06:19:53

小弟刚接触python,抓第一个网页就过不去编码这个坎`请高手赐教,代码如下:

#coding = gb2312
import urllib2
from BeautifulSoup import BeautifulSoup

outfile = open("lsxk.txt", "w")
for i in range(1,2): #第一页
url= "http://bbs.scu.edu.cn/wForum/disparticle.php?
boardName=SCUExpress&ID=1735295349&pos=-1&page=%d" % i
doc = urllib2.urlopen(url).read()
soup = BeautifulSoup(doc,fromEncoding="gb2312")
print >> outfile,doc #输出到文本
outfile.close()

抓到txt中的内容是中文,但使用beautifulSoup后就乱码了`
BS的文档说使用soup = BeautifulSoup(doc, fromEncoding="gb2312"),
还是不能正确编码.print出来报错是gb2312和utf-8不能编码网页中的某些字符.

单独使用编码也不能解决:
import codecs
print open("lsxk.txt").read().decode("utf-8")

可能说的不是很清楚`以上是我遇到的问题们,还请高手帮帮忙,不甚感谢!

BeautifulSoup 版本有问题，使用3.03就可以了

import urllib2
from BeautifulSoup import BeautifulSoup

f = urllib.urlopen('http://www.baidu.com')

html=f.read()
f.close()
soup = BeautifulSoup()
soup.feed(html)
print soup

BeautifulSoup() 不能直接指定编码

还有编码尽量用GB18030 不要用GBK

python的问题！！！！！！关于python编程的问题 python的初级编程问题关于网页语言编码的问题提问：关于网页编码的问题关于网页的utf8编码问题网页编码问题 Python问题网页的编码 python使用源文件的问题急！