之前在爬取可可英语的数据时,由于该网站的编码是:
python
>>> import requests
>>> response = requests.get('http://www.kekenet.com/read/')
>>> response.encoding
'ISO-8859-1'
>>> response.encoding = "utf-8"
虽然手动设置编码为utf-8
,但后续的清洗数据的时候仍然出了错误,过程是我将爬取的数据保存到文件中后,再for循环读取的时候报了错误:
python
with open('a.txt', 'r', encoding='utf-8') as f:
for i in f:
print(i.strip())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 24: invalid start byte
解决办法是,提前对该错误进行处理:
python
with open('a1.txt', 'r', encoding='utf-8') as f:
text = f.read().decode(errors='replace')
可以将上例中的text
对象重新写入到新的文本就好了。 这里用到了decode
,它的errors
参数有以下几种方式:
Value | Meaning |
---|---|
'strict' | 默认参数,输入错误信息, 在strict_errors()中实现。 |
'ignore' | 忽略格式不正确的数据,继续执行,不另行通知。在ignore errors()中实现。 |
下表格仅适用于文本编码中:
Value | Meaning |
---|---|
'replace' | 更换合适的更换标志;Python将在解码时为内置的编解码器使用正式的U+FFFD替换字符。在replace errors()中实现。 |
'xmlcharrefreplace' | 替换为适当的XML字符引用(仅用于编码)。在xmlcharrefreplace errors()中实现。 |
'backslashreplace' | 用反切转义序列替换。在backslashreplace errors()中实现。 |
'namereplace' | 替换为\ N {…}转义序列(仅用于编码)。在namereplace errors()中实现。 |
'surrogateescape' | 解码时,用从U+DC80到U+DCFF的各个代理代码替换字节。然后,当在编码数据时使用“surrogateescape”错误处理程序时,这些代码将被转换回相同的字节。(更多信息请参见PEP 383) |
此外,以下错误处理程序是特定于给定的编解码器:
Value | Codecs | Meaning |
---|---|---|
'surrogatepass' | utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le | 允许对代理代码进行编码和解码。这些编解码器通常将代理程序的存在视为一个错误。 |
有兴趣的可以手动的还原一下事故现场:
python
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from multiprocessing import cpu_count
def write_file(content_url):
content_response = requests.get(url=content_url)
content_response.encoding = 'utf-8'
# print(content_response)
print(content_response.url, content_response.status_code)
if content_response.status_code == 200:
with open('a.txt', 'a', encoding='utf8') as f:
soup = BeautifulSoup(content_response.text, 'html.parser')
div = soup.find(name='div', attrs={'id': 'article'})
p_list = div.find_all(name='p')
for i in p_list:
f.write(i.text)
def content(detail_url):
""" 爬取文章详情 """
response = requests.get(detail_url)
# print(response.url, response.status_code, response.encoding)
if response.status_code == 200: # 排除无效的链接
# response.encoding = 'utf8'
soup = BeautifulSoup(response.text, 'html.parser')
ul = soup.find(name='ul', attrs={'id': 'menu-list'})
h2_list = ul.find_all('h2')
for h2 in h2_list:
content_url = h2.find_all(name='a')[-1].get('href')
write_file(content_url)
def category(category_url):
""" 爬取分类详情列表页 """
t = ThreadPoolExecutor(cpu_count() * 5)
response = requests.get(category_url)
soup = BeautifulSoup(response.text, 'html.parser')
ul = soup.find(name='div', attrs={'class': 'page th'})
end = int([i.text for i in ul.find_all(name='a')][-1])
# print(category_url, end)
for i in range(1, end + 1):
detail_url = category_url + 'List_%s.shtml' % i
# print(detail_url)
t.submit(content, detail_url)
def spider():
"""
爬取http://www.kekenet.com/read/,双语新闻
"""
p = ProcessPoolExecutor(cpu_count() * 2)
response = requests.get('http://www.kekenet.com/read/')
soup = BeautifulSoup(response.text, 'html.parser')
div_list = soup.find_all(name='div', attrs={'class': 'genxin'})
for i in div_list[1:10]:
category_url = 'http://www.kekenet.com' + i.find('h1').find('a').get('href')
# print(category_url)
p.submit(category, category_url)
p.shutdown()
if __name__ == '__main__':
s = time.time()
spider() # 我的网络环境较差,共爬取了大概64兆,耗时一顿饭的功夫
print(time.time() - s)
ps:代码截止到2019/4/16日亲测无误
see also:[codecs- 编解码器注册表和基类](https://docs.python.org/3/library/codecs.html#module-codecs) | [PEP 383](https://www.python.org/dev/peps/pep-0383) | [stackoverflow:UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte](https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s) 欢迎斧正,that's all