王张开

之前在爬取可可英语的数据时，由于该网站的编码是：

python

>>> import requests
>>> response = requests.get('http://www.kekenet.com/read/')
>>> response.encoding
'ISO-8859-1'
>>> response.encoding = "utf-8"

虽然手动设置编码为utf-8，但后续的清洗数据的时候仍然出了错误，过程是我将爬取的数据保存到文件中后，再for循环读取的时候报了错误：

python

with open('a.txt', 'r', encoding='utf-8') as f：
	for i in f:
        print(i.strip())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 24: invalid start byte

解决办法是，提前对该错误进行处理：

python

with open('a1.txt', 'r', encoding='utf-8') as f：
	text = f.read().decode(errors='replace')

可以将上例中的text对象重新写入到新的文本就好了。这里用到了decode，它的errors参数有以下几种方式：

Value	Meaning
'strict'	默认参数，输入错误信息，在strict_errors()中实现。
'ignore'	忽略格式不正确的数据，继续执行，不另行通知。在ignore errors()中实现。

下表格仅适用于文本编码中：

Value	Meaning
'replace'	更换合适的更换标志;Python将在解码时为内置的编解码器使用正式的U+FFFD替换字符。在replace errors()中实现。
'xmlcharrefreplace'	替换为适当的XML字符引用(仅用于编码)。在xmlcharrefreplace errors()中实现。
'backslashreplace'	用反切转义序列替换。在backslashreplace errors()中实现。
'namereplace'	替换为\ N {…}转义序列(仅用于编码)。在namereplace errors()中实现。
'surrogateescape'	解码时，用从U+DC80到U+DCFF的各个代理代码替换字节。然后，当在编码数据时使用“surrogateescape”错误处理程序时，这些代码将被转换回相同的字节。（更多信息请参见PEP 383）

此外，以下错误处理程序是特定于给定的编解码器：

Value	Codecs	Meaning
'surrogatepass'	utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le	允许对代理代码进行编码和解码。这些编解码器通常将代理程序的存在视为一个错误。

有兴趣的可以手动的还原一下事故现场：

python

import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from multiprocessing import cpu_count
def write_file(content_url):
    content_response = requests.get(url=content_url)
    content_response.encoding = 'utf-8'
    # print(content_response)
    print(content_response.url, content_response.status_code)

    if content_response.status_code == 200:
        with open('a.txt', 'a', encoding='utf8') as f:
            soup = BeautifulSoup(content_response.text, 'html.parser')
            div = soup.find(name='div', attrs={'id': 'article'})
            p_list = div.find_all(name='p')
            for i in p_list:
                f.write(i.text)


def content(detail_url):
    """ 爬取文章详情 """

    response = requests.get(detail_url)
    # print(response.url, response.status_code, response.encoding)
    if response.status_code == 200:  # 排除无效的链接
        # response.encoding = 'utf8'
        soup = BeautifulSoup(response.text, 'html.parser')
        ul = soup.find(name='ul', attrs={'id': 'menu-list'})
        h2_list = ul.find_all('h2')
        for h2 in h2_list:
            content_url = h2.find_all(name='a')[-1].get('href')
            write_file(content_url)


def category(category_url):
    """ 爬取分类详情列表页 """
    t = ThreadPoolExecutor(cpu_count() * 5)
    response = requests.get(category_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    ul = soup.find(name='div', attrs={'class': 'page th'})
    end = int([i.text for i in ul.find_all(name='a')][-1])
    # print(category_url, end)
    for i in range(1, end + 1):
        detail_url = category_url + 'List_%s.shtml' % i
        # print(detail_url)
        t.submit(content, detail_url)


def spider():
    """
        爬取http://www.kekenet.com/read/，双语新闻
    """
    p = ProcessPoolExecutor(cpu_count() * 2)
    response = requests.get('http://www.kekenet.com/read/')
    soup = BeautifulSoup(response.text, 'html.parser')
    div_list = soup.find_all(name='div', attrs={'class': 'genxin'})
    for i in div_list[1:10]:
        category_url = 'http://www.kekenet.com' + i.find('h1').find('a').get('href')
        # print(category_url)
        p.submit(category, category_url)
    p.shutdown()
if __name__ == '__main__':
    s = time.time()
    spider()  # 我的网络环境较差，共爬取了大概64兆，耗时一顿饭的功夫
    print(time.time() - s)

ps：代码截止到2019/4/16日亲测无误

see also：[codecs- 编解码器注册表和基类](https://docs.python.org/3/library/codecs.html#module-codecs) | [PEP 383](https://www.python.org/dev/peps/pep-0383) | [stackoverflow：UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte](https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s) 欢迎斧正，that's all

PC有道云翻译

stream_rsa

元博网

config

全国招投标网

微信公众平台md5

爬取肯德基门店信息

网易云音乐

试客联盟

长房网