Skip to content

之前在爬取可可英语的数据时,由于该网站的编码是:

python
>>> import requests
>>> response = requests.get('http://www.kekenet.com/read/')
>>> response.encoding
'ISO-8859-1'
>>> response.encoding = "utf-8"

虽然手动设置编码为utf-8,但后续的清洗数据的时候仍然出了错误,过程是我将爬取的数据保存到文件中后,再for循环读取的时候报了错误:

python
with open('a.txt', 'r', encoding='utf-8') as f:
	for i in f:
        print(i.strip())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 24: invalid start byte

解决办法是,提前对该错误进行处理:

python
with open('a1.txt', 'r', encoding='utf-8') as f:
	text = f.read().decode(errors='replace')

可以将上例中的text对象重新写入到新的文本就好了。 这里用到了decode,它的errors参数有以下几种方式:

ValueMeaning
'strict'默认参数,输入错误信息, 在strict_errors()中实现。
'ignore'忽略格式不正确的数据,继续执行,不另行通知。在ignore errors()中实现。

下表格仅适用于文本编码中:

ValueMeaning
'replace'更换合适的更换标志;Python将在解码时为内置的编解码器使用正式的U+FFFD替换字符。在replace errors()中实现。
'xmlcharrefreplace'替换为适当的XML字符引用(仅用于编码)。在xmlcharrefreplace errors()中实现。
'backslashreplace'用反切转义序列替换。在backslashreplace errors()中实现。
'namereplace'替换为\ N {…}转义序列(仅用于编码)。在namereplace errors()中实现。
'surrogateescape'解码时,用从U+DC80到U+DCFF的各个代理代码替换字节。然后,当在编码数据时使用“surrogateescape”错误处理程序时,这些代码将被转换回相同的字节。(更多信息请参见PEP 383

此外,以下错误处理程序是特定于给定的编解码器:

ValueCodecsMeaning
'surrogatepass'utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le允许对代理代码进行编码和解码。这些编解码器通常将代理程序的存在视为一个错误。

有兴趣的可以手动的还原一下事故现场:

python
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from multiprocessing import cpu_count
def write_file(content_url):
    content_response = requests.get(url=content_url)
    content_response.encoding = 'utf-8'
    # print(content_response)
    print(content_response.url, content_response.status_code)

    if content_response.status_code == 200:
        with open('a.txt', 'a', encoding='utf8') as f:
            soup = BeautifulSoup(content_response.text, 'html.parser')
            div = soup.find(name='div', attrs={'id': 'article'})
            p_list = div.find_all(name='p')
            for i in p_list:
                f.write(i.text)


def content(detail_url):
    """ 爬取文章详情 """

    response = requests.get(detail_url)
    # print(response.url, response.status_code, response.encoding)
    if response.status_code == 200:  # 排除无效的链接
        # response.encoding = 'utf8'
        soup = BeautifulSoup(response.text, 'html.parser')
        ul = soup.find(name='ul', attrs={'id': 'menu-list'})
        h2_list = ul.find_all('h2')
        for h2 in h2_list:
            content_url = h2.find_all(name='a')[-1].get('href')
            write_file(content_url)


def category(category_url):
    """ 爬取分类详情列表页 """
    t = ThreadPoolExecutor(cpu_count() * 5)
    response = requests.get(category_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    ul = soup.find(name='div', attrs={'class': 'page th'})
    end = int([i.text for i in ul.find_all(name='a')][-1])
    # print(category_url, end)
    for i in range(1, end + 1):
        detail_url = category_url + 'List_%s.shtml' % i
        # print(detail_url)
        t.submit(content, detail_url)


def spider():
    """
        爬取http://www.kekenet.com/read/,双语新闻
    """
    p = ProcessPoolExecutor(cpu_count() * 2)
    response = requests.get('http://www.kekenet.com/read/')
    soup = BeautifulSoup(response.text, 'html.parser')
    div_list = soup.find_all(name='div', attrs={'class': 'genxin'})
    for i in div_list[1:10]:
        category_url = 'http://www.kekenet.com' + i.find('h1').find('a').get('href')
        # print(category_url)
        p.submit(category, category_url)
    p.shutdown()
if __name__ == '__main__':
    s = time.time()
    spider()  # 我的网络环境较差,共爬取了大概64兆,耗时一顿饭的功夫
    print(time.time() - s)

ps:代码截止到2019/4/16日亲测无误


see also:[codecs- 编解码器注册表和基类](https://docs.python.org/3/library/codecs.html#module-codecs) | [PEP 383](https://www.python.org/dev/peps/pep-0383) | [stackoverflow:UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte](https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s) 欢迎斧正,that's all