about

官档：https://beautifulsoup.readthedocs.io/zh_CN/latest/
参考我的老博客：https://www.cnblogs.com/Neeo/articles/7810508.html

简单来说，Beautiful Soup是python的一个库，最主要的功能是，我们通过网络请求抓取到HTML的数据，然后通过Beautiful Soup来解析HTML文档，进而帮助我们提取想要的数据。

官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

安装

# pip install beautifulsoup4==4.12.2
pip install beautifulsoup4==4.12.2 -i https://pypi.tuna.tsinghua.edu.cn/simple

Beautiful Soup在解析文档时，还需要指定解析器，除了默认的Python内置的HTML解析器之外，还支持一些第三方的解析器，这里推荐使用三方解析器lxml，该解析器更加强大，速度更快，推荐安装：

bash

pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

当然了，不同的解析器虽然大致一致，但是极少的情况会出现解析结果不同的情况，这点如果你遇到这种疑惑的时候，请考虑是否使用的是同一个解释器，或者考虑更换解析器。

各种解析器的对比，来官档：https://beautifulsoup.readthedocs.io/zh_CN/latest/#id12

安装完了之后，我们就来看看怎么使用吧。

快速上手

python

# 1. 导包 虽然我们下载包下载的是beautifulsoup4，但是使用的时候，却是bs4，这点要务必牢记
from bs4 import BeautifulSoup

# 2. 被解析的HTML文档，假如我们从网络上下载了如下一段文档，我们要提取里面的超链接或者文本
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story"><span>xxx</span></p>
"""

# 3. 按照如下方式实例化BeautifulSoup(待解析的文档，解析器)
soup = BeautifulSoup(html_doc, 'lxml')
# 我们会得到一个文档对象，我们通过各种方法来解析查找想要的内容
print(soup.find_all('a'))  # 查找文档中的所有的超链接
"""
[
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
]
"""
for i in soup.find_all('a'):
    print(i.get("href"), i.text)  # 提取超链接中的href值和文本
"""
http://example.com/elsie Elsie
http://example.com/lacie Lacie
http://example.com/tillie Tillie
"""

是不是非常简单！！！

再来几个其它的姿势：

python

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story"><span>xxx</span></p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# 获取第一个匹配的title标签
print(soup.title)  # <title>The Dormouse's story</title>
# 获取第一个匹配的title标签的名字
print(soup.title.name)  # title

# 获取第一个匹配的title标签的文本
print(soup.title.text)  # The Dormouse's story

# 获取第一个匹配的title标签的父级标签
print(soup.title.parent)  # <head><title>The Dormouse's story</title></head>

# 获取第一个匹配的title标签的父级标签的名字
print(soup.title.parent.name)  # head

# 获取第一个匹配的title标签的父级的文本
print(soup.title.parent.text)  # The Dormouse's story

对节点和子节点解析

一个Tag可能包含多个字符串或其它的Tag，这些都是这个Tag的子节点。

Beautiful Soup提供了许多操作和遍历子节点的属性。

注意: Beautiful Soup中字符串节点不支持这些属性因为字符串没有子节点。

python

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story"><span>xxx</span></p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# 拿到指定节点
body = soup.body

# print(body.contents)  # 以列表的形式返回该节点的所有子节点

# 下面这么取值报错了，因为列表的0索引位置取得值不是一个标签，而是字符串，字符串没有子节点
# print(body.contents[0].contents)   # AttributeError: 'NavigableString' object has no attribute 'contents'

p = body.contents[1]  # 可以按照索引取值，取出子节点

# 可以对拿到的新的节点进行操作，获取想要的数据，比如获取p标签的文本
print(p)   # <p class="title"><b>The Dormouse's story</b></p>
print(p.text)  # The Dormouse's story

# 如果该节点有子节点的话，我们仍然可以通过contents继续以列表的形式取子节点
p = body.contents[-2]  # 可以按照索引取值，取出子节点
print(p.contents)   # [<span>xxx</span>]

对节点内容进行解析

对节点进行取值操作，可以通过下面几个方法来处理：

soup.text：获取指定节点的内容，如果该节点内包含一个或多个子节点，那么它会获取所有子节点的内容。
soup.string：

python

PC有道云翻译

stream_rsa

元博网

config

全国招投标网

微信公众平台md5

爬取肯德基门店信息

网易云音乐

试客联盟

长房网

about

安装

快速上手

对节点和子节点解析

对节点内容进行解析

config

about ​

安装 ​

快速上手 ​

对节点和子节点解析 ​

对节点内容进行解析 ​

about

安装

快速上手

对节点和子节点解析

对节点内容进行解析