測試大量文件時報錯:9千多個html

Traceback (most recent call last):
  File "G:\2000.py", line 35, in <module>
    content = f2.read()
  File "C:\Programs\Python\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 338: invalid continuation byte

这个要复杂些,需要先提取出所有相关的 a 标签,判断那个 a 标签是否隐藏,再点击,可以参考下面代码的实现。如果还有更复杂的情况可以参考后面两个链接中的文档。

# 等待2秒,确保动态网页也可以爬取
time.sleep(2)
# 查询所有需要展开的 a 标签
elements = page.query_selector_all("a:has-text('展开阅读全文 ∨')")
# 遍历 a 标签
for elem in elements:
    # 判断 a 标签是否可见
    if elem.is_visible():
        # 点击 a 标签,展开
        elem.click()
        # 等待2秒,确保动态网页加载完
        time.sleep(2)
# 读取网页内容
content = page.content()
# 打印文本行,去除前后空格换行,响应内容长度
print('current: ', i, line, len(content))

https://playwright.dev/python/docs/selectors#text-selector