測試大量文件時報錯:9千多個html
Traceback (most recent call last):
File "G:\2000.py", line 35, in <module>
content = f2.read()
File "C:\Programs\Python\Python39\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 338: invalid continuation byte
这个要复杂些,需要先提取出所有相关的 a 标签,判断那个 a 标签是否隐藏,再点击,可以参考下面代码的实现。如果还有更复杂的情况可以参考后面两个链接中的文档。
# 等待2秒,确保动态网页也可以爬取
time.sleep(2)
# 查询所有需要展开的 a 标签
elements = page.query_selector_all("a:has-text('展开阅读全文 ∨')")
# 遍历 a 标签
for elem in elements:
# 判断 a 标签是否可见
if elem.is_visible():
# 点击 a 标签,展开
elem.click()
# 等待2秒,确保动态网页加载完
time.sleep(2)
# 读取网页内容
content = page.content()
# 打印文本行,去除前后空格换行,响应内容长度
print('current: ', i, line, len(content))
https://playwright.dev/python/docs/selectors#text-selector