不久前,我开始编程,遇到了这个问题。我想收集股票数据从网站: https://statusinvest.com.br/acoes/petr4 。但是很明显,它们是用javascript呈现的,BeautifulSoup不收集,如果你能帮我理解的话
发布于 2022-11-19 07:47:13
这个部分不仅需要js来加载,它实际上不会加载直到您滚动到它。您可以尝试找出哪个请求和/或一些js是用来呈现该部分的,然后尝试用python复制它,但我认为使用 硒 会更容易一些。我甚至还使用 有此功能 来使在抓取html之前自动化一些更简单/常见的交互变得更加方便:
#### FIRST PASTE [or DOWNLOAD&IMPORT] FUNCTION DEF from https://pastebin.com/kEC9gPC8 ####
soup = linkToSoup_selenium(
'https://statusinvest.com.br/acoes/petr4',
clickFirst='//strong[@data-item="avg_F"]' # it actually just has to scroll, not click [but I haven't added an option for that yet],
ecx='//strong[@data-item="avg_F"][text()!="-"]' # waits till this loads
if soup is not None:
print({
t.find_previous_sibling().get_text(' ').strip(): t.get_text(' ').strip()
for t in soup.select('div#payout-section span.title + strong.value')
})
版画
{'MÉDIA': '83,32%', 'ATUAL': '124,13% \n ( 48,97% acima da média )', 'MENOR\xa0VALOR': '26,35% \n ( 2019 )', 'MAIOR\xa0VALOR': '144,51% \n \n( 2020 )'}
编辑: I最终注意到了用于获取数据的API ( https://statusinvest.com.br/acao/payoutresult?code=petr4&companyid=408&type=0 )。即使在js加载发生之前就可以使用html,您也可以对其进行实际的修改:
soup.select_one('#payout-section[data-company][data-code]').attrs
应该回来
{'id': 'payout-section', 'data-company': '408', 'data-code': 'petr4', 'data-category': '1'}
这样,url就可以用
payout = soup.select_one('#payout-section[data-company][data-code]')
if payout:
compId, dCode = payout.get('data-company'), payout.get('data-code')
apiUrl = f'https://statusinvest.com.br/acao'
apiUrl = f'{apiUrl}/payoutresult?code={dCode}&companyid={compId}&type=0'
我认为
type
参数是为时间窗口-0为5年,1为10年,2为最大窗口。
requests.get(apiUrl, headers=headers).json()
应该返回类似的内容
{
"actual": 124.12623323305537,
"avg": 83.32096287339556,
"avgDifference": 48.97359434223362,
"minValue": 26.353309862919502,
"minValueRank": 2019,
"maxValue": 144.51093035368598,
"maxValueRank": 2020,
"actual_F": "124,13%",
"avg_F": "83,32%",
"avgDifference_F": "48,97% acima da m\u00e9dia",
"minValue_F": "26,35%",
"minValueRank_F": "2019",
"maxValue_F": "144,51%",
"maxValueRank_F": "2020",
"chart": {
"categoryUnique": true,
"category": [
"2018",
"2019",
"2020",
"2021",
"2022"
"series": {
"percentual": [
"value": 27.189302754606462,
"value_F": "27,19%"
"value": 26.353309862919502,
"value_F": "26,35%"
"value": 144.51093035368598,
"value_F": "144,51%"
"value": 94.42503816271046,
"value_F": "94,43%"
"value": 124.12623323305537,
"value_F": "124,13%"
"proventos": [
"value": 7009130357.11,
"value_F": "R$ 7.009.130.357,11",
"valueSmall_F": "7,01 B"
"value": 10577427979.68,
"value_F": "R$ 10.577.427.979,68",
"valueSmall_F": "10,58 B"
"value": 10271836929.54,
"value_F": "R$ 10.271.836.929,54",
"valueSmall_F": "10,27 B"
"value": 100721299707.4,
"value_F": "R$ 100.721.299.707,40",
"valueSmall_F": "100,72 B"
"value": 179966901777.61,
"value_F": "R$ 179.966.901.777,61",
"valueSmall_F": "179,97 B"
"lucroLiquido": [
"value": 25779000000.0,
"value_F": "R$ 25.779.000.000,00",
"valueSmall_F": "25,78 B"
"value": 40137000000.0,
"value_F": "R$ 40.137.000.000,00",
"valueSmall_F": "40,14 B"
"value": 7108000000.0,
"value_F": "R$ 7.108.000.000,00",
"valueSmall_F": "7,11 B"
"value": 106668000000.0,
"value_F": "R$ 106.668.000.000,00",
"valueSmall_F": "106,67 B"
"value": 144987000000.0,
"value_F": "R$ 144.987.000.000,00",
"valueSmall_F": "144,99 B"
}
然后你可以从那里得到你想要的值。(我认为它还包括图表数据。)
发布于 2022-11-19 07:41:35
希望OP的下一个问题将包含一个 最小的,可复制的例子 ,下面是使用请求和BeautifulSoup从该页面获取一些数据的一种方法:
from bs4 import BeautifulSoup as bs
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
r = requests.get('https://statusinvest.com.br/acoes/petr4', headers=headers)
soup = bs(r.text, 'html.parser')
valor_atual = soup.select_one('h3:-soup-contains("Valor atual")').find_next('strong').text
min_52_semanas = soup.select_one('h3:-soup-contains("Min. 52 semanas")').find_next('strong').text
print('Valor atual:', valor_atual)
print('Min. 52 semanas:', min_52_semanas)
### and now some values hydrated in page by Javascript, from an API endpoint:
api_url = 'https://statusinvest.com.br/acao/payoutresult?code=petr4&companyid=408&type=0'
api_headers = {
'referer': 'https://statusinvest.com.br/acoes/petr4',