web-scraping-tutorial/
├── scrapy.cfg
└── tutorial/
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders/
└── __init__.py
Create your first spider
Now that you are all set up, you will write code to extract data from all books
in the Mystery category of books.toscrape.com.
Create a file at tutorial/spiders/books_toscrape_com.py
with the following
code:
from scrapy import Spider
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
start_urls = [
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
def parse(self, response):
next_page_links = response.css(".next a")
yield from response.follow_all(next_page_links)
book_links = response.css("article a")
yield from response.follow_all(book_links, callback=self.parse_book)
def parse_book(self, response):
yield {
"name": response.css("h1::text").get(),
"price": response.css(".price_color::text").re_first("£(.*)"),
"url": response.url,
In the code above:
You define a Scrapy spider class named books_toscrape_com
.
Your spider starts by sending a request for the Mystery category URL,
http://books.toscrape.com/catalogue/category/books/mystery_3/index.html,
(start_urls
), and parses the response with the default callback method:
parse
.
The parse
callback method:
Finds the link to the next page and, if found, yields a request for it,
whose response will also be parsed by the parse
callback method.
As a result, the parse
callback method eventually parses all pages
of the Mystery category.
Finds links to book detail pages, and yields requests for them, whose
responses will be parsed by the parse_book
callback method.
As a result, the parse_book
callback method eventually parses all
book detail pages from the Mystery category.
The parse_book
callback method extracts a record of book information
with the book name, price, and URL.
Now run your code:
scrapy crawl books_toscrape_com -O books.csv
Once execution finishes, the generated books.csv
file will contain records
for all books from the Mystery category of books.toscrape.com in CSV
format. You can open books.csv
with any spreadsheet app.
Continue to the next chapter to learn how you can
easily deploy and run you web scraping project on the cloud.