Understanding Web Scraping
Web scraping is the technique of capturing data from web pages in an automated manner. There are two important terms in this context: web crawling and web scraping. Web crawling is the process of navigating through web pages, following links, and collecting all content, while web scraping is the specific extraction of data from a page.
Web crawling is more complex as it involves managing navigation, dealing with blocks, errors, and changes in page structures. On the other hand, web scraping is simpler, focused only on extracting specific data.
Traditionally, the main tools used for web scraping are Python libraries such as BeautifulSoup and Selenium. However, JavaScript has also stood out in this scenario, with libraries like Puppeteer, Playwright, and Cheerio.
Crawlee and Apify: JavaScript Solutions
Two tools that are transforming the way we handle web scraping in JavaScript are Crawlee and Apify. Crawlee is a framework that abstracts the complexities of web crawling, allowing you to focus only on what you want to collect (the “what”) and how you want to collect it (the “where”).
Crawlee uses libraries like Puppeteer, Playwright, and Cheerio for navigation and data extraction. It also brings advanced features such as proxy management, IP rotation, and error handling, which are common in web scraping projects.
Apify is a platform that allows the deployment and execution of your crawlers in a scalable and managed way. With Apify, you can upload your crawlers in a serverless infrastructure without worrying about provisioning and maintaining servers.
Advantages of Crawlee and Apify
Crawlee offers several advantages over traditional approaches to web scraping in JavaScript:
- Abstraction of the complexities of web crawling, allowing you to focus on what and how to collect data.
- Integration with popular libraries like Puppeteer and Playwright, offering a simple and powerful API.
- Advanced features such as proxy management, IP rotation, and error handling.
- The ability to create data processing pipelines, with integration with Apify.
- Request queue processing with automatic error handling.
- Native parallelism in processing queued requests.
- Helper functions designed to solve both classic and advanced web crawling problems, such as infinite scroll, 2FA, auto link enqueuer, etc.
Apify offers the following advantages:
- Scalable and managed deployment and execution of crawlers without the need to provision and maintain servers.
- Integration with Crawlee, enabling a smooth development and deployment experience.
- A dataset-based database natively integrated with Crawlee.
- Native functionality to create an API to collect data from your crawler.
- Additional services such as captcha resolution and rotating proxies, which help tackle web scraping challenges.
- An extensive free plan with $5.00 credit per month in your account.
- The ability to monetize your crawlers, offering them as a service to other companies.
Creating My First Crawler with Crawlee
The first thing we need to do is create our first project with Crawlee, using the command:
npx crawlee create amazon-crawler
(replace “amazon-crawler” with the name of your crawler)
After this, it will ask which template we will use to set up our project. For this example, we will select the PlaywrightCrawler [Typescript] template, which will bring a structure with two .ts files:
// main.ts
// For more information, see https://crawlee.dev/
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import { router } from './routes.js';
const startUrls = ['https://crawlee.dev'];
const crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 20,
});
await crawler.run(startUrls);
This is the main file with the crawler execution, master definitions, and import of routes that execute the requests.
// routes.ts
import { createPlaywrightRouter } from 'crawlee';
export const router = createPlaywrightRouter();
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
globs: ['https://crawlee.dev/**'],
label: 'detail',
});
});
router.addHandler('detail', async ({ request, page, log, pushData }) => {
const title = await page.title();
log.info(`${title}`, { url: request.loadedUrl });
await pushData({
url: request.loadedUrl,
title,
});
});
This file is responsible for registering routes, a system that abstracts the concept of data collection processors.
It uses this routing concept to resemble REST API frameworks, creating the idea that each crawling operation (“what”) is completely independent of our request queue (“where”). So, regardless of which routes I add to my queue, it will perform the collection, and if an operation fails, it will inform the request queue and automatically reprocess the failed request independently.
The above example performs the following operations:
- Accesses the initial URL
https://crawlee.dev
without specifying a specific handler, leading the crawler to execute the defaultHandler. - The defaultHandler then enqueues all page links matching the pattern
https://crawlee.dev/**
to be executed by the “detail” handler. - The “detail” handler collects the titles of all pages queued to it and, using the pushData function, it utilizes the native Datasets functionality, which works like a database where the crawler data is stored intelligently, preventing data duplication or inconsistency.
At the end of each request execution, the titles of the pages and their respective URLs will be saved in JSON files within /storage/datasets/default
, which can be consumed in various ways, either directly through the files or even served by a REST API as shown below.
// main.ts
// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from 'crawlee';
import express from 'express';
import { router } from './routes.js';
const startUrls = ['https://crawlee.dev'];
const crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 20,
});
// Create an express server to get the results from the previous crawl
const app = express();
app.get('/datasets/results', async (req, res) => {
// Gets the results from the previous crawl
const data = await crawler.getData();
return res.json(data);
});
app.listen(3000, () => {
console.log('Server is running on port 3000');
});
await crawler.run(startUrls);
When accessing the route, we get a result similar to the one below.
{
"count": 21,
"desc": false,
"items": [
{
"url": "https://crawlee.dev/docs/examples",
"title": "Examples | Crawlee"
},
{
"url": "https://crawlee.dev/docs/quick-start",
"title": "Quick Start | Crawlee"
},
{
"url": "https://crawlee.dev/api/core",
"title": "@crawlee/core | API | Crawlee"
},
{
"url": "https://crawlee.dev/api/core/changelog",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/blog",
"title": "Crawlee Blog - learn how to build better scrapers | Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.9/quick-start",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/docs/next/quick-start",
"title": "Quick Start | Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.8/quick-start",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.6/quick-start",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.5/quick-start",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.7/quick-start",
"title": "Quick Start | Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.3/quick-start",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.2/quick-start",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.4/quick-start",
"title": "Quick Start | Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.1/quick-start",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/docs/introduction",
"title": "Introduction | Crawlee"
},
{
"url": "https://crawlee.dev/docs/3.0/quick-start",
"title": "Quick Start | Crawlee"
},
{
"url": "https://crawlee.dev/docs/guides/javascript-rendering",
"title":
"JavaScript rendering | Crawlee"
},
{
"url": "https://crawlee.dev/docs/guides/avoid-blocking",
"title": "Crawlee"
},
{
"url": "https://crawlee.dev/docs/guides/typescript-project",
"title": "TypeScript Projects | Crawlee"
},
{
"url": "https://crawlee.dev/docs/guides/cheerio-crawler-guide",
"title": "CheerioCrawler guide | Crawlee"
}
],
"limit": 999999999999,
"offset": 0,
"total": 21
}
Practical Example: Collecting Data from Amazon
Let’s see an example of how to use Crawlee and Apify to create a crawler that collects data on products related to “Zelda” on Amazon.
First, we define the initial search URL on Amazon and the keyword “Zelda”:
import {
CheerioCrawler,
Dataset,
createCheerioRouter
} from "crawlee";
// keyword for e-commerce search
// create the cheerio router
const router = createCheerioRouter();
// Labels that will be used to identify handlers and routes
const labels = {
PRODUCT: "PRODUCT",
OFFERS: "OFFERS",
}
const keyword = "zelda";
const BASE_URL = "https://www.amazon.com";
const initialRequest = {
url: `${BASE_URL}/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
userData: {
keyword,
},
}
Next, we create the handler responsible for accessing the product listing page and extracting the links to each product’s pages:
// create the initial handler that will extract the products from the search page
router.addDefaultHandler(async ({ $, request, crawler }) => {
const { keyword } = request.userData;
const products = $('div > div [data-asin]:not([data-asin=""])');
for (const product of products) {
const element = $(product);
const titleElement = $(element.find(".a-text-normal [href]"));
const url = `${BASE_URL}${titleElement.attr("href")}`;
await crawler.addRequests([
{
url,
label: labels.PRODUCT,
userData: {
data: {
title: titleElement.first().text().trim(),
asin: element.attr("data-asin"),
itemUrl: url,
keyword,
}
}
}
]);
}
});
Then, we create the handler responsible for accessing the product detail page and extracting information such as the description:
// create the handler that will extract product data
router.addHandler(labels.PRODUCT, async ({ $, request, crawler }) => {
const { data } = request.userData;
const element = $("div#productDescription");
await crawler.addRequests([
{
url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`,
label: labels.OFFERS,
userData: {
data: {
...data,
description: element.text().trim(),
}
}
}
]);
});
Next, we create the handler responsible for obtaining offers for each product:
// create the handler that will extract product offers
router.addHandler(labels.OFFERS, async ({ $, request }) => {
const { data } = request.userData;
for (const offer of $("div#aod-offer")) {
const element = $(offer);
await Dataset.pushData({
...data,
sellerName: element
.find('div[id*="soldBy"] a[aria-label]')
.text().trim(),
offer: element.find(".a-price .a-offscreen").text().trim(),
});
}
});
Finally, we start the crawler and add the first request:
const crawler = new CheerioCrawler({
requestHandler: router,
});
// execute the crawler
await crawler.run([
initialRequest
]);
This example demonstrates how Crawlee simplifies the web scraping process by abstracting the complexity of navigation and allowing you to focus on data extraction.
Integrating with Apify
Now, imagine that you want to deploy this crawler in a scalable and managed way. This is where Apify comes in. Just integrate your Crawlee code with the Apify SDK, and you can upload your crawler to the Apify platform.
const { Actor } = require('apify');
await Actor.init();
const crawler = new CheerioCrawler({
// Same code as the previous example
});
// execute the crawler
await crawler.run([
initialRequest
]);
await Actor.exit();
With this integration, you can enjoy several benefits, such as:
- Scalable and managed deployment and execution of crawlers.
- Access to additional services like captcha resolution and rotating proxies.
- The ability to monetize your crawlers, offering them as a service to other companies.
- Integrate Crawlee’s Datasets with Apify’s Datasets, enabling the creation of automatic APIs that consume the results of the crawler’s execution.
Crawlee and Apify represent a significant shift in how we handle web scraping in JavaScript. They simplify the process, abstract the complexity, and offer a more efficient and scalable development and deployment experience.