Diffbot
Diffbot is a suite of ML-based products that make it easy to structure web data.
Diffbot's Extract API is a service that structures and normalizes data from web pages.
Unlike traditional web scraping tools,
Diffbot Extract
doesn't require any rules to read the content on a page. It uses a computer vision model to classify a page into one of 20 possible types, and then transforms raw HTML markup into JSON. The resulting structured JSON follows a consistent type-based ontology, which makes it easy to extract data from multiple different web sources with the same schema.
Overviewโ
This guide covers how to extract data from a list of URLs using the Diffbot Extract API into structured JSON that we can use downstream.
Setting upโ
Start by installing the required packages.
%pip install --upgrade --quiet langchain-community
Diffbot's Extract API requires an API token. Follow these instructions to get a free API token and then set an environment variable.
%env DIFFBOT_API_TOKEN REPLACE_WITH_YOUR_TOKEN