Overview
The Scrape Webpage action allows you to extract text or HTML from a website by providing its URL. This enables you to programmatically scrape websites on a schedule or on-demand, with options to control the output format and behavior. Common use cases include extracting data from websites for analysis, monitoring changes over time, or repurposing web content for other applications.
Usage Examples
Scrape a website - Build a workflow that allows you to drop in the URL and let the scraping happen. This enables you to scrape websites repetitively, programmatically, and on a schedule. This can be used to:
detect changes in website text like pricing changes or changes in product descriptions
extract metadata elements <H1>, <H2>, <A> etc for evaluation and improvement when forming SEO and Content Strategy
extract specifics from profile pages such as a user’s work history from LinkedIn
extract reviews from a webpage to evaluate customer sentiment
Extracting Meta Information for SEO - The scrape action can be used in conjunction with an extract action to grab SEO metadata of a page. This can be useful for a wide variety of tasks.
Generate a summary of a webpage or PDF - The scrape action can be used in conjunction with the Generate Text action to summarize webpages and content.
Inputs
URL - The web address of the site to be scraped.
https://example.com
Quality Level - The level of quality for the scrape action. Unless you have a specific reason, you should always pick 'ultra' as it provides the highest quality scrape.
ultra
premium
default
Output Type - This toggle determines whether the output is the text content of the website or the full underlying HTML code.
text only
html
Advanced Inputs
Should the workflow continue if this fails? - An optional setting that allows the workflow to continue executing even if this scrape action fails. This is useful if you have an expensive series of subsequent actions and don't want them running unnecessarily if the scrape fails as a prerequisite step.
When toggled on there is an additional Output Message input - This allows you to set a specific message to pass along when the scrape fails.
Default:
Failed to scrape the URL
Outputs
The Scrape Webpage action outputs multiple elements related to the website being scraped:
Text: This contains the text content extracted from the website. Depending on the quality level selected, it will try to intelligently extract just the main text content while excluding code, scripts, etc.
HTML: This is the full underlying HTML code of the website. This can be useful if you need access to the raw markup rather than just the text.
Troubleshooting
Scrape Fails - By far the most common reason for the Scrape Webpage action failing is the provision of an improper URL. The URL provided must be a public URL that is not behind a login wall.
Scraping not returning desired output - The scrape action allows you to specify the output type as either text only or HTML. If you only require the textual content of the webpage, you can select the 'text only' option. However, if you need access to elements embedded in the page's code, such as images or scripts, you should choose the HTML output type. Selecting the appropriate output type ensures that the action returns the desired payload, which can then be used for further processing, text generation, or evaluation purposes.
Scraping not returning full text of page - Some webpages have dynamic code that is responsible for rendering some or part of the page. While scrape is designed to handle many of these scenarios there are some text on pages that may be impossible to scrape.
Workflow failing if scrape action fails - When configuring the scrape action, you have the option to allow the workflow to continue even if this step fails. This is a control mechanism in place to prevent wasting resources on subsequent expensive actions if the scrape action, which is a precursor step, fails. By enabling this option, if the scrape action encounters an issue and cannot complete successfully, the workflow will not execute any remaining actions, thereby saving resources and preventing potential errors or undesired outcomes from propagating through the workflow.
Scraping quality issues - The scrape action offers different quality levels, with 'ultra' being the highest. Unless you have a specific reason to use a lower quality level, it is recommended to select 'ultra' to ensure that the scraping process captures the webpage content as accurately and completely as possible. Lower quality levels may result in missing or incomplete data, which could impact subsequent actions or analyses.
Related Actions
Generate Text - Scrape is often used in conjunction with Generate Text in order to process the scraped inputs in some way such as generating a summary.
Extract Data from Text - Scrape can be used often with Extract Data from Text