Skip to main content
All CollectionsWorkflowsAction Guides
Action: Extract Data From Text
Action: Extract Data From Text
Updated over a week ago

Overview

The Extract Data from Text action allows you to extract specific data elements or a list of specific data elements from a given source text. It is useful for extracting objective information like data points, topics, URLs, or other elements from long-form content sources like meeting transcripts, LinkedIn profiles, SEC filings, or earnings call transcripts.

You can choose to extract either a single instance of each named data element or lists of the same type of element. The action supports adjusting the model quality and performance to balance cost, speed, and accuracy based on the difficulty of the extraction task. The outputs are provided as a JSON object containing the extracted data elements, which can be referenced individually in downstream workflows or actions.

Usage Examples

  • Extract data points from long-form content - This action allows you to extract specific data points or topics from long-form content such as meeting transcripts, LinkedIn profiles, SEC filings documents, and earnings call transcripts. This is useful for pulling out key information from lengthy text sources.

  • Extract a list of URLs from text content or search results- You can use this action to extract a list of URLs from a large body of text that contains multiple hyperlinks or web addresses. This could be helpful for gathering all the links from a document or webpage for further processing. It could also take the list of search results from the Perform Internet Search action and extract their URLs, titles, or descriptions for later use.

  • Extract elements from a 10k filing - For SEC 10-K filings, you can configure the action to extract common elements like employee count, yearly revenue, list of subsidiaries the company owns, and list of directors and executives. This allows you to quickly surface key datapoints from these financial disclosure documents.

  • Extract elements from a meeting transcript - When provided with a meeting transcript text, this action enables extracting items like the date, attendee list, and any action items that were discussed during the meeting. This could aid in summarizing and tracking meeting outcomes.

  • Convert elements in an HTML table into a JSON object - By extracting every individual row value and creating a separate data element for each header, you can use List Mode to convert data displayed in a table of a webpage scrape into a usable data set for use in your workflow.

Inputs

  • Extract List of Objects - A toggle to specify whether you want to extract a single instance of each named data element, or extract lists of the same type of data element. On - To extract lists of a data element from the source text

  • Source Text - The full text content from which you want to extract specific data elements. This could be a webpage, meeting transcript, SEC filing document, or any other long-form text. Text content to extract from

  • Data to Extract - A list of named data elements that you want to extract from the source text. For each element, you need to provide a name and description.

    • Examples of data elements to extract:

      • Employee Count - The total number of employees at the company from an SEC filing

      • Yearly Revenue - The total revenue earned by the company in the last fiscal year from an SEC filing

    • Examples of a list of elements to extract:

      • Hyperlinks in webpage content - All of the URLs listed in scraped webpage content

      • List of Subsidiaries - A list of all subsidiary companies owned by the parent company from an SEC filing

      • List of Directors and Executives - Names and titles of all directors and executives at the company

Advanced Inputs

  • Model Quality and Performance - Allows you to choose between a faster/cheaper model or a more robust/expensive model based on the difficulty and subjectivity of the extraction task.

Outputs

The main output of the Extract Data from Text action is a JSON object of data elements, or a list of JSON objects. The output contains one key in the object for every element that was defined in the "data elements to extract" input section.

Example of an output with extracted data points:

{ "Employee_Count": "22668", "Revenue": "$8.971 billion" }

Example of an object with a list:

[ { "URL": "<https://copy.ai>", }, { "URL": "<https://google.com>" }, { "URL": "<https://perplexity.ai>" } ]

This allows you to precisely access and process these extracted data elements in subsequent workflow actions, without having to work with the entire original text content (like the full 10K document). The JSON output essentially breaks down the source text into the key pieces of information you care about, making it easier to operate on that data going forward.

Troubleshooting

  • Extracting a list of items vs a single item

    • The "Extract list of items" toggle allows you to specify whether you want to extract a list of the same type of elements (e.g. URLs, action items) or just a single element.

    • If you want to extract a list, make sure to toggle this on. This is useful when you want to run a downstream workflow for every element in the extracted list.

    • If you only need to extract a single element, leave this toggle off.

  • Incorrect or imprecise extraction

    • If the action is not extracting the desired data element accurately, you should refine the description provided for that named data element.

    • For example, if you are trying to extract URLs but it's not providing valid URL links, provide a more precise description of what constitutes a valid URL in the data element description.

    • Being more specific in the description will help the model better understand exactly what kind of data you want extracted, leading to more accurate results.

Related Actions

  • Run Workflow - This action allows you to queue subsequent workflows using the data elements extracted from the source text via the "Extract Data from Text" action. It can accept an array or list of elements as input, making it useful for running a workflow for each item in the extracted list (e.g. running a workflow for every URL extracted).

  • Run Workflow Inline - Similar to the "Run Workflow" action, this allows you to trigger subsequent workflows inline using the extracted data elements as input, or an array of data elements as input. It also allows you to access the generated outputs of the inline workflow as outputs downstream in your current workflow.

    • If you’ve extracted a list using Extract Data From Text, you can select “Run in Array Mode” when using Run Workflow Inline to run the nested workflow once for each element in the list.

  • Generate Text - Use Generate Text for further processing of the extracted data elements downstream. For example, you could reference the extracted employee count, revenue, subsidiaries etc. from a 10-K filing as input to a "Generate Text" action to produce summarized text about the company's financials and structure.

Did this answer your question?