AmazonTextractPDFParser

AmazonTextractPDFParser(
  self,
  textract_features: Optional[Sequence[int]] = None,
  client: Optional[Any] = None,
  *,
  linearization_config: Optional[TextLinearizationConfig] = None
)

Constructors

constructor

__init__

Name	Type
textract_features	Optional[Sequence[int]]
client	Optional[Any]
linearization_config	Optional[TextLinearizationConfig]

Attributes

attribute

tc: tc

attribute

textractor: textractor

attribute

textract_features: list

attribute

linearization_config: linearization_config

attribute

boto3_textract_client

Methods

method

lazy_parse

Inherited fromBaseBlobParser(langchain_core)

Methods

Mparse

Send PDF files to Amazon Textract and parse them.

For parsing multi-page PDFs, they have to reside on S3.

The AmazonTextractPDFLoader calls the Amazon Textract Service to convert PDFs into a Document structure. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.

For the call to be successful an AWS account is required, similar to the AWS CLI requirements.

Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.

from langchain_community.document_loaders import AmazonTextractPDFLoader
loader=AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()

One feature is the linearization of the output. When using the features LAYOUT, FORMS or TABLES together with Textract

from langchain_community.document_loaders import AmazonTextractPDFLoader
# you can mix and match each of the features
loader=AmazonTextractPDFLoader(
    "example_data/alejandro_rosalez_sample-small.jpeg",
    textract_features=["TABLES", "LAYOUT"])
documents = loader.load()

it will generate output that formats the text in reading order and try to output the information in a tabular structure or output the key/value pairs with a colon (key: value). This helps most LLMs to achieve better accuracy when processing these texts.

Document objects are returned with metadata that includes the source and a 1-based index of the page number in page. Note that page represents the index of the result returned from Textract, not necessarily the as-written page number in the document.

Parameters

Name	Type	Description
`textract_features`	`Optional[Sequence[int]]`	Default:`None`
`client`	`Optional[Any]`	Default:`None`

Features to be used for extraction, each feature should be passed as an int that conforms to the enum Textract_Features, see amazon-textract-caller pkg

boto3 textract client

Iterates over the Blob pages and returns an Iterator with a Document for each page, like the other parsers If multi-page document, blob.path has to be set to the S3 URI and for single page docs the blob.data is taken

Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg

LangChain Assistant

Menu

AmazonTextractPDFParser

Bases

Constructors

Attributes

Methods

Inherited fromBaseBlobParser(langchain_core)

Methods

Parameters