AmazonTextractPDFParser(
self,
textract_features: Optional[Sequence[int]] = None,
client: Optional[Any] = None,
*,
linearization_config: Optional[TextLinearizationConfig] = None
)Send PDF files to Amazon Textract and parse them.
For parsing multi-page PDFs, they have to reside on S3.
The AmazonTextractPDFLoader calls the Amazon Textract Service to convert PDFs into a Document structure. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.
For the call to be successful an AWS account is required, similar to the AWS CLI requirements.
Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.
from langchain_community.document_loaders import AmazonTextractPDFLoader
loader=AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()
One feature is the linearization of the output. When using the features LAYOUT, FORMS or TABLES together with Textract
from langchain_community.document_loaders import AmazonTextractPDFLoader
# you can mix and match each of the features
loader=AmazonTextractPDFLoader(
"example_data/alejandro_rosalez_sample-small.jpeg",
textract_features=["TABLES", "LAYOUT"])
documents = loader.load()
it will generate output that formats the text in reading order and try to output the information in a tabular structure or output the key/value pairs with a colon (key: value). This helps most LLMs to achieve better accuracy when processing these texts.
Document objects are returned with metadata that includes the source and
a 1-based index of the page number in page. Note that page represents
the index of the result returned from Textract, not necessarily the as-written
page number in the document.
linearization_config |
Optional[TextLinearizationConfig] |
Default: None |
Features to be used for extraction, each feature
should be passed as an int that conforms to the enum
Textract_Features, see amazon-textract-caller pkg
boto3 textract client
Iterates over the Blob pages and returns an Iterator with a Document for each page, like the other parsers If multi-page document, blob.path has to be set to the S3 URI and for single page docs the blob.data is taken
Config to be used for linearization of the output
should be an instance of TextLinearizationConfig from
the textractor pkg