# AmazonTextractPDFParser

> **Class** in `langchain_community`

📖 [View in docs](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/AmazonTextractPDFParser)

Send `PDF` files to `Amazon Textract` and parse them.

For parsing multi-page PDFs, they have to reside on S3.

The AmazonTextractPDFLoader calls the
[Amazon Textract Service](https://aws.amazon.com/textract/)
to convert PDFs into a Document structure.
Single and multi-page documents are supported with up to 3000 pages
and 512 MB of size.

For the call to be successful an AWS account is required,
similar to the
[AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
requirements.

Besides the AWS configuration, it is very similar to the other PDF
loaders, while also supporting JPEG, PNG and TIFF and non-native
PDF formats.

```python
from langchain_community.document_loaders import AmazonTextractPDFLoader
loader=AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()
```

One feature is the linearization of the output.
When using the features LAYOUT, FORMS or TABLES together with Textract

```python
from langchain_community.document_loaders import AmazonTextractPDFLoader
# you can mix and match each of the features
loader=AmazonTextractPDFLoader(
    "example_data/alejandro_rosalez_sample-small.jpeg",
    textract_features=["TABLES", "LAYOUT"])
documents = loader.load()
```

it will generate output that formats the text in reading order and
try to output the information in a tabular structure or
output the key/value pairs with a colon (key: value).
This helps most LLMs to achieve better accuracy when
processing these texts.

``Document`` objects are returned with metadata that includes the ``source`` and
a 1-based index of the page number in ``page``. Note that ``page`` represents
the index of the result returned from Textract, not necessarily the as-written
page number in the document.

## Signature

```python
AmazonTextractPDFParser(
    self,
    textract_features: Optional[Sequence[int]] = None,
    client: Optional[Any] = None,
    *,
    linearization_config: Optional[TextLinearizationConfig] = None,
)
```

## Parameters

| Name | Type | Required | Description |
|------|------|----------|-------------|
| `textract_features` | `Optional[Sequence[int]]` | No | Features to be used for extraction, each feature                should be passed as an int that conforms to the enum                `Textract_Features`, see `amazon-textract-caller` pkg (default: `None`) |
| `client` | `Optional[Any]` | No | boto3 textract client (default: `None`) |
| `linearization_config` | `Optional[TextLinearizationConfig]` | No | Config to be used for linearization of the output                   should be an instance of TextLinearizationConfig from                   the `textractor` pkg (default: `None`) |

## Extends

- `BaseBlobParser`

## Constructors

```python
__init__(
    self,
    textract_features: Optional[Sequence[int]] = None,
    client: Optional[Any] = None,
    *,
    linearization_config: Optional[TextLinearizationConfig] = None,
) -> None
```

| Name | Type |
|------|------|
| `textract_features` | `Optional[Sequence[int]]` |
| `client` | `Optional[Any]` |
| `linearization_config` | `Optional[TextLinearizationConfig]` |


## Properties

- `tc`
- `textractor`
- `textract_features`
- `linearization_config`
- `boto3_textract_client`

## Methods

- [`lazy_parse()`](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/AmazonTextractPDFParser/lazy_parse)

---

[View source on GitHub](https://github.com/langchain-ai/langchain-community/blob/4b280287bd55b99b44db2dd849f02d66c89534d5/libs/community/langchain_community/document_loaders/parsers/pdf.py#L1488)