Skip to content

langchain-unstructured

PyPI - Version PyPI - License PyPI - Downloads

langchain_unstructured

UnstructuredLoader

Bases: BaseLoader

Unstructured document loader interface.

Setup

Install langchain-unstructured and set environment variable UNSTRUCTURED_API_KEY.

.. code-block:: bash pip install -U langchain-unstructured export UNSTRUCTURED_API_KEY="your-api-key"

Instantiate

.. code-block:: python from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
    file_path = ["example.pdf", "fake.pdf"],
    api_key=UNSTRUCTURED_API_KEY,
    partition_via_api=True,
    chunking_strategy="by_title",
    strategy="fast",
)
Lazy load

.. code-block:: python

docs = []
docs_lazy = loader.lazy_load()

# async variant:
# docs_lazy = await loader.alazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

.. code-block:: python

1 2 0 2
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}
Async load

.. code-block:: python

docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)

.. code-block:: python

1 2 0 2
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}
Load URL

.. code-block:: python

loader = UnstructuredLoader(web_url="https://www.example.com/")
print(docs[0])

.. code-block:: none

page_content='Example Domain' metadata={'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com/', 'category': 'Title', 'element_id': 'fdaa78d856f9d143aeeed85bf23f58f8'}

.. code-block:: python

print(docs[1])

.. code-block:: none

page_content='This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.' metadata={'languages': ['eng'], 'parent_id': 'fdaa78d856f9d143aeeed85bf23f58f8', 'filetype': 'text/html', 'url': 'https://www.example.com/', 'category': 'NarrativeText', 'element_id': '3652b8458b0688639f973fe36253c992'}
References

https://docs.unstructured.io/api-reference/api-services/sdk https://docs.unstructured.io/api-reference/api-services/overview https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking

METHOD DESCRIPTION
load

Load data into Document objects.

aload

Load data into Document objects.

load_and_split

Load Document and split into chunks. Chunks are returned as Document.

alazy_load

A lazy loader for Document.

__init__

Initialize loader.

lazy_load

Load file(s) to the _UnstructuredBaseLoader.

load

load() -> list[Document]

Load data into Document objects.

RETURNS DESCRIPTION
list[Document]

The documents.

aload async

aload() -> list[Document]

Load data into Document objects.

RETURNS DESCRIPTION
list[Document]

The documents.

load_and_split

load_and_split(text_splitter: TextSplitter | None = None) -> list[Document]

Load Document and split into chunks. Chunks are returned as Document.

Danger

Do not override this method. It should be considered to be deprecated!

PARAMETER DESCRIPTION
text_splitter

TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

TYPE: TextSplitter | None DEFAULT: None

RAISES DESCRIPTION
ImportError

If langchain-text-splitters is not installed and no text_splitter is provided.

RETURNS DESCRIPTION
list[Document]

List of Document.

alazy_load async

alazy_load() -> AsyncIterator[Document]

A lazy loader for Document.

YIELDS DESCRIPTION
AsyncIterator[Document]

The Document objects.

__init__

__init__(
    file_path: str | Path | list[str] | list[Path] | None = None,
    *,
    file: IO[bytes] | list[IO[bytes]] | None = None,
    partition_via_api: bool = False,
    post_processors: list[Callable[[str], str]] | None = None,
    api_key: str | None = None,
    client: UnstructuredClient | None = None,
    url: str | None = None,
    web_url: str | None = None,
    **kwargs: Any,
)

Initialize loader.

lazy_load

lazy_load() -> Iterator[Document]

Load file(s) to the _UnstructuredBaseLoader.