langchain-unstructured¶
langchain_unstructured
¶
UnstructuredLoader
¶
Bases: BaseLoader
Unstructured document loader interface.
Setup
Install langchain-unstructured and set environment variable UNSTRUCTURED_API_KEY.
.. code-block:: bash pip install -U langchain-unstructured export UNSTRUCTURED_API_KEY="your-api-key"
Instantiate
.. code-block:: python from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path = ["example.pdf", "fake.pdf"],
api_key=UNSTRUCTURED_API_KEY,
partition_via_api=True,
chunking_strategy="by_title",
strategy="fast",
)
Lazy load
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
1 2 0 2
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}
Async load
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
1 2 0 2
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}
Load URL
.. code-block:: python
loader = UnstructuredLoader(web_url="https://www.example.com/")
print(docs[0])
.. code-block:: none
page_content='Example Domain' metadata={'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com/', 'category': 'Title', 'element_id': 'fdaa78d856f9d143aeeed85bf23f58f8'}
.. code-block:: python
print(docs[1])
.. code-block:: none
page_content='This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.' metadata={'languages': ['eng'], 'parent_id': 'fdaa78d856f9d143aeeed85bf23f58f8', 'filetype': 'text/html', 'url': 'https://www.example.com/', 'category': 'NarrativeText', 'element_id': '3652b8458b0688639f973fe36253c992'}
References¶
https://docs.unstructured.io/api-reference/api-services/sdk https://docs.unstructured.io/api-reference/api-services/overview https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking
| METHOD | DESCRIPTION |
|---|---|
load |
Load data into |
aload |
Load data into |
load_and_split |
Load |
alazy_load |
A lazy loader for |
__init__ |
Initialize loader. |
lazy_load |
Load file(s) to the _UnstructuredBaseLoader. |
load
¶
aload
async
¶
load_and_split
¶
load_and_split(text_splitter: TextSplitter | None = None) -> list[Document]
Load Document and split into chunks. Chunks are returned as Document.
Danger
Do not override this method. It should be considered to be deprecated!
| PARAMETER | DESCRIPTION |
|---|---|
text_splitter
|
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If |
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
List of |
alazy_load
async
¶
alazy_load() -> AsyncIterator[Document]
A lazy loader for Document.
| YIELDS | DESCRIPTION |
|---|---|
AsyncIterator[Document]
|
The |
__init__
¶
__init__(
file_path: str | Path | list[str] | list[Path] | None = None,
*,
file: IO[bytes] | list[IO[bytes]] | None = None,
partition_via_api: bool = False,
post_processors: list[Callable[[str], str]] | None = None,
api_key: str | None = None,
client: UnstructuredClient | None = None,
url: str | None = None,
web_url: str | None = None,
**kwargs: Any,
)
Initialize loader.