Module●Since v0.3

document_transformers

Document Transformers are classes to transform Documents.

Document Transformers usually used to transform a lot of Documents in a single run.

Class hierarchy:

.. code-block::

BaseDocumentTransformer --> <name>  # Examples: DoctranQATransformer, DoctranTextTranslator

Main helpers:

.. code-block::

Document

Functions

function

get_stateful_documents

Convert a list of documents to a list of documents with state.

Classes

class

BeautifulSoupTransformer

Transform HTML content by extracting specific tags and removing unwanted ones.

class

DoctranPropertyExtractor

Extract properties from text documents using doctran.

class

DoctranQATransformer

Extract QA from text documents using doctran.

class

DoctranTextTranslator

Translate text documents using doctran.

class

EmbeddingsClusteringFilter

Perform K-means clustering on document vectors. Returns an arbitrary number of documents closest to center.

class

EmbeddingsRedundantFilter

Filter that drops redundant documents by comparing their embeddings.

class

Html2TextTransformer

Replace occurrences of a particular search pattern with a replacement string

class

LongContextReorder

Reorder long context.

Lost in the middle: Performance degrades when models must access relevant information in the middle of long contexts. See: https://arxiv.org/abs//2307.03172

class

MarkdownifyTransformer

Converts HTML documents to Markdown format with customizable options for handling links, images, other tags and heading styles using the markdownify library.

class

NucliaTextTransformer

Nuclia Text Transformer.

The Nuclia Understanding API splits into paragraphs and sentences, identifies entities, provides a summary of the text and generates embeddings for all sentences.

class

OpenAIMetadataTagger

Extract metadata tags from document contents using OpenAI functions.

Example:
    .. code-block:: python

            from langchain_openai import ChatOpenAI
            from langchain_community.document_transformers import OpenAIMetadataTagger
            from langchain_core.documents import Document

            schema = {
                "properties": {
                    "movie_title": { "type": "string" },
                    "critic": { "type": "string" },
                    "tone": {
                        "type": "string",
                        "enum": ["positive", "negative"]
                    },
                    "rating": {
                        "type": "integer",
                        "description": "The number of stars the critic rated the movie"
                    }
                },
                "required": ["movie_title", "critic", "tone"]
            }

            # Must be an OpenAI model that supports functions
            llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
            tagging_chain = create_tagging_chain(schema, llm)
            document_transformer = OpenAIMetadataTagger(tagging_chain=tagging_chain)
            original_documents = [
                Document(page_content="Review of The Bee Movie

By Roger Ebert

This is the greatest movie ever made. 4 out of 5 stars."), Document(page_content="Review of The Godfather By Anonymous

This movie was super boring. 1 out of 5 stars.", metadata={"reliable": False}), ]

            enhanced_documents = document_transformer.transform_documents(original_documents)

deprecatedclass

GoogleTranslateTransformer

Translate text documents using Google Cloud Translation.

Modules

module

beautiful_soup_transformer

module

nuclia_text_transform

Reorder documents

Document transformers that use OpenAI Functions models

module

doctran_text_extract

module

doctran_text_translate

embeddings_redundant_filter

Transform documents

View source on GitHub

document_transformers

Functions

Classes

Modules

LangChain Assistant

Menu

document_transformers

Functions

Classes

Modules