Document Transformers are classes to transform Documents.
Document Transformers usually used to transform a lot of Documents in a single run.
Class hierarchy:
.. code-block::
BaseDocumentTransformer --> <name> # Examples: DoctranQATransformer, DoctranTextTranslator
Main helpers:
.. code-block::
Document
Transform HTML content by extracting specific tags and removing unwanted ones.
Extract properties from text documents using doctran.
Extract QA from text documents using doctran.
Translate text documents using doctran.
Perform K-means clustering on document vectors. Returns an arbitrary number of documents closest to center.
Filter that drops redundant documents by comparing their embeddings.
Replace occurrences of a particular search pattern with a replacement string
Reorder long context.
Lost in the middle: Performance degrades when models must access relevant information in the middle of long contexts. See: https://arxiv.org/abs//2307.03172
Converts HTML documents to Markdown format with customizable options for handling links, images, other tags and heading styles using the markdownify library.
Nuclia Text Transformer.
The Nuclia Understanding API splits into paragraphs and sentences, identifies entities, provides a summary of the text and generates embeddings for all sentences.
Extract metadata tags from document contents using OpenAI functions.
Example:
.. code-block:: python
from langchain_openai import ChatOpenAI
from langchain_community.document_transformers import OpenAIMetadataTagger
from langchain_core.documents import Document
schema = {
"properties": {
"movie_title": { "type": "string" },
"critic": { "type": "string" },
"tone": {
"type": "string",
"enum": ["positive", "negative"]
},
"rating": {
"type": "integer",
"description": "The number of stars the critic rated the movie"
}
},
"required": ["movie_title", "critic", "tone"]
}
# Must be an OpenAI model that supports functions
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
tagging_chain = create_tagging_chain(schema, llm)
document_transformer = OpenAIMetadataTagger(tagging_chain=tagging_chain)
original_documents = [
Document(page_content="Review of The Bee Movie
By Roger Ebert
This is the greatest movie ever made. 4 out of 5 stars."), Document(page_content="Review of The Godfather By Anonymous
This movie was super boring. 1 out of 5 stars.", metadata={"reliable": False}), ]
enhanced_documents = document_transformer.transform_documents(original_documents)
Translate text documents using Google Cloud Translation.
Reorder documents
Document transformers that use OpenAI Functions models
Transform documents