.. title:: Graph Vector Store
Sometimes embedding models don't capture all the important relationships between documents. Graph Vector Stores are an extension to both vector stores and retrievers that allow documents to be explicitly connected to each other.
Graph vector store retrievers use both vector similarity and links to find documents related to an unstructured query.
Graphs allow linking between documents. Each document identifies tags that link to and from it. For example, a paragraph of text may be linked to URLs based on the anchor tags in it's content and linked from the URL(s) it is published at.
Link extractors <langchain_community.graph_vectorstores.extractors.link_extractor.LinkExtractor>
can be used to extract links from documents.
Example::
graph_vector_store = CassandraGraphVectorStore()
link_extractor = HtmlLinkExtractor()
links = link_extractor.extract_one(HtmlInput(document.page_content, "http://mysite"))
add_links(document, links)
graph_vector_store.add_document(document)
.. seealso::
- :class:`How to use a graph vector store as a retriever <langchain_community.graph_vectorstores.base.GraphVectorStoreRetriever>`
- :class:`How to create links between documents <langchain_community.graph_vectorstores.links.Link>`
- :class:`How to link Documents on hyperlinks in HTML <langchain_community.graph_vectorstores.extractors.html_link_extractor.HtmlLinkExtractor>`
- :class:`How to link Documents on common keywords (using KeyBERT) <langchain_community.graph_vectorstores.extractors.keybert_link_extractor.KeybertLinkExtractor>`
- :class:`How to link Documents on common named entities (using GliNER) <langchain_community.graph_vectorstores.extractors.gliner_link_extractor.GLiNERLinkExtractor>`
- `langchain-jieba: link extraction tailored for Chinese language <https://github.com/cqzyys/langchain-jieba>`_
We chunk the State of the Union text and split it into documents::
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
raw_documents = TextLoader("state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
Links can be added to documents manually but it's easier to use a
:class:~langchain_community.graph_vectorstores.extractors.link_extractor.LinkExtractor.
Several common link extractors are available and you can build your own.
For this guide, we'll use the
:class:~langchain_community.graph_vectorstores.extractors.keybert_link_extractor.KeybertLinkExtractor
which uses the KeyBERT model to tag documents with keywords and uses these keywords to
create links between documents::
from langchain_community.graph_vectorstores.extractors import KeybertLinkExtractor
from langchain_community.graph_vectorstores.links import add_links
extractor = KeybertLinkExtractor()
for doc in documents:
add_links(doc, extractor.extract_one(doc))
We'll use an Apache Cassandra or Astra DB database as an example.
We create a
:class:~langchain_community.graph_vectorstores.cassandra.CassandraGraphVectorStore
from the documents and an :class:~langchain_openai.embeddings.base.OpenAIEmbeddings
model::
import cassio
from langchain_community.graph_vectorstores import CassandraGraphVectorStore
from langchain_openai import OpenAIEmbeddings
# Initialize cassio and the Cassandra session from the environment variables
cassio.init(auto=True)
store = CassandraGraphVectorStore.from_documents(
embedding=OpenAIEmbeddings(),
documents=documents,
)
If we don't traverse the graph, a graph vector store behaves like a regular vector
store.
So all methods available in a vector store are also available in a graph vector store.
The :meth:~langchain_community.graph_vectorstores.base.GraphVectorStore.similarity_search
method returns documents similar to a query without considering
the links between documents::
docs = store.similarity_search(
"What did the president say about Ketanji Brown Jackson?"
)
The :meth:~langchain_community.graph_vectorstores.base.GraphVectorStore.traversal_search
method returns documents similar to a query considering the links
between documents. It first does a similarity search and then traverses the graph to
find linked documents::
docs = list(
store.traversal_search("What did the president say about Ketanji Brown Jackson?")
)
The graph vector store has async versions of the methods prefixed with a::
docs = [
doc
async for doc in store.atraversal_search(
"What did the president say about Ketanji Brown Jackson?"
)
]
The graph vector store can be converted to a retriever.
It is similar to the vector store retriever but it also has traversal search methods
such as traversal and mmr_traversal::
retriever = store.as_retriever(search_type="mmr_traversal")
docs = retriever.invoke("What did the president say about Ketanji Brown Jackson?")
A link to/from a tag of a given kind.
Documents in a :class:graph vector store <langchain_community.graph_vectorstores.base.GraphVectorStore>
are connected via "links".
Links form a bipartite graph between documents and tags: documents are connected
to tags, and tags are connected to other documents.
When documents are retrieved from a graph vector store, a pair of documents are
connected with a depth of one if both documents are connected to the same tag.
Links have a kind property, used to namespace different tag identifiers.
For example a link to a keyword might use kind kw, while a link to a URL might
use kind url.
This allows the same tag value to be used in different contexts without causing
name collisions.
Links are directed. The directionality of links controls how the graph is
traversed at retrieval time.
For example, given documents A and B, connected by links to tag T:
+----------+----------+---------------------------------+ | A to T | B to T | Result | +==========+==========+=================================+ | outgoing | incoming | Retrieval traverses from A to B | +----------+----------+---------------------------------+ | incoming | incoming | No traversal from A to B | +----------+----------+---------------------------------+ | outgoing | incoming | No traversal from A to B | +----------+----------+---------------------------------+ | bidir | incoming | Retrieval traverses from A to B | +----------+----------+---------------------------------+ | bidir | outgoing | No traversal from A to B | +----------+----------+---------------------------------+ | outgoing | bidir | Retrieval traverses from A to B | +----------+----------+---------------------------------+ | incoming | bidir | No traversal from A to B | +----------+----------+---------------------------------+
Directed links make it possible to describe relationships such as term references / definitions: term definitions are generally relevant to any documents that use the term, but the full set of documents using a term generally aren't relevant to the term's definition.
.. seealso::
- :mod:`How to use a graph vector store <langchain_community.graph_vectorstores>`
- :class:`How to link Documents on hyperlinks in HTML <langchain_community.graph_vectorstores.extractors.html_link_extractor.HtmlLinkExtractor>`
- :class:`How to link Documents on common keywords (using KeyBERT) <langchain_community.graph_vectorstores.extractors.keybert_link_extractor.KeybertLinkExtractor>`
- :class:`How to link Documents on common named entities (using GliNER) <langchain_community.graph_vectorstores.extractors.gliner_link_extractor.GLiNERLinkExtractor>`
You can create links using the Link class's constructors :meth:incoming,
:meth:outgoing, and :meth:bidir::
from langchain_community.graph_vectorstores.links import Link
print(Link.bidir(kind="location", tag="Paris"))
.. code-block:: output
Link(kind='location', direction='bidir', tag='Paris')
Now that we know how to create links, let's associate them with some documents. These edges will strengthen the connection between documents that share a keyword when using a graph vector store to retrieve documents.
First, we'll load some text and chunk it into smaller pieces. Then we'll add a link to each document to link them all together::
from langchain_community.document_loaders import TextLoader
from langchain_community.graph_vectorstores.links import add_links
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("state_of_the_union.txt")
raw_documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
for doc in documents:
add_links(doc, Link.bidir(kind="genre", tag="oratory"))
print(documents[0].metadata)
.. code-block:: output
{'source': 'state_of_the_union.txt', 'links': [Link(kind='genre', direction='bidir', tag='oratory')]}
As we can see, each document's metadata now includes a bidirectional link to the
genre oratory.
The documents can then be added to a graph vector store::
from langchain_community.graph_vectorstores import CassandraGraphVectorStore
graph_vectorstore = CassandraGraphVectorStore.from_documents(
documents=documents, embeddings=...
)
Helper for executing an MMR traversal query.
A hybrid vector-and-graph graph store.
Document chunks support vector-similarity search as well as edges linking chunks based on structural and semantic properties.
.. versionadded:: 0.3.1
Retriever for GraphVectorStore.
A graph vector store retriever is a retriever that uses a graph vector store to retrieve documents. It is similar to a vector store retriever, except that it uses both vector similarity and graph connections to retrieve documents. It uses the search methods implemented by a graph vector store, like traversal search and MMR traversal search, to query the texts in the graph vector store.
Example::
store = CassandraGraphVectorStore(...)
retriever = store.as_retriever()
retriever.invoke("What is ...")
.. seealso::
:mod:`How to use a graph vector store <langchain_community.graph_vectorstores>`
You can build a retriever from a graph vector store using its
:meth:~langchain_community.graph_vectorstores.base.GraphVectorStore.as_retriever
method.
First we instantiate a graph vector store.
We will use a store backed by Cassandra
:class:~langchain_community.graph_vectorstores.cassandra.CassandraGraphVectorStore
graph vector store::
from langchain_community.document_loaders import TextLoader
from langchain_community.graph_vectorstores import CassandraGraphVectorStore
from langchain_community.graph_vectorstores.extractors import (
KeybertLinkExtractor,
LinkExtractorTransformer,
)
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
pipeline = LinkExtractorTransformer([KeybertLinkExtractor()])
pipeline.transform_documents(texts)
embeddings = OpenAIEmbeddings()
graph_vectorstore = CassandraGraphVectorStore.from_documents(texts, embeddings)
We can then instantiate a retriever::
retriever = graph_vectorstore.as_retriever()
This creates a retriever (specifically a GraphVectorStoreRetriever), which we
can use in the usual way::
docs = retriever.invoke("what did the president say about ketanji brown jackson?")
By default, the graph vector store retriever uses similarity search, then expands the retrieved set by following a fixed number of graph edges. If the underlying graph vector store supports maximum marginal relevance traversal, you can specify that as the search type.
MMR-traversal is a retrieval method combining MMR and graph traversal.
The strategy first retrieves the top fetch_k results by similarity to the question.
It then iteratively expands the set of fetched documents by following adjacent_k
graph edges and selects the top k results based on maximum-marginal relevance using
the given lambda_mult::
retriever = graph_vectorstore.as_retriever(search_type="mmr_traversal")
We can pass parameters to the underlying graph vector store's search methods using
search_kwargs.
Specifying graph traversal depth ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For example, we can set the graph traversal depth to only return documents reachable through a given number of graph edges::
retriever = graph_vectorstore.as_retriever(search_kwargs={"depth": 3})
Specifying MMR parameters ^^^^^^^^^^^^^^^^^^^^^^^^^
When using search type mmr_traversal, several parameters of the MMR algorithm
can be configured.
The fetch_k parameter determines how many documents are fetched using vector
similarity and adjacent_k parameter determines how many documents are fetched
using graph edges.
The lambda_mult parameter controls how the MMR re-ranking weights similarity to
the query string vs diversity among the retrieved documents as fetched documents
are selected for the set of k final results::
retriever = graph_vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"fetch_k": 20, "adjacent_k": 20, "lambda_mult": 0.25},
)
Specifying top k ^^^^^^^^^^^^^^^^
We can also limit the number of documents k returned by the retriever.
Note that if depth is greater than zero, the retriever may return more documents
than is specified by k, since both the original k documents retrieved using
vector similarity and any documents connected via graph edges will be returned::
retriever = graph_vectorstore.as_retriever(search_kwargs={"k": 1})
Similarity score threshold retrieval ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For example, we can set a similarity score threshold and only return documents with a score above that threshold::
retriever = graph_vectorstore.as_retriever(search_kwargs={"score_threshold": 0.5})
Node in the GraphVectorStore.
Edges exist from nodes with an outgoing link to nodes with a matching incoming link.
For instance two nodes a and b connected over a hyperlink https://some-url
would look like:
.. code-block:: python
[
Node(
id="a",
text="some text a",
links= [
Link(kind="hyperlink", tag="https://some-url", direction="incoming")
],
),
Node(
id="b",
text="some text b",
links= [
Link(kind="hyperlink", tag="https://some-url", direction="outgoing")
],
)
]