Model inference on self-hosted remote hardware.
Supported hardware includes auto-launched instances on AWS, GCP, Azure, and Lambda, as well as servers specified by IP address and SSH credentials (such as on-prem, or another cloud like Paperspace, Coreweave, etc.).
To use, you should have the runhouse python package installed.
Example for custom pipeline and inference functions:
.. code-block:: python
from langchain_community.llms import SelfHostedPipeline from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import runhouse as rh
def load_pipeline(): tokenizer = AutoTokenizer.from_pretrained("gpt2") model = AutoModelForCausalLM.from_pretrained("gpt2") return pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10 ) def inference_fn(pipeline, prompt, stop = None): return pipeline(prompt)[0]["generated_text"]
gpu = rh.cluster(name="rh-a10x", instance_type="A100:1") llm = SelfHostedPipeline( model_load_fn=load_pipeline, hardware=gpu, model_reqs=model_reqs, inference_fn=inference_fn )
Example for <2GB model (can be serialized and sent directly to the server): .. code-block:: python
from langchain_community.llms import SelfHostedPipeline
import runhouse as rh
gpu = rh.cluster(name="rh-a10x", instance_type="A100:1")
my_model = ...
llm = SelfHostedPipeline.from_pipeline(
pipeline=my_model,
hardware=gpu,
model_reqs=["./", "torch", "transformers"],
)
Example passing model path for larger models: .. code-block:: python
from langchain_community.llms import SelfHostedPipeline
import runhouse as rh
import pickle
from transformers import pipeline
generator = pipeline(model="gpt2")
rh.blob(pickle.dumps(generator), path="models/pipeline.pkl"
).save().to(gpu, path="models")
llm = SelfHostedPipeline.from_pipeline(
pipeline="models/pipeline.pkl",
hardware=gpu,
model_reqs=["./", "torch", "transformers"],
)