WeightOnlyQuantPipeline()Weight only quantized model.
To use, you should have the intel-extension-for-transformers packabge and
transformers package installed.
intel-extension-for-transformers:
https://github.com/intel/intel-extension-for-transformers
Example using from_model_id:
.. code-block:: python
from langchain_community.llms import WeightOnlyQuantPipeline from intel_extension_for_transformers.transformers import ( WeightOnlyQuantConfig ) config = WeightOnlyQuantConfig hf = WeightOnlyQuantPipeline.from_model_id( model_id="google/flan-t5-large", task="text2text-generation" pipeline_kwargs={"max_new_tokens": 10}, quantization_config=config, )
Example passing pipeline in directly: .. code-block:: python
from langchain_community.llms import WeightOnlyQuantPipeline
from intel_extension_for_transformers.transformers import (
AutoModelForSeq2SeqLM
)
from intel_extension_for_transformers.transformers import (
WeightOnlyQuantConfig
)
from transformers import AutoTokenizer, pipeline
model_id = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
config = WeightOnlyQuantConfig
model = AutoModelForSeq2SeqLM.from_pretrained(
model_id,
quantization_config=config,
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=10,
)
hf = WeightOnlyQuantPipeline(pipeline=pipe)