Ask a question to get started
Enter to send•Shift+Enter new line
VLLM()
BaseLLM
The name or path of a HuggingFace Transformers model.
The number of GPUs to use for distributed execution with tensor parallelism.
Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.
Number of output sequences to return for the given prompt.
Number of output sequences that are generated from the prompt.
Float that penalizes new tokens based on whether they appear in the generated text so far
Float that penalizes new tokens based on their frequency in the generated text so far
Float that controls the randomness of the sampling.
Float that controls the cumulative probability of the top tokens to consider.
Integer that controls the number of top tokens to consider.
Whether to use beam search instead of sampling.
List of strings that stop the generation when they are generated.
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
Maximum number of tokens to generate per output sequence.
Number of log probabilities to return per output token.
The data type for the model weights and activations.
Directory to download and load the weights. (Default to the default cache dir of huggingface)
Holds any model parameters valid for vllm.LLM call not explicitly specified.
vllm.LLM
Validate that python package exists in environment.
VLLM language model.