[Doc] fix docs (#1949)
This commit is contained in:
@@ -1,77 +0,0 @@
|
||||
# Choices Methods in SGLang
|
||||
This doc describes the choices methods supported by SGLang.
|
||||
|
||||
The optional `choices_method` arg determines how options supplied to SGLang's `choices` primitive are selected. Only the `RuntimeEndpoint` backend supports the `choices_method` arg. Other backends, such as `OpenAI`, have bespoke selection implementations due to API limitations.
|
||||
|
||||
## Methods
|
||||
|
||||
### Token Length Normalized
|
||||
|
||||
Token length normalized is the default SGLang choices method. It selects the option with the highest average logprob across all of its tokens.
|
||||
|
||||
Usage example (alternatively, simply omit the `choices_method` arg):
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.token_length_normalized,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
This can perform poorly if an option contains many tokens, where its later tokens are predicted with high confidence based on its earlier tokens. For instance, even strong models will fail the above example if the specified options are `["Paris", "Antidisestablishmentarianism"]`.
|
||||
|
||||
### Greedy Token Selection
|
||||
|
||||
Greedy token selection simply selects the option with the highest logprob for its initial token. For overlapping options where one option is a subset of a longer option, the logprobs of the shorter option are extended using its average logprob for comparison against the longer option.
|
||||
|
||||
Usage example:
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.greedy_token_selection,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
This can perform poorly if an option misleads the model down a bad path based on an attractive initial token. For instance, greedy selection will result in an incorrect response for this example:
|
||||
```python
|
||||
@sgl.function
|
||||
def us_president_example(s):
|
||||
s += sgl.user("Name a US president.")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["Donald Duck", "Millard Fillmore"],
|
||||
choices_method=sgl.greedy_token_selection,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Unconditional Likelihood Normalized
|
||||
|
||||
Unconditional likelihood normalized selects the option with the highest average token logprob once normalized by the unconditional token logprobs, as described in [this EleutherAI blogpost](https://blog.eleuther.ai/multiple-choice-normalization/). This method incurs an additional LLM call to obtain the unconditional likelihoods.
|
||||
|
||||
Usage example:
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.unconditional_likelihood_normalized,
|
||||
)
|
||||
)
|
||||
```
|
||||
@@ -26,9 +26,9 @@ Data parallelism is better for throughput. When there is enough GPU memory, alwa
|
||||
|
||||
### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
|
||||
If you see out of memory (OOM) errors, you can try to tune the following parameters.
|
||||
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
|
||||
If OOM happens during decoding, try to decrease `--max-running-requests`.
|
||||
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
|
||||
- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
|
||||
- If OOM happens during decoding, try to decrease `--max-running-requests`.
|
||||
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
|
||||
|
||||
### Try Advanced Options
|
||||
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
|
||||
|
||||
@@ -4,9 +4,9 @@ This page lists some common errors and tips for fixing them.
|
||||
|
||||
## CUDA out of memory
|
||||
If you see out of memory (OOM) errors, you can try to tune the following parameters.
|
||||
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
|
||||
If OOM happens during decoding, try to decrease `--max-running-requests`.
|
||||
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
|
||||
- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
|
||||
- If OOM happens during decoding, try to decrease `--max-running-requests`.
|
||||
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
|
||||
|
||||
## CUDA error: an illegal memory access was encountered
|
||||
This error may be due to kernel errors or out-of-memory issues.
|
||||
|
||||
Reference in New Issue
Block a user