forked from EngineX-Cambricon/enginex-mlu370-vllm
add qwen3
This commit is contained in:
82
vllm-v0.6.2/examples/production_monitoring/Otel.md
Normal file
82
vllm-v0.6.2/examples/production_monitoring/Otel.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# Setup OpenTelemetry POC
|
||||
|
||||
1. Install OpenTelemetry packages:
|
||||
```
|
||||
pip install \
|
||||
'opentelemetry-sdk>=1.26.0,<1.27.0' \
|
||||
'opentelemetry-api>=1.26.0,<1.27.0' \
|
||||
'opentelemetry-exporter-otlp>=1.26.0,<1.27.0' \
|
||||
'opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0'
|
||||
```
|
||||
|
||||
1. Start Jaeger in a docker container:
|
||||
```
|
||||
# From: https://www.jaegertracing.io/docs/1.57/getting-started/
|
||||
docker run --rm --name jaeger \
|
||||
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
|
||||
-p 6831:6831/udp \
|
||||
-p 6832:6832/udp \
|
||||
-p 5778:5778 \
|
||||
-p 16686:16686 \
|
||||
-p 4317:4317 \
|
||||
-p 4318:4318 \
|
||||
-p 14250:14250 \
|
||||
-p 14268:14268 \
|
||||
-p 14269:14269 \
|
||||
-p 9411:9411 \
|
||||
jaegertracing/all-in-one:1.57
|
||||
```
|
||||
|
||||
1. In a new shell, export Jaeger IP:
|
||||
```
|
||||
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
||||
```
|
||||
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM:
|
||||
```
|
||||
export OTEL_SERVICE_NAME="vllm-server"
|
||||
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
||||
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
||||
```
|
||||
|
||||
1. In a new shell, send requests with trace context from a dummy client
|
||||
```
|
||||
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
||||
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
||||
export OTEL_SERVICE_NAME="client-service"
|
||||
python dummy_client.py
|
||||
```
|
||||
|
||||
1. Open Jaeger webui: http://localhost:16686/
|
||||
|
||||
In the search pane, select `vllm-server` service and hit `Find Traces`. You should get a list of traces, one for each request.
|
||||

|
||||
|
||||
1. Clicking on a trace will show its spans and their tags. In this demo, each trace has 2 spans. One from the dummy client containing the prompt text and one from vLLM containing metadata about the request.
|
||||

|
||||
|
||||
## Exporter Protocol
|
||||
OpenTelemetry supports either `grpc` or `http/protobuf` as the transport protocol for trace data in the exporter.
|
||||
By default, `grpc` is used. To set `http/protobuf` as the protocol, configure the `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` environment variable as follows:
|
||||
```
|
||||
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
|
||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
|
||||
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
||||
```
|
||||
|
||||
## Instrumentation of FastAPI
|
||||
OpenTelemetry allows automatic instrumentation of FastAPI.
|
||||
1. Install the instrumentation library
|
||||
```
|
||||
pip install opentelemetry-instrumentation-fastapi
|
||||
```
|
||||
|
||||
1. Run vLLM with `opentelemetry-instrument`
|
||||
```
|
||||
opentelemetry-instrument vllm serve facebook/opt-125m
|
||||
```
|
||||
|
||||
1. Send a request to vLLM and find its trace in Jaeger. It should contain spans from FastAPI.
|
||||
|
||||

|
||||
54
vllm-v0.6.2/examples/production_monitoring/README.md
Normal file
54
vllm-v0.6.2/examples/production_monitoring/README.md
Normal file
@@ -0,0 +1,54 @@
|
||||
# vLLM + Prometheus/Grafana
|
||||
|
||||
This is a simple example that shows you how to connect vLLM metric logging to the Prometheus/Grafana stack. For this example, we launch Prometheus and Grafana via Docker. You can checkout other methods through [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/) websites.
|
||||
|
||||
Install:
|
||||
- [`docker`](https://docs.docker.com/engine/install/)
|
||||
- [`docker compose`](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
|
||||
|
||||
### Launch
|
||||
|
||||
Prometheus metric logging is enabled by default in the OpenAI-compatible server. Launch via the entrypoint:
|
||||
```bash
|
||||
vllm serve mistralai/Mistral-7B-v0.1 \
|
||||
--max-model-len 2048 \
|
||||
--disable-log-requests
|
||||
```
|
||||
|
||||
Launch Prometheus and Grafana servers with `docker compose`:
|
||||
```bash
|
||||
docker compose up
|
||||
```
|
||||
|
||||
Submit some sample requests to the server:
|
||||
```bash
|
||||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||
|
||||
python3 ../../benchmarks/benchmark_serving.py \
|
||||
--model mistralai/Mistral-7B-v0.1 \
|
||||
--tokenizer mistralai/Mistral-7B-v0.1 \
|
||||
--endpoint /v1/completions \
|
||||
--dataset-name sharegpt \
|
||||
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
|
||||
--request-rate 3.0
|
||||
```
|
||||
|
||||
Navigating to [`http://localhost:8000/metrics`](http://localhost:8000/metrics) will show the raw Prometheus metrics being exposed by vLLM.
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
Navigate to [`http://localhost:3000`](http://localhost:3000). Log in with the default username (`admin`) and password (`admin`).
|
||||
|
||||
#### Add Prometheus Data Source
|
||||
|
||||
Navigate to [`http://localhost:3000/connections/datasources/new`](http://localhost:3000/connections/datasources/new) and select Prometheus.
|
||||
|
||||
On Prometheus configuration page, we need to add the `Prometheus Server URL` in `Connection`. For this setup, Grafana and Prometheus are running in separate containers, but Docker creates DNS name for each containers. You can just use `http://prometheus:9090`.
|
||||
|
||||
Click `Save & Test`. You should get a green check saying "Successfully queried the Prometheus API.".
|
||||
|
||||
#### Import Dashboard
|
||||
|
||||
Navigate to [`http://localhost:3000/dashboard/import`](http://localhost:3000/dashboard/import), upload `grafana.json`, and select the `prometheus` datasource. You should see a screen that looks like the following:
|
||||
|
||||

|
||||
@@ -0,0 +1,19 @@
|
||||
# docker-compose.yaml
|
||||
version: "3"
|
||||
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway" # allow a direct connection from container to the local machine
|
||||
ports:
|
||||
- "9090:9090" # the default port used by Prometheus
|
||||
volumes:
|
||||
- ${PWD}/prometheus.yaml:/etc/prometheus/prometheus.yml # mount Prometheus config file
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
depends_on:
|
||||
- prometheus
|
||||
ports:
|
||||
- "3000:3000" # the default port used by Grafana
|
||||
35
vllm-v0.6.2/examples/production_monitoring/dummy_client.py
Normal file
35
vllm-v0.6.2/examples/production_monitoring/dummy_client.py
Normal file
@@ -0,0 +1,35 @@
|
||||
import requests
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
|
||||
OTLPSpanExporter)
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.export import (BatchSpanProcessor,
|
||||
ConsoleSpanExporter)
|
||||
from opentelemetry.trace import SpanKind, set_tracer_provider
|
||||
from opentelemetry.trace.propagation.tracecontext import (
|
||||
TraceContextTextMapPropagator)
|
||||
|
||||
trace_provider = TracerProvider()
|
||||
set_tracer_provider(trace_provider)
|
||||
|
||||
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
|
||||
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
|
||||
|
||||
tracer = trace_provider.get_tracer("dummy-client")
|
||||
|
||||
url = "http://localhost:8000/v1/completions"
|
||||
with tracer.start_as_current_span("client-span", kind=SpanKind.CLIENT) as span:
|
||||
prompt = "San Francisco is a"
|
||||
span.set_attribute("prompt", prompt)
|
||||
headers = {}
|
||||
TraceContextTextMapPropagator().inject(headers)
|
||||
payload = {
|
||||
"model": "facebook/opt-125m",
|
||||
"prompt": prompt,
|
||||
"max_tokens": 10,
|
||||
"best_of": 20,
|
||||
"n": 3,
|
||||
"use_beam_search": "true",
|
||||
"temperature": 0.0,
|
||||
# "stream": True,
|
||||
}
|
||||
response = requests.post(url, headers=headers, json=payload)
|
||||
1557
vllm-v0.6.2/examples/production_monitoring/grafana.json
Normal file
1557
vllm-v0.6.2/examples/production_monitoring/grafana.json
Normal file
File diff suppressed because it is too large
Load Diff
10
vllm-v0.6.2/examples/production_monitoring/prometheus.yaml
Normal file
10
vllm-v0.6.2/examples/production_monitoring/prometheus.yaml
Normal file
@@ -0,0 +1,10 @@
|
||||
# prometheus.yaml
|
||||
global:
|
||||
scrape_interval: 5s
|
||||
evaluation_interval: 30s
|
||||
|
||||
scrape_configs:
|
||||
- job_name: vllm
|
||||
static_configs:
|
||||
- targets:
|
||||
- 'host.docker.internal:8000'
|
||||
Reference in New Issue
Block a user