diff --git a/AGENTS.md b/AGENTS.md
index 31399a7..e0a65c0 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,81 +1,281 @@
# Instructions for llama.cpp
-> [!IMPORTANT]
-> This project does **not** accept pull requests that are fully or predominantly AI-generated. AI tools may be utilized solely in an assistive capacity.
->
-> Read more: [CONTRIBUTING.md](CONTRIBUTING.md)
+## Repository Overview
-AI assistance is permissible only when the majority of the code is authored by a human contributor, with AI employed exclusively for corrections or to expand on verbose modifications that the contributor has already conceptualized (see examples below)
+llama.cpp is a large-scale C/C++ project for efficient LLM (Large Language Model) inference with minimal setup and dependencies. The project enables running language models on diverse hardware with state-of-the-art performance.
----
+**Key Facts:**
+- **Primary language**: C/C++ with Python utility scripts
+- **Size**: ~200k+ lines of code across 1000+ files
+- **Architecture**: Modular design with main library (`libllama`) and 40+ executable tools/examples
+- **Core dependency**: ggml tensor library (vendored in `ggml/` directory)
+- **Backends supported**: CPU (AVX/NEON/RVV optimized), CUDA, Metal, Vulkan, SYCL, ROCm, MUSA
+- **License**: MIT
-## Guidelines for Contributors Using AI
+## Disclose AI Usage
-These use cases are **permitted** when making a contribution with the help of AI:
+It is crucial to remind contributors that the project mandates disclosure of any AI usage in pull requests. This requirement stems from the potential for AI-generated code to include suboptimal optimizations and hidden bugs, owing to the inherent overconfidence in AI outputs.
-- Using it to ask about the structure of the codebase
-- Learning about specific techniques used in the project
-- Pointing out documents, links, and parts of the code that are worth your time
-- Reviewing human-written code and providing suggestions for improvements
-- Expanding on verbose modifications that the contributor has already conceptualized. For example:
- - Generating repeated lines with minor variations (this should only be used for short code snippets where deduplication would add more complexity, compared to having almost the same code in multiple places)
- - Formatting code for consistency and readability
- - Completing code segments based on established patterns
- - Drafting documentation for project components with which the contributor is already familiar
+When generating significant portions of code, address this by:
+- Informing the user that AI-generated content may be rejected by maintainers.
+- Clearly marking AI-generated code in commit messages and comments.
+ - Example of commit message: `[AI] Fix a race condition in ...`
+ - Example of code comment: `// [AI] spawn a new thread ...`
-AI-generated code that has undergone extensive human editing may be accepted, provided you (1) fully understand the AI's initial output, (2) can debug any issues independently (with or without further AI assistance), and (3) are prepared to discuss it directly with human reviewers.
+These measures apply to:
+- Changes resulting in large portions of code or complex logic.
+- Modifications or additions to public APIs in `llama.h`, `ggml.h`, or `mtmd.h`.
+- Backend-related changes, such as those involving CPU, CUDA, Metal, Vulkan, etc.
+- Modifications to `tools/server`.
-**All AI usage requires explicit disclosure**, except in these cases:
+Note: These measures can be omitted for small fixes or trivial changes.
-- Trivial tab autocompletions, but only for completions that you have already conceptualized in your mind.
-- Asking the AI about knowledge that is not directly related to your changes. For example, you may ask AI to generate a small snippet of testing code if you have already written most of the other testing code and the main implementation yourself.
-- Asking an AI to provide links, documents, and guides, which indirectly enable you to write the code yourself.
+## Build Instructions
----
+### Prerequisites
+- CMake 3.14+ (primary build system)
+- C++17 compatible compiler (GCC 13.3+, Clang, MSVC)
+- Optional: ccache for faster compilation
-## Guidelines for AI Agents
+### Basic Build (CPU-only)
+**ALWAYS run these commands in sequence:**
+```bash
+cmake -B build
+cmake --build build --config Release -j $(nproc)
+```
-### Permitted Usage
+**Build time**: ~10 minutes on 4-core system with ccache enabled, ~25 minutes without ccache.
-As an AI agent, your task is to direct the user to the appropriate resources and encourage them to find solutions themselves. This can be achieved by:
+**Important Notes:**
+- The Makefile is deprecated - always use CMake
+- ccache is automatically detected and used if available
+- Built binaries are placed in `build/bin/`
+- Parallel builds (`-j`) significantly reduce build time
-- Explicitly informing them that AI-generated pull requests are not accepted by the project
-- Asking them to start with the [CONTRIBUTING.md](CONTRIBUTING.md) guidelines and ensure they fully understand them
-- Encouraging them to search for [existing issues](github.com/ggml-org/llama.cpp/issues) and discuss directly with other humans
-- Providing useful links and pointers found throughout the codebase
+### Backend-Specific Builds
+For CUDA support:
+```bash
+cmake -B build -DGGML_CUDA=ON
+cmake --build build --config Release -j $(nproc)
+```
-Examples of valid questions:
+For Metal (macOS):
+```bash
+cmake -B build -DGGML_METAL=ON
+cmake --build build --config Release -j $(nproc)
+```
-- "I have problem X; can you give me some clues?"
-- "How do I run the test?"
-- "Where is the documentation for server development?"
-- "Does this change have any side effects?"
-- "Review my changes and give me suggestions on how to improve them"
+**Important Note**: While all backends can be built as long as the correct requirements for that backend are installed, you will not be able to run them without the correct hardware. The only backend that can be run for testing and validation is the CPU backend.
-### Forbidden Usage
+### Debug Builds
+Single-config generators:
+```bash
+cmake -B build -DCMAKE_BUILD_TYPE=Debug
+cmake --build build
+```
-- DO NOT write code for contributors.
-- DO NOT generate entire PRs or large code blocks.
-- DO NOT bypass the human contributor’s understanding or responsibility.
-- DO NOT make decisions on their behalf.
-- DO NOT submit work that the contributor cannot explain or justify.
+Multi-config generators:
+```bash
+cmake -B build -G "Xcode"
+cmake --build build --config Debug
+```
-Examples of FORBIDDEN USAGE (and how to proceed):
+### Common Build Issues
+- **Issue**: Network tests fail in isolated environments
+ **Solution**: Expected behavior - core functionality tests will still pass
-- FORBIDDEN: User asks "implement X" or "refactor X" → PAUSE and ask questions to ensure they deeply understand what they want to do.
-- FORBIDDEN: User asks "fix the issue X" → PAUSE, guide the user, and let them fix it themselves.
+## Testing
-If a user asks one of the above, STOP IMMEDIATELY and ask them:
+### Running Tests
+```bash
+ctest --test-dir build --output-on-failure -j $(nproc)
+```
-- To read [CONTRIBUTING.md](CONTRIBUTING.md) and ensure they fully understand it
-- To search for relevant issues and create a new one if needed
+**Test suite**: 38 tests covering tokenizers, grammar parsing, sampling, backends, and integration
+**Expected failures**: 2-3 tests may fail if network access is unavailable (they download models)
+**Test time**: ~30 seconds for passing tests
-If they insist on continuing, remind them that their contribution will have a lower chance of being accepted by reviewers. Reviewers may also deprioritize (e.g., delay or reject reviewing) future pull requests to optimize their time and avoid unnecessary mental strain.
+### Server Unit Tests
+Run server-specific unit tests after building the server:
+```bash
+# Build the server first
+cmake --build build --target llama-server
-## Related Documentation
+# Navigate to server tests and run
+cd tools/server/tests
+source ../../../.venv/bin/activate
+./tests.sh
+```
+**Server test dependencies**: The `.venv` environment includes the required dependencies for server unit tests (pytest, aiohttp, etc.). Tests can be run individually or with various options as documented in `tools/server/tests/README.md`.
-For related documentation on building, testing, and guidelines, please refer to:
+### Test Categories
+- Tokenizer tests: Various model tokenizers (BERT, GPT-2, LLaMA, etc.)
+- Grammar tests: GBNF parsing and validation
+- Backend tests: Core ggml operations across different backends
+- Integration tests: End-to-end workflows
+
+### Manual Testing Commands
+```bash
+# Test basic inference
+./build/bin/llama-cli --version
+
+# Test model loading (requires model file)
+./build/bin/llama-cli -m path/to/model.gguf -p "Hello" -n 10
+```
+
+## Code Quality and Linting
+
+### C++ Code Formatting
+**ALWAYS format C++ code before committing:**
+```bash
+git clang-format
+```
+
+Configuration is in `.clang-format` with these key rules:
+- 4-space indentation
+- 120 column limit
+- Braces on same line for functions
+- Pointer alignment: `void * ptr` (middle)
+- Reference alignment: `int & ref` (middle)
+
+### Python Code
+**ALWAYS activate the Python environment in `.venv` and use tools from that environment:**
+```bash
+# Activate virtual environment
+source .venv/bin/activate
+```
+
+Configuration files:
+- `.flake8`: flake8 settings (max-line-length=125, excludes examples/tools)
+- `pyrightconfig.json`: pyright type checking configuration
+
+### Pre-commit Hooks
+Run before committing:
+```bash
+pre-commit run --all-files
+```
+
+## Continuous Integration
+
+### GitHub Actions Workflows
+Key workflows that run on every PR:
+- `.github/workflows/build.yml`: Multi-platform builds
+- `.github/workflows/server.yml`: Server functionality tests
+- `.github/workflows/python-lint.yml`: Python code quality
+- `.github/workflows/python-type-check.yml`: Python type checking
+
+### Local CI Validation
+**Run full CI locally before submitting PRs:**
+```bash
+mkdir tmp
+
+# CPU-only build
+bash ./ci/run.sh ./tmp/results ./tmp/mnt
+```
+
+**CI Runtime**: 30-60 minutes depending on backend configuration
+
+### Triggering CI
+Add `ggml-ci` to commit message to trigger heavy CI workloads on the custom CI infrastructure.
+
+## Project Layout and Architecture
+
+### Core Directories
+- **`src/`**: Main llama library implementation (`llama.cpp`, `llama-*.cpp`)
+- **`include/`**: Public API headers, primarily `include/llama.h`
+- **`ggml/`**: Core tensor library (submodule with custom GGML framework)
+- **`examples/`**: 30+ example applications and tools
+- **`tools/`**: Additional development and utility tools (server benchmarks, tests)
+- **`tests/`**: Comprehensive test suite with CTest integration
+- **`docs/`**: Detailed documentation (build guides, API docs, etc.)
+- **`scripts/`**: Utility scripts for CI, data processing, and automation
+- **`common/`**: Shared utility code used across examples
+
+### Key Files
+- **`CMakeLists.txt`**: Primary build configuration
+- **`include/llama.h`**: Main C API header (~2000 lines)
+- **`src/llama.cpp`**: Core library implementation (~8000 lines)
+- **`CONTRIBUTING.md`**: Coding guidelines and PR requirements
+- **`.clang-format`**: C++ formatting rules
+- **`.pre-commit-config.yaml`**: Git hook configuration
+
+### Built Executables (in `build/bin/`)
+Primary tools:
+- **`llama-cli`**: Main inference tool
+- **`llama-server`**: OpenAI-compatible HTTP server
+- **`llama-quantize`**: Model quantization utility
+- **`llama-perplexity`**: Model evaluation tool
+- **`llama-bench`**: Performance benchmarking
+- **`llama-convert-llama2c-to-ggml`**: Model conversion utilities
+
+### Configuration Files
+- **CMake**: `CMakeLists.txt`, `cmake/` directory
+- **Linting**: `.clang-format`, `.clang-tidy`, `.flake8`
+- **CI**: `.github/workflows/`, `ci/run.sh`
+- **Git**: `.gitignore` (includes build artifacts, models, cache)
+
+### Dependencies
+- **System**: OpenMP, libcurl (for model downloading)
+- **Optional**: CUDA SDK, Metal framework, Vulkan SDK, Intel oneAPI
+- **Bundled**: httplib, json (header-only libraries in vendored form)
+
+## Common Validation Steps
+
+### After Making Changes
+1. **Format code**: `git clang-format`
+2. **Build**: `cmake --build build --config Release`
+3. **Test**: `ctest --test-dir build --output-on-failure`
+4. **Server tests** (if modifying server): `cd tools/server/tests && source ../../../.venv/bin/activate && ./tests.sh`
+5. **Manual validation**: Test relevant tools in `build/bin/`
+
+### Performance Validation
+```bash
+# Benchmark inference performance
+./build/bin/llama-bench -m model.gguf
+
+# Evaluate model perplexity
+./build/bin/llama-perplexity -m model.gguf -f dataset.txt
+```
+
+### Backend Validation
+```bash
+# Test backend operations
+./build/bin/test-backend-ops
+```
+
+## Environment Setup
+
+### Required Tools
+- CMake 3.14+ (install via system package manager)
+- Modern C++ compiler with C++17 support
+- Git (for submodule management)
+- Python 3.9+ with virtual environment (`.venv` is provided)
+
+### Optional but Recommended
+- ccache: `apt install ccache` or `brew install ccache`
+- clang-format 15+: Usually included with LLVM/Clang installation
+- pre-commit: `pip install pre-commit`
+
+### Backend-Specific Requirements
+- **CUDA**: NVIDIA CUDA Toolkit 11.2+
+- **Metal**: Xcode command line tools (macOS only)
+- **Vulkan**: Vulkan SDK
+- **SYCL**: Intel oneAPI toolkit
+
+## Important Guidelines
+
+### Code Changes
+- **Minimal dependencies**: Avoid adding new external dependencies
+- **Cross-platform compatibility**: Test on Linux, macOS, Windows when possible
+- **Performance focus**: This is a performance-critical inference library
+- **API stability**: Changes to `include/llama.h` require careful consideration
+- **Disclose AI Usage**: Refer to the "Disclose AI Usage" earlier in this document
+
+### Git Workflow
+- Always create feature branches from `master`
+- **Never** commit build artifacts (`build/`, `.ccache/`, `*.o`, `*.gguf`)
+- Use descriptive commit messages following project conventions
+
+### Trust These Instructions
+Only search for additional information if these instructions are incomplete or found to be incorrect. This document contains validated build and test procedures that work reliably across different environments.
-- [CONTRIBUTING.md](CONTRIBUTING.md)
-- [Build documentation](docs/build.md)
-- [Server development documentation](tools/server/README-dev.md)
diff --git a/CMakeLists.txt b/CMakeLists.txt
index d24fa08..c231ec0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -111,16 +111,11 @@ option(LLAMA_BUILD_SERVER "llama: build server example" ${LLAMA_STANDALONE})
option(LLAMA_TOOLS_INSTALL "llama: install tools" ${LLAMA_TOOLS_INSTALL_DEFAULT})
# 3rd party libs
-option(LLAMA_HTTPLIB "llama: httplib for downloading functionality" ON)
-option(LLAMA_OPENSSL "llama: use openssl to support HTTPS" ON)
+option(LLAMA_CURL "llama: use libcurl to download model from an URL" ON)
+option(LLAMA_HTTPLIB "llama: if libcurl is disabled, use httplib to download model from an URL" ON)
+option(LLAMA_OPENSSL "llama: use openssl to support HTTPS" OFF)
option(LLAMA_LLGUIDANCE "llama-common: include LLGuidance library for structured output in common utils" OFF)
-# deprecated
-option(LLAMA_CURL "llama: use libcurl to download model from an URL" OFF)
-if (LLAMA_CURL)
- message(WARNING "LLAMA_CURL option is deprecated and will be ignored")
-endif()
-
# Required for relocatable CMake package
include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/build-info.cmake)
include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/common.cmake)
@@ -187,9 +182,6 @@ if (NOT MSVC)
endif()
endif()
-include("cmake/license.cmake")
-license_add_file("llama.cpp" "LICENSE")
-
#
# 3rd-party
#
@@ -217,6 +209,11 @@ add_subdirectory(src)
# utils, programs, examples and tests
#
+if (NOT LLAMA_BUILD_COMMON)
+ message(STATUS "LLAMA_BUILD_COMMON is OFF, disabling LLAMA_CURL")
+ set(LLAMA_CURL OFF)
+endif()
+
if (LLAMA_BUILD_COMMON)
add_subdirectory(common)
if (LLAMA_HTTPLIB)
@@ -238,19 +235,6 @@ if (LLAMA_BUILD_COMMON AND LLAMA_BUILD_TOOLS)
add_subdirectory(tools)
endif()
-# Automatically add all files from the 'licenses' directory
-file(GLOB EXTRA_LICENSES "${CMAKE_SOURCE_DIR}/licenses/LICENSE-*")
-
-foreach(FILE_PATH ${EXTRA_LICENSES})
- get_filename_component(FILE_NAME "${FILE_PATH}" NAME)
- string(REGEX REPLACE "^LICENSE-" "" NAME "${FILE_NAME}")
- license_add_file("${NAME}" "${FILE_PATH}")
-endforeach()
-
-if (LLAMA_BUILD_COMMON)
- license_generate(common)
-endif()
-
#
# install
#
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index c928bc3..4545ff8 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -6,45 +6,21 @@ The project differentiates between 3 levels of contributors:
- Collaborators (Triage): people with significant contributions, who may be responsible for some parts of the code, and are expected to maintain and review contributions for the code they own
- Maintainers: responsible for reviewing and merging PRs, after approval from the code owners
-# AI Usage Policy
-
-> [!IMPORTANT]
-> This project does **not** accept pull requests that are fully or predominantly AI-generated. AI tools may be utilized solely in an assistive capacity.
->
-> Detailed information regarding permissible and restricted uses of AI can be found in the [AGENTS.md](AGENTS.md) file.
-
-Code that is initially generated by AI and subsequently edited will still be considered AI-generated. AI assistance is permissible only when the majority of the code is authored by a human contributor, with AI employed exclusively for corrections or to expand on verbose modifications that the contributor has already conceptualized (e.g., generating repeated lines with minor variations).
-
-If AI is used to generate any portion of the code, contributors must adhere to the following requirements:
-
-1. Explicitly disclose the manner in which AI was employed.
-2. Perform a comprehensive manual review prior to submitting the pull request.
-3. Be prepared to explain every line of code they submitted when asked about it by a maintainer.
-4. Using AI to write pull request descriptions or to respond to human reviewers is strictly prohibited.
-
-For more info, please refer to the [AGENTS.md](AGENTS.md) file.
-
# Pull requests (for contributors & collaborators)
-Before submitting your PR:
-- Search for existing PRs to prevent duplicating efforts
- llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the [examples in the ggml repository](https://github.com/ggml-org/ggml/tree/master/examples/). [simple](https://github.com/ggml-org/ggml/tree/master/examples/simple) shows the bare minimum for using ggml. [gpt-2](https://github.com/ggml-org/ggml/tree/master/examples/gpt-2) has minimal implementations for language model inference using GPT-2. [mnist](https://github.com/ggml-org/ggml/tree/master/examples/mnist) demonstrates how to train and evaluate a simple image classifier
- Test your changes:
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
-- Create separate PRs for each feature or fix:
- - Avoid combining unrelated changes in a single PR
- - For intricate features, consider opening a feature request first to discuss and align expectations
- - When adding support for a new model or feature, focus on **CPU support only** in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs
+- Create separate PRs for each feature or fix. Avoid combining unrelated changes in a single PR
+- When adding support for a new model or feature, focus on **CPU support only** in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
-
-After submitting your PR:
-- Expect requests for modifications to ensure the code meets llama.cpp's standards for quality and long-term maintainability
-- Maintainers will rely on your insights and approval when making a final decision to approve and merge a PR
- If your PR becomes stale, rebase it on top of latest `master` to get maintainers attention
-- Consider adding yourself to [CODEOWNERS](CODEOWNERS) to indicate your availability for fixing related issues and reviewing related PRs
+- Maintainers will rely on your insights and approval when making a final decision to approve and merge a PR
+- Consider adding yourself to [CODEOWNERS](CODEOWNERS) to indicate your availability for reviewing related PRs
+- Using AI to generate PRs is permitted. However, you must (1) explicitly disclose how AI was used and (2) conduct a thorough manual review before publishing the PR. Note that trivial tab autocompletions do not require disclosure.
# Pull requests (for maintainers)
@@ -55,11 +31,6 @@ After submitting your PR:
- When merging a PR, make sure you have a good understanding of the changes
- Be mindful of maintenance: most of the work going into a feature happens after the PR is merged. If the PR author is not committed to contribute long-term, someone else needs to take responsibility (you)
-Maintainers reserve the right to decline review or close pull requests for any reason, particularly under any of the following conditions:
-- The proposed change is already mentioned in the roadmap or an existing issue, and it has been assigned to someone.
-- The pull request duplicates an existing one.
-- The contributor fails to adhere to this contributing guide.
-
# Coding guidelines
- Avoid adding third-party dependencies, extra files, extra headers, etc.
diff --git a/README.md b/README.md
index 9641db6..7e2bdbe 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,11 @@
# llama.cpp
-> Sync with upstream `ggml-org/llama.cpp` tag `b7751`
+> Sync with upstream `ggml-org/llama.cpp` tag `b7516`
## Build Docker Image
```bash
-docker buildx build --build-context hyhal=/opt/hyhal -t enginex-hygon/hygon-llama.cpp:b7751 .
+docker buildx build --build-context hyhal=/opt/hyhal -t enginex-hygon/hygon-llama.cpp:b7516 .
```

@@ -208,7 +208,6 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
-- [BonzAI App](https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (proprietary)
- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
- [Dot](https://github.com/alexpinel/Dot) (GPL)
- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
@@ -491,6 +490,21 @@ To learn more about model quantization, [read this documentation](tools/quantize
+## [`llama-run`](tools/run)
+
+#### A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama [^3].
+
+-
+ Run a model with a specific prompt (by default it's pulled from Ollama registry)
+
+ ```bash
+ llama-run granite-code
+ ```
+
+
+
+[^3]: [RamaLama](https://github.com/containers/ramalama)
+
## [`llama-simple`](examples/simple)
#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
@@ -594,5 +608,7 @@ $ echo "source ~/.llama-completion.bash" >> ~/.bashrc
- [stb-image](https://github.com/nothings/stb) - Single-header image format decoder, used by multimodal subsystem - Public domain
- [nlohmann/json](https://github.com/nlohmann/json) - Single-header JSON library, used by various tools/examples - MIT License
- [minja](https://github.com/google/minja) - Minimal Jinja parser in C++, used by various tools/examples - MIT License
+- [linenoise.cpp](./tools/run/linenoise.cpp/linenoise.cpp) - C++ library that provides readline-like line editing capabilities, used by `llama-run` - BSD 2-Clause License
+- [curl](https://curl.se/) - Client-side URL transfer library, used by various tools/examples - [CURL License](https://curl.se/docs/copyright.html)
- [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain
- [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain
diff --git a/SECURITY.md b/SECURITY.md
index 9a93732..ae496f4 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -1,52 +1,12 @@
# Security Policy
- - [**Reporting a vulnerability**](#reporting-a-vulnerability)
- - [**Requirements**](#requirements)
- - [**Covered Topics**](#covered-topics)
- [**Using llama.cpp securely**](#using-llamacpp-securely)
- [Untrusted models](#untrusted-models)
- [Untrusted inputs](#untrusted-inputs)
- [Data privacy](#data-privacy)
- [Untrusted environments or networks](#untrusted-environments-or-networks)
- [Multi-Tenant environments](#multi-tenant-environments)
-
-## Reporting a vulnerability
-
-If you have discovered a security vulnerability in this project that falls inside the [covered topics](#covered-topics), please report it privately. **Do not disclose it as a public issue.** This gives us time to work with you to fix the issue before public exposure, reducing the chance that the exploit will be used before a patch is released.
-
-Please disclose it as a private [security advisory](https://github.com/ggml-org/llama.cpp/security/advisories/new).
-
-A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
-
-> [!IMPORTANT]
-> For collaborators: if you are interested in helping out with reviewing privting security disclosures, please see: https://github.com/ggml-org/llama.cpp/discussions/18080
-
-## Requirements
-
-Before submitting your report, ensure you meet the following requirements:
-
-- You have read this policy and fully understand it.
-- AI is only permitted in an assistive capacity as stated in [AGENTS.md](AGENTS.md). We do not accept reports that are written exclusively by AI.
-- Your report must include a working Proof-of-Concept in the form of a script and/or attached files.
-
-Maintainers reserve the right to close the report if these requirements are not fulfilled.
-
-## Covered Topics
-
-Only vulnerabilities that fall within these parts of the project are considered valid. For problems falling outside of this list, please report them as issues.
-
-- `src/**/*`
-- `ggml/**/*`
-- `gguf-py/**/*`
-- `tools/server/*`, **excluding** the following topics:
- - Web UI
- - Features marked as experimental
- - Features not recommended for use in untrusted environments (e.g., router, MCP)
- - Bugs that can lead to Denial-of-Service attack
-
-Note that none of the topics under [Using llama.cpp securely](#using-llamacpp-securely) are considered vulnerabilities in LLaMA C++.
-
-For vulnerabilities that fall within the `vendor` directory, please report them directly to the third-party project.
+ - [**Reporting a vulnerability**](#reporting-a-vulnerability)
## Using llama.cpp securely
@@ -95,3 +55,19 @@ If you intend to run multiple models in parallel with shared memory, it is your
3. Model Sharing: In a multitenant model sharing design, tenants and users must understand the security risks of running code provided by others. Since there are no reliable methods to detect malicious models, sandboxing the model execution is the recommended approach to mitigate the risk.
4. Hardware Attacks: GPUs or TPUs can also be attacked. [Researches](https://scholar.google.com/scholar?q=gpu+side+channel) has shown that side channel attacks on GPUs are possible, which can make data leak from other models or processes running on the same system at the same time.
+
+## Reporting a vulnerability
+
+Beware that none of the topics under [Using llama.cpp securely](#using-llamacpp-securely) are considered vulnerabilities of LLaMA C++.
+
+
+However, If you have discovered a security vulnerability in this project, please report it privately. **Do not disclose it as a public issue.** This gives us time to work with you to fix the issue before public exposure, reducing the chance that the exploit will be used before a patch is released.
+
+Please disclose it as a private [security advisory](https://github.com/ggml-org/llama.cpp/security/advisories/new).
+
+Please note that using AI to identify vulnerabilities and generate reports is permitted. However, you must (1) explicitly disclose how AI was used and (2) conduct a thorough manual review before submitting the report.
+
+A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
+
+> [!IMPORTANT]
+> For collaborators: if you are interested in helping out with reviewing privting security disclosures, please see: https://github.com/ggml-org/llama.cpp/discussions/18080
diff --git a/build-xcframework.sh b/build-xcframework.sh
index 0eec871..81280f7 100755
--- a/build-xcframework.sh
+++ b/build-xcframework.sh
@@ -414,7 +414,7 @@ cmake -B build-ios-sim -G Xcode \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=iphonesimulator \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
- -DLLAMA_OPENSSL=OFF \
+ -DLLAMA_CURL=OFF \
-S .
cmake --build build-ios-sim --config Release -- -quiet
@@ -428,7 +428,7 @@ cmake -B build-ios-device -G Xcode \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=iphoneos \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
- -DLLAMA_OPENSSL=OFF \
+ -DLLAMA_CURL=OFF \
-S .
cmake --build build-ios-device --config Release -- -quiet
@@ -439,7 +439,7 @@ cmake -B build-macos -G Xcode \
-DCMAKE_OSX_ARCHITECTURES="arm64;x86_64" \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
- -DLLAMA_OPENSSL=OFF \
+ -DLLAMA_CURL=OFF \
-S .
cmake --build build-macos --config Release -- -quiet
@@ -453,7 +453,7 @@ cmake -B build-visionos -G Xcode \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=xros \
-DCMAKE_C_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_CXX_FLAGS}" \
- -DLLAMA_OPENSSL=OFF \
+ -DLLAMA_CURL=OFF \
-DLLAMA_HTTPLIB=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-S .
@@ -469,7 +469,7 @@ cmake -B build-visionos-sim -G Xcode \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=xrsimulator \
-DCMAKE_C_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_CXX_FLAGS}" \
- -DLLAMA_OPENSSL=OFF \
+ -DLLAMA_CURL=OFF \
-DLLAMA_HTTPLIB=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-S .
@@ -487,7 +487,7 @@ cmake -B build-tvos-sim -G Xcode \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=appletvsimulator \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
- -DLLAMA_OPENSSL=OFF \
+ -DLLAMA_CURL=OFF \
-S .
cmake --build build-tvos-sim --config Release -- -quiet
@@ -502,7 +502,7 @@ cmake -B build-tvos-device -G Xcode \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=appletvos \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
- -DLLAMA_OPENSSL=OFF \
+ -DLLAMA_CURL=OFF \
-S .
cmake --build build-tvos-device --config Release -- -quiet
diff --git a/ci/run.sh b/ci/run.sh
index 6ca6ea5..0a4a0e4 100755
--- a/ci/run.sh
+++ b/ci/run.sh
@@ -45,15 +45,14 @@ sd=`dirname $0`
cd $sd/../
SRC=`pwd`
-CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=${LLAMA_FATAL_WARNINGS:-ON} -DLLAMA_OPENSSL=OFF -DGGML_SCHED_NO_REALLOC=ON"
+CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=${LLAMA_FATAL_WARNINGS:-ON} -DLLAMA_CURL=ON -DGGML_SCHED_NO_REALLOC=ON"
if [ ! -z ${GG_BUILD_METAL} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=ON"
fi
if [ ! -z ${GG_BUILD_CUDA} ]; then
- # TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
- CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_CUDA=ON -DGGML_CUDA_CUB_3DOT2=ON"
+ CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_CUDA=ON"
if command -v nvidia-smi >/dev/null 2>&1; then
CUDA_ARCH=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -d '.')
@@ -105,20 +104,7 @@ if [ ! -z ${GG_BUILD_VULKAN} ]; then
fi
if [ ! -z ${GG_BUILD_WEBGPU} ]; then
- CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1 -DGGML_METAL=OFF -DGGML_BLAS=OFF"
-
- if [ ! -z "${GG_BUILD_WEBGPU_DAWN_PREFIX}" ]; then
- if [ -z "${CMAKE_PREFIX_PATH}" ]; then
- export CMAKE_PREFIX_PATH="${GG_BUILD_WEBGPU_DAWN_PREFIX}"
- else
- export CMAKE_PREFIX_PATH="${GG_BUILD_WEBGPU_DAWN_PREFIX}:${CMAKE_PREFIX_PATH}"
- fi
- fi
-
- # For some systems, Dawn_DIR needs to be set explicitly, e.g., the lib64 path
- if [ ! -z "${GG_BUILD_WEBGPU_DAWN_DIR}" ]; then
- CMAKE_EXTRA="${CMAKE_EXTRA} -DDawn_DIR=${GG_BUILD_WEBGPU_DAWN_DIR}"
- fi
+ CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1"
fi
if [ ! -z ${GG_BUILD_MUSA} ]; then
@@ -297,8 +283,7 @@ function gg_sum_test_scripts {
}
function gg_get_model {
- #local gguf_0="$MNT/models/qwen3/0.6B/ggml-model-f16.gguf"
- local gguf_0="$MNT/models/qwen3/0.6B/ggml-model-q4_0.gguf"
+ local gguf_0="$MNT/models/qwen3/0.6B/ggml-model-f16.gguf"
if [[ -s $gguf_0 ]]; then
echo -n "$gguf_0"
else
diff --git a/cmake/download-models.cmake b/cmake/download-models.cmake
deleted file mode 100644
index de25290..0000000
--- a/cmake/download-models.cmake
+++ /dev/null
@@ -1,21 +0,0 @@
-get_filename_component(DEST_DIR "${DEST}" DIRECTORY)
-file(MAKE_DIRECTORY "${DEST_DIR}")
-
-if(NOT EXISTS "${DEST}")
- message(STATUS "Downloading ${NAME} from ggml-org/models...")
-endif()
-
-file(DOWNLOAD
- "https://huggingface.co/ggml-org/models/resolve/main/${NAME}?download=true"
- "${DEST}"
- TLS_VERIFY ON
- EXPECTED_HASH ${HASH}
- STATUS status
-)
-
-list(GET status 0 code)
-
-if(NOT code EQUAL 0)
- list(GET status 1 msg)
- message(FATAL_ERROR "Failed to download ${NAME}: ${msg}")
-endif()
diff --git a/cmake/license.cmake b/cmake/license.cmake
deleted file mode 100644
index de06660..0000000
--- a/cmake/license.cmake
+++ /dev/null
@@ -1,40 +0,0 @@
-define_property(GLOBAL PROPERTY LICENSE_TEXT
- BRIEF_DOCS "Embedded licenses"
- FULL_DOCS "Global string containing all aggregated licenses"
-)
-
-function(license_add_file NAME FILE)
- if(NOT IS_ABSOLUTE "${FILE}")
- set(FILE "${CMAKE_CURRENT_SOURCE_DIR}/${FILE}")
- endif()
- if(EXISTS "${FILE}")
- set(TITLE "License for ${NAME}")
- string(REGEX REPLACE "." "=" UNDERLINE "${TITLE}")
- file(READ "${FILE}" TEXT)
- get_property(TMP GLOBAL PROPERTY LICENSE_TEXT)
- string(APPEND TMP "R\"=L=(${TITLE}\n${UNDERLINE}\n\n${TEXT})=L=\",\n")
- set_property(GLOBAL PROPERTY LICENSE_TEXT "${TMP}")
- else()
- message(WARNING "License file '${FILE}' not found")
- endif()
-endfunction()
-
-function(license_generate TARGET_NAME)
- message(STATUS "Generating embedded license file for target: ${TARGET_NAME}")
- get_property(TEXT GLOBAL PROPERTY LICENSE_TEXT)
-
- set(CPP_CONTENT "// Generated by CMake\n\n")
- string(APPEND CPP_CONTENT "const char* LICENSES[] = {\n")
- string(APPEND CPP_CONTENT "${TEXT}")
- string(APPEND CPP_CONTENT "nullptr\n")
- string(APPEND CPP_CONTENT "};\n")
-
- set(CPP_FILE "${CMAKE_BINARY_DIR}/license.cpp")
- file(WRITE "${CPP_FILE}" "${CPP_CONTENT}")
-
- if(TARGET ${TARGET_NAME})
- target_sources(${TARGET_NAME} PRIVATE "${CPP_FILE}")
- else()
- message(FATAL_ERROR "Target '${TARGET_NAME}' does not exist")
- endif()
-endfunction()
diff --git a/common/CMakeLists.txt b/common/CMakeLists.txt
index 723973e..f7b9915 100644
--- a/common/CMakeLists.txt
+++ b/common/CMakeLists.txt
@@ -60,8 +60,6 @@ add_library(${TARGET} STATIC
common.h
console.cpp
console.h
- debug.cpp
- debug.h
download.cpp
download.h
http.h
@@ -97,7 +95,17 @@ endif()
# TODO: use list(APPEND LLAMA_COMMON_EXTRA_LIBS ...)
set(LLAMA_COMMON_EXTRA_LIBS build_info)
-if (LLAMA_HTTPLIB)
+if (LLAMA_CURL)
+ # Use curl to download model url
+ find_package(CURL)
+ if (NOT CURL_FOUND)
+ message(FATAL_ERROR "Could NOT find CURL. Hint: to disable this feature, set -DLLAMA_CURL=OFF")
+ endif()
+ target_compile_definitions(${TARGET} PUBLIC LLAMA_USE_CURL)
+ include_directories(${CURL_INCLUDE_DIRS})
+ set(LLAMA_COMMON_EXTRA_LIBS ${LLAMA_COMMON_EXTRA_LIBS} ${CURL_LIBRARIES})
+elseif (LLAMA_HTTPLIB)
+ # otherwise, use cpp-httplib
target_compile_definitions(${TARGET} PUBLIC LLAMA_USE_HTTPLIB)
set(LLAMA_COMMON_EXTRA_LIBS ${LLAMA_COMMON_EXTRA_LIBS} cpp-httplib)
endif()
@@ -147,3 +155,27 @@ if (LLAMA_LLGUIDANCE)
endif ()
target_link_libraries(${TARGET} PRIVATE ${LLAMA_COMMON_EXTRA_LIBS} PUBLIC llama Threads::Threads)
+
+
+#
+# copy the license files
+#
+
+# Check if running in GitHub Actions
+if (DEFINED ENV{GITHUB_ACTIONS} AND "$ENV{GITHUB_ACTIONS}" STREQUAL "true")
+ message(STATUS "Running inside GitHub Actions - copying license files")
+
+ # Copy all files from licenses/ to build/bin/
+ file(GLOB LICENSE_FILES "${CMAKE_SOURCE_DIR}/licenses/*")
+ foreach(LICENSE_FILE ${LICENSE_FILES})
+ get_filename_component(FILENAME ${LICENSE_FILE} NAME)
+ add_custom_command(
+ POST_BUILD
+ TARGET ${TARGET}
+ COMMAND ${CMAKE_COMMAND} -E copy_if_different
+ "${LICENSE_FILE}"
+ "$/${FILENAME}"
+ COMMENT "Copying ${FILENAME} to ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}")
+ message(STATUS "Copying ${LICENSE_FILE} to ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/${FILENAME}")
+ endforeach()
+endif()
diff --git a/common/arg.cpp b/common/arg.cpp
index 163c9b7..1302065 100644
--- a/common/arg.cpp
+++ b/common/arg.cpp
@@ -2,11 +2,10 @@
#include "chat.h"
#include "common.h"
-#include "download.h"
#include "json-schema-to-grammar.h"
#include "log.h"
#include "sampling.h"
-#include "preset.h"
+#include "download.h"
// fix problem with std::min and std::max
#if defined(_WIN32)
@@ -48,8 +47,6 @@
#define LLAMA_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
-extern const char * LICENSES[];
-
using json = nlohmann::ordered_json;
using namespace common_arg_utils;
@@ -271,55 +268,6 @@ static void parse_tensor_buffer_overrides(const std::string & value, std::vector
}
}
-static std::string clean_file_name(const std::string & fname) {
- std::string clean_fname = fname;
- string_replace_all(clean_fname, "\\", "_");
- string_replace_all(clean_fname, "/", "_");
- return clean_fname;
-}
-
-static bool common_params_handle_remote_preset(common_params & params, llama_example ex) {
- GGML_ASSERT(!params.model.hf_repo.empty());
-
- // the returned hf_repo is without tag
- auto [hf_repo, hf_tag] = common_download_split_repo_tag(params.model.hf_repo);
-
- // "latest" tag (default if not specified) is translated to "default" preset
- if (hf_tag == "latest") {
- hf_tag = "default";
- }
-
- const bool offline = params.offline;
- std::string model_endpoint = get_model_endpoint();
- auto preset_url = model_endpoint + hf_repo + "/resolve/main/preset.ini";
-
- // prepare local path for caching
- auto preset_fname = clean_file_name(hf_repo + "_preset.ini");
- auto preset_path = fs_get_cache_file(preset_fname);
- const int status = common_download_file_single(preset_url, preset_path, params.hf_token, offline);
- const bool has_preset = status >= 200 && status < 400;
-
- // remote preset is optional, so we don't error out if not found
- if (has_preset) {
- LOG_INF("applying remote preset from %s\n", preset_url.c_str());
- common_preset_context ctx(ex, /* only_remote_allowed */ true);
- common_preset global;
- auto remote_presets = ctx.load_from_ini(preset_path, global);
- remote_presets = ctx.cascade(global, remote_presets);
- if (remote_presets.find(hf_tag) != remote_presets.end()) {
- common_preset preset = remote_presets.at(hf_tag);
- LOG_INF("\n%s", preset.to_ini().c_str()); // to_ini already added trailing newline
- preset.apply_to_params(params);
- } else {
- throw std::runtime_error("Remote preset.ini does not contain [" + std::string(hf_tag) + "] section");
- }
- } else {
- LOG_INF("%s", "no remote preset found, skipping\n");
- }
-
- return has_preset;
-}
-
struct handle_model_result {
bool found_mmproj = false;
common_params_model mmproj;
@@ -341,7 +289,7 @@ static handle_model_result common_params_handle_model(
if (model.path.empty()) {
auto auto_detected = common_get_hf_file(model.hf_repo, bearer_token, offline);
if (auto_detected.repo.empty() || auto_detected.ggufFile.empty()) {
- exit(1); // error message already printed
+ exit(1); // built without CURL, error message already printed
}
model.name = model.hf_repo; // repo name with tag
model.hf_repo = auto_detected.repo; // repo name without tag
@@ -361,7 +309,9 @@ static handle_model_result common_params_handle_model(
// make sure model path is present (for caching purposes)
if (model.path.empty()) {
// this is to avoid different repo having same file name, or same file name in different subdirs
- std::string filename = clean_file_name(model.hf_repo + "_" + model.hf_file);
+ std::string filename = model.hf_repo + "_" + model.hf_file;
+ // to make sure we don't have any slashes in the filename
+ string_replace_all(filename, "/", "_");
model.path = fs_get_cache_file(filename);
}
@@ -475,87 +425,61 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
}
};
- auto parse_cli_args = [&]() {
- std::set seen_args;
+ std::set seen_args;
- for (int i = 1; i < argc; i++) {
- const std::string arg_prefix = "--";
+ for (int i = 1; i < argc; i++) {
+ const std::string arg_prefix = "--";
- std::string arg = argv[i];
- if (arg.compare(0, arg_prefix.size(), arg_prefix) == 0) {
- std::replace(arg.begin(), arg.end(), '_', '-');
- }
- if (arg_to_options.find(arg) == arg_to_options.end()) {
- throw std::invalid_argument(string_format("error: invalid argument: %s", arg.c_str()));
- }
- if (!seen_args.insert(arg).second) {
- LOG_WRN("DEPRECATED: argument '%s' specified multiple times, use comma-separated values instead (only last value will be used)\n", arg.c_str());
- }
- auto & tmp = arg_to_options[arg];
- auto opt = *tmp.first;
- bool is_positive = tmp.second;
- if (opt.has_value_from_env()) {
- fprintf(stderr, "warn: %s environment variable is set, but will be overwritten by command line argument %s\n", opt.env, arg.c_str());
- }
- try {
- if (opt.handler_void) {
- opt.handler_void(params);
- continue;
- }
- if (opt.handler_bool) {
- opt.handler_bool(params, is_positive);
- continue;
- }
-
- // arg with single value
- check_arg(i);
- std::string val = argv[++i];
- if (opt.handler_int) {
- opt.handler_int(params, std::stoi(val));
- continue;
- }
- if (opt.handler_string) {
- opt.handler_string(params, val);
- continue;
- }
-
- // arg with 2 values
- check_arg(i);
- std::string val2 = argv[++i];
- if (opt.handler_str_str) {
- opt.handler_str_str(params, val, val2);
- continue;
- }
- } catch (std::exception & e) {
- throw std::invalid_argument(string_format(
- "error while handling argument \"%s\": %s\n\n"
- "usage:\n%s\n\nto show complete usage, run with -h",
- arg.c_str(), e.what(), opt.to_string().c_str()));
- }
+ std::string arg = argv[i];
+ if (arg.compare(0, arg_prefix.size(), arg_prefix) == 0) {
+ std::replace(arg.begin(), arg.end(), '_', '-');
}
- };
-
- // parse the first time to get -hf option (used for remote preset)
- parse_cli_args();
-
- // maybe handle remote preset
- if (!params.model.hf_repo.empty()) {
- std::string cli_hf_repo = params.model.hf_repo;
- bool has_preset = common_params_handle_remote_preset(params, ctx_arg.ex);
-
- // special case: if hf_repo explicitly set by preset, we need to preserve it (ignore CLI value)
- // this is useful when we have one HF repo pointing to other HF repos (one model - multiple GGUFs)
- std::string preset_hf_repo = params.model.hf_repo;
- bool preset_has_hf_repo = preset_hf_repo != cli_hf_repo;
-
- if (has_preset) {
- // re-parse CLI args to override preset values
- parse_cli_args();
+ if (arg_to_options.find(arg) == arg_to_options.end()) {
+ throw std::invalid_argument(string_format("error: invalid argument: %s", arg.c_str()));
}
+ if (!seen_args.insert(arg).second) {
+ LOG_WRN("DEPRECATED: argument '%s' specified multiple times, use comma-separated values instead (only last value will be used)\n", arg.c_str());
+ }
+ auto & tmp = arg_to_options[arg];
+ auto opt = *tmp.first;
+ bool is_positive = tmp.second;
+ if (opt.has_value_from_env()) {
+ fprintf(stderr, "warn: %s environment variable is set, but will be overwritten by command line argument %s\n", opt.env, arg.c_str());
+ }
+ try {
+ if (opt.handler_void) {
+ opt.handler_void(params);
+ continue;
+ }
+ if (opt.handler_bool) {
+ opt.handler_bool(params, is_positive);
+ continue;
+ }
- // preserve hf_repo from preset if needed
- if (preset_has_hf_repo) {
- params.model.hf_repo = preset_hf_repo;
+ // arg with single value
+ check_arg(i);
+ std::string val = argv[++i];
+ if (opt.handler_int) {
+ opt.handler_int(params, std::stoi(val));
+ continue;
+ }
+ if (opt.handler_string) {
+ opt.handler_string(params, val);
+ continue;
+ }
+
+ // arg with 2 values
+ check_arg(i);
+ std::string val2 = argv[++i];
+ if (opt.handler_str_str) {
+ opt.handler_str_str(params, val, val2);
+ continue;
+ }
+ } catch (std::exception & e) {
+ throw std::invalid_argument(string_format(
+ "error while handling argument \"%s\": %s\n\n"
+ "usage:\n%s\n\nto show complete usage, run with -h",
+ arg.c_str(), e.what(), opt.to_string().c_str()));
}
}
@@ -755,6 +679,7 @@ static void common_params_print_completion(common_params_context & ctx_arg) {
"llama-quantize",
"llama-qwen2vl-cli",
"llama-retrieval",
+ "llama-run",
"llama-save-load-state",
"llama-server",
"llama-simple",
@@ -929,54 +854,6 @@ bool common_arg_utils::is_autoy(const std::string & value) {
return value == "auto" || value == "-1";
}
-// Simple CSV parser that handles quoted fields and escaped quotes
-// example:
-// input: value1,"value, with, commas","value with ""escaped"" quotes",value4
-// output: [value1] [value, with, commas] [value with "escaped" quotes] [value4]
-static std::vector parse_csv_row(const std::string& input) {
- std::vector fields;
- std::string field;
- bool in_quotes = false;
-
- for (size_t i = 0; i < input.length(); ++i) {
- char ch = input[i];
-
- if (ch == '"') {
- if (!in_quotes) {
- // start of quoted field (only valid if at beginning of field)
- if (!field.empty()) {
- // quote appeared in middle of unquoted field, treat as literal
- field += '"';
- } else {
- in_quotes = true; // start
- }
- } else {
- if (i + 1 < input.length() && input[i + 1] == '"') {
- // escaped quote: ""
- field += '"';
- ++i; // skip the next quote
- } else {
- in_quotes = false; // end
- }
- }
- } else if (ch == ',') {
- if (in_quotes) {
- field += ',';
- } else {
- fields.push_back(std::move(field));
- field.clear();
- }
- } else {
- field += ch;
- }
- }
-
- // Add the last field
- fields.push_back(std::move(field));
-
- return fields;
-}
-
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **)) {
// per-example default params
// we define here to make sure it's included in llama-gen-docs
@@ -1041,16 +918,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
exit(0);
}
));
- add_opt(common_arg(
- {"--license"},
- "show source code license and dependencies",
- [](common_params &) {
- for (int i = 0; LICENSES[i]; ++i) {
- printf("%s\n", LICENSES[i]);
- }
- exit(0);
- }
- ));
add_opt(common_arg(
{"-cl", "--cache-list"},
"show list of models in cache",
@@ -1295,7 +1162,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params) {
params.kv_unified = true;
}
- ).set_env("LLAMA_ARG_KV_UNIFIED").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_PERPLEXITY, LLAMA_EXAMPLE_BATCHED}));
+ ).set_env("LLAMA_ARG_KV_UNIFIED").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_PERPLEXITY}));
add_opt(common_arg(
{"--context-shift"},
{"--no-context-shift"},
@@ -1383,7 +1250,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--in-file"}, "FNAME",
"an input file (use comma-separated values to specify multiple files)",
[](common_params & params, const std::string & value) {
- for (const auto & item : parse_csv_row(value)) {
+ for (const auto & item : string_split(value, ',')) {
std::ifstream file(item);
if (!file) {
throw std::runtime_error(string_format("error: failed to open file '%s'\n", item.c_str()));
@@ -1530,7 +1397,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, bool value) {
params.warmup = value;
}
- ).set_examples({LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MTMD, LLAMA_EXAMPLE_EMBEDDING, LLAMA_EXAMPLE_RETRIEVAL, LLAMA_EXAMPLE_PERPLEXITY, LLAMA_EXAMPLE_DEBUG}));
+ ).set_examples({LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MTMD, LLAMA_EXAMPLE_EMBEDDING, LLAMA_EXAMPLE_RETRIEVAL, LLAMA_EXAMPLE_PERPLEXITY}));
add_opt(common_arg(
{"--spm-infill"},
string_format(
@@ -1729,26 +1596,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
}
).set_sparam());
- add_opt(common_arg(
- {"--adaptive-target"}, "N",
- string_format("adaptive-p: select tokens near this probability (valid range 0.0 "
- "to 1.0; negative = disabled) (default: %.2f)\n"
- "[(more info)](https://github.com/ggml-org/llama.cpp/pull/17927)",
- (double)params.sampling.adaptive_target),
- [](common_params & params, const std::string & value) {
- params.sampling.adaptive_target = std::stof(value);
- }
- ).set_sparam());
- add_opt(common_arg(
- {"--adaptive-decay"}, "N",
- string_format("adaptive-p: decay rate for target adaptation over time. lower values "
- "are more reactive, higher values are more stable.\n"
- "(valid range 0.0 to 0.99) (default: %.2f)",
- (double)params.sampling.adaptive_decay),
- [](common_params & params, const std::string & value) {
- params.sampling.adaptive_decay = std::stof(value);
- }
- ).set_sparam());
add_opt(common_arg(
{"--dynatemp-range"}, "N",
string_format("dynamic temperature range (default: %.1f, 0.0 = disabled)", (double)params.sampling.dynatemp_range),
@@ -1848,13 +1695,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.sampling.grammar = json_schema_to_grammar(json::parse(schema));
}
).set_sparam());
- add_opt(common_arg(
- {"-bs", "--backend-sampling"},
- "enable backend sampling (experimental) (default: disabled)",
- [](common_params & params) {
- params.sampling.backend_sampling = true;
- }
- ).set_sparam().set_env("LLAMA_ARG_BACKEND_SAMPLING"));
add_opt(common_arg(
{"--pooling"}, "{none,mean,cls,last,rank}",
"pooling type for embeddings, use model default if unspecified",
@@ -1866,7 +1706,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
else if (value == "rank") { params.pooling_type = LLAMA_POOLING_TYPE_RANK; }
else { throw std::invalid_argument("invalid value"); }
}
- ).set_examples({LLAMA_EXAMPLE_EMBEDDING, LLAMA_EXAMPLE_RETRIEVAL, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_DEBUG}).set_env("LLAMA_ARG_POOLING"));
+ ).set_examples({LLAMA_EXAMPLE_EMBEDDING, LLAMA_EXAMPLE_RETRIEVAL, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_POOLING"));
add_opt(common_arg(
{"--attention"}, "{causal,non-causal}",
"attention type for embeddings, use model default if unspecified",
@@ -2155,7 +1995,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--image", "--audio"}, "FILE",
"path to an image or audio file. use with multimodal models, use comma-separated values for multiple files\n",
[](common_params & params, const std::string & value) {
- for (const auto & item : parse_csv_row(value)) {
+ for (const auto & item : string_split(value, ',')) {
params.image.emplace_back(item);
}
}
@@ -2177,7 +2017,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
if (llama_supports_rpc()) {
add_opt(common_arg(
{"--rpc"}, "SERVERS",
- "comma separated list of RPC servers (host:port)",
+ "comma separated list of RPC servers",
[](common_params & params, const std::string & value) {
add_rpc_devices(value);
GGML_UNUSED(params);
@@ -2194,22 +2034,11 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
add_opt(common_arg(
{"--mmap"},
{"--no-mmap"},
- string_format("whether to memory-map model. Explicitly enabling mmap disables direct-io. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: %s)", params.use_mmap ? "enabled" : "disabled"),
+ string_format("whether to memory-map model (if disabled, slower load but may reduce pageouts if not using mlock) (default: %s)", params.use_mmap ? "enabled" : "disabled"),
[](common_params & params, bool value) {
params.use_mmap = value;
- if (value) {
- params.use_direct_io = false; // disable direct io when mmap is explicitly enabled
- }
}
).set_env("LLAMA_ARG_MMAP"));
- add_opt(common_arg(
- {"-dio", "--direct-io"},
- {"-ndio", "--no-direct-io"},
- string_format("use DirectIO if available. Takes precedence over --mmap (default: %s)", params.use_direct_io ? "enabled" : "disabled"),
- [](common_params & params, bool value) {
- params.use_direct_io = value;
- }
- ).set_env("LLAMA_ARG_DIO"));
add_opt(common_arg(
{"--numa"}, "TYPE",
"attempt optimizations that help on some NUMA systems\n"
@@ -2258,7 +2087,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
"override tensor buffer type", [](common_params & params, const std::string & value) {
parse_tensor_buffer_overrides(value, params.tensor_buft_overrides);
}
- ).set_env("LLAMA_ARG_OVERRIDE_TENSOR"));
+ ));
add_opt(common_arg(
{"-otd", "--override-tensor-draft"}, "=,...",
"override tensor buffer type for draft model", [](common_params & params, const std::string & value) {
@@ -2308,18 +2137,11 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
}
).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_N_CPU_MOE_DRAFT"));
- GGML_ASSERT(params.n_gpu_layers < 0); // string_format would need to be extended for a default >= 0
add_opt(common_arg(
{"-ngl", "--gpu-layers", "--n-gpu-layers"}, "N",
- string_format("max. number of layers to store in VRAM, either an exact number, 'auto', or 'all' (default: %s)", params.n_gpu_layers == -1 ? "auto" : "all"),
- [](common_params & params, const std::string & value) {
- if (value == "auto") {
- params.n_gpu_layers = -1;
- } else if (value == "all") {
- params.n_gpu_layers = -2;
- } else {
- params.n_gpu_layers = std::stoi(value);
- }
+ string_format("max. number of layers to store in VRAM (default: %d)", params.n_gpu_layers),
+ [](common_params & params, int value) {
+ params.n_gpu_layers = value;
if (!llama_supports_gpu_offload()) {
fprintf(stderr, "warning: no usable GPU found, --gpu-layers option will be ignored\n");
fprintf(stderr, "warning: one possible reason is that llama.cpp was compiled without GPU support\n");
@@ -2361,7 +2183,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
std::vector split_arg{ it, {} };
if (split_arg.size() >= llama_max_devices()) {
throw std::invalid_argument(
- string_format("got %zu input configs, but system only has %zu devices", split_arg.size(), llama_max_devices())
+ string_format("got %d input configs, but system only has %d devices", (int)split_arg.size(), (int)llama_max_devices())
);
}
for (size_t i = 0; i < llama_max_devices(); ++i) {
@@ -2401,28 +2223,10 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_env("LLAMA_ARG_FIT"));
add_opt(common_arg(
- { "-fitt", "--fit-target" }, "MiB0,MiB1,MiB2,...",
- string_format("target margin per device for --fit, comma-separated list of values, "
- "single value is broadcast across all devices, default: %zu", params.fit_params_target[0]/(1024*1024)),
- [](common_params & params, const std::string & value) {
- std::string arg_next = value;
-
- // split string by , and /
- const std::regex regex{ R"([,/]+)" };
- std::sregex_token_iterator it{ arg_next.begin(), arg_next.end(), regex, -1 };
- std::vector split_arg{ it, {} };
- if (split_arg.size() >= llama_max_devices()) {
- throw std::invalid_argument(
- string_format("got %zu input configs, but system only has %zu devices", split_arg.size(), llama_max_devices())
- );
- }
- if (split_arg.size() == 1) {
- std::fill(params.fit_params_target.begin(), params.fit_params_target.end(), std::stoul(split_arg[0]) * 1024*1024);
- return;
- }
- for (size_t i = 0; i < split_arg.size(); i++) {
- params.fit_params_target[i] = std::stoul(split_arg[i]) * 1024*1024;
- }
+ { "-fitt", "--fit-target" }, "MiB",
+ string_format("target margin per device for --fit option, default: %zu", params.fit_params_target/(1024*1024)),
+ [](common_params & params, int value) {
+ params.fit_params_target = value * size_t(1024*1024);
}
).set_env("LLAMA_ARG_FIT_TARGET"));
add_opt(common_arg(
@@ -2441,12 +2245,37 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
));
add_opt(common_arg(
{"--override-kv"}, "KEY=TYPE:VALUE,...",
- "advanced option to override model metadata by key. to specify multiple overrides, either use comma-separated values.\n"
+ "advanced option to override model metadata by key. to specify multiple overrides, either use comma-separated or repeat this argument.\n"
"types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false",
[](common_params & params, const std::string & value) {
- for (const auto & item : parse_csv_row(value)) {
- if (!string_parse_kv_override(item.c_str(), params.kv_overrides)) {
- throw std::runtime_error(string_format("error: Invalid type for KV override: %s\n", item.c_str()));
+ std::vector kv_overrides;
+
+ std::string current;
+ bool escaping = false;
+
+ for (const char c : value) {
+ if (escaping) {
+ current.push_back(c);
+ escaping = false;
+ } else if (c == '\\') {
+ escaping = true;
+ } else if (c == ',') {
+ kv_overrides.push_back(current);
+ current.clear();
+ } else {
+ current.push_back(c);
+ }
+ }
+
+ if (escaping) {
+ current.push_back('\\');
+ }
+
+ kv_overrides.push_back(current);
+
+ for (const auto & kv_override : kv_overrides) {
+ if (!string_parse_kv_override(kv_override.c_str(), params.kv_overrides)) {
+ throw std::runtime_error(string_format("error: Invalid type for KV override: %s\n", kv_override.c_str()));
}
}
}
@@ -2463,7 +2292,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--lora"}, "FNAME",
"path to LoRA adapter (use comma-separated values to load multiple adapters)",
[](common_params & params, const std::string & value) {
- for (const auto & item : parse_csv_row(value)) {
+ for (const auto & item : string_split(value, ',')) {
params.lora_adapters.push_back({ item, 1.0, "", "", nullptr });
}
}
@@ -2474,7 +2303,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
"path to LoRA adapter with user defined scaling (format: FNAME:SCALE,...)\n"
"note: use comma-separated values",
[](common_params & params, const std::string & value) {
- for (const auto & item : parse_csv_row(value)) {
+ for (const auto & item : string_split(value, ',')) {
auto parts = string_split(item, ':');
if (parts.size() != 2) {
throw std::invalid_argument("lora-scaled format: FNAME:SCALE");
@@ -2488,7 +2317,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--control-vector"}, "FNAME",
"add a control vector\nnote: use comma-separated values to add multiple control vectors",
[](common_params & params, const std::string & value) {
- for (const auto & item : parse_csv_row(value)) {
+ for (const auto & item : string_split(value, ',')) {
params.control_vectors.push_back({ 1.0f, item, });
}
}
@@ -2498,7 +2327,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
"add a control vector with user defined scaling SCALE\n"
"note: use comma-separated values (format: FNAME:SCALE,...)",
[](common_params & params, const std::string & value) {
- for (const auto & item : parse_csv_row(value)) {
+ for (const auto & item : string_split(value, ',')) {
auto parts = string_split(item, ':');
if (parts.size() != 2) {
throw std::invalid_argument("control-vector-scaled format: FNAME:SCALE");
@@ -2596,7 +2425,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--context-file"}, "FNAME",
"file to load context from (use comma-separated values to specify multiple files)",
[](common_params & params, const std::string & value) {
- for (const auto & item : parse_csv_row(value)) {
+ for (const auto & item : string_split(value, ',')) {
std::ifstream file(item, std::ios::binary);
if (!file) {
throw std::runtime_error(string_format("error: failed to open file '%s'\n", item.c_str()));
@@ -2743,7 +2572,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, int value) {
params.embd_normalize = value;
}
- ).set_examples({LLAMA_EXAMPLE_EMBEDDING, LLAMA_EXAMPLE_DEBUG}));
+ ).set_examples({LLAMA_EXAMPLE_EMBEDDING}));
add_opt(common_arg(
{"--embd-output-format"}, "FORMAT",
"empty = default, \"array\" = [[],[]...], \"json\" = openai style, \"json+\" = same \"json\" + cosine similarity matrix, \"raw\" = plain whitespace-delimited output (one embedding per line)",
@@ -2821,7 +2650,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params) {
params.embedding = true;
}
- ).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_DEBUG}).set_env("LLAMA_ARG_EMBEDDINGS"));
+ ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_EMBEDDINGS"));
add_opt(common_arg(
{"--rerank", "--reranking"},
string_format("enable reranking endpoint on server (default: %s)", "disabled"),
@@ -2832,13 +2661,9 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_RERANKING"));
add_opt(common_arg(
{"--api-key"}, "KEY",
- "API key to use for authentication, multiple keys can be provided as a comma-separated list (default: none)",
+ "API key to use for authentication (default: none)",
[](common_params & params, const std::string & value) {
- for (const auto & key : parse_csv_row(value)) {
- if (!key.empty()) {
- params.api_keys.push_back(key);
- }
- }
+ params.api_keys.push_back(value);
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_API_KEY"));
add_opt(common_arg(
@@ -2852,7 +2677,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
std::string key;
while (std::getline(key_file, key)) {
if (!key.empty()) {
- params.api_keys.push_back(key);
+ params.api_keys.push_back(key);
}
}
key_file.close();
@@ -2874,7 +2699,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_SSL_CERT_FILE"));
add_opt(common_arg(
{"--chat-template-kwargs"}, "STRING",
- "sets additional params for the json template parser, must be a valid json object string, e.g. '{\"key1\":\"value1\",\"key2\":\"value2\"}'",
+ string_format("sets additional params for the json template parser"),
[](common_params & params, const std::string & value) {
auto parsed = json::parse(value);
for (const auto & item : parsed.items()) {
@@ -2897,18 +2722,10 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.n_threads_http = value;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_THREADS_HTTP"));
- add_opt(common_arg(
- {"--cache-prompt"},
- {"--no-cache-prompt"},
- string_format("whether to enable prompt caching (default: %s)", params.cache_prompt ? "enabled" : "disabled"),
- [](common_params & params, bool value) {
- params.cache_prompt = value;
- }
- ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_CACHE_PROMPT"));
add_opt(common_arg(
{"--cache-reuse"}, "N",
string_format(
- "min chunk size to attempt reusing from the cache via KV shifting, requires prompt caching to be enabled (default: %d)\n"
+ "min chunk size to attempt reusing from the cache via KV shifting (default: %d)\n"
"[(card)](https://ggml.ai/f0.png)", params.n_cache_reuse
),
[](common_params & params, int value) {
@@ -3358,19 +3175,11 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.speculative.devices = parse_device_list(value);
}
).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
- GGML_ASSERT(params.speculative.n_gpu_layers < 0); // string_format would need to be extended for a default >= 0
add_opt(common_arg(
{"-ngld", "--gpu-layers-draft", "--n-gpu-layers-draft"}, "N",
- string_format("max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: %s)",
- params.speculative.n_gpu_layers == -1 ? "auto" : "all"),
- [](common_params & params, const std::string & value) {
- if (value == "auto") {
- params.speculative.n_gpu_layers = -1;
- } else if (value == "all") {
- params.speculative.n_gpu_layers = -2;
- } else {
- params.speculative.n_gpu_layers = std::stoi(value);
- }
+ "number of layers to store in VRAM for the draft model",
+ [](common_params & params, int value) {
+ params.speculative.n_gpu_layers = value;
if (!llama_supports_gpu_offload()) {
fprintf(stderr, "warning: no usable GPU found, --gpu-layers-draft option will be ignored\n");
fprintf(stderr, "warning: one possible reason is that llama.cpp was compiled without GPU support\n");
@@ -3520,27 +3329,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
}
).set_examples({ LLAMA_EXAMPLE_FINETUNE }));
- add_opt(common_arg(
- {"--save-logits"},
- string_format("save final logits to files for verification (default: %s)", params.save_logits ? "true" : "false"),
- [](common_params & params) {
- params.save_logits = true;
- }
- ).set_examples({LLAMA_EXAMPLE_DEBUG}));
- add_opt(common_arg(
- {"--logits-output-dir"}, "PATH",
- string_format("directory for saving logits output files (default: %s)", params.logits_output_dir.c_str()),
- [](common_params & params, const std::string & value) {
- params.logits_output_dir = value;
- }
- ).set_examples({LLAMA_EXAMPLE_DEBUG}));
- add_opt(common_arg(
- {"--tensor-filter"}, "REGEX",
- "filter tensor names for debug output (regex pattern, can be specified multiple times)",
- [](common_params & params, const std::string & value) {
- params.tensor_filter.push_back(value);
- }
- ).set_examples({LLAMA_EXAMPLE_DEBUG}));
// presets
add_opt(common_arg(
@@ -3730,15 +3518,15 @@ void common_params_add_preset_options(std::vector & args) {
[](common_params &, const std::string &) { /* unused */ }
).set_env(COMMON_ARG_PRESET_LOAD_ON_STARTUP).set_preset_only());
- args.push_back(common_arg(
- {"stop-timeout"}, "SECONDS",
- "in server router mode, force-kill model instance after this many seconds of graceful shutdown",
- [](common_params &, int) { /* unused */ }
- ).set_env(COMMON_ARG_PRESET_STOP_TIMEOUT).set_preset_only());
-
// args.push_back(common_arg(
// {"pin"},
// "in server router mode, do not unload this model if models_max is exceeded",
// [](common_params &) { /* unused */ }
// ).set_preset_only());
+
+ // args.push_back(common_arg(
+ // {"unload-idle-seconds"}, "SECONDS",
+ // "in server router mode, unload models idle for more than this many seconds",
+ // [](common_params &, int) { /* unused */ }
+ // ).set_preset_only());
}
diff --git a/common/arg.h b/common/arg.h
index 55782a1..f5111c6 100644
--- a/common/arg.h
+++ b/common/arg.h
@@ -10,7 +10,6 @@
// pseudo-env variable to identify preset-only arguments
#define COMMON_ARG_PRESET_LOAD_ON_STARTUP "__PRESET_LOAD_ON_STARTUP"
-#define COMMON_ARG_PRESET_STOP_TIMEOUT "__PRESET_STOP_TIMEOUT"
//
// CLI argument parsing
@@ -129,3 +128,11 @@ void common_params_add_preset_options(std::vector & args);
// initialize argument parser context - used by test-arg-parser and preset
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
+
+struct common_remote_params {
+ std::vector headers;
+ long timeout = 0; // CURLOPT_TIMEOUT, in seconds ; 0 means no timeout
+ long max_size = 0; // max size of the response ; unlimited if 0 ; max is 2GB
+};
+// get remote file content, returns
+std::pair> common_remote_get_content(const std::string & url, const common_remote_params & params);
diff --git a/common/chat-parser.cpp b/common/chat-parser.cpp
index 2f07351..d740dac 100644
--- a/common/chat-parser.cpp
+++ b/common/chat-parser.cpp
@@ -1395,126 +1395,6 @@ static void common_chat_parse_seed_oss(common_chat_msg_parser & builder) {
builder.consume_reasoning_with_xml_tool_calls(form, "", "");
}
-static void common_chat_parse_solar_open(common_chat_msg_parser & builder) {
- builder.try_parse_reasoning("<|think|>", "<|end|><|begin|>assistant<|content|>");
-
- // TODO: Tool calling
-
- builder.add_content(builder.consume_rest());
-}
-
-static void common_chat_parse_exaone_moe_content(common_chat_msg_parser & builder) {
- // 1) { "name": "...", "arguments": {...} }
- // 2) { "id": "...", "type": "function", "function": { "name": "...", "arguments": {...} } }
- static const common_regex tool_call_open(R"(]*>)");
-
- if (!builder.syntax().parse_tool_calls) {
- LOG_DBG("%s: not parse_tool_calls\n", __func__);
- builder.add_content(builder.consume_rest());
- return;
- }
-
- LOG_DBG("%s: parse_tool_calls\n", __func__);
-
- // Find all blocks
- while (auto first = builder.try_find_regex(tool_call_open, std::string::npos, /* add_prelude_to_content= */ true)) {
- builder.move_to(first->groups[0].end);
- builder.consume_spaces();
-
- builder.try_consume_literal("```json");
- builder.try_consume_literal("```");
- builder.consume_spaces();
-
- // Consume JSON object
- auto data = builder.consume_json();
-
- builder.consume_spaces();
- builder.try_consume_literal("```");
- builder.consume_spaces();
-
- if (!builder.try_consume_literal("")) {
- throw common_chat_msg_partial_exception("incomplete tool call");
- }
- builder.consume_spaces();
-
- // Extract name and arguments
- std::string name;
- std::string id;
- nlohmann::ordered_json arguments;
-
- const auto extract_args = [&](const nlohmann::ordered_json & obj) -> bool {
- if (!obj.contains("name") || !obj.contains("arguments")) {
- return false;
- }
- name = obj.at("name").get();
- arguments = obj.at("arguments");
- if (obj.contains("id") && obj.at("id").is_string()) {
- id = obj.at("id").get();
- }
- return true;
- };
-
- if (!extract_args(data.json)) {
- if (data.json.contains("function") && data.json.at("function").is_object()) {
- auto fn = data.json.at("function");
- extract_args(fn);
- if (id.empty() && data.json.contains("id") && data.json.at("id").is_string()) {
- id = data.json.at("id").get();
- }
- }
- }
-
- // If name is empty, treat the JSON object as content
- if (name.empty()) {
- LOG_DBG("%s: tool call missing name, treating as content\n", __func__);
- builder.add_content(data.json.dump());
- continue;
- }
-
- std::string args_str = arguments.dump();
- if (!builder.add_tool_call(name, id, args_str)) {
- throw common_chat_msg_partial_exception("incomplete tool call");
- }
- }
-
- builder.add_content(builder.consume_rest());
-}
-
-static void common_chat_parse_exaone_moe(common_chat_msg_parser & builder) {
- LOG_DBG("%s: parsing exaone_moe\n", __func__);
- // EXAONE MoE outputs reasoning content between "" and "" tags, followed by regular content
- // First try to parse using the standard reasoning parsing method
- LOG_DBG("%s: thinking_forced_open: %s\n", __func__, std::to_string(builder.syntax().thinking_forced_open).c_str());
-
- auto start_pos = builder.pos();
- auto found_end_think = builder.try_find_literal("");
- builder.move_to(start_pos);
-
- if (builder.syntax().thinking_forced_open && !builder.is_partial() && !found_end_think) {
- LOG_DBG("%s: no end_think, not partial, adding content\n", __func__);
- common_chat_parse_exaone_moe_content(builder);
- } else if (builder.try_parse_reasoning("", "")) {
- // If reasoning was parsed successfully, the remaining content is regular content
- LOG_DBG("%s: parsed reasoning, adding content\n", __func__);
- common_chat_parse_exaone_moe_content(builder);
- } else {
- if (builder.syntax().reasoning_format == COMMON_REASONING_FORMAT_NONE) {
- LOG_DBG("%s: reasoning_format none, adding content\n", __func__);
- common_chat_parse_exaone_moe_content(builder);
- return;
- }
- // If no reasoning tags found, check if we should treat everything as reasoning
- if (builder.syntax().thinking_forced_open) {
- // If thinking is forced open but no tags found, treat everything as reasoning
- LOG_DBG("%s: thinking_forced_open, adding reasoning content\n", __func__);
- builder.add_reasoning_content(builder.consume_rest());
- } else {
- LOG_DBG("%s: no thinking_forced_open, adding content\n", __func__);
- common_chat_parse_exaone_moe_content(builder);
- }
- }
-}
-
static void common_chat_parse_content_only(common_chat_msg_parser & builder) {
builder.try_parse_reasoning("", "");
builder.add_content(builder.consume_rest());
@@ -1599,12 +1479,6 @@ static void common_chat_parse(common_chat_msg_parser & builder) {
case COMMON_CHAT_FORMAT_XIAOMI_MIMO:
common_chat_parse_xiaomi_mimo(builder);
break;
- case COMMON_CHAT_FORMAT_SOLAR_OPEN:
- common_chat_parse_solar_open(builder);
- break;
- case COMMON_CHAT_FORMAT_EXAONE_MOE:
- common_chat_parse_exaone_moe(builder);
- break;
default:
throw std::runtime_error(std::string("Unsupported format: ") + common_chat_format_name(builder.syntax().format));
}
diff --git a/common/chat.cpp b/common/chat.cpp
index d531388..0a426f4 100644
--- a/common/chat.cpp
+++ b/common/chat.cpp
@@ -319,7 +319,7 @@ json common_chat_msgs_to_json_oaicompat(const std::vector & msg
}
}
} else {
- jmsg["content"] = "";
+ jmsg["content"] = json(); // null
}
if (!msg.reasoning_content.empty()) {
jmsg["reasoning_content"] = msg.reasoning_content;
@@ -380,8 +380,8 @@ std::vector common_chat_tools_parse_oaicompat(const json & too
const auto & function = tool.at("function");
result.push_back({
/* .name = */ function.at("name"),
- /* .description = */ function.value("description", ""),
- /* .parameters = */ function.value("parameters", json::object()).dump(),
+ /* .description = */ function.at("description"),
+ /* .parameters = */ function.at("parameters").dump(),
});
}
}
@@ -669,8 +669,6 @@ const char * common_chat_format_name(common_chat_format format) {
case COMMON_CHAT_FORMAT_QWEN3_CODER_XML: return "Qwen3 Coder";
case COMMON_CHAT_FORMAT_APRIEL_1_5: return "Apriel 1.5";
case COMMON_CHAT_FORMAT_XIAOMI_MIMO: return "Xiaomi MiMo";
- case COMMON_CHAT_FORMAT_SOLAR_OPEN: return "Solar Open";
- case COMMON_CHAT_FORMAT_EXAONE_MOE: return "EXAONE MoE";
case COMMON_CHAT_FORMAT_PEG_SIMPLE: return "peg-simple";
case COMMON_CHAT_FORMAT_PEG_NATIVE: return "peg-native";
case COMMON_CHAT_FORMAT_PEG_CONSTRUCTED: return "peg-constructed";
@@ -2066,7 +2064,7 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
// Trigger on tool calls that appear in the commentary channel
data.grammar_triggers.push_back({
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN,
- "<\\|channel\\|>(?:commentary|analysis) to"
+ "<\\|channel\\|>(commentary|analysis) to"
});
// Trigger tool calls that appear in the role section, either at the
@@ -2399,17 +2397,17 @@ static common_chat_params common_chat_params_init_hermes_2_pro(const common_chat
(inputs.parallel_tool_calls ? "(" + tool_call + ")+" : tool_call));
// Trigger on some common known "good bad" outputs (only from the start and with a json that's about a specific argument name to avoid false positives)
data.grammar_triggers.push_back({
- COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN,
+ COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL,
// If thinking_forced_open, then we capture the tag in the grammar,
// (important for required tool choice) and in the trigger's first capture (decides what is sent to the grammar)
- std::string(data.thinking_forced_open ? "(\\s*)" : "") + (
+ std::string(data.thinking_forced_open ? "[\\s\\S]*?(\\s*)" : "(?:[\\s\\S]*?\\s*)?") + (
"\\s*("
"(?:"
"||||)?"
"\\s*\\{\\s*\"name\"\\s*:\\s*\"(?:" + string_join(escaped_names, "|") + ")\""
")"
- ")"
+ ")[\\s\\S]*"
),
});
data.preserved_tokens = {
@@ -2519,86 +2517,6 @@ static common_chat_params common_chat_params_init_granite(const common_chat_temp
return data;
}
-static common_chat_params common_chat_params_init_solar_open(const common_chat_template & tmpl, const struct templates_params & inputs) {
- common_chat_params data;
-
- // TODO: Reasoning effort
- json additional_context = {};
-
- data.prompt = apply(tmpl, inputs, std::nullopt, std::nullopt, additional_context);
- data.format = COMMON_CHAT_FORMAT_SOLAR_OPEN;
-
- data.preserved_tokens = {
- "<|think|>",
- "<|content|>",
- "<|begin|>",
- "<|end|>",
- };
-
- // TODO: Tool calling
-
- return data;
-}
-
-static common_chat_params common_chat_params_init_exaone_moe(const common_chat_template & tmpl, const struct templates_params & inputs) {
- common_chat_params data;
-
- data.prompt = apply(tmpl, inputs);
- data.format = COMMON_CHAT_FORMAT_EXAONE_MOE;
- if (string_ends_with(data.prompt, "\n")) {
- if (!inputs.enable_thinking) {
- data.prompt += "\n\n";
- } else {
- data.thinking_forced_open = true;
- }
- }
-
- if (inputs.tools.is_array() && !inputs.tools.empty()) {
- data.grammar_lazy = inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_REQUIRED && inputs.json_schema.is_null();
- data.grammar = build_grammar([&](const common_grammar_builder & builder) {
- std::vector tool_rules;
- foreach_function(inputs.tools, [&](const json & tool) {
- const auto & function = tool.at("function");
- std::string name = function.at("name");
- auto parameters = function.at("parameters");
- builder.resolve_refs(parameters);
- // Expect: {"name": "", "arguments": {...}}
- tool_rules.push_back(builder.add_rule(
- name + "-call",
- "\"\" space " +
- builder.add_schema(name + "-obj", json{
- {"type", "object"},
- {"properties", {
- {"name", json{{"const", name}}},
- {"arguments", parameters},
- }},
- {"required", json::array({"name", "arguments"})},
- }) +
- " space \"\" space"));
- });
-
- auto tool_call = builder.add_rule("tool_call", string_join(tool_rules, " | "));
- builder.add_rule("root",
- std::string(data.thinking_forced_open ? "( \"\" space )? " : "") +
- (inputs.parallel_tool_calls ? "(" + tool_call + ")+" : tool_call));
-
- data.grammar_triggers.push_back({
- COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL,
- std::string(data.thinking_forced_open ? "[\\s\\S]*?(\\s*)?" : "") +
- "()[\\s\\S]*"
- });
- data.preserved_tokens = {
- "",
- "",
- "",
- "",
- };
- });
- }
-
- return data;
-}
-
static common_chat_params common_chat_params_init_without_tools(const common_chat_template & tmpl, const struct templates_params & inputs) {
common_chat_params data;
data.prompt = apply(tmpl, inputs);
@@ -2769,13 +2687,6 @@ static common_chat_params common_chat_templates_apply_jinja(
return common_chat_params_init_xiaomi_mimo(tmpl, params);
}
- // EXAONE MoE format detection
- if (src.find("") != std::string::npos &&
- src.find("") != std::string::npos &&
- src.find("<|tool_declare|>") != std::string::npos) {
- return common_chat_params_init_exaone_moe(tmpl, params);
- }
-
// Hermes 2/3 Pro, Qwen 2.5 Instruct (w/ tools)
if (src.find("") != std::string::npos && params.json_schema.is_null()) {
return common_chat_params_init_hermes_2_pro(tmpl, params);
@@ -2869,13 +2780,6 @@ static common_chat_params common_chat_templates_apply_jinja(
return common_chat_params_init_magistral(tmpl, params);
}
- // Solar Open
- if (src.find("<|tool_response:begin|>") != std::string::npos &&
- src.find("<|tool_response:name|>") != std::string::npos &&
- src.find("<|tool_response:result|>") != std::string::npos) {
- return common_chat_params_init_solar_open(tmpl, params);
- }
-
// Plain handler (no tools)
if (params.tools.is_null() || inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_NONE) {
return common_chat_params_init_without_tools(tmpl, params);
diff --git a/common/chat.h b/common/chat.h
index 454085e..6085510 100644
--- a/common/chat.h
+++ b/common/chat.h
@@ -124,8 +124,6 @@ enum common_chat_format {
COMMON_CHAT_FORMAT_QWEN3_CODER_XML,
COMMON_CHAT_FORMAT_APRIEL_1_5,
COMMON_CHAT_FORMAT_XIAOMI_MIMO,
- COMMON_CHAT_FORMAT_SOLAR_OPEN,
- COMMON_CHAT_FORMAT_EXAONE_MOE,
// These are intended to be parsed by the PEG parser
COMMON_CHAT_FORMAT_PEG_SIMPLE,
diff --git a/common/common.cpp b/common/common.cpp
index 26250ab..acf2ec8 100644
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -251,7 +251,7 @@ bool set_process_priority(enum ggml_sched_priority prio) {
case GGML_SCHED_PRIO_REALTIME: p = -20; break;
}
- if (setpriority(PRIO_PROCESS, 0, p) != 0) {
+ if (!setpriority(PRIO_PROCESS, 0, p)) {
LOG_WRN("failed to set process priority %d : %s (%d)\n", prio, strerror(errno), errno);
return false;
}
@@ -1086,7 +1086,6 @@ struct common_init_result::impl {
std::vector lora;
std::vector samplers;
- std::vector samplers_seq_config;
};
common_init_result::common_init_result(common_params & params) :
@@ -1097,7 +1096,7 @@ common_init_result::common_init_result(common_params & params) :
if (params.fit_params) {
LOG_INF("%s: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on\n", __func__);
llama_params_fit(params.model.path.c_str(), &mparams, &cparams,
- params.tensor_split, params.tensor_buft_overrides.data(), params.fit_params_target.data(), params.fit_params_min_ctx,
+ params.tensor_split, params.tensor_buft_overrides.data(), params.fit_params_target, params.fit_params_min_ctx,
params.verbosity >= 4 ? GGML_LOG_LEVEL_DEBUG : GGML_LOG_LEVEL_ERROR);
}
@@ -1110,25 +1109,6 @@ common_init_result::common_init_result(common_params & params) :
const llama_vocab * vocab = llama_model_get_vocab(model);
- // load and optionally apply lora adapters (must be loaded before context creation)
- for (auto & la : params.lora_adapters) {
- llama_adapter_lora_ptr lora;
- lora.reset(llama_adapter_lora_init(model, la.path.c_str()));
- if (lora == nullptr) {
- LOG_ERR("%s: failed to load lora adapter '%s'\n", __func__, la.path.c_str());
- pimpl->model.reset(model);
- return;
- }
-
- char buf[1024];
- la.ptr = lora.get();
- llama_adapter_meta_val_str(la.ptr, "adapter.lora.task_name", buf, sizeof(buf));
- la.task_name = buf;
- llama_adapter_meta_val_str(la.ptr, "adapter.lora.prompt_prefix", buf, sizeof(buf));
- la.prompt_prefix = buf;
- pimpl->lora.emplace_back(std::move(lora)); // copy to list of loaded adapters
- }
-
// updates params.sampling
// TODO: fix naming
common_init_sampler_from_model(model, params.sampling);
@@ -1163,18 +1143,10 @@ common_init_result::common_init_result(common_params & params) :
// params.sampling.dry_penalty_last_n = llama_n_ctx(lctx);
//}
- // init the backend samplers as part of the context creation
pimpl->samplers.resize(cparams.n_seq_max);
- pimpl->samplers_seq_config.resize(cparams.n_seq_max);
for (int i = 0; i < (int) cparams.n_seq_max; ++i) {
pimpl->samplers[i].reset(common_sampler_init(model, params.sampling));
- pimpl->samplers_seq_config[i] = { i, common_sampler_get(pimpl->samplers[i].get()) };
- }
-
- if (params.sampling.backend_sampling) {
- cparams.samplers = pimpl->samplers_seq_config.data();
- cparams.n_samplers = pimpl->samplers_seq_config.size();
}
llama_context * lctx = llama_init_from_model(model, cparams);
@@ -1198,12 +1170,6 @@ common_sampler * common_init_result::sampler(llama_seq_id seq_id) {
return pimpl->samplers[seq_id].get();
}
-void common_init_result::reset_samplers() {
- for (int i = 0; i < (int) pimpl->samplers.size(); ++i) {
- llama_sampler_reset(common_sampler_get(pimpl->samplers[i].get()));
- }
-}
-
std::vector & common_init_result::lora() {
return pimpl->lora;
}
@@ -1279,6 +1245,24 @@ common_init_result_ptr common_init_from_params(common_params & params) {
}
}
+ // load and optionally apply lora adapters
+ for (auto & la : params.lora_adapters) {
+ llama_adapter_lora_ptr lora;
+ lora.reset(llama_adapter_lora_init(model, la.path.c_str()));
+ if (lora == nullptr) {
+ LOG_ERR("%s: failed to apply lora adapter '%s'\n", __func__, la.path.c_str());
+ return res;
+ }
+
+ char buf[1024];
+ la.ptr = lora.get();
+ llama_adapter_meta_val_str(la.ptr, "adapter.lora.task_name", buf, sizeof(buf));
+ la.task_name = buf;
+ llama_adapter_meta_val_str(la.ptr, "adapter.lora.prompt_prefix", buf, sizeof(buf));
+ la.prompt_prefix = buf;
+ res->lora().emplace_back(std::move(lora)); // copy to list of loaded adapters
+ }
+
if (!params.lora_init_without_apply) {
common_set_adapter_lora(lctx, params.lora_adapters);
}
@@ -1319,9 +1303,6 @@ common_init_result_ptr common_init_from_params(common_params & params) {
llama_synchronize(lctx);
llama_perf_context_reset(lctx);
llama_set_warmup(lctx, false);
-
- // reset samplers to reset RNG state after warmup to the seeded state
- res->reset_samplers();
}
return res;
@@ -1360,12 +1341,14 @@ struct llama_model_params common_model_params_to_llama(common_params & params) {
mparams.devices = params.devices.data();
}
- mparams.n_gpu_layers = params.n_gpu_layers;
+ if (params.n_gpu_layers != -1) {
+ mparams.n_gpu_layers = params.n_gpu_layers;
+ }
+
mparams.main_gpu = params.main_gpu;
mparams.split_mode = params.split_mode;
mparams.tensor_split = params.tensor_split;
mparams.use_mmap = params.use_mmap;
- mparams.use_direct_io = params.use_direct_io;
mparams.use_mlock = params.use_mlock;
mparams.check_tensors = params.check_tensors;
mparams.use_extra_bufts = !params.no_extra_bufts;
diff --git a/common/common.h b/common/common.h
index b9566df..3343720 100644
--- a/common/common.h
+++ b/common/common.h
@@ -80,8 +80,6 @@ int32_t cpu_get_num_math();
//
enum llama_example {
- LLAMA_EXAMPLE_BATCHED,
- LLAMA_EXAMPLE_DEBUG,
LLAMA_EXAMPLE_COMMON,
LLAMA_EXAMPLE_SPECULATIVE,
LLAMA_EXAMPLE_COMPLETION,
@@ -119,7 +117,6 @@ enum common_sampler_type {
COMMON_SAMPLER_TYPE_INFILL = 9,
COMMON_SAMPLER_TYPE_PENALTIES = 10,
COMMON_SAMPLER_TYPE_TOP_N_SIGMA = 11,
- COMMON_SAMPLER_TYPE_ADAPTIVE_P = 12,
};
// dimensionality reduction methods, used by cvector-generator
@@ -167,34 +164,32 @@ enum common_params_sampling_config : uint64_t {
struct common_params_sampling {
uint32_t seed = LLAMA_DEFAULT_SEED; // the seed used to initialize llama_sampler
- int32_t n_prev = 64; // number of previous tokens to remember
- int32_t n_probs = 0; // if greater than 0, output the probabilities of top n_probs tokens.
- int32_t min_keep = 0; // 0 = disabled, otherwise samplers should return at least min_keep tokens
- int32_t top_k = 40; // <= 0 to use vocab size
- float top_p = 0.95f; // 1.0 = disabled
- float min_p = 0.05f; // 0.0 = disabled
- float xtc_probability = 0.00f; // 0.0 = disabled
- float xtc_threshold = 0.10f; // > 0.5 disables XTC
- float typ_p = 1.00f; // typical_p, 1.0 = disabled
- float temp = 0.80f; // <= 0.0 to sample greedily, 0.0 to not output probabilities
- float dynatemp_range = 0.00f; // 0.0 = disabled
- float dynatemp_exponent = 1.00f; // controls how entropy maps to temperature in dynamic temperature sampler
- int32_t penalty_last_n = 64; // last n tokens to penalize (0 = disable penalty, -1 = context size)
- float penalty_repeat = 1.00f; // 1.0 = disabled
- float penalty_freq = 0.00f; // 0.0 = disabled
- float penalty_present = 0.00f; // 0.0 = disabled
- float dry_multiplier = 0.0f; // 0.0 = disabled; DRY repetition penalty for tokens extending repetition:
- float dry_base = 1.75f; // 0.0 = disabled; multiplier * base ^ (length of sequence before token - allowed length)
- int32_t dry_allowed_length = 2; // tokens extending repetitions beyond this receive penalty
- int32_t dry_penalty_last_n = -1; // how many tokens to scan for repetitions (0 = disable penalty, -1 = context size)
- float adaptive_target = -1.0f; // select tokens near this probability (valid range 0.0 to 1.0; negative = disabled)
- float adaptive_decay = 0.90f; // EMA decay for adaptation; history ≈ 1/(1-decay) tokens (0.0 - 0.99)
- int32_t mirostat = 0; // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
- float top_n_sigma = -1.00f; // -1.0 = disabled
- float mirostat_tau = 5.00f; // target entropy
- float mirostat_eta = 0.10f; // learning rate
+ int32_t n_prev = 64; // number of previous tokens to remember
+ int32_t n_probs = 0; // if greater than 0, output the probabilities of top n_probs tokens.
+ int32_t min_keep = 0; // 0 = disabled, otherwise samplers should return at least min_keep tokens
+ int32_t top_k = 40; // <= 0 to use vocab size
+ float top_p = 0.95f; // 1.0 = disabled
+ float min_p = 0.05f; // 0.0 = disabled
+ float xtc_probability = 0.00f; // 0.0 = disabled
+ float xtc_threshold = 0.10f; // > 0.5 disables XTC
+ float typ_p = 1.00f; // typical_p, 1.0 = disabled
+ float temp = 0.80f; // <= 0.0 to sample greedily, 0.0 to not output probabilities
+ float dynatemp_range = 0.00f; // 0.0 = disabled
+ float dynatemp_exponent = 1.00f; // controls how entropy maps to temperature in dynamic temperature sampler
+ int32_t penalty_last_n = 64; // last n tokens to penalize (0 = disable penalty, -1 = context size)
+ float penalty_repeat = 1.00f; // 1.0 = disabled
+ float penalty_freq = 0.00f; // 0.0 = disabled
+ float penalty_present = 0.00f; // 0.0 = disabled
+ float dry_multiplier = 0.0f; // 0.0 = disabled; DRY repetition penalty for tokens extending repetition:
+ float dry_base = 1.75f; // 0.0 = disabled; multiplier * base ^ (length of sequence before token - allowed length)
+ int32_t dry_allowed_length = 2; // tokens extending repetitions beyond this receive penalty
+ int32_t dry_penalty_last_n = -1; // how many tokens to scan for repetitions (0 = disable penalty, -1 = context size)
+ int32_t mirostat = 0; // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
+ float top_n_sigma = -1.00f;// -1.0 = disabled
+ float mirostat_tau = 5.00f; // target entropy
+ float mirostat_eta = 0.10f; // learning rate
bool ignore_eos = false;
- bool no_perf = false; // disable performance metrics
+ bool no_perf = false; // disable performance metrics
bool timing_per_token = false;
uint64_t user_sampling_config = 0; // bitfield to track user-specified samplers
@@ -221,8 +216,6 @@ struct common_params_sampling {
std::vector logit_bias; // logit biases to apply
std::vector logit_bias_eog; // pre-calculated logit biases for EOG tokens
- bool backend_sampling = false;
-
bool has_logit_bias() const {
return !logit_bias.empty();
}
@@ -336,14 +329,12 @@ struct common_params {
// offload params
std::vector devices; // devices to use for offloading
- int32_t n_gpu_layers = -1; // number of layers to store in VRAM, -1 is auto, <= -2 is all
- int32_t main_gpu = 0; // the GPU that is used for scratch and small tensors
- float tensor_split[128] = {0}; // how split tensors should be distributed across GPUs
- bool fit_params = true; // whether to fit unset model/context parameters to free device memory
- int32_t fit_params_min_ctx = 4096; // minimum context size to set when trying to reduce memory use
-
- // margin per device in bytes for fitting parameters to free memory:
- std::vector fit_params_target = std::vector(llama_max_devices(), 1024 * 1024*1024);
+ int32_t n_gpu_layers = -1; // number of layers to store in VRAM (-1 - use default)
+ int32_t main_gpu = 0; // the GPU that is used for scratch and small tensors
+ float tensor_split[128] = {0}; // how split tensors should be distributed across GPUs
+ bool fit_params = true; // whether to fit unset model/context parameters to free device memory
+ size_t fit_params_target = 1024 * 1024*1024; // margin per device in bytes for fitting parameters to free memory
+ int32_t fit_params_min_ctx = 4096; // minimum context size to set when trying to reduce memory use
enum llama_split_mode split_mode = LLAMA_SPLIT_MODE_LAYER; // how to split the model across GPUs
@@ -379,11 +370,6 @@ struct common_params {
std::string lookup_cache_dynamic = ""; // path of dynamic ngram cache file for lookup decoding // NOLINT
std::string logits_file = ""; // file for saving *all* logits // NOLINT
- // llama-debug specific options
- std::string logits_output_dir = "data"; // directory for saving logits output files // NOLINT
- bool save_logits = false; // whether to save logits to files // NOLINT
- std::vector tensor_filter; // filter tensor names for debug output (regex) // NOLINT
-
std::vector in_files; // all input files
std::vector antiprompt; // strings upon which more user input is prompted (a.k.a. reverse prompts)
std::vector kv_overrides;
@@ -434,8 +420,7 @@ struct common_params {
bool kv_unified = false; // enable unified KV cache
bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix
- bool use_mmap = true; // enable mmap to use filesystem cache
- bool use_direct_io = true; // read from disk without buffering for faster model loading
+ bool use_mmap = true; // use mmap for faster loads
bool use_mlock = false; // use mlock to keep model in memory
bool verbose_prompt = false; // print prompt tokens before generation
bool display_prompt = true; // print prompt before generation
@@ -479,7 +464,6 @@ struct common_params {
int32_t timeout_write = timeout_read; // http write timeout in seconds
int32_t n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)
int32_t n_cache_reuse = 0; // min chunk size to reuse from the cache via KV shifting
- bool cache_prompt = true; // whether to enable prompt caching
int32_t n_ctx_checkpoints = 8; // max number of context checkpoints per slot
int32_t cache_ram_mib = 8192; // -1 = no limit, 0 - disable, 1 = 1 MiB, etc.
@@ -705,9 +689,7 @@ struct common_init_result {
llama_model * model();
llama_context * context();
-
common_sampler * sampler(llama_seq_id seq_id);
- void reset_samplers();
std::vector & lora();
diff --git a/common/debug.cpp b/common/debug.cpp
deleted file mode 100644
index fdaddb1..0000000
--- a/common/debug.cpp
+++ /dev/null
@@ -1,165 +0,0 @@
-#include "debug.h"
-
-#include "log.h"
-
-#include
-#include
-
-static std::string common_ggml_ne_string(const ggml_tensor * t) {
- std::string str;
- for (int i = 0; i < GGML_MAX_DIMS; ++i) {
- str += std::to_string(t->ne[i]);
- if (i + 1 < GGML_MAX_DIMS) {
- str += ", ";
- }
- }
- return str;
-}
-
-static float common_ggml_get_float_value(const uint8_t * data,
- ggml_type type,
- const size_t * nb,
- size_t i0,
- size_t i1,
- size_t i2,
- size_t i3) {
- size_t i = i3 * nb[3] + i2 * nb[2] + i1 * nb[1] + i0 * nb[0];
- float v;
- if (type == GGML_TYPE_F16) {
- v = ggml_fp16_to_fp32(*(const ggml_fp16_t *) &data[i]);
- } else if (type == GGML_TYPE_F32) {
- v = *(const float *) &data[i];
- } else if (type == GGML_TYPE_I64) {
- v = (float) *(const int64_t *) &data[i];
- } else if (type == GGML_TYPE_I32) {
- v = (float) *(const int32_t *) &data[i];
- } else if (type == GGML_TYPE_I16) {
- v = (float) *(const int16_t *) &data[i];
- } else if (type == GGML_TYPE_I8) {
- v = (float) *(const int8_t *) &data[i];
- } else if (type == GGML_TYPE_BF16) {
- v = ggml_bf16_to_fp32(*(const ggml_bf16_t *) &data[i]);
- } else {
- GGML_ABORT("fatal error");
- }
- return v;
-}
-
-template
-void common_debug_print_tensor(uint8_t * data, ggml_type type, const int64_t * ne, const size_t * nb, int64_t n) {
- GGML_ASSERT(n > 0);
- float sum = 0;
- for (int64_t i3 = 0; i3 < ne[3]; i3++) {
- for (int64_t i2 = 0; i2 < ne[2]; i2++) {
- for (int64_t i1 = 0; i1 < ne[1]; i1++) {
- for (int64_t i0 = 0; i0 < ne[0]; i0++) {
- const float v = common_ggml_get_float_value(data, type, nb, i0, i1, i2, i3);
- sum += v;
- }
- }
- }
- }
- for (int64_t i3 = 0; i3 < ne[3]; i3++) {
- LOG_ERR(" [\n");
- for (int64_t i2 = 0; i2 < ne[2]; i2++) {
- if (i2 == n && ne[2] > 2 * n) {
- LOG_ERR(" ..., \n");
- i2 = ne[2] - n;
- }
- LOG_ERR(" [\n");
- for (int64_t i1 = 0; i1 < ne[1]; i1++) {
- if (i1 == n && ne[1] > 2 * n) {
- LOG_ERR(" ..., \n");
- i1 = ne[1] - n;
- }
- LOG_ERR(" [");
- for (int64_t i0 = 0; i0 < ne[0]; i0++) {
- if (i0 == n && ne[0] > 2 * n) {
- LOG_ERR("..., ");
- i0 = ne[0] - n;
- }
- const float v = common_ggml_get_float_value(data, type, nb, i0, i1, i2, i3);
- LOG_ERR("%12.4f", v);
- if (i0 < ne[0] - 1) {
- LOG_ERR(", ");
- }
- }
- LOG_ERR("],\n");
- }
- LOG_ERR(" ],\n");
- }
- LOG_ERR(" ]\n");
- LOG_ERR(" sum = %f\n", sum);
- }
-
- if constexpr (abort) {
- if (std::isnan(sum)) {
- LOG_ERR("encountered NaN - aborting\n");
- exit(0);
- }
- }
-}
-
-/**
- * GGML operations callback during the graph execution.
- *
- * @param t current tensor
- * @param ask when ask is true, the scheduler wants to know if we are interested in data from this tensor
- * if we return true, a follow-up call will be made with ask=false in which we can do the actual collection.
- * see ggml_backend_sched_eval_callback
- * @param user_data user data to pass at each call back
- * @return true to receive data or continue the graph, false otherwise
- */
-template bool common_debug_cb_eval(struct ggml_tensor * t, bool ask, void * user_data) {
- auto * cb_data = (base_callback_data *) user_data;
-
- const struct ggml_tensor * src0 = t->src[0];
- const struct ggml_tensor * src1 = t->src[1];
-
- if (ask) {
- return true; // Always retrieve data
- }
-
- bool matches_filter = cb_data->tensor_filters.empty();
-
- if (!matches_filter) {
- for (const auto & filter : cb_data->tensor_filters) {
- if (std::regex_search(t->name, filter)) {
- matches_filter = true;
- break;
- }
- }
- }
-
- char src1_str[128] = { 0 };
- if (src1) {
- snprintf(src1_str, sizeof(src1_str), "%s{%s}", src1->name, common_ggml_ne_string(src1).c_str());
- }
-
- if (matches_filter) {
- LOG_ERR("%s: %24s = (%s) %10s(%s{%s}, %s}) = {%s}\n", __func__, t->name, ggml_type_name(t->type),
- ggml_op_desc(t), src0->name, common_ggml_ne_string(src0).c_str(), src1 ? src1_str : "",
- common_ggml_ne_string(t).c_str());
- }
-
- const bool is_host = ggml_backend_buffer_is_host(t->buffer);
-
- if (!is_host) {
- auto n_bytes = ggml_nbytes(t);
- cb_data->data.resize(n_bytes);
- ggml_backend_tensor_get(t, cb_data->data.data(), 0, n_bytes);
- }
-
- if (!ggml_is_quantized(t->type) && matches_filter) {
- uint8_t * data = is_host ? (uint8_t *) t->data : cb_data->data.data();
- common_debug_print_tensor(data, t->type, t->ne, t->nb, 3);
- }
-
- return true;
-}
-
-// Explicit template instantiations
-template bool common_debug_cb_eval(ggml_tensor *, bool, void *);
-template bool common_debug_cb_eval(ggml_tensor *, bool, void *);
-template void common_debug_print_tensor(uint8_t *, ggml_type, const int64_t *, const size_t *, int64_t);
-template void common_debug_print_tensor(uint8_t *, ggml_type, const int64_t *, const size_t *, int64_t);
diff --git a/common/debug.h b/common/debug.h
deleted file mode 100644
index 0c55963..0000000
--- a/common/debug.h
+++ /dev/null
@@ -1,43 +0,0 @@
-#pragma once
-#include "common.h"
-#include
-#include
-#include
-
-// common debug functions and structs
-
-// Print a tensor's detailed data
-// data - the tensor's data in byte format
-// type - the tensor's quantization type
-// ne - the tensor dimensions array
-// nb - the tensor strides array
-// n - the number of rows/columns to fully print
-template void common_debug_print_tensor(uint8_t * data, ggml_type type, const int64_t * ne, const size_t * nb, int64_t n);
-
-// Intended to use as callback for ggml_backend_sched_eval_callback
-// prints tensors that are processed in the computation graph
-// by default prints all tensors, but can be configured by creating a `base_callback_data` instance with
-// non-empty filter_patterns. See examples/debug.ccp for possible usage patterns
-// The template parameter determins whether an error should be thrown whenever a NaN is encountered
-// in a tensor (useful for stopping debug sessions on first erroneous tensor)
-// The callback data will be passed as the third parameter (user_data)
-template bool common_debug_cb_eval(struct ggml_tensor * t, bool ask, void * user_data);
-struct base_callback_data {
- std::vector data;
- std::vector tensor_filters;
-
- base_callback_data() = default;
-
- base_callback_data(common_params & params, const std::vector & filter_patterns) {
- for (const auto & pattern : filter_patterns) {
- try {
- std::string anchored_pattern = "^" + pattern;
- tensor_filters.emplace_back(anchored_pattern, std::regex::optimize);
- } catch (const std::regex_error & e) {
- throw std::runtime_error("Invalid regex pattern '" + pattern + "': " + e.what());
- }
- }
- params.cb_eval = common_debug_cb_eval;
- params.cb_eval_user_data = this;
- }
-};
diff --git a/common/download.cpp b/common/download.cpp
index a377804..ef87472 100644
--- a/common/download.cpp
+++ b/common/download.cpp
@@ -19,7 +19,10 @@
#include
#include
-#if defined(LLAMA_USE_HTTPLIB)
+#if defined(LLAMA_USE_CURL)
+#include
+#include
+#elif defined(LLAMA_USE_HTTPLIB)
#include "http.h"
#endif
@@ -154,21 +157,322 @@ static std::string read_etag(const std::string & path) {
return none;
}
-static bool is_http_status_ok(int status) {
- return status >= 200 && status < 400;
-}
+#ifdef LLAMA_USE_CURL
-std::pair common_download_split_repo_tag(const std::string & hf_repo_with_tag) {
- auto parts = string_split(hf_repo_with_tag, ':');
- std::string tag = parts.size() > 1 ? parts.back() : "latest";
- std::string hf_repo = parts[0];
- if (string_split(hf_repo, '/').size() != 2) {
- throw std::invalid_argument("error: invalid HF repo format, expected /[:quant]\n");
+//
+// CURL utils
+//
+
+using curl_ptr = std::unique_ptr;
+
+// cannot use unique_ptr for curl_slist, because we cannot update without destroying the old one
+struct curl_slist_ptr {
+ struct curl_slist * ptr = nullptr;
+ ~curl_slist_ptr() {
+ if (ptr) {
+ curl_slist_free_all(ptr);
+ }
}
- return {hf_repo, tag};
+};
+
+static CURLcode common_curl_perf(CURL * curl) {
+ CURLcode res = curl_easy_perform(curl);
+ if (res != CURLE_OK) {
+ LOG_ERR("%s: curl_easy_perform() failed\n", __func__);
+ }
+
+ return res;
}
-#if defined(LLAMA_USE_HTTPLIB)
+// Send a HEAD request to retrieve the etag and last-modified headers
+struct common_load_model_from_url_headers {
+ std::string etag;
+ std::string last_modified;
+ std::string accept_ranges;
+};
+
+struct FILE_deleter {
+ void operator()(FILE * f) const { fclose(f); }
+};
+
+static size_t common_header_callback(char * buffer, size_t, size_t n_items, void * userdata) {
+ common_load_model_from_url_headers * headers = (common_load_model_from_url_headers *) userdata;
+ static std::regex header_regex("([^:]+): (.*)\r\n");
+ static std::regex etag_regex("ETag", std::regex_constants::icase);
+ static std::regex last_modified_regex("Last-Modified", std::regex_constants::icase);
+ static std::regex accept_ranges_regex("Accept-Ranges", std::regex_constants::icase);
+ std::string header(buffer, n_items);
+ std::smatch match;
+ if (std::regex_match(header, match, header_regex)) {
+ const std::string & key = match[1];
+ const std::string & value = match[2];
+ if (std::regex_match(key, match, etag_regex)) {
+ headers->etag = value;
+ } else if (std::regex_match(key, match, last_modified_regex)) {
+ headers->last_modified = value;
+ } else if (std::regex_match(key, match, accept_ranges_regex)) {
+ headers->accept_ranges = value;
+ }
+ }
+
+ return n_items;
+}
+
+static size_t common_write_callback(void * data, size_t size, size_t nmemb, void * fd) {
+ return std::fwrite(data, size, nmemb, static_cast(fd));
+}
+
+// helper function to hide password in URL
+static std::string llama_download_hide_password_in_url(const std::string & url) {
+ // Use regex to match and replace the user[:password]@ pattern in URLs
+ // Pattern: scheme://[user[:password]@]host[...]
+ static const std::regex url_regex(R"(^(?:[A-Za-z][A-Za-z0-9+.-]://)(?:[^/@]+@)?.$)");
+ std::smatch match;
+
+ if (std::regex_match(url, match, url_regex)) {
+ // match[1] = scheme (e.g., "https://")
+ // match[2] = user[:password]@ part
+ // match[3] = rest of URL (host and path)
+ return match[1].str() + "********@" + match[3].str();
+ }
+
+ return url; // No credentials found or malformed URL
+}
+
+static void common_curl_easy_setopt_head(CURL * curl, const std::string & url) {
+ // Set the URL, allow to follow http redirection
+ curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
+ curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
+
+# if defined(_WIN32)
+ // CURLSSLOPT_NATIVE_CA tells libcurl to use standard certificate store of
+ // operating system. Currently implemented under MS-Windows.
+ curl_easy_setopt(curl, CURLOPT_SSL_OPTIONS, CURLSSLOPT_NATIVE_CA);
+# endif
+
+ curl_easy_setopt(curl, CURLOPT_NOBODY, 1L); // will trigger the HEAD verb
+ curl_easy_setopt(curl, CURLOPT_NOPROGRESS, 1L); // hide head request progress
+ curl_easy_setopt(curl, CURLOPT_HEADERFUNCTION, common_header_callback);
+}
+
+static void common_curl_easy_setopt_get(CURL * curl) {
+ curl_easy_setopt(curl, CURLOPT_NOBODY, 0L);
+ curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, common_write_callback);
+
+ // display download progress
+ curl_easy_setopt(curl, CURLOPT_NOPROGRESS, 0L);
+}
+
+static bool common_pull_file(CURL * curl, const std::string & path_temporary) {
+ if (std::filesystem::exists(path_temporary)) {
+ const std::string partial_size = std::to_string(std::filesystem::file_size(path_temporary));
+ LOG_INF("%s: server supports range requests, resuming download from byte %s\n", __func__, partial_size.c_str());
+ const std::string range_str = partial_size + "-";
+ curl_easy_setopt(curl, CURLOPT_RANGE, range_str.c_str());
+ }
+
+ // Always open file in append mode could be resuming
+ std::unique_ptr outfile(fopen(path_temporary.c_str(), "ab"));
+ if (!outfile) {
+ LOG_ERR("%s: error opening local file for writing: %s\n", __func__, path_temporary.c_str());
+ return false;
+ }
+
+ common_curl_easy_setopt_get(curl);
+ curl_easy_setopt(curl, CURLOPT_WRITEDATA, outfile.get());
+
+ return common_curl_perf(curl) == CURLE_OK;
+}
+
+static bool common_download_head(CURL * curl,
+ curl_slist_ptr & http_headers,
+ const std::string & url,
+ const std::string & bearer_token) {
+ if (!curl) {
+ LOG_ERR("%s: error initializing libcurl\n", __func__);
+ return false;
+ }
+
+ http_headers.ptr = curl_slist_append(http_headers.ptr, "User-Agent: llama-cpp");
+ // Check if hf-token or bearer-token was specified
+ if (!bearer_token.empty()) {
+ std::string auth_header = "Authorization: Bearer " + bearer_token;
+ http_headers.ptr = curl_slist_append(http_headers.ptr, auth_header.c_str());
+ }
+
+ curl_easy_setopt(curl, CURLOPT_HTTPHEADER, http_headers.ptr);
+ common_curl_easy_setopt_head(curl, url);
+ return common_curl_perf(curl) == CURLE_OK;
+}
+
+// download one single file from remote URL to local path
+static bool common_download_file_single_online(const std::string & url,
+ const std::string & path,
+ const std::string & bearer_token) {
+ static const int max_attempts = 3;
+ static const int retry_delay_seconds = 2;
+ for (int i = 0; i < max_attempts; ++i) {
+ std::string etag;
+
+ // Check if the file already exists locally
+ const auto file_exists = std::filesystem::exists(path);
+ if (file_exists) {
+ etag = read_etag(path);
+ } else {
+ LOG_INF("%s: no previous model file found %s\n", __func__, path.c_str());
+ }
+
+ bool head_request_ok = false;
+ bool should_download = !file_exists; // by default, we should download if the file does not exist
+
+ // Initialize libcurl
+ curl_ptr curl(curl_easy_init(), &curl_easy_cleanup);
+ common_load_model_from_url_headers headers;
+ curl_easy_setopt(curl.get(), CURLOPT_HEADERDATA, &headers);
+ curl_slist_ptr http_headers;
+ const bool was_perform_successful = common_download_head(curl.get(), http_headers, url, bearer_token);
+ if (!was_perform_successful) {
+ head_request_ok = false;
+ }
+
+ long http_code = 0;
+ curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &http_code);
+ if (http_code == 200) {
+ head_request_ok = true;
+ } else {
+ LOG_WRN("%s: HEAD invalid http status code received: %ld\n", __func__, http_code);
+ head_request_ok = false;
+ }
+
+ // if head_request_ok is false, we don't have the etag or last-modified headers
+ // we leave should_download as-is, which is true if the file does not exist
+ bool should_download_from_scratch = false;
+ if (head_request_ok) {
+ // check if ETag or Last-Modified headers are different
+ // if it is, we need to download the file again
+ if (!etag.empty() && etag != headers.etag) {
+ LOG_WRN("%s: ETag header is different (%s != %s): triggering a new download\n", __func__, etag.c_str(),
+ headers.etag.c_str());
+ should_download = true;
+ should_download_from_scratch = true;
+ }
+ }
+
+ const bool accept_ranges_supported = !headers.accept_ranges.empty() && headers.accept_ranges != "none";
+ if (should_download) {
+ if (file_exists &&
+ !accept_ranges_supported) { // Resumable downloads not supported, delete and start again.
+ LOG_WRN("%s: deleting previous downloaded file: %s\n", __func__, path.c_str());
+ if (remove(path.c_str()) != 0) {
+ LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
+ return false;
+ }
+ }
+
+ const std::string path_temporary = path + ".downloadInProgress";
+ if (should_download_from_scratch) {
+ if (std::filesystem::exists(path_temporary)) {
+ if (remove(path_temporary.c_str()) != 0) {
+ LOG_ERR("%s: unable to delete file: %s\n", __func__, path_temporary.c_str());
+ return false;
+ }
+ }
+
+ if (std::filesystem::exists(path)) {
+ if (remove(path.c_str()) != 0) {
+ LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
+ return false;
+ }
+ }
+ }
+ if (head_request_ok) {
+ write_etag(path, headers.etag);
+ }
+
+ // start the download
+ LOG_INF("%s: trying to download model from %s to %s (server_etag:%s, server_last_modified:%s)...\n",
+ __func__, llama_download_hide_password_in_url(url).c_str(), path_temporary.c_str(),
+ headers.etag.c_str(), headers.last_modified.c_str());
+ const bool was_pull_successful = common_pull_file(curl.get(), path_temporary);
+ if (!was_pull_successful) {
+ if (i + 1 < max_attempts) {
+ const int exponential_backoff_delay = std::pow(retry_delay_seconds, i) * 1000;
+ LOG_WRN("%s: retrying after %d milliseconds...\n", __func__, exponential_backoff_delay);
+ std::this_thread::sleep_for(std::chrono::milliseconds(exponential_backoff_delay));
+ } else {
+ LOG_ERR("%s: curl_easy_perform() failed after %d attempts\n", __func__, max_attempts);
+ }
+
+ continue;
+ }
+
+ long http_code = 0;
+ curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &http_code);
+ if (http_code < 200 || http_code >= 400) {
+ LOG_ERR("%s: invalid http status code received: %ld\n", __func__, http_code);
+ return false;
+ }
+
+ if (rename(path_temporary.c_str(), path.c_str()) != 0) {
+ LOG_ERR("%s: unable to rename file: %s to %s\n", __func__, path_temporary.c_str(), path.c_str());
+ return false;
+ }
+ } else {
+ LOG_INF("%s: using cached file: %s\n", __func__, path.c_str());
+ }
+
+ break;
+ }
+
+ return true;
+}
+
+std::pair> common_remote_get_content(const std::string & url, const common_remote_params & params) {
+ curl_ptr curl(curl_easy_init(), &curl_easy_cleanup);
+ curl_slist_ptr http_headers;
+ std::vector res_buffer;
+
+ curl_easy_setopt(curl.get(), CURLOPT_URL, url.c_str());
+ curl_easy_setopt(curl.get(), CURLOPT_NOPROGRESS, 1L);
+ curl_easy_setopt(curl.get(), CURLOPT_FOLLOWLOCATION, 1L);
+ curl_easy_setopt(curl.get(), CURLOPT_VERBOSE, 0L);
+ typedef size_t(*CURLOPT_WRITEFUNCTION_PTR)(void * ptr, size_t size, size_t nmemb, void * data);
+ auto write_callback = [](void * ptr, size_t size, size_t nmemb, void * data) -> size_t {
+ auto data_vec = static_cast *>(data);
+ data_vec->insert(data_vec->end(), (char *)ptr, (char *)ptr + size * nmemb);
+ return size * nmemb;
+ };
+ curl_easy_setopt(curl.get(), CURLOPT_WRITEFUNCTION, static_cast(write_callback));
+ curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &res_buffer);
+#if defined(_WIN32)
+ curl_easy_setopt(curl.get(), CURLOPT_SSL_OPTIONS, CURLSSLOPT_NATIVE_CA);
+#endif
+ if (params.timeout > 0) {
+ curl_easy_setopt(curl.get(), CURLOPT_TIMEOUT, params.timeout);
+ }
+ if (params.max_size > 0) {
+ curl_easy_setopt(curl.get(), CURLOPT_MAXFILESIZE, params.max_size);
+ }
+ http_headers.ptr = curl_slist_append(http_headers.ptr, "User-Agent: llama-cpp");
+ for (const auto & header : params.headers) {
+ http_headers.ptr = curl_slist_append(http_headers.ptr, header.c_str());
+ }
+ curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers.ptr);
+
+ CURLcode res = curl_easy_perform(curl.get());
+
+ if (res != CURLE_OK) {
+ std::string error_msg = curl_easy_strerror(res);
+ throw std::runtime_error("error: cannot make GET request: " + error_msg);
+ }
+
+ long res_code;
+ curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &res_code);
+
+ return { res_code, std::move(res_buffer) };
+}
+
+#elif defined(LLAMA_USE_HTTPLIB)
class ProgressBar {
static inline std::mutex mutex;
@@ -313,11 +617,9 @@ static bool common_pull_file(httplib::Client & cli,
}
// download one single file from remote URL to local path
-// returns status code or -1 on error
-static int common_download_file_single_online(const std::string & url,
+static bool common_download_file_single_online(const std::string & url,
const std::string & path,
- const std::string & bearer_token,
- const common_header_list & custom_headers) {
+ const std::string & bearer_token) {
static const int max_attempts = 3;
static const int retry_delay_seconds = 2;
@@ -327,9 +629,6 @@ static int common_download_file_single_online(const std::string & url,
if (!bearer_token.empty()) {
default_headers.insert({"Authorization", "Bearer " + bearer_token});
}
- for (const auto & h : custom_headers) {
- default_headers.emplace(h.first, h.second);
- }
cli.set_default_headers(default_headers);
const bool file_exists = std::filesystem::exists(path);
@@ -348,10 +647,8 @@ static int common_download_file_single_online(const std::string & url,
LOG_WRN("%s: HEAD invalid http status code received: %d\n", __func__, head ? head->status : -1);
if (file_exists) {
LOG_INF("%s: Using cached file (HEAD failed): %s\n", __func__, path.c_str());
- return 304; // 304 Not Modified - fake cached response
+ return true;
}
- return head->status; // cannot use cached file, return raw status code
- // TODO: maybe retry only on certain codes
}
std::string etag;
@@ -383,12 +680,12 @@ static int common_download_file_single_online(const std::string & url,
if (file_exists) {
if (!should_download_from_scratch) {
LOG_INF("%s: using cached file: %s\n", __func__, path.c_str());
- return 304; // 304 Not Modified - fake cached response
+ return true;
}
LOG_WRN("%s: deleting previous downloaded file: %s\n", __func__, path.c_str());
if (remove(path.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
- return -1;
+ return false;
}
}
@@ -400,7 +697,7 @@ static int common_download_file_single_online(const std::string & url,
existing_size = std::filesystem::file_size(path_temporary);
} else if (remove(path_temporary.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path_temporary.c_str());
- return -1;
+ return false;
}
}
@@ -421,16 +718,15 @@ static int common_download_file_single_online(const std::string & url,
if (std::rename(path_temporary.c_str(), path.c_str()) != 0) {
LOG_ERR("%s: unable to rename file: %s to %s\n", __func__, path_temporary.c_str(), path.c_str());
- return -1;
+ return false;
}
if (!etag.empty()) {
write_etag(path, etag);
}
-
- return head->status; // TODO: use actual GET status?
+ break;
}
- return -1; // max attempts reached
+ return true;
}
std::pair> common_remote_get_content(const std::string & url,
@@ -438,9 +734,13 @@ std::pair> common_remote_get_content(const std::string
auto [cli, parts] = common_http_client(url);
httplib::Headers headers = {{"User-Agent", "llama-cpp"}};
-
for (const auto & header : params.headers) {
- headers.emplace(header.first, header.second);
+ size_t pos = header.find(':');
+ if (pos != std::string::npos) {
+ headers.emplace(header.substr(0, pos), header.substr(pos + 1));
+ } else {
+ headers.emplace(header, "");
+ }
}
if (params.timeout > 0) {
@@ -465,45 +765,36 @@ std::pair> common_remote_get_content(const std::string
return { res->status, std::move(buf) };
}
-int common_download_file_single(const std::string & url,
- const std::string & path,
- const std::string & bearer_token,
- bool offline,
- const common_header_list & headers) {
+#endif // LLAMA_USE_CURL
+
+#if defined(LLAMA_USE_CURL) || defined(LLAMA_USE_HTTPLIB)
+
+static bool common_download_file_single(const std::string & url,
+ const std::string & path,
+ const std::string & bearer_token,
+ bool offline) {
if (!offline) {
- return common_download_file_single_online(url, path, bearer_token, headers);
+ return common_download_file_single_online(url, path, bearer_token);
}
if (!std::filesystem::exists(path)) {
LOG_ERR("%s: required file is not available in cache (offline mode): %s\n", __func__, path.c_str());
- return -1;
+ return false;
}
LOG_INF("%s: using cached file (offline mode): %s\n", __func__, path.c_str());
- return 304; // Not Modified - fake cached response
+ return true;
}
// download multiple files from remote URLs to local paths
// the input is a vector of pairs
-static bool common_download_file_multiple(const std::vector> & urls,
- const std::string & bearer_token,
- bool offline,
- const common_header_list & headers) {
+static bool common_download_file_multiple(const std::vector> & urls, const std::string & bearer_token, bool offline) {
// Prepare download in parallel
std::vector> futures_download;
- futures_download.reserve(urls.size());
-
for (auto const & item : urls) {
- futures_download.push_back(
- std::async(
- std::launch::async,
- [&bearer_token, offline, &headers](const std::pair & it) -> bool {
- const int http_status = common_download_file_single(it.first, it.second, bearer_token, offline, headers);
- return is_http_status_ok(http_status);
- },
- item
- )
- );
+ futures_download.push_back(std::async(std::launch::async, [bearer_token, offline](const std::pair & it) -> bool {
+ return common_download_file_single(it.first, it.second, bearer_token, offline);
+ }, item));
}
// Wait for all downloads to complete
@@ -516,18 +807,17 @@ static bool common_download_file_multiple(const std::vector(hf_repo_with_tag, ':');
+ std::string tag = parts.size() > 1 ? parts.back() : "latest";
+ std::string hf_repo = parts[0];
+ if (string_split(hf_repo, '/').size() != 2) {
+ throw std::invalid_argument("error: invalid HF repo format, expected /[:quant]\n");
+ }
std::string url = get_model_endpoint() + "v2/" + hf_repo + "/manifests/" + tag;
// headers
- common_header_list headers = custom_headers;
- headers.push_back({"Accept", "application/json"});
+ std::vector headers;
+ headers.push_back("Accept: application/json");
if (!bearer_token.empty()) {
- headers.push_back({"Authorization", "Bearer " + bearer_token});
+ headers.push_back("Authorization: Bearer " + bearer_token);
}
// Important: the User-Agent must be "llama-cpp" to get the "ggufFile" field in the response
// User-Agent header is already set in common_remote_get_content, no need to set it here
@@ -661,7 +952,7 @@ common_hf_file_res common_get_hf_file(const std::string & hf_repo_with_tag,
} else if (res_code == 401) {
throw std::runtime_error("error: model is private or does not exist; if you are accessing a gated model, please provide a valid HF token");
} else {
- throw std::runtime_error(string_format("error from HF API (%s), response code: %ld, data: %s", url.c_str(), res_code, res_str.c_str()));
+ throw std::runtime_error(string_format("error from HF API, response code: %ld, data: %s", res_code, res_str.c_str()));
}
// check response
@@ -740,10 +1031,9 @@ std::string common_docker_resolve_model(const std::string & docker) {
const std::string url_prefix = "https://registry-1.docker.io/v2/" + repo;
std::string manifest_url = url_prefix + "/manifests/" + tag;
common_remote_params manifest_params;
- manifest_params.headers.push_back({"Authorization", "Bearer " + token});
- manifest_params.headers.push_back({"Accept",
- "application/vnd.docker.distribution.manifest.v2+json,application/vnd.oci.image.manifest.v1+json"
- });
+ manifest_params.headers.push_back("Authorization: Bearer " + token);
+ manifest_params.headers.push_back(
+ "Accept: application/vnd.docker.distribution.manifest.v2+json,application/vnd.oci.image.manifest.v1+json");
auto manifest_res = common_remote_get_content(manifest_url, manifest_params);
if (manifest_res.first != 200) {
throw std::runtime_error("Failed to get Docker manifest, HTTP code: " + std::to_string(manifest_res.first));
@@ -780,8 +1070,7 @@ std::string common_docker_resolve_model(const std::string & docker) {
std::string local_path = fs_get_cache_file(model_filename);
const std::string blob_url = url_prefix + "/blobs/" + gguf_digest;
- const int http_status = common_download_file_single(blob_url, local_path, token, false, {});
- if (!is_http_status_ok(http_status)) {
+ if (!common_download_file_single(blob_url, local_path, token, false)) {
throw std::runtime_error("Failed to download Docker Model");
}
@@ -795,11 +1084,11 @@ std::string common_docker_resolve_model(const std::string & docker) {
#else
-common_hf_file_res common_get_hf_file(const std::string &, const std::string &, bool, const common_header_list &) {
+common_hf_file_res common_get_hf_file(const std::string &, const std::string &, bool) {
throw std::runtime_error("download functionality is not enabled in this build");
}
-bool common_download_model(const common_params_model &, const std::string &, bool, const common_header_list &) {
+bool common_download_model(const common_params_model &, const std::string &, bool) {
throw std::runtime_error("download functionality is not enabled in this build");
}
@@ -807,15 +1096,7 @@ std::string common_docker_resolve_model(const std::string &) {
throw std::runtime_error("download functionality is not enabled in this build");
}
-int common_download_file_single(const std::string &,
- const std::string &,
- const std::string &,
- bool,
- const common_header_list &) {
- throw std::runtime_error("download functionality is not enabled in this build");
-}
-
-#endif // defined(LLAMA_USE_HTTPLIB)
+#endif // LLAMA_USE_CURL || LLAMA_USE_HTTPLIB
std::vector common_list_cached_models() {
std::vector models;
diff --git a/common/download.h b/common/download.h
index 1c1d8e6..d1321e6 100644
--- a/common/download.h
+++ b/common/download.h
@@ -1,27 +1,12 @@
#pragma once
#include
-#include
struct common_params_model;
-using common_header = std::pair;
-using common_header_list = std::vector;
-
-struct common_remote_params {
- common_header_list headers;
- long timeout = 0; // in seconds, 0 means no timeout
- long max_size = 0; // unlimited if 0
-};
-
-// get remote file content, returns
-std::pair> common_remote_get_content(const std::string & url, const common_remote_params & params);
-
-// split HF repo with tag into
-// for example: "user/model:tag" -> <"user/model", "tag">
-// if tag is not present, default to "latest"
-// example: "user/model" -> <"user/model", "latest">
-std::pair common_download_split_repo_tag(const std::string & hf_repo_with_tag);
+//
+// download functionalities
+//
struct common_cached_model_info {
std::string manifest_path;
@@ -56,29 +41,17 @@ struct common_hf_file_res {
common_hf_file_res common_get_hf_file(
const std::string & hf_repo_with_tag,
const std::string & bearer_token,
- bool offline,
- const common_header_list & headers = {}
-);
+ bool offline);
// returns true if download succeeded
bool common_download_model(
const common_params_model & model,
const std::string & bearer_token,
- bool offline,
- const common_header_list & headers = {}
-);
+ bool offline);
// returns list of cached models
std::vector common_list_cached_models();
-// download single file from url to local path
-// returns status code or -1 on error
-int common_download_file_single(const std::string & url,
- const std::string & path,
- const std::string & bearer_token,
- bool offline,
- const common_header_list & headers = {});
-
// resolve and download model from Docker registry
// return local path to downloaded model file
std::string common_docker_resolve_model(const std::string & docker);
diff --git a/common/llguidance.cpp b/common/llguidance.cpp
index d58f147..adce620 100644
--- a/common/llguidance.cpp
+++ b/common/llguidance.cpp
@@ -106,16 +106,12 @@ static void llama_sampler_llg_free(llama_sampler * smpl) {
}
static llama_sampler_i llama_sampler_llg_i = {
- /* .name = */ llama_sampler_llg_name,
- /* .accept = */ llama_sampler_llg_accept_impl,
- /* .apply = */ llama_sampler_llg_apply,
- /* .reset = */ llama_sampler_llg_reset,
- /* .clone = */ llama_sampler_llg_clone,
- /* .free = */ llama_sampler_llg_free,
- /* .backend_init = */ NULL,
- /* .backend_accept = */ NULL,
- /* .backend_apply = */ NULL,
- /* .backend_set_input = */ NULL,
+ /* .name = */ llama_sampler_llg_name,
+ /* .accept = */ llama_sampler_llg_accept_impl,
+ /* .apply = */ llama_sampler_llg_apply,
+ /* .reset = */ llama_sampler_llg_reset,
+ /* .clone = */ llama_sampler_llg_clone,
+ /* .free = */ llama_sampler_llg_free,
};
static size_t llama_sampler_llg_tokenize_fn(const void * user_data, const uint8_t * bytes, size_t bytes_len,
diff --git a/common/preset.cpp b/common/preset.cpp
index 57ccd00..e2fc18c 100644
--- a/common/preset.cpp
+++ b/common/preset.cpp
@@ -16,48 +16,6 @@ static std::string rm_leading_dashes(const std::string & str) {
return str.substr(pos);
}
-// only allow a subset of args for remote presets for security reasons
-// do not add more args unless absolutely necessary
-// args that output to files are strictly prohibited
-static std::set get_remote_preset_whitelist(const std::map & key_to_opt) {
- static const std::set allowed_options = {
- "model-url",
- "hf-repo",
- "hf-repo-draft",
- "hf-repo-v", // vocoder
- "hf-file-v", // vocoder
- "mmproj-url",
- "pooling",
- "jinja",
- "batch-size",
- "ubatch-size",
- "cache-reuse",
- "chat-template-kwargs",
- "mmap",
- // note: sampling params are automatically allowed by default
- // negated args will be added automatically if the positive arg is specified above
- };
-
- std::set allowed_keys;
-
- for (const auto & it : key_to_opt) {
- const std::string & key = it.first;
- const common_arg & opt = it.second;
- if (allowed_options.find(key) != allowed_options.end() || opt.is_sparam) {
- allowed_keys.insert(key);
- // also add variant keys (args without leading dashes and env vars)
- for (const auto & arg : opt.get_args()) {
- allowed_keys.insert(rm_leading_dashes(arg));
- }
- for (const auto & env : opt.get_env()) {
- allowed_keys.insert(env);
- }
- }
- }
-
- return allowed_keys;
-}
-
std::vector common_preset::to_args(const std::string & bin_path) const {
std::vector args;
@@ -163,29 +121,6 @@ void common_preset::merge(const common_preset & other) {
}
}
-void common_preset::apply_to_params(common_params & params) const {
- for (const auto & [opt, val] : options) {
- // apply each option to params
- if (opt.handler_string) {
- opt.handler_string(params, val);
- } else if (opt.handler_int) {
- opt.handler_int(params, std::stoi(val));
- } else if (opt.handler_bool) {
- opt.handler_bool(params, common_arg_utils::is_truthy(val));
- } else if (opt.handler_str_str) {
- // not supported yet
- throw std::runtime_error(string_format(
- "%s: option with two values is not supported yet",
- __func__
- ));
- } else if (opt.handler_void) {
- opt.handler_void(params);
- } else {
- GGML_ABORT("unknown handler type");
- }
- }
-}
-
static std::map> parse_ini_from_file(const std::string & path) {
std::map> parsed;
@@ -295,16 +230,10 @@ static std::string parse_bool_arg(const common_arg & arg, const std::string & ke
return value;
}
-common_preset_context::common_preset_context(llama_example ex, bool only_remote_allowed)
+common_preset_context::common_preset_context(llama_example ex)
: ctx_params(common_params_parser_init(default_params, ex)) {
common_params_add_preset_options(ctx_params.options);
key_to_opt = get_map_key_opt(ctx_params);
-
- // setup allowed keys if only_remote_allowed is true
- if (only_remote_allowed) {
- filter_allowed_keys = true;
- allowed_keys = get_remote_preset_whitelist(key_to_opt);
- }
}
common_presets common_preset_context::load_from_ini(const std::string & path, common_preset & global) const {
@@ -320,18 +249,7 @@ common_presets common_preset_context::load_from_ini(const std::string & path, co
}
LOG_DBG("loading preset: %s\n", preset.name.c_str());
for (const auto & [key, value] : section.second) {
- if (key == "version") {
- // skip version key (reserved for future use)
- continue;
- }
-
LOG_DBG("option: %s = %s\n", key.c_str(), value.c_str());
- if (filter_allowed_keys && allowed_keys.find(key) == allowed_keys.end()) {
- throw std::runtime_error(string_format(
- "option '%s' is not allowed in remote presets",
- key.c_str()
- ));
- }
if (key_to_opt.find(key) != key_to_opt.end()) {
const auto & opt = key_to_opt.at(key);
if (is_bool_arg(opt)) {
@@ -341,10 +259,7 @@ common_presets common_preset_context::load_from_ini(const std::string & path, co
}
LOG_DBG("accepted option: %s = %s\n", key.c_str(), preset.options[opt].c_str());
} else {
- throw std::runtime_error(string_format(
- "option '%s' not recognized in preset '%s'",
- key.c_str(), preset.name.c_str()
- ));
+ // TODO: maybe warn about unknown key?
}
}
diff --git a/common/preset.h b/common/preset.h
index 11ba6ef..3a84d1b 100644
--- a/common/preset.h
+++ b/common/preset.h
@@ -6,7 +6,6 @@
#include
#include
#include