> For the complete documentation index, see [llms.txt](https://book.bsdcn.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://book.bsdcn.org/ask/flat/chapter-20-artificial-intelligence/di-20.4-jie-da-mo-xing-ben-di-bu-shu.md).

# 20.4 Local Deployment of Large Language Models

## llama.cpp

llama.cpp is written in C/C++, aiming to achieve large model inference on various hardware with minimal configuration. llama.cpp depends on **misc/ggml** for the underlying tensor computation library, which is automatically installed when **misc/llama-cpp** is installed. In a pure CPU environment, if GPU acceleration needs to be disabled, the `VULKAN=OFF` option can be set in **misc/ggml** (not in `llama-cpp` itself).

### Installation

* Install using pkg:

```sh
# pkg install llama-cpp
```

* Install using Ports:

```sh
# cd /usr/ports/misc/llama-cpp/
# make install clean
```

* View installation instructions

```sh
# pkg info -D llama-cpp
```

### Deploying the Qwen Large Model

GGUF is a file format that stores the information needed to run a model. llama.cpp requires models to be stored in this format.

The [Hugging Face](https://huggingface.co/models?sort=trending\&search=llama+gguf) platform hosts a large number of GGUF-format large models compatible with llama.cpp. Users can directly search for the keyword "llama gguf".

[Qwen](https://huggingface.co/Qwen/collections) is a family of large language models developed by Alibaba Cloud. Assuming the use of Qwen/Qwen3-0.6B-GGUF:

```sh
$ llama-cli -hf Qwen/Qwen3-0.6B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 1024 -n 256 --no-context-shift
```

| Parameter                          | Description                                                                                                               |
| ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| **-hf Qwen/Qwen3-0.6B-GGUF:Q8\_0** | Specifies the model source and quantization version, using model files from Hugging Face Hub with 8-bit quantized weights |
| `--jinja`                          | Enables Jinja template parsing, allowing variables in prompts                                                             |
| `--color`                          | Displays colored output in the terminal, making it easier to distinguish user input from model-generated text             |
| `-ngl`                             | Specifies the number of layers to offload to GPU (n-gpu-layers); larger values use more GPU                               |
| `-fa`                              | Enables Flash Attention, optimizing attention computation, which can improve inference speed and reduce VRAM usage        |
| `-sm row`                          | Sets multi-GPU tensor split mode (split mode); row means splitting tensors across different GPUs by row                   |
| `--temp`                           | Sets the sampling temperature, controlling the randomness of generated text                                               |
| `--top-k`                          | Limits token generation to selecting from the highest-probability candidates, improving text diversity                    |
| `--top-p`                          | Nucleus sampling strategy, only selecting tokens whose cumulative probability reaches a certain threshold                 |
| `--min-p`                          | Minimum probability threshold for generating tokens, used to filter low-probability tokens                                |
| `--presence-penalty`               | Applies a penalty to repeatedly occurring tokens, reducing repetitive text                                                |
| `-c`                               | Context window length, the number of historical tokens the model can remember during generation                           |
| `-n`                               | Maximum number of tokens to generate, controlling the total length of generated text in one pass                          |
| `--no-context-shift`               | Disables context sliding or moving window, maintaining a fixed context for text generation                                |

For detailed parameter descriptions, refer to the [llama.cpp deployment guide in the Qwen official documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html).

The output is as follows:

```sh
Downloading Qwen3-0.6B-Q8_0.gguf ─────────────────────────────────── 100%

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b9033-unknown
model      : Qwen/Qwen3-0.6B-GGUF:Q8_0
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


>> 介绍一下你自己

[Start thinking]
好的，用户问我要介绍一下自己。我需要先确认用户的需求是什么。可能他们想了解我的能力、经验或特点？或者只是想测试是否能回答问题？

接下来，我要考虑用户的身份。可能是学生、职场人士，或者刚接触AI的人。不同背景的用户对介绍的内容可能会有不同侧重。例如，学生可能更关注学习和技能，而职场人士可能关注实际应用。

此后，我需要确保回答友好且实用。要涵盖基本的信息，但不应过于冗长。同时，保持自然的口语化，避免使用专业术语过多，让信息易于理解。

另外，用户可能没有明确说出深层需求，例如他们可能想通过我的介绍找到合适的资源或学习方向。因此，在回答中可以提到一些可能性，帮助用户做出决策。

最后，检查回答是否准确、简洁，并且符合用户期望的语气。确保没有遗漏关键点，同时保持整体连贯性和自然流畅。

[End thinking]

我是一个AI助手，专注于提供帮助和解答问题。我可以协助你完成学习、工作或生活中的各种任务。如果你有任何问题或需要帮助，请随时告诉我！

[ Prompt: 52.0 t/s | Generation: 25.2 t/s ]

> /exit
```

Type `/exit` or press **Ctrl** + **C** to exit. To use it again, simply run the same command.

The model will be cached at **\~/.cache/huggingface/hub**.

## Ollama

Ollama is a tool for running large language models, primarily written in Go and C.

### Installation

* Install using pkg:

```sh
# pkg install ollama
```

* Install using Ports:

```sh
# cd /usr/ports/misc/ollama/
# make install clean
```

* View installation instructions

```sh
# pkg info -D ollama
```

### Service Management

Enable the service and set it to start on boot:

```sh
# service ollama enable
```

Start the service immediately:

```sh
# service ollama start
```

### Deploying DeepSeek-R1

Pull the 1.5b parameter DeepSeek-R1 model:

```sh
$ ollama run deepseek-r1:1.5b
```

For more large models, see [library](https://ollama.com/library/).

The output of the above command is as follows:

```sh
$ ollama run deepseek-r1:1.5b
pulling manifest
pulling aabd4debf0c8: 100% █████████████████████████████████ 1.1 GB
pulling c5ad996bda6e: 100% █████████████████████████████████  556 B
pulling 6e4c38e1172f: 100% █████████████████████████████████ 1.1 KB
pulling f4d24e9138dd: 100% █████████████████████████████████  148 B
pulling a85fe2a2e58e: 100% █████████████████████████████████  487 B
verifying sha256 digest
writing manifest
success
>>> 你好，世界！
你好！很高兴见到你。有什么我可以帮助你的吗？无论是学习、生活还是其他方面的问题，我都很乐意
解答和分享。如果你有任何想法或需要讨论的点，随时告诉我。我会用最真诚的态度去为你服务
！
>>> 你是谁
您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek-R1。如您有任何问题，我
会尽我所能为您提供帮助。

>>>
... Press Enter to send
```

The larger the number of parameters, the larger the model size typically is. The default Ollama storage location is **\~/.ollama/models**.

Type `/bye` or press **Ctrl** + **D** to exit. To use it again, simply run the same command.

## Claude Code

Claude Code is an AI programming assistant and automated programming tool that can read and understand complete codebases, edit files, run commands, and collaborate with development tools. It works in terminal, IDE, desktop application, and browser environments, helping to rapidly develop features, fix bugs, and automate development tasks. Claude Code requires a paid subscription.

Claude Code's source code is primarily written in TypeScript and runs on the Bun runtime.

### Installation

* Install using pkg:

```sh
# pkg install claude-code
```

* Install using Ports:

```sh
# cd /usr/ports/misc/claude-code/
# make install clean
```

### Using Claude Code

```sh
$ claude
Welcome to Claude Code v2.1.110
………………………………………………………………………………………………………………………………………………………………

     *                                       █████▓▓░
                                 *         ███▓░     ░░
            ░░░░░░                        ███▓░
    ░░░   ░░░░░░░░░░                      ███▓░
   ░░░░░░░░░░░░░░░░░░░    *                ██▓░░      ▓
                                             ░▓▓███▓▓░
 *                                 ░░░░
                                 ░░░░░░░░
                               ░░░░░░░░░░░░░░░░
       █████████                                        *
      ██▄█████▄██                        *
       █████████      *
…………………█ █   █ █……………………………………………………………………………………………………………………

 Let's get started.

 Choose the text style that looks best with your terminal
 To change this later, run /theme

 > 1. Dark mode ✔
   2. Light mode
   3. Dark mode (colorblind-friendly)
   4. Light mode (colorblind-friendly)
   5. Dark mode (ANSI colors only)
   6. Light mode (ANSI colors only)

 ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
  1  function greet() {
  2 -  console.log("Hello, World!");
  2 +  console.log("Hello, Claude!");
  3  }
 ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
  Syntax theme: Monokai Extended (ctrl+t to disable)
```

Set the theme here.

```sh
 Claude Code can be used with your Claude subscription or billed based on API usage through
 your Console account.

 Select login method:

 > 1. Claude account with subscription · Pro, Max, Team, or Enterprise

   2. Anthropic Console account · API usage billing

   3. 3rd-party platform · Amazon Bedrock, Microsoft Foundry, or Vertex AI
```

Claude Code requires a subscription to use. Please log in to your account:

```sh
 Browser didn't open? Use the url below to sign in (c to copy)

https://platform.claude.com/oauth/authorize?code=true&client_id=9d1c250a-e61b-44d9-88ed-5944d19
62f5e&response_type=code&redirect_uri=https%3A%2F%2Fplatform.claude.com%2Foauth%2Fcode%2Fcallba
ck&scope=org%3Acreate_api_key+user%3Aprofile+user%3Ainference+user%3Asessions%3Aclaude_code+use
r%3Amcp_servers+user%3Afile_upload&code_challenge=elerjEbwwqNNdwh7oGGSpSDZ4qwb8SUV2WrM1VtyQTU&c
ode_challenge_method=S256&state=uI0z-JtwQq9WWw4XVAQPmYK0cxJIXy2Q7vhFY20rUO0	# Copy this URL and open it in a browser


 Paste code here if prompted > ***************************************************************
                               ***********************20rUO0	# This string is copied from the webpage; it must be used after logging in
```

After logging in:

```sh

 Security notes:

 1. Claude can make mistakes
    You should always review Claude's responses, especially when
    running code.

 2. Due to prompt injection risks, only use it with code you trust
    For more details see:
    https://code.claude.com/docs/en/security

 Press Enter to continue…

╭─── Claude Code v2.1.110 ────────────────────────────────────────────────────────────────────╮
│                                                    │ Tips for getting started               │
│                    Welcome back!                   │ Run /init to create a CLAUDE.md file … │
│                                                    │ Note: You have launched claude in you… │
│                       ▐▛███▜▌                      │ ────────────────────────────────────── │
│                      ▝▜█████▛▘                     │ Recent activity                        │
│                        ▘▘ ▝▝                       │ No recent activity                     │
│   Sonnet 4.6 · API Usage Billing · 阅读            │                                        │
│   https://book.bsdcn.org/changel‘s Individual Org  │                                        │
│                     /home/ykla                     │                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────╯

>

───────────────────────────────────────────────────────────────────────────────────────────────
>
───────────────────────────────────────────────────────────────────────────────────────────────
  ? for shortcuts
```

After completing the subscription, you can start using it. Press **Ctrl** + **C** twice to exit the tool.

## GitHub Copilot CLI

GitHub Copilot CLI is a closed-source project by GitHub. GitHub offers a free tier (2000 code completions and 50 chat requests per month); exceeding the quota or using advanced features requires a paid subscription.

GitHub Copilot CLI integrates an AI programming assistant into the command-line environment, allowing users to write, debug, and understand code through natural language conversation, and integrate with GitHub workflows.

### Installation

* Install using pkg:

```sh
# pkg install github-copilot-cli
```

* Install using Ports:

```sh
# cd /usr/ports/misc/github-copilot-cli/
# make install clean
```

### Using GitHub Copilot CLI

```sh
$ copilot

```

Open <https://github.com/login/device> in a browser, enter the one-time verification code output by Copilot, and after authorization you can use Copilot.

![GitHub Copilot CLI main interface](/files/Y8MugB3t0lz8UuJIARdZ)

Press **Ctrl** + **C** twice to exit the tool.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://book.bsdcn.org/ask/flat/chapter-20-artificial-intelligence/di-20.4-jie-da-mo-xing-ben-di-bu-shu.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
