persadian/DeepSeek-V4-Flash-GGUF
Abstract
I, Darshani Persadh, present persadian/DeepSeek-V4-Flash-GGUF, a quantized Mixture-of-Experts language model derived from DeepSeek-V4-Flash. Using IQ1_S-XL quantization, the original 500GB FP8 checkpoint is compressed to 61.5 GB (2 shards) — an ~88% reduction — while retaining 284B total parameters, 13B active per token, and a 1M token context window. The model is distributed as a dual‑shard GGUF file compatible with llama.cpp and required a custom V4‑aware fork.
Memory saved = 438.5 GB
Adoption metrics: 799 downloads within 9 hours of release, reaching 985 downloads within 24 hours, reflecting strong community interest in efficient MoE deployment.
High-level overview
The model tackles memory, compute, and communication challenges typical of large MoE architectures. IQ1_S-XL quantization compresses weights while preserving expert routing fidelity. Unlike the single‑file variant, this distribution uses two shards (50GB + 11.6GB) for compatibility with existing GGUF toolchains.
M_weights (IQ1_S) ≈ 58 GB, M_KV (8k) ≈ 2.4 GB, M_activations ≈ 5 GB
~58 GB
~12 GB
~8 GB (peak)
80–128 GB
Profiling: GPU compute & communication
Benchmarks using a custom V4‑aware llama.cpp fork (branch feat/v4-port-cuda) show efficient shard handling and overlap of expert computation.
| Configuration | Tok/s | Communication notes |
|---|---|---|
| CPU only (128GB) | 0.2–0.5 | DRAM bound, no GPU sync |
| RTX 3090 + 80GB | 1–3 | PCIe 4.0, low overhead |
| 2x RTX 3090 (NVLink) | 5–8 | All-reduce ~25 GB/s |
| H100 (80GB) | 15–25 | NVLink switch, expert parallelism |
Architecture & quantization details
Mixture-of-Experts: 256 experts, top-2 routing. IQ1_S-XL is a non-linear quantization method preserving outlier sensitivity for expert gating. The model is split into two GGUF shards: DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf (50GB) and 00002-of-00002.gguf (11.6GB).
| Original precision | FP8 (500GB, 2‑shard) |
| Quantized format | IQ1_S-XL (GGUF, 2 shards) |
| Context window | 1,048,576 tokens |
| Compression | ~8.13x |
Primary capability: code generation
HumanEval pass@1 (Python): 67% after quantization (minor variance from single‑file version). Supports completion, debugging, and documentation for multiple languages.
def binary_search(arr, target):
"""Return index of target in sorted array."""
left, right = 0, len(arr)-1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target: return mid
elif arr[mid] < target: left = mid+1
else: right = mid-1
return -1
Additional tasks: unit test generation, bug detection, refactoring, SQL, and shell scripts.
Throughput benchmarks
| Hardware | Context / batch | Throughput (tok/s) |
|---|---|---|
| RTX 3090 (24GB) | 8k, BS=1 | 2.7 |
| 2x RTX 3090 (NVLink) | 32k, BS=2 | 6.1 |
| H100 (80GB) | 1M tokens (stream) | 6.8 |
Inference & deployment
Important: This model requires a custom llama.cpp fork with V4 architecture support.
git clone -b feat/v4-port-cuda https://github.com/arishma108/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1 -j
# start server with shard detection
./build/bin/llama-server -hf persadian/DeepSeek-V4-Flash-GGUF \
--jinja --ctx-size 393216 --n-gpu-layers 999
Python (llama-cpp-python with custom build):
llm = Llama.from_pretrained(
repo_id="persadian/DeepSeek-V4-Flash-GGUF",
filename="DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf",
n_ctx=8192, n_gpu_layers=35
)
response = llm.create_chat_completion(messages=[{"role":"user","content":"Explain MoE routing"}])
Docker: docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF
Validation & integrity
Both shards are verified: GGUF header signature valid, shard detection working (First shard (00001): True; Second shard (00002): True). Model loads successfully on RTX 3090 with 64GB+ RAM. The custom feat/v4-port-cuda branch correctly resolves the V4 attention architecture.
Citation & license
author = {Persadh, Darshani},
title = {DeepSeek-V4-Flash-GGUF: A Quantized 284B-Parameter Mixture-of-Experts Language Model},
year = {2026}, doi = {10.57967/hf/8828}
}
License: MIT. Acknowledgements: DeepSeek AI, llama.cpp community, teamblobfish (IQ1_S), persadian, Hugging Face.
Environmental impact
Total CO2 offset: 50 Kg · Reforestation code 9162366.
Two‑shard distribution still reduces storage energy compared to FP8 baseline.