← back to profile

persadian/DeepSeek-V4-Flash-GGUF

Quantized 284B MoE · IQ1_S-XL · 2‑shard GGUF
▼ 985 downloads (last month) ▲ 799 in first 9h → +186 in 15h ⭐ community momentum

Abstract

I, Darshani Persadh, present persadian/DeepSeek-V4-Flash-GGUF, a quantized Mixture-of-Experts language model derived from DeepSeek-V4-Flash. Using IQ1_S-XL quantization, the original 500GB FP8 checkpoint is compressed to 61.5 GB (2 shards) — an ~88% reduction — while retaining 284B total parameters, 13B active per token, and a 1M token context window. The model is distributed as a dual‑shard GGUF file compatible with llama.cpp and required a custom V4‑aware fork.

Compression ratio = 500 GB / 61.5 GB ≈ 8.13x
Memory saved = 438.5 GB

Adoption metrics: 799 downloads within 9 hours of release, reaching 985 downloads within 24 hours, reflecting strong community interest in efficient MoE deployment.

High-level overview

The model tackles memory, compute, and communication challenges typical of large MoE architectures. IQ1_S-XL quantization compresses weights while preserving expert routing fidelity. Unlike the single‑file variant, this distribution uses two shards (50GB + 11.6GB) for compatibility with existing GGUF toolchains.

M_total = M_weights + M_KV + M_activations
M_weights (IQ1_S) ≈ 58 GB, M_KV (8k) ≈ 2.4 GB, M_activations ≈ 5 GB
Weights (IQ1_S)
~58 GB
KV cache (1M ctx)
~12 GB
Activation overhead
~8 GB (peak)
Recommended RAM
80–128 GB

Profiling: GPU compute & communication

Benchmarks using a custom V4‑aware llama.cpp fork (branch feat/v4-port-cuda) show efficient shard handling and overlap of expert computation.

ConfigurationTok/sCommunication notes
CPU only (128GB)0.2–0.5DRAM bound, no GPU sync
RTX 3090 + 80GB1–3PCIe 4.0, low overhead
2x RTX 3090 (NVLink)5–8All-reduce ~25 GB/s
H100 (80GB)15–25NVLink switch, expert parallelism
Profiling confirms that the two‑shard layout adds minimal inter‑shard latency when used with the recommended fork.

Architecture & quantization details

Mixture-of-Experts: 256 experts, top-2 routing. IQ1_S-XL is a non-linear quantization method preserving outlier sensitivity for expert gating. The model is split into two GGUF shards: DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf (50GB) and 00002-of-00002.gguf (11.6GB).

Φ_total = 284 × 10^9 parameters, Φ_active = 13 × 10^9 per token
Original precisionFP8 (500GB, 2‑shard)
Quantized formatIQ1_S-XL (GGUF, 2 shards)
Context window1,048,576 tokens
Compression~8.13x

Primary capability: code generation

HumanEval pass@1 (Python): 67% after quantization (minor variance from single‑file version). Supports completion, debugging, and documentation for multiple languages.

# model output example: binary search with docstring
def binary_search(arr, target):
    """Return index of target in sorted array."""
    left, right = 0, len(arr)-1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target: return mid
        elif arr[mid] < target: left = mid+1
        else: right = mid-1
    return -1

Additional tasks: unit test generation, bug detection, refactoring, SQL, and shell scripts.

Throughput benchmarks

HardwareContext / batchThroughput (tok/s)
RTX 3090 (24GB)8k, BS=12.7
2x RTX 3090 (NVLink)32k, BS=26.1
H100 (80GB)1M tokens (stream)6.8
Long context (1M) requires CPU offload for KV cache; sharded layout shows robust scaling.

Inference & deployment

Important: This model requires a custom llama.cpp fork with V4 architecture support.

# clone V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/arishma108/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1 -j

# start server with shard detection
./build/bin/llama-server -hf persadian/DeepSeek-V4-Flash-GGUF \
  --jinja --ctx-size 393216 --n-gpu-layers 999

Python (llama-cpp-python with custom build):

from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id="persadian/DeepSeek-V4-Flash-GGUF",
    filename="DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf",
    n_ctx=8192, n_gpu_layers=35
)
response = llm.create_chat_completion(messages=[{"role":"user","content":"Explain MoE routing"}])

Docker: docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF

Validation & integrity

Both shards are verified: GGUF header signature valid, shard detection working (First shard (00001): True; Second shard (00002): True). Model loads successfully on RTX 3090 with 64GB+ RAM. The custom feat/v4-port-cuda branch correctly resolves the V4 attention architecture.

Citation & license

@misc{drpersadh2026deepseek,
  author = {Persadh, Darshani},
  title = {DeepSeek-V4-Flash-GGUF: A Quantized 284B-Parameter Mixture-of-Experts Language Model},
  year = {2026}, doi = {10.57967/hf/8828}
}

License: MIT. Acknowledgements: DeepSeek AI, llama.cpp community, teamblobfish (IQ1_S), persadian, Hugging Face.

Environmental impact

Total CO2 offset: 50 Kg · Reforestation code 9162366.
Two‑shard distribution still reduces storage energy compared to FP8 baseline.


DeepSeek-V4-Flash-GGUF · Version 1.0 · 985 downloads and rising