persadian/DeepSeek-V4-Flash-GGUF | Documentation Framework

Abstract

I, Darshani Persadh, present persadian/DeepSeek-V4-Flash-GGUF, a quantized Mixture-of-Experts language model derived from DeepSeek-V4-Flash. Using IQ1_S-XL quantization, the original 500GB FP8 checkpoint is compressed to 61.5 GB (2 shards), an ~88% reduction, while retaining 284B total parameters, 13B active per token, and a 1M token context window. The model is distributed as a dual‑shard GGUF file compatible with llama.cpp and required a custom V4‑aware fork.

Compression ratio = 500 GB / 61.5 GB \approx 8.13x

Memory saved = 438.5 GB

Adoption metrics: 799 downloads within 9 hours of release, reaching 1350 downloads within 24 hours, reflecting strong community interest in efficient MoE deployment.

High-level overview

The model tackles memory, compute, and communication challenges typical of large MoE architectures. IQ1_S-XL quantization compresses weights while preserving expert routing fidelity. Unlike the single‑file variant, this distribution uses two shards (50GB + 11.6GB) for compatibility with existing GGUF toolchains.

M_total = M_weights + M_KV + M_activations

M_weights (IQ1_S) \approx 58 GB, M_KV (8k) \approx 2.4 GB, M_activations \approx 5 GB

Weights (IQ1_S)
~58 GB

KV cache (1M ctx)
~12 GB

Activation overhead
~8 GB (peak)

Recommended RAM
80–128 GB

Profiling: GPU compute & communication

Benchmarks using a custom V4‑aware llama.cpp fork (branch feat/v4-port-cuda) show efficient shard handling and overlap of expert computation.

Configuration	Tok/s	Communication notes
CPU only (128GB)	0.2–0.5	DRAM bound, no GPU sync
RTX 3090 + 80GB	1–3	PCIe 4.0, low overhead
2x RTX 3090 (NVLink)	5–8	All-reduce ~25 GB/s
H100 (80GB)	15–25	NVLink switch, expert parallelism

Profiling confirms that the two‑shard layout adds minimal inter‑shard latency when used with the recommended fork.

Architecture & quantization details

Mixture-of-Experts: 256 experts, top-2 routing. IQ1_S-XL is a non-linear quantization method preserving outlier sensitivity for expert gating. The model is split into two GGUF shards: DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf (50GB) and 00002-of-00002.gguf (11.6GB).

Φ_total = 284 \times 10^9

parameters,

Φ_active = 13 \times 10^9

per token

Original precision	FP8 (500GB, 2‑shard)
Quantized format	IQ1_S-XL (GGUF, 2 shards)
Context window	1,048,576 tokens
Compression	~8.13x

Primary capability: code generation

HumanEval pass@1 (Python): 67% after quantization (minor variance from single‑file version). Supports completion, debugging, and documentation for multiple languages.

                # model output example: binary search with docstring

                def binary_search(arr, target):

                    """Return index of target in sorted array."""

                    left, right = 0, len(arr)-1

                    while left <= right:

                        mid = (left + right) // 2

                        if arr[mid] == target: return mid

                        elif arr[mid] < target: left = mid+1

                        else: right = mid-1

                    return -1

Additional tasks: unit test generation, bug detection, refactoring, SQL, and shell scripts.

Throughput benchmarks

Hardware	Context / batch	Throughput (tok/s)
RTX 3090 (24GB)	8k, BS=1	2.7
2x RTX 3090 (NVLink)	32k, BS=2	6.1
H100 (80GB)	1M tokens (stream)	6.8

Long context (1M) requires CPU offload for KV cache; sharded layout shows robust scaling.

Inference & deployment

Important: This model requires a custom llama.cpp fork with V4 architecture support.

                # clone V4-aware fork

                git clone -b feat/v4-port-cuda https://github.com/arishma108/llama.cpp

                cd llama.cpp && make LLAMA_CUDA=1 -j

                # start server with shard detection

                ./build/bin/llama-server -hf persadian/DeepSeek-V4-Flash-GGUF \

                  --jinja --ctx-size 393216 --n-gpu-layers 999

Python (llama-cpp-python with custom build):

                from llama_cpp import Llama

                llm = Llama.from_pretrained(

                    repo_id="persadian/DeepSeek-V4-Flash-GGUF",

                    filename="DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf",

                    n_ctx=8192, n_gpu_layers=35

                )

                response = llm.create_chat_completion(messages=[{"role":"user","content":"Explain MoE routing"}])

Docker: docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF

Validation & integrity

Both shards are verified: GGUF header signature valid, shard detection working

First shard (00001): True;

Second shard (00002): True.

Model loads successfully on RTX 3090 with 64GB+ RAM. The custom feat/v4-port-cuda branch correctly resolves the V4 attention architecture.

Citation & license

                @misc{drpersadh2026deepseek,

                  author = {Persadh, Darshani},

                  title = {DeepSeek-V4-Flash-GGUF: A Quantized 284B-Parameter Mixture-of-Experts Language Model},

                  year = {2026}, doi = {10.57967/hf/8828}

                }

License: MIT.
Acknowledgements: DeepSeek AI, llama.cpp community, teamblobfish (IQ1_S), persadian, Hugging Face.

Environmental impact

Total CO2 offset: 50 Kg · Two‑shard distribution still reduces storage energy compared to FP8 baseline.