> For the complete documentation index, see [llms.txt](https://book.bsdcn.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://book.bsdcn.org/ask/flat/chapter-28-the-zfs-file-system/di-28.2-jie-zfs-te-xing-he-shu-yu.md).

# 28.2 ZFS Features and Terminology

ZFS is fundamentally different from other file systems. Specifically, ZFS integrates file system and volume manager functionality into a single unit: new storage devices can be added to an online system, and the new space is immediately available to all existing file systems in the pool. **vdev** (Virtual Device) is the basic building block of a pool, which can be a single disk (leaf vdev) or a redundant group of devices such as a mirror or RAID-Z. All ZFS file systems (**datasets**) share the total free space in the pool. Used blocks in the pool reduce the space available to each file system. This approach solves the problems of free space fragmentation and difficulty of online resizing found in traditional volume management.

```sh
                         ZFS File System Hierarchy
                         Root: Storage Pool (Pool / zpool)
────────────────────────────────────────────

pool
├─ root                        (mountpoint=/)
│  ├─ var                      (inherit mountpoint)
│  └─ usr                      (inherit mountpoint)
│
├─ home                        (mountpoint=/home)
│  ├─ user1                    (inherit → /home/user1)
│  └─ user2                    (inherit → /home/user2)
│
├─ data                        (mountpoint=none)
│  └─ archive                  (not auto-mounted)
│
└─ legacyfs                    (mountpoint=legacy)
   (managed by /etc/fstab)

Notes:
- mountpoint=path: auto-mount
- mountpoint=none: not mounted
- mountpoint=legacy: managed by traditional system


                              Dataset Layer
─────────────────────────────────────────
Types: filesystem | volume | snapshot | bookmark | clone

Filesystem                        Volume (zvol)
──────────────────────             ──────────────────────
* POSIX directory tree            * Raw block device
* Dynamic space allocation        * Fixed size (refreservation)
* Inheritable properties          * Can host UFS / ext4 / NTFS

                ───────────────┬───────────────
                      ─────────▼────────
               ───────▼───────    ───────▼───────       Clone
               Snapshot          Bookmark
               ─────────                             Derived from snapshot
               @snap (RO)        #bkmk (metadata) ◄──── Shared data blocks (COW)
               Rollback / Clone source  Send incremental reference

Core Features: COW / Checksum / Compression / Encryption

                              Pool Layer (Pool / zpool)
───────────────────────────────────────
Pool Status: ONLINE / DEGRADED / FAULTED
Operations: Scrub / Resilver / Trim

Top-Level VDEV × N
Data striped across top-level VDEVs, redundancy within each VDEV

                              VDEV Layer
───────────────────────────────────────

[Data VDEV]                     [Allocation Class Devices]

Mirror        RAID-Z           Special        Log
n-way         z1/z2/z3         Metadata pool   SLOG

   │              │               │            └─► Synchronous write log (ZIL)
   │              │               └────────────► Metadata / small blocks

dRAID         Spare            Cache          Dedup
Distributed    Hot spare        L2ARC          DDT

                                   │            └─► Dedup table
                                   └────────────► Read cache


                              Physical Layer
────────────────────────────────────────

Physical Disk           Disk Partition          File VDEV
**/dev/ada0**            **/dev/ada0p3**         **/tmp/vdev**
Whole disk recommended     Generally not recommended     Testing purposes
                        GEOM ensures consistency         No power-loss protection


Notes:
- Pool is the root of the entire ZFS hierarchy
- Properties (compression / quota / mountpoint, etc.) are inherited top-down
- Snapshot / Bookmark do not occupy mount points
```

## ZFS Hierarchy and VDEV Concepts

* **vdev Hierarchy** A vdev (Virtual Device) is the basic building block of a ZFS storage pool, representing a logical grouping of physical storage devices (such as hard drives, SSDs, or partitions). The ZFS vdev hierarchy, from top to bottom, is:

  * **Root vdev**: The top of the pool hierarchy, aggregating all top-level vdevs into a single logical storage unit (the pool).
  * **Top-level vdev**: Direct children of the root vdev, which can be a single device or a logical group aggregating multiple leaf vdevs (such as a mirror or RAID-Z group). ZFS dynamically stripes data across all top-level vdevs in the pool. The overall performance, capacity, and redundancy of the pool depend on the configuration and type of the top-level vdevs.
  * **Leaf vdev**: The most basic type of vdev, directly corresponding to a physical storage device, serving as the endpoint of the storage hierarchy.

  At the same time, ZFS is designed to handle vdev failures gracefully. When a vdev fails, as long as the pool's redundancy level permits (such as in mirror, RAID-Z, or dRAID configurations), ZFS can continue operating using the remaining vdevs. The various states of vdevs and devices in a pool and their meanings are detailed in the "Device Status" section below.

## VDEV Types

* **vdev Types**
  * **Disk**: The most basic vdev type, which can be an entire disk (such as **/dev/ada0**, **/dev/da0**) or a single partition (such as **/dev/ada0p3**). In FreeBSD, partitions generally do not suffer significant performance penalties compared to whole disks, but OpenZFS officially still recommends using whole disks whenever possible.
  * **File**: Regular files can also make up a ZFS pool, but this is strongly discouraged for production environments and is only suitable for testing and experimentation. The fault tolerance of a file depends on the file system it resides on. When creating a pool, the full path of the file must be used as the device path.
  * **Mirror**: Created by specifying the keyword `mirror` followed by a list of mirror member devices. A mirror contains two or more devices, and all data is written to all member devices. The usable capacity of a mirror vdev equals the capacity of the smallest member. Data is not lost as long as all but one member remain healthy.
  * **RAID-Z**: ZFS uses RAID-Z, a variant of standard RAID-5/6. RAID-Z distributes parity more evenly and eliminates the "RAID-5 write hole" where data and parity become inconsistent after an unexpected restart. Data and parity are striped across all disks in a raidz group. ZFS supports three RAID-Z levels (raidz/raidz1, raidz2, raidz3), where raidz is an alias for raidz1, each providing a different degree of redundancy. A RAID-Z group requires at minimum `parity count + 1` disks, and 3 to 9 disks per vdev are recommended for optimal performance within this range. If more disks are configured, splitting them into multiple vdevs and striping data across vdevs is recommended.

    For example, with four 1 TB disks in RAID-Z1, the actual usable storage is approximately 3 TB, and the pool can operate in a degraded state if one disk fails; however, if a second disk also fails before replacement and rebuild, all data in the pool will be lost.

    For example, with eight 1 TB disks in RAID-Z3, the usable space is approximately 5 TB, tolerating up to three disk failures. Two RAID-Z2 vdevs each containing 8 disks are analogous to a RAID-60 array. The storage capacity of a RAID-Z group is approximately the smallest disk size multiplied by the number of non-parity disks.

    **RAID-Z Space Efficiency**: The actual space occupied by data blocks in RAID-Z depends on several factors—the minimum write unit is the disk sector size (controlled by the ashift parameter), and the stripe width is dynamic, ranging from at least one data chunk to at most `disk count - parity count` data chunks. For a 3-disk raidz1 with `ashift=12` (4 KiB sectors) and `recordsize=4K`: each 4 KiB data block needs an additional 4 KiB parity block written, yielding only 50% usable space ratio—the same as a 2-disk mirror. For a 3-disk raidz1 with `ashift=12` and `recordsize=128K`: each stripe has at most 2 × 4 KiB data parts plus 1 parity part, 128 KiB of data is split into 16 stripes, each stripe has 8 KiB data plus 4 KiB parity, totaling 192 KiB written to store 128 KiB of data, yielding a 66% usable space ratio. The more disks, the wider the stripe, the higher the space efficiency. Therefore, RAID-Z is better suited for large block sizes and sequential workloads.

    **RAID-Z Write Performance**: A stripe spans all disks in the array. A single write sends a portion of the stripe to each disk. Since all stripe portion writes must complete on every disk, the write IOPS of a RAID-Z vdev equals, in the worst case, the IOPS of the slowest disk in the array.
  * **Spare**: A spare disk is a pseudo-vdev type used to track available hot spares. When an active device fails, a hot spare automatically replaces the failed device; after permanently replacing the failed device using `zpool replace`, the hot spare is automatically released and returns to an available state. Hot spares can be shared across multiple pools but cannot replace log devices. A shared hot spare in use prevents the pool from being exported, as other pools may also use the same hot spare, leading to data corruption. Additionally, shared hot spares carry extra risk: if pools are imported on different hosts and devices fail simultaneously, both pools may attempt to use the same hot spare, and ZFS may not detect this situation, resulting in data corruption.
  * **Log**: A log vdev is a separate intent log device, moving the ZIL (ZFS Intent Log) from regular pool devices to dedicated storage devices (typically NVRAM or dedicated SSDs). A dedicated log device can improve performance for high-synchronous-write workloads such as databases. Log devices can be mirrored but do not support RAID-Z; when multiple log devices are used, the write load is evenly distributed. Log devices can be removed via the `zpool remove` command and can also be added or replaced.
  * **Cache**: Adding a cache vdev to a pool adds cache storage to L2ARC. Cache devices cannot be mirrored or configured as RAID-Z groups. Since cache devices only store copies of existing data, there is no risk of data loss. The contents of cache devices persist across reboots and are asynchronously restored into L2ARC when the pool is imported (persistent L2ARC); setting `l2arc_rebuild_enabled=0` disables this. When a cache device is smaller than 1 GiB, ZFS will not write the metadata structures needed to rebuild L2ARC to save space; this behavior can be adjusted via `l2arc_rebuild_blocks_min_l2size`.

## Allocation Class VDEVs

* **Special Allocation Class (Special)**: A special vdev is dedicated to storing specific types of blocks, which by default include all metadata, indirect blocks of user data, the intent log (when no dedicated log device exists), and deduplication tables. It can also accept small file blocks or volume blocks on a per-dataset basis. High-performance SSDs are typically used to accelerate metadata-intensive operations. There must always be at least one regular (non-dedup/special) vdev in the pool before special class devices can be allocated. If a special vdev becomes full, subsequent allocations will fall back to regular vdevs. Unsetting the `zfs_ddt_data_is_special` module parameter prevents deduplication tables from being placed in the special class. Placing small file blocks or volume blocks in the special class is optional; setting the `special_small_blocks` property on each dataset to a non-zero value controls the upper size limit of small blocks that can be placed in the special class.
* **Dedup Allocation Class (Dedup)**: A dedicated vdev exclusively for storing the deduplication table (DDT). Its redundancy level should match that of other regular vdevs. If multiple dedup devices are specified, allocations are evenly distributed across these devices.

## Device Status and Pool Health

* **Device Status**:

  Top-level vdevs or component devices in a pool can be in the following states:

| Status   | Description                                                                                                                                                                                                                                                                                                                                                                              |
| -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ONLINE   | The device is functioning normally and is healthy                                                                                                                                                                                                                                                                                                                                        |
| OFFLINE  | The device was manually marked offline by the administrator using the `zpool offline` command. The system will no longer attempt to use this device, but ZFS considers it recoverable and the fault count does not increase. Use `zpool online` to bring it back online                                                                                                                  |
| DEGRADED | The device has experienced excessive checksum errors, slow I/O, or I/O errors. ZFS continues to use this device in degraded mode because redundancy still exists, or the number of I/O errors exceeds acceptable levels but cannot be marked as faulted due to insufficient replicas                                                                                                     |
| FAULTED  | The device cannot be opened, or the number of I/O errors exceeds the threshold. The system marks it as faulted to prevent continued use. After the device recovers, use `zpool clear` to clear the error count                                                                                                                                                                           |
| REMOVED  | The device has been physically removed from the system without the administrator issuing `zpool offline`. "Physical" removal means physical disconnection—such as hot-unplugging or switch port disablement—rather than physically taking the disk out of the chassis. If a disk is briefly disconnected, immediately reconnected, and no I/O faults occur, the status may remain ONLINE |
| UNAVAIL  | The device cannot be opened. If a device is unavailable when importing a pool, that device is displayed with a unique identifier rather than a path                                                                                                                                                                                                                                      |

* **Pool**: A storage pool is the most basic building block of ZFS. The top of a storage pool is the root vdev, which aggregates one or more top-level vdevs—the actual devices or device groups that carry data. ZFS dynamically stripes data across all top-level vdevs in the pool. A pool can host one or more file systems (datasets) or block devices (volumes), and these datasets and volumes share the remaining free space in the pool. Each pool has a unique name and GUID, and the pool's feature flags determine which features are available.
* **Pool Health Status**:

  The overall health status of a pool falls into three categories:

| Status   | Description                                                                                                                              |
| -------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| ONLINE   | All devices in the pool are functioning normally                                                                                         |
| DEGRADED | One or more devices have failed or are offline, but due to redundant configuration, data is still accessible                             |
| FAULTED  | Metadata is corrupted, or the number of failed devices exceeds the redundancy limit; the pool is unavailable and data cannot be accessed |

## Core Mechanisms

* **Copy-on-Write (COW)**: Unlike traditional file systems, ZFS writes new data to different blocks rather than overwriting old data in place. After the write completes, metadata is updated to point to the new location. In the event of a short write (system crash or power loss during a write), the original complete contents of the file remain available, and ZFS simply discards the incomplete write. Therefore, ZFS does not require an fsck(8) after an unexpected shutdown.
* **Transaction Groups (TXG)**: Transaction Groups are the mechanism by which ZFS commits batches of data block changes to the pool, and they serve as the atomic unit of consistency. Each transaction group is assigned a unique 64-bit sequential identifier. At most three active transaction groups can exist simultaneously, each in one of the following states:
* **Open**: A new transaction group is created in the open state and accepts new writes. Once the size limit is reached or the `vfs.zfs.txg.timeout` expires, the transaction group advances to the next state. There is always one transaction group in the open state in the system.
* **Quiescing**: A brief transitional state that allows all pending operations to complete while not blocking the creation of a new open transaction group. Once all transactions within the group are complete, it advances to the syncing state.
* **Syncing**: All data within the transaction group is written to stable storage. This process also modifies metadata, space maps, etc., which ZFS writes out as well. The syncing process consists of multiple phases: first, all changed data blocks are written, followed by metadata (which may require multiple phases, since allocating space for data blocks generates new metadata). The syncing state is also where *synctasks* (administrative operations such as creating or destroying snapshots and datasets) complete, ultimately finishing with the update of the uberblock. All administrative functions (such as snapshot writes) are executed as part of a transaction group. ZFS adds created synctasks to the open transaction group and advances that group to the syncing state as quickly as possible to reduce administrative command latency.
* **Self-Healing**: Checksums stored in data blocks combined with storage redundancy (mirror or RAID-Z) endow ZFS with self-healing capability. Pools consisting of a single device can detect data corruption but cannot self-heal.

## Advanced Data Properties

* **Compression**: Each dataset has a compression property, which defaults to `off`. Setting it to `on` means using the current default compression algorithm, which balances compression/decompression speed and compression ratio for a wide range of workloads. Unlike other compression settings, `on` does not select a fixed compression type; instead, as ZFS introduces new compression algorithms and enables them in the pool, it automatically uses the new default algorithm. The current default compression algorithm is `lz4` (falling back to `lzjb` on older pools without the `lz4_compress` feature flag enabled). When compression is enabled, ZFS compresses all newly written data. In addition to saving space, throughput is usually improved due to fewer blocks being read and written. Any compression setting other than `off` automatically detects all-zero blocks (NUL bytes) and stores them as holes. Compressed data is rounded up to sector size (2^`ashift` bytes); if the space saved is less than one sector, the block is stored uncompressed. There is also a 12.5% default compression threshold.
* **recordsize Property**: The suggested block size, designed primarily for database workloads that access files with fixed-size records. ZFS automatically adjusts block sizes according to internal algorithms, and arbitrary modification is not recommended for typical general-purpose file systems. The default value is `131072` (128 KiB), and the maximum can be set to 16 MiB (requires the `large_blocks` feature flag; on x86\_32 platforms, it is limited to 1 MiB by the `zfs_max_recordsize` module parameter default). Changes only affect files created thereafter.
* **Encryption**: ZFS supports native dataset-level encryption (GUID: `com.datto:encryption`). When encryption is enabled, ZFS encrypts file and ZVOL data, file attributes, ACLs, permission bits, directory listings, FUID mappings, and user/group/project space usage data. Encryption must be specified at dataset creation time and cannot be changed afterwards. Supported encryption suites include `aes-128-gcm`, `aes-192-gcm`, `aes-256-gcm` (current default), as well as `aes-128-ccm`, `aes-192-ccm`, `aes-256-ccm`. Keys can be stored in local files, retrieved remotely via HTTPS/HTTP, or entered via a command-line prompt. In versions prior to OpenZFS 2.2.8 and 2.3.3, performing `zfs send` on an encrypted dataset while simultaneously creating a snapshot carries a risk of data corruption; if using unpatched versions, it is recommended to always use raw send mode (`zfs send -w`) for replication operations on encrypted datasets.
* **Copies**: When the `copies` property is set to a value greater than 1, ZFS maintains multiple copies of each block in the file system or volume (potentially stored on different disks). These copies supplement pool-level redundancy (such as mirror or RAID-Z). Setting this property for important datasets provides additional redundancy, allowing data recovery using blocks with mismatched checksums. In pools without redundant configurations, the copies feature is the only form of redundancy. It can recover from individual bad sectors or other small-scale failures but does not protect the pool from the loss of an entire disk.
* **Dedup/Deduplication**: The checksum mechanism enables the detection of duplicate blocks at write time. Through deduplication, the reference count of an existing block is incremented, saving storage space. ZFS writes the Deduplication Table (DDT) to disk and caches it in ARC memory. The DDT is used to detect duplicate blocks at write time, containing a list of unique checksums, block locations, and reference counts. When writing new data, ZFS computes the checksum and compares it against the list; if a match is found, the existing block is reused. The `sha256` checksum combined with deduplication provides a secure cryptographic hash. When `dedup` is set to `on`, a checksum match is considered to mean the data is identical; when set to `verify`, ZFS performs a byte-by-byte comparison to ensure consistency, and if a difference is found, the hash collision is logged and the data is stored separately. The DDT needs to store the hash of every unique block, resulting in enormous memory consumption. The official recommendation is to have at least 1.25 GiB of memory per 1 TiB of storage when enabling deduplication; actual requirements depend on the type of data stored. If memory is insufficient to hold the full DDT, the DDT must be read from disk before every write, severely degrading performance and potentially even preventing pool import due to memory exhaustion. Deduplication can leverage L2ARC to store the DDT as a compromise between memory and disk. Compression should be considered first, as it can often achieve similar space savings without the additional memory overhead.
* **Scrub**: Unlike consistency checks such as fsck(8), ZFS uses `scrub`. `scrub` reads all data blocks in the pool and compares their checksums against known-good checksums in metadata for verification. Periodically checking all stored data ensures that corrupted blocks can be recovered before they are needed. After an unclean shutdown, `scrub` is not required, but it is recommended to perform it at least once a month. ZFS verifies the checksum of every block during normal daily use, but `scrub` ensures that even rarely used blocks are checked for potential corruption, further improving data security for archival storage scenarios.
* **Resilver**: When replacing a failed disk, ZFS must fill the new disk with the missing data. When a mirror vdev is resilvering, data is copied from the remaining mirror members; when a RAID-Z vdev is resilvering, parity information from the remaining disks is used to calculate the missing data and write it to the new disk.

## Cache and Performance Optimization

* **Adaptive Replacement Cache (ARC)**: ZFS uses the Adaptive Replacement Cache (ARC) in place of the traditional Least Recently Used (LRU) cache. An LRU cache is a simple list ordered by recency of use, with new items added to the head of the list and items evicted from the tail when the cache is full. ARC consists of four lists: Most Recently Used (MRU) and Most Frequently Used (MFU) objects, along with their corresponding ghost lists. The ghost lists track evicted objects; when evicted data is requested again (a ghost hit), ARC dynamically adjusts the cache space allocation between MRU and MFU accordingly, improving hit rates. Another advantage of using MRU and MFU is that when scanning an entire file system, an LRU cache would evict all data to accommodate the newly accessed content; ZFS, however, tracks the most frequently accessed objects via MFU, allowing the most frequently accessed cached blocks to be retained.
* **L2ARC**: L2ARC is the second level of the ZFS caching system. The primary ARC is stored in RAM. Since RAM capacity is typically limited, ZFS can also use cache vdevs to extend the cache. SSDs are commonly used as cache devices due to their higher speed and lower latency. L2ARC is entirely optional, but deploying it can improve the speed of reading cached files from SSD, avoiding reads from regular disks. L2ARC can also accelerate deduplication: when the Deduplication Table (DDT) exceeds RAM capacity but fits within L2ARC, its read speed is far faster than disk reads. ZFS limits the rate of data written to cache devices to prevent additional writes from prematurely wearing out SSDs. Before the cache is full (before the first eviction occurs to free space), the L2ARC write rate is limited to the sum of the write limit and the boost limit; afterwards, it is limited to the write limit only.
* **ZIL**: The ZFS Intent Log (ZIL) handles POSIX requirements for synchronous transactions (such as database transactions, NFS operations, and fsync(2) calls). By default, the intent log is allocated from blocks within the main pool. Adding a separate intent log device (such as NVRAM or a dedicated SSD) allows the intent log to be moved to faster storage, significantly reducing synchronous write latency and improving performance. Synchronous workloads such as databases benefit greatly from a dedicated log device, while regular asynchronous writes (such as file copies) do not use the ZIL at all.

## Datasets and Storage Space Management

* **Dataset**: A dataset is the generic term for a ZFS file system, volume, snapshot, or bookmark. Clones are essentially file systems or volumes, not an independent dataset type. Each dataset has a unique name: file systems use the format `pool/path`, snapshots use `pool/path@snapshot`, and bookmarks use `pool/path#bookmark`. The root of the pool is itself a dataset. Child datasets use a hierarchical naming scheme similar to directories; for example, `mypool/home` is a child dataset of `mypool` and inherits its properties, and `mypool/home/user` can be created as a grandchild dataset that inherits properties from both its parent and grandparent datasets. Properties set on a child dataset can override inherited defaults. Management of datasets and their children can be accomplished through delegation.
* **Filesystem**: The most commonly used type of ZFS dataset. A ZFS filesystem is ZFS's own native file system format—it is not FAT32/NTFS/ext4, nor can other file systems be formatted on top of it. It is mounted in the system directory tree (controlled by the `mountpoint` property) and provides a full POSIX file interface (open, read, write, mkdir, etc.). The mount point can be set to a specific path (auto-mount), `legacy` (traditional mounting via **/etc/fstab**), or `none` (not auto-mounted). File systems support hierarchical naming (e.g., `pool/home/user`), and child datasets inherit properties from their parent dataset by default, though they can be overridden with independent values.
* **Volume (ZVOL)**: A ZFS volume is another major dataset type. A volume does not contain a file system itself; it is a raw device exposed as a raw block device located at **/dev/zvol/pool/path**. Because it is a raw block device, it can be formatted with any file system (FAT32, NTFS, UFS, ext4, etc.) or used directly as a virtual machine disk or iSCSI target—any scenario requiring a block device. By default, creating a volume establishes an equal amount of space reservation (`refreservation`); using the `-s` flag creates a sparse volume without reserving space. Volumes also support ZFS features such as snapshots, clones, and rollbacks, but they cannot be independently mounted like file systems, and quotas are not supported (their `volsize` property itself serves as an implicit quota).

**ZFS file systems can host other file systems internally through volumes**

```sh
Physical Disk
 └── vdev (Mirror / RAID-Z / Stripe)
    └── Storage Pool (zpool)
         └── ZFS Dataset (Dataset abstraction layer)
              ├── ZFS Filesystem
              │    ├── Files / Directories
              │    ├── Snapshot
              │    ├── Clone
              │    └── Bookmark
              │
              └── Volume (zvol, block device)
                   ├── Other file systems (UFS / ext4 / FAT32, etc.)
                   │      └── Files / Directories
                   ├── Snapshot
                   ├── Clone
                   └── Bookmark
```

* **Filesystem vs. Volume Comparison**:

  | Feature                             | ZFS Filesystem                                    | ZFS Volume (ZVOL)                                         |
  | ----------------------------------- | ------------------------------------------------- | --------------------------------------------------------- |
  | Nature                              | ZFS native file system format                     | Raw block device                                          |
  | Exposed as                          | Mounted as directory tree (`mountpoint` property) | Device file (**/dev/zvol/pool/path**)                     |
  | Is it a file system?                | Yes: ZFS itself is the file system                | No: raw block device, no built-in file system             |
  | Can format other FS?                | No (it is already a file system)                  | Yes (FAT32, NTFS, UFS, ext4, etc.)                        |
  | POSIX file interface                | Directly provided (open/read/write/mkdir)         | Not directly provided (needs formatting and mounting)     |
  | Creation command                    | `zfs create pool/fs`                              | `zfs create -V size pool/vol`                             |
  | Default space behavior              | Shares free space in the pool                     | Reserves equal space by default (`-s` for sparse)         |
  | Use cases                           | General file storage, user and service data       | VM disks, iSCSI, scenarios requiring non-ZFS file systems |
  | Snapshot/Clone/Bookmark/Rollback    | Supported                                         | Supported                                                 |
  | COW/Checksum/Compression/Encryption | Supported                                         | Supported                                                 |
  | Hierarchical naming                 | Supported (nestable)                              | Supported (coexists alongside file systems)               |
  | Quota support                       | Supported (`quota`/`refquota`/user/group/project) | Not supported (`volsize` itself serves as implicit quota) |
  | Independent mounting                | Yes                                               | No (must be formatted and mounted via corresponding FS)   |
* **Snapshot**: ZFS's Copy-on-Write (COW) design enables near-instantaneous creation of consistent snapshots, which can be arbitrarily named. After taking a recursive snapshot of a dataset or its parent dataset (including all child datasets), new data is written to new blocks, and old blocks are not immediately reclaimed. The snapshot preserves the original version of the file system, while the live file system contains all changes since the snapshot, with no additional space consumed by either. New data written to the live file system is stored in new blocks. Snapshot size gradually grows as old blocks are freed from the live file system. Mounting a snapshot in read-only mode allows recovery of historical file versions; the live file system can also be rolled back to a specific snapshot, undoing all changes since that snapshot. Every block in the pool has a reference counter tracking which snapshots, clones, datasets, or volumes reference that block. When files and snapshots are deleted, the reference count is decremented, and space is reclaimed when the count reaches zero. Snapshots can be marked as held; attempting to destroy a held snapshot returns an `EBUSY` error. Use the `zfs release` command to remove the hold before deletion. Volumes support snapshot, clone, and rollback operations but cannot be independently mounted.
* **Clone**: A snapshot can be used to create a clone. A clone is a writable version of a snapshot that allows a file system to fork as a new dataset. Like snapshots, clones initially consume no new space. New data written to the clone uses new blocks, and the clone's size grows accordingly. When a block in the clone is overwritten, the reference count of the original block is decremented. Since the snapshot is the parent and the clone is the child, the snapshot that a clone depends on cannot be destroyed. A clone can be promoted to reverse this dependency: the clone becomes the parent and the original parent becomes the child. This operation consumes no new space, but since the parent-child space usage relationship is reversed, it may affect existing quotas and reservations.
* **Block Cloning**: Block cloning is a mechanism that allows cloning of files (or portions of files), creating shallow copies that reference existing data blocks without copying them. Subsequent modifications to the data trigger Copy-on-Write for the data blocks. This feature is used to implement "reflinks" or "file-level Copy-on-Write." ZFS tracks cloned blocks through a special on-disk structure called the Block Reference Table (BRT). Unlike deduplication, the overhead of this table is minimal, so it can always be enabled. Another difference from deduplication is that cloning requires the user program to actively request it. Many common file copy programs (including newer versions of **/bin/cp**) automatically attempt to create clones. Block cloning has some limitations: only complete blocks can be cloned; blocks that have not yet been written to disk cannot be cloned; blocks with different `recordsize` properties between source and target cannot be cloned; encrypted blocks can only be cloned when the source and target datasets share the same master key (such as snapshots and clones), and cross-encryption-key datasets cannot be cloned.
* **Bookmark**: A bookmark is similar to a snapshot but is faster to create and consumes no additional space. Unlike snapshots, bookmarks cannot be accessed through the file system in any way. From a storage perspective, a bookmark's purpose is to reference the state at the time a snapshot was created, treating it as an independent object. Bookmarks are initially associated with a snapshot rather than directly with a file system or volume; even if the original snapshot is destroyed, the bookmark continues to exist. Since bookmarks are very lightweight, there is usually no need to destroy them. The bookmark format is `pool/path#bookmark`.
* **Quota**:

  ZFS provides fast and accurate dataset, user, group, and project space accounting, as well as quota and space reservation features, enabling administrators to finely control space allocation and reserve space for critical file systems.

| Quota Type                 | Description                                                                                                                                                                                                            |
| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Dataset Quota              | Limits the total size of a dataset and its descendants, including snapshots and child datasets. Volumes do not support quotas because their `volsize` property serves as an implicit quota                             |
| Reference Quota (refquota) | Imposes a hard limit on the maximum space a dataset can use. This limit only includes space referenced by the dataset itself, not space used by descendants (such as file systems or snapshots)                        |
| User Quota                 | Limits the amount of space used by a specified user                                                                                                                                                                    |
| Group Quota                | Limits the amount of space used by a specified group                                                                                                                                                                   |
| Project Quota              | Limits the space usage of a specified project, identified by project ID. Similar to user and group quotas, project quotas can account for and limit space usage of files and directories belonging to the same project |

* **Reservation**:
* **Dataset Reservation (reservation)**: The `reservation` property guarantees a certain amount of storage space for a specific dataset and its descendants. For example, setting a 10 GB reservation for `storage/home/bob` prevents other datasets from exhausting free space, ensuring that the dataset always has at least 10 GB available.

  Any form of reservation has practical value in the following scenarios: planning and testing disk space allocation schemes for new systems, or ensuring that file systems have sufficient space for audit logs, system recovery processes, and files.
* **Reference Reservation (refreservation)**: The `refreservation` property reserves a certain amount of space for a specific dataset but does not include its descendant datasets. For example, after setting a 10 GB reservation for `storage/home/bob`, even if other datasets attempt to consume free space, at least 10 GB will still be reserved for that dataset. Unlike regular `reservation`, space occupied by snapshots and descendant datasets is not counted toward this reservation. If a snapshot is created for `storage/home/bob`, there must be sufficient disk space beyond the `refreservation` space for the operation to succeed. Descendants of the primary dataset do not consume `refreservation` space.
* **Pool Checkpoint**: Before performing critical operations that include destructive actions (such as `zfs destroy`), an administrator can establish a checkpoint of the pool state. If an error occurs, the entire pool can be rolled back to the checkpoint. A checkpoint is similar to a pool-level snapshot and contains all state of the pool (properties, vdev configuration, etc.). When a checkpoint exists, operations such as vdev removal/attachment/detachment, mirror splitting, and changing the pool GUID cannot be executed. Adding new vdevs is allowed, but they must be re-added after a rollback. After a rollback, the checkpoint is permanently deleted.

> **Note**
>
> When a checkpoint exists, dataset reservations may not be enforced, and data freed by the checkpoint will not be scanned by scrub.

## Feature Flags

The ZFS on-disk format initially used a single version number that was incremented with each format change. This numbering scheme was appropriate when ZFS was driven by a single organization. However, for OpenZFS's distributed development, version numbering is no longer suitable—any version number change requires all implementations to agree on each on-disk format change.

OpenZFS feature flags serve as an alternative to traditional version numbering, providing **a uniquely named pool property** for each on-disk format change. This approach supports both independent and interdependent format changes. When all features used by a pool are supported by multiple OpenZFS implementations, the on-disk format can be ported across those implementations. Exclusively enabled features should be periodically ported to all distributions. Traditional version numbers still exist for pool versions 1-28.

## Differences Between ZFS and Traditional File System Mount Methods

ZFS does not use **/etc/fstab** to manage file system mounts. Instead, it manages mounts through the `zfs mount` command and the `mountpoint` property of ZFS datasets. However, EFI system partitions and swap partitions still require **/etc/fstab**.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://book.bsdcn.org/ask/flat/chapter-28-the-zfs-file-system/di-28.2-jie-zfs-te-xing-he-shu-yu.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
