[go: up one dir, main page]

|
|
Log in / Subscribe / Register

Stratis: Easy local storage management for Linux

May 29, 2018

This article was contributed by Andy Grover

Stratis is a new local storage-management solution for Linux. It can be compared to ZFS, Btrfs, or LVM. Its focus is on simplicity of concepts and ease of use, while giving users access to advanced storage features. Internally, Stratis's implementation favors tight integration of existing components instead of the fully-integrated, in-kernel approach that ZFS and Btrfs use. This has benefits and drawbacks for Stratis, but also greatly decreases the overall time needed to develop a useful and stable initial version, which can then be a base for further improvement in later versions. As the Stratis team lead at Red Hat, I am hoping to raise the profile of the project a bit so that more in our community will have it as an option.

Why make Stratis instead of working on ZFS or Btrfs?

A version of ZFS, originally developed by Sun Microsystems for Solaris (now owned by Oracle), was forked for use on other platforms including Linux (OpenZFS). However, its CDDL-licensed code is not able to be merged into the GPLv2-licensed Linux source tree. Whether CDDL and GPLv2 are truly incompatible is a continuing subject for debate, but the uncertainty is enough to make some enterprise Linux vendors unwilling to adopt and support it.

Btrfs is also well-established, and has no licensing issues. It was the anointed Chosen One for years (and years) for many people, but it is a large project that duplicates existing functionality, with a potentially high cost to complete and support over the long term.

Red Hat ultimately made a choice to instead explore the proposed Stratis solution.

How Stratis is different

Both ZFS and Btrfs can be called "volume-managing filesystems" (VMFs). These combine the filesystem and volume-management layers into one. VMFs focus on managing a pool of storage created from one or more block devices, and allowing the creation of multiple filesystems whose data resides in the pool. This model of management has proven attractive to users, since it makes storage easier to use not only for basic tasks, but also for more advanced features that would otherwise be challenging to set up.

Stratis is also a VMF, but unlike the others, it is not implemented entirely as an in-kernel filesystem. Instead, Stratis is a daemon that manages existing layers of functionality in Linux — the device-mapper (DM) subsystem and the XFS non-VMF filesystem — to achieve a similar result. While these components are not part of Stratis per se (and can indeed be used directly or via LVM) Stratis takes on the entire responsibility for configuring, maintaining, and monitoring the pool's layers on behalf of the user.

Although there are drawbacks to forgoing total integration, there are benefits. The natural primary benefit is that Stratis doesn't need to independently develop and debug the many features a VMF is expected to have. Also, it may be easier to incorporate new capabilities more quickly when they become available. Finally, as a new consumer of these components, Stratis may participate in their common upstream development, sharing mutual benefit with the component's other users.

In addition to this main implementation difference, Stratis also makes some different design choices, based on the current state of technology. First, the widespread use of SSDs minimizes the importance of optimizing for access times on rotational media. If performance is important, SSDs should be used either as primary storage, or as a caching tier for the spinning disks. Assuming this is the case lets Stratis focus more on other requirements in the data storage tier. Second, embedded use and automated deployments are now the norm. A new implementation should include an API from the start, so other programs can also configure it easily. Lastly, storage is starting to become commoditized: big enough for most uses, and perhaps no longer something users want to actively manage. Stratis should account for this by being easy to use. Many people will only interact with Stratis when a problem arises. Poor usability feels even worse when the user is responding to a rare storage alert, and also may be worried about losing data.

Implementation

Stratis is implemented as a user-space daemon, written in the Rust language. It uses D-Bus to present a language-neutral API, and also includes a command-line tool written in Python. The API and command-line interface are focused around the three concepts that a user must be familiar with — blockdevs, pools, and filesystems. A pool is made up of one or more blockdevs (block devices, such as a disk or a disk partition), and then once a pool is created, one or more filesystems can be created from the pool. While the pool has a total size, each filesystem does not have a fixed size.

Under the hood

Although how the pool works internally is not supposed to be a concern of the user, let's look at how the pool is constructed.

[Stratis layers]

Starting from the bottom of the diagram on the "internal view" side, the layers that manage block devices and add value to them are called the Backstore, which is in turn divided into data and cache tiers. Stratis 1.0 will support a basic set of layers, and then additional optional layers are planned for integration that will add more capabilities.

The lowest layer of the data tier is the blockdev layer, which is responsible for initializing and maintaining on-disk metadata regions that are created when a block device is added to a pool. Above that may go support for additional layers, such as providing detection of data corruption (dm-integrity), and providing data redundancy (dm-raid), with the ability to correct data corruption when used in tandem. This would also be where support for compression and deduplication, via the recently open-sourced (but not yet upstream) dm-vdo target, would sit. Since these reduce the available total capacity of the pool and may affect performance, their use will be configurable at time of pool creation.

Above this is the cache tier. This tier manages its own set of higher-performance block devices, to act as a non-volatile cache for the data tier. It uses the dm-cache target, but its internal management of blockdevs used for cache is similar to the data tier's management of blockdevs.

On top of the Backstore sits the Thinpool, which encompasses the data and metadata required for the thin-provisioned storage pool that individual filesystems are created from. Using dm-thin, Stratis creates thin volumes with a large virtual size and formats them with XFS. Since storage blocks are only used as needed, the actual size of a filesystem grows as data is stored on it. If this data's size approaches the filesystem's virtual size, Stratis grows the thin volume and the filesystem automatically. Stratis 1.0 will also periodically reclaim no-longer-used space from filesystems using fstrim, so it can be reused by the Thinpool.

Along with setting up the pool, Stratis continually monitors and maintains it. This includes watching for DM events such as the Thinpool approaching capacity, as well as udev events, for the possible arrival of new blockdevs. Finally, Stratis responds to incoming calls to its D-Bus API. Monitoring is critical because thin-provisioned storage is sensitive to running out of backing storage, and relieving this condition requires intervention from the user, either by adding more storage to the pool or by reducing the total data stored.

Challenges so far

Since Stratis reuses existing kernel components, the Stratis development team's two primary challenges have been determining exactly how to use them, and then encountering cases where the use of components in a new way can raise issues. For example, in implementing the cache tier using dm-cache, the team had to figure out how to use the DM target so that the cache device could be extended if new storage was added. Another example: Snapshotting XFS on a thin volume is fast, but giving the snapshot a new UUID so it can be mounted causes the XFS log to be cleared, which increases the amount of data written.

Both of these were development hurdles, but also mostly expected, given the chosen approach. In the future, when Stratis has proven its worth and has more users and contributors, Stratis developers could also work more with upstream projects to implement and test features that Stratis could then support.

Current Status and How to Get Involved

Stratis version 0.5 was recently released, which added support for snapshots and the cache tier. This is available now for early testing in Fedora 28. Stratis 1.0, which is targeted for release in September 2018, will be the first version suitable for users, and whose on-disk metadata format will be supported in future versions.

Stratis started as a Red Hat engineering project, but has started to attract community involvement, and hopes to attract more. If you're interested in development, testing, or offering other feedback on Stratis, please join us. For learning more about Stratis's current technical details, check out our Design document [PDF] on the web site. There is also a development mailing list.

Development is on GitHub, both for the daemon and the command-line tool. This is also where bugs should be filed. IRC users will find the team on the Freenode network, on channel #stratis-storage. For periodic news, follow StratisStorage on Twitter.

Conclusion

Stratis is a new approach to constructing a volume-managing filesystem whose primary innovation is — ironically — reusing existing components. This accelerates its development timeline, at the cost of foregoing the potential benefits of committing "rampant layering violations". Do the benefits ascribed to ZFS and Btrfs require this integration-of-implementation approach, or are these benefits also possible with integration at a slightly higher level? Stratis aims to answer this question, with the goal of providing a useful new option for local storage management to the Linux community.


Index entries for this article
GuestArticlesGrover, Andy


to post comments

Stratis: Easy local storage management for Linux

Posted May 29, 2018 18:14 UTC (Tue) by ejr (subscriber, #51652) [Link]

If the block devices are ceph rbds, and you expose XFS via its pNFS support from multiple hosts... Hilarity ensues?

Stratis: Easy local storage management for Linux

Posted May 29, 2018 18:55 UTC (Tue) by cesarb (subscriber, #6266) [Link] (1 responses)

Why does this remind me of EVMS?

Stratis: Easy local storage management for Linux

Posted May 29, 2018 19:05 UTC (Tue) by Sesse (subscriber, #53779) [Link]

EVMS wasn't nearly as automatic as what this sounds to be. In practice, it was mostly an alternative LVM2 frontend.

Stratis: Easy local storage management for Linux

Posted May 30, 2018 1:34 UTC (Wed) by gfa (guest, #53331) [Link] (2 responses)

Could Stratis grow options to automatically manage the underling block devices?
I'm thinking resize, add, remove them. In case the storage offers an API to manage it.

TL;DR Could Stratis add EBS volumes to my pool and remove them when they are not longer needed?

Stratis: Easy local storage management for Linux

Posted May 30, 2018 6:44 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

EBS volumes are actually resizable on the fly.

Stratis: Easy local storage management for Linux

Posted May 30, 2018 16:35 UTC (Wed) by agrover (guest, #55381) [Link]

This is a cool idea! I've opened an issue. https://github.com/stratis-storage/stratisd/issues/962

Stratis: Easy local storage management for Linux

Posted May 30, 2018 6:55 UTC (Wed) by nilsmeyer (guest, #122604) [Link] (5 responses)

Would have liked encryption / compression, but otherwise this sounds pretty cool. To be fair, ZFS also has no native encryption on Linux, getting it to work with LUKS was often not easy in my case.

Stratis: Easy local storage management for Linux

Posted May 30, 2018 10:36 UTC (Wed) by evad (subscriber, #60553) [Link] (4 responses)

Compression will be available via the dm-vdo module which provides compression and deduplication (mentioned in the article).

Encryption presumably could be easily added via dm-crypt/luks?

Stratis: Easy local storage management for Linux

Posted May 30, 2018 14:04 UTC (Wed) by agrover (guest, #55381) [Link] (3 responses)

Yeah, it's possible native support for encryption could be added in the future. It would be an integration task instead of a write-from-scratch task.

Stratis: Easy local storage management for Linux

Posted May 31, 2018 2:36 UTC (Thu) by eternaleye (guest, #67051) [Link] (2 responses)

There are definitely questions about where to slot encryption in, though - if it's placed along with dm-integrity it protects more metadata but has to deal with multiple devices to a much greater degree, as well as having to encrypt the parity/replicas from dm-raid, the hot data from dm-cache, etc; if instead placed above dm-cache, its role is drastically simplified (and its performance overhead is reduced), at the cost of leaking RAID topology info and cache metadata.

I currently do the latter using an LVM Klein bottle setup - three hard drives and an SSD in an LVM VG, arranged as RAID-5 under dm-cache hosting a single LV that's encrypted with LUKS, which I then turn around and use as a PV for the same VG (with --pvmetadatacopies 0 to avoid hangs), which then holds swap and root LVs.

This is the script I use to generate it, which I call "lvm-abuse":

#!/bin/bash

set -euo pipefail

declare VGNAME=pool
declare CACHENAME=cache
declare CRYPTNAME=crypt
declare PVNAME=crypt
declare CACHEMODE=writethrough
declare LUKSUUID="$(cat /proc/sys/kernel/random/uuid)"

declare -a SOLIDSTATE
declare -a ROTATIONAL

declare -A SKIP

usage() {
    echo "Usage: $0" \
        "[--cache-mode <writethrough|writeback|passthrough>]" \
        "[--ssd <SSD>]*" \
        "[--hdd <HDD>]+" \
        "[--vg <NAME=pool>]" \
        "[--cache <NAME=cache>]" \
        "[--crypt <NAME=crypt>]" \
        "[--luks-uuid <UUID>]" \
        "[--skip <vg|cache|crypt|lvs|swap|btrfs>]*"
    return $1
}

discard_ssds() {
    for SSD in "${SOLIDSTATE[@]}"; do
        blkdiscard "${SSD}"
    done
}

make_vg() {
    lvm vgcreate "${VGNAME}" "${SOLIDSTATE[@]}" "${ROTATIONAL[@]}"
}

make_crypt() {
    if   [[ "${#ROTATIONAL[@]}" -eq 1 ]]; then
        # Single encrypted drive
        lvm lvcreate "${VGNAME}" -n "${CRYPTNAME}" -l 100%FREE "${ROTATIONAL[0]}"
    elif [[ "${#ROTATIONAL[@]}" -eq 2 ]]; then
        # Encrypted RAID-1 pair of drives
        lvm lvcreate "${VGNAME}" -n "${CRYPTNAME}" -l 100%FREE --type raid1 --nosync --mirrors 1 "${ROTATIONAL[@]}"
    elif [[ "${#ROTATIONAL[@]}" -eq 3 ]]; then
        # Encrypted RAID-5 set of 3 drives
        lvm lvcreate "${VGNAME}" -n "${CRYPTNAME}" -l 100%FREE --type raid5 --nosync --stripes 2 "${ROTATIONAL[@]}"
    else
        # Encrypted RAID-6 set of 4 or more drives
        lvm lvcreate "${VGNAME}" -n "${CRYPTNAME}" -l 100%FREE --type raid6 --nosync --stripes $(( "${#ROTATIONAL[@]}" - 2 )) "${ROTATIONAL[@]}"
    fi

    cryptsetup luksFormat -c aes-xts-plain64 -h sha512 -y -s 256 --use-urandom -M luks --uuid="${LUKSUUID}" /dev/"${VGNAME}"/"${CRYPTNAME}"
    PVNAME="luks-$(cryptsetup luksUUID /dev/"${VGNAME}"/"${CRYPTNAME}")"
    cryptsetup luksOpen /dev/"${VGNAME}"/"${CRYPTNAME}" "${PVNAME}"
    # Don't store any metadata on the nested PV, to avoid deadlocks
    lvm pvcreate --pvmetadatacopies 0 --metadatasize 0 --norestorefile --bootloaderareasize 0 /dev/mapper/"${PVNAME}"
    lvm vgextend "${VGNAME}" /dev/mapper/"${PVNAME}"
}

make_cache() {
    local METANAME="$(mktemp -u cachemeta_XXXXXXXXXXXX)"
    if [[ "${#SOLIDSTATE[@]}" -eq 1 ]]; then
        # Single cache SSD
        lvm lvcreate "${VGNAME}" -n "${METANAME}"  -L 2g       "${SOLIDSTATE[0]}"
        lvm lvcreate "${VGNAME}" -n "${CACHENAME}" -l 100%FREE "${SOLIDSTATE[0]}"
    else
        # RAID-1 set of cache SSDs
        lvm lvcreate "${VGNAME}" -n "${METANAME}"  -L 2g       --type raid1 --mirrors $(( "${#SOLIDSTATE[@]}" - 1 )) "${SOLIDSTATE[@]}"
        lvm lvcreate "${VGNAME}" -n "${CACHENAME}" -l 100%FREE --type raid1 --mirrors $(( "${#SOLIDSTATE[@]}" - 1 )) "${SOLIDSTATE[@]}"
    fi
    lvm lvresize "${VGNAME}"/"${METANAME}" -L 1g
    lvm lvconvert --type cache-pool --poolmetadata "${VGNAME}"/"${METANAME}" "${VGNAME}"/"${CACHENAME}" --chunksize 4m
}

cache_crypt() {
    lvm lvconvert --type cache --cachepool "${VGNAME}"/"${CACHENAME}" "${VGNAME}"/"${CRYPTNAME}"
    lvm lvchange --cachemode "${CACHEMODE}" "${VGNAME}"/"${CRYPTNAME}"
}

make_lvs() {
    lvm lvcreate "${VGNAME}" -n swap -L 32g      /dev/mapper/"${PVNAME}"
    lvm lvcreate "${VGNAME}" -n root -l 100%FREE /dev/mapper/"${PVNAME}"
}

make_swap() {
    mkswap /dev/"${VGNAME}"/swap
}

make_btrfs() {
    mkfs.btrfs -m dup -n 16384 -s 4096 -O extref,skinny-metadata,no-holes /dev/"${VGNAME}"/root -f
}

parse_args() {
    while [[ $# -gt 1 ]]; do
        case $1 in
            --ssd)
                SOLIDSTATE+=( "$2" )
                ;;
            --hdd)
                ROTATIONAL+=( "$2" )
                ;;
            --vg)
                VGNAME="$2"
                ;;
            --cache)
                CACHENAME="$2"
                ;;
            --crypt)
                CRYPTNAME="$2"
                ;;
            --skip)
                SKIP[$2]=1
                ;;
            --cache-mode)
                CACHEMODE="$2"
                ;;
            --luks-uuid)
                LUKSUUID="$2"
                ;;
            *)
                usage 1
                ;;
        esac
        shift; shift;
    done

    if [[ $# -ne 0 ]]; then
        usage 1
    fi

    if [[ "${#ROTATIONAL[@]}" -eq 0  ]]; then
        echo "There must be at least one rotational drive"
        usage 2
    fi
}

do_setup() {
    parse_args "$@"

    set -x

    if [[ -z ${SKIP[discard]:-} ]]; then
        discard_ssds
    fi

    if [[ -z ${SKIP[vg]:-} ]]; then
        make_vg
    elif [[ -z ${SKIP[vgextend]:-} ]]; then
        lvm vgextend "${VGNAME}" "${SOLIDSTATE[@]}" "${ROTATIONAL[@]}"
    fi

    if [[ -z ${SKIP[crypt]:-} ]]; then
        make_crypt
    fi

    if [[ -z ${SKIP[cache]:-} ]] && [[ "${#SOLIDSTATE[@]}" -ne 0 ]]; then
        make_cache
    fi

    if [[ -z ${SKIP[attach]:-} ]]; then
        cache_crypt
    fi

    if [[ -z ${SKIP[lvs]:-} ]]; then
        make_lvs
    fi

    if [[ -z ${SKIP[swap]:-} ]]; then
        make_swap
    fi

    if [[ -z ${SKIP[btrfs]:-} ]]; then
        make_btrfs
    fi
}

do_setup "$@"

Stratis: Easy local storage management for Linux

Posted May 31, 2018 18:20 UTC (Thu) by idra (subscriber, #36289) [Link] (1 responses)

What's the point of hosting swap on it?

Stratis: Easy local storage management for Linux

Posted May 31, 2018 21:04 UTC (Thu) by SEJeff (guest, #51588) [Link]

Unless all code uses mlock(2), it could get paged to disk. If said memory includes secrets... do you really want secret variables such as say passwords paged to unencrypted disk? That is why.

Stratis: Easy local storage management for Linux

Posted May 30, 2018 20:23 UTC (Wed) by gioele (subscriber, #61675) [Link] (5 responses)

One of the best features of ZFS is that its ZRAID levels can cope with partial data losses. If some blocks on a dev or a device in a pool are misbehaving, only some of the files become unaccessible. In the RAID world, if a single sector on a disk is unreadable, the whole disk is lost and the whole file system is in danger until the RAID is rebuild.

ZFS is able to cope with partial data losses because is can "see through the RAID" and understand which files were stored in the section in question. File systems that live on top of dm-raid do know know enough about the underlaying layers and their only possibility is to say "the whole device has been lost".

Is Stratis similar to ZFS or to ext4-over-dm-raid in this regard?

Stratis: Easy local storage management for Linux

Posted May 30, 2018 21:58 UTC (Wed) by agrover (guest, #55381) [Link] (2 responses)

Yeah, this is a spot where deep integration is very helpful. I think Stratis can do two things. First, Stratis can use dm-integrity along with (underneath) dm-raid. Since dm-integrity detects previously silent data corruption, if it returns an error in this case then dm-raid is in a great position to reconstruct the correct result from its other raid volumes.

The other thing that Stratis can do is that it does have the ability to "see through" layers. Going downwards, the thin pool makes this available using the thin_rmap tool, so XFS reporting an error at (what it thinks is) LBA 1000 could be translated into the block in the pool. Then Stratis knows what blockdev that pool block is on, and its offset. Similarly, XFS also has reverse mapping data, see https://lwn.net/Articles/695290/, so I would say there are definitely limitations Stratis must overcome (e.g. thin_rmap currently only works when the thinpool is offline) but that doesn't mean info is always unobtainable.

Stratis: Easy local storage management for Linux

Posted Jun 1, 2018 7:13 UTC (Fri) by Felix (subscriber, #36445) [Link] (1 responses)

That sounds very promising. Does that mean we can (eventually) get fast raid resync by only syncing the parts of the disk which contain actual user data?

Stratis: Easy local storage management for Linux

Posted Jun 7, 2018 19:18 UTC (Thu) by Wol (subscriber, #4433) [Link]

md-raid? I'm hoping to add that at some point. Okay it will slow down the system somewhat, but the idea is that "tar cf /filesystem > /dev/null" will do it :-) Obviously only on raid-6 or a 3-disk mirror - a 2-disk mirror or raid-5 don't have sufficient info to fix broken sectors.

Cheers,
Wol

Stratis: Easy local storage management for Linux

Posted Jun 4, 2018 11:51 UTC (Mon) by hkario (subscriber, #94864) [Link] (1 responses)

Actually, the combination of dm-integrity with mdraid copes with data corruption quite well already, see my article on this for scripts to quickly experiment:
https://securitypitfalls.wordpress.com/2018/05/08/raid-do...

Stratis: Easy local storage management for Linux

Posted Jun 4, 2018 15:40 UTC (Mon) by paulj (subscriber, #341) [Link]

Very useful blog post. Thanks!

Stratis: Easy local storage management for Linux

Posted May 31, 2018 11:56 UTC (Thu) by ptman (subscriber, #57271) [Link] (4 responses)

IIRC it is still not possible to shrink XFS. That might turn out to be tricky.

Stratis: Easy local storage management for Linux

Posted May 31, 2018 16:29 UTC (Thu) by agrover (guest, #55381) [Link]

You can't shrink its size but if it is allocated from a thin-provisioned pool you can reclaim unused portions.

Stratis: Easy local storage management for Linux

Posted May 31, 2018 16:46 UTC (Thu) by otaylor (subscriber, #4190) [Link] (2 responses)

As I understand the article, the idea is that thin provisioning (and fstrim) largely remove the need to shrink filesystems - you can free up space to use for other filesystems without having to change the filesystem size. It seems workable, though it would need to be thoroughly propagated to through the user interface to avoid confusing the user, since, for example, "Available space" in 'df' output would now be meaningless.

Stratis: Easy local storage management for Linux

Posted May 31, 2018 17:56 UTC (Thu) by em-bee (guest, #117037) [Link]

probably the same or similar mechanisms as with the quota system.

greetings, eMBee.

Stratis: Easy local storage management for Linux

Posted Jun 4, 2018 7:49 UTC (Mon) by mbukatov (subscriber, #96216) [Link]

Yes, they note this in the Stratis Software Design document:

> In the absence of online shrink, Stratis would rely on trim to
> reclaim space from an enlarged but mostly empty filesystem,
> and return it to the thin pool for use by other filesystems.

Stratis: Easy local storage management for Linux

Posted May 31, 2018 17:52 UTC (Thu) by em-bee (guest, #117037) [Link] (9 responses)

as i am currently setting up an lvm thinpool for glusterfs, i really appreciate that stratis would simplify that process, because in the end i really don't care about how it is laid out underneath.

that's what i like about btrfs. it's all built in.

i'd be interested to know if stratis plans to support different filesystems besides xfs. (ext4 at least)

i'd also like to see if stratis could support in-place conversion.

btrfs can convert other filesystems because it is able to wrap around them.

this should be possible with stratis/lvm too. (i have seen a tool that converts a partition to lvm without wiping the filesystem. and while i would not trust that tool to be safe yet, it shows that it can be done.)

the ability to take an existing semi filled partition and upgrade it to stratis would be a godsend.

greetings, eMBee.

Stratis: Will it ever have the polish btrfs was meant to have but never quite got?

Posted Jun 3, 2018 20:20 UTC (Sun) by sdalley (subscriber, #18550) [Link] (6 responses)

I've really tried hard to like btrfs, but I've just now given up on it and gone to xfs.

btrfs was running my ubuntu root and /home filesystems for about 3 years now, in an as-vanilla-as-possible RAID1-mirrored arrangement. Same partition layout across two HDDs. No playing about with not-yet-proven RAID5/6 for me, I want a quiet life.

To its credit, it's been reliable in normal use, and snapshots are dead-easy. Nice to know that data as well as metadata is CRCd. The occasional manual scrub to shine the data up and clean up the recoverable errors (of which there were surprisingly many) was no great hardship. Scratched my head a bit over "balancing" the mirrors as the story seems to get a bit foggy there [1]. Software updates took at least twice as long to run as they had done under ext4, I put that down to the CoW nature of btrfs.

At one point, after deleting unwanted junk from my thunderbird Inbox, its file fragmentation exceeded 14,000 extents. Not unreasonable I suppose for a CoW filesystem. Attempted to set the auto-defrag file attribute and mount option but it didn't seem to do anything. Recursive defragment of /home made no impression. Resorted to copy-delete-rename of the offending file. The smell of not-quite-finished filesystem was increasing.

Then I started to get stutters and mouse freezes, plus some nasty I/O errors in the syslog. Sure enough, one of the HDDs was gaining lots of bad sectors in one of the root mirror partitions. No applications reporting file errors though, btrfs was robust in that regard. btrfs scrub fixed (temporarily) all but two of them, fortunately those two appeared to be in free space. Time for action!

Robooted with a recent sysresccd image and manually mounted failing filesystem. Consulted the google and the btrfs manpages. Uh-oh, btrfs device remove and btrfs device add might be problematic and leave my filesystem stranded read-only. [2]. Another strong whiff of not-quite-finished filesystem. But don't-you-worry, there's now a shiny btrfs replace command for exactly your failing-disk use case - great! Created a new partition on a new disk of the same size as the failing partition. Issued the fine command:

btrfs replace start <bad-partition-id> <new-partition-device-file> <bad-filesystem-mount-point>

Command line accepted. Command immediately terminated with message similar to:
Error: status: Inappropriate ioctl for device

Blast.

btrfs filesystem scrub fixed hundreds of errors (again). Re-attempted btrfs-replace with same-blasted-error. At gone 1AM, I'm out of time and motivation to take this any further. Btrfs just isn't there yet for prime time. Create some xfs partitions, back up the damn thing, go to bed.

[1] https://unix.stackexchange.com/questions/409446/do-i-need...
[2] https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_vol...

Stratis: Will it ever have the polish btrfs was meant to have but never quite got?

Posted Jun 4, 2018 0:06 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

This pretty much mirrors my experience. I ultimately went with a good old MD RAID and btrfs for my file junkyard server. I haven't found any other Linux filesystem that can do snapshots.

Stratis: Will it ever have the polish btrfs was meant to have but never quite got?

Posted Jun 4, 2018 11:57 UTC (Mon) by hkario (subscriber, #94864) [Link] (2 responses)

And that's the reason why I don't have high hopes for Stratis, the LVM based snapshotting is slow and every new snapshot is expensive. You need COW for snapshots to work well and DM doesn't support that yet, and I haven't seen anything that plans to add it.

Stratis: Will it ever have the polish btrfs was meant to have but never quite got?

Posted Jun 4, 2018 12:51 UTC (Mon) by sync (guest, #39669) [Link] (1 responses)

Stratis uses dm-thinp snapshots. These are fast. I did tests with thousands lvmthin (dm-thinp) snapshots. Worked fine.

Stratis: Will it ever have the polish btrfs was meant to have but never quite got?

Posted Jun 4, 2018 17:48 UTC (Mon) by hkario (subscriber, #94864) [Link]

with the origin being writeable? what was the performance difference for the origin with and without snapshots?

Stratis: Will it ever have the polish btrfs was meant to have but never quite got?

Posted Oct 2, 2018 7:44 UTC (Tue) by hisdad (guest, #5375) [Link]

I used nilfs for awhile some time back on a mail spool. it was impressive..

Stratis: Will it ever have the polish btrfs was meant to have but never quite got?

Posted Oct 2, 2018 14:49 UTC (Tue) by wazoox (subscriber, #69624) [Link]

Late to the show, but my office NAS have been running NILFS2 on its main 15 TB share volume without a hiccup since 2013. Infinite snapshots, excellent performance.

Stratis: Easy local storage management for Linux

Posted Jun 4, 2018 7:57 UTC (Mon) by mbukatov (subscriber, #96216) [Link] (1 responses)

> btrfs can convert other filesystems because it is able to wrap around them.

Do you mean https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3 ? I tried it some time ago and it didn't work well for me (conversion seemed to finish fine, but then scrub caused kernel panic), and haven't tried it again as the btrfs wiki now contains this warning:

> Warning: As of 4.0 kernels this feature is not often used or well tested anymore, and there have been some reports that the conversion doesn't work reliably. Feel free to try it out, but make sure you have backups.

Stratis: Easy local storage management for Linux

Posted Jun 4, 2018 15:32 UTC (Mon) by em-bee (guest, #117037) [Link]

yup, that one. i used it successfully. btw: as of i think 4.6 that conversion code has been rewritten and us supposedly good to use again.

greetings, eMBee.


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds