Hard links?

Posted Aug 15, 2024 15:56 UTC (Thu) by quotemstr (subscriber, #45331)
Parent article: Memory-management: tiered memory, huge pages, and EROFS

SQ: why not just hard link identical files when constructing EROFS images? Most of the problems with hard links occur when mutating files behind the hard link (e.g. atomic rename versus in-place replacement), but what problems would simple hard links cause in an immutable filesystem?

Or is the patch more about sharing content between identical files in *different* filesystems? That seems much riskier. Suppose I have /dev/sda and /dev/sdb mounted on /mnt/foo and /mnt/bar, respectively. Let's suppose both filesystems are EROFS. One program opens /mnt/foo/my.txt and another, unrelated, program opens identical /mnt/bar/my.txt. AIUI, this patch silently re-routes the second open (for the file under /mnt/bar) to the inode already opened under /mnt/foo.

What happens when I surprise-remove /dev/sda? ISTM both programs would break, even though my expectation would be that only the program that opened /mnt/foo/my.txt would break and the program that opened /mnt/bar/my.txt would keep working.

to post comments

Hard links?

Posted Aug 15, 2024 20:44 UTC (Thu) by iabervon (subscriber, #722) [Link]

My impression is that EROFS isn't for the use case of filesystems mapped one-to-one with physical devices. It's for the case where you have a single physical device storing a dozen images, each used with different overlays to make different containers. I believe the implementation is such that you'd have three storage areas: one with all of the file contents, one with the mapping of paths in /mnt/foo, and one with the mapping of paths in /mnt/bar, and all three would be on the same physical device. Currently, despite the fact that both /mnt/foo/my.txt and /mnt/bar/my.txt use the same storage for the file contents, the cache stores it twice.

Hard links?

Posted Aug 16, 2024 9:27 UTC (Fri) by hsiangkao (subscriber, #123981) [Link] (5 responses)

> SQ: why not just hard link identical files when constructing EROFS images? Most of the problems with hard links occur when mutating files behind the hard link (e.g. atomic rename versus in-place replacement), but what problems would simple hard links cause in an immutable filesystem?

Simply because hardlink cannot resolve cases where files have identity data but different metadata (such as timestamps and/or file modes or likewise). That is quite common due to image rebuild or likewise, but users need to keep different metadata.

> Or is the patch more about sharing content between identical files in *different* filesystems? That seems much riskier. Suppose I have /dev/sda and /dev/sdb mounted on /mnt/foo and /mnt/bar, respectively. Let's suppose both filesystems are EROFS. One program opens /mnt/foo/my.txt and another, unrelated, program opens identical /mnt/bar/my.txt. AIUI, this patch silently re-routes the second open (for the file under /mnt/bar) to the inode already opened under /mnt/foo.
> What happens when I surprise-remove /dev/sda? ISTM both programs would break, even though my expectation would be that only the program that opened /mnt/foo/my.txt would break and the program that opened /mnt/bar/my.txt would keep working.

This patchset is still far from complete, ideally it needs an anonymous address_space for both identified inodes. If one instance unmounted, it will re-route to different source of another device.

Sorry about the incomplete implementation so far due to his current kernel knowledge.

Hard links?

Posted Aug 16, 2024 9:46 UTC (Fri) by hsiangkao (subscriber, #123981) [Link] (1 responses)

>> SQ: why not just hard link identical files when constructing EROFS images? Most of the problems with hard links occur when mutating files behind the hard link (e.g. atomic rename versus in-place replacement), but what problems would simple hard links cause in an immutable filesystem?
> Simply because hardlink cannot resolve cases where files have identity data but different metadata (such as timestamps and/or file modes or likewise). That is quite common due to image rebuild or likewise, but users need to keep different metadata.

Also hardlinks cannot resolve cross-mount instances, which is common for image-based golden filesystem cases, of course.

Hard links?

Posted Sep 4, 2024 1:59 UTC (Wed) by viro (subscriber, #7872) [Link]

Huh? You can't call link(2) between different mounts, but then your filesystem is readonly, so you can't call any directory-modifying syscalls anyway. Different mounts can very well share struct inode instances - otherwise cache coherency would've been an awful PITA.

Metadata differences make for a valid reason; any mount-related stuff is a red herring.

Hard links?

Posted Aug 16, 2024 13:06 UTC (Fri) by hsiangkao (subscriber, #123981) [Link] (2 responses)

>> Or is the patch more about sharing content between identical files in *different* filesystems? That seems much riskier. Suppose I have /dev/sda and /dev/sdb mounted on /mnt/foo and /mnt/bar, respectively. Let's suppose both filesystems are EROFS. One program opens /mnt/foo/my.txt and another, unrelated, program opens identical /mnt/bar/my.txt. AIUI, this patch silently re-routes the second open (for the file under /mnt/bar) to the inode already opened under /mnt/foo.
>> What happens when I surprise-remove /dev/sda? ISTM both programs would break, even though my expectation would be that only the program that opened /mnt/foo/my.txt would break and the program that opened /mnt/bar/my.txt would keep working.
> This patchset is still far from complete, ideally it needs an anonymous address_space for both identified inodes. If one instance unmounted, it will re-route to different source of another device.
> Sorry about the incomplete implementation so far due to his current kernel knowledge.

Also add some words, for the case you mentioned, there is nothing different from page-based page cache sharing and even page cache sharing within the same device (considering data link errors, various device mapper targets, and more);

That is, you turn on this page cache feature _only if_ you trust all data sources since each data source can contribute data to the system shared page cache.

Hard links?

Posted Aug 16, 2024 14:38 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

Thanks for the explanation. Super helpful; sounds like a nice feature once the kinks get worked out.

Another SQ: might it be possible to piggyback on top of fs-verity and make it general across filesystems --- at least read only ones? fs-verity already maintainers per file hashes in metadata, and fs-verity ensures that nobody's lying about what's in the file.

Also, have you considered some kind of "RAID 1" striped IO strategy? If I have block devices A and B with read only filesystems containing identical file X, why not read X from A and B in parallel? I can think of all sorts of fun things to do with the knowledge that two files on the same filesystem are identical.

Hard links?

Posted Aug 16, 2024 15:14 UTC (Fri) by hsiangkao (subscriber, #123981) [Link]

> Another SQ: might it be possible to piggyback on top of fs-verity and make it general across filesystems --- at least read only ones? fs-verity already maintainers per file hashes in metadata, and fs-verity ensures that nobody's lying about what's in the file.

I tend to keep this feature knob priviledged all the time (yeah, I know EROFS mounting needs priviledge). The priviledged user mount program could decide whether it uses dm-verity/fs-verity or other(TM) technologies to enable this feature safely (so we trust the priviledged user mount program.)

> Also, have you considered some kind of "RAID 1" striped IO strategy? If I have block devices A and B with read only filesystems containing identical file X, why not read X from A and B in parallel? I can think of all sorts of fun things to do with the knowledge that two files on the same filesystem are identical.

I guess it will be helpful (also it's still compatible if linux MM has a finer page-based page cache sharing
mechanism in the future), but I think let's make this feature work with the original use case first.

Also I think it would be better to reuse the current EROFS trusted domain id concept. I think all sources in the same trusted domain (e.g. from the same physical device, the same network host or a user-defined trusted array could form a domain) could share page cache safely, but even some sources from different trusted domain won't share by design.

Hard links?

Posted Aug 16, 2024 12:17 UTC (Fri) by HongzhenLuo (guest, #172928) [Link] (1 responses)

Let's take the container scenario as an example. Due to different mount points, accessing the same file in two containers will generate different inodes within the EROFS file system. Therefore, it's not feasible to share the page cache simply through hard links.

As @hsiangkao said, the current patch set is still incomplete, although it has shown positive effects in saving memory. The current implementation maps files with identical content to the same list and uses the page cache of a certain inode in that list to fulfill read requests. When an inode is destroyed, it is removed from the list. However, if the inode currently being destroyed happens to be the inode for the shared page cache, it must wait for all reads to complete before being removed from the list (this is also a drawback of the latest implementation). A solution that can alleviate this situation is to perform reads using anonymous inode: all files with the same content are mapped to the same anonymous inode, and reading is performed through the page cache of this anonymous inode. When an inode is destroyed, the anonymous inode will obtain the necessary information to complete the read request from one of the remaining inodes. From the perspective of the unmounted instance, this will hardly take any waiting time.

Finally, I am a new kernel developer. Due to my limited knowledge of the kernel, I am sorry for the incompleteness in the current implementation.

Hard links?

Posted Sep 4, 2024 0:52 UTC (Wed) by fest3er (guest, #60379) [Link]

"Finally, I am a new kernel developer. Due to my limited knowledge of the kernel, I am sorry for the incompleteness in the current implementation."

No need for apologies. As a work-in-progress, it is expected to have functionality holes and to change significantly as inefficiencies are found, methods are polished, and you learn more about the kernel innards.