[go: up one dir, main page]

|
|
Log in / Subscribe / Register

The curious case of O_DIRECTORY|O_CREAT

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 16:11 UTC (Tue) by NYKevin (subscriber, #129325)
In reply to: The curious case of O_DIRECTORY|O_CREAT by atnot
Parent article: The curious case of O_DIRECTORY|O_CREAT

Well, technically, there is also the option of exposing an API for the secret transaction system that's built into the filesystem. Microsoft actually did that for NTFS, and shortly thereafter turned around and published documentation telling everyone not to use it because they were going to deprecate it next week,[1] although to my knowledge this deprecation never officially happened. Perhaps someone with deeper filesystem knowledge than me can explain why this is so fraught, but from my perspective as an application developer, the whole thing seems a bit silly. You have a series of functions that amount to "atomically read X and then write Y" for numerous different values of X and Y - why not just let userspace directly express the "atomically" part instead?

The obvious concern is about blocking the system, but you could still provide optimistic concurrency control, so that doesn't seem like a valid problem to me. There's also the problem of "this is too complicated and userspace might do something dumb with it" - but that hasn't stopped mmap and similarly weird syscalls from existing. And there's also "we don't know if we made the best possible transaction system, so to avoid breaking backcompat, we'll just avoid exposing it" - but SQL servers have exposed transaction systems since the beginning, and they seem to be doing just fine on the backcompat side of things. I get the sense that the real problem is "this isn't worth the engineering effort in our judgment," but nobody wants to say that out loud.

[1]: https://learn.microsoft.com/en-us/windows/win32/fileio/de...


to post comments

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 19:40 UTC (Tue) by roc (subscriber, #30627) [Link] (5 responses)

You don't want userspace code execution to be part of the transaction, that seems really dangerous even with optimistic concurrency. You would want to be able to submit the complete transaction for validation and execution by the kernel. Actually these days you could probably cobble something together using io_uring and eBPF to do this.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 14:59 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (4 responses)

> You don't want userspace code execution to be part of the transaction, that seems really dangerous even with optimistic concurrency. You would want to be able to submit the complete transaction for validation and execution by the kernel.

Isn't that exactly what SQL does with COMMIT right now?

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 16:50 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

No SQL doesn't do it ... Oracle or PostGreSQL or ScarletDME or whatever do it.

The database batches things up until userspace/SQL/DataBASIC says "okay I'm done". At present, the kernel does not have the ability to do that.

And actually, it might be quite tricky for the kernel, in general, because with a database COMMIT, the database maintains different versions of reality. Yes I know Posix mandates some sort of reality distortion field, but when you're trying to get VFS, and btrfx, and ext?, and whatever whatever whatever all to agree, life gets rather hard ...

Cheers,
Wol

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 18:30 UTC (Wed) by jhoblitt (subscriber, #77733) [Link] (1 responses)

> No SQL doesn't do it ... Oracle or PostGreSQL or ScarletDME or whatever do it.

AFAIK, ANSI/ISO SQL includes transactions.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 19:49 UTC (Wed) by Wol (subscriber, #4433) [Link]

Yes it does. But ANSI/ISO SQL doesn't *DO* transactions - it's an interpreted/jit'd language. In the words of the GP, it is "user space code" that shouldn't go anywhere near doing transactions.

COMMIT merely tells the underlying layer (and it doesn't have to be SQL, ScarletDME is NoSQL and there are a heck of a lot of NoSQL databases out there that have "commit" - without it you can't really claim to be a database at all) "this series of actions are supposed to be carried out in an atomic manner".

This is basically the big problem that everything has with transactions. It's easy for COMMIT (or my case, more likely START TRANSACTION / END TRANSACTION) to *define* *what* *is* *SUPPOSED* *to* *be* *atomic*, it's a much bigger problem for the underlying layer to actually implement it atomically.

Cheers,
Wol

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 18:36 UTC (Wed) by jhoblitt (subscriber, #77733) [Link]

Yes and transactions can be a useful feature in a filesystem. A few years prior to the existence of AWS S3, I wrote an object store like pseudo filesystems by basically converting the kernel's ext3 headers into a sql schema. Transactions were useful in several situations, such as atomic "file" renames where the target name was an existing "file" which needed to be unlinked.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 20:12 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Well, technically, there is also the option of exposing an API for the secret transaction system that's built into the filesystem.

Uhm, it was not secret at any point. Moreover, MS was pushing it pretty hard around the Vista time. The API was actually pretty cool, you could even do distributed transactions (using the Microsoft Distributed Transaction Coordinator) that involved the filesystem and SQL server. We used it to do transactional file operations in our CRM running under the IIS.

There were some problems with it:
1. It was slow as hell. NTFS is not a speed daemon in the best of times, and additional journaling slowed it down even more.
2. Pretty much nobody used it, even Microsoft's own Windows Update. So it got dropped from the ReFS.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds