A critical look at sysfs attribute values

March 17, 2010

This article was contributed by Neil Brown

One of the many memorable lines from Douglas Adams's famous work The Hitchhiker's Guide to the Galaxy was the accusation, probably leveled by supporters of the Encyclopedia Galactica, that the Hitchhiker's Guide was "unevenly edited" and "contains many passages which simply seemed to its editors like a good idea at the time." With small modifications, such as replacing "edited" with "reviewed", this description seems very relevant to the Linux kernel, and undoubtedly many other bodies of software, whether open or closed, free or proprietary. Review is at best "uneven".

It isn't hard to find complaints that the code in the Linux kernel isn't being reviewed enough, or that we need more reviewers. The creation of tags like "Reviewed-by" for patches was in part an attempt to address this by giving more credit to reviewers and there by encouraging more people to get involved in that role.

However one can equally well find complaints about too much review, where developers cannot make progress with some feature because, every time they post a revision, someone new complains about something else and so, in the pursuit of perfection, the good is lost. Similarly, though it does not seem to be a problem lately, there have been times when lots of review would simply result in complaints about white-space inconsistency and spelling mistakes -- things that are worth correcting, but not worth burying a valuable contribution under.

Finding the right topic, the right level, and the right forum for review is not easy (and finding the time can be even harder). This article doesn't propose to address those questions directly, but rather to present a sample of review - a particular topic at a particular level on a particular forum, in the hope that it will be useful. The topic chosen, largely because it is something that your author has needed to work with lately without completely understanding, is "sysfs", the virtual filesystem that provides access to some of the internals of the Linux kernel. And in particular, the attribute files that expose the fine detail of that access.

The level chosen is a high-level or holistic view, asking whether the implementation matches the goals, and at the same time asking whether the goals are appropriate. And the forum is clearly the present publication.

Sysfs and attribute files

Sysfs has an interesting history and a number of design goals, both of which are worth understanding, but neither of which will be examined here except in as much as they reflect specifically the chosen topic: attribute files. The key design goal relating to attribute files is the stipulation - almost a mantra - of "one file, one value" or sometimes "one item per file". The idea here is that each attribute file should contain precisely one value. If multiple values are needed, then multiple files should be used.

A significant part of the history behind this stipulation is the experience of "procfs" or /proc. /proc is a beautiful idea that unfortunately grew in an almost cancerous way to become widely despised. It is a virtual filesystem that originally had one directory for each process that was running, and that directory contained useful information about the running process in various files.

There is clearly more that just processes that could usefully be put in a virtual filesystem, and, with no clear reason to the contrary, things started being added to procfs. With no real design or structure, more and more information was shoe-horned into procfs until it became an unorganised mess. Even inside the per-process directories procfs isn't a pretty sight. Some files (e.g. limits) contain tables with column headers, others (e.g. mounts) have tables without headers, and still others (e.g. status) have rows labeled rather than columns. Some files have single values (e.g. wchan) while others have lots of assorted and inconsistently formatted values (e.g. mountstats).

Against this background of disorganisation and the attendant difficulty of adding new fields without breaking applications, sysfs was declared to have a new policy - one item per file. In fact, in his excellent (though now somewhat out-dated) article on the Driver Model Core, Greg Kroah-Hartman even asserted that this rule was "enforced" (see the side bar on "sysfs").

It would not be fair to hold Greg accountable to what could have been a throw-away line from years ago, and I don't wish to do that. However that comment serves well in providing a starting point and a focus for reviewing the usage of attribute files in sysfs. We can ask if the rule really is being enforced, whether the rule is sufficient to avoid past mistakes, and whether the rule even makes sense in all cases.

As you might guess the answers will be "no", "no" and "no", but the explanation is far more enlightening than the answer.

Is it enforced?

The best way to test if the rule has been enforced is to survey the contents of sysfs - do files contain simple values, or something more? As a very rough assessment of the complexity of the contents of sysfs attribute file, we can issue a simple command:

 find /sys -mount -type f | xargs wc -w | grep -v ' total$'

to get a count of the number of words in each attribute file (the "-mount" is important if you have /sys/kernel/debug mounted, as reading things in there can cause problems).

Processing these results from your author's (Linux 2.6.32) notebook shows that of the 9254 files, 1189 are empty and 7168 have only one word. It seems reasonable to assume these represent only one value (though many of the empty files are probably write-only and this mechanism gives no information about what value or values can be written). This leaves 897 (nearly 10%) which need further examination. They range from two words (487 cases) to 297 words (one case).

While there are nearly 900 files, there are less than 100 base names. If we filter out some common patterns (e.g. gpe%X), the number of distinct attributes is closer to 62, which is a number that can reasonably be examined manually (with a little help from some scripting). Several of these multi-word attribute files contain non-ASCII data and so are almost certainly single values in some reasonable sense. Others contain strings for which a space is a legal character, such as "Dell Inc.", "i8042 KBD port" or "write back". So they clearly are not aberrations from the rule.

There is a small class of files were the single item stored in the file is of an enumerated type. It is common for the file in these cases to contain all of the possible values listed which still seems to hold true to the "one item per file" rule. However there are three variations on this theme:

In some cases, such as the "queue/scheduler" attribute of a block device, or the "trigger" attribute of an LED device, all of the possible options are listed, and the currently active one is enclosed in brackets, thus:
```
   noop anticipatory deadline [cfq]
```
In the second variation there are two files, one which contains the list of possibilities, as with "cpufreq/scaling_available_governors" and one which contains the currently-selected value, "cpufreq/scaling_governor".
Finally, and this could be just a special case of one of the above, we have "/sys/power/state" for which there is no current value, so it just contains a list of the possible values.

These are all examples of attribute files that do clearly contain just one value or item, but happen to use multiple words is various ways to describe those values. They are false-positives of our simplistic tool for finding complex attribute values.

However there are other multi-word attribute files that are not so easily explained away. /sys/class/bluetooth contains some class attributes such as rfcomm, l2cap and sco. Each of these contains structured data, one record per line with 3 to 9 different datums per record (depending on the particular file), the first datum looking rather like the BD address of a local blue-tooth interface.

This appears to be a clear violation of the "one item per file" policy. The files do appear to be very well structured and so easy to parse, so it is tempting to think that they should be safe enough. However sysfs attribute files are limited in size to one page - typically 4KB. If the number of entries in these files ever gets too large (about 70 lines in the l2cap file), accesses to the file will start corrupting memory, or crashing. Hopefully that will never happen, but "hope" is not normally an acceptable basis for good engineering. From a conversation with the bluetooth maintainer it appears that there are plans to move these files to "debugfs" where they can benefit from the "seq_file" implementation, also used widely in /proc, which allows arbitrarily large files.

Some other examples include "/sys/devices/system/node/node0/meminfo" which appears to be a per-node version of "/proc/meminfo" and is clearly multiple values, and the "options" attributes in /sys/devices/pnp*/* which appear to contain exactly the sort of ad hoc formatting of multiple values of multiple types that people find so unacceptable in /proc. The pnp "resources" files are similarly multi-valued, though to a lesser extent.

As a final example of a lack of enforcement, the PCI device directory for the (Intel 3945) wireless network in this notebook contains a file called "statistics" which contains a hex dump of 240 bytes of data, complete with ASCII decoding at the end of each line such as:

02 00 03 00 d9 05 00 00 28 03 00 00 45 02 00 00  ........(...E...
0d 00 00 00 00 00 00 00 00 00 00 00 d6 00 00 00  ................
b1 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00 00 00 00 00 00 00 00 67 00 00 00 00 00 00 00  ........g.......

This is surely not the sort of thing that sysfs was intended to report. If anything, this looks like it should be a binary attribute, not a doubly-encoded ASCII file.

So to answer our opening question, "no", the one item per file rule is not enforced in any meaningful way. Certainly the vast majority of attribute files do contain just one item and that is good. But there are a number which contain multiple values in a variety of different ways. And this number is only likely to grow as people either copy the current bad examples, or find new use cases that don't seem to fit the existing patterns, so invent new approaches which don't take the holistic view into account.

Is the rule sufficient?

Our next question to ask is whether the stated rule for sysfs attributes is sufficient to avoid an increasingly unorganised and ad hoc sysfs following the unfortunate path of procfs. We have already seen at least one case where it isn't. We do not have a standardised way of representing an enumerated type in a sysfs attribute, and so we have at least two implementations as already mentioned. There is at least one more implementation (exposed in the "md/level" attribute of md/raid devices) where just the current value is visible and the various options are not. Having a standard here would be good for consistency and encourage optimal functionality. But we have no standard.

A similar issue arises with simple numerical values that represent measurable items such as storage size or time. It would be nice if these were reported using standard units, probably bytes and seconds. But we find that this is not the case. Amounts of storage are sometimes reported as bytes (/sys/devices/system/memory/block_size_bytes), sometimes as sectors (/sys/class/block/*/size), and sometimes as kilobytes (block/*/queue/read_ahead_kb).

As these particular examples show, one way to avoid ambiguity is to include the name of the units (bytes or kb here) as part of the attribute name, a practice known as Hungarian notation. However this is far from uniformly applied with the examples given above being more the exception than the rule.

Measures of duration face the same problem. Many times that the kernel needs to know about are substantially less than one second. However rather than use the tried-and-true decimal point notation for sub-unit values, some attribute files report in milliseconds (unload_heads in libata devices), some in microseconds (cpuide/state*/time), and some are even in seconds (/sys/class/firmware/timeout). As an extra confusion there are some (.../bridge/hello_time) which use a unit that varies depending on architecture, from centiseconds to mibiseconds (if that is a valid name for 1-1024th part of a second). It is probably fortunate that there is no metric/imperial difference in units for time else we would probably find both of those represented too.

And then there are truth values: On, on, 1, Off, off, 0.

So it would seem that the answer to our second question is "no" too, though it is harder to be positive about this as there is no clearly stated goal that we can measure against. If the goal is to have a high degree of uniformity in the representation of values in attributes, then we clearly don't meet that goal.

Does the requirement always make sense?

So the guiding principle of one item per file is not uniformly enforced, and it isn't really enough to avoid needless inconsistencies, but were it to be uniformly applied, would it really give us what we want, or is it too simplistic or too vague to be useful as a strict rule?

A good place to start exploring this question is the "capabilities/key" attribute of "input" devices. The content of this file is a bitmap listing which key-press events the input device can possibly generate. The bitmap is presented in hexadecimal with a space every 64 bits. Clearly this is a single value - a bitmap - but it is also an array of bits. Or maybe an array of "long"s. Does that make is multiple values in a single attribute?

While that is a trivial example which we surely would all accept as being a single value despite being many bits long, it isn't hard to find examples that aren't quite as clear cut. Every block device has an attribute called "inflight" which contains two numbers, the number of read requests that are in-flight (have been submitted, but not yet completed) and the number of write requests that are in-flight. Is this a single array, like the bitmap, or two separate values? There would be little cost to have implemented "inflight" as two separate attributes thus clearly following the rule, but maybe there would be little value either.

The "cpufreq/stats/time_in_state" attribute goes one step further. It contains pairs, one per line, of CPU frequencies (pleasingly in HZ) and the total time spent at that frequency (unfortunately in microseconds). This it is more of a dictionary than an array. On reflection, this is really the same as the previous two examples. For both "key" and "inflight" the key is an enumerated type that just happens to be mapped to a zero-based sequence of integers. So in each case we see a dictionary. In this last case the keys are explicit rather than implicit.

If we contrast this last example with the "statistics" directory in any "net" device (net/*/statistics) we see that it is quite possible to put individual statistics in individual files. Were these 23 different values put into one file, one per line with labels, it is unlikely that anyone would accept that there was just one item in that file.

So the question here is: where do we draw the line? In each of these 4 cases (capabilities/key, inflight, time_in_state, statistics) we have a 'dictionary' mapping from an enumerated type to a scalar value. In the first case the scalar value is a truth value represented by a single bit, in the others the scalar is an integer. The size of the dictionary ranges from 2 to 23 to several hundred for "capabilities/key". Is it rational to draw a line based on the size of the dictionary, or on the size of the value? Or should it be left to the developer - a direction that usually produces disastrous results for uniformity.

The implication of these explorations seems to be that we must allow structured data to be stored in attributes, as there is no clear line between structured and non-structured data. "One item per file" is a great heuristic that guides us well most of the time, but as we have seen there are numerous times where developers find that it is not suitable and so deviate from the rules with a disheartening lack of consistency.

It could even be that the firmly stated rule has a negative effect here. Faced with a strong belief that a collection of numbers really forms a single attribute, and the strongly stated rule that multi-valued attributes are not allowed, the path of least resistance is often to quietly implement a multi-valued attribute without telling anyone. There is a reasonable chance that such code will not get reviewed until it is too late to make a change. This can lead multiple developers to solve the same problem in different ways, thus exacerbating a problem that the rule was intended to avoid.

So to answer our third question, "no", the "one item per file" doesn't always make sense because it isn't always clear what "one item" is, and those places of uncertainty are holes for chaos to creep in to our kernel.

Can we do better?

A review that finds problems without even suggesting a fix is a poor review indeed. The above identifies a number of problems, here we at least discuss solutions.

The problem of existing attributes that are inappropriately complex or inconsistent in their formatting does not permit a quick fix. We cannot just change the format. At best we could provide new ways to access the same information, and then deprecate the old attributes. It is often stated that once something enters the kernel-userspace interface (which includes all of sysfs) it cannot be changed. However the existence of CONFIG_SYSFS_DEPRECATED_V2 disproves this claim. A policy that permits and supports deprecation and removal of sysfs attributes on an on-going basis may cause some pain but would be of long-term benefit to the kernel, especially if we expect our grandchildren to continue developing Linux.

The problem that there is a clear need for structured data in sysfs attributes is probably best addressed by providing for it rather than ignoring or refuting it. Creating a format for representing arbitrarily structured data is not hard. Agreeing on one is much more of a challenge. XML has been enthusiastically suggested and vehemently opposed. Something more akin to the structure initialisations in C might be more pleasing to kernel developers (who already know C).

Your author is currently pondering how best to communicate a list of "known bad blocks" on devices in a RAID between kernel and userspace. sysfs is the obvious place to manage the data, but one file per block would be silly, and a single file listing all bad blocks would hit the one-page maximum at about 300-400 entries, which is many fewer than we want to support. Having support for structured sysfs attributes would help a lot here.

The final problem is how to enforce whatever rules we do come up with. Even with a very simple rule that is easily and often repeated and is heard by many, knowing the rule is not enough to cause people to follow the rule. This we have just seen.

The implementation of sysfs attribute files allows each developer to provide an arbitrary text string which is then included in the sysfs file for them. This incredible flexibility is a great temptation to variety rather than uniformity. While it may not be possible to remove that implementation, it could be beneficial to make it a lot easier to build sysfs attributes of particular well supported types. For example duration, temperature, switch, enum, storage-size, brightness, dictionary etc. We already have a pattern for this in that module parameters are much easier to define when they are of a particular type - as can be seen when exploring include/linux/moduleparam.h. The moduleparam implementation focuses more on basic types such as int, short, long etc. For sysfs we are more interested in higher level types, however the concept is the same.

If most of sysfs were converted over to using an interface that enforces standardised appearance, it would become fairly easy to find non-standard attributes and then either challenge them, or enhance the standard interface to support them.

In Closing

It must be said that hindsight gives much clearer vision than foresight. It is easy to see these issues in retrospect, but would have been harder to be ready to guard against them from the start. While sysfs could possibly have had a better design, it could certainly have had a worse one. Creating imperfect solutions and then needing to fix them is an acknowledged part of the continuous development approach we use in the Linux kernel.

For entirely internal subsystems, we can and do fix things regularly without any concern for legacy support. For external interfaces, fixing things isn't as easy. We need to either carry unsightly baggage around indefinitely or work to remove that which doesn't work, and encourage the creation only of that which does. Is it wrong to dream that our grandchild might work with a uniform and consistent /sys and maybe even a /proc which only contains processes?

Index entries for this article
Kernel	Development model/Code review
GuestArticles	Brown, Neil

to post comments

A critical look at sysfs attribute values

Posted Mar 17, 2010 16:35 UTC (Wed) by fjpop (guest, #30115) [Link] (1 responses)

Another thing that's inconsistent is how input is accepted. In a lot of
cases a new value can be set by simply doing 'echo 1 >file', but in other
cases you need to suppress the newline and do 'echo -n 1 >file'.

A critical look at sysfs attribute values

Posted Mar 18, 2010 4:01 UTC (Thu) by hmh (subscriber, #3838) [Link]

That's considered a bug. You're supposed to eat all whitespace before AND after whatever you want to parse in a write to a sysfs attribute, and error out if there is any crap at the end.

A critical look at sysfs attribute values

Posted Mar 17, 2010 17:26 UTC (Wed) by adamgundy (subscriber, #5418) [Link] (5 responses)

there's another issue with 'one value per file', which is consistency.

if you're trying to read for example the state of a device, having to read multiple files for different attributes is going to spread your 'snapshot' of the information across time - whereas having all the data in a single file allows for a consistent view of the data.

I'd guess that's the reason for the block device 'inflight' file having two numbers for read and write requests in it..

A critical look at sysfs attribute values

Posted Mar 18, 2010 7:47 UTC (Thu) by ptman (subscriber, #57271) [Link] (1 responses)

POSIX filesystems are not transactional (ACID) like many databases, but it
could probably be done if the user is willing to jump through some hoops.

For example, a sysfs directory might contain the separate files for separate
values and a file called "snapshot" which, when written to, would create a
subdirectory (which probably would need a unique name, and reading the
unique name from the snapshot-file would probably require file locking...)
with the separate files frozen to a specific point in time.

I'm probably missing something here and the idea is too complicated to
actually be used in the real world, I'm just trying to say that it might be
solved, but the solution wouldn't be very nice.

A critical look at sysfs attribute values

Posted Mar 18, 2010 14:47 UTC (Thu) by adamgundy (subscriber, #5418) [Link]

I would think that's an easy denial of service, unless you only allow root to take snapshots (which isn't very helpful). for example, just write a shell script that loops asking for as many snapshots as it can in as many sysfs directories as it can find.. chew up all the kernel memory.

it would probably be better to automatically provide a 'snapshot' file in every directory which can be read to get a consistent view of all files in the directory in some 'well known' format, eg 'key: value' per line like HTTP/SMTP etc.

inter-value consistency

Posted Mar 22, 2010 21:45 UTC (Mon) by neilbrown (subscriber, #359) [Link] (2 responses)

While this is certainly a theoretical limitation of the one-item-per-file approach, I'm not convinced that it is a practical limitation. That is, I'm not aware of any actual situations where the inability to read two attribute files at the same moment causes a real problem.
I don't think the 'inflight' numbers really need to be read concurrently, though I guess the total number in flight might be interesting...

Do you (or anyone) know of a clear example where lack of atomic consistency is a real issue?

inter-value consistency

Posted Mar 22, 2010 22:01 UTC (Mon) by adamgundy (subscriber, #5418) [Link] (1 responses)

I suspect all the cases of 'required consistency' have been shoved elsewhere (ie /proc or /debug) because of the 'one value per file' requirement of sysfs.

for example, /proc/interrupts, /proc/slabinfo etc etc.

without a sensible way of handling consistency, there will always be a demand for putting stuff in /proc or inventing new syscalls/ioctls etc to get the information...

inter-value consistency

Posted Mar 22, 2010 22:39 UTC (Mon) by neilbrown (subscriber, #359) [Link]

You may well be right. That is unfortunate as it is hard to solve problems when they are invisible.

As for your examples, I think both of those files (interrupts and slabinfo) pre-date /sys so they cannot have been put in /proc to avoid /sys rules.
Also I cannot see any atomic-consistency issues that might arise when accessing any of this information. In fact the contents of /proc/slabinfo appear to also exist in /sys/kernel/slab/*/* which does seem to suggest independent access to the values is OK.

If there are genuine cases where consistency is needed it would be really helpful to have them documented so we can work towards making sysfs receptive to such need.

"one file per block would be silly"

Posted Mar 17, 2010 19:19 UTC (Wed) by HelloWorld (guest, #56129) [Link] (6 responses)

Why do you say that putting every block in one file is inefficient? It makes a lot of sense to me. Create a directory called bad_blocks or whatever and put the n-th bad block in a file called n-1. Voila, problem solved. Or is that too inefficient?

"one file per block would be silly"

Posted Mar 18, 2010 2:00 UTC (Thu) by glikely (subscriber, #39601) [Link] (5 responses)

Not inefficient. A sufficiently large number of bad blocks would cause an overrun of the 4k limit for sysfs attributes.

"one file per block would be silly"

Posted Mar 18, 2010 7:49 UTC (Thu) by ptman (subscriber, #57271) [Link] (4 responses)

I think he actually meant separate files, otherwise, why the directory?

"one file per block would be silly"

Posted Mar 18, 2010 9:49 UTC (Thu) by HelloWorld (guest, #56129) [Link] (3 responses)

Yes indeed. The first bad block goes in a file named 0, the second goes in a file called 1, ... the n-th bad block goes in a file called n-1.

"one file per block would be silly"

Posted Mar 18, 2010 13:10 UTC (Thu) by nix (subscriber, #2304) [Link]

... or have a bunch of empty files whose names are block numbers.

Still, this would definitely have to be dynamically-generated at lookup time: one kobject/dentry for every bad block throughout the lifetime of the /sys mountpoint would be madness.

"one file per block would be silly"

Posted Mar 18, 2010 20:10 UTC (Thu) by smurf (subscriber, #17840) [Link] (1 responses)

So if you have 1000 bad blocks, you need 3000+ system calls to read the list.

Sorry, but I don't think that this makes much sense.

Where does that silly 4k limit come from, anyway? debugfs does much better. So provide a symlink to the "real" list which lives somewhere else?

"one file per block would be silly"

Posted Mar 19, 2010 14:00 UTC (Fri) by nix (subscriber, #2304) [Link]

Uh, the syscall underlying readdir() (getdents()) reads a whole bunch of entries at once, so you don't need 3000 syscalls to readdir() through 3000 names. If you want to stat() them, then you're right: but all you need to do in this case is get their names, which is much faster.

Can we admit that sysfs was a really bad idea?

Posted Mar 17, 2010 19:25 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (11 responses)

This article identifies more problems with sysfs, mainly that it's beginning to suffer from the kind of cancer that afflicted procfs. Between the inconsistent formatting, atomicity problems, and awkward shell use*, can we finally concede that sysfs was a Bad Idea, and that something like sysctl works just as well while being safer, more consistent, and more elegant?

* Try using sudo to modify a sysctl value: you have to use something like sudo sh -c 'echo foo > /sys/bar' instead of a nice simple sudo sysctl bar=5'

Can we admit that sysfs was a really bad idea?

Posted Mar 17, 2010 20:27 UTC (Wed) by daniel (guest, #3181) [Link]

"Try using sudo to modify a sysctl value: you have to use something like sudo sh -c 'echo foo > /sys/bar' instead of a nice simple sudo sysctl bar=5'"

maybe try: echo foo | sudo tee /sys/bar

Can we admit that sysfs was a really bad idea?

Posted Mar 17, 2010 20:29 UTC (Wed) by farnz (subscriber, #17727) [Link]

FWIW (and this isn't a disagreement), I use echo foo | sudo tee /sys/bar to do that job. It's still not elegant, but it's a little easier to type than sudo sh -c 'echo foo > /sys/bar'.

Can we admit that sysfs was a really bad idea?

Posted Mar 18, 2010 0:05 UTC (Thu) by mpee (subscriber, #37530) [Link]

No. It's a pretty good idea, with a few warts.

If typing all those extra characters is really causing you pain, you can
write yourself a "sysfssyctl" script in about 4 lines.

Can we admit that sysfs was a really bad idea?

Posted Mar 18, 2010 2:39 UTC (Thu) by glikely (subscriber, #39601) [Link] (6 responses)

There is another aspect to sysfs which this article doesn't address, and that is the purpose of sysfs in the first place. Sysfs' primary job is to provide a user space readable representation of the Linux object model, and methods to manipulate it. udev in particular uses it to make decisions about what to do when new device are registered into the system (ie. add device files). The sysfs design reflects the boundaries of the problem it is intended to solve.

Sysfs is not intended to be the primary interface to a device. Nor is it intended to be a kernel configuration mechanism. A fair few drivers do use it in that capacity and a lot of the time there isn't anything strictly evil about that, but that usage is peripheral to the primary purpose. It certainly doesn't replace sysctl.

If questions about access atomicity or formatting of large data arise then I strongly suspect that the sysfs interface is being abused and that a char device or debugfs file would be a more appropriate choice.

Can we admit that sysfs was a really bad idea?

Posted Mar 18, 2010 11:13 UTC (Thu) by dgm (subscriber, #49227) [Link] (1 responses)

> It certainly doesn't replace sysctl.

Well, according to Wikipedia it does indeed. sysctl is implemented on top of sysfs and procfs.

Can we admit that sysfs was a really bad idea?

Posted Mar 22, 2010 22:02 UTC (Mon) by neilbrown (subscriber, #359) [Link]

The wikipedia article is wrong in several respects. "sysctl" and "sysfs" are independent concepts that attempt to address somewhat similar goals. sysctl is a "mangement information base" to which access is provided through the /proc filesystem. It may be that some values accessible through sysctl (i.e. /proc/sys) are also accessible through /sys, but that is certainly not the norm.

Can we admit that sysfs was a really bad idea?

Posted Mar 18, 2010 11:16 UTC (Thu) by mjthayer (guest, #39183) [Link] (3 responses)

>Sysfs is not intended to be the primary interface to a device. Nor is it intended to be a kernel
>configuration mechanism.
[snip]
>If questions about access atomicity or formatting of large data arise then I strongly suspect
>that the sysfs interface is being abused and that a char device or debugfs file would be a
>more appropriate choice.
As someone who has had to fiddle with sysfs in userspace recently I can only second that.

Can we admit that sysfs was a really bad idea?

Posted Mar 22, 2010 22:08 UTC (Mon) by neilbrown (subscriber, #359) [Link] (2 responses)

I would genuinely like to hear more about the problems and issues you experienced. My own belief is that, independent of any intentions to the contrary, sysfs is the de-facto primary interface to devices for configuration (though not for data transfer), and we would be well served by making it fit that role well. To do that we need to understand the problems.

Can we admit that sysfs was a really bad idea?

Posted Mar 25, 2010 4:34 UTC (Thu) by glikely (subscriber, #39601) [Link]

Heh. I was writing a driver a few years back (not mainlined) and completely ignored all the good advice I got about using a normal char dev to control my simple device. Instead I implemented a novel set of sysfs files for obtaining the state of the device. One file was written to chose the channel to read, and another file to read the measurement back. Nice and simple to manipulate from the shell for debug, but a complete disaster in real-world use cases. (ie. two processes cannot read from the device at the same time).

I completely abused sysfs in my driver. It was the wrong interface for actually interacting with the device, and I ended up rewriting the interface in the end. In this case the problems and issues were very much of my own making. :-)

Anytime I hear complaints about sysfs atomicity it raises a red flag for me and I wonder if it is a similar situation of using sysfs for IO instead of mere configuration. Using sysfs for IO is in a lot of cases broken.

I spoke to quickly in my earlier comment. Yes, you're right. sysfs has become the natural configuration interface for a lot of devices, and it suits that job very well. However, I still say that it doesn't replace sysctl for the non-device-centric configuration knobs (ie. the ip stack).

Can we admit that sysfs was a really bad idea?

Posted Mar 25, 2010 10:03 UTC (Thu) by mjthayer (guest, #39183) [Link]

I suppose my usage case was a bit special - I had to write code that would
enumerate and get information on devices that would run on as wide a
range of Linux-based systems as possible. Getting information out of sysfs
in a way that is fully backwards-compatible while still being future-proof
proved to be rather hard. In the end, by minimising the use I made of sysfs
to just the most basic device enumeration - the original purpose of sysfs
after all - I was able to write code that worked from early single-digit 2.6
kernels up to current ones, and with some slightly hacky extensions even
worked on all 2.4 series kernels I tried.

Can we admit that sysfs was a really bad idea?

Posted Mar 22, 2010 21:54 UTC (Mon) by neilbrown (subscriber, #359) [Link]

I think that it is a very big step to go from "it has some problems" to "it is a bad idea". To my mind sysfs has a lot of strengths.

And sysctl really isn't measurably better. check out dev/cdrom/info (in /proc/sys). If sysctl has fewer problems it is only because it has substantially less content.

A critical look at sysfs attribute values

Posted Mar 17, 2010 20:38 UTC (Wed) by Gollum (guest, #25237) [Link] (8 responses)

How about using JSON instead of XML for structured values?

Still self-documenting (to a certain extent), and easy to parse.

A critical look at sysfs attribute values

Posted Mar 17, 2010 23:17 UTC (Wed) by vomlehn (guest, #45588) [Link] (2 responses)

Unmentioned, but implied, was issue of inconsistency between text and binary data. You can set text values with echo, but binary data is rather less convenient. Supporting structured data, whether it's done with an XML, C, or other syntax makes good sense. It allows a consistent snapshot to be taken and reduces the number of system calls that must be done to open, read, and close sysfs multiple one-per-attribute files. It should be mandatory that names and types of attributes mustn't change, though allowing attributes to be added, deleted or even reordered should also be supported.

A critical look at sysfs attribute values

Posted Mar 18, 2010 13:06 UTC (Thu) by bcopeland (subscriber, #51750) [Link] (1 responses)

If only there were some way to get structured binary data out of the kernel in an agreed-upon format... hrm, I know, let's call it ioctl!

A critical look at sysfs attribute values

Posted Mar 22, 2010 15:36 UTC (Mon) by jzbiciak (guest, #5246) [Link]

Hmmm... yes, because ioctl() has always been self-documenting and never problematic, particularly on dual-ABI machines such as x86-64. ;-) (Not to mention TOCTTOU issues with multithreaded programs.)

A critical look at sysfs attribute values

Posted Mar 17, 2010 23:25 UTC (Wed) by yanfali (subscriber, #2949) [Link] (3 responses)

YAML is also a nice simple structured text format, with a lot of available libraries.

A critical look at sysfs attribute values

Posted Mar 18, 2010 7:55 UTC (Thu) by ptman (subscriber, #57271) [Link]

YAML (at least the latest revision of the spec) is a superset of JSON.
Better start simple and add complexity only if it is needed.

A critical look at sysfs attribute values

Posted Mar 19, 2010 11:19 UTC (Fri) by mgedmin (guest, #34497) [Link]

I've read the YAML spec. I can never understand people who proclaim it to
be "simple". Even XML is simpler!

A critical look at sysfs attribute values

Posted Mar 21, 2010 21:08 UTC (Sun) by gerdesj (subscriber, #5446) [Link]

I'll take your YAML or XML or whatever and raise you ASN.1. Heck, you could mount sysmibfs and make it legible - hilarious!

The latest modish markup thingie isn't going to solve the design goals of sysfs, let alone anything else.

I personally like the simple text file with a value in it approach - I don't have to learn Yet Another Markup Language (and I don't just mean YAML 8)

Re evaluating the goals and perhaps filling in the missing ones and then someone enforcing the end design decisions will give a chance for the grandkids to see this thing.

A critical look at sysfs attribute values

Posted Mar 18, 2010 12:29 UTC (Thu) by jengelh (subscriber, #33263) [Link]

You want to submit that as an April Fool's patch :-)

A critical look at sysfs attribute values

Posted Mar 18, 2010 15:33 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>And then there are truth values: On, on, 1, Off, off, 0.

You have to add Y and N to that list (this is what sysfs btw returns back on reading the boolean attribtue).

A critical look at sysfs attribute values

Posted Mar 19, 2010 12:22 UTC (Fri) by ranmachan (guest, #21283) [Link] (2 responses)

One option to handle the bad block list would be to have 3 files:
Say, badblock_count badblock_number and badblock.
Then you read it using
for (i=0; i<badblock_count;) {
write i to badblock_number
read badblock from badblock
}
Might be quite slow though since it involves lots of context switches. :)

A critical look at sysfs attribute values

Posted Mar 21, 2010 0:37 UTC (Sun) by giraffedata (guest, #1954) [Link]

That suffers seriously from the issue mentioned in other comments of cross-file consistency. After you read the number of bad blocks, the number of them changes, and so does the identity of the 35th one.

A critical look at sysfs attribute values

Posted Mar 22, 2010 22:32 UTC (Mon) by neilbrown (subscriber, #359) [Link]

Apart from any consistency problems mentioned elsewhere, this approach really just hides the problem rather than solving it.

It still has many values behind 1 (or 3) attribute files. If the "one value per file" policy stands, it is the wrong thing to do. If the "one value per file" policy can be extended to support arbitrarily large values, then implementing that policy extension through seqfile or similar would be a much more sensible resolution.

A critical look at sysfs attribute values

Posted Mar 19, 2010 20:41 UTC (Fri) by cmccabe (guest, #60281) [Link] (2 responses)

It really seems like getting bad blocks information would best be done with Netlink, rather than sysfs.

With sysfs, if you follow the "one value per file" philsophy, you could create a potentially unlimited number of files, in view of how many blocks hard drives have these days. If you don't follow the philsophy, you end up with something that's probably both ugly and buggy.

With netlink, you don't pay the cost unless someone actually asks for the data. With sysfs, you have to create the inodes, whether or not anyone actually uses them. And of course, there are the atomicity issues. With netlink, it's a lot easier for the application to ask only for the data it wants. Userspace can do something reasonable like reading only part of the data at a time.

It also makes operations on these bad blocks a lot more natural. You can do something like query which blocks are bad, and then run operations on these blocks, as part of your send/receive netlink session.

You'd have to ask upstream if they'd be willing to accept a netlink solution first, of course. But I think this is a situation where sysfs is just a square peg in a round hole.

Netlink??

Posted Mar 22, 2010 22:27 UTC (Mon) by neilbrown (subscriber, #359) [Link] (1 responses)

I fail to understand the general fascination with netlink.

My guess is that it is a bit like debugfs "the only rule is that there are no rules". With netlink you can do whatever you like - it is like ioctl but without the guilt.

I would always use files in a filesystem for any interaction between kernel and user-space. You can easily do large files (using seqfile). You can easily to query-response transactions using simple_transaction_* (see the nfsd filesystem).

But I don't think we should make these choices 'because it is easy' or 'because we can' but rather 'because it is right'. Determining what is 'right' is a challenge. In a lot of cases I think 'one item per file' is 'right. But I don't think it is always right. Hence the desire to explore how sysfs is actually used in order to find a new interpretation of "right" that both acknowledges everything that currently works well, but also includes those cases that currently aren't supported well.

Just using netlink (or debugfs) because it doesn't impose rules sounds too much like taking the broad road with the wide gate - I hope you know where that leads.

Netlink??

Posted Mar 25, 2010 7:02 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

For a lot of reasons netlink is a very good fit for the working with the networking stack. But note the networking stack has /proc/sys/net also.

Netlink isn't too bad for general event based things.

Beyond that I would not encourage use of netlink.

I'm not certain what to do with bad blocks. They seem like part of an abstraction leaking through so I'm not certain we want an interface that we have to maintain for all time describing them for dealing with them.

A critical look at sysfs attribute values

Posted Mar 23, 2010 13:43 UTC (Tue) by mezcalero (subscriber, #45103) [Link] (1 responses)

It's also awful that the USB subsystem encodes hexadecimal values without the 0x prefix, but the PCI subsystem includes it:

$ cat /sys/devices/pci0000\:00/0000\:00\:1a.7/vendor
0x8086
$ cat /sys/devices/pci0000\:00/0000\:00\:1a.7/usb1/idVendor
1d6b

A critical look at sysfs attribute values

Posted Apr 12, 2010 21:46 UTC (Mon) by k8to (guest, #15413) [Link]

It seems a bit sad that you can't get the text representation of either of these.