Training courses

Kernel and Embedded Linux

Bootlin training courses

Embedded Linux, kernel,
Yocto Project, Buildroot, real-time,
graphics, boot time, debugging...

Bootlin logo

Elixir Cross Referencer

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
.\" format with ditroff -me
.\" $FreeBSD$
.\" format made to look as a paper for the proceedings is to look
.\" (as specified in the text)
.if n \{ .po 0
.	ll 78n
.	na
.\}
.if t \{ .po 1.0i
.	ll 6.5i
.	nr pp 10		\" text point size
.	nr sp \n(pp+2		\" section heading point size
.	nr ss 1.5v		\" spacing before section headings
.\}
.nr tm 1i
.nr bm 1i
.nr fm 2v
.he ''''
.de bu
.ip \0\s-2\(bu\s+2
..
.lp
.rs
.ce 5
.sp
.sz 14
.b "Rethinking /dev and devices in the UNIX kernel"
.sz 12
.sp
.i "Poul-Henning Kamp"
.sp .1
.i "<phk@FreeBSD.org>"
.i "The FreeBSD Project"
.i
.sp 1.5
.b Abstract
.lp
An outstanding novelty in UNIX at its introduction was the notion
of ``a file is a file is a file and even a device is a file.''
Going from ``hardware only changes when the DEC Field engineer is here''
to ``my toaster has USB'' has put serious strain on the rather crude
implementation of the ``devices as files'' concept, an implementation which
has survived practically unchanged for 30 years in most UNIX variants.
Starting from a high-level view of devices and the semantics that
have grown around them over the years, this paper takes the audience on a
grand tour of the redesigned FreeBSD device-I/O system, 
to convey an overview of how it all fits together, and to explain why
things ended up as they did, how to use the new features and 
in particular how not to.
.sp
.if t \{ 
.2c
.\}
.\" end boilerplate... paper starts here.
.sh 1 "Introduction"
.sp
There are really only two fundamental ways to conceptualise
I/O devices in an operating system:
The usual way and the UNIX way.
.lp
The usual way is to treat I/O devices as their own class of things,
possibly several classes of things, and provide APIs tailored
to the semantics of the devices.
In practice this means that a program must know what it is dealing
with, it has to interact with disks one way, tapes another and
rodents yet a third way, all of which are different from how it
interacts with a plain disk file.
.lp
The UNIX way has never been described better than in the very first
paper 
published on UNIX by Ritchie and Thompson [Ritchie74]:
.(q
Special files constitute the most unusual feature of the UNIX filesystem.
Each supported I/O device is associated with at least one such file.
Special files are read and written just like ordinary disk files,
but requests to read or write result in activation of the associated device.
An entry for each special file resides in directory /dev,
although a link may be made to one of these files just as it may to an
ordinary file.
Thus, for example, to write on a magnetic tape one may write on the file /dev/mt.

Special files exist for each communication line, each disk, each tape drive,
and for physical main memory.
Of course, the active disks and the memory special files are protected from indiscriminate access.

There is a threefold advantage in treating I/O devices this way:
file and device I/O are as similar as possible;
file and device names have the same syntax and meaning,
so that a program expecting a file name as a parameter can be passed a device name;
finally, special files are subject to the same protection mechanism as regular files.
.)q
.lp
.\" (Why was this so special at the time?)
At the time, this was quite a strange concept; it was totally accepted
for instance, that neither the system administrator nor the users were
able to interact with a disk as a disk.
Operating systems simply
did not provide access to disk other than as a filesystem.
Most vendors did not even release a program to initialise a
disk-pack with a filesystem: selling pre-initialised and ``quality
tested'' disk-packs was quite a profitable business.
.lp
In many cases some kind of API for reading and
writing individual sectors on a disk pack
did exist in the operating system,
but more often than not
it was not listed in the public documentation.
.sh 2 "The traditional implementation"
.lp
.\" (Explain how opening /dev/lpt0 lands you in the right device driver)
The initial implementation used hardcoded inode numbers [Ritchie98].
The console
device would be inode number 5, the paper-tape-punch number 6 and so on,
even if those inodes were also actual regular files in the filesystem.
.lp
For reasons one can only too vividly imagine, this was changed and 
Thompson
[Thompson78]
describes how the implementation now used ``major and minor''
device numbers to index though the devsw array to the correct device driver.
.lp
For all intents and purposes, this is the implementation which survives
in most UNIX-like systems even to this day.
Apart from the access control and timestamp information which is
found in all inodes, the special inodes in the filesystem contain only
one piece of information: the major and minor device numbers, often
logically OR'ed to one field.
.lp
When a program opens a special file, the kernel uses the major number
to find the entry points in the device driver, and passes the combined
major and minor numbers as a parameter to the device driver.
.sh 1 "The challenge"
.lp
Now, we did not talk much about where the special inodes came from
to begin with.
They were created by hand, using the
mknod(2) system call, usually through the mknod(8) program.
.lp
In those days a
computer had a very static hardware configuration\**
.(f
\** Unless your assigned field engineer was present on site.
.)f
and it certainly did not
change while the system was up and running, so creating device nodes
by hand was certainly an acceptable solution.
.lp
The first sign that this would not hold up as a solution came with
the advent of TCP/IP and the telnet(1) program, or more precisely 
with the telnetd(8) daemon.
In order to support remote login a ``pseudo-tty'' device driver was implemented,
basically as tty driver which instead of hardware had another device which
would allow a process to ``act as hardware'' for the tty.
The telnetd(8) daemon would read and write data on the ``master'' side of
the pseudo-tty and the user would be running on the ``slave'' side,
which would act just like any other tty: you could change the erase 
character if you wanted to and all the signals and all that stuff worked.
.lp
Obviously with a device requiring no hardware, you can compile as many
instances into the kernel as you like, as long as you do not use
too much memory.
As system after system was connected
to the ARPANet, ``increasing number of ptys'' became a regular task
for system administrators, and part of this task was to create
more special nodes in the filesystem.
.lp
Several UNIX vendors also noticed an issue when they sold minicomputers
in many different configurations: explaining to system administrators
just which special nodes they would need and how to create them were
a significant documentation hassle.  Some opted for the simple solution
and pre-populated /dev with every conceivable device node, resulting
in a predictable slowdown on access to filenames in /dev.
.lp
System V UNIX provided a band-aid solution:
a special boot sequence would take effect if the kernel or
the hardware had changed since last reboot.
This boot procedure would
amongst other things create the necessary special files in the filesystem,
based on an intricate system of per device driver configuration files.
.lp
In the recent years, we have become used to hardware which changes
configuration at any time: people plug USB, Firewire and PCCard
devices into their computers.
These devices can be anything from modems and disks to GPS receivers 
and fingerprint authentication hardware.
Suddenly maintaining the
correct set of special devices in ``/dev'' became a major headache.
.lp
Along the way, UNIX kernels had learned to deal with multiple filesystem
types [Heidemann91a] and a ``device-pseudo-filesystem'' was a pretty
obvious idea.
The device drivers have a pretty good idea which
devices they have found in the configuration, so all that is needed is
to present this information as a filesystem filled with just the right
special files.
Experience has shown that this like most other ``pseudo
filesystems'' sound a lot simpler in theory than in practice.
.sh 1 "Truly understanding devices"
.lp
Before we continue, we need to fully understand the
``device special file'' in UNIX.
.lp
First we need to realize that a special file has the nature of
a pointer from the filesystem into a different namespace;
a little understood fact with far reaching consequences.
.lp
One implication of this is that several special files can
exist in the filename namespace all pointing to the same device
but each having their own access and timestamp attributes:
.lp
.(b M
.vs -3
\fC\s-3guest# ls -l /dev/fd0 /tmp/fd0
crw-r----- 1 root operator 9, 0 Sep 27 19:21 /dev/fd0
crw-rw-rw- 1 root wheel    9, 0 Sep 27 19:24 /tmp/fd0\fP\s+3
.vs +3
.)b
Obviously, the administrator needs to be on top of this:
one popular way to exploit an unguarded root prompt is
to create a replica of the special file /dev/kmem 
in a location where it will not be noticed.
Since /dev/kmem gives access to the kernel memory, 
gaining any particular
privilege can be arranged by suitably modifying the kernel's
data structures through the illicit special file.
.lp
When NFS appeared it opened a new avenue for this attack:
People may have root privilege on one machine but not another.
Since device nodes are not interpreted on the NFS server
but rather on the local computer,
a user with root privilege on a NFS client
computer can create a device node to his liking on a filesystem
mounted from an NFS server.
This device node can in turn be used to 
circumvent the security of other computers which mount that filesystem,
including the server, unless they protect themselves by not
trusting any device entries on untrusted filesystem by mounting such
filesystems with the \fCnodev\fP mount-option.
.lp
The fact that the device itself does not actually exist inside the
filesystem which holds the special file makes it possible
to perform boot-strapping stunts in the spirit 
of Baron Von Münchausen [raspe1785],
where a filesystem is (re)mounted using one of its own
device vnodes:
.(b M
.vs -3
\fC\s-2guest# mount -o ro /dev/fd0 /mnt
guest# fsck /mnt/dev/fd0
guest# mount -u -o rw /mnt/dev/fd0 /mnt\fP\s+2
.vs +3
.)b
.lp
Other interesting details are chroot(2) and jail(2) [Kamp2000] which
provide filesystem isolation for process-trees.
Whereas chroot(2) was not implemented as a security tool [Mckusick1999]
(although it has been widely used as such), the jail(2) security
facility in FreeBSD provides a pretty convincing ``virtual machine''
where even the root privilege is isolated and restricted to the designated
area of the machine.
Obviously chroot(2) and jail(2) may require access to a well-defined
subset of devices like /dev/null, /dev/zero and /dev/tty,
whereas access to other devices such as /dev/kmem
or any disks could be used to compromise the integrity of the jail(2)
confinement.
.lp
For a long time FreeBSD, like almost all UNIX-like systems had two kinds
of devices, ``block'' and
``character'' special files, the difference being that ``block''
devices would provide caching and alignment for disk device access.
This was one of those minor architectural mistakes which took
forever to correct.
.lp
The argument that block devices were a mistake is really very
very simple:  Many devices other than disks have multiple modes
of access which you select by choosing which special file to use.
.lp
Pick any old timer and he will be able to recite painful
sagas about the crucial difference between the /dev/rmt 
and /dev/nrmt devices for tape access.\**
.(f
\** Make absolutely sure you know the difference before you take
important data on a multi-file 9-track tape to remote locations.
.)f
.lp
Tapes, asynchronous ports, line printer ports and many other devices
have implemented submodes, selectable by the user
at a special filename level, but that has not earned them their
own special file types.
Only disks\**
.(f
\** Well, OK: and some 9-track tapes.
.)f
have enjoyed the privilege of getting an entire file type dedicated to a
a minor device mode.
.lp
Caching and alignment modes should have been enabled by setting
some bit in the minor device number on the disk special file,
not by polluting the filesystem code with another file type.
.lp
In FreeBSD block devices were not even implemented in a fashion
which would be of any use, since any write errors would never be
reported to the writing process.  For this reason, and since no
applications 
were found to be in existence which relied on block devices
and since historical usage was indeed historical [Mckusick2000],
block devices were removed from the FreeBSD system.
This greatly simlified the task of keeping track of open(2) 
reference counts for disks and
removed much magic special-case code throughout.
.lp
.sh 1 "Files, sockets, pipes, SVID IPC and devices"
.sp
It is an instructive lesson in inconsistency to look at the
various types of ``things'' a process can access in UNIX-like
systems today.
.lp
First there are normal files, which are our reference yardstick here:
they are accessed with open(2), read(2), write(2), mmap(2), close(2)
and various other auxiliary system calls.
.lp
Sockets and pipes are also accessed via file handles but each has
its own namespace.  That means you cannot open(2) a socket,\**
.(f
\** This is particularly bizarre in the case of UNIX domain sockets
which use the filesystem as their namespace and appear in directory
listings.
.)f
but you can read(2) and write(2) to it.
Sockets and pipes vector off at the file descriptor level and do
not get in touch with the vnode based part of the kernel at all.
.lp
Devices land somewhere in the middle between pipes and sockets on
one side and normal files on the other.
They use the filesystem 
namespace, are implemented with vnodes, and can be operated
on like normal files, but don't actually live in the filesystem.
.lp
Devices are in fact special-cased all the way through the vnode system.
For one thing devices break the ``one file-one vnode''
rule, making it necessary to chain all vnodes for the same
device together in
order to be able to find ``the canonical vnode for this device node'',
but more importantly, many operations have to be specifically denied
on special file vnodes since they do not make any sense.
.lp
For true inconsistency, consider the SVID IPC mechanisms - not
only do they not operate via file handles,
but they also sport a singularly
illconceived 32 bit numeric namespace and a dedicated set of
system calls for access.
.lp
Several people have convincingly argued that this is an inconsistent
mess, and have proposed and implemented more consistent operating systems
like the Plan9 from Bell Labs [Pike90a] [Pike92a].
Unfortunately reality is that people are not interested in learning a new
operating system when the one they have is pretty darn good, and
consequently research into better and more consistent ways is
a pretty frustrating [Pike2000] but by no means irrelevant topic.
.sh 1 "Solving the /dev maintenance problem"
.lp
There are a number of obvious, simple but wrong ways one could
go about solving the ``/dev'' maintenance problem.
.lp
The very straightforward way is to hack the namei() kernel function
responsible for filename translation and lookup.
It is only a minor matter of programming to
add code to special-case any lookup which ends up in ``/dev''.
But this leads to problems:  in the case of chroot(2) or jail(2), the
administrator will want to present only a subset of the available
devices in ``/dev'', so some kind of state will have to be kept per
chroot(2)/jail(2) about which devices are visible and
which devices are hidden, but no obvious location for this information
is available in the absence of a mount data structure.
.lp
It also leads to some unpleasant issues
because of the fact that ``/dev/foo'' is a synthesised directory
entry which may or may not actually be present on the filesystem 
which seems to provide ``/dev''.
The vnodes either have to belong to a filesystem or they
must be special-cased throughout the vnode layer of the kernel.
.lp
Finally there is the simple matter of generality:
hardcoding the string "/dev" in the kernel is very general.
.lp
A cruder solution is to leave it to a daemon: make a special
device driver, have a daemon read messages from it and create and
destroy nodes in ``/dev'' in response to these messages.
.lp
The main drawback to this idea is that now we have added IPC
to the mix introducing new and interesting race conditions.
.lp
Otherwise this solution is a surprisingly effective,
but chroot(2)/jail(2) requirements prevents a simple implementation 
and running a daemon per jail would become an administrative
nightmare.
.lp
Another pitfall of
this approach is that we are not able to remount the root filesystem
read-write at boot until we have a device node for the root device,
but if this node is missing we cannot create it with a daemon since
the root filesystem (and hence /dev) is read-only.
Adding a read-write memory-filesystem mount /dev to solve this problem
does not improve
the architectural qualities further and certainly the KISS principle has
been violated by now.
.lp
The final and in the end only satisfactory solution is to write a ``DEVFS''
which mounts on ``/dev''.
.lp
The good news is that it does solve the problem with chroot(2) and jail(2):
just mount a DEVFS instance on the ``dev'' directory inside the filesystem
subtree where the chroot or jail lives.  Having a mountpoint gives us
a convenient place to keep track of the local state of this DEVFS mount.
.lp
The bad news is that it takes a lot of cleanup and care to implement
a DEVFS into a UNIX kernel.
.sh 1 "DEVFS architectural decisions"
.lp
Before implementing a DEVFS, it is necessary to decide on a range
of corner cases in behaviour, and some of these choices have proved
surprisingly hard to settle for the FreeBSD project.
.sh 2 "The ``persistence'' issue"
.lp
When DEVFS in FreeBSD was initially presented at a BoF at the 1995
USENIX Technical Conference in New Orleans,
a group of people demanded that it provide ``persistence''
for administrative changes.
.lp
When trying to get a definition of ``persistence'', people can generally
agree that if the administrator changes the access control bits of
a device node, they want that mode to survive across reboots.
.lp
Once more tricky examples of the sort of manipulations one can do
on special files are proposed, people rapidly disagree about what
should be supported and what should not.
.lp
For instance, imagine a
system with one floppy drive which appears in DEVFS as ``/dev/fd0''.
Now the administrator, in order to get some badly written software
to run, links this to ``/dev/fd1'':
.(b M
\fC\s-2ln /dev/fd0 /dev/fd1\fP\s+2
.)b
This works as expected and with persistence in DEVFS, the link is
still there after a reboot.
But what if after a reboot another floppy drive has been connected
to the system?
This drive would naturally have the name ``/dev/fd1'',
but this name is now occupied by the administrators hard link.
Should the link be broken?
Should the new floppy drive be called
``/dev/fd2''?  Nobody can agree on anything but the ugliness of the
situation.
.lp
Given that we are no longer dependent on DEC Field engineers to
change all four wheels to see which one is flat, the basic assumption
that the machine has a constant hardware configuration is simply no
longer true.
The new assumption one should start from when analysing this
issue is that when the system boots, we cannot know what devices we
will find, and we can not know if the devices we do find
are the same ones we had when the system was last shut down.
.lp
And in fact, this is very much the case with laptops today:  if I attach
my IOmega Zip drive to my laptop it appears like a SCSI disk named
``/dev/da0'', but so does the RAID-5 array attached to the PCI SCSI controller
installed in my laptop's docking station.  If I change mode to ``a+rw''
on the Zip drive, do I want that mode to apply to the RAID-5 as well?
Unlikely.
.lp
And what if we have persistent information about the mode of
device ``/dev/sio0'', but we boot and do not find any sio devices?
Do we keep the information in our device-persistence registry?
How long do we keep it?  If I borrow a modem card,
set the permissions to some non-standard value like 0666,
and then attach some other serial device a year from now - do I
want some old permissions changes to come back and haunt me,
just because they both happened to be ``/dev/sio0''?
Unlikely.
.lp
The fact that more people have laptop computers today than
five years ago, and the fact that nobody has been able to credibly
propose where a persistent DEVFS would actually store the 
information about these things in the first place has settled the issue.
.lp
Persistence may be the right answer, but to the
wrong question: persistence is not a desirable property for a DEVFS
when the hardware configuration may change literally at any time.
.sh 2 "Who decides on the names?"
.lp
In a DEVFS-enabled system, the responsibility for creating nodes in
/dev shifts to the device drivers, and consequently the device
drivers get to choose the names of the device files.
In addition an initial value for owner, group and mode bits are
provided by the device driver.
.lp
But should it be possible to rename ``/dev/lpt0'' to ``/dev/myprinter''?
While the obvious affirmative answer is easy to arrive at, it leaves
a lot to be desired once the implications are unmasked.
.lp
Most device drivers know their own name and use it purposefully in
their debug and log messages to identify themselves.
Furthermore, the ``NewBus'' [NewBus] infrastructure facility,
which ties hardware to device drivers, identifies things by name 
and unit numbers.
.lp
A very common way to report errors in fact:
.(b M
.vs -3
\fC\s-2#define LPT_NAME "lpt" /* our official name */
[...]
printf(LPT_NAME
    ": cannot alloc ppbus (%d)!", error);\fP\s+2
.vs +3
.)b
.lp
So despite the user renaming the device node pointing to the printer
to ``myprinter'', this has absolutely no effect in the kernel and can
be considered a userland aliasing operation.
.lp
The decision was therefore made that it should not be possible to rename
device nodes since it would only lead to confusion and because the desired
effect could be attained by giving the user the ability to create
symlinks in DEVFS.
.sh 2 "On-demand device creation"
.lp
Pseudo-devices like pty, tun and bpf,
but also some real devices, may not pre-emptively create entries for all
possible device nodes.  It would be a pointless waste of resources
to always create 1000 ptys just in case they are needed,
and in the worst case more than 1800 device nodes would be needed per 
physical disk to represent all possible slices and partitions.
.lp
For pseudo-devices the task at hand is to make a magic device node,
``/dev/pty'', which when opened will magically transmogrify into the
first available pty subdevice, maybe ``/dev/pty123''.
.lp
Device submodes, on the other hand, work by having multiple
entries in /dev, each with a different minor number, as a way to instruct
the device driver in aspects of its operation.  The most widespread
example is probably ``/dev/mt0'' and ``/dev/nmt0'', where the node
with the extra ``n''
instructs the tape device driver to not rewind on close.\**
.(f
\** This is the answer to the question in footnote number 2.
.)f
.lp
Some UNIX systems have solved the problem for pseudo-devices by
creating magic cloning devices like ``/dev/tcp''.
When a cloning device is opened,
it finds a free instance and through vnode and file descriptor mangling
return this new device to the opening process.
.lp
This scheme has two disadvantages: the complexity of switching vnodes
in midstream is non-trivial, but even worse is the fact that it 
does not work for
submodes for a device because it only reacts to one particular /dev entry.
.lp
The solution for both needs is a more flexible on-demand device
creation, implemented in FreeBSD as a two-level lookup.
When a
filename is looked up in DEVFS, a match in the existing device nodes is
sought first and if found, returned.
If no match is found, device drivers are polled in turn to ask if
they would be able to synthesise a device node of the given name.
.lp
The device driver gets a chance to modify the name
and create a device with make_dev().
If one of the drivers succeeds in this, the lookup is started over and
the newly found device node is returned:
.(b M
.vs -3
\fC\s-2pty_clone()
   if (name != "pty")
      return(NULL); /* no luck */
   n = find_next_unit();
   dev = make_dev(...,n,"pty%d",n);
   name = dev->name;
   return(dev);\fP\s+2
.vs +3
.)b
.lp
An interesting mixed use of this mechanism is with the sound device drivers.
Modern sound devices have multiple channels, presumably to allow the
user to listen to CNN, Napstered MP3 files and Quake sound effects at
the same time.
The only problem is that all applications attempt to open ``/dev/dsp''
since they have no concept of multiple sound devices.
The sound device drivers use the cloning facility to direct ``/dev/dsp''
to the first available sound channel completely transparently to the
process.
.lp
There are very few drawbacks to this mechanism, the major one being
that ``ls /dev'' now errs on the sparse side instead of the rich when used
as a system device inventory, a practice which has always been 
of dubious precision at best.
.sh 2 "Deleting and recreating devices"
.lp
Deleting device nodes is no problem to implement, but as likely as not,
some people will want a method to get them back.
Since only the device driver know how to create a given device,
recreation cannot be performed solely on the basis of the parameters 
provided by a process in userland.
.lp
In order to not complicate the code which updates the directory
structure for a mountpoint to reflect changes in the DEVFS inode list,
a deleted entry is merely marked with DE_WHITEOUT instead of being
removed entirely.
Otherwise a separate list would be needed for inodes which we had
deleted so that they would not be mistaken for new inodes.
.lp
The obvious way to recreate deleted devices is to let mknod(2) do it
by matching the name and disregarding the major/minor arguments.
Recreating the device with mknod(2) will simply remove the DE_WHITEOUT
flag.
.sh 2 "Jail(2), chroot(2) and DEVFS"
.lp
The primary requirement from facilities like jail(2) and chroot(2)
is that it must be possible to control the contents of a DEVFS mount
point.
.lp
Obviously, it would not be desirable for dynamic devices to pop
into existence in the carefully pruned /dev of jails so it must be
possible to mark a DEVFS mountpoint as ``no new devices''.
And in the same way, the jailed root should not be able to recreate
device nodes which the real root has removed.
.lp
These behaviours will be controlled with mount options, but these have not
yet been implemented because FreeBSD has run out of bitmap flags for
mount options, and a new unlimited mount option implementation is
still not in place at the time of writing.
.lp
One mount option ``jaildevfs'', will restrict the contents of the
DEVFS mountpoint to the ``normal set'' of devices for a jail and
automatically hide all future devices and make it impossible
for a jailed root to un-hide hidden entries while letting an un-jailed
root do so.
.lp
Mounting or remounting read-only, will prevent all future
devices from appearing and will make it impossible to
hide or un-hide entries in the mountpoint.
This is probably only useful for chroots or jails where no tty
access is intended since cloning will not work either.
.lp
More mount options may be needed as more experience is gained.
.sh 2 "Default mode, owner & group"
.lp
When a device driver creates a device node, and a DEVFS mount adds it
to its directory tree, it needs to have some values for the access
control fields: mode, owner and group.
.lp
Currently, the device driver specifies the initial values in the
make_dev() call, but this is far from optimal.
For one thing, embedding magic UIDs and GIDs in the kernel is simply
bad style unless they are numerically zero.
More seriously, they represent compile-time defaults which in these
enlightened days is rather old-fashioned.
.lp
.sh 1 "Cleaning up before we build: struct specinfo and dev_t"
.lp
Most of the rest of the paper will be about the various challenges
and issues in the implementation of DEVFS in FreeBSD.
All of this should be applicable to other systems derived from
4.4BSD-Lite as well.
.lp
POSIX has defined a type called ``dev_t'' which is the identity of a device.
This is mainly for use in the few system calls which knows about devices:
stat(2), fstat(2) and mknod(2).
A dev_t is constructed by logically OR'ing
the major# and minor# for the device.
Since those have been defined
as having no overlapping bits, the major# and minor#
can be retrieved from the dev_t by a simple masking operation.
.lp
Although the kernel had a well-defined concept of any particular
device it did not have a data structure to represent "a device".
The device driver has such a structure, traditionally called ``softc''
but the high kernel does not (and should not!) have access to the
device driver's private data structures.
.lp
It is an interesting tale how things got to be this way,\**
.(f
\** Basically, devices should have been moved up with sockets and
pipes at the file descriptor level when the VFS layering was introduced,
rather than have all the special casing throughout the vnode system.
.)f
but for now just record for
a fact how the actual relationship between the data structures was
in the 4.4BSD release (Fig. 1). [44BSDBook]
.(z
.PS 3
F: box "file" "handle"
arrow down from F.s
V: box "vnode"
arrow right from V.e
S: box "specinfo"
arrow down from V.s
I: box "inode"
arrow right from I.e
C: box invis "devsw[]" "[major#]"
arrow down from C.s
D: box "device" "driver"
line right from D.e
box invis "softc[]" "[minor#]"
F2: box "file" "handle" at F + (2.5,0)
arrow down from F2.s
V2: box "vnode"
arrow right from V2.e
S2: box "specinfo"
arrow down from V2.s
I2: box "inode"
arrow left from I2.w
.PE
.ce 1
Fig. 1 - Data structures in 4.4BSD
.)z
.lp
As for all other files, a vnode references a filesystem inode, but
in addition it points to a ``specinfo'' structure.  In the inode
we find the dev_t which is used to reference the device driver.
.lp
Access to the device driver happens by extracting the major# from
the dev_t, indexing through the global devsw[] array to locate
the device driver's entry point.
.lp
The device driver will extract the minor# from the dev_t and use
that as the index into the softc array of private data per device.
.lp
The ``specinfo'' structure is a little sidekick vnodes grew underway,
and is used to find all vnodes which reference the same device (i.e.
they have the same  major# and minor#).
This linkage is used to determine
which vnode is the ``chosen one'' for this device, and to keep track of
open(2)/close(2) against this device.
The actual implementation was an inefficient hash implementation,
which depending on the vnode reclamation rate and /dev directory lookup
traffic, may become a measurable performance liability.
.sh 2 "The new vnode/inode/dev_t layout"
.lp
In the new layout (Fig. 2) the specinfo structure takes a central
role.  There is only one instanace of struct specinfo per 
device (i.e. unique major#
and minor# combination) and all vnodes referencing this device point
to this structure directly.
.(z
.PS 2.25
F: box "file" "handle"
arrow down from F.s
V: box "vnode"
arrow right from V.e
S: box "specinfo"
arrow down from V.s
I: box "inode"
F2: box "file" "handle" at F + (2.5,0)
arrow down from F2.s
V2: box "vnode"
arrow left from V2.w
arrow down from V2.s
I2: box "inode"
arrow down from S.s
D: box "device" "driver"
.PE
.ce 1
Fig. 2 - The new FreeBSD data structures.
.)z
.lp
In userland, a dev_t is still the logical OR of the major# and
minor#, but this entity is now called a udev_t in the kernel.
In the kernel a dev_t is now a pointer to a struct specinfo.
.lp
All vnodes referencing a device are linked to a list hanging
directly off the specinfo structure, removing the need for the
hash table and  consequently simplifying and speeding up a lot
of code dealing with vnode instantiation, retirement and
name-caching.
.lp
The entry points to the device driver are stored in the specinfo
structure, removing the need for the devsw[] array and allowing
device drivers to use separate entrypoints for various minor numbers.
.lp
This is very convenient for devices which have a ``control''
device for management and tuning.  The control device, almost always
have entirely separate open/close/ioctl implementations [MD.C].
.lp
In addition to this, two data elements are included in the specinfo
structure but ``owned'' by the device driver.  Typically the
device driver will store a pointer to the softc structure in
one of these, and unit number or mode information in the other.
.lp
This removes the need for drivers to find the softc using array
indexing based on the minor#, and at the same time has obliviated
the need for the compiled-in ``NFOO'' constants which traditionally
determined how many softc structures and therefore devices
the driver could support.\**
.(f
\** Not to mention all the drivers which implemented panic(2) 
because they forgot to perform bounds checking on the index before
using it on their softc arrays.
.)f
.lp
There are some trivial technical issues relating to allocating
the storage for specinfo early in the boot sequence and how to
find a specinfo from the udev_t/major#+minor#, but they will
not be discussed here.
.sh 2 "Creating and destroying devices"
.lp
Ideally, devices should only be created and
destroyed by the device drivers which know what devices are present.
This is accomplished with the make_dev() and destroy_dev()
function calls.
.lp
Life is seldom quite that simple.  The operating system might be called
on to act as a NFS server for a diskless workstation, possibly even
of a different architecture, so we still need to be able to represent
device nodes with no device driver backing in the filesystems and
consequently we need to be able to create a specinfo from
the major#+minor# in these inodes when we encounter them.
In practice this is quite trivial, but in a few places in the code
one needs to be aware of the existence
of both ``named'' and ``anonymous'' specinfo structures.
.lp
The make_dev() call creates a specinfo structure and populates
it with driver entry points, major#, minor#, device node name
(for instance ``lpt0''), UID, GID and access mode bits.  The return
value is a dev_t (i.e.,  a pointer to struct specinfo).
If the device driver determines that the device is no longer
present, it calls destroy_dev(), giving a dev_t as argument
and the dev_t will be cleaned and converted to an anonymous dev_t.
.lp
Once created with make_dev() a named dev_t exists until destroy_dev()
is called by the driver.  The driver can rely on this and keep state
in the fields in dev_t which is reserved for driver use.
.sh 1 "DEVFS"
.lp
By now we have all the relevant information about each device node
collected in struct specinfo but we still have one problem to
solve before we can add the DEVFS filesystem on top of it.
.sh 2 "The interrupt problem"
.lp
Some device drivers, notably the CAM/SCSI subsystem in FreeBSD
will discover changes in the device configuration inside an interrupt
routine.
.lp
This imposes some limitations on what can and should do be done:
first one should minimise the amount
of work done in an interrupt routine for performance reasons;
second, to avoid deadlocks, vnodes and mountpoints should not be
accessed from an interrupt routine.
.lp
Also, in addition to the locking issue,
a machine can have many instances of DEVFS mounted:
for a jail(8) based virtual-machine web-server several hundred instances
is not unheard of, making it far too expensive to update all of them
in an interrupt routine.
.lp
The solution to this problem is to do all the filesystem work on
the filesystem side of DEVFS and use atomically manipulated integer indices
(``inode numbers'') as the barrier between the two sides.
.lp
The functions called from the device drivers, make_dev(), destroy_dev()
&c. only manipulate the DEVFS inode number of the dev_t in
question and do not even get near any mountpoints or vnodes.
.lp
For make_dev() the task is to assign a unique inode number to the
dev_t and store the dev_t in the DEVFS-global inode-to-dev_t array.
.(b M
.vs -3
\fC\s-2make_dev(...)
    store argument values in dev_t
    assign unique inode number to dev_t
    atomically insert dev_t into inode_array\fP\s+2
.vs +3
.)b
.lp
For destroy_dev() the task is the opposite: clear the inode number
in the dev_t and NULL the pointer in the devfs-global inode-to-dev_t
array.
.(b M
.vs -3
\fC\s-2destroy_dev(...)
    clear fields in dev_t
    zero dev_t inode number.
    atomically clear entry in inode_array\fP\s+2
.vs +3
.)b
.lp
Both functions conclude by atomically incrementing a global variable
\fCdevfs_generation\fP to leave an indication to the filesystem
side that something has changed.
.lp
By modifying the global state only with atomic instructions, locks
have been entirely avoided in this part of the code which means that
the make_dev() and destroy_dev() functions can be called from practically
anywhere in the kernel at any time.
.lp
On the filesystem side of DEVFS, the only two vnode methods which examine
or rely on the directory structure, VOP_LOOKUP and VOP_READDIR,
call the function devfs_populate() to update their mountpoint's view
of the device hierarchy to match current reality before doing any work.
.(b M
.vs -3
\fC\s-2devfs_readdir(...)
    devfs_populate(...)
    ...\fP\s+2
.)b
.vs +3
.lp
The devfs_populate() function, compares the current \fCdevfs_generation\fP
to the value saved in the mountpoint last time devfs_populate() completed
and if (actually: while) they differ a linear run is made through the
devfs-global inode-array and the directory tree of the mountpoint is
brought up to date.
.lp
The actual code is slightly more complicated than shown in the pseudo-code
here because it has to deal with subdirectories and hidden entries.
.(b M
.vs -3
\fC\s-2devfs_populate(...)
  while (mount->generation != devfs_generation)
    for i in all inodes
      if inode created)
        create directory entry
      else if inode destroyed
        remove directory entry
.vs +3
.)b
.lp
Access to the global DEVFS inode table is again implemented
with atomic instructions and failsafe retries to avoid the
need for locking.
.lp
From a performance point of view this scheme also means that a particular
DEVFS mountpoint is not updated until it needs to be, and then always by
a process belonging to the jail in question thus minimising and 
distributing the CPU load.
.sh 1 "Device-driver impact"
.lp
All these changes have had a significant impact on how device drivers
interact with the rest of the kernel regarding registration of
devices.
.lp
If we look first at the ``before'' image in Fig. 3, we notice first
the NFOO define which imposes a firm upper limit on the number of
devices the kernel can deal with.
Also notice that the softc structure for all of them is allocated
at compile time.
This is because most device drivers (and texts on writing device
drivers) are from before the general
kernel malloc facility [Mckusick1988] was introduced into the BSD kernel.
.lp
.(b M
.vs -3
\fC\s-2
#ifndef NFOO
#	define NFOO	4
#endif

struct foo_softc {
	...
} foo_softc[NFOO];

int nfoo = 0;

foo_open(dev, ...)
{
	int unit = minor(dev);
	struct foo_softc *sc;

	if (unit >= NFOO || unit >= nfoo)
		return (ENXIO);
	
	sc = &foo_softc[unit]

	...
}

foo_attach(...)
{
	struct foo_softc *sc;
	static int once;

	...
	if (nfoo >= NFOO) {
		/* Have hardware, can't handle */
		return (-1);
	}
	sc = &foo_softc[nfoo++];
	if (!once) {
		cdevsw_add(&cdevsw);
		once++;
	}
	...
}
\fP\s+2
Fig. 3 - Device-driver, old style.
.vs +3
.)b
.lp
Also notice how range checking is needed to make sure that the
minor# is inside range.  This code gets more complex if device-numbering
is sparse.  Code equivalent to that shown in the foo_open() routine
would also be needed in foo_read(), foo_write(), foo_ioctl() &c.
.lp
Finally notice how the attach routine needs to remember to register
the cdevsw structure (not shown) when the first device is found.
.lp
Now, compare this to our ``after'' image in Fig. 4.
NFOO is totally gone and so is the compile time allocation
of space for softc structures.
.lp
The foo_open (and foo_close, foo_ioctl &c) functions can now
derive the softc pointer directly from the dev_t they receive
as an argument.
.lp
.(b M
.vs -3
\fC\s-2
struct foo_softc {
	....
};

int nfoo;

foo_open(dev, ...)
{
	struct foo_softc *sc = dev->si_drv1;

	...
}

foo_attach(...)
{
	struct foo_softc *sc;

	...
	sc = MALLOC(..., M_ZERO);
	if (sc == NULL) {
		/* Have hardware, can't handle */
		return (-1);
	}
	sc->dev = make_dev(&cdevsw, nfoo,
	    UID_ROOT, GID_WHEEL, 0644,
	    "foo%d", nfoo);
	nfoo++;
	sc->dev->si_drv1 = sc;
	...
}
\fP\s+2
Fig. 4 - Device-driver, new style.
.vs +3
.)b
.lp
In foo_attach() we can now attach to all the devices we can
allocate memory for and we register the cdevsw structure per
dev_t rather than globally.
.lp
This last trick is what allows us to discard all bounds checking
in the foo_open() &c. routines, because they can only be
called through the cdevsw, and the cdevsw is only attached to
dev_t's which foo_attach() has created.
There is no way to end
up in foo_open() with a dev_t not created by foo_attach().
.lp
In the two examples here, the difference is only 10 lines of source
code, primarily because only one of the worker functions of the
device driver is shown.
In real device drivers it is not uncommon to save 50 or more lines
of source code which typically is about a percent or two of the
entire driver.
.sh 1 "Future work"
.lp
Apart from some minor issues to be cleaned up, DEVFS is now a reality
and future work therefore is likely concentrate on applying the
facilities and functionality of DEVFS to FreeBSD.
.sh 2 "devd"
.lp
It would be logical to complement DEVFS with a ``device-daemon'' which
could configure and de-configure devices as they come and go.
When a disk appears, mount it.
When a network interface appears, configure it.
And in some configurable way allow the user to customise the action,
so that for instance images will automatically be copied off the
flash-based media from a camera, &c.
.lp
In this context it is good to question how we view dynamic devices.
If for instance a printer is removed in the middle of a print job
and another printer arrives a moment later, should the system
automatically continue the print job on this new printer?
When a disk-like device arrives, should we always mount it?  Should
we have a database of known disk-like devices to tell us where to
mount it, what permissions to give the mountpoint?
Some computers come in multiple configurations, for instance laptops
with and without their docking station.  How do we want to present
this to the users and what behaviour do the users expect?
.sh 2 "Pathname length limitations"
.lp
In order to simplify memory management in the early stages of boot,
the pathname relative to the mountpoint is presently stored in a 
small fixed size buffer inside struct specinfo.
It should be possible to use filenames as long as the system otherwise
permits, so some kind of extension mechanism is called for.
.lp
Since it cannot be guaranteed that memory can be allocated in
all the possible scenarios where make_dev() can be called, it may
be necessary to mandate that the caller allocates the buffer if
the content will not fit inside the default buffer size.
.sh 2 "Initial access parameter selection"
.lp
As it is now, device drivers propose the initial mode, owner and group
for the device nodes, but it would be more flexible if it were possible
to give the kernel a set of rules, much like packet filtering rules,
which allow the user to set the wanted policy for new devices.
Such a mechanism could also be used to filter new devices for mount
points in jails and to determine other behaviour.
.lp
Doing these things from userland results in some awkward race conditions
and software bloat for embedded systems, so a kernel approach may be more
suitable.
.sh 2 "Applications of on-demand device creation"
.lp
The facility for on-demand creation of devices has some very interesting
possibilities.
.lp
One planned use is to enable user-controlled labelling
of disks.
Today disks have names like /dev/da0, /dev/ad4, but since
this numbering is topological any change in the hardware configuration
may rename the disks, causing /etc/fstab and backup procedures
to get out of sync with the hardware.
.lp
The current idea is to store on the media of the disk a user-chosen
disk name and allow access through this name, so that for instance 
/dev/mydisk0
would be a symlink to whatever topological name the disk might have
at any given time.
.lp
To simplify this and to avoid a forest of symlinks, it will probably
be decided to move all the sub-divisions of a disk into one subdirectory
per disk so just a single symlink can do the job.
In practice that means that the current /dev/ad0s2f will become
something like /dev/ad0/s2f and so on.
Obviously, in the same way, disks could also be accessed by their
topological address, down to the specific path in a SAN environment.
.lp
Another potential use could be for automated offline data media libraries.
It would be quite trivial to make it possible to access all the media
in the library using /dev/lib/$LABEL which would be a remarkable
simplification compared with most current automated retrieval facilities.
.lp
Another use could be to access devices by parameter rather than by
name.  One could imagine sending a printjob to /dev/printer/color/A2
and behind the scenes a search would be made for a device with the
correct properties and paper-handling facilities.
.sh 1 "Conclusion"
.lp
DEVFS has been successfully implemented in FreeBSD,
including a powerful, simple and flexible solution supporting 
pseudo-devices and on-demand device node creation.
.lp
Contrary to the trend, the implementation added functionality
with a net decrease in source lines,
primarily because of the improved API seen from device drivers point of view.
.lp
Even if DEVFS is not desired, other 4.4BSD derived UNIX variants
would probably benefit from adopting the dev_t/specinfo related
cleanup.
.sh 1 "Acknowledgements"
.lp
I first got started on DEVFS in 1989 because the abysmal performance
of the Olivetti M250 computer forced me to implement a network-disk-device
for Minix in order to retain my sanity.
That initial work led to a
crude but working DEVFS for Minix, so obviously both Andrew Tannenbaum
and Olivetti deserve credit for inspiration.
.lp
Julian Elischer implemented a DEVFS for FreeBSD around 1994 which never
quite made it to maturity and subsequently was abandoned.
.lp
Bruce Evans deserves special credit not only for his keen eye for detail,
and his competent criticism but also for his enthusiastic resistance to the
very concept of DEVFS.
.lp
Many thanks to the people who took time to help me stamp out ``Danglish''
through their reviews and comments:  Chris Demetriou, Paul Richards,
Brian Somers, Nik Clayton, and Hanne Munkholm.
Any remaining insults to proper use of english language are my own fault.
.\" (list & why)
.sh 1 "References"
.lp
[44BSDBook]
Mckusick, Bostic, Karels & Quarterman:
``The Design and Implementation of 4.4 BSD Operating System.''
Addison Wesley, 1996, ISBN 0-201-54979-4.
.lp
[Heidemann91a]
John S. Heidemann:
``Stackable layers: an architecture for filesystem development.''
Master's thesis, University of California, Los Angeles, July 1991.
Available as UCLA technical report CSD-910056.
.lp
[Kamp2000]
Poul-Henning Kamp and Robert N. M. Watson:
``Confining the Omnipotent root.''
Proceedings of the SANE 2000 Conference.
Available in FreeBSD distributions in \fC/usr/share/papers\fP.
.lp
[MD.C]
Poul-Henning Kamp et al:
FreeBSD memory disk driver:
\fCsrc/sys/dev/md/md.c\fP
.lp
[Mckusick1988]
Marshall Kirk Mckusick, Mike J. Karels:
``Design of a General Purpose Memory Allocator for the 4.3BSD UNIX-Kernel''
Proceedings of the San Francisco USENIX Conference, pp. 295-303, June 1988.
.lp
[Mckusick1999]
Dr. Marshall Kirk Mckusick:
Private email communication.
\fI``According to the SCCS logs, the chroot call was added by Bill Joy
on March 18, 1982 approximately 1.5 years before 4.2BSD was released.
That was well before we had ftp servers of any sort (ftp did not
show up in the source tree until January 1983).  My best guess as
to its purpose was to allow Bill to chroot into the /4.2BSD build
directory and build a system using only the files, include files,
etc contained in that tree.  That was the only use of chroot that
I remember from the early days.''\fP
.lp
[Mckusick2000]
Dr. Marshall Kirk Mckusick:
Private communication at BSDcon2000 conference.
\fI``I have not used block devices since I wrote FFS and that
was \fPmany\fI years ago.''\fP
.lp
[NewBus]
NewBus is a subsystem which provides most of the glue between
hardware and device drivers.  Despite the importance of this
there has never been published any good overview documentation
for it.
The following article by Alexander Langer in ``Dæmonnews'' is 
the best reference I can come up with:
\fC\s-2http://www.daemonnews.org/200007/newbus-intro.html\fP\s+2
.lp
[Pike2000]
Rob Pike:
``Systems Software Research is Irrelevant.''
\fC\s-2http://www.cs.bell\-labs.com/who/rob/utah2000.pdf\fP\s+2
.lp
[Pike90a]
Rob Pike, Dave Presotto, Ken Thompson and Howard Trickey:
``Plan 9 from Bell Labs.''
Proceedings of the Summer 1990 UKUUG Conference.
.lp
[Pike92a]
Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey and Phil Winterbottom:
``The Use of Name Spaces in Plan 9.''
Proceedings of the 5th ACM SIGOPS Workshop.
.lp
[Raspe1785]
Rudolf Erich Raspe:
``Baron Münchhausen's Narrative of his marvellous Travels and Campaigns in Russia.''
Kearsley, 1785.
.lp
[Ritchie74]
D.M. Ritchie and K. Thompson:
``The UNIX Time-Sharing System''
Communications of the ACM, Vol. 17, No. 7, July 1974.
.lp
[Ritchie98]
Dennis Ritchie: private conversation at USENIX Annual Technical Conference
New Orleans, 1998.
.lp
[Thompson78]
Ken Thompson:
``UNIX Implementation''
The Bell System Technical Journal, vol 57, 1978, number 6 (part 2) p. 1931ff.