Training courses

Kernel and Embedded Linux

Bootlin training courses

Embedded Linux, kernel,
Yocto Project, Buildroot, real-time,
graphics, boot time, debugging...

Bootlin logo

Elixir Cross Referencer

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
.\"	$NetBSD: fsinterface.ms,v 1.4 2003/08/07 10:30:42 agc Exp $
.\"
.\" Copyright (c) 1986 The Regents of the University of California.
.\" All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\"    notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\"    notice, this list of conditions and the following disclaimer in the
.\"    documentation and/or other materials provided with the distribution.
.\" 3. Neither the name of the University nor the names of its contributors
.\"    may be used to endorse or promote products derived from this software
.\"    without specific prior written permission.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
.\" SUCH DAMAGE.
.\"
.\"	@(#)fsinterface.ms	1.4 (Berkeley) 4/16/91
.\"
.if \nv .rm CM
.de UX
.ie \\n(UX \s-1UNIX\s0\\$1
.el \{\
\s-1UNIX\s0\\$1\(dg
.FS
\(dg \s-1UNIX\s0 is a registered trademark of AT&T.
.FE
.nr UX 1
.\}
..
.TL
Toward a Compatible Filesystem Interface
.AU
Michael J. Karels
Marshall Kirk McKusick
.AI
Computer Systems Research Group
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, California  94720
.AB
.LP
As network or remote filesystems have been implemented for
.UX ,
several stylized interfaces between the filesystem implementation
and the rest of the kernel have been developed.
.FS
This is an update of a paper originally presented
at the September 1986 conference of the European
.UX
Users' Group.
Last modified April 16, 1991.
.FE
Notable among these are Sun Microsystems' Virtual Filesystem interface (VFS)
using vnodes, Digital Equipment's Generic File System (GFS) architecture,
and AT&T's File System Switch (FSS).
Each design attempts to isolate filesystem-dependent details
below a generic interface and to provide a framework within which
new filesystems may be incorporated.
However, each of these interfaces is different from
and incompatible with the others.
Each of them addresses somewhat different design goals.
Each was based on a different starting version of
.UX ,
targetted a different set of filesystems with varying characteristics,
and uses a different set of primitive operations provided by the filesystem.
The current study compares the various filesystem interfaces.
Criteria for comparison include generality, completeness, robustness,
efficiency and esthetics.
Several of the underlying design issues are examined in detail.
As a result of this comparison, a proposal for a new filesystem interface
is advanced that includes the best features of the existing implementations.
The proposal adopts the calling convention for name lookup introduced
in 4.3BSD, but is otherwise closely related to Sun's VFS.
A prototype implementation is now being developed at Berkeley.
This proposal and the rationale underlying its development
have been presented to major software vendors
as an early step toward convergence on a compatible filesystem interface.
.AE
.SH
Introduction
.PP
As network communications and workstation environments
became common elements in
.UX
systems, several vendors of
.UX
systems have designed and built network file systems
that allow client process on one
.UX
machine to access files on a server machine.
Examples include Sun's Network File System, NFS [Sandberg85],
AT&T's recently-announced Remote File Sharing, RFS [Rifkin86],
the LOCUS distributed filesystem [Walker85],
and Masscomp's extended filesystem [Cole85].
Other remote filesystems have been implemented in research or university groups
for internal use, notably the network filesystem in the Eighth Edition
.UX
system [Weinberger84] and two different filesystems used at Carnegie-Mellon
University [Satyanarayanan85].
Numerous other remote file access methods have been devised for use
within individual
.UX
processes,
many of them by modifications to the C I/O library
similar to those in the Newcastle Connection [Brownbridge82].
.PP
Multiple network filesystems may frequently
be found in use within a single organization.
These circumstances make it highly desirable to be able to transport filesystem
implementations from one system to another.
Such portability is considerably enhanced by the use of a stylized interface
with carefully-defined entry points to separate the filesystem from the rest
of the operating system.
This interface should be similar to the interface between device drivers
and the kernel.
Although varying somewhat among the common versions of
.UX ,
the device driver interfaces are sufficiently similar that device drivers
may be moved from one system to another without major problems.
A clean, well-defined interface to the filesystem also allows a single
system to support multiple local filesystem types.
.PP
For reasons such as these, several filesystem interfaces have been used
when integrating new filesystems into the system.
The best-known of these are Sun Microsystems' Virtual File System interface,
VFS [Kleiman86], and AT&T's File System Switch, FSS.
Another interface, known as the Generic File System, GFS,
has been implemented for the ULTRIX\(dd
.FS
\(dd ULTRIX is a trademark of Digital Equipment Corp.
.FE
system by Digital [Rodriguez86].
There are numerous differences among these designs.
The differences may be understood from the varying philosophies
and design goals of the groups involved, from the systems under which
the implementations were done, and from the filesystems originally targetted
by the designs.
These differences are summarized in the following sections
within the limitations of the published specifications.
.SH
Design goals
.PP
There are several design goals which, in varying degrees,
have driven the various designs.
Each attempts to divide the filesystem into a filesystem-type-independent
layer and individual filesystem implementations.
The division between these layers occurs at somewhat different places
in these systems, reflecting different views of the diversity and types
of the filesystems that may be accommodated.
Compatibility with existing local filesystems has varying importance;
at the user-process level, each attempts to be completely transparent
except for a few filesystem-related system management programs.
The AT&T interface also makes a major effort to retain familiar internal
system interfaces, and even to retain object-file-level binary compatibility
with operating system modules such as device drivers.
Both Sun and DEC were willing to change internal data structures and interfaces
so that other operating system modules might require recompilation
or source-code modification.
.PP
AT&T's interface both allows and requires filesystems to support the full
and exact semantics of their previous filesystem,
including interruptions of system calls on slow operations.
System calls that deal with remote files are encapsulated
with their environment and sent to a server where execution continues.
The system call may be aborted by either client or server, returning
control to the client.
Most system calls that descend into the file-system dependent layer
of a filesystem other than the standard local filesystem do not return
to the higher-level kernel calling routines.
Instead, the filesystem-dependent code completes the requested
operation and then executes a non-local goto (\fIlongjmp\fP) to exit the
system call.
These efforts to avoid modification of main-line kernel code
indicate a far greater emphasis on internal compatibility than on modularity,
clean design, or efficiency.
.PP
In contrast, the Sun VFS interface makes major modifications to the internal
interfaces in the kernel, with a very clear separation
of filesystem-independent and -dependent data structures and operations.
The semantics of the filesystem are largely retained for local operations,
although this is achieved at some expense where it does not fit the internal
structuring well.
The filesystem implementations are not required to support the same
semantics as local
.UX
filesystems.
Several historical features of
.UX
filesystem behavior are difficult to achieve using the VFS interface,
including the atomicity of file and link creation and the use of open files
whose names have been removed.
.PP
A major design objective of Sun's network filesystem,
statelessness,
permeates the VFS interface.
No locking may be done in the filesystem-independent layer,
and locking in the filesystem-dependent layer may occur only during
a single call into that layer.
.PP
A final design goal of most implementors is performance.
For remote filesystems,
this goal tends to be in conflict with the goals of complete semantic
consistency, compatibility and modularity.
Sun has chosen performance over modularity in some areas,
but has emphasized clean separation of the layers within the filesystem
at the expense of performance.
Although the performance of RFS is yet to be seen,
AT&T seems to have considered compatibility far more important than modularity
or performance.
.SH
Differences among filesystem interfaces
.PP
The existing filesystem interfaces may be characterized
in several ways.
Each system is centered around a few data structures or objects,
along with a set of primitives for performing operations upon these objects.
In the original
.UX
filesystem [Ritchie74],
the basic object used by the filesystem is the inode, or index node.
The inode contains all of the information about a file except its name:
its type, identification, ownership, permissions, timestamps and location.
Inodes are identified by the filesystem device number and the index within
the filesystem.
The major entry points to the filesystem are \fInamei\fP,
which translates a filesystem pathname into the underlying inode,
and \fIiget\fP, which locates an inode by number and installs it in the in-core
inode table.
\fINamei\fP performs name translation by iterative lookup
of each component name in its directory to find its inumber,
then using \fIiget\fP to return the actual inode.
If the last component has been reached, this inode is returned;
otherwise, the inode describes the next directory to be searched.
The inode returned may be used in various ways by the caller;
it may be examined, the file may be read or written,
types and access may be checked, and fields may be modified.
Modified inodes are automatically written back the filesystem
on disk when the last reference is released with \fIiput\fP.
Although the details are considerably different,
the same general scheme is used in the faster filesystem in 4.2BSD
.UX
[Mckusick85].
.PP
Both the AT&T interface and, to a lesser extent, the DEC interface
attempt to preserve the inode-oriented interface.
Each modify the inode to allow different varieties of the structure
for different filesystem types by separating the filesystem-dependent
parts of the inode into a separate structure or one arm of a union.
Both interfaces allow operations
equivalent to the \fInamei\fP and \fIiget\fP operations
of the old filesystem to be performed in the filesystem-independent
layer, with entry points to the individual filesystem implementations to support
the type-specific parts of these operations.  Implicit in this interface
is that files may be conveniently be named by and located using a single
index within a filesystem.
The GFS provides specific entry points to the filesystems
to change most file properties rather than allowing arbitrary changes
to be made to the generic part of the inode.
.PP
In contrast, the Sun VFS interface replaces the inode as the primary object
with the vnode.
The vnode contains no filesystem-dependent fields except the pointer
to the set of operations implemented by the filesystem.
Properties of a vnode that might be transient, such as the ownership,
permissions, size and timestamps, are maintained by the lower layer.
These properties may be presented in a generic format upon request;
callers are expected not to hold this information for any length of time,
as they may not be up-to-date later on.
The vnode operations do not include a corollary for \fIiget\fP;
the only external interface for obtaining vnodes for specific files
is the name lookup operation.
(Separate procedures are provided outside of this interface
that obtain a ``file handle'' for a vnode which may be given
to a client by a server, such that the vnode may be retrieved
upon later presentation of the file handle.)
.SH
Name translation issues
.PP
Each of the systems described include a mechanism for performing
pathname-to-internal-representation translation.
The style of the name translation function is very different in all
three systems.
As described above, the AT&T and DEC systems retain the \fInamei\fP function.
The two are quite different, however, as the ULTRIX interface uses
the \fInamei\fP calling convention introduced in 4.3BSD.
The parameters and context for the name lookup operation
are collected in a \fInameidata\fP structure which is passed to \fInamei\fP
for operation.
Intent to create or delete the named file is declared in advance,
so that the final directory scan in \fInamei\fP may retain information
such as the offset in the directory at which the modification will be made.
Filesystems that use such mechanisms to avoid redundant work
must therefore lock the directory to be modified so that it may not
be modified by another process before completion.
In the System V filesystem, as in previous versions of
.UX ,
this information is stored in the per-process \fIuser\fP structure
by \fInamei\fP for use by a low-level routine called after performing
the actual creation or deletion of the file itself.
In 4.3BSD and in the GFS interface, these side effects of \fInamei\fP
are stored in the \fInameidata\fP structure given as argument to \fInamei\fP,
which is also presented to the routine implementing file creation or deletion.
.PP
The ULTRIX \fInamei\fP routine is responsible for the generic
parts of the name translation process, such as copying the name into
an internal buffer, validating it, interpolating
the contents of symbolic links, and indirecting at mount points.
As in 4.3BSD, the name is copied into the buffer in a single call,
according to the location of the name.
After determining the type of the filesystem at the start of translation
(the current directory or root directory), it calls the filesystem's
\fInamei\fP entry with the same structure it received from its caller.
The filesystem-specific routine translates the name, component by component,
as long as no mount points are reached.
It may return after any number of components have been processed.
\fINamei\fP performs any processing at mount points, then calls
the correct translation routine for the next filesystem.
Network filesystems may pass the remaining pathname to a server for translation,
or they may look up the pathname components one at a time.
The former strategy would be more efficient,
but the latter scheme allows mount points within a remote filesystem
without server knowledge of all client mounts.
.PP
The AT&T \fInamei\fP interface is presumably the same as that in previous
.UX
systems, accepting the name of a routine to fetch pathname characters
and an operation (one of: lookup, lookup for creation, or lookup for deletion).
It translates, component by component, as before.
If it detects that a mount point crosses to a remote filesystem,
it passes the remainder of the pathname to the remote server.
A pathname-oriented request other than open may be completed
within the \fInamei\fP call,
avoiding return to the (unmodified) system call handler
that called \fInamei\fP.
.PP
In contrast to the first two systems, Sun's VFS interface has replaced
\fInamei\fP with \fIlookupname\fP.
This routine simply calls a new pathname-handling module to allocate
a pathname buffer and copy in the pathname (copying a character per call),
then calls \fIlookuppn\fP.
\fILookuppn\fP performs the iteration over the directories leading
to the destination file; it copies each pathname component to a local buffer,
then calls the filesystem \fIlookup\fP entry to locate the vnode
for that file in the current directory.
Per-filesystem \fIlookup\fP routines may translate only one component
per call.
For creation and deletion of new files, the lookup operation is unmodified;
the lookup of the final component only serves to check for the existence
of the file.
The subsequent creation or deletion call, if any, must repeat the final
name translation and associated directory scan.
For new file creation in particular, this is rather inefficient,
as file creation requires two complete scans of the directory.
.PP
Several of the important performance improvements in 4.3BSD
were related to the name translation process [McKusick85][Leffler84].
The following changes were made:
.IP 1. 4
A system-wide cache of recent translations is maintained.
The cache is separate from the inode cache, so that multiple names
for a file may be present in the cache.
The cache does not hold ``hard'' references to the inodes,
so that the normal reference pattern is not disturbed.
.IP 2.
A per-process cache is kept of the directory and offset
at which the last successful name lookup was done.
This allows sequential lookups of all the entries in a directory to be done
in linear time.
.IP 3.
The entire pathname is copied into a kernel buffer in a single operation,
rather than using two subroutine calls per character.
.IP 4.
A pool of pathname buffers are held by \fInamei\fP, avoiding allocation
overhead.
.LP
All of these performance improvements from 4.3BSD are well worth using
within a more generalized filesystem framework.
The generalization of the structure may otherwise make an already-expensive
function even more costly.
Most of these improvements are present in the GFS system, as it derives
from the beta-test version of 4.3BSD.
The Sun system uses a name-translation cache generally like that in 4.3BSD.
The name cache is a filesystem-independent facility provided for the use
of the filesystem-specific lookup routines.
The Sun cache, like that first used at Berkeley but unlike that in 4.3,
holds a ``hard'' reference to the vnode (increments the reference count).
The ``soft'' reference scheme in 4.3BSD cannot be used with the current
NFS implementation, as NFS allocates vnodes dynamically and frees them
when the reference count returns to zero rather than caching them.
As a result, fewer names may be held in the cache
than (local filesystem) vnodes, and the cache distorts the normal reference
patterns otherwise seen by the LRU cache.
As the name cache references overflow the local filesystem inode table,
the name cache must be purged to make room in the inode table.
Also, to determine whether a vnode is in use (for example,
before mounting upon it), the cache must be flushed to free any
cache reference.
These problems should be corrected
by the use of the soft cache reference scheme.
.PP
A final observation on the efficiency of name translation in the current
Sun VFS architecture is that the number of subroutine calls used
by a multi-component name lookup is dramatically larger
than in the other systems.
The name lookup scheme in GFS suffers from this problem much less,
at no expense in violation of layering.
.PP
A final problem to be considered is synchronization and consistency.
As the filesystem operations are more stylized and broken into separate
entry points for parts of operations, it is more difficult to guarantee
consistency throughout an operation and/or to synchronize with other
processes using the same filesystem objects.
The Sun interface suffers most severely from this,
as it forbids the filesystems from locking objects across calls
to the filesystem.
It is possible that a file may be created between the time that a lookup
is performed and a subsequent creation is requested.
Perhaps more strangely, after a lookup fails to find the target
of a creation attempt, the actual creation might find that the target
now exists and is a symbolic link.
The call will either fail unexpectedly, as the target is of the wrong type,
or the generic creation routine will have to note the error
and restart the operation from the lookup.
This problem will always exist in a stateless filesystem,
but the VFS interface forces all filesystems to share the problem.
This restriction against locking between calls also
forces duplication of work during file creation and deletion.
This is considered unacceptable.
.SH
Support facilities and other interactions
.PP
Several support facilities are used by the current
.UX
filesystem and require generalization for use by other filesystem types.
For filesystem implementations to be portable,
it is desirable that these modified support facilities
should also have a uniform interface and 
behave in a consistent manner in target systems.
A prominent example is the filesystem buffer cache.
The buffer cache in a standard (System V or 4.3BSD)
.UX
system contains physical disk blocks with no reference to the files containing
them.
This works well for the local filesystem, but has obvious problems
for remote filesystems.
Sun has modified the buffer cache routines to describe buffers by vnode
rather than by device.
For remote files, the vnode used is that of the file, and the block
numbers are virtual data blocks.
For local filesystems, a vnode for the block device is used for cache reference,
and the block numbers are filesystem physical blocks.
Use of per-file cache description does not easily accommodate
caching of indirect blocks, inode blocks, superblocks or cylinder group blocks.
However, the vnode describing the block device for the cache
is one created internally,
rather than the vnode for the device looked up when mounting,
and it is located by searching a private list of vnodes
rather than by holding it in the mount structure.
Although the Sun modification makes it possible to use the buffer
cache for data blocks of remote files, a better generalization
of the buffer cache is needed.
.PP
The RFS filesystem used by AT&T does not currently cache data blocks
on client systems, thus the buffer cache is probably unmodified.
The form of the buffer cache in ULTRIX is unknown to us.
.PP
Another subsystem that has a large interaction with the filesystem
is the virtual memory system.
The virtual memory system must read data from the filesystem
to satisfy fill-on-demand page faults.
For efficiency, this read call is arranged to place the data directly
into the physical pages assigned to the process (a ``raw'' read) to avoid
copying the data.
Although the read operation normally bypasses the filesystem buffer cache,
consistency must be maintained by checking the buffer cache and copying
or flushing modified data not yet stored on disk.
The 4.2BSD virtual memory system, like that of Sun and ULTRIX,
maintains its own cache of reusable text pages.
This creates additional complications.
As the virtual memory systems are redesigned, these problems should be
resolved by reading through the buffer cache, then mapping the cached
data into the user address space.
If the buffer cache or the process pages are changed while the other reference
remains, the data would have to be copied (``copy-on-write'').
.PP
In the meantime, the current virtual memory systems must be used
with the new filesystem framework.
Both the Sun and AT&T filesystem interfaces
provide entry points to the filesystem for optimization of the virtual
memory system by performing logical-to-physical block number translation
when setting up a fill-on-demand image for a process.
The VFS provides a vnode operation analogous to the \fIbmap\fP function of the
.UX
filesystem.
Given a vnode and logical block number, it returns a vnode and block number
which may be read to obtain the data.
If the filesystem is local, it returns the private vnode for the block device
and the physical block number.
As the \fIbmap\fP operations are all performed at one time, during process
startup, any indirect blocks for the file will remain in the cache
after they are once read.
In addition, the interface provides a \fIstrategy\fP entry that may be used
for ``raw'' reads from a filesystem device,
used to read data blocks into an address space without copying.
This entry uses a buffer header (\fIbuf\fP structure)
to describe the I/O operation
instead of a \fIuio\fP structure.
The buffer-style interface is the same as that used by disk drivers internally.
This difference allows the current \fIuio\fP primitives to be avoided,
as they copy all data to/from the current user process address space.
Instead, for local filesystems these operations could be done internally
with the standard raw disk read routines,
which use a \fIuio\fP interface.
When loading from a remote filesystems,
the data will be received in a network buffer.
If network buffers are suitably aligned,
the data may be mapped into the process address space by a page swap
without copying.
In either case, it should be possible to use the standard filesystem
read entry from the virtual memory system.
.PP
Other issues that must be considered in devising a portable
filesystem implementation include kernel memory allocation,
the implicit use of user-structure global context,
which may create problems with reentrancy,
the style of the system call interface,
and the conventions for synchronization
(sleep/wakeup, handling of interrupted system calls, semaphores).
.SH
The Berkeley Proposal
.PP
The Sun VFS interface has been most widely used of the three described here.
It is also the most general of the three, in that filesystem-specific
data and operations are best separated from the generic layer.
Although it has several disadvantages which were described above,
most of them may be corrected with minor changes to the interface
(and, in a few areas, philosophical changes).
The DEC GFS has other advantages, in particular the use of the 4.3BSD
\fInamei\fP interface and optimizations.
It allows single or multiple components of a pathname
to be translated in a single call to the specific filesystem
and thus accommodates filesystems with either preference.
The FSS is least well understood, as there is little public information
about the interface.
However, the design goals are the least consistent with those of the Berkeley
research groups.
Accordingly, a new filesystem interface has been devised to avoid
some of the problems in the other systems.
The proposed interface derives directly from Sun's VFS,
but, like GFS, uses a 4.3BSD-style name lookup interface.
Additional context information has been moved from the \fIuser\fP structure
to the \fInameidata\fP structure so that name translation may be independent
of the global context of a user process.
This is especially desired in any system where kernel-mode servers
operate as light-weight or interrupt-level processes,
or where a server may store or cache context for several clients.
This calling interface has the additional advantage
that the call parameters need not all be pushed onto the stack for each call
through the filesystem interface,
and they may be accessed using short offsets from a base pointer
(unlike global variables in the \fIuser\fP structure).
.PP
The proposed filesystem interface is described very tersely here.
For the most part, data structures and procedures are analogous
to those used by VFS, and only the changes will be be treated here.
See [Kleiman86] for complete descriptions of the vfs and vnode operations
in Sun's interface.
.PP
The central data structure for name translation is the \fInameidata\fP
structure.
The same structure is used to pass parameters to \fInamei\fP,
to pass these same parameters to filesystem-specific lookup routines,
to communicate completion status from the lookup routines back to \fInamei\fP,
and to return completion status to the calling routine.
For creation or deletion requests, the parameters to the filesystem operation
to complete the request are also passed in this same structure.
The form of the \fInameidata\fP structure is:
.br
.ne 2i
.ID
.nf
.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
/*
 * Encapsulation of namei parameters.
 * One of these is located in the u. area to
 * minimize space allocated on the kernel stack
 * and to retain per-process context.
 */
struct nameidata {
		/* arguments to namei and related context: */
	caddr_t	ni_dirp;		/* pathname pointer */
	enum	uio_seg ni_seg;		/* location of pathname */
	short	ni_nameiop;		/* see below */
	struct	vnode *ni_cdir;		/* current directory */
	struct	vnode *ni_rdir;		/* root directory, if not normal root */
	struct	ucred *ni_cred;		/* credentials */

		/* shared between namei, lookup routines and commit routines: */
	caddr_t	ni_pnbuf;		/* pathname buffer */
	char	*ni_ptr;		/* current location in pathname */
	int	ni_pathlen;		/* remaining chars in path */
	short	ni_more;		/* more left to translate in pathname */
	short	ni_loopcnt;		/* count of symlinks encountered */

		/* results: */
	struct	vnode *ni_vp;		/* vnode of result */
	struct	vnode *ni_dvp;		/* vnode of intermediate directory */

/* BEGIN UFS SPECIFIC */
	struct diroffcache {		/* last successful directory search */
		struct	vnode *nc_prevdir;	/* terminal directory */
		long	nc_id;			/* directory's unique id */
		off_t	nc_prevoffset;		/* where last entry found */
	} ni_nc;
/* END UFS SPECIFIC */
};
.DE
.DS
.ta \w'#define\0\0'u +\w'WANTPARENT\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
/*
 * namei operations and modifiers
 */
#define	LOOKUP	0	/* perform name lookup only */
#define	CREATE	1	/* setup for file creation */
#define	DELETE	2	/* setup for file deletion */
#define	WANTPARENT	0x10	/* return parent directory vnode also */
#define	NOCACHE	0x20	/* name must not be left in cache */
#define	FOLLOW	0x40	/* follow symbolic links */
#define	NOFOLLOW	0x0	/* don't follow symbolic links (pseudo) */
.DE
As in current systems other than Sun's VFS, \fInamei\fP is called
with an operation request, one of LOOKUP, CREATE or DELETE.
For a LOOKUP, the operation is exactly like the lookup in VFS.
CREATE and DELETE allow the filesystem to ensure consistency
by locking the parent inode (private to the filesystem),
and (for the local filesystem) to avoid duplicate directory scans
by storing the new directory entry and its offset in the directory
in the \fIndirinfo\fP structure.
This is intended to be opaque to the filesystem-independent levels.
Not all lookups for creation or deletion are actually followed
by the intended operation; permission may be denied, the filesystem
may be read-only, etc.
Therefore, an entry point to the filesystem is provided
to abort a creation or deletion operation
and allow release of any locked internal data.
After a \fInamei\fP with a CREATE or DELETE flag, the pathname pointer
is set to point to the last filename component.
Filesystems that choose to implement creation or deletion entirely
within the subsequent call to a create or delete entry
are thus free to do so.
.PP
The \fInameidata\fP is used to store context used during name translation.
The current and root directories for the translation are stored here.
For the local filesystem, the per-process directory offset cache
is also kept here.
A file server could leave the directory offset cache empty,
could use a single cache for all clients,
or could hold caches for several recent clients.
.PP
Several other data structures are used in the filesystem operations.
One is the \fIucred\fP structure which describes a client's credentials
to the filesystem.
This is modified slightly from the Sun structure;
the ``accounting'' group ID has been merged into the groups array.
The actual number of groups in the array is given explicitly
to avoid use of a reserved group ID as a terminator.
Also, typedefs introduced in 4.3BSD for user and group ID's have been used.
The \fIucred\fP structure is thus:
.DS
.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
/*
 * Credentials.
 */
struct ucred {
	u_short	cr_ref;			/* reference count */
	uid_t	cr_uid;			/* effective user id */
	short	cr_ngroups;		/* number of groups */
	gid_t	cr_groups[NGROUPS];	/* groups */
	/*
	 * The following either should not be here,
	 * or should be treated as opaque.
	 */
	uid_t   cr_ruid;		/* real user id */
	gid_t   cr_svgid;		/* saved set-group id */
};
.DE
.PP
A final structure used by the filesystem interface is the \fIuio\fP
structure mentioned earlier.
This structure describes the source or destination of an I/O
operation, with provision for scatter/gather I/O.
It is used in the read and write entries to the filesystem.
The \fIuio\fP structure presented here is modified from the one
used in 4.2BSD to specify the location of each vector of the operation
(user or kernel space)
and to allow an alternate function to be used to implement the data movement.
The alternate function might perform page remapping rather than a copy,
for example.
.DS
.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
/*
 * Description of an I/O operation which potentially
 * involves scatter-gather, with individual sections
 * described by iovec, below.  uio_resid is initially
 * set to the total size of the operation, and is
 * decremented as the operation proceeds.  uio_offset
 * is incremented by the amount of each operation.
 * uio_iov is incremented and uio_iovcnt is decremented
 * after each vector is processed.
 */
struct uio {
	struct	iovec *uio_iov;
	int	uio_iovcnt;
	off_t	uio_offset;
	int	uio_resid;
	enum	uio_rw uio_rw;
};

enum	uio_rw { UIO_READ, UIO_WRITE };
.DE
.DS
.ta .5i +\w'caddr_t\0\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
/*
 * Description of a contiguous section of an I/O operation.
 * If iov_op is non-null, it is called to implement the copy
 * operation, possibly by remapping, with the call
 *	(*iov_op)(from, to, count);
 * where from and to are caddr_t and count is int.
 * Otherwise, the copy is done in the normal way,
 * treating base as a user or kernel virtual address
 * according to iov_segflg.
 */
struct iovec {
	caddr_t	iov_base;
	int	iov_len;
	enum	uio_seg iov_segflg;
	int	(*iov_op)();
};
.DE
.DS
.ta .5i +\w'UIO_USERISPACE\0\0\0\0\0'u
/*
 * Segment flag values.
 */
enum	uio_seg {
	UIO_USERSPACE,		/* from user data space */
	UIO_SYSSPACE,		/* from system space */
	UIO_USERISPACE		/* from user I space */
};
.DE
.SH
File and filesystem operations
.PP
With the introduction of the data structures used by the filesystem
operations, the complete list of filesystem entry points may be listed.
As noted, they derive mostly from the Sun VFS interface.
Lines marked with \fB+\fP are additions to the Sun definitions;
lines marked with \fB!\fP are modified from VFS.
.PP
The structure describing the externally-visible features of a mounted
filesystem, \fIvfs\fP, is:
.DS
.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
/*
 * Structure per mounted file system.
 * Each mounted file system has an array of
 * operations and an instance record.
 * The file systems are put on a doubly linked list.
 */
struct vfs {
	struct vfs	*vfs_next;		/* next vfs in vfs list */
\fB+\fP	struct vfs	*vfs_prev;		/* prev vfs in vfs list */
	struct vfsops	*vfs_op;		/* operations on vfs */
	struct vnode	*vfs_vnodecovered;	/* vnode we mounted on */
	int	vfs_flag;		/* flags */
\fB!\fP	int	vfs_fsize;		/* fundamental block size */
\fB+\fP	int	vfs_bsize;		/* optimal transfer size */
\fB!\fP	uid_t	vfs_exroot;		/* exported fs uid 0 mapping */
	short	vfs_exflags;		/* exported fs flags */
	caddr_t	vfs_data;		/* private data */
};
.DE
.DS
.ta \w'\fB+\fP 'u +\w'#define\0\0'u +\w'VFS_EXPORTED\0\0'u +\w'0x40\0\0\0\0\0'u
	/*
	 * vfs flags.
	 * VFS_MLOCK lock the vfs so that name lookup cannot proceed past the vfs.
	 * This keeps the subtree stable during mounts and unmounts.
	 */
	#define	VFS_RDONLY	0x01		/* read only vfs */
\fB+\fP	#define	VFS_NOEXEC	0x02		/* can't exec from filesystem */
	#define	VFS_MLOCK	0x04		/* lock vfs so that subtree is stable */
	#define	VFS_MWAIT	0x08		/* someone is waiting for lock */
	#define	VFS_NOSUID	0x10		/* don't honor setuid bits on vfs */
	#define	VFS_EXPORTED	0x20		/* file system is exported (NFS) */

	/*
	 * exported vfs flags.
	 */
	#define	EX_RDONLY	0x01		/* exported read only */
.DE
.LP
The operations supported by the filesystem-specific layer
on an individual filesystem are:
.DS
.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
/*
 * Operations supported on virtual file system.
 */
struct vfsops {
\fB!\fP	int	(*vfs_mount)(		/* vfs, path, data, datalen */ );
\fB!\fP	int	(*vfs_unmount)(		/* vfs, forcibly */ );
\fB+\fP	int	(*vfs_mountroot)();
	int	(*vfs_root)(		/* vfs, vpp */ );
\fB!\fP	int	(*vfs_statfs)(		/* vfs, vp, sbp */ );
\fB!\fP	int	(*vfs_sync)(		/* vfs, waitfor */ );
\fB+\fP	int	(*vfs_fhtovp)(		/* vfs, fhp, vpp */ );
\fB+\fP	int	(*vfs_vptofh)(		/* vp, fhp */ );
};
.DE
.LP
The \fIvfs_statfs\fP entry returns a structure of the form:
.DS
.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
/*
 * file system statistics
 */
struct statfs {
\fB!\fP	short	f_type;			/* type of filesystem */
\fB+\fP	short	f_flags;		/* copy of vfs (mount) flags */
\fB!\fP	long	f_fsize;		/* fundamental file system block size */
\fB+\fP	long	f_bsize;		/* optimal transfer block size */
	long	f_blocks;		/* total data blocks in file system */
	long	f_bfree;		/* free blocks in fs */
	long	f_bavail;		/* free blocks avail to non-superuser */
	long	f_files;		/* total file nodes in file system */
	long	f_ffree;		/* free file nodes in fs */
	fsid_t	f_fsid;			/* file system id */
\fB+\fP	char	*f_mntonname;		/* directory on which mounted */
\fB+\fP	char	*f_mntfromname;		/* mounted filesystem */
	long	f_spare[7];		/* spare for later */
};

typedef long fsid_t[2];			/* file system id type */
.DE
.LP
The modifications to Sun's interface at this level are minor.
Additional arguments are present for the \fIvfs_mount\fP and \fIvfs_umount\fP
entries.
\fIvfs_statfs\fP accepts a vnode as well as filesystem identifier,
as the information may not be uniform throughout a filesystem.
For example,
if a client may mount a file tree that spans multiple physical
filesystems on a server, different sections may have different amounts
of free space.
(NFS does not allow remotely-mounted file trees to span physical filesystems
on the server.)
The final additions are the entries that support file handles.
\fIvfs_vptofh\fP is provided for the use of file servers,
which need to obtain an opaque
file handle to represent the current vnode for transmission to clients.
This file handle may later be used to relocate the vnode using \fIvfs_fhtovp\fP
without requiring the vnode to remain in memory.
.PP
Finally, the external form of a filesystem object, the \fIvnode\fP, is:
.DS
.ta .5i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
/*
 * vnode types. VNON means no type.
 */
enum vtype 	{ VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK };

struct vnode {
	u_short	v_flag;			/* vnode flags (see below) */
	u_short	v_count;		/* reference count */
	u_short	v_shlockc;		/* count of shared locks */
	u_short	v_exlockc;		/* count of exclusive locks */
	struct vfs	*v_vfsmountedhere;	/* ptr to vfs mounted here */
	struct vfs	*v_vfsp;		/* ptr to vfs we are in */
	struct vnodeops	*v_op;			/* vnode operations */
\fB+\fP	struct text	*v_text;		/* text/mapped region */
	enum vtype	v_type;			/* vnode type */
	caddr_t	v_data;			/* private data for fs */
};
.DE
.DS
.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
/*
 * vnode flags.
 */
#define	VROOT	0x01	/* root of its file system */
#define	VTEXT	0x02	/* vnode is a pure text prototype */
#define	VEXLOCK	0x10	/* exclusive lock */
#define	VSHLOCK	0x20	/* shared lock */
#define	VLWAIT	0x40	/* proc is waiting on shared or excl. lock */
.DE
.LP
The operations supported by the filesystems on individual \fIvnode\fP\^s
are:
.DS
.ta .5i +\w'int\0\0\0\0\0'u  +\w'(*vn_getattr)(\0\0\0\0\0'u
/*
 * Operations on vnodes.
 */
struct vnodeops {
\fB!\fP	int	(*vn_lookup)(		/* ndp */ );
\fB!\fP	int	(*vn_create)(		/* ndp, vap, fflags */ );
\fB+\fP	int	(*vn_mknod)(		/* ndp, vap, fflags */ );
\fB!\fP	int	(*vn_open)(		/* vp, fflags, cred */ );
	int	(*vn_close)(		/* vp, fflags, cred */ );
	int	(*vn_access)(		/* vp, fflags, cred */ );
	int	(*vn_getattr)(		/* vp, vap, cred */ );
	int	(*vn_setattr)(		/* vp, vap, cred */ );

\fB+\fP	int	(*vn_read)(		/* vp, uiop, offp, ioflag, cred */ );
\fB+\fP	int	(*vn_write)(		/* vp, uiop, offp, ioflag, cred */ );
\fB!\fP	int	(*vn_ioctl)(		/* vp, com, data, fflag, cred */ );
	int	(*vn_select)(		/* vp, which, cred */ );
\fB+\fP	int	(*vn_mmap)(		/* vp, ..., cred */ );
	int	(*vn_fsync)(		/* vp, cred */ );
\fB+\fP	int	(*vn_seek)(		/* vp, offp, off, whence */ );

\fB!\fP	int	(*vn_remove)(		/* ndp */ );
\fB!\fP	int	(*vn_link)(		/* vp, ndp */ );
\fB!\fP	int	(*vn_rename)(		/* src ndp, target ndp */ );
\fB!\fP	int	(*vn_mkdir)(		/* ndp, vap */ );
\fB!\fP	int	(*vn_rmdir)(		/* ndp */ );
\fB!\fP	int	(*vn_symlink)(		/* ndp, vap, nm */ );
	int	(*vn_readdir)(		/* vp, uiop, offp, ioflag, cred */ );
	int	(*vn_readlink)(		/* vp, uiop, ioflag, cred */ );

\fB+\fP	int	(*vn_abortop)(		/* ndp */ );
\fB+\fP	int	(*vn_lock)(		/* vp */ );
\fB+\fP	int	(*vn_unlock)(		/* vp */ );
\fB!\fP	int	(*vn_inactive)(		/* vp */ );
};
.DE
.DS
.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0'u
/*
 * flags for ioflag
 */
#define	IO_UNIT	0x01		/* do io as atomic unit for VOP_RDWR */
#define	IO_APPEND	0x02		/* append write for VOP_RDWR */
#define	IO_SYNC	0x04		/* sync io for VOP_RDWR */
.DE
.LP
The argument types listed in the comments following each operation are:
.sp
.IP ndp 10
A pointer to a \fInameidata\fP structure.
.IP vap
A pointer to a \fIvattr\fP structure (vnode attributes; see below).
.IP fflags
File open flags, possibly including O_APPEND, O_CREAT, O_TRUNC and O_EXCL.
.IP vp
A pointer to a \fIvnode\fP previously obtained with \fIvn_lookup\fP.
.IP cred
A pointer to a \fIucred\fP credentials structure.
.IP uiop
A pointer to a \fIuio\fP structure.
.IP ioflag
Any of the IO flags defined above.
.IP com
An \fIioctl\fP command, with type \fIunsigned long\fP.
.IP data
A pointer to a character buffer used to pass data to or from an \fIioctl\fP.
.IP which
One of FREAD, FWRITE or 0 (select for exceptional conditions).
.IP off
A file offset of type \fIoff_t\fP.
.IP offp
A pointer to file offset of type \fIoff_t\fP.
.IP whence
One of L_SET, L_INCR, or L_XTND.
.IP fhp
A pointer to a file handle buffer.
.sp
.PP
Several changes have been made to Sun's set of vnode operations.
Most obviously, the \fIvn_lookup\fP receives a \fInameidata\fP structure
containing its arguments and context as described.
The same structure is also passed to one of the creation or deletion
entries if the lookup operation is for CREATE or DELETE to complete
an operation, or to the \fIvn_abortop\fP entry if no operation
is undertaken.
For filesystems that perform no locking between lookup for creation
or deletion and the call to implement that action,
the final pathname component may be left untranslated by the lookup
routine.
In any case, the pathname pointer points at the final name component,
and the \fInameidata\fP contains a reference to the vnode of the parent
directory.
The interface is thus flexible enough to accommodate filesystems
that are fully stateful or fully stateless, while avoiding redundant
operations whenever possible.
One operation remains problematical, the \fIvn_rename\fP call.
It is tempting to look up the source of the rename for deletion
and the target for creation.
However, filesystems that lock directories during such lookups must avoid
deadlock if the two paths cross.
For that reason, the source is translated for LOOKUP only,
with the WANTPARENT flag set;
the target is then translated with an operation of CREATE.
.PP
In addition to the changes concerned with the \fInameidata\fP interface,
several other changes were made in the vnode operations.
The \fIvn_rdrw\fP entry was split into \fIvn_read\fP and \fIvn_write\fP;
frequently, the read/write entry amounts to a routine that checks
the direction flag, then calls either a read routine or a write routine.
The two entries may be identical for any given filesystem;
the direction flag is contained in the \fIuio\fP given as an argument.
.PP
All of the read and write operations use a \fIuio\fP to describe
the file offset and buffer locations.
All of these fields must be updated before return.
In particular, the \fIvn_readdir\fP entry uses this
to return a new file offset token for its current location.
.PP
Several new operations have been added.
The first, \fIvn_seek\fP, is a concession to record-oriented files
such as directories.
It allows the filesystem to verify that a seek leaves a file at a sensible
offset, or to return a new offset token relative to an earlier one.
For most filesystems and files, this operation amounts to performing
simple arithmetic.
Another new entry point is \fIvn_mmap\fP, for use in mapping device memory
into a user process address space.
Its semantics are not yet decided.
The final additions are the \fIvn_lock\fP and \fIvn_unlock\fP entries.
These are used to request that the underlying file be locked against
changes for short periods of time if the filesystem implementation allows it.
They are used to maintain consistency
during internal operations such as \fIexec\fP,
and may not be used to construct atomic operations from other filesystem
operations.
.PP
The attributes of a vnode are not stored in the vnode,
as they might change with time and may need to be read from a remote
source.
Attributes have the form:
.DS
.ta .5i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
/*
 * Vnode attributes.  A field value of -1
 * represents a field whose value is unavailable
 * (getattr) or which is not to be changed (setattr).
 */
struct vattr {
	enum vtype	va_type;	/* vnode type (for create) */
	u_short	va_mode;	/* files access mode and type */
\fB!\fP	uid_t	va_uid;		/* owner user id */
\fB!\fP	gid_t	va_gid;		/* owner group id */
	long	va_fsid;	/* file system id (dev for now) */
\fB!\fP	long	va_fileid;	/* file id */
	short	va_nlink;	/* number of references to file */
	u_long	va_size;	/* file size in bytes (quad?) */
\fB+\fP	u_long	va_size1;	/* reserved if not quad */
	long	va_blocksize;	/* blocksize preferred for i/o */
	struct timeval	va_atime;	/* time of last access */
	struct timeval	va_mtime;	/* time of last modification */
	struct timeval	va_ctime;	/* time file changed */
	dev_t	va_rdev;	/* device the file represents */
	u_long	va_bytes;	/* bytes of disk space held by file */
\fB+\fP	u_long	va_bytes1;	/* reserved if va_bytes not a quad */
};
.DE
.SH
Conclusions
.PP
The Sun VFS filesystem interface is the most widely used generic
filesystem interface.
Of the interfaces examined, it creates the cleanest separation
between the filesystem-independent and -dependent layers and data structures.
It has several flaws, but it is felt that certain changes in the interface
can ameliorate most of them.
The interface proposed here includes those changes.
The proposed interface is now being implemented by the Computer Systems
Research Group at Berkeley.
If the design succeeds in improving the flexibility and performance
of the filesystem layering, it will be advanced as a model interface.
.SH
Acknowledgements
.PP
The filesystem interface described here is derived from Sun's VFS interface.
It also includes features similar to those of DEC's GFS interface.
We are indebted to members of the Sun and DEC system groups
for long discussions of the issues involved.
.br
.ne 2i
.SH
References

.IP Brownbridge82 \w'Satyanarayanan85\0\0'u
Brownbridge, D.R., L.F. Marshall, B. Randell,
``The Newcastle Connection, or UNIXes of the World Unite!,''
\fISoftware\- Practice and Experience\fP, Vol. 12, pp. 1147-1162, 1982.

.IP Cole85
Cole, C.T., P.B. Flinn, A.B. Atlas,
``An Implementation of an Extended File System for UNIX,''
\fIUsenix Conference Proceedings\fP,
pp. 131-150, June, 1985.

.IP Kleiman86
``Vnodes: An Architecture for Multiple File System Types in Sun UNIX,''
\fIUsenix Conference Proceedings\fP,
pp. 238-247, June, 1986.

.IP Leffler84
Leffler, S., M.K. McKusick, M. Karels,
``Measuring and Improving the Performance of 4.2BSD,''
\fIUsenix Conference Proceedings\fP, pp. 237-252, June, 1984.

.IP McKusick84
McKusick, M.K., W.N. Joy, S.J. Leffler, R.S. Fabry,
``A Fast File System for UNIX,'' \fITransactions on Computer Systems\fP,
Vol. 2, pp. 181-197,
ACM, August, 1984.

.IP McKusick85
McKusick, M.K., M. Karels, S. Leffler,
``Performance Improvements and Functional Enhancements in 4.3BSD,''
\fIUsenix Conference Proceedings\fP, pp. 519-531, June, 1985.

.IP Rifkin86
Rifkin, A.P., M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and K. Yueh,
``RFS Architectural Overview,'' \fIUsenix Conference Proceedings\fP,
pp. 248-259, June, 1986.

.IP Ritchie74
Ritchie, D.M. and K. Thompson, ``The Unix Time-Sharing System,''
\fICommunications of the ACM\fP, Vol. 17, pp. 365-375, July, 1974.

.IP Rodriguez86
Rodriguez, R., M. Koehler, R. Hyde,
``The Generic File System,'' \fIUsenix Conference Proceedings\fP,
pp. 260-269, June, 1986.

.IP Sandberg85
Sandberg, R., D. Goldberg, S. Kleiman, D. Walsh, B. Lyon,
``Design and Implementation of the Sun Network Filesystem,''
\fIUsenix Conference Proceedings\fP,
pp. 119-130, June, 1985.

.IP Satyanarayanan85
Satyanarayanan, M., \fIet al.\fP,
``The ITC Distributed File System: Principles and Design,''
\fIProc. 10th Symposium on Operating Systems Principles\fP, pp. 35-50,
ACM, December, 1985.

.IP Walker85
Walker, B.J. and S.H. Kiser, ``The LOCUS Distributed Filesystem,''
\fIThe LOCUS Distributed System Architecture\fP,
G.J. Popek and B.J. Walker, ed., The MIT Press, Cambridge, MA, 1985.

.IP Weinberger84
Weinberger, P.J., ``The Version 8 Network File System,''
\fIUsenix Conference presentation\fP,
June, 1984.