XFS: High-Performance Linux File System
XFS: High-Performance Linux File System
10 ; L O G I N : VO L . 3 4 , N O. 5
new standard file system. As a new design that includes advanced manage-
ment and self-healing features, btrfs will compete heavily with XFS on the
lower end of the XFS market, but we will have to see how well it does on the
extreme high end.
Today XFS is used by many well-known institutions, with CERN and Fer-
milab managing petabytes of storage for scientific experiments using XFS,
and [Link] serving the source code to the Linux kernel and many other
projects from XFS file systems.
12 ; L O G I N : VO L . 3 4 , N O. 5
ber of inodes when creating the file system, with the possibility of under- or
overprovision. Because every block in the file system can now possibly con-
tain inodes, an additional data structure is needed to keep track of inode lo-
cations and allocations. For this, each allocation group contains another B+
tree tracking the inodes allocated within it.
Because of this, XFS uses a sparse inode number scheme where inode num-
bers encode the location of the inode on disk. While this has advantages
when looking up inodes, it also means that for large file systems, inode
numbers can easily exceed the range encodable by a 32-bit integer. Despite
Linux’s having supported 64-bit-wide inode numbers for over 10 years,
many user-space applications on 32-bit systems still cannot accommodate
large inode numbers. Thus by default XFS limits the allocation of inodes to
the first allocation groups, in order to ensure all inode numbers fit into 32
bits. This can have a significant performance impact, however, and can be
disabled with the inode64 mount option.
Directories
XFS supports two major forms of directories. If a directory contains only a
few entries and is small enough to fit into the inode, a simple unsorted lin-
ear format can store all data inside the inode’s data fork. The advantage of
this format is that no external block is used and access to the directory is
extremely fast, since it will already be completely cached in memory once
it is accessed. Linear algorithms, however, do not scale to large directories
with millions of entries. XFS thus again uses B+ trees to manage large di-
rectories. Compared to simple hashing schemes such as the htree option in
ext3 and ext4, a full B+ tree provides better ordering of readdir results and
allows for returning unused blocks to the space allocator when a directory
shrinks. The much improved ordering of readdir results can be seen in Fig-
ure 2, which compares the read rates of files in readdir order in a directory
with 100,000 entries.
Reading 100.000 4kiB files in readdir order
Seagate ST373454SS SATA disk
4000
XFS
ext4
ext3
3500
3000
2500
Rate (Files/s)
2000
1500
1000
500
F i g u r e 2 : Co m p a r i s o n of r e a d i n g a l a r g e ( 1 0 0 , 0 0 0 e n t r y )
d i r e c t o r y, t h e n r e a d i n g e a c h f i l e
Direct I/O
XFS provides a feature, called direct I/O, that provides the semantics of a
UNIX raw device inside the file system namespace. Reads and writes to a
file opened for direct I/O bypass the kernel file cache and go directly from
the user buffer to the underlying I/O hardware. Bypassing the file cache of-
fers the application full control over the I/O request size and caching policy.
Avoiding the copy into the kernel address space reduces the CPU utilization
for large I/O requests significantly. Thus direct I/O allows applications such
as databases, which were traditionally using raw devices, to operate within
the file system hierarchy.
14 ; L O G I N : VO L . 3 4 , N O. 5
Streaming write performance - 10GB file
RAID 0 of 6 Seagate ST373454SS SATA disks
700
block device
XFS
ext4
600 ext3
500
Throughput (MiB/s)
400
300
200
100
di
bu
re
ffe
ct
re
d
F i g u r e 3 : Co m p a r i n g b lo c k d e v i c e , X F S , e x t 4 , a n d e x t 3 w h e n
w r i t i n g a 1 0 GB f i l e
70
60
Throughput (MiB/s)
50
40
30
20
10
0
re
re
re
w
rit
rit
rit
ad
ad
ad
e,
e,
e,
,4
,8
,1
16
6
th
th
th
th
th
th
re
re
re
re
re
re
ad
ad
ad
ad
ad
ad
s
s
s
F i g u r e 4 : Co m p a r i n g s e q u e n t i a l I / O p e r fo r m a n c e b e t w e e n
X F S , e x t4 , a n d e x t 3
Throughput (MiB/s)
1.5
0.5
re
re
re
w
rit
rit
rit
ad
ad
ad
e,
e,
e,
,4
,8
,1
16
6
th
th
th
th
th
th
re
re
re
re
re
re
ad
ad
ad
ad
ad
ad
s
s
s
s
F i g u r e 5 : Co m p a r i n g r a n d o m I / O p e r fo r m a n c e b e t w e e n X F S ,
e x t4 , a n d e x t 3
Direct I/O has been adopted by all major Linux file systems, but the support
outside of XFS is rather limited. While XFS guarantees the uncached I/O
behavior under all circumstances, other file systems fall back to buffered I/O
for many non-trivial cases such as appending writes, hole filling, or writing
into preallocated blocks. A major semantic difference between direct I/O and
buffered I/O in XFS is that XFS allows multiple parallel writers to files using
direct I/O, instead of imposing the single-writer limit specified in Posix for
buffered I/O. Serialization of I/O requests hitting the same region is left to
the application, and thus allows databases to access a table in a single file in
parallel from multiple threads or processes.
Crash Recovery
For today’s large file systems, a full file system check on an unclean shut-
down is not acceptable because it would take too long. To avoid the require-
ment for regular file system checks, XFS uses a write-ahead logging scheme
that enables atomic updates of the file system. XFS only logs structural up-
dates to the file system metadata, but not the actual user data, for which the
Posix file system interface does not provide useful atomicity guarantees.
XFS logs every update to the file system data structures and does not batch
changes from multiple transactions into a single log write, as is done by
ext3. This means that XFS must write significantly more data to the log in
case a single metadata structure gets modified again and again in short se-
quence (e.g., removing a large number of small files). To mitigate the impact
of log writes to the system performance, an external log device can be used.
With an external log the additional seeks on the main device are reduced,
and the log can use the full sequential performance of the log device.
Unfortunately, transaction logging does not help to protect against hard-
ware-induced errors. To deal with these problems, XFS has an offline file
system checking and repair tool called xfs_repair. To deal with the ever
growing disk sizes and worsening seek rates, xfs_repair has undergone a
major overhaul in the past few years to perform efficient read-ahead and
caching and to make use of multiple processors in SMP systems [6].
16 ; L O G I N : VO L . 3 4 , N O. 5
Disk Quotas
XFS provides an enhanced implementation of the BSD disk quotas. It sup-
ports the normal soft and hard limits for disk space usage and number of
inodes as an integral part of the file system. Both the per-user and per-group
quotas supported in BSD and other Linux file systems are supported. In ad-
dition to group quotas, XFS alternatively can support project quotas, where
a project is an arbitrary integer identifier assigned by the system adminis-
trator. The project quota mechanism in XFS is used to implement directory
tree quota, where a specified directory and all of the files and subdirectories
below it are restricted to using a subset of the available space in the file sys-
tem. For example, the sequence below restricts the size of the log files in
/var/log to 1 gigabyte of space:
# mount -o prjquota /dev/sda6 /var
Day-to-Day Use
A file system in use should be boring and mostly invisible to the system ad-
ministrator and user. But to get to that state the file system must first be cre-
ated. An XFS file system is created with the [Link] command, which is
trivial to use:
# [Link] /dev/vg00/scratch
meta-data =/dev/vg00/scratch isize=256 agcount=4, agsize=1245184 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=4980736, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0
For more details, see the [Link] man page and the XFS training material
[5].
A command worth note is xfs_fsr. FSR stands for file system reorganizer
and is the XFS equivalent to the Windows defrag tool. It allows defragmen-
tion of the extent lists of all files in a file system and can be run in back-
ground from cron. It may also be used on a single file.
Although all normal backup applications can be used for XFS file systems,
the xfsdump command is specifically designed for XFS backup. Unlike tra-
ditional dump tools such as dumpe2fs for ext2 and ext3, xfsdump uses a
special API to perform I/O based on file handles similar to those used in the
NFS over the wire protocol. That way, xfsdump does not suffer from the in-
consistent device snapshots on the raw block device that plague traditional
dump tools. The xfsdump command can perform backups to regular files
and tapes on local and remote systems, and it supports incremental backups
with a sophisticated inventory management system.
XFS file systems can be grown while mounted using the xfs_growfs com-
mand, but there is not yet the ability to shrink.
Conclusion
This article gave a quick overview of the features of XFS, the Linux file sys-
tem for large storage systems. I hope it clearly explains why Linux needs a
file system that differs from the default and also shows the benefits of a file
system designed for large storage from day one.
acknowledgments
I would like to thank Eric Sandeen for reviewing this article carefully.
references
[1] Adam Sweeney et al., “Scalability in the XFS File System,” Proceedings
of the USENIX 1996 Annual Technical Conference.
[2] Silicon Graphics Inc., XFS Filesystem Structure, 2nd edition,
[Link]
[3] Linux attr(5) man page: [Link]
[4] Dave Chinner and Jeremy Higdon, “Exploring High Bandwidth File-
systems on Large Systems,” Proceedings of the Ottawa Linux Symposium 2006:
[Link]
[5] Silicon Graphics Inc., XFS Overview and Internals: [Link]
projects/xfs/training/[Link].
[6] Dave Chinner and Barry Naujok, Fixing XFS Filesystems Faster: http://
[Link]/pub/[Link]/2008/slides/135-fixing_xfs_faster.pdf.
[7] Dr. Stephen Tweedie, “EXT3, Journaling Filesystem,” transcript of a pre-
sentation at the Ottawa Linux Symposium 2000: [Link]
net/release/OLS2000-ext3/[Link].
[8] xfs_quota(8)—Linux manpage: [Link]
18 ; L O G I N : VO L . 3 4 , N O. 5