Module 6: Distributed File Systems
(DFS)
What is DFS (Distributed File System)?
A distributed file system (DFS) is a networked architecture that allows multiple users and
applications to access and manage files across various machines as if they were on a local
storage device. Instead of storing data on a single server, a DFS spreads files across multiple
locations, enhancing redundancy and reliability.
• This setup not only improves performance by enabling parallel access but also
simplifies data sharing and collaboration among users.
• By abstracting the complexities of the underlying hardware, a distributed file
system provides a seamless experience for file operations, making it easier to
manage large volumes of data in a scalable manner
Components of DFS
• Location Transparency: Location Transparency achieves through the namespace
component.
• Redundancy: Redundancy is done through a file replication component.
In the case of failure and heavy load, these components together improve data availability
by allowing the sharing of data in different locations to be logically grouped under one
folder, which is known as the "DFS root". It is not necessary to use both the two
components of DFS together, it is possible to use the namespace component without using
the file replication component and it is perfectly possible to use the file replication
component without using the namespace component between servers
Features of DFS
1. Transparency
• Structure transparency: There is no need for the client to know about the
number or locations of file servers and the storage devices. Multiple file
servers should be provided for performance, adaptability, and
dependability.
• Access transparency: Both local and remote files should be accessible in
the same manner. The file system should be automatically located on the
accessed file and send it to the client’s side.
• Naming transparency: There should not be any hint in the name of the file
to the location of the file. Once a name is given to the file, it should not be
changed during transferring from one node to another.
• Replication transparency: If a file is copied on multiple nodes, both the
copies of the file and their locations should be hidden from one node to
another.
2. User mobility: It will automatically bring the user's home directory to the node
where the user logs in.
3. Performance: Performance is based on the average amount of time needed to
convince the client requests. This time covers the CPU time + time taken to
access secondary storage + network access time. It is advisable that the
performance of the Distributed File System be similar to that of a centralized file
system.
4. Simplicity and ease of use: The user interface of a file system should be simple and
the number of commands in the file should be small.
5. High availability: A Distributed File System should be able to continue in case of
any partial failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and
independent file servers for controlling different and independent storage devices.
6. Scalability: Since growing the network by adding new machines or joining two
networks together is routine, the distributed system will inevitably grow over time.
As a result, a good distributed file system should be built to scale quickly as the
number of nodes and users in the system grows. Service should not be substantially
disrupted as the number of nodes and users grows.
7. Data integrity: Multiple users frequently share a file system. The integrity of data
saved in a shared file must be guaranteed by the file system. That is, concurrent
access requests from many users who are competing for access to the same file must
be correctly synchronized using a concurrency control method. Atomic transactions
are a high-level concurrency management mechanism for data integrity that is
frequently offered to users by a file system.
8. Security: A distributed file system should be secure so that its users may trust that
their data will be kept private. To safeguard the information contained in the file
system from unwanted & unauthorized access, security mechanisms must be
implemented.
File Models in Distributed System
What is the File Model in Distributed Systems?
A file model in distributed systems refers to the way data and files are organized, accessed,
and managed across multiple nodes or locations within a network. It encompasses the
structure, organization, and methods used to store, retrieve, and manipulate files in a
distributed environment. File models define how data is stored physically, how it can be
accessed, and what operations can be performed on it.
Importance of File Models in Distributed Systems
• Organize and Structure Data: File models provide a framework for organizing data
into logical units, making it easier to manage and query data across distributed
nodes.
• Ensure Data Consistency and Integrity: By defining how data is structured and
accessed, file models help maintain data consistency and integrity, crucial for
reliable operations in distributed environments.
• Support Scalability: Different file models offer varying levels of scalability, allowing
distributed systems to efficiently handle growing amounts of data and increasing
user demands.
• Enable Efficient Access and Retrieval: Depending on the file model chosen,
distributed systems can optimize data access patterns, ensuring that data retrieval
operations are efficient and responsive.
• Facilitate Collaboration and Sharing: File models in distributed systems enable
seamless collaboration and sharing of data among users and applications, regardless
of geographical location or network configuration.
Types of File Models in Distributed Systems
File models in distributed systems dictate how data is organized, accessed, and
managed across multiple nodes within a network. These models are classified based on
their structure and modifiability criteria, each offering distinct advantages and
functionalities.
1. Based on Structure Criteria:
• Unstructured Files:
o Description: An unstructured file is a collection of data stored as an
uninterpreted sequence of bytes, without any predefined format or internal
structure.
o Characteristics:
o Simplest and commonly used model.
o Data can be interpreted differently by different applications.
o Suitable for storing diverse data types (text, multimedia, binary).
o Example: Traditional file systems like UNIX or DOS.
• Structured Files:
o Description: A structured file organizes data into a predefined schema or
format, typically using records and fields.
o Characteristics:
o Data is organized into records with defined attributes.
o Supports complex querying and indexing.
o Ensures data consistency and integrity.
o Types:
o Files with Non-Indexed Records: Records accessed by position in
the file.
o Files with Indexed Records: Records accessed by key fields using
data structures like B-trees or hash tables.
o Example: Relational databases (e.g., MySQL, PostgreSQL).
2. Based on Modifiability Criteria:
• Mutable Files:
o Description: Mutable files allow data to be modified, updated, or deleted
after initial creation.
o Characteristics:
o Supports dynamic updates and real-time data manipulation.
o Requires concurrency control mechanisms for simultaneous access.
o Example: Traditional file systems and databases supporting CRUD
operations.
• Immutable Files:
o Description: Immutable files prohibit modifications once created,
maintaining data integrity and auditability.
o Characteristics:
o Each update creates a new version of the file.
o Ensures consistent data sharing and replication.
o Reduces risks associated with accidental or malicious alterations.
o Example: Cedar File System (CFS) where multiple versions of a file are
managed.
File Accessing Models in Distributed
System
In Distributed File Systems (DFS), multiple machines are used to provide the file system’s
facility. Different file system utilize different conceptual models of a file. The two most
usually involved standards for file modeling are structure and modifiability. File models in
view of these standards are described below.
File Accessing Models:
The file accessing model basically to depends on
• The unit of data access/Transfer
• The method utilized for accessing to remote files
Based on the unit of data access, following file access models may be utilized to get to the
particular file.
1. File-level transfer model: In file level transfer model, the all out document is moved
while a particular action requires the document information to be sent the whole way
through the circulated registering network among client and server. This model has better
versatility and is proficient.
2. Block-level transfer model: In the block-level transfer model, record information
travels through the association among client and a server is accomplished in units of
document blocks. Thus, the unit of information move in block-level transfer model is
document blocks. The block-level transfer model might be used in dispersed figuring
climate containing a few diskless workstations.
3. Byte-level transfer model: In the byte-level transfer model, record information moves
the association among client and a server is accomplished in units of bytes. In this way, the
unit of information move in byte-level exchange model is bytes. The byte-level exchange
model offers more noteworthy versatility in contrast with the other record move models
since, it licenses recuperation and limit of a conflicting progressive sub range of a document.
The significant hindrance to the byte-level exchange model is the trouble in store
organization because of the variable-length information for different access requests.
4. Record-level transfer model: The record-level file transfer model might be used in the
document models where the document contents are organized as records. In record-level
exchange model, document information travels through the organization among client and a
server is accomplished in units of records. The unit of information move in record-level
transfer model is record.
File Caching in Distributed File Systems
File caching enhances I/O performance because previously read files are kept in the main
memory. Because the files are available locally, the network transfer is zeroed when
requests for these files are repeated. Performance improvement of the file system is based
on the locality of the file access pattern. Caching also helps in reliability and scalability.
File caching is an important feature of distributed file systems that helps to improve
performance by reducing network traffic and minimizing disk access. In a distributed file
system, files are stored across multiple servers or nodes, and file caching involves
temporarily storing frequently accessed files in memory or on local disks to reduce the need
for network access or disk access.
Here are some ways file caching is implemented in distributed file systems:
Client-side caching: In this approach, the client machine stores a local copy of frequently
accessed files. When the file is requested, the client checks if the local copy is up-to-date
and, if so, uses it instead of requesting the file from the server. This reduces network traffic
and improves performance by reducing the need for network access.
Server-side caching: In this approach, the server stores frequently accessed files in memory
or on local disks to reduce the need for disk access. When a file is requested, the server
checks if it is in the cache and, if so, returns it without accessing the disk. This approach can
also reduce network traffic by reducing the need to transfer files over the network.
Distributed caching: In this approach, the file cache is distributed across multiple servers or
nodes. When a file is requested, the system checks if it is in the cache and, if so, returns it
from the nearest server. This approach reduces network traffic by minimizing the need for
data to be transferred across the network.
Advantages of file caching in distributed file systems include:
1. Improved performance: By reducing network traffic and minimizing disk access, file
caching can significantly improve the performance of distributed file systems.
2. Reduced latency: File caching can reduce latency by allowing files to be accessed
more quickly without the need for network access or disk access.
3. Better resource utilization: File caching allows frequently accessed files to be stored
in memory or on local disks, reducing the need for network or disk access and
improving resource utilization.