0% found this document useful (0 votes)
10 views4 pages

Overview of Google File System Features

This document summarizes a research article about the Google File System (GFS). GFS is designed to manage large amounts of data across many servers. It divides files into 64 MB chunks that are replicated across multiple servers for fault tolerance. The system uses a single master server to manage metadata and coordinate access from clients. Overall, GFS provides high throughput, reliability, and scalability for large data applications like Google search.

Uploaded by

Pankaj Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

Overview of Google File System Features

This document summarizes a research article about the Google File System (GFS). GFS is designed to manage large amounts of data across many servers. It divides files into 64 MB chunks that are replicated across multiple servers for fault tolerance. The system uses a single master server to manage metadata and coordinate access from clients. Overall, GFS provides high throughput, reliability, and scalability for large data applications like Google search.

Uploaded by

Pankaj Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

International Journal of Computer Science Trends and Technology (IJCST) – Volume 4 Issue 4, Jul - Aug 2016

RESEARCH ARTICLE OPEN ACCESS

A Review on Google File System


Richa Pandey [1], S.P Sah [2]
Department of Computer Science
Graphic Era Hill University
Uttarakhand – India

ABSTRACT
Google is an American multinational technology company specializing in Internet -related services and products . A
Google file system help in managing the large amount of data which is spread in various databases.
A good Google file system is that which have the capability to handle the fault, the replication of data, make the data
efficient, memory storage. The large data which is big data must be in the form that it can be managed e asily. Many
applications like Gmail, Facebook etc. have the file systems which organize the data in relevant format.
In conclusion the paper introduces new approaches in distributed file system like spreading file’s data across
storage, single master and appends writes etc.
Keywords:- GFS, NFS, AFS, HDFS

III. KEY IDEAS


I. INTRODUCTION
The Google file system is designed in such a way that 3.1. Design and Architecture: GFS cluster consist
the data which is spread in database must be saved in of single master and multiple chunk servers used by
a arrange manner. multiple clients. Since files to be stored in GFS are
The arrangement of data is in the way that it can large, processing and transferring such huge files can
reduce the overhead load on the server, the consume a lot of bandwidth. To efficiently utilize
availability must be increased, throughput should be bandwidth files are divided into large 64 MB size
highly aggregated and much more services to make chunks which are identified by unique 64-bit chunk
the available data more accurate and reliable. Many handle assigned by master.
methods are introduced in that process.
3.2. No caching: File data is not cached by the client
II. GFS EVOLUTION or chunk [Link] streaming reads offer little
caching benefits since most of the cache data will
The need of GFS arises because of the original always be overwritten.
design of GFS. Mainly the single master design
selection was not that much efficient and contains a 3.3. Single Master: Simplifies design and allows a
lot of risk. So Google people decide to research so as simple centralized management. Master stores
to make the master distributed file system to solve metadata and co-ordinates access. All metadata is
existing challenges it faces. stored in master’s memory that makes operations
fast. It maintains 64 bytes/chunk. Hence, master
Some of the problems that Google faced: memory is not a problem. To reduce master
1) Size of storage memory increased in the range of involvement lease mechanism is used. Lease is used
petabytes. The single master started becoming a to maintain a consistent mutation (append or write)
problem when thousand client requests came order across replicas.
simultaneously.
2) 64 MB standard chunk size design choice which 3.4. Garbage collection:The system has a special
was fixed created problems. The system had to deal approach for this. Once a file is deleted its data are
with applications generating large number of small not regain [Link] files are removed if they
files [Link]. exist for 3 days during the regular scan. The

ISSN: 2347-8578 [Link] Page 177


International Journal of Computer Science Trends and Technology (IJCST) – Volume 4 Issue 4, Jul - Aug 2016

advantages offered by it are: 1) Simple in operation V. GENERAL ARCHITECTURE OF


2) Deleting of files can take place during master’s GOOGLE FILE SYSTEM
idle periods and 3) Safety against accidental deletion.
GFS is clusters of computers. A cluster is simply a
network of computers. Each cluster might contain
[Link] consistency model
hundreds or even thousands of machines. In each
1) File namespace transformation are always atomic.
GFS clusters there are three main entities:
2) File region is consistent if all clients read s ame
1. Clients
values from replicas.
2. Master servers
3) File region is defined if clients see mutation writes
[Link].
in entirety.

IV. GFS FEATURES INCLUDE


 Fault tolerance
 Critical data replication
 Automatic and efficient data recovery.
 High aggregate throughput.
 Reduced client and master interaction
because of large chunk server size.
 Namespace management and locking.
 High accessibility.
The largest GFS clusters have more than 1,000
nodes with 300 TB disk storage capacity.
Google file system is a distributed file system built
for large distributed data intensive applications like
gmail [Link] it was built to store data [Link] are other computers or computer application
generated by its large crawling and indexing system. which make a file request. Requests can range from
The files generated by this system were usually huge. retrieving and manipulating existing files to create
Maintaining and managing such huge files and data new files on the system. Clients can be thought as
processing demands was a challenge with the existing customers of the GFS.
file systems. The main objective of the designers was
building a highly fault tolerant system while running [Link] Server is the manager for the cluster. Its
inexpensive hardware. task include:-

[Link] design assumptions: (a).Maintaining an operation log, that keeps track of


1) System fail a lot and GFS should be able to the activities of the cluster. The operation log helps
recover from it. keep service interruptions to a minimum if the master
2) Files stored are of high GB. server crashes.
3) Reads of two types: large streaming reads and
small random reads. (b) The master server also keeps track of metadata,
4) Once files are written they are mostly [Link] of which is the information that describes chunks. The
the write operations are of append type. metadata tells the master server to which files the
5) Support concurrent appends by multiple clients to chunks are related and where they fit in the overall
the same file. file.
6) High supply bandwidth and throughput are more
important than low latency. [Link] Servers are the powerstation of the GFS.
They store 64-MB file chunks. The chunk servers
send requested chunks directly to the client. The GFS

ISSN: 2347-8578 [Link] Page 178


International Journal of Computer Science Trends and Technology (IJCST) – Volume 4 Issue 4, Jul - Aug 2016

copies every chunk multiple times and stores it on 7) No caching eliminates cache coherence issues.
different chunk servers. Each copy is called a replica. 8) Decoupling of flow of data from flow of control
By default,GFS makes three replicas per chunk, but allows to use network efficiently.
users can change the setting and make more or fewer 9) Orphaned chunks are automatically collected using
replicas as desired. garbage collection.
10) GFS master constantly monitors each
VI. COMPARISON chunkserver through continous messages.

Comparing GFS with other distributed file system Cons:


like Sun Network file system (NFS) and Andrew File 1) Special purpose design is a limitation when
system (AFS) and Hadoop File System(HDFS): applying to general purpose design.
2) Inefficient for small files.:
GFS NFS AFS HDFS i) Small files will have small number of chunks. This
Cluster Client-Server Cluster Cluster can lead to chunk servers storing these files to
based based based based become special in case of many client requests.
architecture architecture architecture architecture ii) Also if there are many such small files the master
No caching Client and Client No caching involvement will increase and can lead to a problem.
server caching Thus,single master node can become an issue.
caching 3) Slow garbage collection can become a problem
Not similar Similar to Similar to Not similar when the files are not static. If there many deletions
to UNIX UNIX UNIX to UNIX then not recycling can become trouble.
End users End users End users End user 4) Since a relaxed consistency model is used clients
do not interact interact interact have to perform consistency checks on their own.
interact. 5) Performance can degrade if the numbers of writers
Server No Server Server and random writes are more.
replication replication replication replication 6) Master memory is a limitation.
7) The whole system is tailored according to
VII. PROS AND CONS workloads present in Google. GFS as well as
applications are adjusted and tuned as necessary since
Pros: both are controlled by Google.
1) Very high availability and fault tolerance through 8) No reason is given for the choice of standard
replication: a) Chunk and master replication and b) chunk size (64MB).
Chunk and master recovery.
2) Simple and efficient centralized design with a Future relevance: GFS is good at for the application
single master. Delivers good performance for what it it was designed for:i.e. sequential reads for large files
was designed for i.e. large sequential reads. by data-parallel workloads. Since HDFS has become
3) Concurrent writes to the same file region are not sort of an industry standard for storing large amounts
serializable. Thus replicas might have duplicates but of data, it's increasingly being used for other types of
there is no interleaving of records. To ensure data workloads. H Base is one example of this (a more
integrity each chunkserver verifies integrity of its database-like column store), which definitely does a
own copy using checksums. lot more random I/Os.
4) Read operations takes at least a few 64KB blocks The GFS node cluster is a single master with multiple
therefore the checksum costs reduces. chunk servers that are continuously accessed by
5) Batch operations like writing to operation log, different client systems. Chunk servers store data as
garbage collection help increase the bandwidth. Linux files on local disks. Stored data is divided into
6) Atomic append operations ensures no large chunks (64 MB), which are replicated in the
synchronization is needed at client end. network a minimum of three times. The large chunk
size reduces network overhead.

ISSN: 2347-8578 [Link] Page 179


International Journal of Computer Science Trends and Technology (IJCST) – Volume 4 Issue 4, Jul - Aug 2016

GFS is designed to accommodate Google’s large [8] [Link]


cluster requirements without burdening applications. ystem.
Files are stored in hierarchical directories which are
identified by path names. Metadata - such as
namespace, access control data, and mapping
information - is controlled by the master, which
interacts with and monitors the status updates of each
chunk server through timed heartbeat [Link],
a more efficient file system must be design which
overcomes all the shortcoming of the curent gfs.

REFERENCES
[1] Sanjay Ghemawat, Howard Gobioff, and
Shun-Tak Leung,Google

[2] GFS:Evolution on fast-forward :


[Link]

[3] Garth A. Gibson, David F. Nagle, Khalil


Amiri, Jeff Butler, Fay W. Chang, Howard
Gobioff, Charles Hardin, ErikR iedel, David
Rochberg, and Jim Zelenka. A cost-
effective, high-bandwidth storage.
[4] Thomas Anderson, Michael Dahlin, Jeanna
Neefe, David Patterson, Drew Roselli, and
Randolph Wang. Serverless networkfil e
systems. In Proceedings of the 15th ACM
Symposium on Operating System Principles,
pages 109–126, Copper Mountain Resort,
Colorado, December 1995.

[5] Remzi H. Arpaci-Dusseau, Eric Anderson,


Noah Treuhaft, David E. Culler, Joseph M.
Hellerstein, David Patterson, and Kathy
Yelick. Cluster I/O with River: Making the
fast case common. In Proceedings of the
Sixth Workshop on Input/Output in Parallel
and Distributed Systems (IOPADS ’99) ,
pages 10–22, Atlanta, Georgia, May 1999.

[6] Luis-Felipe Cabrera and Darrell D. E. Long.


Swift: Using distributed disks triping to
provide high I/O data rates. Computer
Systems, 4(4):405–436, 1991.
[7] [Link]
all-2012/csci8980-2/papers/[Link]

ISSN: 2347-8578 [Link] Page 180

Common questions

Powered by AI

The Google File System (GFS) is designed to manage large-scale data storage efficiently with features like chunk-based storage, single master control, and replication for fault tolerance. Large data files are split into 64 MB chunks to optimize bandwidth and reduce network load . GFS employs a single master server to store metadata and coordinate access, allowing a centralized but simple management structure . Efficient data recovery and automatic garbage collection further enhance its fault tolerance and data integrity . GFS prioritizes throughput over latency, suitable for large sequential reads, common in Google's data processing needs .

GFS ensures data consistency through atomic mutations and a relaxed consistency model, where file namespace transformations are atomic, and clients see consistent results through the use of leases for operation order across replicas . The system uses checksums to verify the integrity of the data stored on chunk servers . However, this design necessitates client-side consistency checks, increasing complexity for clients , and it involves trade-offs in potential data duplication without interleaving, impacting storage efficiency in some cases .

GFS emphasizes throughput over latency, optimizing for large, sequential reads which are typical in big data processing applications . This prioritization means GFS is well-suited to applications requiring high data bandwidth, like crawling and indexing systems, rather than those needing low-latency access, such as transactional databases . This focus on throughput supports high-volume, parallel workloads but makes GFS less suitable for tasks needing fast, random access to small files .

GFS architecture is designed to operate over large distributed systems with cluster-based setups as opposed to the client-server model used by NFS and AFS . Unlike NFS and AFS, GFS does not implement client-side caching, which reduces cache coherence issues but relies on high throughput for large file reads . GFS uses large file chunks (64 MB) and replicates data across multiple chunk servers, contrasted with AFS's smaller file chunking and server replication .

The master server in GFS acts as the central manager of the cluster, handling metadata storage, coordinating client requests, and maintaining operation logs for consistency and recovery . It reduces client interactions through centralized control but also poses risks such as creating a single point of failure. If the master server experiences issues, it can lead to system-wide disruptions despite its quick recovery protocols . Furthermore, the increasing load on the master with scaling operations can strain its capabilities and impact performance negatively .

In GFS, metadata is stored entirely within the master server's memory, facilitating rapid access and management of file locations and states . This approach minimizes latency in operations by avoiding disk I/O, thus enhancing performance. However, it also risks overloading the master server's memory, which can become a performance bottleneck as the system scales up . The centralized metadata management can also limit scalability and increase failure risks if the master server reaches capacity or fails .

GFS employs several techniques to handle errors and ensure fault tolerance, including replicating each data chunk across multiple chunk servers, with a default of three replicas . This replication allows for data recovery and integrity even if one or more servers fail. Additionally, GFS uses periodic heartbeat messages to continuously monitor the health of chunk servers, enabling quick identification and recovery from failures . The master server also maintains operation logs to track activities, assisting in minimal service interruption after a crash .

GFS offers several advantages over traditional file systems for handling large-scale data processing. Its design supports large streaming reads and concurrent writes, essential for applications like Google's crawling and indexing systems . The high aggregate throughput prioritizes data parallel workloads, making GFS particularly effective for large file processing . Furthermore, the fault tolerance through data replication and simplified metadata management provides robust support for extensive data storage and retrieval needs .

GFS manages efficient data recovery through its use of replicated data chunks, which enables the system to access data from other chunkserver replicas if one fails . The challenges in garbage collection arise when files are frequently deleted, as orphaned data can accumulate. GFS addresses this through regular scans that safely mark deleted files for recycling after three days, but this process can be slower and inefficient if file deletions are frequent, causing potential delays in reclaiming storage space .

The single master design in GFS can become a bottleneck as it centralizes metadata management, which can be cumbersome with increasing scale, leading to potential performance degradation during high-volume, simultaneous client requests . This design can also raise reliability concerns, as any failure of the master node can disrupt the system, despite measures like operation logging to minimize service interruptions .

You might also like