Google Case Study
IT332 – Distributed Systems
2
Google Company
Google, a US-based corporation, was born out of a research
project at Stanford with the company launched in 1998.
Offers Internet search and broader web applications
Earns revenue largely from advertising associated with such
services.
3
Google Distributed System:
Design Strategy
Google has diversified and as well as providing a search engine
is now a major player in cloud computing.
88 billion queries a month by the end of 2010. The user can
expect query result in 0.2 seconds.
Good performance in terms of scalability, reliability,
performance and openness.
4
Google Search Engine
Consist of a set of services
Crawling:
To locate and retrieve the contents of the web and pass the content onto the
indexing subsystem. Performed by a software called Googlebot.
Indexing:
Produce an index for the contents of the web that is similar to an index at the
back of a book, but on a much larger scale.
Ranking:
Relevance of the retrieved links. Ranking algorithm is called PageRank, a page
will be viewed as important if it is linked to by a large number of other pages.
5
Google as a cloud provider
A set of Internet-based application, storage and computing
services sufficient to support most user's needs,
Software as a service:
offering application-level software over the Internet as web application.
Ex: Gmail, Google Docs, Google Talk and Google Calendar. Aims to
replace traditional office suites. ( more examples in the following table)
Platform as a service:
offering distributed system APIs and services across the Internet, with
these APIs used to support the development and hosting of web
applications.
Google App Engine
6
Example Google applications
7
Physical Model of a Google DS
Commodity PC
Data Center
Cluster
Rack Approx 30 racks (around 2400 PCs)
Approx 40 to 80 PCs 2 high-bandwidth switches (each rack connected to both the
One Ethernet switch (Internal=100Mbps, switches for redundancy)
external = 1Gbps) Placement and replication generally done at cluster level
8
Key Requirements
Scalability: i). Deal with more data ii) deal with more queries and iii)
seeking better results
Reliability: There is a need to provide 24/7 availability. Google offers 99.9%
service level agreement to paying customers of Google Apps covering
Gmail, Google Calendar, Google Docs, Google sites and Google Talk.
Performance: Low latency of user interaction. Achieving the throughput to
respond to all incoming requests while dealing with very large datasets
over network.
Openness: Core services and applications should be open to allow
innovation and new applications.
9
The overall Google systems architecture
10
Google infrastructure
11
Google Infrastructure
The underlying communication paradigms, including services for both remote invocation and indirect
communication.
The protocol buffers offers a common serialization format including the serialization of requests and
replies in remote invocation.
The publish-subscribe supports the efficient dissemination of events to large numbers of subscribers.
Data and coordination services providing unstructured and semi-structured abstractions for the
storage of data coupled with services to support access to the data.
GFS offers a distributed file system optimized for Google application and services like large file storage.
Chubby supports coordination services and the ability to store small volumes of data
BigTable provides a distributed database offering access to semi-structure data.
Distributed computation services providing means for carrying out parallel and distributed
computation over the physical infrastructure.
MapReduce supports distributed computation over potentially very large datasets for example stored in
Bigtable.
Sawzall provides a higher-level language for the execution of such distributed computation.
12
Summary of design choices related to communication
paradigms - part 1
13
Summary of design choices related to communication
paradigms - part 2
Google File System
Companies like Amazon and Google offer services to Web clients
resulting in reads and updates to a massive number of files
distributed across literally tens of thousands of computers
To address this problem, Google, has developed its own Google File
System (GFS)
The GFS offers similar abstractions but is specialized for storage and
access to very large quantities of data (not huge number of files but
each file is massive 100Mega or 1Giga)
And sequential reads and sequential write as opposed to random
reads and
GFS Architecture
File name, chunk index
GFS client Master
Contact address
Instructions Chunk-server state
Chunk Id, range
Chunk Server Chunk Server Chunk Server
Chunk data Linux File Linux File Linux File
System System System
16
Chubby API
:Four distinct capabilities
[Link] locks to synchronize
distributed activities in a large-scale
asynchronous environment.
[Link] system offering reliable storage of
small files complementing the service
offered by GFS.
[Link] the election of a primary in a
set of replicas.
[Link] as a name service within Google.
17
Overall architecture of Chubby
18
Overall architecture of Bigtable
• A Bigtable is broken up into tablets, with a given tablet being approximately 100 to
200 megabytes in size. It use both GFS and Chubby for data storage and distributed
coordination.
• Three major components:
• A library component on the client side
• A master server
• A potential large number of tablet servers
19
The storage architecture in Bigtable
20
Summary of design choices related to data storage
and coordination
21
Distributed Computation Services
The Google infrastructure supports distributed computation
through MapReduce service and also the higher level Sawzall
language.
MapReduce
Google reimplemented the main production indexing system in
2003 and reduced the number of lines of C++ code in MapReduce
from 3,800 to 700, a significant reduction, albeit in a relatively
small system.
22
Examples of the use of MapReduce
23
References
George F. Coulouris and Jean Dollimore. 2012. Distributed
Systems: Concepts and Design. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.