0% found this document useful (0 votes)
58 views21 pages

Overview of BigTable and Cloud Services

BigTable is a distributed storage system developed at Google for managing large amounts of structured data. It stores data as multidimensional sorted maps and provides real-time read/write access to petabytes of data across thousands of commodity servers. BigTable's data is distributed across many machines and it uses Google File System for storage. It was inspired by Google's need to manage user data for services like Search, Analytics, Maps and Gmail.

Uploaded by

sharath_rakki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views21 pages

Overview of BigTable and Cloud Services

BigTable is a distributed storage system developed at Google for managing large amounts of structured data. It stores data as multidimensional sorted maps and provides real-time read/write access to petabytes of data across thousands of commodity servers. BigTable's data is distributed across many machines and it uses Google File System for storage. It was inspired by Google's need to manage user data for services like Search, Analytics, Maps and Gmail.

Uploaded by

sharath_rakki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

BigTable

• BigTable is one that may be petabytes in size and distributed among


tens to thousands of machines. It is designed for storing items such as
billions of URLs, with many versions per page; over 100 TB of satellite
image data; hundreds of millions of users; and performing thousands
of queries a second.
• BigTable was developed at Google it has been in use since 2005 in
dozens of Google services. An open source version, HBase, was
created by the Apache project on top of the Hadoop core. Apache
Cassandra, first developed at Facebook to power their search engine,
is similar to BigTable with a tunable consistency model and no master.
• BigTable is designed with semi-structured data storage in mind. It is a
large map that is indexed by a row key, column key, and a timestamp.
Each value within the map is an array of bytes that is interpreted by
the application. Every read or write of data to a row is atomic,
regardless of how many different columns are read or written within
that row.
characteristics of BigTable:
• Map: A map is a data structure that allows one to look up a value to a
corresponding key quickly. BigTable is a collection of (key, value) pairs where the
key identifies a row and the value is the set of columns.
• persistent: The data is stored persistently on disk.
• Distributed: BigTable's data is distributed among many independent machines.
At Google, BigTable is built on top of GFS (Google File System). The Apache open
source version of BigTable, HBase, is built on top of HDFS (Hadoop Distributed File
System) or Amazon S3.
• Sparse: The table is sparse, meaning that different rows in a table may use
different columns, with many of the columns empty for a particular row.
• Sorted: Most associative arrays are not sorted. A key is hashed to a position in a
table. BigTable sorts its data by keys. This helps keep related data close together,
usually on the same machine. For example, if domain names are used as keys in a
BigTable
[Link]
[Link]
[Link]
• Multidimensional: A table is indexed by rows. Each row contains one or more
named column families. Column families are defined when the table is first
created. Within a column family, one may have one or more named columns. All
data within a column family is usually of the same type. Columns within a column
family can be created on the fly. Rows, column families and columns provide a
three-level naming hierarchy in identifying data. For example:
• Time-based: Time is another dimension in BigTable data.
Every column family may keep multiple versions of column
family data. If an application does not specify a timestamp, it
will retrieve the latest version of the column family.
Alternatively, it can specify a timestamp and get the latest
version that is earlier than or equal to that timestamp.
BigTable: Columns and column families
BigTable: Rows and partitioning
• A table is logically split among rows into multiple
sub tables called tablets. A tablet is a set of
consecutive rows of a table and is the unit of
distribution and load balancing within BigTable.
Because the table is always sorted by row, reads
of short ranges of rows are efficient: one typically
communicates with a small number of machines.
Hence, a key to ensuring a high degree of locality
is to select row keys properly.
BigTable: Timestamps
• Each column family cell can contain multiple versions of
content. For example, in the earlier example, we may have
several timestamped versions of page contents associated with
a URL. Each version is identified by a 64-bit timestamp that
either represents real time or is a value assigned by the client.
Reading column data retrieves the most recent version if no
timestamp is specified or the latest version that is earlier than a
specified timestamp.
• A table is configured with per-column-family settings for
garbage collection of old versions. A column family can be
defined to keep only the latest n versions or to keep only the
versions written since some time t.
BigTable: Chubby
• Chubby is a highly available and persistent distributed lock service
that manages leases for resources and stores configuration
information. The service runs with five active replicas, one of which
is elected as the master to serve requests. A majority must be
running for the service to work. Paxos is used to keep the replicas
consistent. Chubby provides a namespace of files & directories. Each
file or directory can be used as a lock.

In BigTable, Chubby is used to:


 ensure there is only one active master
 store the bootstrap location of BigTable data
 discover tablet servers
 store BigTable schema information
 store access control lists
BigTable indexing hierarchy
Google Big Data services
• Search
• Analytics
• Maps
• Gmail
OpenStack
• OpenStack is a project originally started by NASA and Rackspace for delivering a
cloud computing and storage platform. Today, OpenStack is a global
collaboration of developers and technologists producing an open source cloud
computing platform for public and private clouds.
• The technology consists of a series of interrelated projects delivering various
components for a cloud infrastructure solution. OpenStack software delivers a
massively scalable cloud operating system consisting of three major
components:
 Compute: open source software designed to provision and manage large
networks of virtual machines, creating a redundant and scalable cloud
computing platform.
 Object Storage: open source software for creating redundant, scalable object
storage using clusters of standardized servers to store petabytes of accessible
data (code-named "Swift").
 Image Service: provides discovery, registration, and delivery services for virtual
disk images (code-named "Glance").
• OpenStack has attracted more than 500 member organizations, including Dell,
Cisco, Citrix, HP, EMC, VMware, Red Hat, IBM and Intel, and the project is
currently managed by the non-profit OpenStack Foundation.
Microsoft Azure
• Microsoft Azure is widely considered both a
Platform as a Service (PaaS) and Infrastructure
as a Service (IaaS).
• Microsoft Azure is one of several major public
cloud service providers operating on a large
global scale. Other major providers include
Google Cloud Platform (GCP), Amazon Web
Services (AWS) and IBM.
Azure products and services
Microsoft categorizes Azure cloud services into 18 main product types:
• Compute -- These services enable a user to deploy and manage virtual
machines (VMs), containers and batch processing, as well as support
remote application access.
• Web -- These services support the development and deployment of web
applications, and also offer features for search, content delivery,
application programming interface (API) management, notification and
reporting.
• Data storage -- This category of services provides scalable cloud storage
for structured and unstructured data and also supports big data projects,
persistent storage (for containers) and archival storage.
• Analytics -- These services provide distributed analytics and storage, as
well as features for real-time analytics, big data analytics, data lakes,
machine learning, business intelligence (BI), internet of things (IoT) data
streams and data warehousing.
• Networking -- This group includes virtual networks, dedicated
connections and gateways, as well as services for traffic management and
diagnostics, load balancing, domain name system (DNS) hosting, and
network protection against distributed denial-of-service (DDoS) attacks.
• Media and content delivery network (CDN) -- These services include on-
demand streaming, digital rights protection, encoding and media
playback and indexing.
• Hybrid integration -- These are services for server backup, site recovery
and connecting private and public clouds.
• Identity and access management (IAM) -- These offerings ensure only
authorized users can access Azure services, and help protect encryption
keys and other sensitive information in the cloud. Services include
support for Azure Active Directory and multifactor authentication (MFA).
• Internet of things -- These services help users capture, monitor and
analyze IoT data from sensors and other devices. Services include
notifications, analytics, monitoring and support for coding and execution.
• Development -- These services help application developers share code,
test applications and track potential issues. Azure supports a range of
application programming languages, including JavaScript, Python, .NET
and [Link].
• Security -- These products provide capabilities to identify and respond to
cloud security threats, as well as manage encryption keys and other
sensitive assets.
• Artificial intelligence (AI) and machine learning -- This is a wide range of
services that a developer can use to infuse machine learning, AI and
cognitive computing capabilities into applications and data sets.
• Containers -- These services help an enterprise to create, register, arrange
and manage huge volumes of containers in the Azure cloud, using
common platforms such as Docker and Kubernetes.
• Databases -- This category includes Database as a Service (DBaaS)
offerings for SQL and NoSQL, as well as other database instances, such as
Azure Cosmos DB and Azure Database for PostgreSQL. It also includes SQL
Data Warehouse support, caching, and hybrid database integration and
migration features.
• Migration -- This suite of tools helps an organization estimate
workload migration costs, and perform the actual migration of
workloads from local data centers to the Azure cloud.
• Mobile -- These products help a developer build cloud applications
for mobile devices, providing notification services, support for back-
end tasks, tools for building APIs and the ability to couple geospatial
(location) context with data.
• Management -- These services provide a range of backup, recovery,
compliance, automation, scheduling and monitoring tools that can
help a cloud administrator manage an Azure deployment.
Integrating Data source
• A primary purpose of Data integration is to present the data in
new and unique ways. To gain new insights and, in business,
new advantages. Recognizing the needs of the organization
prior to “organizing” the data is useful in a broad range of Big
Data projects, including business and scientific research. Big
Data Integration combines traditional data, social media data
from the Internet of Things (IoT), and transactional data. Data
that is not compatible, or has not been
translated/transformed, is essentially useless for such projects.
• Organizations use MDM systems to promote the collection,
aggregation, consolidation, and delivery of reliable data
throughout the organization. Additionally, new tools, such as
Scribe and Sqoop are being used to support the integration of
Big Data.
• Managing “integrated” Big Data assures more
confidence in decision-making and provides
superior insights. The process of integrating
huge data sets can be quite complicated and
can present several challenges. Some
challenges faced during the integration
process include: uncertainty of data,
management, syncing across data sources,
finding insights, and skill availability.
Big Data Integration Tools
• As “traditional” tools for data integration continue to evolve, they
should be re-evaluated for their abilities to process the ever-
increasing variety of unstructured data, as well as the growing
volume of Big Data. Integration technologies must have a
common platform to support Data Quality and profiling.
• In traditional data warehouses, ETL (extract, transform, and load)
technologies are used to organize data. Those technologies have
evolved, and continue to evolve, to work within Big Data
environments.
• When using the cloud, data can be organized using integration
Platform-as-a-Service (iPaaS). This service is generally easy to use
and can include data from Cloud-based sources, such as Software-
as-a-Service (SaaS).
The Challenges of Big Data Integration

• Finding Staff
• Bringing in the Data
• Synchronization
• Data Management Tools
• Choosing a Strategy

You might also like