0% found this document useful (0 votes)
84 views12 pages

MapReduce Concepts in NoSQL Databases

Uploaded by

Raghu Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views12 pages

MapReduce Concepts in NoSQL Databases

Uploaded by

Raghu Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
  • Question 1: Explain with diagram, partitioning in MapReduce
  • Question 2: Two stages MapReduce example
  • Question 3: Explain single stage MapReduce
  • Calculations in MapReduce
  • Key-Value Data Stores
  • Usage examples of Redis

NOSQL Database 21CS745 Question Bank & Answers

MODULE 3

Question Bank with Answers

1 Explain with a neat diagram, the partitioning and combining in MapReduce


Parallelism with Partitioning:

 In a basic setup, the outputs of all mappers are concatenated and sent into a single
reduce function. This can become inefficient, especially as the size of the data grows.

 To increase parallelism and minimize bottlenecks, we partition the output of the


mappers. Each reducer operates on a subset of data associated with a specific key.
This allows multiple reduce tasks to run in parallel, speeding up the process.

 In this setup, the key-value pairs are grouped into partitions based on the key. These
partitions are then shuffled and distributed to the corresponding reducers. Multiple
reducers work on different partitions in parallel, and the results are merged at the end.

Data Transfer Reduction with Combining:

 A significant issue in map-reduce jobs is the amount of data being transferred


between the map and reduce phases. Much of the data consists of repeated key-value
pairs for the same key.

 The solution to this is a combiner function, which processes the data on the map side
before it is transferred to the reducers. The combiner aggregates values for the same
key, reducing the amount of data transferred. This helps cut down on network
overhead.

 A combiner function is essentially a mini-reduce function. In many cases, the


combiner function can be the same as the reducer function, but with a constraint: the
output of the combiner must match the input of the reduce function. These are called
combinable reducers.

Non-Combinable Reducers:

 Some reduce functions cannot be used as combiners. For instance, a reduce function
that counts unique customers for a product might not be combinable. This is because
the output of such a reduce function (the total count) differs from the input (individual
product-customer pairs).

 In such cases, a different approach is used, such as eliminating duplicates before they
reach the reducer, but this doesn’t combine the data in the same way as a combiner
would.

1
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

Combining Across Mappers:

 When using combinable reducers, not only can the map-reduce job run in parallel on
different partitions, but combining can occur across nodes as well. This flexibility
allows for earlier combining before all the mappers have completed, and even allows
some data combining to happen on the map side before it’s sent over the network.

Framework Considerations:

 Some map-reduce frameworks require all reducers to be combinable, which


maximizes flexibility by allowing parallel and serial reductions. If a non-combinable
reducer is necessary, it’s typically handled by breaking the processing into pipelined
map-reduce steps.

2
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

2 Explain two stages Map reduce example, with neat diagram


This "pipes-and-filters" model is beneficial when processing tasks involve multiple phases,
each of which can build upon the output of the previous stage.

Stage 1: Aggregate Monthly Sales

In the first stage, the goal is to summarize sales by product and month for each year. This
stage involves:

1. Mapping: Each input record (a single sale) is mapped to a key-value pair where the
key combines the year, month, and product, and the value is the quantity sold.

2. Reducing: All records with the same key (i.e., the same product in the same month of
the same year) are aggregated, summing up quantities. This gives the total sales for
each product in each month.

Example: For each sales record, the mapper might output:

 Key: [Link] puerh

 Value: quantity

The reducer then aggregates these records to produce one record per product per month, such
as:

 {year: 2011, month: 12, product: puerh, quantity: 1200}.

Stage 2: Year-on-Year Comparison

In the second stage, the output from Stage 1 is processed to compare the sales of each product
in a given month with the previous year. This is achieved by:

1. Mapping: Each record is mapped, and the mapper identifies whether it belongs to the
current year (2011) or the previous year (2010).

2. Reducing: The reducer merges records for the same product and month from both
years, calculates the percentage increase or decrease, and produces a final record
showing the comparison.

Example: For the same product "puerh" in December 2011 and 2010, the reducer might
produce:

 {product: puerh, month: 12, current_quantity: 1200, prior_quantity: 1000, increase:


20%}.

Benefits of the Two-Stage Approach

3
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

 Parallelism: Each map and reduce task can be executed in parallel, making it efficient
for large datasets.

 Reusability: The intermediate data can be stored, reused, or analyzed separately.

 Cluster-Suitability: The final outputs are ideal for distributed storage, which enables
quick data access for downstream processing.

Using tools like Apache Pig or Hive on Hadoop further simplifies this model by providing
high-level abstractions for MapReduce operations. This is particularly helpful as data scales
and demands for high-volume processing increase.

Reusable Intermediate Outputs: Intermediate results from MapReduce can be stored as


materialized views, saving time and resources for future calculations.

Optimizing Query Patterns: Build materialized views based on actual queries, as


speculative reuse can be inefficient.

Language Support: Tools like Apache Pig and Hive simplify MapReduce with user-friendly
scripting and SQL-like syntax, making it easier to use with Hadoop.

Beyond NoSQL: MapReduce is useful in many data environments, not just NoSQL, and is
ideal for distributed processing on large datasets.

Cluster-Friendly: MapReduce is well-suited for handling large volumes of data across


clusters, making it a crucial tool as data processing demands grow.

4
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

5
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

3 Explain basic map reduce, with neat diagram


The MapReduce framework is a programming model designed to handle large-scale data
processing across distributed systems. It allows complex computations on large datasets by
breaking down tasks into parallelizable units, making it especially effective for handling tasks
like data aggregation and analysis.

Core Components of MapReduce

1. Map Function: The first phase of MapReduce is the map function, which processes
each data record independently. Each record, or "aggregate" in database terms, is
converted into a series of key-value pairs. For example, when processing orders that
contain line items (product IDs, quantities, and prices), the map function extracts each
product and associates it with its details (product ID as the key, quantity, and price as
values). This setup enables efficient data processing by focusing only on relevant
details for each record.

2. Parallelism and Independence: The map function processes each aggregate (order)
independently, making it highly parallelizable. Since each map operation works
without reference to others, the framework can assign these tasks across multiple
nodes in a cluster. This parallelism enables faster data processing by distributing tasks
across the system.

3. Reduce Function: The second phase, known as the reduce function, aggregates data
by combining all values associated with each unique key. The reduce function
processes collections of values with the same key—such as all orders containing a
specific product—and consolidates them into a single output. For example, if the map
phase produced several entries for a product (each detailing quantity and revenue
from different orders), the reduce function sums these values to yield total sales for
that product.

4. Framework Coordination: The MapReduce framework automatically manages data


flow between the map and reduce phases, including moving and sorting key-value

6
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

pairs and ensuring the appropriate data reaches the reduce function. This coordination
allows developers to focus on writing the map and reduce functions without needing
to handle data shuffling or parallel task management directly.

4 How are calculations composed in Map reduce? Explain with neat diagram
The MapReduce approach is a model designed for concurrent data processing, prioritizing
ease of parallelization over flexibility. Here’s an overview of its core principles and
limitations:

Constraints in MapReduce

 Single Aggregate per Map Task: Each map task can only work with individual
records or aggregates (e.g., single orders), meaning that processing must be designed
to operate independently on each data entry without reference to others.

 Single Key per Reduce Task: Each reduce task operates on values associated with a
specific key (e.g., one product ID), so computations must be structured around
aggregating values that share the same key.

Structuring Calculations

To use MapReduce effectively, calculations must fit within the model’s constraints. Here’s
how different calculations are handled:

1. Non-Composable Calculations (e.g., Averages):

o Calculating averages illustrates a limitation in MapReduce because averages


are not composable—you can’t merge two average values directly.

o Instead, each map task must output the total sum and count of quantities,
allowing the reduce function to combine these values. The final average is
computed from the combined sum and count, not from intermediate averages.

2. Counting Operations:

o Counts are straightforward in MapReduce. Each map task emits a count of 1


for each occurrence, and the reduce function simply sums these to get the total
count.

7
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

Example Workflows:

 In a product order analysis, each map function could output entries with a product ID
key, a count of 1, and a quantity. The reduce function then combines all entries with
the same key to produce total counts and quantities, enabling further calculations like
averages based on the combined data.

What are key value stores? List out some popular key value database. Explain how all
5
data is stored in a single bucket of key value data store

Key-value stores are among the simplest and most high-performing types of NoSQL
databases, using a straightforward API model focused on basic operations for managing data.

Core Characteristics:

1. Basic Operations:

o Get: Retrieve the value associated with a key.

o Put: Insert or update a value for a key.

o Delete: Remove a key and its associated value.

2. Data Structure:

o The value in a key-value store is an opaque blob (binary large object),


meaning the database stores it without needing to interpret its content.

o Responsibility for understanding and managing the structure of stored data lies
entirely with the application.

8
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

3. Primary-Key Access:

o Key-value stores operate solely on primary keys, allowing efficient, direct


access to data and making these databases highly performant and scalable.

Popular Key-Value Databases:

 Riak: Uses a "bucket" structure for segmenting keys, aiding organization.

 Redis: Often referred to as a data structure server, supports complex structures like
lists, sets, and hashes, enabling more versatile use.

 Memcached, Berkeley DB, HamsterDB, Amazon DynamoDB, Project


Voldemort.

Advanced Features in Key-Value Databases:

 Some stores, such as Redis, offer data structure support for lists, sets, and hashes,
allowing for a range of operations like unions and intersections.

Bucket Organization in Key-Value Stores:

 Single Bucket Approach: All data (e.g., session data, shopping carts) can be stored
within a single bucket under one key-value pair, creating a unified object. However,
this can risk key conflicts due to different data types being stored under the same
bucket.

 Separate Buckets for Data Types: By appending object names to keys or creating
specific buckets for each data type (e.g., sessionID_userProfile), it’s possible to avoid
key conflicts and access only the necessary object types without needing extensive
key design changes.

9
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

Example of Redis Use:

 Redis supports lists and arrays, allowing it to store more structured information like
states, visit logs, or address types, making it ideal for data that requires order or
grouping

6 What are the key value features. Explain in detail

The key-value store model provides a simple and efficient approach to data management,
offering features that differ significantly from those of traditional relational databases.

1. Consistency

 Key-value stores are typically optimized for high performance, particularly in


distributed settings, using an eventually consistent model. This means that changes
made to the data may take time to propagate across all nodes, which can lead to
temporary inconsistencies. For instance, in Riak, users can choose either "last write
wins" or "multiple values returned" for handling conflicting writes, allowing client-
side resolution.

 This flexibility in consistency settings can be defined at the bucket level, where
options such as allow Siblings, n Val (replication factor), and w (write quorum)
enable control over the balance between data consistency and performance.

2. Transactions

 Transactions in key-value stores are limited or non-existent due to the lack of support
for multi-key or multi-document transactions. To manage transactional requirements,
some key-value stores, like Riak, employ a quorum model for writes and reads. By
configuring values like N (total replicas), W (write quorum), and R (read quorum),
users can achieve a level of reliability in write success and data availability.

3. Query Features

 Key-value stores primarily support direct key-based lookups, without the complex
query capabilities found in SQL databases. This design is fast but limits flexibility, as
querying by fields within the value requires either application-level filtering or special
indexing capabilities (like Riak Search, which enables Lucene-based querying).

 Key design becomes crucial, as the application must generate or derive meaningful
keys for efficient data retrieval. This constraint makes key-value stores ideal for
applications where queries are predictable, such as session storage or shopping carts.

4. Structure of Data

 The value part of key-value pairs is typically stored as a blob, leaving the content and
structure to the application. This flexibility allows for storing various data types (e.g.,

10
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

JSON, XML, text), but it also shifts the responsibility of data interpretation to the
client application.

 For instance, Riak allows users to specify data types in requests via the Content-Type
header, which can simplify deserialization but does not affect how the database stores
the blob.

5. Scaling

 Sharding, or partitioning data across multiple nodes based on keys, enables key-value
stores to scale horizontally. Each node handles a subset of keys, based on a
deterministic function, allowing seamless expansion by adding more nodes to the
cluster.

 However, this approach also introduces risks; if a node responsible for certain keys
fails, data with those keys becomes unavailable until the node is restored. Key-value
stores address these issues with replication and settings for the CAP theorem (e.g., N,
R, and W values in Riak), offering a trade-off between consistency, availability, and
partition tolerance.

7 Explain with suitable use cases of key value stores


Key-value stores offer a simple and efficient storage model suitable for applications where
data can be represented as individual items with unique keys.:

1. Storing Session Information:

 Use Case: Each web session is assigned a unique sessionid.

 Advantage: Fast retrieval and storage in a single PUT or GET request, ideal for
storing session data.

 Example Solution: Memcached or Riak can be used, with Riak offering enhanced
availability for session consistency across requests.

2. User Profiles and Preferences:

 Use Case: User-specific settings such as language, timezone, or access permissions.

 Advantage: All user profile data can be stored in a single object, allowing quick
retrieval of preferences.

 Example Solution: The profile can be stored with a unique user ID as the key,
making it simple to access user settings with a single GET.

3. Shopping Cart Data:

 Use Case: Shopping carts tied to individual users across sessions, browsers, and
devices.

11
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

 Advantage: All cart information is stored under a unique userid key, ensuring high
availability.

 Example Solution: A Riak cluster, which maintains availability and fault tolerance,
making it suitable for this application.

When Not to Use Key-Value Stores

While key-value stores are effective for certain types of data storage, they are not ideal for
every scenario:

1. Data Relationships:

 Challenge: Complex relationships or associations between data items are difficult to


model in a key-value store.

 Limitation: Key-value stores lack the querying capability and relational structure that
relational databases provide.

 Alternative: Consider a relational database or a graph database where relationships


among entities are critical.

----------------------------------------END OF MODULE 3----------------------------------------------

12
Koustav Biswas. Dept. Of CSE, DSATM

Common questions

Powered by AI

Key-value stores are distinguished from traditional relational databases by their simplified data model, where data is stored as key-value pairs, offering fast direct key-based lookups. They are typically optimized for high performance in distributed environments following an eventually consistent model, lacking support for complex queries or multi-key transactions. The values are stored as blobs, and scaling is achieved through sharding, which partitions data across nodes using keys, enabling horizontal expansion. This model is ideal for applications with predictable access patterns, like session information or shopping carts .

MapReduce frameworks handle non-combinable reducers by utilizing a pipes-and-filters model, breaking tasks into pipelined steps. This model accommodates processes that cannot be reduced combinatively by structuring them into sequential phases, each building on the previous stage's output. This approach simplifies the processing of complex tasks, ensuring that non-combinable elements can be efficiently managed and processed in parts. It provides greater flexibility, enabling frameworks to support diverse processing needs while enhancing scaling and reusability of intermediate results .

Key-value stores may be unsuitable in scenarios where complex relationships or associations between data items are crucial, as they lack querying capabilities and relational structure. In such cases, relational databases or graph databases are more appropriate. These alternatives enable modeling detailed relationships and efficiently executing complex queries, which are challenging to achieve with the simplistic structure of key-value stores .

Key design is crucial in key-value stores as it directly impacts data retrieval efficiency. Properly designed keys help in fast lookups and reduce system overhead. However, it presents challenges because the application must accurately generate or derive these keys to enable meaningful and efficient data access, necessitating thoughtful planning and implementation. Predictable key patterns are essential, especially since complex queries are not supported natively; this requires any additional filtering or indexing to occur at the application level rather than the database .

Partitioning enhances parallelism in MapReduce by dividing the output of the map tasks into multiple partitions based on the keys. Each partition is processed by different reduce tasks, allowing them to run in parallel, thus speeding up the process by minimizing bottlenecks. The key-value pairs are grouped into partitions by the key, then shuffled and distributed to appropriate reducers, enabling multiple reducers to work on different partitions simultaneously .

A combiner function acts as a mini-reduce function that processes data on the map side before it is transferred to the reducers. It aggregates values for the same key, reducing the amount of data transferred over the network. This reduction in data volume helps cut down on network overhead by decreasing the amount of repeated key-value pairs reaching the reducers. Combiners can often be similar to reducers, but the output of a combiner must be compatible with the input of the reduce function .

The MapReduce framework manages data flows between map and reduce phases by automatically sorting and transferring the key-value pairs generated during the map phase to the correct reducers. This involves moving data, sorting by keys, and ensuring that data reaches the appropriate reduce functions, simplifying the process for developers. This coordination allows developers to focus solely on writing map and reduce functions without worrying about data shuffling, thus reducing complexity and allowing efficient scaling of processes across distributed systems .

Some reduce functions cannot be used as combiners because their output differs significantly from their input. For instance, a reducer that counts unique customers can't be a combiner because its output (a total count) distinctly differs from the input (individual customer-product pairs). In such scenarios, an alternative approach involves eliminating duplicates prior to reaching the reducer or breaking the process into multiple pipelined MapReduce steps, allowing for effective data processing even without combining data in the traditional manner .

Typical use cases for key-value stores include storing session information, user profiles, and shopping cart data. For session information, key-value stores allow for fast retrieval and storage using a single PUT or GET request per session, enhancing system efficiency. User profiles can be stored with a unique user ID as the key, streamlining the retrieval of preferences and settings. Shopping cart data benefits from high availability and fault tolerance when tied to a unique user ID key, ensuring accessibility across different sessions and devices. These use cases exemplify the strengths of key-value stores, such as simplicity, performance, and availability .

In the first stage of a MapReduce job aimed at summarizing sales data, mapping translates each sale into a key-value pair combining year, month, and product with the quantity sold as the value. The reducing phase aggregates these to sum up the quantities, providing total sales per product per month. The second stage maps the summarized data to differentiate between current and prior year's sales for the same product and month, with the reducer calculating percentage changes. This approach enables parallel processing of tasks, allows intermediate data to be reused separately, and is well-suited for distributed storage and cluster operations, thereby enhancing efficiency and scalability .

You might also like