Big Data Analytics Practical Answers
(BDA-22684)
1. Pandas Program – Excel Import & Column Access
```python
import pandas as pd
df = pd.read_excel('[Link]')
print([Link])
print(df[['Name', 'Age']])
```
2. Pandas Program – Summary Stats & Skipping Rows/Cols
```python
print(df['Age'].sum())
print(df['Age'].mean())
print(df['Age'].max())
print(df['Age'].min())
df_skip = pd.read_excel('[Link]', skiprows=2, usecols="B:D")
```
3. Select & Delete Rows/Columns from DataFrame
```python
print([Link][0:5, ['Name', 'Age']])
[Link]('Age', axis=1, inplace=True)
[Link](0, axis=0, inplace=True)
```
4. Import Modules and Download File
```python
import pandas as pd, [Link]
[Link]('[Link] '[Link]')
df = pd.read_excel('[Link]')
```
5. Extract ZIP, Load Data, Log Each Phase
```python
import zipfile, os, pandas as pd
with [Link]('[Link]', 'r') as z:
[Link]('data_folder')
df = pd.read_csv('data_folder/[Link]')
def log(msg): print('[LOG]', msg)
log("Zip extracted")
log("Data loaded")
```
6. Create Hive Tables
```sql
CREATE EXTERNAL TABLE emp_ext(id INT, name STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' LOCATION '/external';
LOAD DATA LOCAL INPATH '/home/[Link]' INTO TABLE emp_ext;
CREATE TABLE emp_int(id INT, name STRING);
LOAD DATA LOCAL INPATH '/home/[Link]' INTO TABLE emp_int;
LOAD DATA INPATH '/hdfs/[Link]' INTO TABLE emp_int;
```
7. Hive Table Storage Formats
```sql
CREATE TABLE text_format(id INT) STORED AS TEXTFILE;
CREATE TABLE seq_format(id INT) STORED AS SEQUENCEFILE;
CREATE TABLE rc_format(id INT) STORED AS RCFILE;
```
8. Spark Count WARN Logs
```python
from pyspark import SparkContext
sc = SparkContext("local", "WarnCount")
logs = [Link]("[Link]")
warns = [Link](lambda line: "WARN" in line)
print([Link]())
```
9. Create [Link]
```python
with open("[Link]", "w") as f:
[Link]("[Link],[Link],[Link]
```
10. Spark SQL Flipkart Access
```python
from [Link] import SparkSession
spark = [Link]("Flipkart").getOrCreate()
df = [Link]("[Link]")
[Link]("logdata")
[Link]("SELECT location, COUNT(*) FROM logdata WHERE url LIKE '%flipkart%' GROUP
BY location").show()
```
11. Spark SQL Distinct IPs
```python
[Link]("SELECT DISTINCT ip FROM logdata").show()
[Link]("SELECT location, COUNT(DISTINCT ip) FROM logdata GROUP BY
location").show()
```
12. Data Science Responsibilities
- Data collection, cleaning, visualization, modeling, and reporting.
- Use ML and statistical tools to derive insights.
13. Big Data Terminologies
- HDFS, YARN, MapReduce, Hive, Pig, Spark, Zookeeper, NoSQL
14. Big Data Stack
- Data Sources → Storage (HDFS, NoSQL) → Processing (Spark, MapReduce) → Access (Hive,
Pig) → Visualization (Tableau)
15. Analytics Patterns
- Descriptive, Diagnostic, Predictive, Prescriptive, Streaming, Batch Analytics
16. Big Data Challenges
- Volume, Variety, Velocity, Veracity, Security, Scalability, Integration
17. Big Data Analytics Classification
- Descriptive, Predictive, Prescriptive, Diagnostic
18. Advantages of Hadoop
- Scalable, Cost-effective, Fault-tolerant, Open-source, Handles all data types
19. RDBMS vs Hadoop
- RDBMS for structured data, Hadoop for all data types. Hadoop is distributed and scalable.
20. Hadoop Architecture
- HDFS for storage, YARN for resource management, MapReduce for processing.
21. HDFS Description
- NameNode (metadata), DataNode (blocks), replication, fault tolerance.
22. Advantages of Hadoop
- Cost-effective, scalable, flexible, fault-tolerant, supports large datasets.
23. RDBMS vs Hadoop Table
| RDBMS | Hadoop |
|-------|--------|
| Centralized | Distributed |
| Structured only | All data types |
24. Hadoop Architecture
- HDFS stores, YARN schedules, MapReduce processes.
25. HDFS
- Splits files into blocks, stores in DataNodes, NameNode manages metadata.
26. Hadoop in Detail
- Framework for distributed storage and processing of big data using HDFS + MapReduce.
27. HDFS Diagram
Client → NameNode → DataNodes. Blocks stored with replication.
28. Use of Hive
- SQL-like interface for querying large datasets in Hadoop.
29. Hive Architecture
- UI, Driver, Compiler, Metastore, Execution Engine, HDFS.
30. SERDE with Diagram
- Serializer/Deserializer for reading/writing custom formats in Hive.
31. Hadoop Features
- Distributed, scalable, open-source, fault-tolerant, flexible.
32. Use of Apache Spark
- Real-time processing, machine learning, SQL, streaming, fast computation.
33. Apache Spark Architecture
- Driver → Cluster Manager → Executors. Processes tasks in memory.
34. Hive Architecture
- Same as Q29: UI, Driver, Compiler, Metastore, Execution Engine.
35. Spark vs MapReduce
- Spark is faster (in-memory), MapReduce is disk-based. Spark supports more APIs.
36. RDBMS vs Hadoop
- Hadoop is more scalable and handles all data types, unlike RDBMS.
37. Hadoop Explanation
- Full Hadoop architecture: HDFS + YARN + MapReduce. Open-source big data framework.