0% found this document useful (0 votes)
9 views5 pages

Big Data Analytics with Python & Hive

BDA answer

Uploaded by

yuvraj120555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views5 pages

Big Data Analytics with Python & Hive

BDA answer

Uploaded by

yuvraj120555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Big Data Analytics Practical Answers

(BDA-22684)
1. Pandas Program – Excel Import & Column Access
```python
import pandas as pd
df = pd.read_excel('[Link]')
print([Link])
print(df[['Name', 'Age']])
```

2. Pandas Program – Summary Stats & Skipping Rows/Cols


```python
print(df['Age'].sum())
print(df['Age'].mean())
print(df['Age'].max())
print(df['Age'].min())
df_skip = pd.read_excel('[Link]', skiprows=2, usecols="B:D")
```

3. Select & Delete Rows/Columns from DataFrame


```python
print([Link][0:5, ['Name', 'Age']])
[Link]('Age', axis=1, inplace=True)
[Link](0, axis=0, inplace=True)
```

4. Import Modules and Download File


```python
import pandas as pd, [Link]
[Link]('[Link] '[Link]')
df = pd.read_excel('[Link]')
```

5. Extract ZIP, Load Data, Log Each Phase


```python
import zipfile, os, pandas as pd
with [Link]('[Link]', 'r') as z:
[Link]('data_folder')
df = pd.read_csv('data_folder/[Link]')
def log(msg): print('[LOG]', msg)
log("Zip extracted")
log("Data loaded")
```

6. Create Hive Tables


```sql
CREATE EXTERNAL TABLE emp_ext(id INT, name STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' LOCATION '/external';
LOAD DATA LOCAL INPATH '/home/[Link]' INTO TABLE emp_ext;

CREATE TABLE emp_int(id INT, name STRING);


LOAD DATA LOCAL INPATH '/home/[Link]' INTO TABLE emp_int;
LOAD DATA INPATH '/hdfs/[Link]' INTO TABLE emp_int;
```

7. Hive Table Storage Formats


```sql
CREATE TABLE text_format(id INT) STORED AS TEXTFILE;
CREATE TABLE seq_format(id INT) STORED AS SEQUENCEFILE;
CREATE TABLE rc_format(id INT) STORED AS RCFILE;
```

8. Spark Count WARN Logs


```python
from pyspark import SparkContext
sc = SparkContext("local", "WarnCount")
logs = [Link]("[Link]")
warns = [Link](lambda line: "WARN" in line)
print([Link]())
```

9. Create [Link]
```python
with open("[Link]", "w") as f:
[Link]("[Link],[Link],[Link]
```

10. Spark SQL Flipkart Access


```python
from [Link] import SparkSession
spark = [Link]("Flipkart").getOrCreate()
df = [Link]("[Link]")
[Link]("logdata")
[Link]("SELECT location, COUNT(*) FROM logdata WHERE url LIKE '%flipkart%' GROUP
BY location").show()
```

11. Spark SQL Distinct IPs


```python
[Link]("SELECT DISTINCT ip FROM logdata").show()
[Link]("SELECT location, COUNT(DISTINCT ip) FROM logdata GROUP BY
location").show()
```

12. Data Science Responsibilities


- Data collection, cleaning, visualization, modeling, and reporting.
- Use ML and statistical tools to derive insights.

13. Big Data Terminologies


- HDFS, YARN, MapReduce, Hive, Pig, Spark, Zookeeper, NoSQL

14. Big Data Stack


- Data Sources → Storage (HDFS, NoSQL) → Processing (Spark, MapReduce) → Access (Hive,
Pig) → Visualization (Tableau)

15. Analytics Patterns


- Descriptive, Diagnostic, Predictive, Prescriptive, Streaming, Batch Analytics

16. Big Data Challenges


- Volume, Variety, Velocity, Veracity, Security, Scalability, Integration

17. Big Data Analytics Classification


- Descriptive, Predictive, Prescriptive, Diagnostic

18. Advantages of Hadoop


- Scalable, Cost-effective, Fault-tolerant, Open-source, Handles all data types

19. RDBMS vs Hadoop


- RDBMS for structured data, Hadoop for all data types. Hadoop is distributed and scalable.

20. Hadoop Architecture


- HDFS for storage, YARN for resource management, MapReduce for processing.

21. HDFS Description


- NameNode (metadata), DataNode (blocks), replication, fault tolerance.

22. Advantages of Hadoop


- Cost-effective, scalable, flexible, fault-tolerant, supports large datasets.
23. RDBMS vs Hadoop Table
| RDBMS | Hadoop |
|-------|--------|
| Centralized | Distributed |
| Structured only | All data types |

24. Hadoop Architecture


- HDFS stores, YARN schedules, MapReduce processes.

25. HDFS
- Splits files into blocks, stores in DataNodes, NameNode manages metadata.

26. Hadoop in Detail


- Framework for distributed storage and processing of big data using HDFS + MapReduce.

27. HDFS Diagram


Client → NameNode → DataNodes. Blocks stored with replication.

28. Use of Hive


- SQL-like interface for querying large datasets in Hadoop.

29. Hive Architecture


- UI, Driver, Compiler, Metastore, Execution Engine, HDFS.

30. SERDE with Diagram


- Serializer/Deserializer for reading/writing custom formats in Hive.

31. Hadoop Features


- Distributed, scalable, open-source, fault-tolerant, flexible.

32. Use of Apache Spark


- Real-time processing, machine learning, SQL, streaming, fast computation.

33. Apache Spark Architecture


- Driver → Cluster Manager → Executors. Processes tasks in memory.

34. Hive Architecture


- Same as Q29: UI, Driver, Compiler, Metastore, Execution Engine.

35. Spark vs MapReduce


- Spark is faster (in-memory), MapReduce is disk-based. Spark supports more APIs.

36. RDBMS vs Hadoop


- Hadoop is more scalable and handles all data types, unlike RDBMS.
37. Hadoop Explanation
- Full Hadoop architecture: HDFS + YARN + MapReduce. Open-source big data framework.

You might also like