0% found this document useful (0 votes)
18 views4 pages

PySpark Functions Overview

The document provides a comprehensive overview of PySpark functions, covering RDD creation, transformations, actions, DataFrame methods, SQL operations, date and time functions, window functions, and storage levels. It details various methods for manipulating and processing data using PySpark, including both narrow and wide transformations, as well as optimization techniques for performance. Additionally, it outlines persistence methods and the importance of partitioning in data processing.

Uploaded by

Govind Gupta
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

PySpark Functions Overview

The document provides a comprehensive overview of PySpark functions, covering RDD creation, transformations, actions, DataFrame methods, SQL operations, date and time functions, window functions, and storage levels. It details various methods for manipulating and processing data using PySpark, including both narrow and wide transformations, as well as optimization techniques for performance. Additionally, it outlines persistence methods and the importance of partitioning in data processing.

Uploaded by

Govind Gupta
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PySpark Functions Summary

PySpark Core (RDD Functions)

RDD Creation and Loading


[Link]() - Load text file into RDD

[Link]() - Create RDD from collection with partitions

range() - Generate range of numbers

Transformations (Narrow)
map() - Apply function to each element

flatMap() - Apply function and flatten results

filter() - Filter elements based on condition

distinct() - Remove duplicates

union() - Combine RDDs

sortBy() - Sort by key with random key sort

mapPartitions() - Apply function to each partition

Transformations (Wide)
groupByKey() - Group by key (Note: aggregation by key, must use map)

reduceByKey() - Reduce by key (Note: perform local aggregation before final reduce, very less
shuffling, high performance)
sortByKey() - Sort by key (Note: sorting in each partition is done separately, then results are merged)

join() - Inner join

rightOuterJoin() - Right outer join

leftOuterJoin() - Left outer join

cogroup() - Group multiple RDDs

Transformations on Paired RDDs


mapValues() - Apply function to values only

keys() - Extract keys

values() - Extract values

Actions
collect() - Return RDD as list
take(n) - Take first n records

first() - Take only 1st element

count() - Count number of records in RDD

reduce() - Aggregate elements (only commutative and associative operations allowed)

saveAsTextFile() - Save RDD to text file

countByKey() - Count occurrences, return FlatMap

collectAsMap() - Convert to Python dictionary

Spark Joins
join() - Inner join

rightOuterJoin() - Key present in 1st RDD

leftOuterJoin() - Key present in 2nd RDD

PySpark DataFrame

DataFrame Creation
From collections: createDataFrame()
Import named tuple: NamedTuple

From Spark RDD: toDF()

DataFrame Methods
toDf() - Convert to DataFrame

createDataFrame() - Create DataFrame from RDD

show() - Display DataFrame

select() - Select specific columns

filter() - Filter rows based on condition

where() - Alternative to filter

groupBy() - Group by column, then apply aggregation (sum, max, count)

agg() - Aggregation function

limit(n) - First n records

distinct() - Unique values from column, return DataFrame

orderBy() - Sort by column (False = Descending)

printSchema() - Display DataFrame schema

DataFrame Metadata
[Link] - Column names
[Link] - Data types of each column

[Link] - Structure with column and type

PySpark SQL

SQL Context and Temporary Views


createOrReplaceTempView() - Create temporary view (table)

[Link]() - Execute SQL queries

Date and Time Functions


current_date() - Get current date

current_date().show(n) - Limited rows

date_format() - Format date to other format

to_date() - Convert string to date

date_add() - Add specific days to date column

date_sub() - Subtract specific days from column

months_between() - Similar to date_diff() but months difference in float

year() , month() , next_day() , week_of_year() - Extract date parts from given date column

current_timestamp() - Current timestamp

hour() , minute() , second() - Same as year(), month()

to_timestamp() - Convert string to timestamp

Window Functions
Used with frame and partition concepts

Storage Levels and Persistence

Persistence Methods
cache() - Cache RDD (doesn't take any parameter)

persist() - Same as persist(pyspark, StorageLevel, MEMORY_ONLY)

Storage Levels
1. MEMORY_ONLY - Only in RAM

2. MEMORY_AND_DISK - If RAM full, use DISK

3. MEMORY_ONLY_SER - RDD as serialized Java object only on RAM

4. MEMORY_AND_DISK_SER - Serialized Java object on RAM, if RAM full then DISK

5. DISK_ONLY - Only DISK


6. MEMORY_ONLY_2 - Same as level above but replicate each partition on 2 cluster nodes
7. MEMORY_AND_DISK_2 - Similar concept

Key Concepts

Partitioning
Data distribution across partitions (P1, P2, P3)

Narrow vs Wide Transformations

Shuffling operations and performance impact

Optimization Techniques
RDD persistence and caching

Avoiding wide transformations when possible

Using appropriate storage levels

Network congestion and performance issues with shuffling

Common questions

Powered by AI

To minimize the impact of shuffling during wide transformations in PySpark, strategies include using reduceByKey() instead of groupByKey() to perform local aggregation before data is shuffled across nodes . Another strategy involves partitioning data appropriately before wide operations, leveraging a consistent hashing strategy to minimize unnecessary data movement . Utilizing caching and persistence to retain frequently accessed RDDs in memory also reduces recomputation and related shuffling overheads . These approaches collectively enhance performance by minimizing data transfer and processing costs.

Using persistence and selecting appropriate storage levels in PySpark helps enhance computational efficiency by preventing repeated computation of RDDs that are reused multiple times . The choice of storage level depends on resource availability and use cases. For instance, MEMORY_ONLY is suitable when RAM is sufficient and serialization is unnecessary, while MEMORY_AND_DISK is beneficial when RAM is limited, allowing spillover to disk . Serialized storage levels like MEMORY_ONLY_SER reduce memory footprint by storing RDDs as serialized Java objects, improving efficiency in memory-critical applications .

Window functions in PySpark are crucial for complex data analyses that require operations over subsets of data within rows, allowing calculations such as running totals, moving averages, and rank assignments . Unlike groupBy, which collapses data, window functions can operate on entire datasets while respecting frame boundaries, providing further insights across data ranges . They support partitioned computations, enabling efficient, informative analytics on streaming or time series data, and facilitate tasks like temporal trend analysis or cohort studies by retaining context in individual row evaluations .

DataFrame operations in PySpark generally outperform RDD operations due to their optimizations such as catalyst optimizer and custom execution plans that optimize queries for performance . They offer higher-level abstractions and SQL-like syntax, enhancing usability and reducing the need for verbose code . In contrast, RDD operations offer more low-level data manipulation capabilities, suitable for tasks requiring fine-grained control over data processing. However, they lack the optimization benefits inherent in DataFrames, making them less efficient for large-scale data transformations .

Narrow transformations in PySpark, such as map() and filter(), perform operations on each partition independently and don't require data reshuffling across nodes, making them more efficient and faster . Wide transformations, such as groupByKey() and reduceByKey(), involve shuffling data across partitions or nodes, which can lead to increased network I/O and processing time, impacting performance negatively . These processes require coordination and data transfer between executors, affecting the overall speed and resource usage of a PySpark application.

The sc.parallelize() method in PySpark efficiently creates RDDs from Python collections, enabling distributed execution and parallel processing of data . Benefits include easy partition control and scalability, which facilitates handling large datasets without explicit data pre-partitioning . However, limitations include initial overheads in setting up partitions and inefficiencies when dealing with extremely large collections, as performance may bottleneck due to these overheads . Optimal use of sc.parallelize() generally involves moderately sized datasets, making it less effective for massive, already partitioned data.

PySpark provides different join operations: join(), leftOuterJoin(), and rightOuterJoin(). The join() operation performs an inner join, returning only matching key-value pairs from both RDDs . This is effective when both datasets contribute equally to the final result. The leftOuterJoin() returns all keys from the first RDD with matching key-value pairs from the second RDD, filling in with nulls where no match is found, suitable for preserving data from the primary dataset . The rightOuterJoin() works oppositely, useful when emphasis is on the second dataset . These operations help combine data from different sources based on specific needs.

The mapValues() function is advantageous in scenarios where transformations need to be applied solely to the values in paired RDDs, preserving key-value pair relationships without modifying keys . This is particularly useful when key-dependent operations are unnecessary or when maintaining key integrity is critical, such as when performing successive operations like sorting or joining with other datasets based on keys . Compared to map(), which alters both keys and values, mapValues() provides a more efficient and semantically clear approach for value-centric transformations within paired RDDs.

The date_format() function in PySpark is instrumental in converting and formatting date columns into different string representations, supporting diverse date-time formatting requirements . It complements functions like to_date(), which converts strings to date types, and date_add()/date_sub(), which adjust dates by specific intervals . By using date_format() in conjunction with these functions, users can achieve comprehensive date manipulation and formatting tasks, ensuring consistent and readable date-time representations across DataFrame operations .

DataFrame methods such as filter() and groupBy() are pivotal in managing large datasets within PySpark. The filter() method allows for efficient row-wise data selection based on specified conditions, enabling streamlined processing of relevant data subsets . On the other hand, groupBy(), followed by aggregations like sum or count, facilitates summary statistics computation and segmentation analysis, crucial for gaining insights from voluminous data . These methods optimize query performance by limiting data operations to relevant segments, ensuring efficient resource use and quicker processing times.

You might also like