0% found this document useful (0 votes)

227 views66 pages

Pyspark Union and UnionByName Guide

The document provides an overview of PySpark transformations, specifically focusing on the union and unionByName functions for merging DataFrames with the same or different schemas. It includes syntax examples and practical use cases demonstrating how to combine DataFrames while handling missing columns. Additionally, it introduces PySpark window functions for performing statistical operations within groups of data.

Uploaded by

Rambabu Giduturi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

227 views66 pages

Pyspark Union and UnionByName Guide

Uploaded by

Rambabu Giduturi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

30/05/2025, 08:13 Pyspark practice - Databricks

Pyspark practice
([Link]
Union and UnionByName transformation
Concept :-
union works when the columns of both DataFrames being joined are in the same order. It can give surprisingly wrong
results when the schemas aren’t the same.

unionByName works when both DataFrames have the same columns, but in a different order.

union and unionByName transformation are used to merge two or more dataframes of the same schema and structure.

unionByName() functions takes param allowMissingColumns with the value True

Syntax of unionByName()

unionByName(df, allowMissingColumns = True)

Union() function in pyspark

The PySpark union() function is used to combine two or more data frames having the same structure or schema. This
function returns an error if the schema of data frames differs from each other.

Syntax: data_frame1.union(data_frame2)

Where,

data_frame1 and data_frame2 are the dataframes.

[Link] 1/66
30/05/2025, 08:13 Pyspark practice - Databricks

Example 1
americans = [Link](
[("bob", 42), ("lisa", 59)], ["first_name", "age"]
)
colombians = [Link](
[("maria", 20), ("camilo", 31)], ["first_name", "age"]
)
res = [Link](colombians)
[Link]()

+----------+---+
|first_name|age|
+----------+---+
| bob| 42|
| lisa| 59|
| maria| 20|
| camilo| 31|
+----------+---+

details = [(1, 'Krishna', 'IT', 'male')]

column = ['id', 'name', 'department', 'gender']
details1 = [(1, 'Krishna', 'IT', 10000)]
column = ['id', 'name', 'department', 'salary']
df1 = [Link](details, column)
df2 = [Link](details1, column)
[Link]()
[Link]()

+---+-------+----------+------+
| id| name|department|salary|
+---+-------+----------+------+

[Link] 2/66
30/05/2025, 08:13 Pyspark practice - Databricks

| 1|Krishna| IT| male|

+---+-------+----------+------+

+---+-------+----------+------+
| id| name|department|salary|
+---+-------+----------+------+
| 1|Krishna| IT| 10000|
+---+-------+----------+------+

[Link](df2).show()

+---+-------+----------+------+
| id| name|department|salary|
+---+-------+----------+------+
| 1|Krishna| IT| male|
| 1|Krishna| IT| 10000|
+---+-------+----------+------+

[Link](df2, allowMissingColumns=True).show()

# if columns will miss then also it will union. like here was a difference gender and salary.

+---+-------+----------+------+
| id| name|department|salary|
+---+-------+----------+------+
| 1|Krishna| IT| male|
| 1|Krishna| IT| 10000|
+---+-------+----------+------+

[Link] 3/66
30/05/2025, 08:13 Pyspark practice - Databricks

Remember - If number will mismatch(if different column name or different number of column) then we will use
unionByName with True.

Example 2
# union

data_frame1 = [Link](
[("Nitya", 82.98), ("Abhishek", 80.31)],
["Student Name", "Overall Percentage"]
)

# Creating another dataframe

data_frame2 = [Link](
[("Sandeep", 91.123), ("Rakesh", 90.51)],
["Student Name", "Overall Percentage"]
)

# union()
UnionEXP = data_frame1.union(data_frame2)

[Link]()

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Nitya| 82.98|
| Abhishek| 80.31|
| Sandeep| 91.123|
| Rakesh| 90.51|

[Link] 4/66
30/05/2025, 08:13 Pyspark practice - Databricks

+------------+------------------+

UnionByName() function in pyspark

The PySpark unionByName() function is also used to combine two or more data frames but it might be used to combine
dataframes having different schema. This is because it combines data frames by the name of the column and not the
order of the columns.

Syntax: data_frame1.unionByName(data_frame2)

Where, data_frame1 and data_frame2 are the dataframes

data_frame1 = [Link](
[("Nitya", 82.98), ("Abhishek", 80.31)],
["Student Name", "Overall Percentage"]
)

# Creating another data frame

data_frame2 = [Link](
[(91.123, "Naveen"), (90.51, "Sandeep"), (87.67, "Rakesh")],
["Overall Percentage", "Student Name"]
)

# Union both the dataframes using unionByName() method

byName = data_frame1.unionByName(data_frame2)

[Link]()

see in this example , data_frame1 and data_frame2 are of different schema but the output is the desired one.

+------------+------------------+
|Student Name|Overall Percentage|

[Link] 5/66
30/05/2025, 08:13 Pyspark practice - Databricks

+------------+------------------+
| Nitya| 82.98|
| Abhishek| 80.31|
| Naveen| 91.123|
| Sandeep| 90.51|
| Rakesh| 87.67|
+------------+------------------+

data_frame1 = [Link](
[("Bhuwanesh", 82.98, "Computer Science"), ("Harshit", 80.31, "Information Technology")],["Student Name", "Overall
Percentage", "Department"]
)

# Creating another dataframe

data_frame2 = [Link]( [("Naveen", 91.123), ("Piyush", 90.51)], ["Student Name", "Overall Percentage"] )

# Union both the dataframes using unionByName() method

column_name_morein1df = data_frame1.unionByName(data_frame2, allowMissingColumns=True)

column_name_morein1df.show()

in this example we have more columnname in 1st dataframe but less in 2nd dataframe.
but we are able to do our desired result with the help of unionByName().

[Link] 6/66
30/05/2025, 08:13 Pyspark practice - Databricks

| Piyush| 90.51| null|

+------------+------------------+--------------------+

pySpark Window Ranking Functions

Definitions:

Window function:
pySpark window functions are useful when you want to examine relationships within group of data rather than
between group of data. It performs statictical operations like below explained.

PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or
collection of rows and returns results for each row individually. It is also popularly growing to perform data
transformations.

There are mainly three types of Window function:

Analytical Function
Ranking Function
Aggregate Function

Analytical functions:

An analytic function is a function that returns a result after operating on data or a finite set of rows partitioned
by a SELECT clause or in the ORDER BY clause. It returns a result in the same number of rows as the number of input
rows. E.g. lead(), lag(), cume_dist().

[Link] 7/66
30/05/2025, 08:13 Pyspark practice - Databricks

from [Link] import Window

import pyspark
from [Link] import SparkSession
spark = [Link]("pyspark_window").getOrCreate()
sampleData = (("Nitya", 28, "Sales", 3000),
("Abhishek", 33, "Sales", 4600),
("Sandeep", 40, "Sales", 4100),
("Rakesh", 25, "Finance", 3000),
("Ram", 28, "Sales", 3000),
("Srishti", 46, "Management", 3300),
("Arbind", 26, "Finance", 3900),
("Hitesh", 30, "Marketing", 3000),
("Kailash", 29, "Marketing", 2000),
("Sushma", 39, "Sales", 4100)
)
# column names
columns = ["Employee_Name", "Age",
"Department", "Salary"]

# creating the dataframe df

df = [Link](data=sampleData,schema=columns)
windowPartition = [Link]("Department").orderBy("Age")
[Link]()
[Link]()
display(df)

root 

|-- Employee_Name: string (nullable = true)

|-- Age: long (nullable = true)
|-- Department: string (nullable = true)
|-- Salary: long (nullable = true)

+-------------+---+----------+------+

[Link] 8/66
30/05/2025, 08:13 Pyspark practice - Databricks

+-------------+---+----------+------+
| Nitya| 28| Sales| 3000|
| Abhishek| 33| Sales| 4600|
| Sandeep| 40| Sales| 4100|
| Rakesh| 25| Finance| 3000|
| Ram| 28| Sales| 3000|
| Srishti| 46|Management| 3300|
| Arbind| 26| Finance| 3900|
| Hitesh| 30| Marketing| 3000|
| Kailash| 29| Marketing| 2000|
| Sushma| 39| Sales| 4100|
+-------------+---+----------+------+

Table
   
Employee_Name Age Department Salary
1 Nitya 28 Sales 3000
2 Abhishek 33 Sales 4600 

3 Sandeep 40 Sales 4100

4 Rakesh 25 Finance 3000
5 Ram 28 Sales 3000
6 Srishti 46 Management 3300
7 Arbind 26 Finance 3900
8 Hitesh 30 Marketing 3000
9 Kailash 29 Marketing 2000
10 Sushma 39 Sales 4100

10 rows

[Link] 9/66
30/05/2025, 08:13 Pyspark practice - Databricks

Using cume_dist():
cume_dist() window function is used to get the cumulative distribution within a window partition.

from [Link] import cume_dist

[Link]("cume_dist", cume_dist().over(windowPartition)).display()

Table
    
Employee_Name Age Department Salary cume_dist
1 Rakesh 25 Finance 3000 0.5
2 Arbind 26 Finance 3900 1
3 Srishti 46 Management 3300 1
4 Kailash 29 Marketing 2000 0.5
5 Hitesh 30 Marketing 3000 1
6 Nitya 28 Sales 3000 0.4
7 Ram 28 Sales 3000 04
10 rows

Using lag()
A lag() function is used to access previous rows’ data as per the defined offset value in the function.

from [Link] import lag

[Link]("Lag", lag("Salary", 2).over(windowPartition)) \
.display()

Table

Employee Name Age Department Salary Lag

[Link] 10/66
30/05/2025, 08:13 Pyspark practice - Databricks

Employee_Name  Age  Department  Salary  Lag 

1 Rakesh 25 Finance 3000 null
2 Arbind 26 Finance 3900 null
3 Srishti 46 Management 3300 null
4 Kailash 29 Marketing 2000 null
5 Hitesh 30 Marketing 3000 null
6 Nitya 28 Sales 3000 null
7 Ram 28 Sales 3000 null
10 rows

Using lead()
A lead() function is used to access next rows data as per the defined offset value in the function.

from [Link] import lead

[Link]("Lead", lead("salary", 2).over(windowPartition)) \
.display()

Table
    
Employee_Name Age Department Salary Lead
1 Rakesh 25 Finance 3000 null
2 Arbind 26 Finance 3900 null
3 Srishti 46 Management 3300 null
4 Kailash 29 Marketing 2000 null
5 Hitesh 30 Marketing 3000 null
6 Nitya 28 Sales 3000 4600
7 Ram 28 Sales 3000 4100
10 rows

[Link] 11/66
30/05/2025, 08:13 Pyspark practice - Databricks

Ranking function
Ranking Function
The function returns the statistical rank of a given value for each row in a partition or group. The goal of this
function is to provide consecutive numbering of the rows in the resultant column, set by the order selected in the
[Link] for each partition specified in the OVER clause. E.g. row_number(), rank(), dense_rank(), etc.

A. row_number() : row_number() window functions is used to give the sequential row number starting from 1 to the
result of each
window partition.

B. rank():
rank window functions is used to provide a rank to the result within a window partition. this functions leaves
gaps in rank when there are ties.
Example :- 1 1 1 4 this is rank

C. dense_rank(): dense_rank() window function is used to get the result with rank of rows within a window partition
without a gaps.
This is similar to rank function difference being rank function leaves the gaps in rank when there
are ties.
Example : - 1 1 1 2 this is dense rank.

Syntax for Window Funtions :

[Link]("col_name", Window_function().over(Window_partition))

[Link] 12/66
30/05/2025, 08:13 Pyspark practice - Databricks

root
|-- Roll_No: long (nullable = true)
|-- Student_Name: string (nullable = true)
|-- Subject: string (nullable = true)
|-- Marks: long (nullable = true)

Table
   
Roll_No Student_Name Subject Marks

[Link] 13/66
30/05/2025, 08:13 Pyspark practice - Databricks

101 R Bi l 80
2 103 Sita Social Science 78
3 104 Lakshman Sanskrit 58
4 102 Kunal Phisycs 89
5 101 Ram Biology 80
6 106 Srishti Maths 70
7 108 Sandeep Physics 75
8 107 Hitesh Maths 88
9 109 Kailash Maths 90
10 105 Abhishek Social Science 84
10 rows

Using row_number().
row_number() function is used to gives a sequential number to each row present in the table.

from [Link] import row_number

[Link]("row_number", row_number().over(windowPartition)).display()

Table
    
Roll_No Student_Name Subject Marks row_number
1 101 Ram Biology 80 1
2 101 Ram Biology 80 2
3 106 Srishti Maths 70 1
4 107 Hitesh Maths 88 2
5 109 Kailash Maths 90 3
6 102 Kunal Phisycs 89 1

[Link] 14/66
30/05/2025, 08:13 Pyspark practice - Databricks

7 108 Sandeep Physics 75 1

8 104 Lakshman Sanskrit 58 1
9 103 Sita Social Science 78 1
10 105 Abhishek Social Science 84 2

10 rows

Using rank()
The rank function is used to give ranks to rows specified in the window partition. This function leaves gaps in rank
if there are ties.

from [Link] import rank

[Link]("rank", rank().over(windowPartition)) \
.display()

Table
    
Roll_No Student_Name Subject Marks rank
1 101 Ram Biology 80 1
2 101 Ram Biology 80 1
3 106 Srishti Maths 70 1
4 107 Hitesh Maths 88 2
5 109 Kailash Maths 90 3
6 102 Kunal Phisycs 89 1
7 108 Sandeep Physics 75 1

[Link] 15/66
30/05/2025, 08:13 Pyspark practice - Databricks

10 rows

Using percent_rank()
This function is similar to rank() function. It also provides rank to rows but in a percentile format.

from [Link] import percent_rank

[Link]("percent_rank",percent_rank().over(windowPartition)).display()

Table
    
Roll_No Student_Name Subject Marks percent_rank
1 101 Ram Biology 80 0
2 101 Ram Biology 80 0
3 106 Srishti Maths 70 0
4 107 Hitesh Maths 88 0.5
5 109 Kailash Maths 90 1
6 102 Kunal Phisycs 89 0
7 108 Sandeep Physics 75 0
10 rows

Using dense_rank()
This function is used to get the rank of each row in the form of row numbers. This is similar to rank() function,
there is only one difference the rank function leaves gaps in rank when there are ties.

from [Link] import dense_rank

[Link]("dense_rank",dense_rank().over(windowPartition)).display()

[Link] 16/66
30/05/2025, 08:13 Pyspark practice - Databricks

Table
    
Roll_No Student_Name Subject Marks dense_rank
1 101 Ram Biology 80 1
2 101 Ram Biology 80 1
3 106 Srishti Maths 70 1
4 107 Hitesh Maths 88 2
5 109 Kailash Maths 90 3
6 102 Kunal Phisycs 89 1
7 108 Sandeep Physics 75 1
10 rows

Aggregate functions
Aggregate function
An aggregate function or aggregation function is a function where the values of multiple rows are grouped to form a
single summary value. The definition of the groups of rows on which they operate is done by using the SQL GROUP BY
clause. E.g. AVERAGE, SUM, MIN, MAX, etc.

[Link] 17/66
30/05/2025, 08:13 Pyspark practice - Databricks

root
|-- Employee_Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- Salary: long (nullable = true)

Table
  
Employee_Name Department Salary
1 Ram Sales 3000
2 Meena Sales 4600
3 Abhishek Sales 4100
4 Kunal Finance 3000

[Link] 18/66
30/05/2025, 08:13 Pyspark practice - Databricks

5 Ram Sales 3000

6 Srishti Management 3300
7 Sandeep Finance 3900
8 Hitesh Marketing 3000
9 Kailash Marketing 2000
10 Shyam Sales 4100

10 rows

from [Link] import Window

from [Link] import col,avg,sum,min,max,row_number
windowPartitionAgg = [Link]("Department")
[Link]("Avg",avg(col("salary")).over(windowPartitionAgg)).show()
# sum()
[Link]("Sum",sum(col("salary")).over(windowPartitionAgg)).show()
#min()
[Link]("Min",min(col("salary")).over(windowPartitionAgg)).show()
[Link]("Max",max(col("salary")).over(windowPartitionAgg)).show()

+-------------+----------+------+------+ 

|Employee_Name|Department|Salary| Avg|
+-------------+----------+------+------+
| Kunal| Finance| 3000|3450.0|
| Sandeep| Finance| 3900|3450.0|
| Srishti|Management| 3300|3300.0|
| Hitesh| Marketing| 3000|2500.0|
| Kailash| Marketing| 2000|2500.0|
| Ram| Sales| 3000|3760.0|
| Meena| Sales| 4600|3760.0|
| Abhishek| Sales| 4100|3760.0|

[Link] 19/66
30/05/2025, 08:13 Pyspark practice - Databricks

| Ram| Sales| 3000|3760.0| 

| Shyam| Sales| 4100|3760.0|

+-------------+----------+------+------+

+-------------+----------+------+-----+
|Employee_Name|Department|Salary| Sum|
+-------------+----------+------+-----+
| Kunal| Finance| 3000| 6900|
| Sandeep| Finance| 3900| 6900|

date and timestamp functions in timestamp

from [Link] import current_timestamp, to_timestamp
from [Link] import *
from [Link] import SparkSession
from [Link] import IntegerType, DateType, BooleanType

spark = [Link] \
.master("local[*]") \
.appName("timestamp") \

.getOrCreate()
df = [Link]([["1", "2019-07-01 12:01:19.000"], ["2", "2019-06-24 12:01:19.000"]], ["id",
"input_timestamp"])
[Link]()
[Link]()

root
|-- id: string (nullable = true)
|-- input_timestamp: string (nullable = true)

Table
 
id input_timestamp

[Link] 20/66
30/05/2025, 08:13 Pyspark practice - Databricks

1 1 2019-07-01 12:01:19.000
2 2 2019-06-24 12:01:19.000

2 rows

# Converting string datatype to timestamp

df1 = [Link]("timestamptype", to_timestamp("input_timestamp"))

[Link]()

Table
  
id input_timestamp timestamptype
1 1 2019-07-01 12:01:19.000 2019-07-01T12:01:19.000+0000
2 2 2019-06-24 12:01:19.000 2019-06-24T12:01:19.000+0000

2 rows

# selecting only necessary column and renaming

df2 = [Link]("id", "timestamptype").withColumnRenamed("timestamptype", "input_timestamp")
[Link]()

Table
 
id input_timestamp
1 1 2019-07-01T12:01:19.000+0000
2 2 2019-06-24T12:01:19.000+0000

2 rows

[Link] 21/66
30/05/2025, 08:13 Pyspark practice - Databricks

# using cast to convert timestamp to DataType

df3=[Link](col("id"), col("input_timestamp").cast('string'))
[Link]()

Table
 
id input_timestamp
1 1 2019-07-01 12:01:19
2 2 2019-06-24 12:01:19

2 rows

# timestamp type to datetype

df4 = [Link](col("id"), to_date(col("input_timestamp")))
[Link]()

Table
 
id to_date(input_timestamp)
1 1 2019-07-01
2 2 2019-06-24

2 rows

[Link] 22/66
30/05/2025, 08:13 Pyspark practice - Databricks

Select top N rows from each group

from [Link] import SparkSession, Window
from [Link] import *
spark = SparkSession \
.builder \
.appName("TopN") \
.master("local[*]") \
.getOrCreate()
sampledata = (("Nitya", "Sales", 3000), \
("Abhi", "Sales", 4600), \
("Rakesh", "Sales", 4100), \
("Sandeep", "finance", 3000), \
("Abhishek", "Sales", 3000), \
("Shyan", "finance", 3300), \
("Madan", "finance", 3900), \
("Jarin", "marketing", 3000), \
("kumar", "marketing", 2000))

columns = ["employee_name", "department", "Salary"]

df = [Link](data = sampledata, schema = columns)
[Link]()

Table
  
employee_name department Salary
1 Nitya Sales 3000
2 Abhi Sales 4600
3 Rakesh Sales 4100
4 Sandeep finance 3000

[Link] 23/66
30/05/2025, 08:13 Pyspark practice - Databricks

6 Shyan finance 3300

7 Madan finance 3900
8 Jarin marketing 3000
9 kumar marketing 2000

9 rows

windowSpec = [Link]("department").orderBy("salary")
df1 = [Link]("row", row_number().over(windowSpec)) # applying row_number
[Link]()

Table
   
employee_name department Salary row
1 Nitya Sales 3000 1
2 Abhishek Sales 3000 2
3 Rakesh Sales 4100 3
4 Abhi Sales 4600 4
5 Sandeep finance 3000 1
6 Shyan finance 3300 2
7 Madan finance 3900 3
8 kumar marketing 2000 1
9 Jarin marketing 3000 2

9 rows

df2 = [Link](col("row") <3)

[Link]()

[Link] 24/66
30/05/2025, 08:13 Pyspark practice - Databricks

Table
   
employee_name department Salary row
1 Nitya Sales 3000 1
2 Abhishek Sales 3000 2
3 Sandeep finance 3000 1
4 Shyan finance 3300 2
5 kumar marketing 2000 1
6 Jarin marketing 3000 2

6 rows

[Link] 25/66
30/05/2025, 08:13 Pyspark practice - Databricks

Drop dublicate record

# drop dublicate in emp dataframe using distinct or dropDublicates

from [Link] import SparkSession

from [Link] import *

spark = SparkSession \
.builder \
.appName("droppingDublicates") \
.master("local[*]") \
.getOrCreate()

sample_data = ([1, "ramesh", 1000], [2, "Krishna", 2000], [3, "Shri", 3000], [4, "Pradip", 4000],
[1, "ramesh", 1000], [2, "Krishna", 2000], [3, "Shri", 3000], [4, "Pradip", 4000])

columns = ["id", "name", "salary"]

df = [Link](data = sample_data, schema= columns)

[Link]()

Table
  
id name salary
1 1 ramesh 1000
2 2 Krishna 2000
3 3 Shri 3000
4 4 Pradip 4000
5 1 ramesh 1000
6 2 Krishna 2000

[Link] 26/66
30/05/2025, 08:13 Pyspark practice - Databricks

7 3 Shri 3000
8 4 Pradip 4000

8 rows

df1= [Link]().show()

+---+-------+------+
| id| name|salary|
+---+-------+------+
| 1| ramesh| 1000|
| 2|Krishna| 2000|
| 3| Shri| 3000|
| 4| Pradip| 4000|
+---+-------+------+

df3 = [Link]().show()

+---+-------+------+
| id| name|salary|
+---+-------+------+
| 1| ramesh| 1000|
| 2|Krishna| 2000|
| 3| Shri| 3000|
| 4| Pradip| 4000|
+---+-------+------+

df4 = [Link](["id", "name"]).distinct().show()

[Link] 27/66
30/05/2025, 08:13 Pyspark practice - Databricks

+---+-------+
| id| name|
+---+-------+
| 1| ramesh|
| 2|Krishna|
| 3| Shri|
| 4| Pradip|
+---+-------+

Explode nested array in pyspark

from [Link] import SparkSession
from [Link] import explode, flatten
spark = [Link]("explodeempdata").master("local[*]").getOrCreate()
arrayArrayData = [("Abhishek", [["Java", "scala", "perl"], ["spark","java"]]),
("Nitya", [["spark", "java", "c++"], ["spark", "java"]]),
("Sandeep", [["csharp", "vb"], ["spark", "python"]])]

df = [Link](data = arrayArrayData, schema = ['name', 'subjects'])

[Link]()
[Link]()
[Link]([Link], explode([Link])).show(truncate=False)
[Link]([Link], flatten([Link])).show(truncate=False)

root 

|-- name: string (nullable = true)

[Link] 28/66
30/05/2025, 08:13 Pyspark practice - Databricks

+--------+--------------------+ 

[Link]([Link], flatten([Link])).show(truncate=False)

[Link] 29/66
30/05/2025, 08:13 Pyspark practice - Databricks

Splitting multi delimiter row

data = [(1, "Abhishek", "10|30|40"),
(2, "Krishna", "50|40|70"),
(3, "rakesh", "20|70|90")]

df = [Link](data, schema=["id", "name", "marks"])

[Link]()

+---+--------+--------+
| id| name| marks|
+---+--------+--------+
| 1|Abhishek|10|30|40|
| 2| Krishna|50|40|70|
| 3| rakesh|20|70|90|
+---+--------+--------+

[Link] 30/66
30/05/2025, 08:13 Pyspark practice - Databricks

from [Link] import split, col

df_s = [Link]("mark_details", split(col("marks"), "[|]")) \
.withColumn("maths", col("mark_details")[0]) \
.withColumn("physics", col("mark_details")[1]) \
.withColumn("chemistry", col("mark_details")[2])
display(df_s)

Table
      
id name marks mark_details maths physics chemistry
1 Abhishek 10|30|40  ["10", "30", 10 30 40
1
"40"]
2 Krishna 50|40|70  ["50", "40", 50 40 70
2
"70"]

3 rows

from [Link] import split, col

Table
    
id name maths physics chemistry
1 1 Abhishek 10 30 40
2 2 Krishna 50 40 70
3 3 rakesh 20 70 90

[Link] 31/66
30/05/2025, 08:13 Pyspark practice - Databricks

3 rows

How to check required column is exists or not

datacolumn = ['empid', 'empname']
data = [(10, 'Krishna'), (20, 'mahesh'), (30, 'Rakesh')]
df = [Link](data = data, schema= datacolumn)
[Link]()

+-----+-------+
|empid|empname|
+-----+-------+
| 10|Krishna|
| 20| mahesh|
| 30| Rakesh|
+-----+-------+

print([Link]())

['empid', 'empname']

columns = [Link]()
if [Link]('empid')>0:
print('empid exists in the dataframe')
else:
print('not exists')

empid exists in the dataframe

[Link] 32/66
30/05/2025, 08:13 Pyspark practice - Databricks

pySpark Join
Join is used to combine two or more dataframes based on columns in the dataframe.

Syntax: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”type”)

where-

dataframe1 is the first dataframe

dataframe2 is the second dataframe
column_name is the column which are matching in both the dataframes
type is the join type we have to join

Table
  
ID NAME Company
1 1 Saroj company 1
2 2 Nitya company 1
3 3 Abhishek company 2

[Link] 33/66
30/05/2025, 08:13 Pyspark practice - Databricks

4 4 Sandeep company 1
5 5 Rakesh company 1

5 rows

Table
  
ID salary department
1 1 45000 IT
2 2 145000 Manager
3 6 45000 HR
4 5 34000 Sales

4 rows

[Link] 34/66
30/05/2025, 08:13 Pyspark practice - Databricks

Inner Join
Inner join
This will join the two PySpark dataframes on key columns, which are common in both dataframes.

Syntax: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”inner”)

Table
     
ID NAME Company ID salary department
1 1 sravan company 1 1 45000 IT
2 2 ojaswi company 1 2 145000 Manager

[Link] 35/66
30/05/2025, 08:13 Pyspark practice - Databricks

3 5 bobby company 1 5 34000 Sales

3 rows

Outer join
Full Outer Join
This join joins the two dataframes with all matching and non-matching rows, we can perform this join in three ways

Syntax:

outer: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”outer”)

full: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”full”)
fullouter: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”fullouter”)

[Link] 36/66
30/05/2025, 08:13 Pyspark practice - Databricks

#Using outer keyword

Table
     
ID NAME Company ID salary department
1 1 Nitya company 1 1 45000 IT
2 2 Ramesh company 1 2 145000 Manager
3 3 Abhishek company 2 null null null
4 4 Sandeep company 1 null null null
5 5 Manisha company 1 5 34000 Sales
6 null null null 6 45000 HR

[Link] 37/66
30/05/2025, 08:13 Pyspark practice - Databricks

6 rows

# using full keyword

Table
     
ID NAME Company ID salary department
1 1 Nitya company 1 1 45000 IT
2 2 Rakesh company 1 2 145000 Manager
3 3 Abhishek company 2 null null null
4 4 Anjali company 1 null null null
5 5 Saviya company 1 5 34000 Sales
6 null null null 6 45000 HR

[Link] 38/66
30/05/2025, 08:13 Pyspark practice - Databricks

6 rows

Left Join
Here this join joins the dataframe by returning all rows from the first dataframe and only matched rows from the
second dataframe with respect to the first dataframe. We can perform this type of join using left and leftouter.

Syntax:
left: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”left”)
leftouter: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”leftouter”)

Table
     
ID NAME Company ID salary department

[Link] 39/66
30/05/2025, 08:13 Pyspark practice - Databricks

1 1 Amit company 1 1 45000 IT

2 2 Rakesh company 1 2 145000 Manager

3 3 Abhishek company 2 null null null
4 4 Sri company 1 null null null
5 5 Sachin company 1 5 34000 Sales

5 rows

Right Join
Here this join joins the dataframe by returning all rows from the second dataframe and only matched rows from the
first dataframe with respect to the second dataframe. We can perform this type of join using right and rightouter.

Syntax:

right: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”right”)

rightouter: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”rightouter”)

[Link] 40/66
30/05/2025, 08:13 Pyspark practice - Databricks

# right join on two dataframes

[Link](dataframe1,[Link] == [Link],"right").display()

Table
     
ID NAME Company ID salary department
1 1 Manisha company 1 1 45000 IT
2 2 Aarti company 1 2 145000 Manager
3 null null null 6 45000 HR
4 5 Virat company 1 5 34000 Sales

4 rows

[Link] 41/66
30/05/2025, 08:13 Pyspark practice - Databricks

Leftsemi join
This join will all rows from the first dataframe and return only matched rows from the second dataframe

Syntax: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”leftsemi”)

# leftsemi join on two dataframes

[Link](dataframe1,[Link] == [Link],"leftsemi").display()

Table
  
ID NAME Company
1 1 Mitchell company 1
2 2 Rachin company 1
3 5 Thomas company 1

[Link] 42/66
30/05/2025, 08:13 Pyspark practice - Databricks

3 rows

LeftAnti join
This join returns only columns from the first dataframe for non-matched records of the second dataframe

Syntax: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”leftanti”)

Table
  
ID NAME Company
1 3 Rohit company 2
2 4 Srini company 1

[Link] 43/66
30/05/2025, 08:13 Pyspark practice - Databricks

2 rows

SQL Expression
We can perform all types of the above joins using an SQL expression, we have to mention the type of join in this
expression. To do this, we have to create a temporary view.

Syntax: [Link](“name”)

where

dataframe is the input dataframe

name is the view name

Syntax: [Link](“select * from dataframe1, dataframe2 where dataframe1.column_name == dataframe2.column_name “)

where,

dataframe1 is the first view dataframe

dataframe2 is the second view dataframe
column_name is the column to be joined

[Link] 44/66
30/05/2025, 08:13 Pyspark practice - Databricks

# creating a dataframe from the lists of data

dataframe1 = [Link](data1, columns)

# create a view for dataframe named student

[Link]("student")

# create a view for dataframe1 named department

[Link]("department")

#use sql expression to select ID column

[Link]("select * from student, department where [Link] == [Link]").display()

Table
     
ID NAME Company ID salary department
1 1 Manoj company 1 1 45000 IT
2 2 Manisha company 1 2 145000 Manager

[Link] 45/66
30/05/2025, 08:13 Pyspark practice - Databricks

3 5 RAkesh company 1 5 34000 Sales

3 rows

Syntax: [Link](“select * from dataframe1 JOIN_TYPE dataframe2 ON dataframe1.column_name == dataframe2.column_name

“)

where, JOIN_TYPE refers to above all types of joins

[Link] 46/66
30/05/2025, 08:13 Pyspark practice - Databricks

# create a view for dataframe named student

[Link]("student")

# create a view for dataframe1 named department

[Link]("department")

# inner join on id column using sql expression

[Link]("select * from student INNER JOIN department on [Link] == [Link]").show()

+---+--------+---------+---+------+----------+ 

| ID| NAME| Company| ID|salary|department|

+---+--------+---------+---+------+----------+
| 1| Ramesh|company 1| 1| 45000| IT|
| 2| Rakesh|company 1| 2|145000| Manager|
| 5|Ravindra|company 1| 5| 34000| Sales|
+---+--------+---------+---+------+----------+


[Link] 47/66
30/05/2025, 08:13 Pyspark practice - Databricks

Using functools
Functools module provides functions for working with other functions and callable objects to use or extend them
without completely rewriting them.

Syntax:

[Link](lambda df1, df2: [Link]([Link]([Link])), dfs)

where,

df1 is the first dataframe

df2 is the second dataframe

from [Link] import SparkSession

spark = [Link]('pyspark - example join').getOrCreate()
data = [(('Ram'), '1991-04-01', 'M', 3000),
(('Mike'), '2000-05-19', 'M', 4000),
(('Rohini'), '1978-09-05', 'M', 4000),
(('Maria'), '1967-12-01', 'F', 4000),
(('Jenis'), '1980-02-17', 'F', 1200)]
columns = ["Name", "DOB", "Gender", "salary"]
df1 = [Link](data=data, schema=columns)
[Link]()

+------+----------+------+------+
| Name| DOB|Gender|salary|
+------+----------+------+------+
| Ram|1991-04-01| M| 3000|
| Mike|2000-05-19| M| 4000|
|Rohini|1978-09-05| M| 4000|
| Maria|1967-12-01| F| 4000|
| Jenis|1980-02-17| F| 1200|

[Link] 48/66
30/05/2025, 08:13 Pyspark practice - Databricks

+------+----------+------+------+

data2 = [(('Mohi'), '1991-04-01', 'M', 3000),

(('Ani'), '2000-05-19', 'F', 4300),
(('Shipta'), '1978-09-05', 'F', 4200),
(('Jessy'), '1967-12-01', 'F', 4010),
(('kanne'), '1980-02-17', 'F', 1200)]
columns = ["Name", "DOB", "Gender", "salary"]
df2 = [Link](data=data, schema=columns)
[Link]()

import functools
def unionAll(dfs):
return [Link](lambda df1, df2: [Link](
[Link]([Link])), dfs)
result3 = unionAll([df1, df2])
[Link]()

Table
   
Name DOB Gender salary

[Link] 49/66
30/05/2025, 08:13 Pyspark practice - Databricks

1 Ram 1991-04-01 M 3000

2 Mike 2000-05-19 M 4000
3 Rohini 1978-09-05 M 4000
4 Maria 1967-12-01 F 4000
5 Jenis 1980-02-17 F 1200
6 Ram 1991-04-01 M 3000
7 Mike 2000-05-19 M 4000
8 Rohini 1978-09-05 M 4000
9 Maria 1967-12-01 F 4000
10 Jenis 1980-02-17 F 1200

10 rows

Filter Columns with None or Null Values

the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations
of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have
to filter those NULL values from the dataframe.

[Link](condition) : This function returns the new dataframe with the values which satisfies the given condition.
df.column_name.isNotNull() : This function is used to filter the rows that are not NULL/None in the dataframe column.

[Link] 50/66
30/05/2025, 08:13 Pyspark practice - Databricks

drop all columns with null values

PySpark drop() Syntax
The drop() method in PySpark has three optional arguments that may be used to eliminate NULL values from single, any,
all, or numerous DataFrame columns. Because drop() is a transformation method, it produces a new DataFrame after
removing rows/records from the current Dataframe.

drop(how='any', thresh=None, subset=None)

All of these settings are optional.

how – This accepts any or all values. Drop a row if it includes NULLs in any column by using the ‘any’ operator. Drop
a row only if all columns contain NULL values if you use the ‘all’ option. The default value is ‘any’.
thresh – This is an int quantity; rows with less than thresh hold non-null values are dropped. ‘None’ is the default.
subset – This is used to select the columns that contain NULL values. ‘None’ is the default.

pip install findspark

Python interpreter will be restarted.

Collecting findspark
Using cached [Link] (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
Python interpreter will be restarted.

[Link] 51/66
30/05/2025, 08:13 Pyspark practice - Databricks

from [Link] import StructType, StructField, StringType, IntegerType, FloatType

from [Link] import SparkSession
import findspark
[Link]('_path-to-spark_')
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
("Ritika", 20, "CS32", 94, "Writing"),
("Atirikt", 4, "BB21", 78, None),
("Reshav", 18, None, 56, None)
]
spark = [Link]("Student_Info").getOrCreate()
schema = StructType([
StructField("Name", StringType(), True),
StructField("Roll Number", IntegerType(), True),
StructField("Class ID", StringType(), True),
StructField("Marks", IntegerType(), True),
StructField("Extracurricular", StringType(), True)
])
df = [Link](data=data2, schema=schema)

# drop None Values

[Link](how="any").show(truncate=False)

# stop spark session

[Link]()
The spark context has stopped and the driver is restarting. Your notebook will be automatically reattached.

[Link] 52/66
30/05/2025, 08:13 Pyspark practice - Databricks

Remove all columns where the entire column is null

[Link]()

Parameters:

dataRDD: An RDD of any kind of SQL data representation(e.g. Row, tuple, int, boolean, etc.), or list, or
[Link].
schema: A datatype string or a list of column names, default is None.
samplingRatio: The sample ratio of rows used for inferring
verifySchema: Verify data types of every row against schema. Enabled by default.
Returns: Dataframe

[Link] 53/66
30/05/2025, 08:13 Pyspark practice - Databricks

from [Link] import SparkSession

import [Link] as T

spark = [Link]('My App').getOrCreate()

actor_data = [
("James", None, "Bond", "M", 6000),
("Michael", None, None, "M", 4000),
("Robert", None, "Pattinson", "M", 4000),
("Natalie", None, "Portman", "F", 4000),
("Julia", None, "Roberts", "F", 1000)
]
actor_schema = [Link]([
[Link]("firstname", [Link](), True),
[Link]("middlename", [Link](), True),
[Link]("lastname", [Link](), True),
[Link]("gender", [Link](), True),
[Link]("salary", [Link](), True)
])

df = [Link](data=actor_data, schema=actor_schema)
[Link](truncate=False)

[Link] 54/66
30/05/2025, 08:13 Pyspark practice - Databricks

+---------+----------+---------+------+------+

import [Link] as F
null_counts = [Link]([[Link]([Link]([Link](c).isNull(), c)).alias(
c) for c in [Link]]).collect()[0].asDict()
print(null_counts)
df_size = [Link]()
to_drop = [k for k, v in null_counts.items() if v == df_size]
print(to_drop)
output_df = [Link](*to_drop)

output_df.show(truncate=False)

{'firstname': 0, 'middlename': 5, 'lastname': 1, 'gender': 0, 'salary': 0}

['middlename']
+---------+---------+------+------+
|firstname|lastname |gender|salary|
+---------+---------+------+------+
|James |Bond |M |6000 |
|Michael |null |M |4000 |
|Robert |Pattinson|M |4000 |
|Natalie |Portman |F |4000 |
|Julia |Roberts |F |1000 |
+---------+---------+------+------+

[Link] 55/66
30/05/2025, 08:13 Pyspark practice - Databricks

Filtering rows based on column values

from [Link] import SparkSession
spark = [Link]('sparkdf').getOrCreate()
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["1", "sravan", "company 1"],
["4", "sridevi", "company 1"]]
columns = ['ID', 'NAME', 'Company']
dataframe = [Link](data, columns)

[Link]()

+---+-------+---------+
| ID| NAME| Company|
+---+-------+---------+
| 1| sravan|company 1|
| 2| ojaswi|company 1|
| 3| rohith|company 2|
| 4|sridevi|company 1|
| 1| sravan|company 1|
| 4|sridevi|company 1|
+---+-------+---------+

Using where() function

Syntax: [Link](condition)

filter rows in dataframe where ID =1

[Link] 56/66
30/05/2025, 08:13 Pyspark practice - Databricks

[Link]([Link]=='1').show()

+---+------+---------+
| ID| NAME| Company|
+---+------+---------+
| 1|sravan|company 1|
| 1|sravan|company 1|
+---+------+---------+

[Link]([Link] != 'sravan').show()

+---+-------+---------+
| ID| NAME| Company|
+---+-------+---------+
| 2| ojaswi|company 1|
| 3| rohith|company 2|
| 4|sridevi|company 1|
| 4|sridevi|company 1|
+---+-------+---------+

[Link]([Link]>'3').show()

+---+-------+---------+
| ID| NAME| Company|
+---+-------+---------+
| 4|sridevi|company 1|
| 4|sridevi|company 1|
+---+-------+---------+

[Link] 57/66
30/05/2025, 08:13 Pyspark practice - Databricks

Count rows based on condition in Pyspark

Using where() function.
Using filter() function.

+---+-------+-------+
| ID| NAME|college|
+---+-------+-------+
| 1| Nitya| vignan|
| 2| Nitesh| vvit|
| 3| Neha| vvit|
| 4| Neerak| vignan|
| 1|Neekung| vignan|
| 5| Neelam| iit|
+---+-------+-------+

[Link]()

Out[8]: 6

[Link] 58/66
30/05/2025, 08:13 Pyspark practice - Databricks

# [Link](condition)

print([Link]([Link] == '1').count())

print('They are ')

[Link]([Link] == '1').show()

2
They are
+---+-------+-------+
| ID| NAME|college|
+---+-------+-------+
| 1| Nitya| vignan|
| 1|Neekung| vignan|
+---+-------+-------+

print([Link]([Link] != '1').count())

print([Link]([Link] == 'vignan').count())

print([Link]([Link] > 2).count())

4
3
3

[Link] 59/66
30/05/2025, 08:13 Pyspark practice - Databricks

Scenario based question

from [Link] import SparkSession
from [Link] import col
spark = [Link]("StorageVolumne").getOrCreate()

storage_houseData = [("Room1", 1, 1),

("Room1", 2, 10),
("Room1", 3, 5),
("Room2", 1,2),
("Room2", 2,2),
("Room3", 4,1)
]
storage_house_df = [Link](storage_houseData, ["name", "product_id", "units"])

product_data = [
(1, "Mencollection", 5, 50, 40),
(2, "Girlcollection", 5,5,5),
(3, "Childcollectgion", 2,10,10),
(4, "Womencollection", 4, 10, 20)
]

product_df = [Link](product_data, ["product_id", "product_name", "Width", "Length", "Height"])

# calculatinh the volumne for each item in storage house

result_df= storage_house_df.join(product_df, on = "product_id", how = "inner") \

.withColumn("volume", col("Width") * col("Length") * col("Height")) \
.groupBy("name") \
.agg({"volume":"sum"}) \
.withColumnRenamed("sum(volumn)", "volume")

[Link] 60/66
30/05/2025, 08:13 Pyspark practice - Databricks

result_df.show()

+-----+-----------+
| name|sum(volume)|
+-----+-----------+
|Room3| 800|
|Room2| 10125|
|Room1| 10325|
+-----+-----------+

[Link] 61/66
30/05/2025, 08:13 Pyspark practice - Databricks

[Link] 62/66
30/05/2025, 08:13 Pyspark practice - Databricks

[Link] 63/66
30/05/2025, 08:13 Pyspark practice - Databricks

[Link] 64/66
30/05/2025, 08:13 Pyspark practice - Databricks

[Link] 65/66
30/05/2025, 08:13 Pyspark practice - Databricks

[Link] 66/66

Common questions

Aggregate functions in PySpark operate across a group of rows and return a single summary value, such as SUM, AVG, MIN, or MAX, based on those rows, often using the GROUP BY clause. Analytical functions, however, perform operations over a selected range of values to return individual results for each row within that context, often maintaining the same number of output rows as input and enabling row-wise comparisons with functions like LEAD, LAG, or window rankings .

Choosing SQL expressions over PySpark methods for joining dataframes can offer advantages, such as better readability for those familiar with SQL syntax, and the capacity to utilize complex querying capacities provided by SQL. Additionally, SQL can simplify interaction with various types of joins or group-based operations, enhancing maintainability and clear expression of logic, which is particularly beneficial in complex queries that involve multiple conditions or datasets .

PySpark window functions like lead() and lag() allow analysis relative to other rows within the same group by providing access to subsequent or previous rows' data, respectively. The lag function helps to access previous rows' data as defined by an offset, enabling comparative analysis of a current row against previous data. Conversely, the lead function allows access to future rows, which is useful for projections or comparisons to succeeding data points within a window partition .

A full outer join in PySpark combines rows from both dataframes based on a joining condition, including all matching and non-matching rows from both sides. It distinguishes itself by preserving all available data from both datasets, filling in with nulls where children have no pair. It's often used when a complete view of both datasets is necessary, especially for identifying mismatches or when retaining unmatched data is critical for further analysis .

The unionByName method in PySpark is used to union two DataFrames that may have different schemas. It allows for a union operation where the column names are aligned based on their names rather than their positions. When handling differing schemas, it can manage columns that are missing in one dataframe by setting the allowMissingColumns parameter to True, allowing those columns to have null values in the resulting dataframe .

In PySpark, the unionByName function can handle additional columns present in one dataframe and not the other by using allowMissingColumns=True, ensuring that the union operation can include these columns by introducing nulls for missing data across rows in the respective dataframe. This capability allows for combining datasets with inconsistent schemas while maintaining the integrity of unique column data when present .

A left anti join is more useful in scenarios where one needs to find and retrieve records from the first dataframe that do not have corresponding matches in the second dataframe, which is useful for identifying non-common elements between two datasets. Conversely, a left semi join returns only the matching records from the first dataframe, essentially filtering the first dataset to its intersection with the second, often used to confirm inclusion .

The cume_dist() function calculates the cumulative distribution within a window partition and returns a value indicating the proportion of rows that are the same or less than the current row. It yields results in the range of 0 to 1, providing a normalized ranking. The rank() function, on the other hand, assigns a rank to rows within a partition based on specified criteria, but it leaves gaps in rank values when there are ties, resulting in non-continuous ranking numbers .

The dense_rank() function in PySpark ensures continuous ranking by assigning consecutive ranks to rows in a window partition without leaving gaps for ties, unlike the rank function, which skips ranking numbers when ties occur. This continuity is beneficial for generating a tightly packed ranking order, making it easier to evaluate positions without accounting for skipped numbers in an analysis .

The percent_rank() function in PySpark provides an advanced utility by calculating the relative rank of each row within its partition as a percentage of the total number of values, facilitating a normalized comparison across partitioned data. Unlike the rank function, which gives ordinal positions, percent_rank outputs a value between 0 and 1, making it ideal for scoring or ranking data uniformly, significant in comparative analysis needing normalized values .

EY & Deloitte Data Engineer Interview Guide
No ratings yet
EY & Deloitte Data Engineer Interview Guide
26 pages
Pyspark Window Functions Overview
100% (1)
Pyspark Window Functions Overview
8 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Pyspark Union vs UnionByName Guide
No ratings yet
Pyspark Union vs UnionByName Guide
42 pages
Top 50 PySpark Interview Questions 2024
No ratings yet
Top 50 PySpark Interview Questions 2024
7 pages
Essential PySpark for Databricks Interviews
No ratings yet
Essential PySpark for Databricks Interviews
7 pages
Spark Interview Questions: Driver & Data Skew
No ratings yet
Spark Interview Questions: Driver & Data Skew
34 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Pyspark Dataframe Queries Guide
No ratings yet
Pyspark Dataframe Queries Guide
10 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Apache Spark Use Cases Across Industries
No ratings yet
Apache Spark Use Cases Across Industries
2 pages
Essential PySpark Transformations Guide
No ratings yet
Essential PySpark Transformations Guide
10 pages
Optimizing Databricks Workloads
100% (1)
Optimizing Databricks Workloads
18 pages
PySpark 4 Interview Questions Guide
No ratings yet
PySpark 4 Interview Questions Guide
5 pages
Data Cleaning with Apache Spark
No ratings yet
Data Cleaning with Apache Spark
21 pages
Spark SQL Query Optimization Techniques
100% (1)
Spark SQL Query Optimization Techniques
29 pages
SCD Implementation in Databricks
No ratings yet
SCD Implementation in Databricks
16 pages
Advanced Azure Data Engineering Project
100% (1)
Advanced Azure Data Engineering Project
5 pages
Databricks Clusters: Types & Management Guide
100% (1)
Databricks Clusters: Types & Management Guide
29 pages
PySpark Interview Questions & Answers
No ratings yet
PySpark Interview Questions & Answers
6 pages
Spark Production Insights by Databricks
No ratings yet
Spark Production Insights by Databricks
34 pages
Spark Tuning with Ganglia Insights
No ratings yet
Spark Tuning with Ganglia Insights
37 pages
Optimizing Databricks with Z-Ordering
No ratings yet
Optimizing Databricks with Z-Ordering
16 pages
Delta Live Tables for ETL Optimization
No ratings yet
Delta Live Tables for ETL Optimization
27 pages
Spark Optimizations and Deployment Guide
No ratings yet
Spark Optimizations and Deployment Guide
39 pages
PySpark Basics and Data Management
No ratings yet
PySpark Basics and Data Management
102 pages
Databricks SQL & PySpark Cheat Sheet
100% (4)
Databricks SQL & PySpark Cheat Sheet
11 pages
PySpark Features and Applications
No ratings yet
PySpark Features and Applications
31 pages
Databricks Course Curriculum Overview
No ratings yet
Databricks Course Curriculum Overview
2 pages
Azure Databricks Cluster Guide
No ratings yet
Azure Databricks Cluster Guide
20 pages
Caching DataFrames in PySpark
No ratings yet
Caching DataFrames in PySpark
51 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Apache Spark Architecture Overview
0% (1)
Apache Spark Architecture Overview
30 pages
Mastering DataFrames in PySpark
No ratings yet
Mastering DataFrames in PySpark
59 pages
PySpark Cheat Sheet and Overview
100% (1)
PySpark Cheat Sheet and Overview
12 pages
Databricks Performance Tuning Guide
100% (1)
Databricks Performance Tuning Guide
54 pages
PySpark: Python API for Spark
No ratings yet
PySpark: Python API for Spark
35 pages
Stratascratch PySpark Coding Questions
No ratings yet
Stratascratch PySpark Coding Questions
23 pages
Spark RDD Actions and Transformations
No ratings yet
Spark RDD Actions and Transformations
25 pages
Databricks Performance Optimization Guide
No ratings yet
Databricks Performance Optimization Guide
9 pages
Comprehensive PySpark Guide PDF
100% (1)
Comprehensive PySpark Guide PDF
3 pages
Spark Performance Optimization Insights
50% (2)
Spark Performance Optimization Insights
14 pages
Spark vs Hadoop: Key Interview Insights
No ratings yet
Spark vs Hadoop: Key Interview Insights
9 pages
PySpark Optimization Techniques Guide
No ratings yet
PySpark Optimization Techniques Guide
1 page
Top PySpark Interview Questions Explained
No ratings yet
Top PySpark Interview Questions Explained
4 pages
Overview of Spark Architecture
No ratings yet
Overview of Spark Architecture
25 pages
PySpark Interview Questions and Solutions
No ratings yet
PySpark Interview Questions and Solutions
275 pages
SQL to PySpark Conversion Guide
No ratings yet
SQL to PySpark Conversion Guide
9 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
Introduction to PySpark Basics
No ratings yet
Introduction to PySpark Basics
73 pages
Big Data with Spark and Python Guide
No ratings yet
Big Data with Spark and Python Guide
28 pages
PySpark Cheat Sheet for Data Processing
No ratings yet
PySpark Cheat Sheet for Data Processing
10 pages
S3 Bucket Types and Spark Optimization Techniques
No ratings yet
S3 Bucket Types and Spark Optimization Techniques
7 pages
Understanding RDD Operations in Spark
No ratings yet
Understanding RDD Operations in Spark
20 pages
Azure Data Engineer Interview Q&A
No ratings yet
Azure Data Engineer Interview Q&A
2 pages
PySpark RDD Operations Cheat Sheet
No ratings yet
PySpark RDD Operations Cheat Sheet
1 page
Pyspark CheatSheet
No ratings yet
Pyspark CheatSheet
2 pages
SQL vs PySpark Transformations Guide
No ratings yet
SQL vs PySpark Transformations Guide
3 pages
Spark SQL Overview and Usage Guide
No ratings yet
Spark SQL Overview and Usage Guide
32 pages
Deep PySpark & PySpark SQL Guide For Data Engineers
No ratings yet
Deep PySpark & PySpark SQL Guide For Data Engineers
10 pages
PySpark Essentials: A Quick Guide
No ratings yet
PySpark Essentials: A Quick Guide
190 pages
SparkSession and Employee DataFrame
No ratings yet
SparkSession and Employee DataFrame
3 pages
Key Concepts in Biology and Cell Types
No ratings yet
Key Concepts in Biology and Cell Types
183 pages
HDFS Overview and Key Concepts
No ratings yet
HDFS Overview and Key Concepts
2 pages
Installing Hadoop on Windows Guide
No ratings yet
Installing Hadoop on Windows Guide
10 pages
Essential PySpark Commands Guide
No ratings yet
Essential PySpark Commands Guide
12 pages
Odisha Postal GDS Application Details
No ratings yet
Odisha Postal GDS Application Details
3 pages
Apache Hadoop Developer Training
No ratings yet
Apache Hadoop Developer Training
8 pages
SQL Interview Questions & Answers Guide
No ratings yet
SQL Interview Questions & Answers Guide
15 pages
Adding and Subtracting Decimals Module
No ratings yet
Adding and Subtracting Decimals Module
8 pages
How To Use Abaqus CEL To Model Air Pressure
No ratings yet
How To Use Abaqus CEL To Model Air Pressure
9 pages
04 MathematicalReference
No ratings yet
04 MathematicalReference
474 pages
Quantifying The Effect of End Support Restraints On Vibrati - 2024 - Engineering
No ratings yet
Quantifying The Effect of End Support Restraints On Vibrati - 2024 - Engineering
16 pages
Understanding Test Beds in Software Testing
No ratings yet
Understanding Test Beds in Software Testing
5 pages
lONG QUIZ
No ratings yet
lONG QUIZ
2 pages
Discrete Mathematics Mid-Sem Exam 2023
No ratings yet
Discrete Mathematics Mid-Sem Exam 2023
1 page
Step-by-Step Maths Solutions Worksheets
No ratings yet
Step-by-Step Maths Solutions Worksheets
6 pages
JEE Mains: Quadratic Equations & Complex Numbers
No ratings yet
JEE Mains: Quadratic Equations & Complex Numbers
149 pages
Enhancing Piping Design with FEATools
100% (1)
Enhancing Piping Design with FEATools
58 pages
MOSFET Polarization Analysis and Examples
No ratings yet
MOSFET Polarization Analysis and Examples
23 pages
Grade 8 Assessment Report Template
No ratings yet
Grade 8 Assessment Report Template
42 pages
Numerical Methods: Curve Fitting Exercises
No ratings yet
Numerical Methods: Curve Fitting Exercises
2 pages
Class X Probability Lab Activity Guide
No ratings yet
Class X Probability Lab Activity Guide
3 pages
OIML G 22: EVSE Metrology Guide 2022
No ratings yet
OIML G 22: EVSE Metrology Guide 2022
69 pages
Paint Job Cost Estimator Program
No ratings yet
Paint Job Cost Estimator Program
2 pages
The Mirror That Feels: A Sensual Journey
No ratings yet
The Mirror That Feels: A Sensual Journey
58 pages
Solving Linear Equations and Word Problems
No ratings yet
Solving Linear Equations and Word Problems
7 pages
2023 Computer Science Textbook List
No ratings yet
2023 Computer Science Textbook List
1 page
Consuming OData V2 with V4 Model
No ratings yet
Consuming OData V2 with V4 Model
5 pages
Divide and Conquer Algorithms Explained
No ratings yet
Divide and Conquer Algorithms Explained
37 pages
Quadrilateral Area Formulas Explained
No ratings yet
Quadrilateral Area Formulas Explained
19 pages
Sales Strategies for Client Engagement
No ratings yet
Sales Strategies for Client Engagement
35 pages
General Intelligence Sample Paper
No ratings yet
General Intelligence Sample Paper
24 pages
Group Theory in Rubik's Cube Solutions
100% (1)
Group Theory in Rubik's Cube Solutions
48 pages
Stress Analysis of Pinned Bar Assembly
100% (2)
Stress Analysis of Pinned Bar Assembly
41 pages
ME F319: Vibration and Control Course
No ratings yet
ME F319: Vibration and Control Course
2 pages
Higher Order Homogeneous LDE Solutions
No ratings yet
Higher Order Homogeneous LDE Solutions
13 pages
Length and Volume Measurement Techniques
No ratings yet
Length and Volume Measurement Techniques
6 pages
7SD5/7SD61 Differential Protection Overview
100% (3)
7SD5/7SD61 Differential Protection Overview
54 pages