Pyspark Union and UnionByName Guide
Pyspark Union and UnionByName Guide
Pyspark practice
([Link]
Union and UnionByName transformation
Concept :-
union works when the columns of both DataFrames being joined are in the same order. It can give surprisingly wrong
results when the schemas aren’t the same.
unionByName works when both DataFrames have the same columns, but in a different order.
union and unionByName transformation are used to merge two or more dataframes of the same schema and structure.
Syntax of unionByName()
Syntax: data_frame1.union(data_frame2)
Where,
[Link] 1/66
30/05/2025, 08:13 Pyspark practice - Databricks
Example 1
americans = [Link](
[("bob", 42), ("lisa", 59)], ["first_name", "age"]
)
colombians = [Link](
[("maria", 20), ("camilo", 31)], ["first_name", "age"]
)
res = [Link](colombians)
[Link]()
+----------+---+
|first_name|age|
+----------+---+
| bob| 42|
| lisa| 59|
| maria| 20|
| camilo| 31|
+----------+---+
+---+-------+----------+------+
| id| name|department|salary|
+---+-------+----------+------+
[Link] 2/66
30/05/2025, 08:13 Pyspark practice - Databricks
+---+-------+----------+------+
| id| name|department|salary|
+---+-------+----------+------+
| 1|Krishna| IT| 10000|
+---+-------+----------+------+
[Link](df2).show()
+---+-------+----------+------+
| id| name|department|salary|
+---+-------+----------+------+
| 1|Krishna| IT| male|
| 1|Krishna| IT| 10000|
+---+-------+----------+------+
[Link](df2, allowMissingColumns=True).show()
# if columns will miss then also it will union. like here was a difference gender and salary.
+---+-------+----------+------+
| id| name|department|salary|
+---+-------+----------+------+
| 1|Krishna| IT| male|
| 1|Krishna| IT| 10000|
+---+-------+----------+------+
[Link] 3/66
30/05/2025, 08:13 Pyspark practice - Databricks
Remember - If number will mismatch(if different column name or different number of column) then we will use
unionByName with True.
Example 2
# union
data_frame1 = [Link](
[("Nitya", 82.98), ("Abhishek", 80.31)],
["Student Name", "Overall Percentage"]
)
# union()
UnionEXP = data_frame1.union(data_frame2)
[Link]()
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Nitya| 82.98|
| Abhishek| 80.31|
| Sandeep| 91.123|
| Rakesh| 90.51|
[Link] 4/66
30/05/2025, 08:13 Pyspark practice - Databricks
+------------+------------------+
Syntax: data_frame1.unionByName(data_frame2)
data_frame1 = [Link](
[("Nitya", 82.98), ("Abhishek", 80.31)],
["Student Name", "Overall Percentage"]
)
[Link]()
see in this example , data_frame1 and data_frame2 are of different schema but the output is the desired one.
+------------+------------------+
|Student Name|Overall Percentage|
[Link] 5/66
30/05/2025, 08:13 Pyspark practice - Databricks
+------------+------------------+
| Nitya| 82.98|
| Abhishek| 80.31|
| Naveen| 91.123|
| Sandeep| 90.51|
| Rakesh| 87.67|
+------------+------------------+
data_frame1 = [Link](
[("Bhuwanesh", 82.98, "Computer Science"), ("Harshit", 80.31, "Information Technology")],["Student Name", "Overall
Percentage", "Department"]
)
column_name_morein1df.show()
in this example we have more columnname in 1st dataframe but less in 2nd dataframe.
but we are able to do our desired result with the help of unionByName().
+------------+------------------+--------------------+
|Student Name|Overall Percentage| Department|
+------------+------------------+--------------------+
| Bhuwanesh| 82.98| Computer Science|
| Harshit| 80.31|Information Techn...|
| Naveen| 91.123| null|
[Link] 6/66
30/05/2025, 08:13 Pyspark practice - Databricks
Window function:
pySpark window functions are useful when you want to examine relationships within group of data rather than
between group of data. It performs statictical operations like below explained.
PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or
collection of rows and returns results for each row individually. It is also popularly growing to perform data
transformations.
Analytical Function
Ranking Function
Aggregate Function
Analytical functions:
An analytic function is a function that returns a result after operating on data or a finite set of rows partitioned
by a SELECT clause or in the ORDER BY clause. It returns a result in the same number of rows as the number of input
rows. E.g. lead(), lag(), cume_dist().
[Link] 7/66
30/05/2025, 08:13 Pyspark practice - Databricks
root
+-------------+---+----------+------+
[Link] 8/66
30/05/2025, 08:13 Pyspark practice - Databricks
|Employee_Name|Age|Department|Salary|
+-------------+---+----------+------+
| Nitya| 28| Sales| 3000|
| Abhishek| 33| Sales| 4600|
| Sandeep| 40| Sales| 4100|
| Rakesh| 25| Finance| 3000|
| Ram| 28| Sales| 3000|
| Srishti| 46|Management| 3300|
| Arbind| 26| Finance| 3900|
| Hitesh| 30| Marketing| 3000|
| Kailash| 29| Marketing| 2000|
| Sushma| 39| Sales| 4100|
+-------------+---+----------+------+
Table
Employee_Name Age Department Salary
1 Nitya 28 Sales 3000
2 Abhishek 33 Sales 4600
10 rows
[Link] 9/66
30/05/2025, 08:13 Pyspark practice - Databricks
Using cume_dist():
cume_dist() window function is used to get the cumulative distribution within a window partition.
Table
Employee_Name Age Department Salary cume_dist
1 Rakesh 25 Finance 3000 0.5
2 Arbind 26 Finance 3900 1
3 Srishti 46 Management 3300 1
4 Kailash 29 Marketing 2000 0.5
5 Hitesh 30 Marketing 3000 1
6 Nitya 28 Sales 3000 0.4
7 Ram 28 Sales 3000 04
10 rows
Using lag()
A lag() function is used to access previous rows’ data as per the defined offset value in the function.
Table
[Link] 10/66
30/05/2025, 08:13 Pyspark practice - Databricks
Using lead()
A lead() function is used to access next rows data as per the defined offset value in the function.
Table
Employee_Name Age Department Salary Lead
1 Rakesh 25 Finance 3000 null
2 Arbind 26 Finance 3900 null
3 Srishti 46 Management 3300 null
4 Kailash 29 Marketing 2000 null
5 Hitesh 30 Marketing 3000 null
6 Nitya 28 Sales 3000 4600
7 Ram 28 Sales 3000 4100
10 rows
[Link] 11/66
30/05/2025, 08:13 Pyspark practice - Databricks
Ranking function
Ranking Function
The function returns the statistical rank of a given value for each row in a partition or group. The goal of this
function is to provide consecutive numbering of the rows in the resultant column, set by the order selected in the
[Link] for each partition specified in the OVER clause. E.g. row_number(), rank(), dense_rank(), etc.
A. row_number() : row_number() window functions is used to give the sequential row number starting from 1 to the
result of each
window partition.
B. rank():
rank window functions is used to provide a rank to the result within a window partition. this functions leaves
gaps in rank when there are ties.
Example :- 1 1 1 4 this is rank
C. dense_rank(): dense_rank() window function is used to get the result with rank of rows within a window partition
without a gaps.
This is similar to rank function difference being rank function leaves the gaps in rank when there
are ties.
Example : - 1 1 1 2 this is dense rank.
[Link]("col_name", Window_function().over(Window_partition))
[Link] 12/66
30/05/2025, 08:13 Pyspark practice - Databricks
root
|-- Roll_No: long (nullable = true)
|-- Student_Name: string (nullable = true)
|-- Subject: string (nullable = true)
|-- Marks: long (nullable = true)
Table
Roll_No Student_Name Subject Marks
[Link] 13/66
30/05/2025, 08:13 Pyspark practice - Databricks
101 R Bi l 80
2 103 Sita Social Science 78
3 104 Lakshman Sanskrit 58
4 102 Kunal Phisycs 89
5 101 Ram Biology 80
6 106 Srishti Maths 70
7 108 Sandeep Physics 75
8 107 Hitesh Maths 88
9 109 Kailash Maths 90
10 105 Abhishek Social Science 84
10 rows
Using row_number().
row_number() function is used to gives a sequential number to each row present in the table.
Table
Roll_No Student_Name Subject Marks row_number
1 101 Ram Biology 80 1
2 101 Ram Biology 80 2
3 106 Srishti Maths 70 1
4 107 Hitesh Maths 88 2
5 109 Kailash Maths 90 3
6 102 Kunal Phisycs 89 1
[Link] 14/66
30/05/2025, 08:13 Pyspark practice - Databricks
10 rows
Using rank()
The rank function is used to give ranks to rows specified in the window partition. This function leaves gaps in rank
if there are ties.
Table
Roll_No Student_Name Subject Marks rank
1 101 Ram Biology 80 1
2 101 Ram Biology 80 1
3 106 Srishti Maths 70 1
4 107 Hitesh Maths 88 2
5 109 Kailash Maths 90 3
6 102 Kunal Phisycs 89 1
7 108 Sandeep Physics 75 1
[Link] 15/66
30/05/2025, 08:13 Pyspark practice - Databricks
10 rows
Using percent_rank()
This function is similar to rank() function. It also provides rank to rows but in a percentile format.
Table
Roll_No Student_Name Subject Marks percent_rank
1 101 Ram Biology 80 0
2 101 Ram Biology 80 0
3 106 Srishti Maths 70 0
4 107 Hitesh Maths 88 0.5
5 109 Kailash Maths 90 1
6 102 Kunal Phisycs 89 0
7 108 Sandeep Physics 75 0
10 rows
Using dense_rank()
This function is used to get the rank of each row in the form of row numbers. This is similar to rank() function,
there is only one difference the rank function leaves gaps in rank when there are ties.
[Link] 16/66
30/05/2025, 08:13 Pyspark practice - Databricks
Table
Roll_No Student_Name Subject Marks dense_rank
1 101 Ram Biology 80 1
2 101 Ram Biology 80 1
3 106 Srishti Maths 70 1
4 107 Hitesh Maths 88 2
5 109 Kailash Maths 90 3
6 102 Kunal Phisycs 89 1
7 108 Sandeep Physics 75 1
10 rows
Aggregate functions
Aggregate function
An aggregate function or aggregation function is a function where the values of multiple rows are grouped to form a
single summary value. The definition of the groups of rows on which they operate is done by using the SQL GROUP BY
clause. E.g. AVERAGE, SUM, MIN, MAX, etc.
[Link] 17/66
30/05/2025, 08:13 Pyspark practice - Databricks
root
|-- Employee_Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- Salary: long (nullable = true)
Table
Employee_Name Department Salary
1 Ram Sales 3000
2 Meena Sales 4600
3 Abhishek Sales 4100
4 Kunal Finance 3000
[Link] 18/66
30/05/2025, 08:13 Pyspark practice - Databricks
10 rows
+-------------+----------+------+------+
|Employee_Name|Department|Salary| Avg|
+-------------+----------+------+------+
| Kunal| Finance| 3000|3450.0|
| Sandeep| Finance| 3900|3450.0|
| Srishti|Management| 3300|3300.0|
| Hitesh| Marketing| 3000|2500.0|
| Kailash| Marketing| 2000|2500.0|
| Ram| Sales| 3000|3760.0|
| Meena| Sales| 4600|3760.0|
| Abhishek| Sales| 4100|3760.0|
[Link] 19/66
30/05/2025, 08:13 Pyspark practice - Databricks
+-------------+----------+------+-----+
|Employee_Name|Department|Salary| Sum|
+-------------+----------+------+-----+
| Kunal| Finance| 3000| 6900|
| Sandeep| Finance| 3900| 6900|
spark = [Link] \
.master("local[*]") \
.appName("timestamp") \
.getOrCreate()
df = [Link]([["1", "2019-07-01 12:01:19.000"], ["2", "2019-06-24 12:01:19.000"]], ["id",
"input_timestamp"])
[Link]()
[Link]()
root
|-- id: string (nullable = true)
|-- input_timestamp: string (nullable = true)
Table
id input_timestamp
[Link] 20/66
30/05/2025, 08:13 Pyspark practice - Databricks
1 1 2019-07-01 12:01:19.000
2 2 2019-06-24 12:01:19.000
2 rows
Table
id input_timestamp timestamptype
1 1 2019-07-01 12:01:19.000 2019-07-01T12:01:19.000+0000
2 2 2019-06-24 12:01:19.000 2019-06-24T12:01:19.000+0000
2 rows
Table
id input_timestamp
1 1 2019-07-01T12:01:19.000+0000
2 2 2019-06-24T12:01:19.000+0000
2 rows
[Link] 21/66
30/05/2025, 08:13 Pyspark practice - Databricks
df3=[Link](col("id"), col("input_timestamp").cast('string'))
[Link]()
Table
id input_timestamp
1 1 2019-07-01 12:01:19
2 2 2019-06-24 12:01:19
2 rows
Table
id to_date(input_timestamp)
1 1 2019-07-01
2 2 2019-06-24
2 rows
[Link] 22/66
30/05/2025, 08:13 Pyspark practice - Databricks
Table
employee_name department Salary
1 Nitya Sales 3000
2 Abhi Sales 4600
3 Rakesh Sales 4100
4 Sandeep finance 3000
[Link] 23/66
30/05/2025, 08:13 Pyspark practice - Databricks
9 rows
windowSpec = [Link]("department").orderBy("salary")
df1 = [Link]("row", row_number().over(windowSpec)) # applying row_number
[Link]()
Table
employee_name department Salary row
1 Nitya Sales 3000 1
2 Abhishek Sales 3000 2
3 Rakesh Sales 4100 3
4 Abhi Sales 4600 4
5 Sandeep finance 3000 1
6 Shyan finance 3300 2
7 Madan finance 3900 3
8 kumar marketing 2000 1
9 Jarin marketing 3000 2
9 rows
[Link] 24/66
30/05/2025, 08:13 Pyspark practice - Databricks
Table
employee_name department Salary row
1 Nitya Sales 3000 1
2 Abhishek Sales 3000 2
3 Sandeep finance 3000 1
4 Shyan finance 3300 2
5 kumar marketing 2000 1
6 Jarin marketing 3000 2
6 rows
[Link] 25/66
30/05/2025, 08:13 Pyspark practice - Databricks
spark = SparkSession \
.builder \
.appName("droppingDublicates") \
.master("local[*]") \
.getOrCreate()
sample_data = ([1, "ramesh", 1000], [2, "Krishna", 2000], [3, "Shri", 3000], [4, "Pradip", 4000],
[1, "ramesh", 1000], [2, "Krishna", 2000], [3, "Shri", 3000], [4, "Pradip", 4000])
Table
id name salary
1 1 ramesh 1000
2 2 Krishna 2000
3 3 Shri 3000
4 4 Pradip 4000
5 1 ramesh 1000
6 2 Krishna 2000
[Link] 26/66
30/05/2025, 08:13 Pyspark practice - Databricks
7 3 Shri 3000
8 4 Pradip 4000
8 rows
df1= [Link]().show()
+---+-------+------+
| id| name|salary|
+---+-------+------+
| 1| ramesh| 1000|
| 2|Krishna| 2000|
| 3| Shri| 3000|
| 4| Pradip| 4000|
+---+-------+------+
df3 = [Link]().show()
+---+-------+------+
| id| name|salary|
+---+-------+------+
| 1| ramesh| 1000|
| 2|Krishna| 2000|
| 3| Shri| 3000|
| 4| Pradip| 4000|
+---+-------+------+
[Link] 27/66
30/05/2025, 08:13 Pyspark practice - Databricks
+---+-------+
| id| name|
+---+-------+
| 1| ramesh|
| 2|Krishna|
| 3| Shri|
| 4| Pradip|
+---+-------+
[Link]()
[Link]()
[Link]([Link], explode([Link])).show(truncate=False)
[Link]([Link], flatten([Link])).show(truncate=False)
root
[Link] 28/66
30/05/2025, 08:13 Pyspark practice - Databricks
+--------+--------------------+
| name| subjects|
+--------+--------------------+
|Abhishek|[[Java, scala, pe...|
| Nitya|[[spark, java, c+...|
| Sandeep|[[csharp, vb], [s...|
+--------+--------------------+
+--------+-------------------+
|name |col |
+--------+-------------------+
|Abhishek|[Java, scala, perl]|
|Abhishek|[spark, java] |
|Nitya |[spark, java, c++] |
| i |[ k j ] |
[Link]([Link], explode([Link])).show(truncate=False)
+--------+-------------------+
|name |col |
+--------+-------------------+
|Abhishek|[Java, scala, perl]|
|Abhishek|[spark, java] |
|Nitya |[spark, java, c++] |
|Nitya |[spark, java] |
|Sandeep |[csharp, vb] |
|Sandeep |[spark, python] |
+--------+-------------------+
[Link]([Link], flatten([Link])).show(truncate=False)
[Link] 29/66
30/05/2025, 08:13 Pyspark practice - Databricks
+--------+--------------------------------+
|name |flatten(subjects) |
+--------+--------------------------------+
|Abhishek|[Java, scala, perl, spark, java]|
|Nitya |[spark, java, c++, spark, java] |
|Sandeep |[csharp, vb, spark, python] |
+--------+--------------------------------+
+---+--------+--------+
| id| name| marks|
+---+--------+--------+
| 1|Abhishek|10|30|40|
| 2| Krishna|50|40|70|
| 3| rakesh|20|70|90|
+---+--------+--------+
[Link] 30/66
30/05/2025, 08:13 Pyspark practice - Databricks
Table
id name marks mark_details maths physics chemistry
1 Abhishek 10|30|40 ["10", "30", 10 30 40
1
"40"]
2 Krishna 50|40|70 ["50", "40", 50 40 70
2
"70"]
3 rows
Table
id name maths physics chemistry
1 1 Abhishek 10 30 40
2 2 Krishna 50 40 70
3 3 rakesh 20 70 90
[Link] 31/66
30/05/2025, 08:13 Pyspark practice - Databricks
3 rows
+-----+-------+
|empid|empname|
+-----+-------+
| 10|Krishna|
| 20| mahesh|
| 30| Rakesh|
+-----+-------+
print([Link]())
['empid', 'empname']
columns = [Link]()
if [Link]('empid')>0:
print('empid exists in the dataframe')
else:
print('not exists')
[Link] 32/66
30/05/2025, 08:13 Pyspark practice - Databricks
pySpark Join
Join is used to combine two or more dataframes based on columns in the dataframe.
where-
Table
ID NAME Company
1 1 Saroj company 1
2 2 Nitya company 1
3 3 Abhishek company 2
[Link] 33/66
30/05/2025, 08:13 Pyspark practice - Databricks
4 4 Sandeep company 1
5 5 Rakesh company 1
5 rows
Table
ID salary department
1 1 45000 IT
2 2 145000 Manager
3 6 45000 HR
4 5 34000 Sales
4 rows
[Link] 34/66
30/05/2025, 08:13 Pyspark practice - Databricks
Inner Join
Inner join
This will join the two PySpark dataframes on key columns, which are common in both dataframes.
Table
ID NAME Company ID salary department
1 1 sravan company 1 1 45000 IT
2 2 ojaswi company 1 2 145000 Manager
[Link] 35/66
30/05/2025, 08:13 Pyspark practice - Databricks
3 rows
Outer join
Full Outer Join
This join joins the two dataframes with all matching and non-matching rows, we can perform this join in three ways
Syntax:
[Link] 36/66
30/05/2025, 08:13 Pyspark practice - Databricks
Table
ID NAME Company ID salary department
1 1 Nitya company 1 1 45000 IT
2 2 Ramesh company 1 2 145000 Manager
3 3 Abhishek company 2 null null null
4 4 Sandeep company 1 null null null
5 5 Manisha company 1 5 34000 Sales
6 null null null 6 45000 HR
[Link] 37/66
30/05/2025, 08:13 Pyspark practice - Databricks
6 rows
Table
ID NAME Company ID salary department
1 1 Nitya company 1 1 45000 IT
2 2 Rakesh company 1 2 145000 Manager
3 3 Abhishek company 2 null null null
4 4 Anjali company 1 null null null
5 5 Saviya company 1 5 34000 Sales
6 null null null 6 45000 HR
[Link] 38/66
30/05/2025, 08:13 Pyspark practice - Databricks
6 rows
Left Join
Here this join joins the dataframe by returning all rows from the first dataframe and only matched rows from the
second dataframe with respect to the first dataframe. We can perform this type of join using left and leftouter.
Syntax:
left: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”left”)
leftouter: [Link](dataframe2,dataframe1.column_name == dataframe2.column_name,”leftouter”)
Table
ID NAME Company ID salary department
[Link] 39/66
30/05/2025, 08:13 Pyspark practice - Databricks
5 rows
Right Join
Here this join joins the dataframe by returning all rows from the second dataframe and only matched rows from the
first dataframe with respect to the second dataframe. We can perform this type of join using right and rightouter.
Syntax:
[Link] 40/66
30/05/2025, 08:13 Pyspark practice - Databricks
Table
ID NAME Company ID salary department
1 1 Manisha company 1 1 45000 IT
2 2 Aarti company 1 2 145000 Manager
3 null null null 6 45000 HR
4 5 Virat company 1 5 34000 Sales
4 rows
[Link] 41/66
30/05/2025, 08:13 Pyspark practice - Databricks
Leftsemi join
This join will all rows from the first dataframe and return only matched rows from the second dataframe
Table
ID NAME Company
1 1 Mitchell company 1
2 2 Rachin company 1
3 5 Thomas company 1
[Link] 42/66
30/05/2025, 08:13 Pyspark practice - Databricks
3 rows
LeftAnti join
This join returns only columns from the first dataframe for non-matched records of the second dataframe
Table
ID NAME Company
1 3 Rohit company 2
2 4 Srini company 1
[Link] 43/66
30/05/2025, 08:13 Pyspark practice - Databricks
2 rows
SQL Expression
We can perform all types of the above joins using an SQL expression, we have to mention the type of join in this
expression. To do this, we have to create a temporary view.
Syntax: [Link](“name”)
where
where,
[Link] 44/66
30/05/2025, 08:13 Pyspark practice - Databricks
Table
ID NAME Company ID salary department
1 1 Manoj company 1 1 45000 IT
2 2 Manisha company 1 2 145000 Manager
[Link] 45/66
30/05/2025, 08:13 Pyspark practice - Databricks
3 rows
[Link] 46/66
30/05/2025, 08:13 Pyspark practice - Databricks
+---+--------+---------+---+------+----------+
[Link] 47/66
30/05/2025, 08:13 Pyspark practice - Databricks
Using functools
Functools module provides functions for working with other functions and callable objects to use or extend them
without completely rewriting them.
Syntax:
where,
+------+----------+------+------+
| Name| DOB|Gender|salary|
+------+----------+------+------+
| Ram|1991-04-01| M| 3000|
| Mike|2000-05-19| M| 4000|
|Rohini|1978-09-05| M| 4000|
| Maria|1967-12-01| F| 4000|
| Jenis|1980-02-17| F| 1200|
[Link] 48/66
30/05/2025, 08:13 Pyspark practice - Databricks
+------+----------+------+------+
+------+----------+------+------+
| Name| DOB|Gender|salary|
+------+----------+------+------+
| Ram|1991-04-01| M| 3000|
| Mike|2000-05-19| M| 4000|
|Rohini|1978-09-05| M| 4000|
| Maria|1967-12-01| F| 4000|
| Jenis|1980-02-17| F| 1200|
+------+----------+------+------+
import functools
def unionAll(dfs):
return [Link](lambda df1, df2: [Link](
[Link]([Link])), dfs)
result3 = unionAll([df1, df2])
[Link]()
Table
Name DOB Gender salary
[Link] 49/66
30/05/2025, 08:13 Pyspark practice - Databricks
10 rows
[Link](condition) : This function returns the new dataframe with the values which satisfies the given condition.
df.column_name.isNotNull() : This function is used to filter the rows that are not NULL/None in the dataframe column.
[Link] 50/66
30/05/2025, 08:13 Pyspark practice - Databricks
how – This accepts any or all values. Drop a row if it includes NULLs in any column by using the ‘any’ operator. Drop
a row only if all columns contain NULL values if you use the ‘all’ option. The default value is ‘any’.
thresh – This is an int quantity; rows with less than thresh hold non-null values are dropped. ‘None’ is the default.
subset – This is used to select the columns that contain NULL values. ‘None’ is the default.
[Link] 51/66
30/05/2025, 08:13 Pyspark practice - Databricks
[Link] 52/66
30/05/2025, 08:13 Pyspark practice - Databricks
Parameters:
dataRDD: An RDD of any kind of SQL data representation(e.g. Row, tuple, int, boolean, etc.), or list, or
[Link].
schema: A datatype string or a list of column names, default is None.
samplingRatio: The sample ratio of rows used for inferring
verifySchema: Verify data types of every row against schema. Enabled by default.
Returns: Dataframe
[Link] 53/66
30/05/2025, 08:13 Pyspark practice - Databricks
actor_data = [
("James", None, "Bond", "M", 6000),
("Michael", None, None, "M", 4000),
("Robert", None, "Pattinson", "M", 4000),
("Natalie", None, "Portman", "F", 4000),
("Julia", None, "Roberts", "F", 1000)
]
actor_schema = [Link]([
[Link]("firstname", [Link](), True),
[Link]("middlename", [Link](), True),
[Link]("lastname", [Link](), True),
[Link]("gender", [Link](), True),
[Link]("salary", [Link](), True)
])
df = [Link](data=actor_data, schema=actor_schema)
[Link](truncate=False)
+---------+----------+---------+------+------+
|firstname|middlename|lastname |gender|salary|
+---------+----------+---------+------+------+
|James |null |Bond |M |6000 |
|Michael |null |null |M |4000 |
|Robert |null |Pattinson|M |4000 |
|Natalie |null |Portman |F |4000 |
|Julia |null |Roberts |F |1000 |
[Link] 54/66
30/05/2025, 08:13 Pyspark practice - Databricks
+---------+----------+---------+------+------+
import [Link] as F
null_counts = [Link]([[Link]([Link]([Link](c).isNull(), c)).alias(
c) for c in [Link]]).collect()[0].asDict()
print(null_counts)
df_size = [Link]()
to_drop = [k for k, v in null_counts.items() if v == df_size]
print(to_drop)
output_df = [Link](*to_drop)
output_df.show(truncate=False)
[Link] 55/66
30/05/2025, 08:13 Pyspark practice - Databricks
[Link]()
+---+-------+---------+
| ID| NAME| Company|
+---+-------+---------+
| 1| sravan|company 1|
| 2| ojaswi|company 1|
| 3| rohith|company 2|
| 4|sridevi|company 1|
| 1| sravan|company 1|
| 4|sridevi|company 1|
+---+-------+---------+
Syntax: [Link](condition)
[Link] 56/66
30/05/2025, 08:13 Pyspark practice - Databricks
[Link]([Link]=='1').show()
+---+------+---------+
| ID| NAME| Company|
+---+------+---------+
| 1|sravan|company 1|
| 1|sravan|company 1|
+---+------+---------+
[Link]([Link] != 'sravan').show()
+---+-------+---------+
| ID| NAME| Company|
+---+-------+---------+
| 2| ojaswi|company 1|
| 3| rohith|company 2|
| 4|sridevi|company 1|
| 4|sridevi|company 1|
+---+-------+---------+
[Link]([Link]>'3').show()
+---+-------+---------+
| ID| NAME| Company|
+---+-------+---------+
| 4|sridevi|company 1|
| 4|sridevi|company 1|
+---+-------+---------+
[Link] 57/66
30/05/2025, 08:13 Pyspark practice - Databricks
+---+-------+-------+
| ID| NAME|college|
+---+-------+-------+
| 1| Nitya| vignan|
| 2| Nitesh| vvit|
| 3| Neha| vvit|
| 4| Neerak| vignan|
| 1|Neekung| vignan|
| 5| Neelam| iit|
+---+-------+-------+
[Link]()
Out[8]: 6
[Link] 58/66
30/05/2025, 08:13 Pyspark practice - Databricks
# [Link](condition)
print([Link]([Link] == '1').count())
2
They are
+---+-------+-------+
| ID| NAME|college|
+---+-------+-------+
| 1| Nitya| vignan|
| 1|Neekung| vignan|
+---+-------+-------+
print([Link]([Link] != '1').count())
print([Link]([Link] == 'vignan').count())
4
3
3
[Link] 59/66
30/05/2025, 08:13 Pyspark practice - Databricks
product_data = [
(1, "Mencollection", 5, 50, 40),
(2, "Girlcollection", 5,5,5),
(3, "Childcollectgion", 2,10,10),
(4, "Womencollection", 4, 10, 20)
]
[Link] 60/66
30/05/2025, 08:13 Pyspark practice - Databricks
result_df.show()
+-----+-----------+
| name|sum(volume)|
+-----+-----------+
|Room3| 800|
|Room2| 10125|
|Room1| 10325|
+-----+-----------+
[Link] 61/66
30/05/2025, 08:13 Pyspark practice - Databricks
[Link] 62/66
30/05/2025, 08:13 Pyspark practice - Databricks
[Link] 63/66
30/05/2025, 08:13 Pyspark practice - Databricks
[Link] 64/66
30/05/2025, 08:13 Pyspark practice - Databricks
[Link] 65/66
30/05/2025, 08:13 Pyspark practice - Databricks
[Link] 66/66
Aggregate functions in PySpark operate across a group of rows and return a single summary value, such as SUM, AVG, MIN, or MAX, based on those rows, often using the GROUP BY clause. Analytical functions, however, perform operations over a selected range of values to return individual results for each row within that context, often maintaining the same number of output rows as input and enabling row-wise comparisons with functions like LEAD, LAG, or window rankings .
Choosing SQL expressions over PySpark methods for joining dataframes can offer advantages, such as better readability for those familiar with SQL syntax, and the capacity to utilize complex querying capacities provided by SQL. Additionally, SQL can simplify interaction with various types of joins or group-based operations, enhancing maintainability and clear expression of logic, which is particularly beneficial in complex queries that involve multiple conditions or datasets .
PySpark window functions like lead() and lag() allow analysis relative to other rows within the same group by providing access to subsequent or previous rows' data, respectively. The lag function helps to access previous rows' data as defined by an offset, enabling comparative analysis of a current row against previous data. Conversely, the lead function allows access to future rows, which is useful for projections or comparisons to succeeding data points within a window partition .
A full outer join in PySpark combines rows from both dataframes based on a joining condition, including all matching and non-matching rows from both sides. It distinguishes itself by preserving all available data from both datasets, filling in with nulls where children have no pair. It's often used when a complete view of both datasets is necessary, especially for identifying mismatches or when retaining unmatched data is critical for further analysis .
The unionByName method in PySpark is used to union two DataFrames that may have different schemas. It allows for a union operation where the column names are aligned based on their names rather than their positions. When handling differing schemas, it can manage columns that are missing in one dataframe by setting the allowMissingColumns parameter to True, allowing those columns to have null values in the resulting dataframe .
In PySpark, the unionByName function can handle additional columns present in one dataframe and not the other by using allowMissingColumns=True, ensuring that the union operation can include these columns by introducing nulls for missing data across rows in the respective dataframe. This capability allows for combining datasets with inconsistent schemas while maintaining the integrity of unique column data when present .
A left anti join is more useful in scenarios where one needs to find and retrieve records from the first dataframe that do not have corresponding matches in the second dataframe, which is useful for identifying non-common elements between two datasets. Conversely, a left semi join returns only the matching records from the first dataframe, essentially filtering the first dataset to its intersection with the second, often used to confirm inclusion .
The cume_dist() function calculates the cumulative distribution within a window partition and returns a value indicating the proportion of rows that are the same or less than the current row. It yields results in the range of 0 to 1, providing a normalized ranking. The rank() function, on the other hand, assigns a rank to rows within a partition based on specified criteria, but it leaves gaps in rank values when there are ties, resulting in non-continuous ranking numbers .
The dense_rank() function in PySpark ensures continuous ranking by assigning consecutive ranks to rows in a window partition without leaving gaps for ties, unlike the rank function, which skips ranking numbers when ties occur. This continuity is beneficial for generating a tightly packed ranking order, making it easier to evaluate positions without accounting for skipped numbers in an analysis .
The percent_rank() function in PySpark provides an advanced utility by calculating the relative rank of each row within its partition as a percentage of the total number of values, facilitating a normalized comparison across partitioned data. Unlike the rank function, which gives ordinal positions, percent_rank outputs a value between 0 and 1, making it ideal for scoring or ranking data uniformly, significant in comparative analysis needing normalized values .