[ Exploratory Data Analysis (EDA) with PySpark ] {CheatSheet}
1. Data Loading
● Read CSV File: df = [Link]('[Link]', header=True,
inferSchema=True)
● Read Parquet File: df = [Link]('[Link]')
● Read from JDBC (Databases): df =
[Link]("jdbc").options(url="jdbc_url",
dbtable="table_name").load()
2. Basic Data Inspection
● Display Top Rows: [Link]()
● Print Schema: [Link]()
● Summary Statistics: [Link]().show()
● Count Rows: [Link]()
● Display Columns: [Link]
3. Data Cleaning
● Drop Missing Values: [Link]()
● Fill Missing Values: [Link](value)
● Drop Column: [Link]('column_name')
● Rename Column: [Link]('old_name', 'new_name')
4. Data Transformation
● Select Columns: [Link]('column1', 'column2')
● Add New or Transform Column: [Link]('new_column',
expression)
● Filter Rows: [Link](df['column'] > value)
● Group By and Aggregate: [Link]('column').agg({'column': 'sum'})
● Sort Rows: [Link](df['column'].desc())
5. SQL Queries on DataFrames
● Create Temporary View: [Link]('view_name')
By: Waleed Mousa
● SQL Query: [Link]('SELECT * FROM view_name WHERE condition')
6. Statistical Analysis
● Correlation Matrix: from [Link] import Correlation;
[Link](df, 'column')
● Covariance: [Link]('column1', 'column2')
● Frequency Items: [Link](['column1', 'column2'])
● Sample By: [Link]('column', fractions={'class1': 0.1,
'class2': 0.2})
7. Handling Missing and Duplicated Data
● Fill Missing Values in Column: [Link]({'column': value})
● Drop Duplicates: [Link]()
● Replace Value: [Link](['old_value'], ['new_value'],
'column')
8. Data Conversion and Export
● Convert to Pandas DataFrame: pandas_df = [Link]()
● Write DataFrame to CSV: [Link]('path_to_save.csv')
● Write DataFrame to Parquet:
[Link]('path_to_save.parquet')
9. Column Operations
● Change Column Type: [Link]('column',
df['column'].cast('new_type'))
● Split Column into Multiple Columns: [Link]('new_col1',
split(df['column'], 'delimiter')[0])
● Concatenate Columns: [Link]('new_column', concat_ws(' ',
df['col1'], df['col2']))
10. Date and Time Operations
● Current Date and Time: [Link]('current_date',
current_date())
By: Waleed Mousa
● Date Formatting: [Link]('formatted_date',
date_format('dateColumn', 'yyyyMMdd'))
● Date Arithmetic: [Link]('date_plus_days',
date_add(df['date'], 5))
11. Advanced Data Processing
● Window Functions: from [Link] import Window;
[Link]('rank',
rank().over([Link]('column').orderBy('other_column')))
● Pivot Table:
[Link]('column').pivot('pivot_column').sum('sum_column')
● UDF (User Defined Functions): from [Link] import udf;
my_udf = udf(my_python_function); [Link]('new_col',
my_udf(df['col']))
12. Performance Optimization
● Caching DataFrame: [Link]()
● Repartitioning: [Link](10)
● Broadcast Join Hint: [Link](broadcast(df2), 'key', 'inner')
13. Exploratory Data Analysis Specifics
● Column Value Counts: [Link]('column').count().show()
● Distinct Values in a Column: [Link]('column').distinct().show()
● Aggregations (sum, max, min, avg):
[Link]().sum('column').show()
14. Working with Complex Data Types
● Exploding Arrays: [Link]('exploded',
explode(df['array_column']))
● Working with Structs: [Link](df['struct_column']['field'])
● Handling Maps: [Link](map_keys(df['map_column']))
15. Joins
● Inner Join: [Link](df2, df1['id'] == df2['id'])
By: Waleed Mousa
● Left Outer Join: [Link](df2, df1['id'] == df2['id'],
'left_outer')
● Right Outer Join: [Link](df2, df1['id'] == df2['id'],
'right_outer')
16. Saving and Loading Models
● Saving ML Model: [Link]('model_path')
● Loading ML Model: from [Link] import
LogisticRegressionModel; [Link]('model_path')
17. Handling JSON and Complex Files
● Read JSON: df = [Link]('path_to_file.json')
● Explode JSON Object: [Link]('json_column.*')
18. Custom Aggregations
● Custom Aggregate Function: from [Link] import functions as F;
[Link]('group_column').agg([Link]('sum_column'))
19. Working with Null Values
● Counting Nulls in Each Column:
[Link]([[Link]([Link]([Link](c), c)).alias(c) for c in
[Link]])
● Drop Rows with Null Values: [Link]()
20. Data Import/Export Tips
● Read Text Files: df = [Link]('path_to_file.txt')
● Write Data to JDBC: [Link]("jdbc").options(url="jdbc_url",
dbtable="table_name").save()
21. Advanced SQL Operations
● Register DataFrame as Table:
[Link]('temp_table')
By: Waleed Mousa
● Perform SQL Queries: [Link]('SELECT * FROM temp_table WHERE
condition')
22. Dealing with Large Datasets
● Sampling Data: sampled_df = [Link](False, 0.1)
● Approximate Count Distinct:
[Link](approx_count_distinct('column')).show()
23. Data Quality Checks
● Checking Data Integrity: [Link]()
● Asserting Conditions: [Link](df['column'] > 0).count()
24. Advanced File Handling
● Specify Schema While Reading: schema = StructType([...]); df =
[Link]('[Link]', schema=schema)
● Writing in Overwrite Mode:
[Link]('overwrite').csv('path_to_file.csv')
25. Debugging and Error Handling
● Collecting Data Locally for Debugging: local_data = [Link](5)
● Handling Exceptions in UDFs: def safe_udf(my_udf): def
wrapper(*args, **kwargs): try: return my_udf(*args, **kwargs)
except: return None; return wrapper
26. Machine Learning Integration
● Creating Feature Vector: from [Link] import
VectorAssembler; assembler = VectorAssembler(inputCols=['col1',
'col2'], outputCol='features'); feature_df =
[Link](df)
27. Advanced Joins and Set Operations
● Cross Join: [Link](df2)
By: Waleed Mousa
● Set Operations (Union, Intersect, Minus): [Link](df2);
[Link](df2); [Link](df2)
28. Dealing with Network Data
● Reading Data from HTTP Source:
[Link]("csv").option("url",
"[Link]
29. Integration with Visualization Libraries
● Convert to Pandas for Visualization: pandas_df = [Link]();
pandas_df.plot(kind='bar')
30. Spark Streaming for Real-Time EDA
● Reading from a Stream: df =
[Link]('source').load()
● Writing to a Stream: [Link]('console').start()
31. Advanced Window Functions
● Cumulative Sum: from [Link] import Window;
[Link]('cum_sum',
[Link]('column').over([Link]('group_column').orderBy('or
der_column')))
● Row Number: [Link]('row_num',
F.row_number().over([Link]('column')))
32. Handling Complex Analytics
● Rollup: [Link]('column1', 'column2').agg([Link]('column3'))
● Cube for Multi-Dimensional Aggregation: [Link]('column1',
'column2').agg([Link]('column3'))
33. Dealing with Geospatial Data
● Using GeoSpark for Geospatial Data: from [Link] import
GeoSparkRegistrator; [Link](spark)
By: Waleed Mousa
34. Advanced File Formats
● Reading ORC Files: df = [Link]('[Link]')
● Writing Data to ORC: [Link]('path_to_file.orc')
35. Dealing with Sparse Data
● Using Sparse Vectors: from [Link] import SparseVector;
sparse_vec = SparseVector(size, {index: value})
36. Handling Binary Data
● Reading Binary Files: df =
[Link]('binaryFile').load('path_to_binary_file')
37. Efficient Data Transformation
● Using mapPartitions for Transformation: rdd =
[Link](lambda partition: [transform(row) for row in
partition])
38. Advanced Machine Learning Operations
● Using ML Pipelines: from [Link] import Pipeline; pipeline =
Pipeline(stages=[stage1, stage2]); model = [Link](df)
● Model Evaluation: from [Link] import
BinaryClassificationEvaluator; evaluator =
BinaryClassificationEvaluator(); [Link](predictions)
39. Optimization Techniques
● Broadcast Variables for Efficiency: from [Link]
import broadcast; [Link](broadcast(df2), 'key')
● Using Accumulators for Global Aggregates: accumulator =
[Link](0); [Link](lambda x:
[Link](x))
40. Advanced Data Import/Export
By: Waleed Mousa
● Reading Data from Multiple Sources: df =
[Link]('format').option('option',
'value').load(['path1', 'path2'])
● Writing Data to Multiple Formats:
[Link]('format').save('path', mode='overwrite')
41. Utilizing External Data Sources
● Connecting to External Data Sources (e.g., Kafka, S3): df =
[Link]('kafka').option('[Link]',
'host1:port1').load()
42. Efficient Use of SQL Functions
● Using Built-in SQL Functions: from [Link] import
col, lit; [Link]('new_column', col('existing_column') +
lit(1))
43. Exploring Data with GraphFrames
● Using GraphFrames for Graph Analysis: from graphframes import
GraphFrame; g = GraphFrame(vertices_df, edges_df)
44. Working with Nested Data
● Exploding Nested Arrays: [Link]('id', 'explode(nestedArray)
as element')
● Handling Nested Structs: [Link]('struct_column.*')
45. Advanced Statistical Analysis
● Hypothesis Testing: from [Link] import ChiSquareTest; r =
[Link](df, 'features', 'label')
● Statistical Functions (e.g., mean, stddev): from
[Link] import mean, stddev;
[Link](mean('column'), stddev('column'))
46. Customizing Spark Session
By: Waleed Mousa
● Configuring SparkSession: spark =
[Link]('app').config('[Link]
n', 'value').getOrCreate()
By: Waleed Mousa