Spark ETL
How to create an optimal daily fantasy baseball roster.
Chicago Hadoop Users Group
May 12, 2015
Don Drake
don@drakeconsulting.com
@dondrake
Overview
• Who am I?
• ETL
• Daily Fantasy Baseball
• Spark
• Data Flow
• Extracting - web crawler
• Parquet
• DataFrames
• Python or Scala?
• Transforming - RDDs
• Transforming - Moving Average
Who Am I?
• Don Drake @dondrake
• Currently: Principal Big Data Consultant @ Allstate
• 5 years consulting on Hadoop
• Independent consultant for last 14 years
• Previous clients:
• Navteq / Nokia
• Sprint
• Mobile Meridian - Co-Founder
• MailLaunder.com - my SaaS anti-spam service
• cars.com
• Tribune Media Services
• Family Video
• Museum of Science and Industry
ETL
• Informally: Any repeatable programmed data movement
• Extraction - Get data from another source
• Oracle/PostgreSQL
• CSV
• Web crawler
• Transform - Normalize formatting of phone #’s, addresses
• Create surrogate keys
• Joining data sources
• Aggregate
• Load -
• Load into data warehouse
• CSV
• Sent to predictive model
Spark
• Apache Spark is a fast and general purpose engine for large-scale data
processing.
• It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala
and Python)
• Includes an advanced DAG execution engine that supports in-memory
computing
• RDD - Resilient Distributed Dataset — Core construct of the framework
• Includes a set of high-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing and Spark Streaming
• Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local
• Open Source, core committers from DataBricks
• Latest version is 1.3.1, which includes DataFrames
• Started in 2009 (AMPLab) as research project, Apache project since 2013.
• LOTS of momentum.
Daily Fantasy Baseball
Spark 101- Execution
• Driver - your program’s main() method
• Only 1 per application
• Executors - do the distributed work
• As many as your cluster can handle
• You determine ahead of time how many
• You determine the amount of RAM required
Spark 101 - RDD
• RDD - Resilient Distributed Dataset
• Can be created from Hadoop Input formats (text
file, sequence file, parquet file, HBase, etc.) OR by
transforming other RDDs.
• RDDs have actions and transformations which
return pointers to new RDDs
• RDD’s can contain anything

scala> val textFile = sc.textFile("README.md") 

textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

scala> textFile.count() // Number of items in this RDD 

res0: Long = 126 

scala> textFile.first() // First item in this RDD 

res1: String = # Apache Spark
DEMO
Spark 101 - DEMO
1 #!/bin/env python
2
3 import pyspark.SparkContext
4
5
6 sc = new SparkContext()
7
8 # create some numbers
9 nums = sc.parallelize(xrange(1000))
10
11 nums.getNumPartitions()
12
13 nums.count()
Spark 101 - Lazy Demo
1 #!/bin/env python
2
3 import pyspark.SparkContext
4
5 sc = new SparkContext()
6
7 # create some numbers
8 nums = sc.parallelize(xrange(1000))
9
10 check = sc.accumulator(0)
11
12 def isEven(x):
13 check.add(1)
14 return x % 2 == 0
15
16 evens = nums.filter(isEven)
17
18 evens.count()
19
20 evens.collect()
21
22 check
Spark 101 - Lazy RDDs
• RDD’s are executed *only* when an action is called
upon it.
• This allows multiple transformations on a RDD,
allowing Spark to compute an optimal execution plan.
• Uncached RDD’s are evaluated *every* time an action
is called upon it
• Cache RDD’s if you know you will iterate over an RDD
more than once.
• You can cache to RAM + Disk, by default, persists to
RAM only.
Data Flow
Overview
Extraction & Ingestion
Extraction
• We have seen it’s easy to create a parallelized data
structure that we can execute code against
• Pro-tip: If extracting from relational database, use
Sqoop + save as Parquet
• We need to download a set of files for each
baseball game previously played (player statistics)
• TODO List Pattern
• We know the filenames to download for each
game (they are static)
• Download all files for a game in parallel
1 def getFiles(game):
2 session = requests.Session()
3 files = scrape.get_files([game], session=session)
4 count, fails = scrape.download(files, DownloadMLB.cache)
5 return (count, fails)
6
7 def summarize(a, x):
8 total = a[0] + x[0]
9 alist = a[1]
10 alist.extend(x[1])
11 return (total, alist)
12
13 def run(self):
14
15 sc = SparkContext()
16 start_scrape = datetime.now()
17 begin, begin_parts = scrape.get_boundary(self.begin)
18 end, end_parts = scrape.get_boundary(self.end)
19
20 session = requests.Session()
21
22 print "here"
23 all_years_months_days = self.getYearsMonths(self.WEB_ROOT,
session)
24 games = scrape.get_games(all_years_months_days, session=session)
25
26 gamesRDD = sc.parallelize(games)
27 print "fileRDD=", gamesRDD
28
29 gamesRDD.foreach(dump)
30 print "# parttions:", gamesRDD.getNumPartitions()
31 print "count=", gamesRDD.count()
32 res = gamesRDD.map(getFiles).reduce(summarize)
33 print "res=", res
34
35 count = res[0]
36 fails = res[1]
37 end_scrape = datetime.now()
38 self.log.info("%d files downloaded in %s", count,
39 str(end_scrape - start_scrape))
40 if fails:
41 for url in fails:
42 self.log.error("failed to download %s", url)
43
44 sc.stop()
TODO DEMO
Butter…..Butter……
Parquet?
• Features
• Interoperability - Spark, Impala, Sqoop, much more
• Space efficient
• Query efficient
• Schema - (field name, data type, nullable)
• Columnar Format
• Different Encoding (Compression) Algorithms
• Delta Encoding - diffs per row per column
• Dictionary Encoding (~60k items). e.g. Dates, IP address,
• Run Length Encoding - (for repetitive data)
• Example: http://blog.cloudera.com/blog/2014/05/using-
impala-at-scale-at-allstate/
http://parquet.apache.org/presentations/
http://parquet.apache.org/presentations/
Using DataFrame -
SparkSQL
• Previously called SchemaRDD, now with DataFrame contain extra
functionality to query/filter
• Allow you to write SQL (joins, filter, etc.) against DataFrames (or
RDD’s with a little effort)
• All DataFrame’s contain a schema
1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" +
2 "batter_moving_averages.parquet")
3 batter_mov_avg.registerTempTable("bma")
4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
5 print "batter_mov_avg=", batter_mov_avg.take(2)
6
7 batter_games = sqlContext.sql("select * from games g, bma,
8 game_players gp 
9 where bma.player_id = gp.player_id 
10 and bma.game_date = gp.game_date 
11 and gp.game_id = g.game_id 
12 and g.game_date = gp.game_date 
13 and g.game_date = bma.game_date")
DataFrame’s to the rescue
• DataFrames offer a DSL that provides a distributed
data manipulation
• In Python, you can convert a DF to Pandas data
frame and vice versa
More DataFrames
• We can select columns from DataFrame
• Run our own transform on it with map()
• Transform function gets a Row() object, and returns one
• The toDF() function will infer data types for you.
• https://issues.apache.org/jira/browse/SPARK-7182
1 batting_features = batter_games.select(*unique_cols)
2 print "batting_features=", batting_features.schema
3 #print "batting_features=", batting_features.show()
4
5 def transformBatters(row_object):
6 row = row_object.asDict()
7 row = commonTransform(row)
8
9 return Row(**row)
10
11 batting_features = batting_features.map(transformBatters).toDF()
12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
13
14 self.rmtree(self.rddDir + "/" + "batting_features.parquet")
15 batting_features.saveAsParquetFile(self.rddDir + "/" +
16 "batting_features.parquet")
Programmatically Specifying
the Schema1 class Game(AbstractDF):
2 schema = StructType( sorted(
3 [
4 StructField("game_id", StringType()),
5 StructField("game_date", DateType()),
6 StructField("id", IntegerType()),
7 StructField("type", StringType()),
8 StructField("local_game_time", StringType()),
9 StructField("game_pk", IntegerType()),
10 StructField("gameday_sw", StringType()),
11 StructField("game_time_et", TimestampType()),
12 StructField("home_code", StringType()),
13 StructField("home_abbrev", StringType()),
14 StructField("home_name", StringType()),
15 StructField("home_won", IntegerType()),
16 StructField("home_loss", IntegerType()),
17 StructField("home_division_id", IntegerType()),
18 StructField("home_league", StringType()),
19 StructField("away_code", StringType()),
20 StructField("away_abbrev", StringType()),
21 StructField("away_name", StringType()),
22 StructField("away_won", IntegerType()),
23 StructField("away_loss", IntegerType()),
24 StructField("away_division_id", IntegerType()),
25 StructField("away_league", StringType()),
26 StructField("stadium_id", IntegerType()),
27 StructField("stadium_name", StringType()),
28 StructField("stadium_venue_w_chan_loc", StringType()),
29 StructField("stadium_location", StringType()),
30 StructField("modified", TimestampType()),
31 ],
32 key = lambda x: x.name))
33 skipSelectFields = ['modified']
AbstractDF
• Download: https://gist.github.com/dondrake/
c7fcf42cf051492fdd91
• AbstractDF adds 3 major features to the class
• Creates a python object with attributes based on the field
name defined in the schema
• e.g. g = Game(); g.game_id = ‘123’
• Exposes a helper function to create a Row object
containing all of the fields in schema stored with correct
data types.
• Needed so Scala can correctly infer data types of the
values sent.
• Provides method that will return a list of columns to use in
a select statement.
Python or Scala??
Python or Scala???
• Use Python for prototyping
• Access to pandas, scikit-learn, etc.
• Python is strongly typed (but not statically typed)
• Spark Python API support lags, not as popular as Scala. Bugs might
not get fixed right away or at all.
• Python is slower due to Gateway required. All data must be serialized
and sent through Gateway to/from Scala. (Not as bad for DataFrames)
• Use Scala for application development
• Scala learning curve is steep.
• Functional Programming learning curve is steep.
• Scala is a statically typed language
• Java was intentionally left off
• Don’t bother with Java 7 (IMO)
Transformations
Building a moving average
Example of a moving average
How do we compute a 5-
day MA for batting average?
• For an individual player, on an particular date, we need
the previous 5 days batting average.
• Please note: not all players play every day.
• To build a predictive model, we need historical data, so
we would need to calculate this for each day, every
player, for (possibly) each metric.
• We would also want to use different Moving Average
durations (e.g. a 5, 7 ,14 day moving average) for each
player - day.
• Our compute space just got big and is also
embarrassingly parallel
Some Stats
• load table (2 full years + this year so far ~17%):
• gamePlayers = sqlContext.parquetFile(rddDir +
‘game_players.parquet').cache()
• # Hitters:
• gamePlayers.filter (gamePlayers.fd_position !=
‘P').count()
• 248856L (# of hitter-gamedate combinations)
• # Pitchers
• gamePlayers.filter (gamePlayers.fd_position ==
'P').count()
• 244926L
• Season-level moving average would require about 248,000 *
(162/2) rows of data (~20 million rows)
MapReduce is dead.
Long live MapReduce.
• Spark has a set of transformations that can perform on
RDD’s of key / value pairs.
• Transformations
• groupByKey([numTasks])
• reduceByKey(func, [numTasks])
• aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
• sortByKey([ascending], [numTasks])
• join(otherDataset, [numTasks])
• cogroup(otherDataset, [numTasks])
• Actions
• reduce(func)
• countByKey()
Broadcast Map-side join
• “Broadcast” a dictionary of lists of game_dates,
keys in dictionary is the year the game took place.
• FlatMap operation looping over broadcast
game_dates for the stats took place
• We use flatMap because we emit many rows
from a single input (7-day, 14-day, etc)
• Key: player_id + asOfDate + moving_length
• Value: Game Stats dictionary from input
• Since this generates a lot of output, I
repartition(50) the output of flatMap
Reduce
• now call groupByKey()
• Creates a RDD that have a list of values with the same key.
These values are the stats to calculate the average
• If performing aggregation, use reduceByKey, less shuffling
involved
• Calculate the average (and stddev, etc.)
• Emit new key of player_id + asOfDate
• Value is the dictionary of all moving average fields
• Run reduceByKey() to combine (concatenate) 7-day, 14-day
metrics to a single row
• Save results as Parquet, will be joined with other features to
create a predictive model
Useful Links
1. https://databricks.com/blog/2015/04/28/project-
tungsten-bringing-spark-closer-to-bare-metal.html
2. https://www.usenix.org/system/files/conference/
nsdi15/nsdi15-paper-ousterhout.pdf
3. https://zeppelin.incubator.apache.org/
4. https://codewords.recurse.com/issues/one/an-
introduction-to-functional-programming
5. http://apache-spark-user-list.
1001560.n3.nabble.com/
Spark Books
Q & A