Cascalog Workshop
Example query
Execution

1. Pre-aggregation
2. Aggregation
3. Post-aggregation
Variable dependencies
Pre-aggregation
• Start from generator variables
• Resolve as many variables as possible using:
 • Joins
 • Functions
• Use as many filters as possible
• Join all sources into one set of tuples
Aggregation


• Group by resolved output variables
• Apply all aggregators to each group
Post-aggregation


• Resolve the rest of the variables
• Apply rest of filters
Example query
Query planner




 Start with generators
Query planner

          [?person2 ?age2 ?double-age2]




Add functions and filters until fixed point
Query planner

  [?person2 ?age2 ?double-age2]

   [?person1 ?person2 ?age2 ?double-age2]




       Do a join
Query planner

          [?person2 ?age2 ?double-age2]

           [?person1 ?person2 ?age2 ?double-age2]




Add functions and filters until fixed point
Query planner

                              [?person2 ?age2 ?double-age2]

                               [?person1 ?person2 ?age2 ?double-age2]

[?person1 ?age1 ?person2 ?age2 ?double-age2]




                                   Do a join
Query planner

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]




[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]

               Add functions and filters until fixed point
Query planner

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta


[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]


                 Group by already satisfied output vars
Query planner

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta              [?delta ?count]



[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]


                    Execute aggregators on each group
Query planner

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta              [?delta ?count]



[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]

               Add functions and filters until fixed point
Query planner

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta              [?delta ?count]



[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]


                                                       Project fields to [?delta ?count]
Cascading pipes

• Each: can occur in Map or Reduce
• GroupBy: Causes a Reduce step
• Every: One or more follow GroupBy
• CoGroup: Join implementation, causes
  Reduce step
To Cascading
To Cascading
              Each


 [?person2 ?age2 ?double-age2]
To Cascading

 [?person2 ?age2 ?double-age2]
                             CoGroup
   [?person1 ?person2 ?age2 ?double-age2]
To Cascading

                              [?person2 ?age2 ?double-age2]

                               [?person1 ?person2 ?age2 ?double-age2]
  CoGroup
[?person1 ?age1 ?person2 ?age2 ?double-age2]
To Cascading

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]
                      Each


                       Each


[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
To Cascading

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta
                                                      GroupBy
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
To Cascading

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]

                                                                                       Every
                                        Group by ?delta              [?delta ?count]



[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]


                    Execute aggregators on each group
To Cascading

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta              [?delta ?count]
                                                                             Each

[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
To Cascading

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta              [?delta ?count]



[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
                                                                                 Each
                                                       Project fields to [?delta ?count]
To MapReduce

                                 [?person2 ?age2 ?double-age2]
                                                                            Job 1
                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta              [?delta ?count]



[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]


                                                       Project fields to [?delta ?count]
To MapReduce

                                 [?person2 ?age2 ?double-age2]

   Job 2                           [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta              [?delta ?count]



[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]


                                                       Project fields to [?delta ?count]
To MapReduce

                                 [?person2 ?age2 ?double-age2]

                                   [?person1 ?person2 ?age2 ?double-age2]

   [?person1 ?age1 ?person2 ?age2 ?double-age2]


                                        Group by ?delta              [?delta ?count]



[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
                                                        Job 3
                                                       Project fields to [?delta ?count]
defmapop
[A1, B1, C1]                            [A1, B1, C1, D1, E1]



[A2, B2, C2]                            [A2, B2, C2, D2, E2]



[A3, B3, C3]                            [A3, B3, C3, D3, E3]



               Appends fields to tuple
deffilterop
[A1, B1, C1]     true
                            [A1, B1, C1]
[A2, B2, C2]     false      [A3, B3, C3]


[A3, B3, C3]     true
defmapcatop
                      [    [“a red dog”, “a”]
                                                               [“a red dog”, “a”]
[“a red dog”]             [“a red dog”, “red”]
                          [“a red dog”, “dog”]   ]            [“a red dog”, “red”]

   [“ ”]                          []                          [“a red dog”, “dog”]

                                                               [“hello”, “hello”]
  [“hello”]           [    [“hello”, “hello”]    ]
                Map                                  Concat
Aggregators
[“key1”, 1]         [“key1”, 1]
                                       [“key1”, 3]
[“key3”, 3]         [“key1”, 2]

Map Task 1         Reduce Task 1


[“key2”, 3]         [“key2”, 3]
                                       [“key2”, 3]
[“key1”, 2]         [“key3”, 3]
                                      [“key3”, 4]
[“key3”, 1]         [“key3”, 1]
Map Task 2         Reduce Task 2


Regular aggregators - all data goes to reducers
defparallelagg
 [“nathan”]           [“nathan”, 1]
                                                [“nathan”, 2]
  [“alice”]            [“alice”, 1]                                 [“nathan”, 3]
                                                  [“alice”, 1]
 [“nathan”]           [“nathan”, 1]
  Map Task 1            Map Task 1                Map Task 1        Reduce Task 1
                                      Combine            Combine
               Init
                                       (Map)             (Reduce)
                                                                    [“sally”, 1]
 [“nathan”]           [“nathan”, 1]             [“nathan”, 1]
                                                                    [“alice”, 1]
  [“sally”]            [“sally”, 1]              [“sally”, 1]
 Map Task 2             Map Task 2                 Map Task 2       Reduce Task 2


Parallel aggregators - partial aggregation done in mappers
combine
[1]             [3]

[2]             [4]

[3]             [5]


        [1]

        [2]

        [3]
        [3]
        [4]

        [5]
union
[1]           [3]

[2]           [4]

[3]           [5]


       [1]

       [2]

       [3]

       [4]

       [5]
ElephantDB
                                   Shard 0
                                   Shard 1
                                   Shard 2       Distributed
Key/Value pairs
                                   Shard 3       Filesystem
                    Pre-shard      Shard 4
                   and index in
                                   Shard 5
                   MapReduce


                  Generation of domain of data
ElephantDB
DFS                       ElephantDB
                             Server
Shard 0
Shard 1
Shard 2                   ElephantDB
                             Server
Shard 3
Shard 4
Shard 5                   ElephantDB
                             Server


     Serving domain of data