Intermediate SQL Training Notes
Intermediate SQL Training Notes
Section I - Setup
To set up, you will need two things: SQLiteStudio and the files for this class (which
you likely have already if you are reading this document).
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY
FROM CUSTOMER_ORDER
WHERE CUSTOMER_ID IN (
SELECT CUSTOMER_ID
FROM CUSTOMER
WHERE STATE = 'TX'
)
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
(
SELECT AVG(QUANTITY)
FROM CUSTOMER_ORDER c2
WHERE c2.CUSTOMER_ID = c1.CUSTOMER_ID
AND c2.PRODUCT_ID = c1.PRODUCT_ID
) AS AVG_QUANTITY
FROM CUSTOMER_ORDER c1
Depending on how they are used, subqueries can be more expensive or less expensive
than joins. Subqueries that generate a value for each record tend to me more
expensive, like the example above.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ORDER.CUSTOMER_ID,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
QUANTITY,
cust_avgs.avg_qty
FROM CUSTOMER_ORDER
INNER JOIN
(
SELECT CUSTOMER_ID,
PRODUCT_ID,
AVG(QUANTITY) as avg_qty
FROM CUSTOMER_ORDER
GROUP BY 1, 2
) cust_avgs
ON CUSTOMER_ORDER.CUSTOMER_ID = cust_avgs.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = cust_avgs.PRODUCT_ID
WITH CUST_AVGS AS (
SELECT CUSTOMER_ID,
PRODUCT_ID,
AVG(QUANTITY) AS AVG_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
)
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ORDER.CUSTOMER_ID,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
QUANTITY,
AVG_QTY
ON CUSTOMER_ORDER.CUSTOMER_ID = cust_avgs.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = cust_avgs.PRODUCT_ID
For instance, we can create two derived tables "TX_CUSTOMERS" and "TX_ORDERS" but give
them names as common table expressions. Then we can proceed to use those two derived
tables like this.
WITH TX_CUSTOMERS AS
(
SELECT * FROM CUSTOMER
WHERE STATE = 'TX'
),
TX_ORDERS AS
(
SELECT * FROM CUSTOMER_ORDER
WHERE CUSTOMER_ID IN (SELECT CUSTOMER_ID FROM TX_CUSTOMERS)
)
Conditional Subqueries
2.5 - Unions
To simply append two queries (with identical fields) together, put a UNION ALL
between them.
SELECT
'FEB' AS MONTH,
PRODUCT.PRODUCT_ID,
PRODUCT_NAME,
SUM(PRICE * QUANTITY) AS REV
FROM PRODUCT LEFT JOIN CUSTOMER_ORDER
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
UNION ALL
SELECT
'MAR' AS MONTH,
PRODUCT.PRODUCT_ID,
PRODUCT_NAME,
SUM(PRICE * QUANTITY) AS REV
FROM PRODUCT LEFT JOIN CUSTOMER_ORDER
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
Using UNION instead of UNION ALL will remove duplicates, which should not be
necessary in this case.
You should strive not to use unions as they often encourage bad, inefficient SQL.
Strive to use CASE statements or other tools instead. In this example, it would have
been better to do this:
SELECT
CASE
WHEN ORDER_DATE BETWEEN '2017-02-01' AND '2017-02-28' THEN 'FEB'
WHEN ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31' THEN 'MAR'
END AS MONTH,
PRODUCT.PRODUCT_ID,
PRODUCT_NAME,
SUM(PRICE * QUANTITY) AS REV
FROM PRODUCT LEFT JOIN CUSTOMER_ORDER
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
This shows a concatenated list of values of PRODUCT_ID's ordered for each ORDER_DATE.
SELECT ORDER_DATE,
group_concat(PRODUCT_ID) as product_ids_ordered
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-02-01' AND '2017-02-28'
GROUP BY ORDER_DATE
Putting the DISTINCT keyword inside of it will only concatenate the DISTINCT product
ID's.
SELECT ORDER_DATE,
group_concat(DISTINCT PRODUCT_ID) as product_ids_ordered
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-02-01' AND '2017-02-28'
GROUP BY ORDER_DATE
Note that GROUP_CONCAT is used in MySQL and SQLite, but is often called STRING_AGG
on other platforms such as Oracle, PostgreSQL, and Microsoft SQL Server.
Exercise 1
Bring in all records for CUSTOMER_ORDER , but also bring in the total quantities ever
ordered each given PRODUCT_ID and CUSTOMER_ID .
ANSWER:
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ORDER.CUSTOMER_ID,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
QUANTITY,
sum_qty
FROM CUSTOMER_ORDER
INNER JOIN
(
SELECT CUSTOMER_ID,
PRODUCT_ID,
SUM(QUANTITY) AS sum_qty
FROM CUSTOMER_ORDER
GROUP BY 1, 2
) total_ordered
ON CUSTOMER_ORDER.CUSTOMER_ID = total_ordered.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = total_ordered.PRODUCT_ID
WITH total_ordered AS (
SELECT CUSTOMER_ID,
PRODUCT_ID,
SUM(QUANTITY) AS sum_qty
FROM CUSTOMER_ORDER
GROUP BY 1, 2
)
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ORDER.CUSTOMER_ID,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
QUANTITY,
sum_qty
ON CUSTOMER_ORDER.CUSTOMER_ID = total_ordered.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = total_ordered.PRODUCT_ID
While regular expressions can be used to split and search text, we will primarily be
using it to match text, much like wildcards.
The REGEXP operator is used in SQLite for matching a text to a regex pattern. Like all
boolean operations, it will return a 1 for true or 0 for false.
Here are some interactive training sessions on O'Reilly's platform to learn regular
expressions. [Link]
Note carefully that the REGEXP operator is not implemented by default with
SQLite. SQLiteStudio does implement it for you, so it will work while using
SQLiteStudio. However, if you use SQLite with Python, Java, or other programming
platforms you will have to implement it yourself.
3.1 - Literals
Literals are characters in a regex pattern that have no special function, and
represent that character verbatim. Numbers and letters are literals. For example, The
regex TX will match the string TX
Some characters, as we have seen, have special functionality in a regex. If you want
to use these characters as literals, sometimes you have to escape them with a
preceding \ . These characters include:
[\^$.|?*+()
So to qualify a U.S. currency amount, you will need to escape the dollar sign $ and
the decimal place .
We can also specify certain characters, and they don't necessarily have to be ranges:
3.2 - Anchoring
If you don't want partial matches but rather full matches, you have to anchor the
beginning and end of the String with ^ and $ respectively.
For instance, [A-Z][A-Z] would qualify with SMU . This is because it found two
alphabetic characters within those three characters.
If you don't want that, you will need to qualify start and end anchors, which
effectively demands a full match when both are used:
You can also anchor to just the beginning or end of the string to check, for instance,
if a string starts with a number followed by an alphabetic character:
We can instead explicitly identify in curly brackets we want to repeat that alphabetic
character 3 times, by following it with a {3} .
We can also specify a min and max number of repetitions, such as a minimum of 2 but
max of 3.
Leaving the second argument blank will result in only requiring a minimum of
repetitions:
To allow 1 or more repetitions, use the + . This will qualify with 1 or more
alphanumeric characters.
To allow 0 or 1 repetitions (an optional character), follow the item with a ? . This
will allow two characters to be preceded with a number, but it doesn't have to:
You can use several repeaters for different clauses in a regex. Below, we qualify a
string of alphabetic characters, a dash - followed by a string of numbers, and then
another - with a string of alphabetic characters.
3.4 Wildcards
A dot . represents any character, even whitespaces.
We can also qualify a string of letters, a slash `/`, and a string of numbers, but
qualify any number of repetitions of this entire pattern:
```sql
SELECT 'WHISKY/23482374/ZULU/23423234/FOXTROT/6453' REGEXP '^([A-Z]+/[0-9]+/?)+$' --
true
## EXERCISE
Find all customers with a 3-4 digit street number. Note the literal space before the
wildcard `.*`:
```sql
SELECT * FROM CUSTOMER
WHERE ADDRESS REGEXP '^[0-9]{3,4} .*$'
```sql
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_NAME,
ORDER_DATE,
PRODUCT_ID,
PRODUCT_NAME,
PRODUCT_GROUP
QUANTITY,
PRICE
FROM CUSTOMER_ORDER
SELECT PRODUCT_ID,
PRODUCT_NAME,
SUM(QUANTITY) as total_quantity
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
AND ORDER_DATE = '2017-03-01'
GROUP BY 1, 2
Note you can also create a temporary (or permanent) table from a SELECT query. This is
helpful to persist expensive query results and reuse it multiple times during a
session. SQLite is a bit more convoluted to do this than other platforms:
SELECT CUSTOMER_ORDER.*,
PRICE,
DISCOUNT_RATE,
PRICE * (1 - DISCOUNT_RATE) AS DISCOUNTED_PRICE
FROM CUSTOMER_ORDER
INNER JOIN CUSTOMER
ON CUSTOMER_ORDER.CUSTOMER_ID = CUSTOMER.CUSTOMER_ID
If you expect records to possibly get multiple discounts, then sum the discounts and
GROUP BY everything else:
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_NAME,
STATE,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
PRODUCT_NAME,
PRODUCT_GROUP
QUANTITY,
PRICE,
SUM(DISCOUNT_RATE) as TOTAL_DISCOUNT_RATE,
PRICE * (1 - SUM(DISCOUNT_RATE)) AS DISCOUNTED_PRICE
FROM CUSTOMER_ORDER
INNER JOIN CUSTOMER
ON CUSTOMER_ORDER.CUSTOMER_ID = CUSTOMER.CUSTOMER_ID
GROUP BY 1,2,3,4,5,6,7,8
SELECT o1.CUSTOMER_ORDER_ID,
o1.CUSTOMER_ID,
o1.PRODUCT_ID,
o1.ORDER_DATE,
[Link],
[Link] AS PREV_DAY_QUANTITY
FROM CUSTOMER_ORDER o1
LEFT JOIN CUSTOMER_ORDER o2
ON o1.CUSTOMER_ID = o2.CUSTOMER_ID
AND o1.PRODUCT_ID = o2.PRODUCT_ID
AND o2.ORDER_DATE = date(o1.ORDER_DATE, '-1 day')
Note if you want to get the previous quantity ordered for that record's given
CUSTOMER_ID and PRODUCT_ID , even if it wasn't strictly the day before, you can use a
subquery instead that qualifies previous dates and orders them descending. Then you
can use LIMIT 1 to grab the most recent at the top.
SELECT ORDER_DATE,
PRODUCT_ID,
CUSTOMER_ID,
QUANTITY,
(
SELECT QUANTITY
FROM CUSTOMER_ORDER c2
WHERE c1.ORDER_DATE > c2.ORDER_DATE
AND c1.PRODUCT_ID = c2.PRODUCT_ID
AND c1.CUSTOMER_ID = c2.CUSTOMER_ID
ORDER BY ORDER_DATE DESC
LIMIT 1
) as PREV_QTY
FROM CUSTOMER_ORDER c1
This MANAGER_ID points to another EMPLOYEE record. If you want to bring in Daniel
and his superior's information, this isn't hard to do with a self join.
SELECT e1.FIRST_NAME,
e1.LAST_NAME,
[Link],
e2.FIRST_NAME AS MANAGER_FIRST_NAME,
e2.LAST_NAME AS MANAGER_LAST_NAME
But what if you wanted to display the entire hierarchy above Daniel? Well shoot, this
is hard because now you have to do several self joins to daisy-chain your way to the
top. What makes this even harder is you don't know how many self joins you will need
to do. For cases like this, it can be helpful to leverage recursive queries.
A recursion is a special type of common table expression (CTE). Typically, you "seed"
a starting value and then use UNION or UNION ALL to append the results of a query
that uses each "seed", and the result becomes the next seed.
In this case, we will use a RECURSIVE common table expression to seed Daniel's ID,
and then append each MANAGER_ID of each EMPLOYEE_ID that matches the seed. This will
give a set of ID's for employees hierarchical to Daniel. We can then use these ID's to
navigate Daniel's hierarchy via JOINS, IN, or other SQL operators.
Recursive queries are a bit tricky to get right, but practice them if you deal
frequently with hierarchical records. You will likely use them with a specific part of
the hierarchy in focus (e.g. Daniel's superiors). It's harder to show the hierarchy
for everyone at once, but there are ways. For instance, you can put a RECURSIVE
operation in a subquery and use GROUP_CONCAT .
SELECT e1.* ,
(
WITH RECURSIVE hierarchy_of(x) AS (
SELECT [Link]
UNION ALL -- append each manager ID recursively
SELECT MANAGER_ID
FROM hierarchy_of INNER JOIN EMPLOYEE
ON [Link] = hierarchy_of.x -- employee ID must equal previous recursion
)
FROM EMPLOYEE e1
Senior
17 Stephanus Palatino Operations 15 1,3,14,15,1
Analyst
Business
19 Desdemona Farmar Operations 15 1,3,14,15,1
Consultant
Senior
22 Quill Pinder Operations 20 1,3,14,20,2
Analyst
Business
23 Maybelle Freiburger Operations 20 1,3,14,20,2
Consultant
Business
24 Angelique Havis Operations 20 1,3,14,20,2
Consultant
Technician
27 Junie Blanque Help Desk 26 1,4,25,26,2
I
Note recursive queries also can be used to improvise a set of consecutive values
without creating a table. For instance, we can generate a set of consecutive integers.
Here is how you create a set of integers from 1 to 1000.
We can break up each booking into individual bookings for each passenger (e.g create 3
records off of a booking with 3 passengers).
SELECT BOOKING_ID,
BOOKED_EMPLOYEE_ID,
DEPARTURE_DATE,
ORIGIN,
DESTINATION,
FARE_PRICE,
repeat_helper.x AS PASSENGER_NUMBER
FROM EMPLOYEE_AIR_TRAVEL CROSS JOIN repeat_helper
ON repeat_helper.x <= NUM_OF_PASSENGERS
You can also use some clever CASE expression logic with an integer generator to find
total costs of sending employees to each airport.
SELECT
CASE WHEN repeat_helper.x == 1 THEN ORIGIN ELSE DESTINATION END AS AIRPORT,
SUM(FARE_PRICE * NUM_OF_PASSENGERS) AS AIRPORT_REVENUE
FROM EMPLOYEE_AIR_TRAVEL CROSS JOIN repeat_helper
ON repeat_helper.x <= 2
GROUP BY AIRPORT
AIRPORT AIRPORT_REVENUE
DFW 4840
JFK 1290
LAX 740
ORD 1460
SFO 1350
You can apply the same concept to generate a set of chronological dates. This
recursive query will generate all dates from today to '2030-12-31':
SELECT ORDER_DATE,
PRODUCT_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
We should use a cross join to resolve this problem. For instance, we can leverage a
CROSS JOIN query to generate every possible combination of PRODUCT_ID and
CUSTOMER_ID .
SELECT
CUSTOMER_ID,
PRODUCT_ID
FROM CUSTOMER
CROSS JOIN PRODUCT
In this case we should bring in CALENDAR_DATE and cross join it with PRODUCT_ID to
get every possible combination of calendar date and product. Note the CALENDAR_DATE
comes from the CALENDAR table, which acts as a simple list of consecutive calendar
dates. Note we could also have used a recursive query, as shown in the previous
example, to generate the dates. We'll stick with a simple table instead for now in
case you are not comfortable with recursion yet. We should only filter the calendar to
a date range of interest, like 2017-01-01 and 2017-03-31 .
SELECT
CALENDAR_DATE,
PRODUCT_ID
FROM PRODUCT
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
Then we can LEFT JOIN to our previous query to get every product quantity sold by
calendar date, even if there were no orders that day:
SELECT CALENDAR_DATE,
all_combos.PRODUCT_ID,
TOTAL_QTY
FROM
(
SELECT
CALENDAR_DATE,
PRODUCT_ID
FROM PRODUCT
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
) all_combos
LEFT JOIN
(
SELECT ORDER_DATE,
PRODUCT_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
) totals
ON all_combos.CALENDAR_DATE = totals.ORDER_DATE
AND all_combos.PRODUCT_ID = totals.PRODUCT_ID
WITH all_combos AS (
SELECT
CALENDAR_DATE,
PRODUCT_ID
FROM PRODUCT
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
),
totals AS (
SELECT ORDER_DATE,
PRODUCT_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
)
SELECT CALENDAR_DATE,
all_combos.PRODUCT_ID,
TOTAL_QTY
SELECT c1.ORDER_DATE,
c1.PRODUCT_ID,
c1.CUSTOMER_ID,
[Link],
SUM([Link]) as ROLLING_QTY
GROUP BY 1, 2, 3, 4
Exercise 4
For every CALENDAR_DATE and CUSTOMER_ID , show the total QUANTITY ordered for the
date range of 2017-01-01 to 2017-03-31 :
ANSWER:
SELECT CALENDAR_DATE,
all_combos.CUSTOMER_ID,
coalesce(TOTAL_QTY, 0) AS TOTAL_QTY
FROM
(
SELECT
CALENDAR_DATE,
CUSTOMER_ID
FROM CUSTOMER
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
) all_combos
LEFT JOIN
(
SELECT ORDER_DATE,
CUSTOMER_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
) totals
ON all_combos.CALENDAR_DATE = totals.ORDER_DATE
AND all_combos.CUSTOMER_ID = totals.CUSTOMER_ID
WITH all_combos AS (
SELECT
CALENDAR_DATE,
CUSTOMER_ID
FROM CUSTOMER
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
),
totals AS (
SELECT ORDER_DATE,
CUSTOMER_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
)
SELECT CALENDAR_DATE,
all_combos.CUSTOMER_ID,
coalesce(TOTAL_QTY, 0) AS TOTAL_QTY
ON all_combos.CALENDAR_DATE = totals.ORDER_DATE
AND all_combos.CUSTOMER_ID = totals.CUSTOMER_ID
Section V - Windowing
Windowing functions allow you to greater contextual aggregations in ways much more
flexible than GROUP BY. Many major database platforms support windowing functions.
5.1 PARTITION BY
Sometimes it can be helpful to create a contextual aggregation for each record in a
query. Windowing functions can make this much easier and save us a lot of subquery
work.
For instance, it may be helpful to not only get each CUSTOMER_ORDER for the month of
MARCH, but also the average quantity that customer purchased for that PRODUCT_ID . We
can do that with an OVER PARTITION BY combined with the AVG() function. We have done
this operation with derived tables and common table expressions in the past, but now
we can do it in a one-line window function!
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
AVG(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as AVG_PRODUCT_QTY_ORDERED
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MAX(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as MAX_PRODUCT_QTY_ORDERED
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
You can have multiple windowed fields in a query. Below, we get a MIN, MAX, and AVG
for that given window.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MIN(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as MIN_PRODUCT_QTY_ORDERED,
MAX(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as MAX_PRODUCT_QTY_ORDERED,
AVG(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as AVG_PRODUCT_QTY_ORDERED
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
You can also mix and match scopes which is difficult to do with derived tables.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MIN(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as
MIN_PRODUCT_CUSTOMER_QTY_ORDERED,
MIN(QUANTITY) OVER(PARTITION BY PRODUCT_ID) as MIN_PRODUCT_QTY_ORDERED,
MIN(QUANTITY) OVER(PARTITION BY CUSTOMER_ID) as MIN_CUSTOMER_QTY_ORDERED
FROM CUSTOMER_ORDER
When you are declaring your window redundantly, you can reuse it using a WINDOW
declaration, which goes between the WHERE and the ORDER BY .
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MIN(QUANTITY) OVER(w) as MIN_PRODUCT_QTY_ORDERED,
MAX(QUANTITY) OVER(w) as MAX_PRODUCT_QTY_ORDERED,
AVG(QUANTITY) OVER(w) as AVG_PRODUCT_QTY_ORDERED
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
5.2 ORDER BY
You can also use an ORDER BY in your window to only consider values that
comparatively come before that record.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER (ORDER BY ORDER_DATE) AS ROLLING_QUANTITY
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
Note you can precede the ORDER BY clause with a DESC keyword to window in the
opposite direction.
Above, notice our example output has the same rolling total for all records on a given
date. This is because the ORDER BY in a window function by default does a logical
boundary, which in this case is the ORDER_DATE . This means it is rolling up
everything on that ORDER_DATE and previous to it. A side effect is all records with
the same ORDER_DATE are going to get the same rolling total.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER (ORDER BY ORDER_DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW) AS ROLLING_QUANTITY
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
If you want to incrementally roll the quantity by each row's physical order (not
logical order by the entire ORDER_DATE ), you can use ROWS BETWEEN instead of RANGE
BETWEEN .
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER (ORDER BY ORDER_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW) AS ROLLING_QUANTITY
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
Note the AND CURRENT ROW is a default, so you can shorthand it like this:
In this particular example, you could have avoided using a physical boundary by
specifying your window with an ORDER BY CUSTOMER_ORDER_ID . But we covered the
previous strategy anyway to see how to execute physical boundaries. Here is an
excellent overview of windowing functions and bounds:
[Link]
For example, for each record we can get the max quantity ordered up to that date
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MAX(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID ORDER BY ORDER_DATE) as
MAX_TO_DATE_PRODUCT_QTY
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
SELECT CUSTOMER_ORDER_ID,
ORDER_DATE,
CUSTOMER_ID,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER(PARTITION BY PRODUCT_ID ORDER BY ORDER_DATE ROWS UNBOUNDED
PRECEDING) as total_qty_for_customer_and_product
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31'
You need to be very careful mixing PARITITION BY with an ORDER BY that uses a physical
boundary! If you sort the results, it can get confusing very quickly because you lose
that physical ordered context.
SELECT CUSTOMER_ORDER_ID,
ORDER_DATE,
CUSTOMER_ID,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER(PARTITION BY PRODUCT_ID ORDER BY ORDER_DATE) as
total_qty_for_product
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31'
ORDER BY ORDER_DATE
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
AVG(QUANTITY) OVER (ORDER BY ORDER_DATE ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS
ROLLING_AVG
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
SELECT *,
LAG (QUANTITY, 1, 0) OVER (ORDER BY ORDER_DATE) AS PREV_QTY
FROM CUSTOMER_ORDER
LAG() will retrieve the quantity of the next record on that ordering.
SELECT *,
LEAD (QUANTITY, 1, 0) OVER (ORDER BY ORDER_DATE) AS NEXT_QTY
FROM CUSTOMER_ORDER
You can also use LEAD() and LAG() with a PARTIION BY , making it possible to silo
the previous quantity only within records that share each given record's PRODUCT_ID
and CUSTOMER_ID .
SELECT *,
LAG (QUANTITY, 1, 0) OVER (PARTITION BY PRODUCT_ID, CUSTOMER_ID ORDER BY ORDER_DATE)
AS PREV_QTY
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
For example, say I wanted to get the top 3 selling PRODUCTs by CUSTOMER. I can use a
ROW_NUMBER() function to assign a ranking number to each sorted quantity by
CUSTOMER_ID and PRODUCT_ID . Then I can filter for only the first three items.
WITH TOTAL_QTYS AS (
SELECT CUSTOMER_ID, PRODUCT_ID, SUM(QUANTITY) AS TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1,2
),
PRODUCT_SALES_BY_CUSTOMER AS (
SELECT CUSTOMER_ID, PRODUCT_ID, TOTAL_QTY,
ROW_NUMBER() OVER (PARTITION BY CUSTOMER_ID ORDER BY TOTAL_QTY DESC) AS RANKING
FROM TOTAL_QTYS
)
SELECT * FROM PRODUCT_SALES_BY_CUSTOMER
WHERE RANKING <= 3
Note if you want identical values to receive the same ranking, use the RANK()
function instead of row_number() . Use DENSE_RANK() if you want to force the values
to be consecutive rather than dupes causing ranks to be skipped.
EXERCISE
For the month of March, bring in the rolling sum of quantity ordered (up to each
ORDER_DATE ) by CUSTOMER_ID and PRODUCT_ID .
SELECT CUSTOMER_ORDER_ID,
ORDER_DATE,
CUSTOMER_ID,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER(PARTITION BY CUSTOMER_ID, PRODUCT_ID ORDER BY ORDER_DATE) as
total_qty_for_customer_and_product
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31'
ORDER BY CUSTOMER_ORDER_ID
engine = create_engine('sqlite:///thunderbird_manufacturing.db')
conn = [Link]()
for r in results:
print(r)
engine = create_engine('sqlite:///thunderbird_manufacturing.db')
conn = [Link]()
def get_all_customers():
stmt = text("SELECT * FROM CUSTOMER")
return list([Link](stmt))
print(get_all_customers())
engine = create_engine('sqlite:///thunderbird_manufacturing.db')
conn = [Link]()
def get_all_customers():
stmt = text("SELECT * FROM CUSTOMER")
return list([Link](stmt))
def customer_for_id(customer_id):
stmt = text("SELECT * FROM CUSTOMER WHERE CUSTOMER_ID = :id")
return [Link](stmt, id=customer_id).first()
print(customer_for_id(3))
import pandas as pd
import sqlite3
conn = [Link]('thunderbird_manufacturing.db')
data_frame_from_sql = pd.read_sql_query("SELECT * FROM CUSTOMER", conn)
print(data_frame_from_sql)
setwd('c:\\my_folder')
library(DBI)
library(RSQLite)
dbClearResult(my_query)
print(my_data)
remove(my_query)
dbDisconnect(db)
You can get detailed information on how to work with R and SQL in the official DBI
documentation:
Keep in mind there are many solutions and libraries that abstract away SQL operations,
which can be good or bad depending on how much control you want to maintain:
Hibernate - ORM technology that's been around since 2001 and has a mature
implementation. However, Hibernate is notorious for its strange loading
mechanisms, and can be a hindrance if you want to maintain control of how and
when data is loaded from a database.
jOOQ - A more modern (but commercial) ORM that fluently allows working with
databases in a type-safe manner.
Speedment - Another fast turnaround, fluent API that compiles pure Java code
from table schemas to work with databases.
If you are going to go the vanilla JDBC route, it is a good idea to use a connection
pool so you can persist and reuse several connections safely in a multithreaded
environment. HikariCP is a leading option to achieve this and provides an optimal
DataSource implementation, which is Java's recommended interface for a database
connection pool.
A helpful resource to learning how to work with JDBC is Jenkov's in-depth tutorial:
[Link]
dependencies {
compile '[Link]:sqlite-jdbc:3.19.3'
compile '[Link]:HikariCP:2.6.3'
compile 'org.slf4j:slf4j-simple:1.7.25'
}
Below, we create a simple Java application that creates a Hikari data source with a
minimum of 1 connection and a maximum of 5. Then we create a query and loop through
it's ResultSet .
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class JavaLauncher {
try {
HikariConfig config = new HikariConfig();
[Link]("jdbc:sqlite:/c:/git/oreilly_intermediate_sql_for_data/thunderbird_man
[Link](1);
[Link](5);
while ([Link]()) {
[Link]([Link]("CUSTOMER_ID") + " " +
[Link]("CUSTOMER_NAME"));
}
} catch (Exception e) {
[Link]();
}
}
}
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
try {
HikariConfig config = new HikariConfig();
[Link]("jdbc:sqlite:/c:/git/oreilly_intermediate_sql_for_data/thunderbird_man
[Link](1);
[Link](5);
ResultSet rs = [Link]();
while ([Link]()) {
[Link]([Link]("CUSTOMER_ID") + " " +
[Link]("CUSTOMER_NAME"));
}
} catch (Exception e) {
[Link]();
}
}
}
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
try {
HikariConfig config = new HikariConfig();
[Link]("jdbc:sqlite:/c:/git/oreilly_intermediate_sql_for_data/thunderbird_man
[Link](1);
[Link](5);
[Link](1,"Kry Kall");
[Link](2,"BETA");
[Link](3, [Link](35.0));
[Link]();
} catch (Exception e) {
[Link]();
}
}
}