SQL Joins and Query Optimization

SQL Joins
and Query Optimization

How it works

by Brian Gallagher
www.BrianGallagher.com
G

Why
Wh JOIN?
 Combine data from multiple tables
 Reduces the number of queries needed
 Moves processing burden to database
server
 Improves data integrity (over combining
data in
d t i program code)
d )

Types of JOINs
T f JOIN
 INNER
 the most common, return all rows with matching records
, g
 SELECT * FROM T1 INNER JOIN T2 ON T1.fld1 = T2.fld2
 LEFT or LEFT OUTER
 return all rows on the left (first) table and right (second)
 If no matching record on the right side, NULL-values for each field are returned
 SELECT * FROM T1 LEFT OUTER JOIN T2 ON T1 fld1 = T2 fld2 T1.fld1 T2.fld2
 FULL or FULL OUTER
 return all rows on the left (first) table and right (second)
 If no matching record on the left or right side, NULL-values for each field are returned
 S
SELECT * FROM T1 FULL OU
C O U OUTER JO JOIN T2 ON T1.fld1 = T2.fld2
O . d . d
 CROSS
 the least common, return all possible row combinations
 SELECT * FROM T1, T2
 RIGHT or RIGHT OUTER
 Don’t use them without a very good reason.
 They do not add any functionality over a LEFT JOIN and make code more confusing
 Works the same as the LEFT and LEFT OUTER JOINs, but the second table is the one
which all rows are returned from. These two queries are functionally identical:
 SELECT * FROM T1 LEFT JOIN T2 ON T1.fld1 = T2.fld2
 SELECT * FROM T2 RIGHT JOIN T1 ON T1.fld1 = T2.fld2

Sample T t Data
S l Test D t
 Sample Customer and Job records

LEFT (OUTER) JOIN
( )

One-To-Many LEFT JOIN
y

RIGHT (OUTER) JOIN
( )

FULL (OUTER) JOIN
( )

FULL (OUTER) JOIN
 FULL JOIN is not supported in MS-Access
 Can be emulated using UNION queries

 They
Th are rarely used
l d

CROSS JOINs

The Join you never use…
except every time

Making Big
M ki it Bi
 A CROSS JOIN combines every row on the left
table with every row in the right table
 Resulting recordset will b th t t l of b th row
R lti d t ill be the total f both
counts multiplied together
 Leftrow has 10 000 records
10,000
 Right row has 30,000 records
 Resulting recordset has 300,000,000 records
 If no JOIN condition is specified, a CROSS JOIN
will be executed

Joins
J i each record t th other t bl
h d to the th table

CROSS JOIN’s resulting set of all
JOIN s
record combinations

How C
H Cross J i M k I
Joins Make Inner J i
Joins
 Tables are joined using a CROSS JOIN
 JOIN criteria is evaluated, removing all records
which d ’t match th “ON” condition
hi h don’t t h the diti
SELECT *
FROM Customers C
INNER JOIN Job J
ON C.CustID = J.CustID
 The remaining rows are the recordset returned
for the INNER JOIN

Using “ON” criteria to select records

One-to-Many CROSS JOIN
y

One to Many: Cross Join to Inner Join

Unpredictable R
U di t bl Record O d
d Orders
 Databases are not required to return records in
any particular order
 Records may be returned by the order they are
stored on disk or may be unordered, depending
on how the query engine handles data
 If you need data in a particular order, use an
dd t i ti l d
ORDER BY clause
 Some databases may return data presorted by
Primary key or Clustered index
 DO NOT DEPEND on this behavior
 It is not reliably portable across different databases
 Do not make a program rely on a behavior not
specified by the program – bad programming style

Indexes

Unique, clustered, etc.

What
Wh t are Indexes
I d
 Indexes are a pre-sorted list of database
field values
 This list speeds up queries A LOT
 An index is much smaller than the full
recordset
 An index tracks only the fields specified when
created

Indexes and P f
I d d Performance
 Indexes GREATLY speed up queries on the
fields being indexed
 Indexes SLIGHTLY slow d
I d l down INSERTS
INSERTS,
UPDATES and DELETES
 The indexes need to be updated anytime a record
changes
 This adds a small bit of overhead to the system
 The increase in query speed usually greatly
offsets the slight extra load on INSERTS,
UPDATES and DELETES
 Most database use is typically reading (instead of
writing)

Types of Indexes
T fI d
 Simple Index
 Indexesthe field specified
 Can have multiple simple indexes
 Speeds up queries using the indexed field
 Composite Index
 Indexescombinations of fields
 Can have multiple composite indexes
 Speeds up queries using the specified
combination of fi ld
bi ti f fields

Simple Indexes
Si l I d
 An Index contains a list of all field values
and pointers to the record with that value
p

How Indexes help SELECTs
p

Composite I d
C it Indexes
 Useful only when you will be querying multiple
fields simultaneously
 Most selective (resulting in the fewest records
matching) fields should be listed first
 Slight performance gain over multiple Simple
Indexes when used correctly
 No performance gain (possibly even a
performance loss) when used incorrectly.

Searching I d
S hi Indexes
 There are high performance algorithms for
high-performance
searching pre-sorted data
I d
Index data stored i t
d t t d internally as bi
ll binary t
trees
(typically)
 Searching through
binary tree much
faster than reading
through table

Design Strategy

Make it work
Make it fast
Make it pretty
p y

Focus on what’s needed
F h t’ d d
 Make your query as specific as possible
 Makes joins simpler (and therefore faster)
 Returns no unwanted data
 Increases response time
 Reduces network and server load

 Put as much as possible in the WHERE
clause
 J i your fi ld properly
Join fields l

Maximum S l ti it / S
M i Selectivity Specificity
ifi it

 If differents parts of the WHERE clause
will result in smaller data sets than other
ones, list them the ones that will most
g
greatly reduce the data first
y
 List the others in the same order

Optimizing Queries

Faster is Better

Horizontal P titi i
H i t l Partitioning
 Avoid extremely “wide” records (records
with lots of fields)
)
 Split data into two tables with a common
ID field
 All the record data is rarely used in all
queries,
queries so it reduces traffic and speeds
up processing and disk reads

Add I d
Indexes
 Make sure all fields searched on regularly
are indexed

Define Clustered Index
D fi Cl t d I d
 The most-likely criteria to be sorted on
should be defined as your clustered index
y
 Clustered index defines the order the
records will physically be stored on disk

Normalize D t
N li Data
 No redundant data
 Link related data
 Don’t go crazy

Denormalize D t
D li Data
 If joining more 4 tables or more in a single
q y,
query, consider de-normalizing data
g
(combining data from external tables back
into the main table))
 Can return faster query results
 Risks include having to manage duplicate
data in database and larger disk usage

Size records to fit database page
size
 SQL Server 7.0, 2000, and 2005 data pages are
8K (8192 bytes) in size.
 Of the
th 8192 available b t i each d t page, only
il bl bytes in h data l
8060 bytes are used to store a row. The rest is used
for overhead.
 If you have a row that was 4031 bytes long, then only
one row could fit into a data page, and the remaining
4029 bytes left in the page would be empty
y p g py
 Making each record 1 byte shorter would halve the disk
access required to read the table

Use Primary K
U aPi Key
 Make sure you have a primary key defined
 Ideally, it should be a single unique field
 You can define composite keys, but they are
generally not needed if the database is designed
p p y
properly
 Most tables should have a unique TableNameID
field
 This allows you to identify any row by a single unique
value

Use
U an IDENTITY C l
Column
 If there is:
 no Primary Key on the table
 no unique index on the table
 Then
 Add an IDENTITY (unique value) column to
the table
 Optionally, index the IDENTITY column if you
will query it regularly

Move TEXT, NTEXT, and IMAGE
data into table
 These types normally stored outside the
table (uses a pointer to the data in the
( p
field)
 If these types will be searched frequently
frequently,
consider moving them into the database
table s
table’s storage with:
sp_tableoption 'tablename', 'text in row', 'on'
or
sp_tableoption 'tablename', 'text in row', 'size'

Consider R li ti i advance
C id Replication in d
 If Replication is to be used, factor the
decision into the original design of the
g g
database

Use Built-in Referential I t it
U B ilt i R f ti l Integrity
 Use foreign keys and validation
constraints built into the database
 Don’t manage it in the application

 Benefits:
 Faster execution
 Can’t mess it up with application errors

Re-test h
R t t when changing servers
h i
 Retest and rebenchmark applications when
moving to a new server, such as:
 Development
 Staging
 Production
 Problems often caused b
P bl ft d by:
 More rows in test data on servers “closer” to
production
 Server configurations different
 Tables not indexed the same way

Constraints are F t
C t i t Fast
 Constraints on fields are faster than
 Triggers
 Rules
 Defaults

Don’t duplicate ff t
D ’t d li t effort
 Don’t check for the same thing twice
 (duh, but it happens)
 Don’t use a trigger and a constraint to do the
same thing g
 Same for constraints and defaults
 Same for constraints and rules

Limit
Li it records t b JOIN d
d to be JOINed
 Use WHERE clause to minimize rows to
be JOINed.
 Particularly in the OUTER table of an OUTER
JOIN

Index F i K
I d Foreign Keys
 Fields in a table that are a foreign key are
not indexed automatically j
y just because
they are a foreign key.
 Add these indexes manually

Minimize Duplicate
Mi i i D li t JOIN fi ld d t
field data

 JOINs are slower when there are few
different keys in the j
y joining table
g

JOIN on unique indexed fi ld
i i d d fields
 JOINs will perform fastest when joining
indexed fields with no duplicate data
p

JOIN Numeric Fi ld
N i Fields
 JOINs on numeric fields perform much
faster than JOINs on other datatype fields.
yp

JOIN th exact same d t t
the t datatypes

 JOINs should be to the exact same
datatype for best p
yp performance
 Same type of field
 Same field length
 Same encoding (ASCII vs. Unicode, etc)

Use
U ANSI JOIN syntax
t
 Improves readability
 Less likely to cause programmer errors
 No Aliases:
SELECT fname, lname, department
FROM names INNER JOIN departments ON
names.employeeid
names employeeid = departments employeeid
departments.employeeid
 Microsoft syntax example:
FROM names departments
names,
WHERE names.employeeid = departments.employeeid
 Code is more portable between databases

Use Table Aliases
U T bl Ali
 Shortens code
 Makes it easier to follow, especially with long queries
 Identifies which table each field is coming from
 No table aliases:
FROM names INNER JOIN departments ON
names.employeeid = departments.employeeid
 Microsoft syntax example:
SELECT N.fname, N.lname, D.department
FROM names N INNER JOIN departments D ON
N.employeeid = D.employeeid

Don’t
D ’t use SELECT *
 Requires additional parsing on the server to
extract field names
 Returns unneccesary data (unless you are
actually using every field)
 Returns duplicate data on JOINs (do you really
need two copies of the RecordID field?)
 Can cause errors (in some databases) if JOINed
tables have fields with the same name

Store i separate fil in fil
St in t files i filegroup

 For very large joins placing tables to be
j
joined in separate p y
p physical files within the
same filegroup can improve performance
 SQL Server can spawn separate threads
for processing each file

Don’t
D ’t use CROSS JOINs
JOIN
 Unless actually needed (rarely) do not use
a CROSS JOIN (returns all combinations
(
of records on both sides of JOIN)
 People sometimes will do a CROSS JOIN
and then use DISTINCT and GROUP BY
to eliminate all the duplication
 Don’t do that.

JOINs
JOIN vs. Subquery
S b
 Depending on the specifics, either could be faster. Write
both and test performance to be sure which is best.
 JOIN:
SELECT a.*
FROM Table1 a INNER JOIN Table2 b
ON a.Table1ID = b.Table1ID
WHERE b.Table2ID = 'SomeValue'
 Subquery:
SELECT a.* FROM Table1 a WHERE Table1ID IN
a.
(SELECT Table1ID FROM Table2 WHERE Table2ID =
'SomeValue')

Avoid Subqueries unless needed
A id S b i l d d

 Avoid subqueries unless actually required
 Most subqueries can be expressed as
JOINs
 Subqueries are generally harder to read
and understand (for humans) and,
therefore,
therefore maintain

Use an Indexed View
(Enterprise 2000 and later only)
 An Indexed View maintains an updated
record of how the tables are joined via a
j
clustered index
 This slows INSERTs, UPDATEs and
INSERTs
DELETEs a bit, so consider the tradeoff
for the faster queries

Use Database’s Performance
Database s
Optimization Tools
 Most major databases have methods for
monitoring and optimizing database
g p g
performance
 Google “databaseName optimizing” for
databaseName optimizing
tons of links

Avoid
A id DISTINCT when possible
h ibl
 DISTINCT clauses are frequently (and usually
unintentionally) used to hide an incorrect JOIN
 Properly normalized database will not frequently
need DISTINCT clauses
 Look for incorrect JOINs or create more explicit
WHERE clauses to avoid needing DISTINCT
 Of course, use it when appropriate

The End
(for now)

Google: Query Optimization
for lots more information

SQL Joins and Query Optimization

SQL Joins and Query Optimization

More Related Content

What's hot

Similar to SQL Joins and Query Optimization

Recently uploaded

SQL Joins and Query Optimization