SQL Joins
and Query Optimization

How it works


by Brian Gallagher
www.BrianGallagher.com
         G
Why
Wh JOIN?
 Combine data from multiple tables
 Reduces the number of queries needed
 Moves processing burden to database
  server
 Improves data integrity (over combining
  data in
  d t i program code)
                    d )
Types of JOINs
T      f JOIN
   INNER
        the most common, return all rows with matching records
                        ,                             g
        SELECT * FROM T1 INNER JOIN T2 ON T1.fld1 = T2.fld2
   LEFT or LEFT OUTER
        return all rows on the left (first) table and right (second)
        If no matching record on the right side, NULL-values for each field are returned
        SELECT * FROM T1 LEFT OUTER JOIN T2 ON T1 fld1 = T2 fld2  T1.fld1   T2.fld2
   FULL or FULL OUTER
        return all rows on the left (first) table and right (second)
        If no matching record on the left or right side, NULL-values for each field are returned
        S
         SELECT * FROM T1 FULL OU
               C         O         U       OUTER JO JOIN T2 ON T1.fld1 = T2.fld2
                                                                O     . d        . d
   CROSS
        the least common, return all possible row combinations
        SELECT * FROM T1, T2
   RIGHT or RIGHT OUTER
      Don’t use them without a very good reason.
      They do not add any functionality over a LEFT JOIN and make code more confusing
      Works the same as the LEFT and LEFT OUTER JOINs, but the second table is the one
       which all rows are returned from. These two queries are functionally identical:
      SELECT * FROM T1 LEFT JOIN T2 ON T1.fld1 = T2.fld2
      SELECT * FROM T2 RIGHT JOIN T1 ON T1.fld1 = T2.fld2
Sample T t Data
S   l Test D t
   Sample Customer and Job records
INNER JOIN
LEFT (OUTER) JOIN
     (     )
One-To-Many LEFT JOIN
          y
RIGHT (OUTER) JOIN
      (     )
FULL (OUTER) JOIN
     (     )
FULL (OUTER) JOIN
 FULL JOIN is not supported in MS-Access
 Can be emulated using UNION queries


   They
    Th are rarely used
               l     d
CROSS JOINs

The Join you never use…
except every time
Making Big
M ki it Bi
   A CROSS JOIN combines every row on the left
    table with every row in the right table
   Resulting recordset will b th t t l of b th row
    R     lti       d t ill be the total f both
    counts multiplied together
     Leftrow has 10 000 records
                   10,000
     Right row has 30,000 records
     Resulting recordset has 300,000,000 records
   If no JOIN condition is specified, a CROSS JOIN
    will be executed
Joins
J i each record t th other t bl
       h      d to the th table
CROSS JOIN’s resulting set of all
        JOIN s
record combinations
How C
H   Cross J i M k I
          Joins Make Inner J i
                           Joins
   Tables are joined using a CROSS JOIN
   JOIN criteria is evaluated, removing all records
    which d ’t match th “ON” condition
      hi h don’t    t h the         diti
    SELECT *
    FROM Customers C
    INNER JOIN Job J
    ON C.CustID = J.CustID
   The remaining rows are the recordset returned
    for the INNER JOIN
Using “ON” criteria to select records
One-to-Many CROSS JOIN
          y
One to Many: Cross Join to Inner Join
Unpredictable R
U    di t bl Record O d
                  d Orders
   Databases are not required to return records in
    any particular order
   Records may be returned by the order they are
    stored on disk or may be unordered, depending
    on how the query engine handles data
   If you need data in a particular order, use an
              dd t i         ti l      d
    ORDER BY clause
   Some databases may return data presorted by
    Primary key or Clustered index
     DO NOT DEPEND on this behavior
     It is not reliably portable across different databases
     Do not make a program rely on a behavior not
      specified by the program – bad programming style
Indexes

Unique, clustered, etc.
What
Wh t are Indexes
         I d
 Indexes are a pre-sorted list of database
  field values
 This list speeds up queries A LOT
     An index is much smaller than the full
      recordset
     An index tracks only the fields specified when
      created
Indexes and P f
I d       d Performance
   Indexes GREATLY speed up queries on the
    fields being indexed
   Indexes SLIGHTLY slow d
    I d                  l down INSERTS
                                INSERTS,
    UPDATES and DELETES
     The indexes need to be updated anytime a record
      changes
     This adds a small bit of overhead to the system
   The increase in query speed usually greatly
    offsets the slight extra load on INSERTS,
    UPDATES and DELETES
     Most  database use is typically reading (instead of
      writing)
Types of Indexes
T      fI d
   Simple Index
     Indexesthe field specified
     Can have multiple simple indexes
     Speeds up queries using the indexed field
   Composite Index
     Indexescombinations of fields
     Can have multiple composite indexes
     Speeds up queries using the specified
      combination of fi ld
         bi ti     f fields
Simple Indexes
Si l I d
   An Index contains a list of all field values
    and pointers to the record with that value
        p
How Indexes help SELECTs
               p
Composite I d
C     it Indexes
   Useful only when you will be querying multiple
    fields simultaneously
   Most selective (resulting in the fewest records
    matching) fields should be listed first
   Slight performance gain over multiple Simple
    Indexes when used correctly
   No performance gain (possibly even a
    performance loss) when used incorrectly.
Composite Index
   p
Searching I d
S    hi Indexes
 There are high performance algorithms for
              high-performance
  searching pre-sorted data
I d
  Index data stored i t
         d t t d internally as bi
                           ll  binary t
                                      trees
  (typically)
 Searching through
  binary tree much
  faster than reading
  through table
Design Strategy

Make it work
Make it fast
Make it pretty
        p y
Focus on what’s needed
F         h t’     d d
   Make your query as specific as possible
     Makes  joins simpler (and therefore faster)
     Returns no unwanted data
     Increases response time
     Reduces network and server load

 Put as much as possible in the WHERE
  clause
 J i your fi ld properly
  Join     fields      l
Maximum S l ti it / S
M i     Selectivity Specificity
                        ifi it

 If differents parts of the WHERE clause
  will result in smaller data sets than other
  ones, list them the ones that will most
  g
  greatly reduce the data first
          y
 List the others in the same order
Optimizing Queries

Faster is Better
Horizontal P titi i
H i    t l Partitioning
 Avoid extremely “wide” records (records
  with lots of fields)
                     )
 Split data into two tables with a common
  ID field
 All the record data is rarely used in all
  queries,
  queries so it reduces traffic and speeds
  up processing and disk reads
Add I d
    Indexes
   Make sure all fields searched on regularly
    are indexed
Define Clustered Index
D fi Cl t d I d
 The most-likely criteria to be sorted on
  should be defined as your clustered index
                         y
 Clustered index defines the order the
  records will physically be stored on disk
Normalize D t
N    li Data
 No redundant data
 Link related data
 Don’t go crazy
Denormalize D t
D      li Data
   If joining more 4 tables or more in a single
    q y,
    query, consider de-normalizing data
                                   g
    (combining data from external tables back
    into the main table))
     Can  return faster query results
     Risks include having to manage duplicate
      data in database and larger disk usage
Size records to fit database page
size
   SQL Server 7.0, 2000, and 2005 data pages are
    8K (8192 bytes) in size.
     Of  the
          th 8192 available b t i each d t page, only
                       il bl bytes in    h data         l
      8060 bytes are used to store a row. The rest is used
      for overhead.
     If you have a row that was 4031 bytes long, then only
      one row could fit into a data page, and the remaining
      4029 bytes left in the page would be empty
              y              p g               py
          Making each record 1 byte shorter would halve the disk
           access required to read the table
Use Primary K
U aPi       Key
   Make sure you have a primary key defined
   Ideally, it should be a single unique field
     You can define composite keys, but they are
      generally not needed if the database is designed
      p p y
      properly
   Most tables should have a unique TableNameID
    field
     This allows you to identify any row by a single unique
      value
Use
U an IDENTITY C l
              Column
   If there is:
     no Primary Key on the table
     no unique index on the table
   Then
     Add   an IDENTITY (unique value) column to
      the table
     Optionally, index the IDENTITY column if you
      will query it regularly
Move TEXT, NTEXT, and IMAGE
data into table
 These types normally stored outside the
  table (uses a pointer to the data in the
         (      p
  field)
 If these types will be searched frequently
                                  frequently,
  consider moving them into the database
  table s
  table’s storage with:
    sp_tableoption 'tablename', 'text in row', 'on'
      or
    sp_tableoption 'tablename', 'text in row', 'size'
Consider R li ti i advance
C   id Replication in d
   If Replication is to be used, factor the
    decision into the original design of the
                          g         g
    database
Use Built-in Referential I t it
U B ilt i R f       ti l Integrity
 Use foreign keys and validation
  constraints built into the database
 Don’t manage it in the application


   Benefits:
     Faster execution
     Can’t mess it up with application errors
Re-test h
R t t when changing servers
            h   i
   Retest and rebenchmark applications when
    moving to a new server, such as:
     Development
     Staging
     Production
   Problems often caused b
    P bl      ft        d by:
     More rows in test data on servers “closer” to
      production
     Server configurations different
     Tables not indexed the same way
Constraints are F t
C   t i t       Fast
   Constraints on fields are faster than
     Triggers
     Rules
     Defaults
Don’t duplicate ff t
D ’t d li t effort
   Don’t check for the same thing twice
     (duh, but it happens)
     Don’t use a trigger and a constraint to do the
      same thing g
     Same for constraints and defaults
     Same for constraints and rules
Limit
Li it records t b JOIN d
           d to be JOINed
   Use WHERE clause to minimize rows to
    be JOINed.
     Particularly   in the OUTER table of an OUTER
     JOIN
Index F i K
I d Foreign Keys
 Fields in a table that are a foreign key are
  not indexed automatically j
                            y just because
  they are a foreign key.
 Add these indexes manually
Minimize Duplicate
Mi i i D li t JOIN fi ld d t
                   field data

   JOINs are slower when there are few
    different keys in the j
                y         joining table
                                g
JOIN on unique indexed fi ld
          i    i d   d fields
   JOINs will perform fastest when joining
    indexed fields with no duplicate data
                             p
JOIN Numeric Fi ld
     N    i Fields
   JOINs on numeric fields perform much
    faster than JOINs on other datatype fields.
                                    yp
JOIN th exact same d t t
     the    t      datatypes

   JOINs should be to the exact same
    datatype for best p
         yp           performance
     Same type of field
     Same field length
     Same encoding (ASCII vs. Unicode, etc)
Use
U ANSI JOIN syntax
               t
   Improves readability
   Less likely to cause programmer errors
   No Aliases:
    SELECT fname, lname, department
      FROM names INNER JOIN departments ON
      names.employeeid
      names employeeid = departments employeeid
                         departments.employeeid
   Microsoft syntax example:
    SELECT fname, lname, department
      FROM names departments
           names,
      WHERE names.employeeid = departments.employeeid
   Code is more portable between databases
Use Table Aliases
U T bl Ali
   Shortens code
   Makes it easier to follow, especially with long queries
   Identifies which table each field is coming from
   No table aliases:
    SELECT fname, lname, department
      FROM names INNER JOIN departments ON
      names.employeeid = departments.employeeid
   Microsoft syntax example:
    SELECT N.fname, N.lname, D.department
      FROM names N INNER JOIN departments D ON
      N.employeeid = D.employeeid
Don’t
D ’t use SELECT *
   Requires additional parsing on the server to
    extract field names
   Returns unneccesary data (unless you are
    actually using every field)
   Returns duplicate data on JOINs (do you really
    need two copies of the RecordID field?)
   Can cause errors (in some databases) if JOINed
    tables have fields with the same name
Store i separate fil in fil
St    in      t files i filegroup

 For very large joins placing tables to be
  j
  joined in separate p y
              p      physical files within the
  same filegroup can improve performance
 SQL Server can spawn separate threads
  for processing each file
Don’t
D ’t use CROSS JOINs
               JOIN
 Unless actually needed (rarely) do not use
  a CROSS JOIN (returns all combinations
                     (
  of records on both sides of JOIN)
 People sometimes will do a CROSS JOIN
  and then use DISTINCT and GROUP BY
  to eliminate all the duplication
     Don’t   do that.
JOINs
JOIN vs. Subquery
         S b
   Depending on the specifics, either could be faster. Write
    both and test performance to be sure which is best.
   JOIN:
    SELECT a.*
      FROM Table1 a INNER JOIN Table2 b
      ON a.Table1ID = b.Table1ID
      WHERE b.Table2ID = 'SomeValue'
   Subquery:
    SELECT a.* FROM Table1 a WHERE Table1ID IN
           a.
      (SELECT Table1ID FROM Table2 WHERE Table2ID =
      'SomeValue')
Avoid Subqueries unless needed
A id S b     i     l       d d

 Avoid subqueries unless actually required
 Most subqueries can be expressed as
  JOINs
 Subqueries are generally harder to read
  and understand (for humans) and,
  therefore,
  therefore maintain
Use an Indexed View
(Enterprise 2000 and later only)
 An Indexed View maintains an updated
  record of how the tables are joined via a
                               j
  clustered index
 This slows INSERTs, UPDATEs and
              INSERTs
  DELETEs a bit, so consider the tradeoff
  for the faster queries
Use Database’s Performance
    Database s
Optimization Tools
 Most major databases have methods for
  monitoring and optimizing database
             g     p      g
  performance
 Google “databaseName optimizing” for
            databaseName optimizing
  tons of links
Avoid
A id DISTINCT when possible
               h       ibl
   DISTINCT clauses are frequently (and usually
    unintentionally) used to hide an incorrect JOIN
   Properly normalized database will not frequently
    need DISTINCT clauses
   Look for incorrect JOINs or create more explicit
    WHERE clauses to avoid needing DISTINCT
   Of course, use it when appropriate
The End
               (for now)

Google: Query Optimization
for lots more information