Friday, February 24, 2012

JOIN Process Order and Performance Comparisons

Hi all,
A common SQL that I do is joining parent and child tables together (1-M
relationship), e.g. Invoice and InvoiceItem tables. These tables have huge
number of rows.
Q1) Compare the two statements (that give the same result) below, from a
programming point of view, which one is more efficient?
Statement 1
--
SELECT *
FROM Invoice Ivo
INNER JOIN InvoiceItem IvoItem ON Ivo.RecNum = IvoItm.InvRecNum
WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
AND IvoItm.ProductType = 1 --This line is processed in WHERE.
Statement 2
--
SELECT *
FROM Invoice Ivo
INNER JOIN InvoiceItem IvoItm ON Ivo.RecNum = IvoItm.InvRecNum
AND IvoItm.ProductType = 1 --This line is processed in JOIN.
WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
This is something which I have been wondering for quite sometime. After
reading MSDN article "Join Fundamentals" stating, it says the JOIN statement
s
are processed first.
Q2) So in statement 2, does SQL Server process the JOIN 1st, then process
this filter "AND IvoItm.ProductType = 1", OR process that filter 1st, then
process the JOIN?
Q3) If it does the latter 1st, would it filter out the MANY rows in IvoItm,
before doing the JOINS? Therefore improving performance, as the amount of
data to join is reduced in the IvoItm?
Q4) Using the same analogy in Q3, would there be performance gain if I
rewrite the statement using sub-query to do the filtering 1st?
SELECT *
FROM Invoice Ivo
INNER JOIN (
SELECT *
FROM InvoiceItem
WHERE ProductType = 1 --This line is processed in sub-query.
) IvoItm ON Ivo.RecNum = IvoItm.InvRecNum
WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
Q5) And would above be efficient than using the Statement 1 and 2?Answers inline:
--
----
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"HardKhor" <HardKhor@.discussions.microsoft.com> wrote in message
news:3E92F8CB-3E8A-4A4C-A0D9-71B707658D1A@.microsoft.com...
> Hi all,
> A common SQL that I do is joining parent and child tables together (1-M
> relationship), e.g. Invoice and InvoiceItem tables. These tables have huge
> number of rows.
> Q1) Compare the two statements (that give the same result) below, from a
> programming point of view, which one is more efficient?
> Statement 1
> --
> SELECT *
> FROM Invoice Ivo
> INNER JOIN InvoiceItem IvoItem ON Ivo.RecNum = IvoItm.InvRecNum
> WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
> AND IvoItm.ProductType = 1 --This line is processed in WHERE.
> Statement 2
> --
> SELECT *
> FROM Invoice Ivo
> INNER JOIN InvoiceItem IvoItm ON Ivo.RecNum = IvoItm.InvRecNum
> AND IvoItm.ProductType = 1 --This line is processed in JOIN.
> WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
These two statements are probably going to perform equivalently, as they are
mathematically equivalent. If this was an outer join, then it will make a
difference. First thing to do is to check the plan using Query analyzer. It
should be the exact same plan.
Logically, all of the JOIN operators will be dealt with, building the set
with Invoice.* + InvoiceLineItem, eliminating rows where the join criteria
fails. Then for every row in the output you apply the where clause.
However, the optimizer can can reorganize the query to make it execute
better as long as the same results would be the same.
For the rest of your questions, try looking at the plan first. It may answer
the questions for you as the answers may be different based on the number of
rows in each table.

> This is something which I have been wondering for quite sometime. After
> reading MSDN article "Join Fundamentals" stating, it says the JOIN
> statements
> are processed first.
>
This is true logically, but it is not required if the results are the same

> Q2) So in statement 2, does SQL Server process the JOIN 1st, then process
> this filter "AND IvoItm.ProductType = 1", OR process that filter 1st, then
> process the JOIN?
> Q3) If it does the latter 1st, would it filter out the MANY rows in
> IvoItm,
> before doing the JOINS? Therefore improving performance, as the amount of
> data to join is reduced in the IvoItm?
> Q4) Using the same analogy in Q3, would there be performance gain if I
> rewrite the statement using sub-query to do the filtering 1st?
> SELECT *
> FROM Invoice Ivo
> INNER JOIN (
> SELECT *
> FROM InvoiceItem
> WHERE ProductType = 1 --This line is processed in sub-query.
> ) IvoItm ON Ivo.RecNum = IvoItm.InvRecNum
> WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
> Q5) And would above be efficient than using the Statement 1 and 2?|||As noted by Louis, for Inner Joins it doesn't matter whether you specify
the predicates in the WHERE clause or in the JOIN ON clause. For Outer
Joins the meaning is different.
What the query optimizer will do, is analyse which indexes your tables
have and whether they can be sed or scanned in order to reduce the
I/O needed to retrieve the actual data. Then it will do an access path
analysis to see in which order the joins would be fastest. If there is
an appropriate index, then physically, the (partial) filtering will
occur before the join.
Suppose you have a clustered index on Invoice(Date). Then you will
probably see a clustered index s on table Invoice, regardless whether
you used syntax 1 or 2. BTW: the only way to really tell is check the
query plan.
HTH,
Gert-Jan
HardKhor wrote:
> Hi all,
> A common SQL that I do is joining parent and child tables together (1-M
> relationship), e.g. Invoice and InvoiceItem tables. These tables have huge
> number of rows.
> Q1) Compare the two statements (that give the same result) below, from a
> programming point of view, which one is more efficient?
> Statement 1
> --
> SELECT *
> FROM Invoice Ivo
> INNER JOIN InvoiceItem IvoItem ON Ivo.RecNum = IvoItm.InvRecNum
> WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
> AND IvoItm.ProductType = 1 --This line is processed in WHERE.
> Statement 2
> --
> SELECT *
> FROM Invoice Ivo
> INNER JOIN InvoiceItem IvoItm ON Ivo.RecNum = IvoItm.InvRecNum
> AND IvoItm.ProductType = 1 --This line is processed in JOIN.
> WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
> This is something which I have been wondering for quite sometime. After
> reading MSDN article "Join Fundamentals" stating, it says the JOIN stateme
nts
> are processed first.
> Q2) So in statement 2, does SQL Server process the JOIN 1st, then process
> this filter "AND IvoItm.ProductType = 1", OR process that filter 1st, then
> process the JOIN?
> Q3) If it does the latter 1st, would it filter out the MANY rows in IvoItm
,
> before doing the JOINS? Therefore improving performance, as the amount of
> data to join is reduced in the IvoItm?
> Q4) Using the same analogy in Q3, would there be performance gain if I
> rewrite the statement using sub-query to do the filtering 1st?
> SELECT *
> FROM Invoice Ivo
> INNER JOIN (
> SELECT *
> FROM InvoiceItem
> WHERE ProductType = 1 --This line is processed in sub-query.
> ) IvoItm ON Ivo.RecNum = IvoItm.InvRecNum
> WHERE Ivo.Date IS BETWEEN '2004-01-01 00:00:00' TO '2004-12-31 23:59:59'
> Q5) And would above be efficient than using the Statement 1 and 2?

No comments:

Post a Comment