join or subselect: JOIN with table valued function very slow

If have a tabled-valued function uCalendar that returns a two-column table
with attributes dayno (number of days past 19000101) and caldate, which is a
formatted date based on dayno.
DDL for uCalendar:
CREATE FUNCTION uCalendar (@.startdate datetime = '19000101', @.enddate
datetime )
RETURNS @.calendar TABLE (dayno bigint, caldate char(20))
AS
BEGIN
DECLARE @.firstday bigint
DECLARE @.lastday bigint
SELECT @.firstday = DATEDIFF(dd, 0, @.startdate)
SELECT @.lastday = DATEDIFF(dd, 0, @.enddate)
WHILE (@.firstday <= @.lastday)
BEGIN
INSERT INTO @.calendar VALUES (@.firstday, CONVERT(char(20), DATEADD(dd,
@.firstday, 0), 107))
SET @.firstday = @.firstday + 1
END
RETURN
END
Calling the function like so
select * from uCalendar('20050101','20051231')
returns the result set very quickly.
I have another query that reports the number of hits against a website per
day:
SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
GROUP BY DATEDIFF(dd, 0, [time])
weblog is a view that references a base table with approximately 500,000
rows. The above query finishes in about 2 seconds.
However, if I try
select t1.dayno from uCalendar('20050101','20051231') AS t1
LEFT JOIN
(SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
GROUP BY DATEDIFF(dd, 0, [time])) t2
ON t1.dayno=t2.dayno
It seems to hang. The longest I let it run was about 3 minutes. If the
result sets from each "side" of the join are produced rapidly, why doesn't
this query produce its results quickly? Is it because it is constantly
re-evaluating the function over and over?
I'm working on a better solution using a stored procedure to get the entire
result. The point of my question is *why* is it slow, not "this will work
instead".
Thanks,
-Mark WilliamsHi Mark
Why are you not using a calendar table for this
http://www.aspfaq.com/show.asp?id=2519?
John
"Mark Williams" <MarkWilliams@.discussions.microsoft.com> wrote in message
news:F3466971-ED3D-4026-91B1-0EAFD26DBDB1@.microsoft.com...
> If have a tabled-valued function uCalendar that returns a two-column table
> with attributes dayno (number of days past 19000101) and caldate, which is
> a
> formatted date based on dayno.
> DDL for uCalendar:
> CREATE FUNCTION uCalendar (@.startdate datetime = '19000101', @.enddate
> datetime )
> RETURNS @.calendar TABLE (dayno bigint, caldate char(20))
> AS
> BEGIN
> DECLARE @.firstday bigint
> DECLARE @.lastday bigint
> SELECT @.firstday = DATEDIFF(dd, 0, @.startdate)
> SELECT @.lastday = DATEDIFF(dd, 0, @.enddate)
> WHILE (@.firstday <= @.lastday)
> BEGIN
> INSERT INTO @.calendar VALUES (@.firstday, CONVERT(char(20), DATEADD(dd,
> @.firstday, 0), 107))
> SET @.firstday = @.firstday + 1
> END
> RETURN
> END
> Calling the function like so
> select * from uCalendar('20050101','20051231')
> returns the result set very quickly.
> I have another query that reports the number of hits against a website per
> day:
> SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
> GROUP BY DATEDIFF(dd, 0, [time])
> weblog is a view that references a base table with approximately 500,000
> rows. The above query finishes in about 2 seconds.
> However, if I try
> select t1.dayno from uCalendar('20050101','20051231') AS t1
> LEFT JOIN
> (SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
> GROUP BY DATEDIFF(dd, 0, [time])) t2
> ON t1.dayno=t2.dayno
> It seems to hang. The longest I let it run was about 3 minutes. If the
> result sets from each "side" of the join are produced rapidly, why
> doesn't
> this query produce its results quickly? Is it because it is constantly
> re-evaluating the function over and over?
> I'm working on a better solution using a stored procedure to get the
> entire
> result. The point of my question is *why* is it slow, not "this will work
> instead".
> Thanks,
> -Mark Williams|||Mark,
Best I can tell, the query optimizer has no information about the
number of rows or distribution of values in
uCalendar('20050101','20051231'),
and so instead of materializing the grouped table, then joining
it with the UDF, it chooses a query plan that runs a count from
the weblog table for each row of the calendar table. When I run
this against the Northwind Orders table, I see that the optimizer thinks
there are only one or two rows in the UDF result set. There is no
way to tell the optimizer to think agian.
The quickest solution is create a permanent (not a UDF) calendar
table indexed on at least the bigint column, and then join against that.
Here I do so, and put a wide range of dates into the calendar table.
I'm writing this against Northwind..Orders so I can test it, but the
improvement in the query plan should translate to your situation
as well.
CREATE TABLE tCalendar (
dayno bigint primary key,
caldate char(20) unique
)
GO
insert into tCalendar
select dayno, caldate
from uCalendar('20010101','20101231')
go
select t1.dayno
from tCalendar AS t1
left outer join (
SELECT
DATEDIFF(dd, 0, [OrderDate]) AS dayno,
COUNT(*) AS "hits" FROM Northwind..Orders
GROUP BY DATEDIFF(dd, 0, [OrderDate])
) T
on t1.dayno = T.dayno
WHERE t1.dayno between
datediff(day,0,'20050101') and datediff(day,0,'20051231')
It's important here to be sure the WHERE clause is a SARG.
Since you don't use any datetime data types here, you can't
compare anything directly against your two datetime strings,
and you don't want to put t1.dayno into an expression.
Ideally, you would use datetime as the type to store dates
with, not bigint and not a string, but perhaps this will help
you out until you can make other improvements to your design.
A permanent calendar table is always a good idea, and if you
use uCalendar widely, you could rewrite it to select from a
permanent table (using the same WHERE clause I show here
outside the join), so you don't have to rewrite as many queries.
I assume you know that your sample query is not too practical,
since because of the outer join with no where clause, your result
will just be all dayno values in the UDF. But the optimizer doesn't
manage to catch that...
Steve Kass
Drew University
Mark Williams wrote:

>If have a tabled-valued function uCalendar that returns a two-column table
>with attributes dayno (number of days past 19000101) and caldate, which is
a
>formatted date based on dayno.
>DDL for uCalendar:
>CREATE FUNCTION uCalendar (@.startdate datetime = '19000101', @.enddate
>datetime )
>RETURNS @.calendar TABLE (dayno bigint, caldate char(20))
>AS
>BEGIN
> DECLARE @.firstday bigint
> DECLARE @.lastday bigint
> SELECT @.firstday = DATEDIFF(dd, 0, @.startdate)
> SELECT @.lastday = DATEDIFF(dd, 0, @.enddate)
> WHILE (@.firstday <= @.lastday)
> BEGIN
> INSERT INTO @.calendar VALUES (@.firstday, CONVERT(char(20), DATEADD(dd,
>@.firstday, 0), 107))
> SET @.firstday = @.firstday + 1
> END
> RETURN
>END
>Calling the function like so
>select * from uCalendar('20050101','20051231')
>returns the result set very quickly.
>I have another query that reports the number of hits against a website per
>day:
>SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
>GROUP BY DATEDIFF(dd, 0, [time])
>weblog is a view that references a base table with approximately 500,000
>rows. The above query finishes in about 2 seconds.
>However, if I try
>select t1.dayno from uCalendar('20050101','20051231') AS t1
>LEFT JOIN
>(SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
>GROUP BY DATEDIFF(dd, 0, [time])) t2
>ON t1.dayno=t2.dayno
>It seems to hang. The longest I let it run was about 3 minutes. If the
>result sets from each "side" of the join are produced rapidly, why doesn't
>this query produce its results quickly? Is it because it is constantly
>re-evaluating the function over and over?
>I'm working on a better solution using a stored procedure to get the entire
>result. The point of my question is *why* is it slow, not "this will work
>instead".
>Thanks,
>-Mark Williams
>|||I'm aware of caledar tables, but my question was more oriented toward why it
was slow, and not "what is another solution?" I did come up with another
solution, which produced the results very quickly:
CREATE PROCEDURE hitsByDay
@.startdate datetime = '19000101',
@.enddate datetime
AS
BEGIN
--DECLARE @.calendar TABLE (dayno bigint, caldate char(20))
CREATE TABLE #calendar (dayno bigint, caldate char(20))
DECLARE @.firstday bigint
DECLARE @.lastday bigint
SELECT @.firstday = DATEDIFF(dd, 0, @.startdate)
SELECT @.lastday = DATEDIFF(dd, 0, @.enddate)
WHILE (@.firstday <= @.lastday)
BEGIN
INSERT INTO #calendar VALUES (@.firstday, CONVERT(char(20), DATEADD(dd,
@.firstday, 0), 107))
SET @.firstday = @.firstday + 1
END
SELECT t1.caldate, ISNULL(t2.hits,0) from #calendar AS t1
LEFT JOIN
(SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
GROUP BY DATEDIFF(dd, 0, [time])) t2
ON t1.dayno=t2.dayno
ORDER BY t1.dayno
DROP TABLE #calendar
END
EXEC dbo.hitsByDay '20050101','20051231'
It should be noted that there are no indexes in the base table or the view
that referenced it. There are no natural candidate keys because of the natur
e
of the data (it's a web site log, lots of duplicates). I tried creating an
index the [time] column in the view, but it complained that is was
non-deterministic. (I don't buy that one).
So, the question is, why, specifically, the join with the table-valued
function is so slow. The join with the temporary #calendar table is pretty
quick, even without an index.
--
"John Bell" wrote:

> Hi Mark
> Why are you not using a calendar table for this
> http://www.aspfaq.com/show.asp?id=2519?
> John
> "Mark Williams" <MarkWilliams@.discussions.microsoft.com> wrote in message
> news:F3466971-ED3D-4026-91B1-0EAFD26DBDB1@.microsoft.com...
>
>|||Thank you; very well worded and insightful.
I ended up creating a stored procedure that dynamically creates a temporary
calendar based on the input start and end dates.
CREATE PROCEDURE hitsByDay
@.startdate datetime = '19000101',
@.enddate datetime
AS
BEGIN
--DECLARE @.calendar TABLE (dayno bigint, caldate char(20))
CREATE TABLE #calendar (dayno bigint, caldate char(20))
DECLARE @.firstday bigint
DECLARE @.lastday bigint
SELECT @.firstday = DATEDIFF(dd, 0, @.startdate)
SELECT @.lastday = DATEDIFF(dd, 0, @.enddate)
WHILE (@.firstday <= @.lastday)
BEGIN
INSERT INTO #calendar VALUES (@.firstday, CONVERT(char(20), DATEADD(dd,
@.firstday, 0), 107))
SET @.firstday = @.firstday + 1
END
SELECT t1.caldate, ISNULL(t2.hits,0) from #calendar AS t1
LEFT JOIN
(SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
WHERE [time] BETWEEN @.startdate AND @.enddate
GROUP BY DATEDIFF(dd, 0, [time])) t2
ON t1.dayno=t2.dayno
ORDER BY t1.dayno
DROP TABLE #calendar
END
EXEC dbo.hitsByDay '20051201','20051231'
If you posted to this forum through TechNet, and you found my answers
helpful, please mark them as answers.
"Steve Kass" wrote:

> Mark,
> Best I can tell, the query optimizer has no information about the
> number of rows or distribution of values in
> uCalendar('20050101','20051231'),
> and so instead of materializing the grouped table, then joining
> it with the UDF, it chooses a query plan that runs a count from
> the weblog table for each row of the calendar table. When I run
> this against the Northwind Orders table, I see that the optimizer thinks
> there are only one or two rows in the UDF result set. There is no
> way to tell the optimizer to think agian.
> The quickest solution is create a permanent (not a UDF) calendar
> table indexed on at least the bigint column, and then join against that.
> Here I do so, and put a wide range of dates into the calendar table.
> I'm writing this against Northwind..Orders so I can test it, but the
> improvement in the query plan should translate to your situation
> as well.
> CREATE TABLE tCalendar (
> dayno bigint primary key,
> caldate char(20) unique
> )
> GO
> insert into tCalendar
> select dayno, caldate
> from uCalendar('20010101','20101231')
> go
> select t1.dayno
> from tCalendar AS t1
> left outer join (
> SELECT
> DATEDIFF(dd, 0, [OrderDate]) AS dayno,
> COUNT(*) AS "hits" FROM Northwind..Orders
> GROUP BY DATEDIFF(dd, 0, [OrderDate])
> ) T
> on t1.dayno = T.dayno
> WHERE t1.dayno between
> datediff(day,0,'20050101') and datediff(day,0,'20051231')
>
> It's important here to be sure the WHERE clause is a SARG.
> Since you don't use any datetime data types here, you can't
> compare anything directly against your two datetime strings,
> and you don't want to put t1.dayno into an expression.
> Ideally, you would use datetime as the type to store dates
> with, not bigint and not a string, but perhaps this will help
> you out until you can make other improvements to your design.
> A permanent calendar table is always a good idea, and if you
> use uCalendar widely, you could rewrite it to select from a
> permanent table (using the same WHERE clause I show here
> outside the join), so you don't have to rewrite as many queries.
> I assume you know that your sample query is not too practical,
> since because of the outer join with no where clause, your result
> will just be all dayno values in the UDF. But the optimizer doesn't
> manage to catch that...
> Steve Kass
> Drew University
> Mark Williams wrote:
>
>|||Hi Mark
You may find a more permanent calendar table would be useful elsewhere,
Check where you use data functions and comparisons to see if using one would
be more efficient. Searching Google for "UDF AND SLOW" turns up many hits,
although most of these are related to scalar functions e.g.
http://www.sql-server-performance.c...server_udfs.asp
Your function(s) are not set based solutions which included looping which
would be expected to perform worse with a larger number of iterations, this
is where a calendar table would be significantly faster, therefore I would
make sure that you test it with a full date range.
John
"Mark Williams" <MarkWilliams@.discussions.microsoft.com> wrote in message
news:D4E2BC61-680E-4B03-9535-4E0015828A5C@.microsoft.com...
> I'm aware of caledar tables, but my question was more oriented toward why
> it
> was slow, and not "what is another solution?" I did come up with another
> solution, which produced the results very quickly:
> CREATE PROCEDURE hitsByDay
> @.startdate datetime = '19000101',
> @.enddate datetime
> AS
> BEGIN
> --DECLARE @.calendar TABLE (dayno bigint, caldate char(20))
> CREATE TABLE #calendar (dayno bigint, caldate char(20))
> DECLARE @.firstday bigint
> DECLARE @.lastday bigint
> SELECT @.firstday = DATEDIFF(dd, 0, @.startdate)
> SELECT @.lastday = DATEDIFF(dd, 0, @.enddate)
> WHILE (@.firstday <= @.lastday)
> BEGIN
> INSERT INTO #calendar VALUES (@.firstday, CONVERT(char(20), DATEADD(dd,
> @.firstday, 0), 107))
> SET @.firstday = @.firstday + 1
> END
> SELECT t1.caldate, ISNULL(t2.hits,0) from #calendar AS t1
> LEFT JOIN
> (SELECT DATEDIFF(dd, 0, [time]) AS dayno, COUNT(*) AS "hits" FROM weblog
> GROUP BY DATEDIFF(dd, 0, [time])) t2
> ON t1.dayno=t2.dayno
> ORDER BY t1.dayno
> DROP TABLE #calendar
> END
> EXEC dbo.hitsByDay '20050101','20051231'
> It should be noted that there are no indexes in the base table or the view
> that referenced it. There are no natural candidate keys because of the
> nature
> of the data (it's a web site log, lots of duplicates). I tried creating an
> index the [time] column in the view, but it complained that is was
> non-deterministic. (I don't buy that one).
> So, the question is, why, specifically, the join with the table-valued
> function is so slow. The join with the temporary #calendar table is pretty
> quick, even without an index.
> --
> "John Bell" wrote:
>

Friday, March 9, 2012

JOIN with table valued function very slow

No comments:

Post a Comment

join or subselect

Blog Archive

About Me