Questions › SQL Server JOINs executing very slowly with large tables

Below I have a query that takes an e-mail from one table, and joins three other tables to match that e-mail. It also filters by two columns (utm_campaign and utm_source) to make sure they are not empty.

Two of the tables have close to a million rows, and the other two are around 100,000 rows.

Currently, to get 100 rows outputted it takes approximately 60 seconds. I'm expecting between 500,000-1,000,000 rows to be outputted by this SELECT statement, which might take 4-5 days to complete.

I don't understand why the server's processors are only using 27% of its resources, or what I could be doing differently with the JOINs to make this a faster process. I have refined the JOINs as well as I could, and increased the number of processors on the server to no avail. I'm not very familiar with indexing and I don't know that it could be done for most of this data.

Has anyone had experience with doing JOINs on such large tables and could identify flaws in the logic of my query, or maybe come up with a more efficient way of matching rows from other tables. Please see the complete query below for reference:

select
    PU.recip_id,
    PU.email,
    PU.date_joined,
    VP.utm_source AS VP_source,
    VP.utm_med AS VP_medium,
    VP.utm_camp VP_campaign,
    VP.created AS VP_created,
    SCH.utm_source AS SCH_source,
    SCH.utm_med AS SCH_medium,
    SCH.utm_camp AS SCH_campaign,
    SCH.created AS SCH_created,
    GF.utm_source AS GF_source,
    GF.utm_medium AS GF_medium,
    GF.utm_campaign AS GF_campaign,
    GF.created AS GF_created

FROM [Digital].[dbo].[postup_recipients] AS PU

LEFT JOIN [Digital].[dbo].[vp_charges] AS VP
    ON PU.email = '"' + VP.email + '"'

LEFT JOIN [Digital].[dbo].[stripe_customers] AS SCU
    ON PU.email = '"' + SCU.email + '"'

LEFT JOIN [Digital].[dbo].[stripe_charges] AS SCH
    ON SCU.cust_id = SCH.cust_id

LEFT JOIN [Digital].[dbo].[gform_entries] AS GF
    ON PU.email = '"' + GF.email + '"'

WHERE   (   GF.utm_source IS NOT NULL AND GF.utm_source != ''
                AND GF.utm_campaign IS NOT NULL AND GF.utm_campaign != '')
    OR
        (   VP.utm_source IS NOT NULL AND VP.utm_source != ''
                AND VP.utm_camp IS NOT NULL AND vp.utm_camp != '')
    OR
        (   SCH.utm_source IS NOT NULL AND SCH.utm_source != ''
                AND SCH.utm_camp IS NOT NULL AND SCH.utm_camp != '')
Comments :

The clearest sign that your DB design is very seriously flawed is that you are actually hitting 27% CPU usage on a query that should be trivially light on CPU resources. A query finding 100 (indexed) rows from a table with 1,000,000 rows should complete faster than you can blink - so you are doing millions of unnecessary string comparisons because of faulty DB design, which is where the performance drag is originating.

usr replied

Post the actual execution plan. It's probably terrible. With existing data, schema and query this should take a few seconds for everything using hash joins.

usr replied

It is true that the data model seems badly designed but that does not explain 60s per 100 rows of performance. Right now there is not enough information to understand the problem.


2 Answers :
Bruce Dunwiddie answered

Create an index on VP.email, SCU.email, SCH.cust_id, and GF.email.

Reverse your join logic on the three joins that you're calculating, e.g. PU.email = '"' + VP.email + '"' => VP.email = SUBSTRING(PU.email, 2, LEN(PU.email) - 2).

Your filters might be able to be played with, but that gets a little tricky. I think VP.utm_source IS NOT NULL AND VP.utm_source != '' => VP.utm_source > '', and you can create an index on VP.utm_source, but it will only be used if there are only a few rows that are populated. You could also add it as a secondary column to the index on VP.email. I think this part however is the lesser of your problems. The joins above are most likely your biggest issues.

Frumples replied
This is just what I needed. And thank you for not including negative side-comments. Particularly it was the SUBSTRING() that reduced the execution time to 14 seconds. Trying to add the quotation marks around the values was the biggest mistake. But I also added indices on the tables as well.;
TomTom answered

Let's start with a reality check. This will help yo down the road:

Two of the tables have close to a million rows, and the other two are around 100,000 rows.

Good. Where is large? A million row was small when I started working with SQL Server - 4.21, about 20 years ago. Today, do not say large unless every of your tables in that join has a billion rows or more.

This is slow because of bad code and possibly bad database design. Not because SQL Server is bad.

I don't understand why the server's processors are only using 27% of its resources,

WHAT ressources? Typically SQL Servers (database servers in general) are limited by disc IO, or memory as most smaller shops keep them starving for memory and rarely invest in a proper disc subsystem. The CPU rarely is busy because the discs can not keep up. Hard to do processing when you wait for data to be loaded. A reason why database servers move to all SSD setups for years now - cheaper, because SSD are hundreds of times faster than discs.

The problem with your join's are - they are bad. REALLY bad.

  • You do not use the standard join syntax at all. Join and subselect. Look up the proper syntax for joins.

  • Your comparison is over complicated - or someone did not understand the value of Null when loading the databases. This join should require only one condition per table. Not 2.

  • Do you really join via email address? This is super bad database design. There should be ONE table with email addresses, all other tables should have a foreign key to it that is numeric.

In general - you also will want to look at your query plan (which you an publish here) to see how SQL Server processes it. Do you have the proper indices on the tables? Without wanting to be too negative, but it looks like whoever designed the database had not a lot of knowledge - so I would give it a high chance it misses the necessary indices, which forces full table scans. The query plan will tell you that. If that is the case, make sure you get the necessary indices added. If that is the case - consider yourself lucky as performance will increase dramatically once the indices are added.

TomTom replied
Oh, and make sure you need no substring or something in the join. That KILLS ALL INDEX USE (generally). There should be a simple "=" in the join. Or the db is badly designed.;
Frumples replied
I realized that I had posted an experimental version of my query, I edited and posted the correct one. Still digesting the answers and comments.;
TomTom replied
Well, it still is not exactly a query as it should be. My answer stands. The comment is obsolete.;