Cross Joins

Sixth in a series of posts in response to Tim Ford's #EntryLevel Challenge.

Cross joins give all combinations of rows from two tables. They aren't normally useful, but they can be so in the right circumstances. One use for cross joins lies in generating large amounts of data for performance testing or similar purposes.

Programmers would sometimes write cross joins by mistake. This was back in the day of comma-delimited join syntax. All it took was to mention two tables in a FROM clause and then forget to place the intended join conditions in the WHERE clause. (Which is why I prefer the explicit join syntax I wrote about in March). Here's an example that you shouldn't execute, because it returns almost four billion rows:

FROM Sales.SalesOrderHeader, Sales.SalesOrderDetail;

Now it's obvious I've omitted the WHERE clause in this case. But just imagine a page-long query with many predicates in the WHERE clause, and perhaps with a few subqueries into the bargain, each with their own WHERE clause predicates. Programmers would sometimes lose track and end up with a cross join by mistake. 

So cross joins aren't too useful, but one use for them lies in generating large amounts of data for testing, especially for performance testing. Following is a cross join -- don't execute it! -- that generates almost 400 million first and last name combinations:

SELECT ppA.FirstName, ppB.LastName
FROM Person.Person AS ppA
     Person.Person AS ppB;

OK. Execute the query if you like, but you might not want quite so many rows. As well, you'll end up with a good many duplicate name combinations, and that's because there are numerous given names and surnames that appear multiple times in the Person table. 

You might prefer only the unique combinations that are possible. How many rows would that yield? Execute the following query to count the number of distinct given names and surnames:

FROM Person.Person AS ppA
FROM Person.Person AS ppB;


Multiply 1018 unique first names by 1206 unique last names, and you get 1,227,708 unique combinations. That's around a million and a quarter rows, which you can generate by cross joining two subqueries as follows:

    SELECT DISTINCT (ppA.FirstName)
    FROM Person.Person AS ppA
    SELECT DISTINCT (ppB.LastName)
    FROM Person.Person ppB) AS ppD;
FirstName LastName
--------- -----------
Gustavo   Abbas
Gustavo   Abel
Gustavo   Abercrombie

The preceding query may look intimidating at first. Stand up to it though. Don't back down from the syntax, and the query will begin to make sense as you study it. Here's what's happening:

  1. The FROM clause begins with a subquery generating a table of unique first names.
  2. That table is named as ppC in the context of the main query.
  3. A second subquery generates a table of unique last names.
  4. The two tables are cross joined.
  5. The result of the cross join is given the name ppD.
  6. The SELECT clause returns all the columns from the two subsqueries. 

The difficult part for me in writing the query was to realize that I had to give each table a name. My first attempt omitted all the table aliases, and that was a fail. Then I rewrote the query, paying attention as I went to be sure that I named each intermediate table, and then the final table that is the query's result. 

Normally it's a bad thing when cross joins generate overwhelmingly large row counts. But that sort of "explosion" of row combinations turns to your advantage when creating test data. Using cross joins, you can create large volumes of data for performance testing, and in large numbers of combinations such as our one and one-quarter million distinct names generated from just over 2200 input names. Cross joins are not something you want to write by accident, but they can be just the ticket when you need to generate volumes of data for load- and performance testing.