Grouping and Aggregation
Grouping, in essence, collapses multiple rows into single rows based on some criteria. Hand-in-hand with performing a grouping operation, some fields in the output rowset must be aggregated into some meaningful value (or discarded if no possible or meaningful aggregation can be done).
We can witness this behavior by building up to it in stages.
// list all session durations.
@output =
SELECT Duration
FROM @searchlog;
This creates a simple list of integers.
Duration |
---|
73 |
614 |
74 |
24 |
1213 |
241 |
502 |
60 |
1270 |
610 |
422 |
283 |
305 |
10 |
612 |
1220 |
691 |
63 |
30 |
119 |
732 |
183 |
630 |
Now, let's add all the numbers together. This yields a RowSet with
exactly one row and one column.
// Find the total duration for all sessions combined
@output =
SELECT
SUM(Duration) AS TotalDuration
FROM @searchlog;
Duration |
---|
9981 |
Now let's use the GROUP BY operator to break apart the totals byRegion
.
// find the total Duration by Region
@output =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM searchlog
GROUP BY Region;
This returns:
en_ca | 24 |
---|---|
en_ch | 10 |
en_fr | 241 |
en_gb | 688 |
en_gr | 305 |
en_mx | 422 |
en_us | 8291 |
This is a good opportunity to explore a common use of the HAVING operator. We can use HAVING to restrict the output RowSet to those rows that have aggregate values we are interested in. For example, perhaps we want to find all the Regions where total dwell time is above some value.
// find all the Regions where the total dwell time is > 200
@output =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region
HAVING TotalDuration > 200;
en-fr | 241 |
---|---|
en-gb | 688 |
en-gr | 305 |
en-mx | 422 |
en-us | 8291 |
// Count the number of total sessions.
@output =
SELECT
COUNT() AS NumSessions
FROM @searchlog;
23 |
---|
Count the number of total sessions by Region.
@output =
SELECT
COUNT() AS NumSessions,
Region
FROM @searchlog
GROUP BY Region;
1 | en_ca |
---|---|
1 | en_ch |
1 | en_fr |
2 | en_gb |
1 | en_gr |
1 | en_mx |
16 | en_us |
Count the number of total sessions by Region and include total duration for that language.
@output =
SELECT
COUNT() AS NumSessions,
Region,
SUM(Duration) AS TotalDuration,
AVG(Duration) AS AvgDwellTtime,
MAX(Duration) AS MaxDuration,
MIN(Duration) AS MinDuration
FROM @searchlog
GROUP BY Region;
NumSessions | Region | TotalDuration | AvgDuration | MaxDuration | MinDuration |
---|---|---|---|---|---|
1 | en_ca | 24 | 24 | 24 | 24 |
1 | en_ch | 10 | 10 | 10 | 10 |
1 | en_fr | 241 | 241 | 241 | 241 |
2 | en_gb | 688 | 344 | 614 | 74 |
1 | en_gr | 305 | 305 | 305 | 305 |
1 | en_mx | 422 | 422 | 422 | 422 |
16 | en_us | 8291 | 518.1875 | 1270 | 30 |
Data types coming from aggregate functions
You should be aware of how some aggregation operators deal with data types. Some aggregations will promote a numeric type to a "larger" type.
For example:
SUM(floatexpr)
-> doubleSUM(doubleexpr)
-> doubleSUM(intexpr)
-> longSUM(byteexpr)
-> long
DISTINCT with Aggregates
Every aggregate function can take a DISTINCT qualifier.
For example
COUNT(DISTINCT x)
Notes
Aggregates can ONLY appear in a SELECT clause.