Filtering records with WHERE
Now let's transform the data by filtering out records with the WHERE clause
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM "/SearchLog.tsv"
USING Extractors.Tsv();
@output =
SELECT *
FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @output
TO "/SearchLog_output.tsv"
USING Outputters.Tsv();
The SQL Logical Operators
The AND/OR/NOT operators can be combined with parentheses to create more complex logical expressions
@output =
SELECT Start, Region, Duration
FROM @searchlog
WHERE (Duration >= 2*60 AND Duration <= 5*60) OR NOT (Region == "en-gb");
The C# Logical Operators
U-SQL also supports the C# logical operators
@output =
SELECT Start, Region, Duration
FROM @searchlog
WHERE (Duration >= 2*60 && Duration <= 5*60) || (!(Region == "en-gb"));
SQL Logical Operators (AND OR) versus C# Operators ( && || )
These operators behave the same except for their short-circuiting behavior:
- SQL-style logical operators: These DO NOT short-circuit
- C#-style logical operators: These DO short-circuit
Use the SQL-style logical operators unless you MUST have the short-circuiting behavior. The reasons why this is important are covered in a later chapter that describes the order of evaluation of predicates in expressions.
Find all the sessions occurring before a date
@output =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Start <= DateTime.Parse("2012/02/17");
Find all the sessions occurring between two dates
@output =
SELECT Start, Region, Duration
FROM @searchlog
WHERE
Start >= DateTime.Parse("2012/02/16")
AND Start <= DateTime.Parse("2012/02/17");
Filtering on calculated columns
Consider this case where SELECT is used to define a new column called DurationInMinutes
@output =
SELECT
Start,
Region,
Duration/60.0 AS DurationInMinutes
FROM @searchlog;
There are a couple of approaches for filtering rows based on the DurationInMinutes
value
The first option is to use RowSet refinement
@output =
SELECT
Start,
Region,
Duration/60.0 AS DurationInMinutes
FROM @searchlog;
@output =
SELECT *
FROM @output
WHERE DurationInMinutes>= 20;
The second option is to repeat the expression in the WHERE clause
@output =
SELECT
Start,
Region,
Duration/60.0 AS DurationInMinutes
FROM @searchlog
WHERE Duration/60.0>= 20;
WHERE does not work on calculated columns in the same statement
WHERE
filters rows coming into to the statement. The DurationInMinutes column doesn't exist in the input. Therefore WHERE cannot operate on it. So, the example below will not compile
// SYNTAX ERROR: WHERE cannot be used with columns created by SELECT
@output =
SELECT Start, Region,Duration/60.0 AS DurationInMinutes
FROM @searchlog
WHERE DurationInMinutes>= 20;