Harpreet's Diary on Business Intelligence & Database

Thursday, November 10, 2011

Inline Variable Assignment in SQL Server 2008

SQL Server 2008 allows you to declare and initialize a value at the same time, similar to other languages. For some reason, this enhancement was never made to the actual variable declaration.

In previous versions, we could declare a parameter in a stored procedure and assign a default value. However, we could never declare a variable and assign value at the same time. All we could do was declare the variable first and assign it later.

This article illustrates how to declare and assign the values for a variable.

It is a very small change added to SQL Server 2008, it makes a lot of difference when we declare multiple variables.

Example:

---- SQL Server 2005 Way

DECLARE @Var INT

SET @Var = 5

SELECT @Var AS TestVar

---- SQL Server 2008 Way

DECLARE @Var INT = 5

SELECT @Var AS TestVar

GROUP BY with ROLLUP, CUBE, and GROUPING SETS in SQL Server 2008

Introduction:
This post describes a new feature in SQL Server 2008 – GROUP BY with ROLLUP, CUBE, and GROUPING SETS

The ROLLUP, CUBE, and GROUPING SETS operators are extensions of the GROUP BY clause. The ROLLUP, CUBE, or GROUPING SETS operators can generate the same result set as when you use UNION ALL to combine single grouping queries; however, using one of the GROUP BY operators is usually more efficient.

ROLLUP:
ROLLUP clause is used to do aggregate operation on multiple levels in hierarchy.

CUBE:
CUBE is similar to the WITH ROLLUP clause in its operation as it adds new summary rows for groups of results. However, rather than creating a hierarchy of results, a cube contains totals for every possible permutation of group. Returning to our pivot table layout, every total will be calculated.

GROUPING SETS:
The GROUPING SETS operator can generate the same result set as that generated by using a simple GROUP BY, ROLLUP, or CUBE operator. When all the groupings that are generated by using a full ROLLUP or CUBE operator are not required, you can use GROUPING SETS to specify only the groupings that you want. The GROUPING SETS list can contain duplicate groupings; and, when GROUPING SETS is used with ROLLUP and CUBE, it might generate duplicate groupings. Duplicate groupings are retained as they would be by using UNION ALL.

Multiple columns that are in inner parentheses in theGROUPING SETS list are treated as a single set. For example, in the clause GROUP BY GROUPING SETS ((Colum1, Column2), Column3, Column4), Column1 and Column2 are treated as one column. For an example of how to use GROUPING SETS with composite elements.

When the GROUPING SETS list contains multiple sets in inner parentheses, separated by commas, the output of the sets are concatenated. The result set is the cross product or Cartesian product of the grouping sets. For an example of how to use GROUP BY with concatenated ROLLUP operations.

Example:
First, let's see how we rewrite simple WITH ROLLUP and CUBE queries using the new syntax. I'll use the same schema and queries as in my previous posts:

CREATE TABLE #Product (ProductId INT, ProductName Varchar(500), Price MONEY)
INSERT INTO #Product (ProductId ,ProductName,Price)
VALUES(1, 'Test1', 12000),
(2, 'Test2', 18000),
(3, 'Test3', 25000),
(4, 'Test4', 15000),
(1, 'Test4', 15000)

SELECT ProductId ,ProductName, SUM(Price) AS Price
FROM #Product
GROUP BY ProductId ,ProductName WITH ROLLUP

SELECT ProductId ,ProductName, SUM(Price) AS Price
FROM #Product
GROUP BY ProductId ,ProductName WITH CUBE

We can rewrite these two queries using the new syntax as:

SELECT ProductId ,ProductName, SUM(Price) AS Price
FROM #Product
GROUP BY ROLLUP(ProductId ,ProductName)

SELECT ProductId ,ProductName, SUM(Price) AS Price
FROM #Product
GROUP BY CUBE(ProductId ,ProductName)

These new queries are semantically equivalent to and use the same query plans as the original queries. Note that the new ROLLUP and CUBE syntax is only available in compatibility level 100. The more general GROUPING SETS syntax, which I will discuss next, is also available in earlier compatibility levels.

The new GROUPING SETS syntax is considerably more powerful. It allows us to specify precisely which aggregations we want to compute.

ROLLUP and CUBE are just shorthand for two common usages of GROUPING SETS. We can express the above ROLLUP query as:
SELECT ProductId ,ProductName, SUM(Price) AS Price

FROM #Product Group By
GROUPING SETS((ProductId, ProductName), (ProductId), ())
Result:

This query explicitly asks SQL Server to aggregate price by productID and product, to aggregate by product only, and to compute the total for all productID for all product. The () syntax with no GROUP BY columns denotes the total. Similarly, we can express the above CUBE query by asking SQL Server to compute all possible aggregate combinations:

SELECT ProductId ,ProductName, SUM(Price) AS Price
FROM #Product Group By
GROUPING SETS((ProductId, ProductName), (ProductId), (productName), ())
Result:

We can skip certain rollup levels. For example, we can compute the total sales by employee and year and the total sales for all employees and all years without computing any of the intermediate results:

SELECT ProductId ,ProductName, SUM(Price) AS Price
FROM #Product Group By
GROUPING SETS((ProductId, ProductName), ())
Result:

Wednesday, November 9, 2011

Row Constructors in SQL Server 2008

SQL Server 2008 provides a new method to insert data to SQL Server tables, referred to as row constructors. Row constructors are a feature that can be used to simplify data insertion, allowing multiple rows of data to be specified in a single DML statement. Row constructors are used to specify a set of row value expressions to be constructed into a data row.

Row constructors can be specified in the VALUES clause of the INSERT statement, in the USING clause of the MERGE statement, and in the definition of a derived table in the FROM clause. The general syntax of the row constructor is as follows:

VALUES ( { expression | DEFAULT | NULL |} [ ,...n ] ) [ ,...n ]

Each column of data defined in the VALUES clause is separated from the next using a comma. Multiple rows (which may also contain multiple columns) are separated from each other using parentheses and a comma. When multiple rows are specified, the corresponding column values must be of the same data type or implicitly convertible data type.

The following example shows the row constructor VALUES clause being used within a SELECT statement to define a set of rows and columns with explicit values:

SELECT a, b
FROM (VALUES (1, 2), (3, 4), (5, 6), (7, 8), (9, 10) )
AS MyTable(a, b);

Result:

The following example shows the row constructor VALUES clause being used within a Insert statement:

Create Table #tmpTest
(
ID INT,
TestValue Varchar(20)
)

INSERT INTO #tmpTest(ID,TestValue)
Values (1,'Test1'),
(2,'Test2'),
(3,'Test3')

Select * from #tmpTest

Result:

The following example shows the row constructor VALUES clause being used within a MERGE statement:

Create Table #tmpTest
(
ID INT,
TestValue Varchar(20)
)

INSERT INTO #tmpTest(ID,TestValue)
Values (1,'Test1'),
(2,'Test2'),
(3,'Test3')

MERGE
INTO #tmpTest as s
USING
(VALUES
(41 , 'Test41'),
(125 , 'Test125'),
(1 , 'Test1')
) as i (ID, TestValue)
ON s.ID = i.ID
WHEN MATCHED and s.ID <> i.ID
THEN UPDATE
SET s.ID = i.ID,
s.TestValue = i.TestValue
WHEN NOT MATCHED
THEN INSERT (ID, TestValue)
VALUES (i.ID, i.TestValue)
OUTPUT $action,
isnull(inserted.ID, 0) as src_ID,
isnull(inserted.TestValue, '') as src_TestValue,
isnull(deleted.ID, 0) as tgt_ID,
isnull(deleted.TestValue, '') as tgt_TestValue
;
go

Result:

Information Schema Views in SQL Server 2005/2008

Introduction
The INFORMATION_SCHEMA views allow you to retrieve metadata about the objects within a database. These views can be found in the master database under Views / System Views and be called from any database in your SQL Server instance. The reason these were developed was so that they are standard across all database platforms. In SQL 2005 and SQL 2008 these Information Schema views comply with the ISO standard.

In order to retrieve the views available under the INFORMATION_SCHEMA schema and their definitions, could be used the following query:
-- retrieving objects' definition for INFORMATION_SCHEMA
SELECT s.name [schema_name]
, o.name [object_name]
, sm.definition
FROM sys.all_sql_modules sm
JOIN sys.all_objects o
ON sm.object_id = o.object_id
JOIN sys.schemas s
ON o.schema_id = s.schema_id
WHERE s.name = 'INFORMATION_SCHEMA'
ORDER BY o.name

Find below the query result:

Here is the list of views with information:

CHECK_CONSTRAINTS: Returns one row for each CHECK constraint
COLUMN_DOMAIN_USAGE: Returns one row for each column that has an alias data type
COLUMN_PRIVILEGES: Returns one row for each column that has a privilege that is either granted to or granted by
COLUMNS: Returns one row for each column
CONSTRAINT_COLUMN_USAGE: Returns one row for each column that has a constraint defined on it
CONSTRAINT_TABLE_USAGE: Returns one row for each table that has a constraint defined on it
DOMAIN_CONSTRAINTS: Returns one row for each alias data type that has a rule bound to it
DOMAINS: Returns one row for each alias data type
KEY_COLUMN_USAGE: Returns one row for each column that is constrained as a key
PARAMETERS: Returns one row for each parameter of a user-defined function or stored procedure
REFERENTIAL_CONSTRAINTS: Returns one row for each FOREIGN KEY constraint
ROUTINES: Returns one row for each stored procedure and function
ROUTINE_COLUMNS: Returns one row for each column returned by table-valued functions
SCHEMATA: Returns one row for each schema
TABLE_CONSTRAINTS: Returns one row for each table constraint
TABLE_PRIVILEGES: Returns one row for each table privilege that is granted to or granted by
TABLES: Returns one row for each table
VIEW_COLUMN_USAGE: Returns one row for each column that is used in a view definition
VIEW_TABLE_USAGE: Returns one row for each table that is used in a view
VIEWS: Returns one row for views

Tuesday, November 8, 2011

Lookup cache modes in SSIS 2008

Integration Services 2008 has a new Lookup transformation that includes a caching option that allows lookups to execute much faster than before. There are three modes of caching and a new Cache Connection Manager.

When adding a Lookup transformation to a data flow task you are requesting just what it says, a lookup. Based on a matching column, a lookup table is used to retrieve other data columns that the data flow can use. For instance, based on an ProductId we can look up Product Name based on that id and add that to the data flow. There are different flows for Matched records and Not Matched records.

Note that these cache modes apply when you're using an OLE DB connection manager - using the new Cache connection manager is similar to using a Full Cache mode.

Full Cache

The default cache mode for the lookup is Full cache. In this mode, the database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory. This approach uses the most memory, and adds additional startup time for your data flow, as all of the caching takes place before any rows are read from the data flow source(s). The trade off is that the lookup operations will be very fast during execution. One thing to note is that the lookup will not swap memory out to disk, so your data flow will fail if you run out of memory.

When to use this cache mode

When you're accessing a large portion of your reference set
When you have a small reference table
When your database is remote or under heavy load, and you want to reduce the number of queries sent to the server

Keys to using this cache mode

Ensure that you have enough memory to fit your cache
Ensure that you don't need to pick up any changes made to the reference table

Since the lookup query is executed before the data flow begins, any changes made to the reference table during the data flow execution will not be reflected in the cache

Partial Cache

In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached so they can be used the next time a matching row comes in.

Since no caching is done during the pre-execute phase, the startup time using a partial cache mode is less than it would be for a full cache. However, your lookup operations would be slower, as you will most likely be hitting the database more often.

When running in partial cache mode, you can configure the maximum size of the cache. This setting can be found on the Advanced Options page of the lookup UI. There are actually two separate values - one for 32bit execution, and one for 64bit. If the cache gets filled up, the lookup transform will start dropping the least seen rows from the cache to make room for the new ones.

In 2008 there is a new Miss Cache feature that allows you to allocate a certain percentage of your cache to remembering rows that had no match in the database. This is useful in a lot of situations, as it prevents the transform from querying the database multiple times for values that don't exist. However, there are cases where you don't want to remember the misses - for example, if your data flow is adding new rows to your reference table. The Miss Cache is disabled by default.

When to use this cache mode

When you're processing a small number of rows and it's not worth the time to charge the full cache
When you have a large reference table
When your data flow is adding new rows to your reference table
When you want to limit the size of your reference table by modifying query with parameters from the data flow

Keys to using this cache mode

Ensure that your cache size setting is large enough
Use the Miss Cache appropriately
If the cache size isn't large enough for your rows, sort on lookup index columns if possible

No Cache

As the name implies, in this mode the lookup transform doesn't maintain a lookup cache (actually, not quite true - we keep the last match around, as the memory has already been allocated). In most situations, this means that you'll be hitting the database for every row.

When to use this cache mode

When you're processing a small number of rows
When you have non-repeating lookup indexes
When your reference table is changing (inserts, updates, deletes)
When you have severe memory limitations

Keys to using this cache mode

Ensure that the partial cache mode isn't the better choice

Custom Error Handling in SSIS using Event Handlers

SQL Server Integration Services (SSIS) contains some really useful logging procedures but as with most things in SSIS, it is extensible. There are 2 methods of extending the logging capability of SSIS:

Build a custom log provider
Use event handlers

Error logging is crucial to troubleshooting and auditing efforts. If an error occurs in your SSIS package while it is executing in production loads, transformations, or transfers. you will want to know as much as possible about it.

As I said we are going to use event handlers to demonstrate the custom logging ability of SSIS. The event handlers provided with SSIS (N.B. The event handlers themselves are also extensible) are:

OnError
OnExecStatusChanged
OnInformation
OnPostExecute
OnPostValidate
OnPreExecute
OnPreValidate
OnProgress
OnQueryCancel
OnTaskFailed
OnVariableValueChanged
OnWarning

For the purposes of logging the most important ones are OnPostExecute, OnError & OnWarning.

Example:

Follow the below steps to implement the custom error handling in SSIS using event handlers.

Create a new SSIS project named ErrorTest

Drag an Execute SQL Task onto the Control Flow

Configure it (double-click or right-click and select Edit to open the editor) as shown below. Note any connection will suffice, so long as the SQL Server is running. Also note the SQL will generate an exception (intentionally).

Click OK to close the editor. Right-click the Task and select Execute Task to test.The task should fail.

Stop execution and click on the Event Handlers tab.

Check to be sure the Executable dropdown is set to the ErrorTestPkg (the package) and the Event handler dropdown is set to OnError. Click the link to create a new OnError Event Handler for the package. From the Toolbox, drag and drop a Data Flow Task onto the Event Handler

Double-click the Data Flow Task to edit it. From the toolbox, drag a Script Component transformation onto the Data Flow.

The Script Component can be configured as a Source, Transformation, or Destination. Select Source

Since a Source has only outputs (a destination has only inputs, and a transformation has both inputs and outputs), you will need to configure the Output. Rename the generic Output 0 to ErrorOutput.

Click on Output Columns and click Add Column.Rename the Column ErrDescription

Edit the ErrDesription column's properties - change the datatype to String (DT_STR) and the length to 5000

Add the ErrorDescription System variable to the ReadOnlyVariables property of the Script Component

There are other variables you will add to your error handling script component. Remember variable names are case-sensitive, comma-separated, with no spaces in the list.

Click the Design Script button and add the following code snippet in the CreateNewOutputRows subroutine as shown below:

With ErrorOutputBuffer

.AddRow()

.ErrDescription = Me.Variables.ErrorDescription

End With

Some explanation: The name of the ErrorOutputBuffer was determined by the name you gave the Script Component Output earlier. When you changed the Output name to ErrorOutput, the script component's output buffer was changed to ErrorOutputBuffer.

To populate the buffer, you must add rows to it. Think of this as starting a new row in a table. If you're familiar with tables in Microsoft Word documents, you know when you start a new row it is empty until you populate it. The same is true here. You are creating a new, empty row in the output buffer to populate with data.

13. The final step is to load the row. Again, in your production-ready error event handler you will populate more columns in the output buffer than this one - this is for demonstration purposes only

14. Drag a flat file destination onto the Data Flow and connect it to the Script Component as shown.

15. Configure a flat file destination. When you configure a flat file destination, you are prompted for which type of file. Delimited files are fairly simple to read - especially with few fields

16. Select a location and name for your flat file. I chose C:\ErrorOutput1.txt

17. Click Columns to view the columns the file will contain. Note: the columns would be empty if you had not connected a data path from the Script Component prior to configuring the Flat File Destination

Click OK to proceed. Click the Mappings page to assign mappings between fields in the pipeline and fields in the destination

Since you built the destination from the metadata contained in the pipeline, auto-mapping should occur when you click Mappings

Click OK. Your Event Handler should appear as shown

Start debugging the package by pressing F5 or clicking the Debug Run arow. Your ExecuteSQL task should fail as before

Click on the Event Handlers tab to determine the run status of your Package OnError Event Handler. Double-click the Data Flow Task and note one row was written to your error file (ErrorOutput1.txt)

Open ErrorOutput1.txt file to view the error stored there

In this example, we built a package to demonstrate custom error handling in SSIS package and store that error to a Flat File.

Thursday, November 3, 2011

Snowflake Schema in Data Warehouse

Introduction
The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.

Advantages:

The biggest advantage of snowflake schema is improved query performance.
Reduces dimension table size.

Disadvantage:

Additional maintenance efforts needed due to the increase number of lookup or detail tables

Harpreet's Diary on Business Intelligence & Database

Pages

Thursday, November 10, 2011

Inline Variable Assignment in SQL Server 2008

GROUP BY with ROLLUP, CUBE, and GROUPING SETS in SQL Server 2008

Wednesday, November 9, 2011

Row Constructors in SQL Server 2008

Information Schema Views in SQL Server 2005/2008

Tuesday, November 8, 2011

Lookup cache modes in SSIS 2008

Custom Error Handling in SSIS using Event Handlers

Thursday, November 3, 2011

Snowflake Schema in Data Warehouse

NeoCounter