Tuesday, November 8, 2011

Lookup cache modes in SSIS 2008

Integration Services 2008 has a new Lookup transformation that includes a caching option that allows lookups to execute much faster than before. There are three modes of caching and a new Cache Connection Manager.

When adding a Lookup transformation to a data flow task you are requesting just what it says, a lookup. Based on a matching column, a lookup table is used to retrieve other data columns that the data flow can use. For instance, based on an ProductId we can look up Product Name based on that id and add that to the data flow. There are different flows for Matched records and Not Matched records.

Note that these cache modes apply when you're using an OLE DB connection manager - using the new Cache connection manager is similar to using a Full Cache mode.

Full Cache

The default cache mode for the lookup is Full cache. In this mode, the database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory. This approach uses the most memory, and adds additional startup time for your data flow, as all of the caching takes place before any rows are read from the data flow source(s). The trade off is that the lookup operations will be very fast during execution. One thing to note is that the lookup will not swap memory out to disk, so your data flow will fail if you run out of memory.

When to use this cache mode
  • When you're accessing a large portion of your reference set
  • When you have a small reference table
  • When your database is remote or under heavy load, and you want to reduce the number of queries sent to the server

Keys to using this cache mode
  • Ensure that you have enough memory to fit your cache
  • Ensure that you don't need to pick up any changes made to the reference table
    • Since the lookup query is executed before the data flow begins, any changes made to the reference table during the data flow execution will not be reflected in the cache
Partial Cache

In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached so they can be used the next time a matching row comes in.

Since no caching is done during the pre-execute phase, the startup time using a partial cache mode is less than it would be for a full cache. However, your lookup operations would be slower, as you will most likely be hitting the database more often.

When running in partial cache mode, you can configure the maximum size of the cache. This setting can be found on the Advanced Options page of the lookup UI. There are actually two separate values - one for 32bit execution, and one for 64bit. If the cache gets filled up, the lookup transform will start dropping the least seen rows from the cache to make room for the new ones.

In 2008 there is a new Miss Cache feature that allows you to allocate a certain percentage of your cache to remembering rows that had no match in the database. This is useful in a lot of situations, as it prevents the transform from querying the database multiple times for values that don't exist. However, there are cases where you don't want to remember the misses - for example, if your data flow is adding new rows to your reference table. The Miss Cache is disabled by default.

When to use this cache mode
  • When you're processing a small number of rows and it's not worth the time to charge the full cache
  • When you have a large reference table
  • When your data flow is adding new rows to your reference table
  • When you want to limit the size of your reference table by modifying query with parameters from the data flow
Keys to using this cache mode
  • Ensure that your cache size setting is large enough
  • Use the Miss Cache appropriately
  • If the cache size isn't large enough for your rows, sort on lookup index columns if possible

No Cache

As the name implies, in this mode the lookup transform doesn't maintain a lookup cache (actually, not quite true - we keep the last match around, as the memory has already been allocated). In most situations, this means that you'll be hitting the database for every row.

When to use this cache mode

  • When you're processing a small number of rows
  • When you have non-repeating lookup indexes
  • When your reference table is changing (inserts, updates, deletes)
  • When you have severe memory limitations

Keys to using this cache mode

  • Ensure that the partial cache mode isn't the better choice

No comments: