Since the release of Visual Studio 2012, business intelligence developers have been limited in how much they could use this tool due to the fact that it did not support BI project types (SSIS, SSAS, and SSRS). Today, that limitation is now gone with the release by Microsoft of SQL Server Data Tools – Business Intelligence for Visual Studio 2012. With this release, BI professionals may now fully move onto Visual Studio 2012 for SQL Server 2012 project initiatives.
You can read more about this release on the SSRS team blog and SSAS team blog, or download the code from the Microsoft website.
Note: This will be the first post in a short series on using Object typed variables in SQL Server Integration Services.
When defining variables in SSIS, the ETL developer has several data type options to choose from depending on the information to be stored in each variable. Included in the options are String, Boolean, Char, DateTime, and several flavors and sizes of Int. However, there’s another variable data type that is very handy but also underutilized: the Object data type.
Why Object Variables?
Variables with a data type of Object are the most flexible variables in SSIS. Data types such as Int, String, and DateType are designed to store just one type of data; however, an Object typed SSIS variable can store almost any type of data, even information that can’t otherwise be represented in Integration Services. In most cases, SSIS doesn’t even have to be configured to know what type of data you’re storing in an Object typed variable – usually, it can simply pass the value along the wire as a bunch of bits without knowing or caring what’s in there.
When considering the use of Object typed variables in SSIS, I’ll give the same disclaimer that I give for using script tasks/components in SSIS: Just because you can doesn’t mean you should. If a native data type will work to store any possible value for a particular variable, by all means, don’t complicate your code by adding unnecessary moving parts. Use Object typed variables only when a native type won’t do – otherwise, stick to the well-worn path.
There are a few cases that come to mind that lend themselves to using Object typed variables in SSIS:
- Iterating over a result set. This is probably the most common and well-documented use of Object variable, as well as the easiest to implement as it requires no manual coding. This pattern will allow you to retrieve a set of data from a database, and then perform some operation for each row in that result set. I’ll work through the mechanics of how to do this in my next post.
- Handling binary data. A common example of this is extracting binary data from or writing binary data to a VARBINARY field in a relational database. If you need an interim storage mechanism in which this binary data should be stored, and Object variable can be a good solution. In this case, the ETL pipeline doesn’t necessarily need to be aware of what is stored in the Object typed variable .
- Creating or consuming binary data in SSIS. Let’s say you need to either generate or process binary data as part of your ETL. For example, you might need to retrieve a JPEG or PNG file from the file system and write it into a relational database, or retrieve a binary object from a database and process the various elements of that object. By storing said data in an Object typed variable, you can directly write to or read from this variable within your code.
Risks and challenges
Naturally, with a construct as flexible as an Object typed variables, there are a few challenges to be aware of when considering when and how to use objects in your SSIS packages. Among the risks:
- Some coding required. In many cases, especially when you’re processing the information contained in the Object typed variable (as opposed to simply passing the value through from a source to a destination), you’re going to have to write some code to address that object.
- SSIS expressions not allowed. Because they are designed to store a variety of information structures, Object typed variables cannot be used in an SSIS expression. Even if the underlying data stored in the variable is of a type that could be stored in a native SSIS type, attempting to add an Object typed variable to an SSIS expression will throw an error.
- Debugging challenges. If you overflow an Int32, or happen to truncate a string in SSIS, you’ll get a (mostly) friendly error message telling you what went wrong. Often, when dealing with Object typed variables, you don’t get that luxury in SSIS.
Object typed variables in SQL Server Integration Services allow a great deal of flexibility when dealing with atypical data structures in the ETL cycle. Although they are not as commonly used as simple native types, Object typed variables can make otherwise difficult ETL tasks easier.
In my experience, I’ve found that Object typed variables are sometime avoided in SSIS packages simply because they’re misunderstood or believed to be too difficult to use. In the next few posts in this series, I’ll illustrate how the Object variable can be leveraged in your SSIS package by demonstrating a few use cases where Object typed variables in SSIS are appropriate.
Earlier this week, Artis Consulting welcomed a new addition to our team. Technical expert Erin Welker (b|t), a very well known and experienced business intelligence professional, has joined the BI practice here at Artis. Erin brings a great deal to the table – she has been a SQL Server BI consultant for a number of years, has served the community in various capacities (including serving on the PASS board of directors), is a board member for the North Texas SQL Server User Group, and is a past recipient of the SQL Server MVP award.
The BI practice here at Artis has been the strongest team I’ve ever worked with, and I’m very excited to have my friend Erin joining us here. If you know Erin, be sure to give her a shout of congratulations.
Frequently I am asked how to properly use SQL Server Integration Services to handle ragged flat files, an example of which is shown below. In this small sample file, you can see that we are dealing with medical data that has an inconsistent format. As shown, the rows in the pipe-delimited file can have as few as seven and as many as eleven columns.
Although you and I can look at this file and interpret its meaning, previous versions of SSIS cannot (see note below regarding SSIS in SQL Server 2012). You can see that we’re not only missing the data but the delimiters as well, which is the root of the problem. This type of format where delimiters are simply left off when values at the end of the row are missing or NULL is, unfortunately, a relative common event that ETL developers must address.
With a file like this, SSIS will continue to consume data for each field until it find the specified field delimiter – even if that means that the data is “gobbled up” from the next line in the file! Let’s assume that the outputs on my Flat File Source component define 11 fields (the largest number of fields found in this file snippet). When SSIS processes the first row of data, it’s going to get to the end of that row still expecting to find a pipe delimiter to indicate the end of that field. Rather than assuming the end of the line of data (represented by an unprintable newline character) is actually the end of the column, the Flat File Source component will continue to consume data until either it finds the specified delimiter or the length of the data consumed exceeds the width specified for that column. In this example, the eighth field would not be the literal “453” at the end of the line in the file, but would contain a newline character followed by the first value of the following line of data. Then the next field, which should be NULL, would contain “Reichert”, followed by the incorrect value of “781” and so forth. As shown below, this slip would cascade through the remainder of the file.
In a case like this, the very best you can hope for is that the data length would overflow the allowed field width, or a data type conflict would cause an error and fail the package. In a worst-case scenario, SSIS could actually consume the entire file with this misalignment as shown above, which may not raise an error in the package but would almost certainly cause downstream issues in the destination systems.
Scripting to the rescue!
Never fear – there is a solution to this problem. By using just a bit of code in SSIS, we can easily get around this issue and process the ragged flat file. The methodology we’ll use is as follows:
- Using the Flat File Source, consume each line of text as one big string (rather than breaking it apart into individual fields)
- Using the Script Component, parse each line of the file. Dynamically populate the output fields with the data in each line of the file, and assign NULLs to those fields where no value is present.
- Send the cleansed output to the destination table
Sounds easy enough, right? Let’s go through this step by step.
Process each line of data
As mentioned, we’re going to trick our Flat File Source component into processing this file by forcing it to handle each line of data as if it were a single text field. As shown below, I’ve configured a single output field, named [Column 0], as a field of type String with a width of 2000 characters. This will include everything on the line of data, including the pipe delimiters, in the output.
Next, we’ll add a script component into the data flow pane. We’ll configure the input of the script component to use the single field generated by the flat file source above.
We also need to add all of the output fields manually, which we will populate from within the script. As shown below, I’ve already added the eleven fields to the script component output, configuring each one with the appropriate data type and length.
Now onto the fun part: The Code. Within the script, our methodology will be as follows:
- Break apart the big text string into its individual data fields, splitting on the specified delimiter (in this case, the pipe symbol).
- If there are mandatory fields (i.e., any fields which must be present for a valid row of data), assign these to the appropriate output columns.
- For the optional output columns, process the values retrieved from the input string, in order, assigning each one to the next output column until no other input values exist.
First things first – let’s grab and shred the input text. As shown below, I’m going to split the input on the pipe delimiter, resulting in an array of strings, each with a single value.
|// Break apart the input string into its atomic elements|
string allValues = Row.Column0.Split('|');
Next, we’ll work through all of the fields that are expected to be present in all valid rows of data. In situations where there are mandatory fields which should always be present, no dynamic processing is required on those values, so we can simply assign them in order. Note that I’ve used an incrementing counter variable to work through the array of input data – this is simply a shortcut step to allow me to copy/paste the same code rather than typing the cardinal numbers for each element.
int i = 0;// Address the values known to be present in every row
Row.MRN = allValues[i++];
Row.PatientLastName = allValues[i++];
Row.PhysicianID = allValues[i++];
Row.ProcedureDate = allValues[i++];
Row.ProcedureID = allValues[i++];
Row.ProcedureDescription = allValues[i++];
Row.ProcedureStatus = allValues[i++];
Now we move on to the dynamic portion of the processing. From this point, all columns are optional, so we need to go through the remaining values, one by one, and assign them to the final four output columns (all related to risk factor codes in this example). As shown in the code snippet below, I’m continuing to work through the string array created earlier process all of the values. As long as values remain in the array, we’ll continue assigning those values to the next risk factor code in order; when we run out of values, all remaining risk factor columns will be populated with a NULL string.
|// The next four values are variable. We may have zero to four of these, so|
// we'll add them dynamically as they appear
Row.RiskFactor1 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor2 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor3 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor4 = allValues.Length > i ? allValues[i] : null;
We only have four possible additional columns in this scenario, but this pattern can easily be applied to similar scenarios with more than just a few optional elements.
Consume the cleansed data
Finally, we just need to route the cleansed data out to our destination. As shown below, we’re sending the processed data to a SQL Server table.
SQL Server 2012 changes
As mentioned earlier, this is a design pattern that I recommend for versions of SSIS prior to 2012. If you are using SSIS with SQL 2012, this workaround is unnecessary: the Flat File Source was updated in this latest version so that it allows easy processing of these types of ragged files.
Ragged flat file processing can be a pain when working with older versions of SQL Server Integration Services. This post demonstrates how, with just a few lines of code, you can avoid issues with ragged files by dynamically “peeling off” values and assigning them to the appropriate outputs.
Note: The package and sample data file I used for this demonstration are attached here.
I’m happy to announce that the book I’ve been working on for the past two years is complete and has been released for sale as of today. I got the unique privilege to work alongside SSIS rock stars Andy Leonard, Jessica Moss, Matt Masson, and Michelle Ufford to scribe what will hopefully be a must-have book for anyone working in the SQL Server Integration Services space.
I’m immensely grateful for the opportunity to have participated in this project. Thanks first and foremost to my four fine coauthors – I’m humbled to even be mentioned in the same context as you folks. Thanks also to the good people at APress – Jonathan Gennick, Mark Powers, and many others who kept us on track and helped to point out our dangling modifiers and run-on sentences. Thanks also to the folks at Artis Consulting (my employer) who were gracious in their support and encouragement during this effort.
Along with the book release, the five of us are putting together a full day seminar to be delivered at the SQL PASS summit this fall. We’ll expand on the topics covered in the book, with more examples, more design patterns, and lively discussion among the authors. Registration for this event is still open. Hope to see you there!
This week, Artis Consulting will be a part of the SQL Rally event in downtown Dallas. This annual event hosts 500-600 speakers, PASS board members and staffers, and attendees from across the nation in the largest midyear gathering of SQL Server professionals in the nation.
Artis is a silver sponsor of this event, and will have a booth in the main vendor area. We will have folks at the event all week long, so if you’re in attendance, please do stop by our booth. If you’ve got a Business Intelligence or SharePoint question, or if you are a senior BI or SharePoint developer looking for a new challenge, we’d love to chat with you!
In addition to our sponsorship, two of us will be presenting sessions at the event. My colleague Greg Galloway will be presenting twice, once as a full-day preconference seminar and another regular 75-minute session:
99 Tips for Tuning and Enhancing Analysis Services (full day precon): Come hear Greg Galloway share all his best tips for optimizing performance of Analysis Services and for improving the user experience and reach of your cubes. This session is targeted at BI developers responsible for maintaining Analysis Services cubes and to DBAs who are tasked with monitoring and troubleshooting Analysis Services issues. Expect to leave with an actionable checklist of ideas to try on Monday, when you return to work with a deeper grasp of the internals of Analysis Services, with MDX optimization experience, and with best practices that ensure your projects succeed. This pre-conference session will focus on Multidimensional cubes and MDX, but will briefly discuss the advantages and limitations of new Tabular models and DAX in Analysis Services 2012.
Making Heads or Tails of Analysis Services Storage (regular session): If you are about to start a new Microsoft business intelligence project, you have more Analysis Services choices and flexibility than ever before. Come learn what's new in Analysis Services 2012 and learn the advantages and limitations of the new Tabular models compared with classic Multidimensional models. In this session we will take a deep dive into storage modes in Analysis Services. We’ll discuss the two storage modes a Multidimensional model can use: MOLAP and ROLAP. We’ll discuss the storage modes in new Tabular models: VertiPaq and DirectQuery. We’ll preview the new columnstore index in SQL Server 2012 and discuss how it impacts Analysis Services. And we’ll take a deep dive into two successful Analysis Services deployments at two Fortune 500 companies and discuss why one chose ROLAP and the other chose MOLAP.
In addition, I’ll be presenting a Data Quality Services session on Thursday:
Introduction to Data Quality Services (regular session): In this session, we will take a quick tour of the new data quality tool released with SQL Server 2012. With SQL Server Data Quality Services, data professionals now have an easy-to-use framework with which they can analyze and maintain data quality. This session will serve as an introduction to this new product - we will discuss DQS concepts and architecture, review the server and client components of DQS, and will demonstrate the DQS component for SSIS.
Registration is still open for the event, and we’d love to see you in one of our sessions or at the booth.
In Data Quality Services, composite domains can be created to associate together two or more natural domains within a knowledge base. Like natural domains, composite domains can also contain one or more validation rules to govern which domain values are valid. In my last post, I discussed the use of validation rules against natural domains. In this post, I’ll continue the thread by covering the essentials of composite domain rules and demonstrating how these can be used to create relationships between data domains.
What is a composite domain?
Before we break off into discussing the relationships between member domains of a composite domain, we’ll first touch on the essentials of the latter.
Simply, a composite domain is a wrapper for two or more organic domains in a knowledge base. Think of a composite domain as a virtual collection of dissimilar yet related properties. As best I can tell, the composite domain is not materialized in the DQS knowledge base, but is simply a meta wrapper pointing to the underlying values.
To demonstrate, I’ve created a knowledge base using a list of automobile makes and models, along with a few other properties (car type and seating capacity). I should be able to derive a loose association between automobile type and seating capacity, so I’ll create a composite domain with those two domains as shown below.
As shown above, creating a composite domain requires nothing more than selecting two or more domains from an existing knowledge base. After the composite domain has been created, your configuration options are generally limited to attaching the composite domain to a reference data provider (which I’ll cover in a future post) and adding composite domain rules.
Value association via composite domain rules
The most straightforward way to associate the values of a composite domain is to create one or more rules against that composite domain. When created against a composite domain, you can use rules to declare if/then scenarios to describe allowable combinations therein.
Back in the day, before marriage, kids, and a mortgage, I used to drive sports cars. Even though that was a long time ago, I do remember a few things about that type of automobile: they are fast, expensive to insure, and don’t have a lot of passenger capacity. It’s on that last point that we’ll focus our data quality efforts for now. I want to make sure that some sneaky manufacturer doesn’t falsely identify as a sports car some big and roomy 4-door sedan. Therefore, I’m going to create a rule that will restrict the valid domain values for seating capacity for sports cars.
I’ll start with some business assumptions. What’s the minimum number of seats a sports car should have? I think it’s probably 2, but I suppose if some enterprising gearhead decided to convert an Indy Car into a street-legal machine, it would likely be classified as a sports car too. Therefore, it would be reasonable to assume that, in an edge case, a sports car could have just a single seat, so our minimum seating capacity for a sports car would be 1. On the high side, design of sports cars should dictate that there aren’t many seats. For example, the Chevrolet Camaro I had in high school could seat 4 people, assuming that 2 of the people were small children with stunted growth who had no claustrophobic tendencies. However, we can give a little on this issue and assume that they somehow manage to squeeze a third rows of seats into a Dodge Magnum, so we’ll say that a sports car can have a maximum seating capacity of 6 people.
Now, with that information in hand, I’m going to use the Domain Management component of the DQS client to set up the new rule against the “Type and Capacity” composite domain from above. As shown below, I can set value-specific constraints on the seating capacity based on the automobile type of Sports Car.
As shown, any valid record with a car type of Sports Car must have a seating capacity of between 1 and 6 persons.
Of course, sports cars aren’t the only types of automobiles (gasp!), so this approach would likely involve multiple rules. Fortunately, composite domains allow for many such rules, which would permit the creation of additional restrictions for other automobile types. You could also expand the Sports Car rule and add more values on the left side of the operator (the if side of the equation). For example, you might call this instead a “Small Car rule” and include both sports cars and compact cars in this seating capacity restriction.
Although we’ve limited our exploration to simply interrogating the value of the natural domains within a composite domain, this is by no means our only option for validation. For example, when dealing with string data you can inspect the length of the string, search for patterns, use regular expressions, and test for an empty string in addition to checking against the actual value. Shown below are some of the options you can use to query against a string value in a domain rule.
When dealing with date or numerical data, you have the expected comparison operators including less than, greater than, less than or equal to, etc.
This post has briefly explored composite domains and shown how to add validation rules to a composite domain in an existing knowledge base. In my next DQS post, I’ll continue with composite domains to illustrate a potential misunderstanding in the way composite domains treat value combinations in cleansing operations.
A compelling feature of the new Data Quality Services in SQL Server 2012 is the ability to apply rules to fields (domains) to describe what makes up a valid value. In this brief post, I’d like to walk through the concepts of domain validation and demonstrate how this can be implemented in DQS.
Domain validation essentials
Let’s ponder domain validation by way of a concrete example. Consider the concept of age: it’s typically expressed in discrete, non-negative whole numbers. However, the expected values of the ages of things will vary greatly depending on the context. An age of 10 years seems reasonable for a building, but sounds ridiculous when describing fossilized remains. A date of “1/1/1950” is a valid date and would be appropriate for classifying a person’s date of birth, but would be out of context if describing when a server was last restarted. In a nutshell, the purpose of domain validation is to allow context-specific rules to provide reasonableness checks on the data.
A typical first step in data validation would involve answering the following questions:
- Is the data of the right type? This helps us to eliminate values such as the number “purple” and the date “3.14159”.
- Does the data have the right precision? This is similar to the point above: If I’m expecting to store the cost of goods at a retail store, I’m probably not going to configure the downstream elements to store a value of $100 million for a single item.
- Is the data present where required? When expressing address data, the first line of an address might be required while a second line could be optional.
Domain validation goes one step further by answering the question, “Is a given value valid when used in this context?” It takes otherwise valid data and validates it to be sure it fits the scenario in play.
Domain validation in DQS
Even if you don’t use this term to describe it, you’re probably already doing some sort of domain validation as part of your ETL or data maintenance routines. Every well-designed ETL system has some measure sanity check to make sure data fits semantically as well as technically.
The downside to many of these domain validation scenarios is that they can be inconsistent and are usually decentralized. Perhaps they are implemented at the outer layer of the ETL before data is passed downstream. Maybe the rules are applied as stored procedures after they are loaded, or even as (yikes!) triggers on the destination tables.
Data Quality Services seeks to remedy the inconsistency and decentralization issue, as well as make the process easier, by way of domain validation rules. When creating a domain in DQS, you are presented with the option of creating domain rules that govern what constitutes a valid value for that domain. For the example below, I’m using data for automobile makes and models, and am implementing a domain rule to constrain the value for the number of doors for a given model.
With the rule created, I can apply one or more conditions to each of the rules. As shown, I am going to constrain the valid values to lie between 1 and 9 inclusive, which should account for the smallest and largest automobile types (such as limousines and buses).
For this rule, I’m setting the conditions that the value must be greater than zero or less than ten. Note that there is no requirement to use this bookend qualification process; you can specify a single qualifier (for example, greater than zero) or have multiple conditions strung together in the same rule. You can even change the AND qualifier to an OR if the rule should be met if either condition is true – though I would caution you when mixing 3 or more conditions using both AND and OR, as the behavior may not yield what you might expect.
That’s all there is to creating a simple domain validation rule. Remember that for the condition qualifiers, you can set greater than, less than, greater than/equal to, etc., for the inclusion rule when dealing with numerical or date domain data types. For string data types, the number of options is even greater, as shown below:
Of particular interest here is that you can leverage regular expressions and patterns to look for partial or pattern matches within the string field. You can also check the string value to see if it can be converted to numeric or date/time.
The rule in action
With the new domain validation rule in place, let’s run some test data through it. I’m going to create a few test records, some of which violate the rule we just created, and run them through a new DQS project using the knowledge base we modified with this rule.
I’ll start off with the dirty data as shown below. You can probably infer that we’ve got a few rows that do not comply with the rule we created, on both ends of the value scale:
After creating a new data cleansing project, I use the data shown above to test the rule constraining the number of doors. As shown below in the New output tab, we have several rows that comply with this new rule:
In addition, there are two distinct values found that do not meet the criteria specified in the new rule. Selecting the Invalid tab, I see the values 0 and 12 have failed validation, as they fall outside the range specified by the rule. In the Reason column, you can see that we get feedback indicating that our new rule is the reason that these records are marked as Invalid:
So by implementing this rule against my data, I am able to validate not only that the value is present and of the correct type, but that it is reasonable for this scenario.
In this post we’ve reviewed the essentials of domain validation and how we can implement these checks through domain rules in SQL Server Data Quality Services. In my next post, I’ll continue the discussion around domain rules by reviewing how these rules can be applied to composite domains in DQS.