Artis Consulting at SQL Rally

This week, Artis Consulting will be a part of the SQL Rally event in downtown Dallas.  This annual event hosts 500-600 speakers, PASS board members and staffers, and attendees from across the nation in the largest midyear gathering of SQL Server professionals in the nation.

Artis is a silver sponsor of this event, and will have a booth in the main vendor area.  We will have folks at the event all week long, so if you’re in attendance, please do stop by our booth.  If you’ve got a Business Intelligence or SharePoint question, or if you are a senior BI or SharePoint developer looking for a new challenge, we’d love to chat with you!

In addition to our sponsorship, two of us will be presenting sessions at the event.  My colleague Greg Galloway will be presenting twice, once as a full-day preconference seminar and another regular 75-minute session:

99 Tips for Tuning and Enhancing Analysis Services (full day precon): Come hear Greg Galloway share all his best tips for optimizing performance of Analysis Services and for improving the user experience and reach of your cubes. This session is targeted at BI developers responsible for maintaining Analysis Services cubes and to DBAs who are tasked with monitoring and troubleshooting Analysis Services issues. Expect to leave with an actionable checklist of ideas to try on Monday, when you return to work with a deeper grasp of the internals of Analysis Services, with MDX optimization experience, and with best practices that ensure your projects succeed. This pre-conference session will focus on Multidimensional cubes and MDX, but will briefly discuss the advantages and limitations of new Tabular models and DAX in Analysis Services 2012.

Making Heads or Tails of Analysis Services Storage (regular session): If you are about to start a new Microsoft business intelligence project, you have more Analysis Services choices and flexibility than ever before. Come learn what's new in Analysis Services 2012 and learn the advantages and limitations of the new Tabular models compared with classic Multidimensional models. In this session we will take a deep dive into storage modes in Analysis Services. We’ll discuss the two storage modes a Multidimensional model can use: MOLAP and ROLAP. We’ll discuss the storage modes in new Tabular models: VertiPaq and DirectQuery. We’ll preview the new columnstore index in SQL Server 2012 and discuss how it impacts Analysis Services. And we’ll take a deep dive into two successful Analysis Services deployments at two Fortune 500 companies and discuss why one chose ROLAP and the other chose MOLAP.

In addition, I’ll be presenting a Data Quality Services session on Thursday:

Introduction to Data Quality Services (regular session): In this session, we will take a quick tour of the new data quality tool released with SQL Server 2012. With SQL Server Data Quality Services, data professionals now have an easy-to-use framework with which they can analyze and maintain data quality. This session will serve as an introduction to this new product - we will discuss DQS concepts and architecture, review the server and client components of DQS, and will demonstrate the DQS component for SSIS.

Registration is still open for the event, and we’d love to see you in one of our sessions or at the booth.

DQS Validation Rules on Composite Domains

In Data Quality Services, composite domains can be created to associate together two or more natural domains within a knowledge base.  Like natural domains, composite domains can also contain one or more validation rules to govern which domain values are valid.  In my last post, I discussed the use of validation rules against natural domains.  In this post, I’ll continue the thread by covering the essentials of composite domain rules and demonstrating how these can be used to create relationships between data domains.

What is a composite domain?

Before we break off into discussing the relationships between member domains of a composite domain, we’ll first touch on the essentials of the latter.

Simply, a composite domain is a wrapper for two or more organic domains in a knowledge base.  Think of a composite domain as a virtual collection of dissimilar yet related properties.  As best I can tell, the composite domain is not materialized in the DQS knowledge base, but is simply a meta wrapper pointing to the underlying values.

To demonstrate, I’ve created a knowledge base using a list of automobile makes and models, along with a few other properties (car type and seating capacity).  I should be able to derive a loose association between automobile type and seating capacity, so I’ll create a composite domain with those two domains as shown below.

image

As shown above, creating a composite domain requires nothing more than selecting two or more domains from an existing knowledge base.  After the composite domain has been created, your configuration options are generally limited to attaching the composite domain to a reference data provider (which I’ll cover in a future post) and adding composite domain rules.

Value association via composite domain rules

The most straightforward way to associate the values of a composite domain is to create one or more rules against that composite domain.  When created against a composite domain, you can use rules to declare if/then scenarios to describe allowable combinations therein.

Back in the day, before marriage, kids, and a mortgage, I used to drive sports cars.  Even though that was a long time ago, I do remember a few things about that type of automobile: they are fast, expensive to insure, and don’t have a lot of passenger capacity.  It’s on that last point that we’ll focus our data quality efforts for now.  I want to make sure that some sneaky manufacturer doesn’t falsely identify as a sports car some big and roomy 4-door sedan.  Therefore, I’m going to create a rule that will restrict the valid domain values for seating capacity for sports cars.

I’ll start with some business assumptions.  What’s the minimum number of seats a sports car should have?  I think it’s probably 2, but I suppose if some enterprising gearhead decided to convert an Indy Car into a street-legal machine, it would likely be classified as a sports car too.  Therefore, it would be reasonable to assume that, in an edge case, a sports car could have just a single seat, so our minimum seating capacity for a sports car would be 1.  On the high side, design of sports cars should dictate that there aren’t many seats.  For example, the Chevrolet Camaro I had in high school could seat 4 people, assuming that 2 of the people were small children with stunted growth who had no claustrophobic tendencies.  However, we can give a little on this issue and assume that they somehow manage to squeeze a third rows of seats into a Dodge Magnum, so we’ll say that a sports car can have a maximum seating capacity of 6 people.

Now, with that information in hand, I’m going to use the Domain Management component of the DQS client to set up the new rule against the “Type and Capacity” composite domain from above.  As shown below, I can set value-specific constraints on the seating capacity based on the automobile type of Sports Car.

image

As shown, any valid record with a car type of Sports Car must have a seating capacity of between 1 and 6 persons.

Of course, sports cars aren’t the only types of automobiles (gasp!), so this approach would likely involve multiple rules.  Fortunately, composite domains allow for many such rules, which would permit the creation of additional restrictions for other automobile types.  You could also expand the Sports Car rule and add more values on the left side of the operator (the if side of the equation).  For example, you might call this instead a “Small Car rule” and include both sports cars and compact cars in this seating capacity restriction.

Other uses

Although we’ve limited our exploration to simply interrogating the value of the natural domains within a composite domain, this is by no means our only option for validation.  For example, when dealing with string data you can inspect the length of the string, search for patterns, use regular expressions, and test for an empty string in addition to checking against the actual value.  Shown below are some of the options you can use to query against a string value in a domain rule.

image

When dealing with date or numerical data, you have the expected comparison operators including less than, greater than, less than or equal to, etc.

Conclusion

This post has briefly explored composite domains and shown how to add validation rules to a composite domain in an existing knowledge base.  In my next DQS post, I’ll continue with composite domains to illustrate a potential misunderstanding in the way composite domains treat value combinations in cleansing operations.

DQS Domain Validation Rules

dqsA compelling feature of the new Data Quality Services in SQL Server 2012 is the ability to apply rules to fields (domains) to describe what makes up a valid value.  In this brief post, I’d like to walk through the concepts of domain validation and demonstrate how this can be implemented in DQS.

Domain validation essentials

Let’s ponder domain validation by way of a concrete example.  Consider the concept of age: it’s typically expressed in discrete, non-negative whole numbers.  However, the expected values of the ages of things will vary greatly depending on the context.  An age of 10 years seems reasonable for a building, but sounds ridiculous when describing fossilized remains.  A date of “1/1/1950” is a valid date and would be appropriate for classifying a person’s date of birth, but would be out of context if describing when a server was last restarted.  In a nutshell, the purpose of domain validation is to allow context-specific rules to provide reasonableness checks on the data.

A typical first step in data validation would involve answering the following questions:

  • Is the data of the right type?  This helps us to eliminate values such as the number “purple” and the date “3.14159”.
  • Does the data have the right precision? This is similar to the point above: If I’m expecting to store the cost of goods at a retail store, I’m probably not going to configure the downstream elements to store a value of $100 million for a single item.
  • Is the data present where required?  When expressing address data, the first line of an address might be required while a second line could be optional.

Domain validation goes one step further by answering the question, “Is a given value valid when used in this context?”  It takes otherwise valid data and validates it to be sure it fits the scenario in play.

Domain validation in DQS

Even if you don’t use this term to describe it, you’re probably already doing some sort of domain validation as part of your ETL or data maintenance routines.  Every well-designed ETL system has some measure sanity check to make sure data fits semantically as well as technically.

The downside to many of these domain validation scenarios is that they can be inconsistent and are usually decentralized.  Perhaps they are implemented at the outer layer of the ETL before data is passed downstream.  Maybe the rules are applied as stored procedures after they are loaded, or even as (yikes!) triggers on the destination tables.

Data Quality Services seeks to remedy the inconsistency and decentralization issue, as well as make the process easier, by way of domain validation rules.  When creating a domain in DQS, you are presented with the option of creating domain rules that govern what constitutes a valid value for that domain.  For the example below, I’m using data for automobile makes and models, and am implementing a domain rule to constrain the value for the number of doors for a given model.

SNAGHTML612bbb0

With the rule created, I can apply one or more conditions to each of the rules.  As shown, I am going to constrain the valid values to lie between 1 and 9 inclusive, which should account for the smallest and largest automobile types (such as limousines and buses).

SNAGHTML6292bf0

For this rule, I’m setting the conditions that the value must be greater than zero or less than ten.  Note that there is no requirement to use this bookend qualification process; you can specify a single qualifier (for example, greater than zero) or have multiple conditions strung together in the same rule.  You can even change the AND qualifier to an OR if the rule should be met if either condition is true – though I would caution you when mixing 3 or more conditions using both AND and OR, as the behavior may not yield what you might expect.

That’s all there is to creating a simple domain validation rule.  Remember that for the condition qualifiers, you can set greater than, less than, greater than/equal to, etc., for the inclusion rule when dealing with numerical or date domain data types.  For string data types, the number of options is even greater, as shown below:

image

Of particular interest here is that you can leverage regular expressions and patterns to look for partial or pattern matches within the string field.  You can also check the string value to see if it can be converted to numeric or date/time.

The rule in action

With the new domain validation rule in place, let’s run some test data through it.  I’m going to create a few test records, some of which violate the rule we just created, and run them through a new DQS project using the knowledge base we modified with this rule.

I’ll start off with the dirty data as shown below.  You can probably infer that we’ve got a few rows that do not comply with the rule we created, on both ends of the value scale:

image

After creating a new data cleansing project, I use the data shown above to test the rule constraining the number of doors.  As shown below in the New output tab, we have several rows that comply with this new rule:

SNAGHTML65f97ec

In addition, there are two distinct values found that do not meet the criteria specified in the new rule.  Selecting the Invalid tab, I see the values 0 and 12 have failed validation, as they fall outside the range specified by the rule.  In the Reason column, you can see that we get feedback indicating that our new rule is the reason that these records are marked as Invalid:

SNAGHTML661229e

So by implementing this rule against my data, I am able to validate not only that the value is present and of the correct type, but that it is reasonable for this scenario.

Conclusion

In this post we’ve reviewed the essentials of domain validation and how we can implement these checks through domain rules in SQL Server Data Quality Services.  In my next post, I’ll continue the discussion around domain rules by reviewing how these rules can be applied to composite domains in DQS.

Skip to main content