Data Glass: February 2006

Tuesday, February 21, 2006

Reasons for Partitioning Your Data

Partitioning splits data across multiple tables, databases, and/or database servers. There are two types; vertical and horizontal partitioning. Horizontal partitioning divides data across mutiple tables based on rows that fall within the appropriate partition rules. Each partition has the same columns, but has its own partition rule (Example: Between 01/01/2005-02/01/2005). Veritical partitioning splits the table definition into two or more tables based on the columns.

Here are some basic reasons why a system may need to have its data partitioned: Performance, Workflow, Security, & Change Tracking.

Performance: Volume
Performance issues due to data volume is the most common reason for partitioning tables. Data warehouses commonly use Horizontal partitioning to partition based on a clearly defined set of rules. If a row fits within a partition rule set, it is inserted into that partition. When defining partitions, database designers look for natural partitions for the business; Example:

· Date Range: By Year, Month, or Day
· By Database Source: Database System X, Y, & Z
· By Collecting Point: Data is collected by Nth number of data collectors
· By Business Unit
· By User
· By User Location

Workflow: Status
Workflow partitioning is used in systems like document tracking and management services. It’s a horizontal partition that partitions based on Workflow Status Rules; Example:

· (Partition 1) Draft Incomplete, Draft Complete, Draft Cancelled, Draft Approved
· (Partition 2) Active, Suspended, Expired
· (Partition 3) Archived

WorkFlow: User Type
Some of the reasons for partitioning based on Workflow Status may be based on types of users as well. Contracts are the most common managed documents and are used by several different types of users. Example:

· Sales Department
· Legal Department
· Service Fulfillment Departments
· Accounting Department

Security: PII
Partitioning data based on levels of security risk is used to Isolate PII (Personal Identification Information). Example:

· SSN
· Phone and Address
· Email
· Name
· Credit Cards and Accounts
· Passport ID

Security: Need To Know
Other type of information that may be partitioned for security reasons is information that requires a security clearance and/or a need to know.

· Medical History
· Sensitive Documents
· Trademark Secrets

Security partitions are usually horizontal partitioned using tables, databases, and/or servers. This type of partitioning lowers the security risk and increases the manageability of enforcing security requirements.

Change Tracking
Change tracking and history tracking partitions are very common in mission critical applications requiring full tracking of data changes within a database. These partitions are horizontal partitions with the addition of extra tracking attributes. These partitions are usually managed by a trigger on its corresponding master table. This trigger inserts a new row into the partition every time there is an update or delete. With this technique, it is easy to determine what changed when, and by whom.

Sunday, February 12, 2006

How To: Hierarchal Lookups Without a Cursor

Reporting on hierarchal tables (Child/Parent relationship) tables can be a huge time bottleneck when using an iterative cursor process. Here is a speedy way to look up hierarchal information.

Original Hierarchal Table
Traditionally hierarchal tables self reference its own primary key and giving it a parent role name. Example:

Person
PersonID (PK) , Name (Attributes) , ParentPersonID (Self Ref. Person.PersonID FK)
1 , John Doe , NULL
2 , Tom Doe , 1
3 , Jill Doe , 1
4 , Harry Doe , 2
5 , Jim Smith , NULL

Hierarchal Index Table
The key is building an index table that maps a parent to every child, grand child, great grand child, etc... and assign the generation it belongs to in relationship. Example:

PersonHierarchyIndex
AncestoryID (Person.PersonID PK, FK), ChildID (Person.PersonID PK, FK), GenerationLevel1 (John Doe) , 1 (John Doe), 0
1 (John (Doe), 2 (Tom Doe), 1
1 (John Doe), 3 (Jill Doe), 1
1 (John Doe), 4 (Harry Doe), 2
2 (Tom Doe), 2 (Tom Doe), 0
2 (Tom Doe), 4 (Harry Doe), 1
3 (Jill Doe),2 (Jill Doe), 0
4 (Harry Doe), 4 (Harry Doe), 0
5 (Jim Smith), 5 (Jim Smith), 0

Index Usage
With the aid of the hierarchal index you will not need a cursor for your reports. Example:

-- Get all Progeny (Children, Grand Children, etc…)
Select Person.Name, Parent.Name, Index.GenerationLevel
From PersonHierarchyIndex
Join Person
On Person.PersonID = PersonHierarchyIndex.ChildID
And ParentHierarchyIndex.AncestoryID = 1 -- (John Doe)
Join Person Parent
On Parent.PersonID = PersonHierarchyIndex.AncestoryID

Index Population
The following example uses the new feature in 2005 SQL Server. We populate the Index using a recursive query using a CTE (Common Table Expression). It is used to build the index at the time of creating or modifying the hierarchy. Most hierarchies are slowly changing domain data. With that in mind it makes sense to take the cost of building the index at the time when the domain changes rather then at the time of selecting from the hierarchy to create your reports.

Use this function to then insert into the PersonHierarchyIndex table:

-- Recursive Query using Common Table Expression
CREATE Function dbo.BuildPersonHierarchyIndex () RETURNS Table
ASRETURN(
WITH AncestoryTree (PPID, PID, PersonID, ParentPersonID, DirectChild, GenerationLevel)
AS (SELECT PPID = Person.ParentPersonID, PID = Person.PersonID, Person.PersonID, Person.ParentPersonID , DirectChild = 1, GenerationLevel = 1
FROM Person
UNION ALL
SELECT AncestoryTree.PPID, AncestoryTree.PID, Person.PersonID, Person.ParentPersonID, DirectChild = 0, GenerationLevel = AncestoryTree.GenerationLevel + 1
FROM Person
JOIN AncestoryTree
ON AncestoryTree.ParentPersonID = Person.PersonID
WHERE Person.PersonID <> Person.ParentPersonID)
SELECT PersonID = AncestoryTree.PID, AncestoryTree.ParentPersonID, AncestoryTree.DirectChild, AncestoryTree.GenerationLevel
FROM AncestoryTree
WHERE AncestoryTree.PPID IS NOT NULL
UNION ALL
SELECT PersonID, PersonID, 0, 0
FROM Person

--OPTION (MAXRECURSION 200) -- Default is 100: Make sure the recursion limit is set high enough
);

Friday, February 03, 2006

Best Practices for SQL Design Patterns

I’m a firm believer in creating symmetry through design patterns where it is applicable. Using good modeling techniques and naming conventions on data and process definitions will enable easier code generation. 80% of an application can be code generated based on templates that enforce design patterns. The remaining 20% may need to vary a little from the standards when it is too “costly” to do so. The following naming and coding suggestions have proved to be the most practical for me in starting any project.

Table Name convention In the past I‘ve seen a several databases that had some difficult naming conventions that increased the cost of ownership due to confusion. The databases that have followed the below suggestions have lowered the cost of ownership by making it easy to understand.

Pascal naming convention: The first letter of every word is capitalized.

Good Example: Person, PersonAddress, GBITax
Bad Example: personaddress, Person_Address, personAddress, tbl_PersonAddress
Do not pluralize the table name: It is best not to place an 'S' at the end of every table. It can be assumed that a table contains many entries.

Good Example: Person, Country, UserSignup, DetailRecord
Bad Example: Persons, DetailRecords
General Rules for placing Prefix to tables: Prefixes can be useful for classifying tables into business areas when the number of tables becomes too unwieldy to deal with in a single database. Usually one should start asking the question of whether or not they should split the database into multiple databases when this starts to happen, but there are cases when this may not be the desired direction. In this case you should start thinking about Prefixes. You can use abbreviations also long as they are consistent.

Good Example: CatalogItem, UtilMessageLog, PromoRule
Bad Example: tblPerson
Do not use the ‘_’ as a spacer: From a developers and analyst standpoint these little guys can be quite annoying. And with the Pascal naming convention the use of ‘_’ for readability becomes redundant.

Column Name convention

Pascal naming convention: The first letter of every word is capitalized.

Good Example: PersonID, FirstName, ZipCode, SSN
Bad Example: Personid, Firstname, lastName
Do not place the name of the data type in column name: It is not a good idea to place the name of the data type in the column name. If you change the data type of the column, then your column name is misleading. And changing the name of a column is full of troubling changes. There are a few exceptions: Date is one example.

Good Example: StartBatchDate, FirstName, USDAmount
Bad Example: ti_event_id, vcFirstName
Usage of Abbreviations: Abbreviations are good for reducing the size of a column name. One should always be consistent and only use abbreviations that are commonly understood. Remember that the name of the column should be self-descriptive. Avoid over-use of abbreviations which can make it difficult to understand and support. When in doubt spell it out.

Good Example: StartBatchDate, UserNum, SubscriptionDesc
Bad Example: ti_event_id, vcFirstName, SbDc, X
Schema Structure
Use Constraints: It is sad to say that there are many databases that exist in this world that do not use primary key (PK), foreign key (FK), and alternate key (AK) constraints. Unless there are technical reasons for not having these, you should always use them, if for nothing else, to insure referential integrity. The use of data integrity constraints also makes the database design self documenting (Example: Schema extraction process into Visio or Erwin.)

Name Your Indexes: If you name your indexes instead of having SQL Server to generate the name dynamically, you will have an easier time altering those indexes in the future. You will then explicitly know the name of the index rather then having to dynamically lookup the index name to alter it. If you dynamically generate the name of the index, it will be different from environment to environment (Example Environments: Development, Test, & Production).

Stored Procedure Naming Convention

Name Format: Don’t use SP as a prefix for your stored procedures. SP in SQL Server stands for System Procedure (Example: SP_HelpText). {VERIFY} Also, if you use the SP prefix, SQL Server will check the master database first and then the application database. So it is also a performance issue as well.

Good Example: GetPerson, ImportProductKey, CountPIDs, CommitPayments, GetProductReport
Bad Example: SP_GetPerson
Pascal naming convention: The first letter of every word is capitalized.

Good Example: GetPerson, ImportProductKey, CountPIDs, CommitPayments
Bad Example: Get_Person, IProductKey, COUNTPIDS
Examples of names: Get; Set; Ins (Insert); Upd (Update); Del (Delete); Import; Extract; Load; Transform; Export; Count; Process; Commit;

Abbreviation / Acronym dictionary: You should have a list of the most commonly used abbreviations. It’s not a hard and fast rule that a person can’t use an abbreviation that is not on the list. It is just a means to help create symmetry into the system. This list should be in a database design / development standards document.

Examples:
Abbreviation NameNum - Number
Ind - Indicator
Dtm - Date and Time (Millisecond)
Id - Identifier
Cnt - Count
Amt - Amount
Acct - Account
USD - United States Dollar Currency Code

Stored Procedure Argument Naming Convention
This is where coding preferences really start to vary from developer to developer. The concepts here are just helpful tips that I have found very useful as a SQL Developer.

Argument Prefix: @Arg @Arg. Looking through a complex stored procedure and attempting to quickly find where the stored procedure arguments are being used can be time consuming. The use of @Arg Prefix can ease this issue. I’ve found myself avoiding lots of coding mistakes by having done so.

Example: GetPerson(@ArgPersonID int)

Coding Stored Procedures

The Return statement: Avoided using the return statement to return any values other then error codes if even that. One should use Output parameters where possible for returning values. (This is for stored procedures. Functions are a different beast all together.)

Set NoRowCount On: Place Set NoRowCount On at the beginning of every stored procedure to reduce the noise coming back to Query Analyzer.

Error Trapping: Always check for errors at every statement and log these errors to a MessageLog table. Every stored procedure must do this.

Transactions: K.I.S.S. is the general rule here. Keeping it simple will avoid some of the pitfalls many fall into with complex rollback designs. Always think of a unit of work in which several processes must live and die together. If any process within that unit of work fails, then the unit of work should be rolled back. If a unit of work is extensive it would be a good idea to break it up into multiple units of work with each unit of work isolated in its own stored procedure. That stored procedure can have multiple stored procedures calls, but try to keep the design flat so as not to have a stored procedure that calls a stored procedure that calls a stored procedures etc… . Avoid combining units of work into one gigantic transaction. People from all walks of life will walk over to your desk and slap you with a wet noodle.

Avoid Deadlocks: To avoid deadlocks, get in the habit of accessing tables in the same order through out your system. This will alleviate the most common cause of deadlocks.

InLine Code Documentation: Document the “You Know What” out of the stored procedure within the code. No excuses or exceptions here.

Stored Procedure Code Generator: This is a very useful tool in creating fundamental (Basic) insert, update, delete, get, and set stored procedures. I’ve called them fundamental stored procedures as they are generated per table. These stored procedures are mainly used by the UI. I’ve found that generating these stored procedures increases the symmetry and stability and reduces human error and derivation. This kind of tool can be used to generate 80% of all stored procedures where a UI is needed. The other 20% are complex stored procedures that need more LTC (Loving Tender Care) from the developers.

Once you’ve generated these stored procedures, your better off maintaining them manually, Though some generators allow for “Re-Entry” in which manual code changes are kept while regenerating the rest of the stored procedure.

I do have a stored procedure code generator that I’ve created over the years. It needs a little bit of updating, but works fine as-is and can save lots of time and money. It will generate very clean and supportable code. There are many code generators available. I’ve created my own, because I like total control of my tools and I’ve had it before many code generators existed.

Data Glass

Pages