Wednesday, September 12, 2012

Building Critical High Volume Cubes: The Challenge

The past 4 years I've been involved in delivering mission critical line-of-business analysis and reporting cubes that feed off high volume data. Though I've moved on to another job, I've learned a lot and wanted to pass on some of what I've learned and experienced to others that may stumble upon my blog. This entry is not to provide a solution but to tell a story of the high volume data space which is suffering from an astounding rate of growth. We weren't the most successful team out there I know. I've heard of wonderful stories like this one (2PB Data and 24TB Cubes with high reporting speeds), I just was not part of those stories, Sadly.  But I did enjoy the work and the people that I worked with.

One day there will be an extreme high volume cloud solution that the development community can easily use to provide reporting solutions at the speed of business, but right now it still seems that we struggle. There is Hadoop and Self-Learning Bitmaps so there is hope. I have faith that some brilliant engineers out there will come up with something that will work.
(Edit:  That is coming as these technologies have really matured and of course going from batch to interactive querying with Apache Tez is very promising in deed!) 

--------------------------------------------------------------------------------------------

Keeping Pace with the Speed of Business Challenge: Our business was in the need to do complex web analytics and the available tools at our disposal were: a Hadoop like grid processing and storage environment (which was already in place and managed by another team upstream from us) and SSAS Cubes (This was our team’s responsibility). Using SSAS Cubes as a reporting solution was dictated from executive management. Our business needed to have new dimensions and measures added to their cubes at the speed of business. Business usually meant within 2-4 weeks or less. These cubes also needed to be filled with at least two years historical data at a daily grain. The data we needed to process was from service logs that logged events that added up to 24 Gb a day.

Reality: We rolled out new dimensions and measures to cubes every 6 months with no historical reprocessing. This meant two years of data ws available for existing dimensions/measures and new dimensions could only have data from the point we released the new cubes into pre-production, which was a month and a half before we released the cubes to business. The month and half block of time we used for processing data, data validation, and user acceptance testing.

Reason for 6 months Delivery Cycles: We were limited by other team’s schedules. First the teams need to budget for and commit to providing the new data streams. Each team may have their own development cycle in which all teams participating in the pipeline need to be coordinated and data contracts put into place. The schedules usually resulted in 6 month development cycle.

Reason for No Historical Reprocess for New Dimensions/Measures: Reprocessing two years of historical high volume data on limited hardware was too costly and prevented critical daily ETL pipeline processing. Business considered data analysis as a cost center and always wanted more than what they were willing to pay for.

--------------------------------------------------------------------------------------------

Data Size Challenge: We needed to process data sets adding up to 24 Tb data a day within 24 hours after log collection has been completed for the day and store at least two years of pre-aggregated data.

Reality: We used a Hadoop like processing and storage grid that processed the data into pre-aggregated data. We were able to store two years of data in the storage grid. The aggregated data sets that feed each data mart and cube was 10-15 Gb a day. These cubes where 500 Gb to 1 Tb in size with two years of historical data at a daily grain. It usually took 36 hours to process the data through the ETL pipeline and into the cubes after the close of daily logs.

Reason for 36 hour latency: It took 24 hours to enrich the logs which included identifying user uniqueness without exposing user identification downstream, fraud detection, and sessionize the data. It took another 12 hours to join and merge other external datasets, dimensionalize, aggregate the data, and load the data into the cubes.

--------------------------------------------------------------------------------------------

Tuning Challenge: Business expected that all cube queries should return with results within 2 seconds for 100 concurrent users in order to satisfy business reporting and business troubleshooting analysis usually done at the last minute at the end of the month. You know, the instant information at your fingertips kind of expectation.

Reality: Our average query ranged from 1 second to 5 minutes. Aggregation Indexes where needed to be added for key dimension combinations used by business for common reports. Any queries going passed 5 minutes where automatically cancelled. We did this because any query taking a lot of time caused other users to not get their simple query results until rogue query was completed. Cubes didn't handle concurrent users very well. Caching critical monthly/weekly/daily canned reports was the most reliable approach to receiving quick results. This reduced the overall users on the cubes at any given moment allowing deeper analysis to be done by key people with a stronger understanding of cube technology. This played in our favor as this group of people were so much more appreciative and understanding of what magical things we did for them. All other people got canned reports :)

--------------------------------------------------------------------------------------------

Data Quality Challenge: Business made business critical decisions based on our data from the cubes. So the data must be the highest of quality and must compare consistently and accurately with other related data available internally or externally to the business.

Reality: ETL Pipelines are living organisms. Different teams release coding changes weekly into production. Though everyone understands that protecting and managing data contracts between teams was key to the pipeline’s predictability and stability, there is always something that happens that affected the quality of the data.

A data quality team was always validating new cube releases as well as validating the data weekly for any anomalies.   This would result in putting into place new data validation check automations.   

The most horrible bugs were bugs in the instrumentation which can significantly impact the quality of the data logs themselves. Double logging, no logging, or recording the wrong value was the most common bugs. The bad instrumentation may not be detected right away if the data verification checks weren't good enough to detect the anomaly. So it could be weeks if not months before someone notices the problem. The solution is to fix the Instrumentation; unfortunately you can't fix the data logs (Or I should say, you shouldn't as the logs are your empirical data source). So adjustments are made downstream in the reporting systems to account for the bug.

ETL Pipeline code bugs can also go undetected, but once found can be fixed by reprocessing the data most of the time. Unfortunately if we found the problem two months into the anomaly, the business may have to determine if the benefits out way the cost to reprocess the data verses just doing an adjustment to the reporting system downstream.

Adjustments can be done to fix the reports, but an adjustment tracking system should be built to track and apply the adjustments in a predictable way so that analysts can see what adjustments were made and for what reasons on any report the view.

--------------------------------------------------------------------------------------------

Cube Hardware Challenge: We needed to have hardware available that can process and store the data, load balance the cubes and reporting services, and allow for the old and new cubes to exist in production in parallel for one to two month period of time during releases.

Reality: Hardware in relationship to human resources is cheap, but that fact of the matter is that operations team (Those that management and maintain the production environments) think very differently then the engineering teams that are focused on a product or service. The Operation Team may have thousands of servers from various groups they are managing and have been mandated to increase the efficiency of usage of servers as many have storage and processing power that go unused. We didn't have a cloud solution available that enabled easier sharing, scaling, and managing of hardware and storage. So we had to justify new hardware and storage which was like pulling teeth. Sometimes we were required to reconfigure our production environments and move cubes to share with other cubes hardware to get the hardware we needed. And to top it all off, we needed to have our hardware ordered 9 months before we needed it and it needed to be done 3 months before the beginning of the fiscal year. The real challenge was how much hardware and storage we will need for the next 9-12 months before we even know what our next delivery will require. So we had to provide on educated guess. That was hard and annoying, but we survived.

One other point about standards with the Operations team.   Our Operations Team required to use a set of standard hardware from a excepted list they provided.   This means specialized hardware to handle large scale cubes where out of the question.  So this limitation also impacted our design and  cube performance.

--------------------------------------------------------------------------------------------

Unique User Count Challenge: Web analytics community has moved on from page view, clicks, and click-throughs to using much more in-depth analysis based on Distinct Users. We needed to provide Unique User counts based on any number of dimension combinations.

Reality: As you should know distinct counts of anything are not summable. Example: I cannot add yesterday’s distinct user counts with today's distinct user counts and get a correct answer. A recalculation of distinct user counts across the two todays is required. Our number of distinct users  per day visiting the website went into the hundred millions. Producing distinct user counts in the cube was not even possible with the dimensionality we required and get any reasonable performance. The result would just render the cube unusable. So we used our Hadoop like grid process environment to preprocess the distinct user counts. This was easy enough but this meant we had to pre-know exactly what combination of dimensions and the grain each report required. This limited the reporting to select combinations of dimensions and if business needed another combination not currently supported, a development cycle was required. And the Distinct User Count Reports were not in the Cube, but in a separate reporting system. We could have used drill through, but this was not a very user friendly option.

All of these constraints and challenges really caused a lot of frustration and problems for business to keep up with the speed of business.  But on the positive side, the business learned to ask for exactly what the needed rather then large list of wants.

No comments: