The past 4 years I've been involved in delivering mission critical
line-of-business analysis and reporting cubes that feed off high volume data.
Though I've moved on to another job, I've learned a lot and wanted to pass on
some of what I've learned and experienced to others that may stumble upon my
blog. This entry is not to provide a solution but to tell a story of the high
volume data space which is suffering from an astounding rate of growth. We
weren't the most successful team out there I know. I've heard of wonderful
stories like this one (
2PB Data and 24TB Cubes with high reporting speeds), I just was not part of those stories, Sadly. But I did enjoy the work
and the people that I worked with.
One day there will be an extreme high volume cloud solution that the development
community can easily use to provide reporting solutions at the speed of business, but
right now it still seems that we struggle. There is Hadoop and Self-Learning
Bitmaps so there is hope. I have faith that some brilliant engineers out there
will come up with something that will work.
(Edit: That is coming as these technologies have really matured and of course going from batch to interactive querying with
Apache Tez is very promising in deed!)
--------------------------------------------------------------------------------------------
Keeping Pace with the Speed of Business Challenge: Our
business was in the need to do complex web analytics and the available tools at
our disposal were: a Hadoop like grid processing and storage environment (which
was already in place and managed by another team upstream from us) and SSAS
Cubes (This was our team’s responsibility). Using SSAS Cubes as a reporting solution was dictated from executive management. Our business needed to have new
dimensions and measures added to their cubes at the speed of business. Business
usually meant within 2-4 weeks or less. These cubes also needed to be filled
with at least two years historical data at a daily grain. The data we needed to
process was from service logs that logged events that added up to 24 Gb a day.
Reality: We rolled out new dimensions and measures to cubes
every 6 months with no historical reprocessing. This meant two years of data ws
available for existing dimensions/measures and new dimensions could only have data
from the point we released the new cubes into pre-production, which was a month
and a half before we released the cubes to business. The month and half block of time we used
for processing data, data validation, and user acceptance testing.
Reason for 6 months Delivery Cycles: We were limited by
other team’s schedules. First the teams need to budget for and commit to
providing the new data streams. Each team may have their own development cycle
in which all teams participating in the pipeline need to be coordinated and
data contracts put into place. The schedules usually resulted in 6 month
development cycle.
Reason for No Historical Reprocess for New Dimensions/Measures: Reprocessing
two years of historical high volume data on limited hardware was too costly and
prevented critical daily ETL pipeline processing. Business considered data
analysis as a cost center and always wanted more than what they were willing to
pay for.
--------------------------------------------------------------------------------------------
Data Size Challenge: We needed to process data sets adding
up to 24 Tb data a day within 24 hours after log collection has been completed
for the day and store at least two years of pre-aggregated data.
Reality: We used a Hadoop like processing and storage grid
that processed the data into pre-aggregated data. We were able to store two
years of data in the storage grid. The aggregated data sets that feed each data
mart and cube was 10-15 Gb a day. These cubes where 500 Gb to 1 Tb in size with
two years of historical data at a daily grain. It usually took 36 hours to
process the data through the ETL pipeline and into the cubes after the close of
daily logs.
Reason for 36 hour latency: It took 24 hours to enrich the
logs which included identifying user uniqueness without exposing user
identification downstream, fraud detection, and sessionize the data. It took
another 12 hours to join and merge other external datasets, dimensionalize,
aggregate the data, and load the data into the cubes.
--------------------------------------------------------------------------------------------
Tuning Challenge: Business expected that all cube queries
should return with results within 2 seconds for 100 concurrent users in order
to satisfy business reporting and business troubleshooting analysis usually
done at the last minute at the end of the month. You know, the instant
information at your fingertips kind of expectation.
Reality: Our average query ranged from 1 second to 5
minutes. Aggregation Indexes where needed to be added for key dimension
combinations used by business for common reports. Any queries going passed 5
minutes where automatically cancelled. We did this because any query taking a
lot of time caused other users to not get their simple query results until
rogue query was completed. Cubes didn't handle concurrent users very well.
Caching critical monthly/weekly/daily canned reports was the most reliable
approach to receiving quick results. This reduced the overall users on
the cubes at any given moment allowing deeper analysis to be done by key people
with a stronger understanding of cube technology. This played in our favor as
this group of people were so much more appreciative and understanding of what
magical things we did for them. All other people got canned reports :)
--------------------------------------------------------------------------------------------
Data Quality Challenge: Business made business critical
decisions based on our data from the cubes. So the data must be the highest of
quality and must compare consistently and accurately with other related data
available internally or externally to the business.
Reality: ETL Pipelines are living organisms. Different
teams release coding changes weekly into production. Though everyone
understands that protecting and managing data contracts between teams was key
to the pipeline’s predictability and stability, there is always something that
happens that affected the quality of the data.
A data quality team was always validating new cube releases as well as
validating the data weekly for any anomalies.
This would result in putting into place new data validation check
automations.
The most horrible bugs were bugs in the instrumentation which can significantly
impact the quality of the data logs themselves. Double logging, no logging, or
recording the wrong value was the most common bugs. The bad instrumentation may
not be detected right away if the data verification checks weren't good enough
to detect the anomaly. So it could be weeks if not months before someone
notices the problem. The solution is to fix the Instrumentation; unfortunately
you can't fix the data logs (Or I should say, you shouldn't as the logs are
your empirical data source). So adjustments are made downstream in the
reporting systems to account for the bug.
ETL Pipeline code bugs can also go undetected, but once
found can be fixed by reprocessing the data most of the time. Unfortunately if we found the
problem two months into the anomaly, the business may have to determine if the
benefits out way the cost to reprocess the data verses just doing an adjustment
to the reporting system downstream.
Adjustments can be done to fix the reports, but an adjustment tracking
system should be built to track and apply the adjustments in a predictable way
so that analysts can see what adjustments were made and for what reasons on any
report the view.
--------------------------------------------------------------------------------------------
Cube Hardware Challenge: We needed to have hardware
available that can process and store the data, load balance the cubes and
reporting services, and allow for the old and new cubes to exist in production
in parallel for one to two month period of time during releases.
Reality: Hardware in relationship to human resources is
cheap, but that fact of the matter is that operations team (Those that
management and maintain the production environments) think very differently then the
engineering teams that are focused on a product or service. The Operation Team
may have thousands of servers from various groups they are managing and have
been mandated to increase the efficiency of usage of servers as many have
storage and processing power that go unused. We didn't have a cloud solution
available that enabled easier sharing, scaling, and managing of hardware and
storage. So we had to justify new hardware and storage which was like pulling
teeth. Sometimes we were required to reconfigure our production environments
and move cubes to share with other cubes hardware to get the hardware we
needed. And to top it all off, we needed to have our hardware ordered 9 months
before we needed it and it needed to be done 3 months before the beginning of
the fiscal year. The real challenge was how much hardware and storage we will
need for the next 9-12 months before we even know what our next delivery will
require. So we had to provide on educated guess. That was hard and annoying,
but we survived.
One other point about standards with the Operations team. Our Operations Team required to use a set of standard hardware from a excepted list they provided. This means specialized hardware to handle large scale cubes where out of the question. So this limitation also impacted our design and cube performance.
--------------------------------------------------------------------------------------------
Unique User Count Challenge: Web analytics community has moved on
from page view, clicks, and click-throughs to using much more in-depth analysis
based on Distinct Users. We needed to provide Unique User counts based on any
number of dimension combinations.
Reality: As you should know distinct counts of anything are
not summable. Example: I cannot add yesterday’s distinct user counts with today's
distinct user counts and get a correct answer. A recalculation of distinct user
counts across the two todays is required. Our number of distinct users per day visiting the website went into the
hundred millions. Producing distinct user counts in the cube was not even
possible with the dimensionality we required and get any reasonable
performance. The result would just render the cube unusable. So we used our Hadoop like grid process environment
to preprocess the distinct user counts. This was easy enough but this meant we
had to pre-know exactly what combination of dimensions and the grain each
report required. This limited the reporting to select combinations of
dimensions and if business needed another combination not currently supported,
a development cycle was required. And the Distinct User Count Reports were not
in the Cube, but in a separate reporting system. We could have used drill
through, but this was not a very user friendly option.
All of these constraints and challenges really caused a lot
of frustration and problems for business to keep up with the speed of business. But on the positive side, the business learned to ask for exactly what the needed rather then large list of wants.