I was assigned a task to provide tuning recommendations to improve price performance for under- performing queries of TPCx-BB when scaled from a 8 node cluster to a 21 node cluster and vice versa. The first stage was to identify why these queries behaved so?
Hence I used Intel PAT tool (https://github.com/asonje/PAT) to collect cluster node telemetrics data and which would also give a high level information on hot modules of the stack deployment which are consuming high system resources. Later used jupyter notebooks to plot visualizations of the results obtained for each performance usecase experiment done on the big data cluster. Idea was to correlate resource usage metrics to hadoop counters data and then corresponding tuning parameters used. This turned out to be a valuable piece of information to understand hadoop counter, cluster system resources usage patterns for a given set of tuning parameters. I could not thing of any other big data tools than jupyter notebooks for fine grained analysis (Please do let me know of the same — maybe zeppelin?). The scale factors of the big data cluster was around 1 TB to 30 TB for different no of datanodes. I cannot talk about nitty-gritty of technical cohort and the cluster hardware stack configuration details like processor SKU type / model , memory DIMMs, HDDs/ SSDs, specific cloudera tunes used as this work is still in progress (optimizing machine learning models) and some patents filed.
The Idea was conceptualized by me using a simple cluster initially to convince stake holders of the potential in recommending optimal configurations, as shown in the below conceptual diagram.
Lot of effort went into initially defining hadoop-spark cluster sizing and assuming already defined best practices from cloudera for each service of the big data deployment for the benchmark.
The process involved analyzing the reasons for queries response time degradation even though there were resources available and to understand data distribution across the cluster nodes with different data formats (Avro, parque, csv, json). Since this process took a huge amount of manual process of executing different scripts and data analysis, I designed and developed a Jenkins framework which would read key-value pairs of spark, hive tuning parameters with a predefined set of range of values and tuning options from a dataset whose input is the tuning parameters and output is the Benchmark response time metric. These tuning parameters are updated in CDH cluster deployment by the Jenkins data pipeline through cloudera REST API, configured across big data components. This dataset example (figure below) is shown below which had about 2000 tuning experiments defined as each row, which the jenkins pipeline would read and execute the required benchmark queries and dump the results into a csv files and then run machine learning models with history of the results obtained. Sub-dataset from the results were resource utilizations and some of the big data software counter information used to find relationship between performance KPI’s and the counters, helping the model to better tuning recommendations, which would be valuable in RFI and RFP’s or RA / RC best practices.
Dataset – Input to framework:
Data pipeline Framework:
Data Pipeline in Action:
With massive amounts of ever-increasing raw data every day, Data analytics platforms, including parallel and distributed data processing systems (like Hadoop MapReduce, Spark), have emerged to assist with the Big Data challenges of collection, processing, and analyzing huge volumes of data. Achieving system performance at such a scale is directly linked to the capacity of systems being used for executing such tasks. Because of the exponential growth in the volume of the data, the more resources are required, in terms of processors, memory and disks in order to meet the customer’s SLAs (latency, query execution time). With every extra node or hardware required to meet the SLAs also increases the costs due hardware footprint and licensing, if you are using for ex: Cloudera based Big Data solution stack which is why capacity planning (resourse demands) for Big Data workloads is so important in order to reduce TCO and at the same time meet SLAs at a competitive price. For customers who require service guarantees, a performance question to be answered is the following: given a Big data workload W with input dataset D, what should be the allocated cluster size in term of number of compute, memory, and disks required so this workload finishes within deadline (SLAs) T? Also, with time spent on arriving at optimal configuration for varied customer use cases and enormous data growth expectations YoY (Year-on-year), choosing the right workload & configuration, to meet usecase specific resource demands, vary due to above challenges, resulting in costs amounting to nearly millions of dollars.
For right sizing of the infrastructure, we have used industry-standard big data workload TPCx-BB Express Benchmark which is designed to measure the performance of the Big Data analytics system. TPCx-BB has been largely used to benchmark our company servers for competing on Intel’s/AMD’s Performance Leaderboard, creating marketing collaterals, and bidding in Customer’s Request for Proposals and has largely impacted various customer bids. The benchmark contains 30 use cases (queries) that simulate big data processing, big data storage, big data analytics, and reporting which provides enough scalability to address challenges of scaling data size (volume) and nodes (cluster size). Our company’s large server portfolio supporting As-a-service model, will be greatly benefitted with sizing guidance based on predictive analytical model which actually captures the customers’ workload based on growing data volumes and could make an offer meeting customers SLAs with minimum TCO. Hence we require a method, which addresses the Big Data customer specific requirement’s based capacity planning for our company’s large server portfolio supporting As-a-service model.
We propose a framework we have developed, which helps in gathering data relevant to customer specific data growth requirements and prediction of resource demands for those specific customer use cases. Our solution involves three phases, “Requirements collection”, “Identification & simulated data collection for Big Data customer requirements by domain expertise” and “Predictive analytical model to predict optimal configuration recommendations”. Figure1 depicts workflow architecture with it’s prime offerings.
Requirements collection phase accepts user inputs in json format as key value pairs. Essentially the inputs detail on customer requirements spanning workload resource characteristics (CPU/MEM/DISK/NETWORK), data growth size, current data scale factors, deployment details w.r.t batch and stream based Big Data components such as HDFS, Spark, Hive, Kafka and YARN etc, performance SLAs metrics data.
Next phase uses these inputs to generate randomized subsets of cluster configurations parameters at uniform data growth of simulated Big data component parameters and measure performance of key identified parameters at different workload levels. System telemetrics data capturing usage patterns and Big data application level performance metrics, system level counters are collected during this process, and in turn fed as input to predictive modeling phase.
Predictive modelling, identifies significant counters responsible for performance variations and corresponding usage patterns characteristics for different data growth scale factors. It builds a history of knowledge base (KB) for these performance delta key value mappings for each classified Big data components and cluster configuration parameters. This KB is used for building a machine learning model, which learns adaptitatively as different Big Data components and use cases are trained. The machine learning model essentially employs gradient boosted decision trees to derive optimal configurations for each Big Data component classification and use case. This model is optimised as we add more Big Data deployments, customer use cases by providing this framework as tool to different teams evaluating Big Data solution stacks. Hence the framework is provided as a service offering to build the knowedge database from different data sources across our company’s large server porfolio supporting As-a-service model. As an end product it is also provided to customer for suggesting optimal configuration recommendations and further do reinforced learning from those service engagements.
As next steps we also consider to build a similar tuning parameters history of knowledge base for our company’s large server portfolio with cost optimizations by DevOps approach to managing automated dynamic sizing of the cluster configurations and hardware, software parameters tuning with a real time workload classifier for each Big Data deployment/solution stack.
Mukund kumar, Vishal Daga colleagues who helped to integrate some of the python scripts in above automation.
Impact of above work:
- Patent filed
- Aggregation of results from above tool to our company’s Solution sizing tool for Big Data, used widely for capacity planning of customer deployments with BOM (Bill of Material) for different big data components stack helping in achieving successful RFPs and close deals in less time with optimal TCO.
Model results to identify important tunes we can use to reduce resource footprints: