Big Data Solution – Teaching platform

Couple of months back my elder brother (almost 48 yrs+ aged), asked me a question, which baffled me to write up this blog after researching into real world implementation of a Big Data solution for teaching platform.

His question was simple, It’s taking hell of a time to send his ~ 2GB size file of the videos he is making to teach his students. How can he manage his contents like document pdf files, video recording and wants an interactive platform which should record what he writes on the board, to be automatically prepared as a material for students to refer – post his classes anytime or later for study as a reference.

Hence, I have come up with below Big Data solution to build it into a viable product for any tutoring groups to make use of. Please feel free to augment or advice on the design and implementation details – a Reference Architecture for real world implementation from scratch comparing deployment on Google anthos (hybrid solution combined with on-Prem solution providers) and GCP (Public cloud). I also compare the hybrid and public cloud solutions for this teaching domain.

High Level Design:

1) Know your customer – Business goal ?

2) Small, medium or large customer? – Define Technologies for scalability

3) Define Business model – how are you going to make money?

     – Single user money – discounts

     – Subscription based – value for 1 yr contract with lower monthly pay – rewards to consumers (certificates)

4) Advertising & Sponsorship – provide data of consumer/client so that customer sponsors

     – Marketing channels – emails, FB, youtube, and google Ads etc.

     – Customer management and automation – First impression

     – Transactional emails help build trust and help improve client’s side revenue

5) Value based subscription model (Attractive content costs money)

6) Content Assessment and quality

7) Identify the platform – Create a marketplace for customer

8) Personalization for customer

9) Start development – Minimum viable product (MVP) – core features only [POC]

    – Validate your idea asap without much of spending on designs, features etc [Product Design workshops – Design thinking]

    – Get Feedback – future development

    – User acceptance – Plan future budget

    – Estimate for development

10) Choosing the best Techstack

     – Purpose  : Build what prototype or enterprise level software (resilient one)

     – Use case :  Develop app which rely on ML or data science

     – Architecture: Monolith (one app – many sub functionalities) / Micro-services (Services for different functionalities)

     – Popularity :  Identify Tech & SME’s (specialists) in that field who can help

        Establish requirements, build solution and measure – repeat

       – Backend : Python, Django, Go , PHP https://www.merixstudio.com/blog/backend-development/

       – Frontend : https://www.merixstudio.com/work/u-project/

       – Mobile : 

            cross-platform mobile development solution, using frameworks like React Native or Flutter. 

            https://www.merixstudio.com/work/sportshi/    &  https://www.merixstudio.com/work/ginny/

11)  Higher Best people – Product designers & Project managers

         – Outsource if necessary

12) Core features (basic ones) of the product

         a) Search tool : Category & filtering

         b) Recommendation : Machine learning

         c) Dashboard : Analytics

         d) Course Page and Reviews

         e)  Reviews & Ratings

         f)  Payment systems : local and global payment methods

         g) Customized Notifications

         h) Admin Panel:  For managing content and user accounts , generate statistics

Etc

Low Level Design :

1) Understand customer – customer growth over years – scalability – design architecture – proper technologies & tools

2) Budget : CAPEX, OPEX

3) Resources –   skilled  workers

4) Timeline – deadlines

5) Security of the data

6) Data ingestion : Lecture Videos from Mobile Laptops (Teachers)  + Documents (word, pdfs) etc

7) Data processing: Classification of videos, documents, pdfs etc into topics

8) Data Consumption: End users (students) , subscription to topics contents.

9) 2 Services defined for Recorded replay, shared links of documents

10) 2 Services defined for Users (students – consumers) and (Teachers – producers)

11) 3 Services defined for Announcements, community interaction and Q&As.

12) Storage – Data Lake

13) 2 Load balancers, 2 gateways, 1 DNS, 1 DHCP, 1 Active directory security, 8 VMs for different services + 2 VMs for security storage and backup recovery.  Total : 17 VMs

14) Provision for data growth yearly and users growth (20%)

15) Big Data tools: Kafka, Spark, Hive and Redis, Cassandra.

Compression ratio, Intermediate data, 70% data consumption, data and users growth quarterly and yearly.

Micro services defined for each independent service module :

16) Payment system : Gateway server and secure networking for payments

17) Reviews management system : Content server and associated Frontend-Backend service

18) Recommendation system : Service defined for recommendation – popular and personalized using Machine learning.

Brother’s use case : Small customer –

Service module (say Serv_gd) to keep his videos in Google drive – syncing from his mobile device. All the content be it pdfs, prescribe during his lecture and video – audio files of classes will have individual services to make it highly available spread across regions and datacenters and Racks. For this small use case caching servers may not be necessary, but to make this services consistent syncing data across the granular architecture.

Architect the front and backend to support above use case. Services may have to be defined for replay of content or use CDN’s for high availability, consistency and scalability, which is critical for business.

Resilience – 10 Days

Day1- Growth Mindset:

Acknowledgement source by : https://www.youtube.com/watch?v=bL1m5nM6JnU

Find someone who is in a situation where they doubt being successful. Now you write a letter to them telling how you know they can be successful and give them conviction with an evidence you have seen in them in their past.

Letter to Ashok anna:

Dear Ashok Anna,

I know you have lot of financial troubles now with no family member supporting you. I have seen you in the past working hard to get up 5:30 am in the morning and making idly in Bangalore’s small hotel nearby our Basavaraj home. Doing odd jobs in Big Bazaar, trying to making a living in Bangalore. Although you have failed in them, but you have accepted your capabilities and moved to BDVT to develop our father’s Lands. I see that having spent so many years trying to understand how our land’s arcanut trees turn in yellow can be made green, when land trenches are made distributing water moisture content in land helping the trees being avoided to turn into yellow. Hence with experience you have learnt, taking advice from farmer community and friends in Masarahalli how this problem can be solved. Today due to this the farm land is giving better yields.

Therefore, I am sure the financial situation you are in now will also pass and you will gain new perspectives to become a better farmer. I would suggest to walk everyday for about 45 mins and doing Kapala Bhatti, AnulomVilom , Bhastrika pranayam each from 2mins to 5 mins gradually increasing the time for each practice. This will not only reduce the anxiety you are having at the moment but it will give you more awareness of the situation and new perspectives to overcome them.

Your loving brother,

Harsha

SO HOW CAN YOU HELP YOURSELF NOW WRITING A LETTER TO OVERCOME YOUR PROBLEMS and say good things TO YOURSELF. One cannot build resilience by running away from problems but by facing them and overcome them with acceptance and telling yourself, what you can do about them in the future by learning what went wrong and why so? You cannot change the past but you can certainly change the future.

Day 2 – Reflection

Acknowledgement source by https://www.youtube.com/watch?v=mZFkHXqkgtM [Gemma Leigh Roberts – Psychologist]

When one is faced with a big challenge like a difficult situation, how do you feel about it? Today’s aim is to accept this situation and reflect upon it on what went wrong and what was in your control or otherwise. Trying to understand your perspectives in the process during that time and think about how differently you could have handled it. It is okay to be retrospective about that situation and accept them. Now think about what tough situations you are in and plan for actions to be taken rather than to just straight away jump to finish them.

Algorithm:

Big challenge –> Situation –> Accept –>Don’t Jump to solve it –> Rather Reflect –> what went wrong and right –> what was in control and Not –> think of perspectives of handling them in reflection –> register that in mind –> Plan the action items with different perspectives for the next big challenge.

My personal experience on this front :

Big Challenge:

I was successful in my previous job having received rewards for the development support work with my stakeholders in US and even got promoted in the process on the research, analysis and inferences i made w.r.t new JVM tunings, evaluation of energy management algorithms for different server hardware configurations, which made an energy management algorithm change it’s defaults in the company server class firmware, and an opportunity patent (memory workloads were not sensitive to DVFS from my data analysis. This was used to introduce a new hardware counter to monitor memory bandwidth and alter DVFS to lower power modes since performance was not impacted) huge savings in terms of optimization in Energy costs . All my leads and mentors helped me in the process.

Hence i was offered to take up Big Data project (back in 2013-2014) on demand that i want something challenging. I am glad to this day i went out of the comfort zone then, but i faced tough challenges on delivering the results in the initial 3-6 months of the project. My status update was almost same during this juncture as i was not a developer and i was put into do that in no time. I had to understand maven, jenkins, java JVM internals (proprietary to the company – known variable) and how big data workloads are designed, stressed. How, what experiments i need to do to find bottlenecks in our hardware platforms to optimize when faced with customer deployments. I wasn’t completely an expert on hardware platform knowledge on architecture, compute topology and network topology. How i can create hadoop, spark cluster on ambari and global team to work with across Japan, china , US and brazil. How i can use micro benchmarks to macro benchmark data analysis to make an impact on the software/hardware stack. Therefore too many unknown variables and less known variables to derive useful information from the unknown to take up the Big challenge inorder to deliver the project.

Reflection: what went wrong and What i felt at that moment ?

I was struggling to learn maven and understand java code but my senior in US was busy to reach out for a helping hand. I was alone in the project to discuss about the same with anyone on the jvm or java code as rest of the team was dealing with some other projects related to either C/C++. I was struggling with maven and java code with lack of jvm internals knowledge with hardware knobs which were affected by this. My seniors who had java background had left the company and were busy to be approachable, although to this day i am in touch with them. I felt like why i chose this path, very much stressed with sleepless nights with my stake holders sitting on top for delivery asap. On few days i was feeling as to give up and say to my manager i want to quit from this project.

Reflection : what was in my control?

i could have mastered few bit of the project like learning java in say 2 weeks and understand jvm internals quickly in next 2 weeks (some what known variables) – reach out to mentors and get help at the pain points. Also jenkins was an easier piece to pick up and that i did. I had to also multitask between two projects (Big data and energy efficiency)

Reflection : what was not in my control?

Time, skills, knowledge, help from others, the vision of the project and end goals. The big picture.

Reflection : Perspectives to handle in retrospect?

I think an extra help from any senior from across globe would have helped me a lot. Extra effort from my side to fill the skills along with mentoring to check if i am on the right track. Better time management lessons from my manager. Better communication with stake holders and being honest on what i have and how much i can deliver. Not gaining these skills from my previous job experiences – which i had quit having faced with these situations.

Reflection: what went right?

During this time it was my manager who was of tremendous support and managed the stakeholders, so that i should not be kicked out of the project. Maybe he liked my determination to not quit no matter how much failures i see. My motivation levels were high as this was the Big data project and everyone else in the team wanted to be in this, but i got it due my previous success.

My US Lead was furious on this situation but i started communicating with US manager and Lead on my capabilities and what i can deliver. I was taken off the developer role and again put into installation, configuration, tuning and deployment work. In the process i understood the software, hardware stack in depth and did again did my JVM analysis with profiling the benchmarks at disk , memory , cpu and network level. The impact was i could improve the performance of one of the key Big data benchmark by 2x compared to Intel platform as my lead and my curiosity helped to identify bottlenecks at the hardware level. Partitioning of the data, tuning the jvm for GC, using optimized non-java libraries and identifying disk IO was the bottleneck with better utilization of logical cores (in all – data parallelism – for a machine learning workload / application) helped me achieve this.

I presented my work in seville spain, Linux foundation for the first time with help of my US Lead – a solution architect from US joined me to help organize the situation at the event as it is a key marketing event from the point of selling the platform to the other vendors in the industrial event. I successfully delivered the talk which put me in stardom among my peers and the larger US team. I started collaborating with Brazil and japan team for optimizing the stream benchmarks related to big data on spark. Also jenkins automation was handed over to me to lead and auomate the process of benchmark version regression for different JEMT systems we used to get to certify for performance. After attending the event i realized the Big picture of the work i was doing – i.e. to provide Big Data collateral to the company helping to sell the servers , either per core / per thread level density.

In the hindsight i should have continued to build my skills in developing some projects of interest in Big data, outside of work and gained more networking in the process.

I did plan that and executed to be a “Data scientist” – did dataquest certifications, learnt python and today i am glad i spent so much time distributing over long term. But did not do anything outside or work nor developed any network with other interested people. But i did dared to understand spark in depth and did coursera courses, looked out for MOOC material, learnt R studio and machine learning concepts with practical implementations in my next job – – which getting to be helping me now!

DevOps Automation worth more than a million!

It is just an exaggeration when I say the title interpretation is just leading one’s mind to! but the technical work I am blogging here is no misnomer.

In our enterprise Solution R&D division I was assigned a task to automate “execution of a workload” on different SUT (System Under Test) that has 4 metrics and it takes nearly 2-3 days to complete their execution. Seven team members scheduling these workloads on 15+ different SUTs almost every week with man-handling each metric one after another and in between weekends would add up to the more delays in getting the end results, the SLAs? This exercise continues almost every 3 months and 3 times a year costing a high price when a client is anvil to buy your servers worth millions (for their on-prem/colo/hybrid cloud environments) but not ready to wait for that long with requirements which are very competitively taken care by sometimes initial RFI (Request for information) and other times negotiating over RFPs (Request for proposal) by other vendors too!!

So, I developed a Jenkins framework which executes these workload metrics scheduled on different SUTs as and when requests from team members hits a queue, which is freed up every 5 minutes. It was just a wrapper over an already automated python package, which was not supporting multi-tenant use case. This was challenging as it was a multi-tenant usecase with reports generated for each execution run which had to be tagged to end users/SUTs/metrics and all together , with debugging logs required for each stage of the pipeline.

Objective and Goal of the Automation:

Concept workflow

Dev DataOps

Automated
Performance Tests Scheduling on different SUTs

SUTs –> System Under Test

The Flow chart i designed for this use-case was as below.

Automation Flow-Chart

Applications:

1. Even a not so experienced person can execute these applications and hence cost savings.

2. Sanity tests – ILO, SUT, OS , Network  etc

3. Regression tests: Early performance bugs detection across stacks

4. Optimization pipelines – RFI, RFP

5. In-house pipelines tools – python / java plugins and more….

Impact:

  1. Helped to successfully complete deals of RFP requests worth 5 to 8+ millions of US dollars. (Recent Example: SSA customer bid of 173 million win in 2 months time, an 8 year contract )

2. Fast turn around time for RFP requests and customer delight

Although i am a linux geek, but i had to learn windows batch scripting as the automation framework used for execution of the application, collect logs at system level and dump the results on a NTFS shared across the team. The initial part was to get users input defined as a key-value pair which would have SUT (system under test) IP, credentials, BIOS tunes, OS tunes etc. so the jenkins framework shown above in the flow chart would read these key-value pairs and apply the BIOS tunes, OS tunes to the SUT , followed by application execution, with the choice of the application / metric made in the UI style drop down in the jenkins framework itself.

The users would edit and copy their run parameters as a text file and copy to Input folder and jenkins framework would read these files every 5 mins and schedules the application execution on different SUTs automatically. Once the runs complete, the results are dumped in a user specific folders with tuning-comment strings to identify users who executed a particular application on a given SUT with specific tuning applied. This history of tunes and application results can be used to build machine learning models and use it for recommendation to RFP/RFI requests. However if there are errors in the execution then currently it is not handled but is notified through slack from the jenkins workflow.

Some screenshots of the automation are shown below:

Input folder with test*.dat file containing key-value pairs
Results of various systems and applications like intspeed,fpspeed fprate etc with jadmin user
The pipeline can be also launched manually with choice of different systems
jenkins pipeline used for above

Since the SUTs on which the application is launched had to be on a isolated network (for performance tests), the JVM parameters used to launch are as below, so that there is no impact of jenkins agents on the performance results.

Different users controlled through matrix based authentication in jenkins

users
Matrix based users authentication
different apps choices
slack notification plugin in jenkins to report test workflow results

Rest of the python, windows batch scripts and jenkins groovy scripts are at github link

Due paucity of time and some company policy constraints, I am unable to blog in detail but reach out to me for demo or details on the workflow and python/groovy scripts.

What are BigData and distributed file systems (e.g. HDFS)?

A) Unix Command Line Interface (CLI)
B) Distributed File Systems (DFS), HDFS (Hadoop DFS) Architecture and Scalability Problems
C) Tuning Distributed Storage Platform with File Types

D) Docker installation guide:
To make the learning process more convenient, our team has built Docker containers for each block of assignments. These containers are everything that you should install to pass all our courses. Having these containers deployed you can work with the same Jupyter notebooks as in Coursera’s interface and be independent from the internet connection problems or any other problems.

To use our containers with Jupyter notebooks and appropriate Hadoop services installed, you should do the steps below. Firstly you should install the latest version of Docker. Then you should deploy the necessary containers from our repository in DockerHub. And finally if you want you can use a container in a terminal. It is an optional part of the guide because this is a good decision if you are interested to learn how Hadoop cluster works and not only to practice API of Hadoop services.

I. Installing Docker
For installing Docker and further using it with our containers for this course, you should have a machine answering the following requirements.

The minimal requirements for running notebooks on your computer are:

2-core processor
4 GB of RAM
20 GB of free disk space
The recommended requirements are:

4-core processor
8GB of RAM
50 GB of free disk space

Installing Docker on Unix
>curl -sSL https://get.docker.com/ | sh

To install Docker on any of the most popular Linux operating systems (Ubuntu, Debian, CentOs, fedora etc) you should run the following command with sudo privileges:

Then optionally you can add your user account to the docker group:
>usermod -aG docker $USER

This allows you to deploy containers without sudo but we don’t recommend doing this on CentOS for security reasons.

Checking the installation
Now run the test container to check the installation successful:
>docker run hello-world

You can see some Docker deployment logs on the screen and then “Hello from Docker!” message from successfully deployed container.

II. Deploying containers
You can find all of them in our Dockerhub repository by the following link. Appropriate is specified at the bottom of self-reading to each assignment. E.g. you can see “If you want to deploy the environment on your own machine, please use bigdatateam/spark-course1 Docker container.”

To run a Juputer notebook on you own machine you should do the following steps.

1. Pull the actual version of the container from DockerHub
>docker pull

2. Deploy the container
>docker run –rm -it -p :8888

3. Open Jupyter environment

Now Jupyter can be opened in a browser by typing localhost:. If you’re working on Windows, you should open : instead of localhost. You can find in the deployment logs of the container.

So now you have full environment with installed and adjusted Hadoop services, prepared datasets and demo codes on your own machine.

III. Working with docker through a terminal
To start the container and work with it via Unix terminal you should do the following steps.
1. Start a Tmux session:
>tmux new -s my_docker

2. Run your container in terminal opened:
docker run –rm -it -p 8888:8888 -p 50070:50070 -p 8088:8088 bigdatateam/hdfs-notebook

(please, take into account that you can forward some additional ports with Hadoop UIs, not only Jupyter’s port).

3. Detach the Tmux session by typing “Ctrl+b” and then “d”.

4. Check your container’s id:
>docker ps

In the output of the command you should find something like this “da1d48ac25fc”. This is your container’s id.

5. Finally, you can open the terminal within the container by executing:
>docker exec -it /bin/bash

6. Now you’ve logged in the container as root and you can execute Hadoop shell commands.
>root@da1d48ac25fc:~# hdfs dfs -ls /data
Found 1 items
drwxrwxrwx – jovyan supergroup 0 2017-10-15 16:30 /data/wiki
root@da1d48ac25fc:~#

Simultaneously, you can work with Jupyter via browser.

To stop the docker container you should do the following steps.
1. Attach the Tmux session.
>tmux a -t my_docker
(if you have forgotten the name of Tmux session, you can type ‘tmux ls’)

2. Stop the container via Ctrl+C.

3. Exit from the Tmux session.

We recommend to learn Tmux, it has many interesting features (e.g., you can start from this guide). However these 3 commands will be enough to work with our docker containers:
>tmux new -s
tmux a -t
tmus ls

Name node architecture:
The discussion about WAL and NFS is going beyond the scope of this course. If you are not familiar with them, I suggest you start from Wikipedia articles:
https://en.wikipedia.org/wiki/Write-ahead_logging
https://en.wikipedia.org/wiki/Network_File_System
WAL helps you to persist changes into the storage before applying them.
NFS helps you to overcome node crashes so you will be able to restore changes from a remote storage.

E) First assignment :
==============================================
!hdfs dfs -du -h /user/jovyan/assignment1/test.txt is not correct you should use instead hdfs dfs -ls [filename] because it gives more details about the file.
!hdfs dfs -chown is not correct chown change the owner of the file not the rights on that file, use chmod instead.
for the lase question the used space is false

==============================================

Estimate minimum Namenode RAM size for HDFS with 1 PB capacity, block size 64 MB, average metadata size for each block is 300 B, replication factor is 3. Provide the formula for calculations and the result.

Ans:
1) Formula for minimum Namenode RAM size for HDFS = (Size of original data ) / (Size of single data block * replication factor ) * Avg metadatasize of each block on name node)

Minimum Namenode RAM size with requirements stated = 1 PB / (64 MB * 3) * 300 = (10 ^15 B) / (64 *1000*1000* 3 * 300B) = 1562500000 bytes ~ 1.5GB

Remember the level of granularity on the Namenode is block, not replica . No of blocks on data node is nothing but (Size of original data ) / (Size of single data block * replication factor ) say “x” blocks . So to store one block avg name node metadatasize is 300 bytes. Hence for “x” blocks it is 300x.

2) HDDs in your cluster have the following characteristics: average reading speed is 60 MB/s, seek time is 5 ms. You want to spend 0.5 % time for seeking the block, i.e. seek time should be 200 times less than the time to read the block. Estimate the minimum block size.

Ans:
Check the calculations and the result, they should both be correct.

block_size / 60 MB/s * 0.5 / 100 >= 5 ms

block_size >= 60 MB/s * 0.005 s / 0.005 = 60 MB

So, the minimum block size is 60 MB or 64 MB.

3) To complete this task use the ‘HDFS CLI Playground’ item.

Create text file ‘test.txt’ in a local fs. Use HDFS CLI to make the following operations:

сreate directory ‘assignment1’ in your home directory in HDFS (you can use a relative path or prescribe it explicitly “/user/jovyan/…”)
put test.txt in it
output the size and the owner of the file
revoke ‘read’ permission for ‘other users’
read the first 10 lines of the file
rename it to ‘test2.txt’.
Provide all the commands to HDFS CLI.

Ans:
https://linode.com/docs/tools-reference/tools/modify-file-permissions-with-chmod/

!touch test.txt
!for i in 1 2 3 4 5 6 7 8 9 0 10 ; do echo line_No_$i >> test.txt ; done
!hdfs dfs -mkdir -p /user/jovyan/assignment1
!hdfs dfs -put test.txt /user/jovyan/assignment1/
!hdfs dfs -du -h /user/jovyan/assignment1/test.txt
141 /user/jovyan/assignment1/test.txt
!hdfs dfs -chown -R r-ou /user/jovyan/assignment1/test.txt
!hdfs dfs -cat /user/jovyan/assignment1/test.txt | head -10
!hdfs dfs -mv /user/jovyan/assignment1/test.txt /user/jovyan/assignment1/test2.txt

4)

To complete this task use the ‘HDFS CLI Playground’ item.

Use HDFS CLI to investigate the file ‘/data/wiki/en_articles_part/articles-part’ in HDFS:

get blocks and their locations in HDFS for this file, show the command without an output
get the information about any block of the file, show the command and the block locations from the output

!hdfs fsck -blockId blk_1073741825

Connecting to namenode via http://localhost:50070/fsck?ugi=jovyan&blockId=blk_1073741825+&path=%2F
FSCK started by jovyan (auth:SIMPLE) from /127.0.0.1 at Sun Apr 15 10:40:54 UTC 2018

Block Id: blk_1073741825
Block belongs to: /data/wiki/en_articles_part/articles-part
No. of Expected Replica: 1
No. of live Replica: 1
No. of excess Replica: 0
No. of stale Replica: 0
No. of decommissioned Replica: 0
No. of decommissioning Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: eca2de932db3/default-rack is HEALTHY —-> shows the block location ie datanode/rack: eca2de932db3/default-rack

RIGHT ANSWERS BELOW:

Check the formula and the result, they should both be correct.

1 PB / 64 MB / 3 * 300 B = 1024 * 1024 * 1024 / 64 / 3 * 300 = 1600 MB

The result may be not exactly the same, rounding and other units are possible. So 1600 MB, 1.6 GB, 1.56 GB are all allowed.

=====

Check the calculations and the result, they should both be correct.

block_size / 60 MB/s * 0.5 / 100 >= 5 ms

block_size >= 60 MB/s * 0.005 s / 0.005 = 60 MB

So, the minimum block size is 60 MB or 64 MB.

-============

Check the commands, they should be like these:

$ hdfs dfs -mkdir assignment1

$ hdfs dfs -put test.txt assignment1/

$ hdfs dfs -ls assignment1/test.txt or hdfs dfs -stat “%b %u” assignment1/test.txt

$ hdfs dfs -chmod o-r assignment1/test.txt

$ hdfs dfs -cat assignment1/test.txt | head -10

$ hdfs dfs -mv assignment1/test.txt assignment1/test2.txt

There can be the following differences:

‘hdfs dfs’ and ‘hadoop fs’ are the same
absolute paths are also allowed: ‘/user//assignment1/test.txt’ instead of ‘assignment1/test.txt’
the permissions argument can be in an octal form, like 0640
the ‘text’ command can be used instead of ‘cat’
======

Blocks and locations of ‘/data/wiki/en_articles_part/articles-part’:
$ hdfs fsck /data/wiki/en_articles_part/articles-part -files -blocks -locations

Block information (block id may be different):
$ hdfs fsck -blockId blk_1073971670

It outputs the block locations, example (nodes list will be different):

Block replica on datanode/rack: some_node_hostname/default-rack is HEALTHY

=====

Total capacity: 2.14 TB

Used space: 242.12 GB (=DFS Used) or 242.12+35.51 = 277.63 GB (=DFS Used + Non DFS Used) – the latter answer is more precise, but the former is also possible

Data nodes in the cluster: 4

F)

1. The name of the notebook may contain only Roman letters, numbers and characters “-” or “_”

2. You have to send the worked out notebook (the required outputs in all cells should be filled).

3. While checking the system runs the code from the notebook and reads the output of only the last filled cell. It is clear that for getting this output all the preceding cells should also contain working code.If you decide to write some text in a cell, you should change the style of the cell to Markdown (Cell -> Cell type -> Markdown).

4. Only the answer to your task should be printed in the output stream (stdout) There should be no more output in stdout. In order to get rid of garbage [junk lines] redirect the output to /dev/null.

Let’s see on this code snippet:

%%bash

OUT_DIR=”wordcount_result_”$(date +”%s%6N”)
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR}

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name=”Streaming wordCount” \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-files mapper.py,reducer.py \
-mapper “python mapper.py” \
-combiner “python reducer.py” \
-reducer “python reducer.py” \
-input /data/wiki/en_articles_part1 \
-output ${OUT_DIR}

hdfs dfs -cat ${OUT_DIR}/part-00000 | head

F.1)
The commands in lines (6) and (8) may output some junk lines to stdout. So you need to eliminate output of these commands by redirecting it to the /dev/null. Let’s look at the fixed version of the snippet:

%%bash

OUT_DIR=”wordcount_result_”$(date +”%s%6N”)
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name=”Streaming wordCount” \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-files mapper.py,reducer.py \
-mapper “python mapper.py” \
-combiner “python reducer.py” \
-reducer “python reducer.py” \
-input /data/wiki/en_articles_part1 \
-output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/part-00000 | head

Please take into account that you mustn’t redirect stderr to anywhere. Hadoop, Hive, and Spark print their logs to stderr and grading system also reads and analyses it.

5. One cell must contain only one programming language.

E.g. if a cell contains Python code and you also want to call a bash-command (using “!”) in it, you should move the bash to another cell.

6. When you submit the solution the grading system converts it from Ipython-notebook into usual Python-script for further execution on the Hadoop-cluster. When standard converter (nbconvert) finds Jupyter’s magic commands or anything else that can’t be found in usual Python it generates a stub function “get_ipython()”. This function doesn’t work by default. Our Ipython converter is an improved version of Nbconvert and it can process most of magics correctly (e.g. it can convert “%%bash” to the “subprocess.check_output()”). However, despite this, we recommend to avoid magics where possible. Because in the cases when our converter can’t process magics correctly, it escalates the conversion to the nbconvert and nbconvert generates “get_ipython()” functions.

Week2:

Select the type of failure when a server in a cluster gets out of service because of the power supply has burnt out

Fail-Stop

Fail-Recovery

This should not be selected
It’s not a Fail-Recovery because it requires somebody to fix the problem first and then to turn on the server

Byzantine failure

Question 2
Correct
1 / 1 points
2. Question 2
Select the type of failure when a server in a cluster reboots because of a momentary power failure

Fail-Stop

Byzantine failure

Fail-Recovery

Correct
That’s right, the server recovers after a failure by itself

Question 3
Correct
1 / 1 points
3. Question 3
Select the type of failure when a server in a cluster outputs unexpected results of the calculations

Fail-Recovery

Fail-Stop

Byzantine failure

Correct
Yes, it is a Byzantine failure, because it looks like the server doesn’t behave according to the protocol

Question 4
Correct
1 / 1 points
4. Question 4
Select the type of failure when some packets are lost regardless the contents of the message

Fair-Loss Link

Correct
Yes, when packets are lost regardless their contents, this is exactly a Fair-Loss Link

Byzantine Link

Question 5
Correct
1 / 1 points
5. Question 5
Select the facts about Byzantine Link:

Some packets are modified

Correct
Yes, in case of Byzantine Link packets can be modified

Some packets are created out of nowhere

Correct
Yes, in case of Byzantine Link packets can be created

Some packets are lost regardless the contents of the message

Un-selected is correct
Question 6
Correct
1 / 1 points
6. Question 6
Select a mechanism for capturing events in chronological order in the distributed system

Synchronize the clocks on all the servers in the system

Use a logical clocks, for example, Lamport timestamps

Both ways are possible, logical clocks hide inaccuracies of clocks synchronization

Correct
That’s right, use a logical clocks, but don’t forget about clocks synchronization

Question 7
Correct
1 / 1 points
7. Question 7
Select the failure types specific for distributed computing, for example for Hadoop stack products

Fail-Stop, Perfect Link and Synchronous model

Byzantine failure, Byzantine Link and Asynchronous model

Fail-Recovery, Fair-Loss Link and Asynchronous model

Correct
That’s right, these types of failures are inherent in Hadoop

Question 8
Correct
1 / 1 points
8. Question 8
What phase in MapReduce paradigm is better for aggregating records by key?

Map

Shuffle & sort

Reduce

Correct
Yes, the reducer gets records sorted by key, so they are suitable for aggregation

Question 9
Incorrect
0 / 1 points
9. Question 9
Select a MapReduce phase which input records are sorted by key:

Map

This should not be selected
No, because mapper input records can be (and usually are) unsorted

Shuffle & sort

Reduce

Question 10
0.60 / 1 points
10. Question 10
What map and reduce functions should be used (in terms of Unix utilities) to select the only unique input records?

map=’uniq’, reduce=None

Un-selected is correct

map=’uniq’, reduce=’uniq’

This should be selected

map=’cat’, reduce=’uniq’

Correct
Yes, because ‘uniq’ on the sorted input records gives the required result

map=’sort -u’, reduce=’sort -u’

Un-selected is correct

map=’cat’, reduce=’sort -u’

This should not be selected
No, because the records have already been sorted before the Reduce phase, so using ‘sort’ here is an overkill and can lead to the ‘out of memory’ error (though on small data the result will be correct)

Question 11
Correct
1 / 1 points
11. Question 11
What map and reduce functions should be used (in terms of Unix utilities) to select only the repeated input records?

map=’uniq’, reduce=None

Un-selected is correct

map=’uniq’, reduce=’cat’

Un-selected is correct

map=’uniq -d’, reduce=’uniq -d’

Un-selected is correct

map=’cat’, reduce=’uniq -d’

Correct
Yes, mappers pass all the records to reducers and then ‘uniq -d’ on the sorted records solves the task

Question 12
Incorrect
0 / 1 points
12. Question 12
What service in Hadoop MapReduce v1 is responsible for running map and reduce tasks:

TaskTracker

JobTracker

This should not be selected
No, JobTracker is responsible for jobs, whereas tasks are run by TaskTracker

Question 13
Correct
1 / 1 points
13. Question 13
In YARN Application Master is…

A service which processes requests for cluster resources

A process on a cluster node to run a YARN application (for example, a MapReduce job)

Correct
Yes, Application Master executes a YARN application on a node

A service to run and monitor containers for application-specific processes on cluster nodes

 

=================== WEEK3 ====================================

In this assignment Hadoop Streaming is used to process Wikipedia articles dump.

Dataset location: /data/wiki/en_articles

Stop words list is in /datasets/stop_words_en.txt’ file in local filesystem.

Format: article_id <tab> article_text

To parse articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

In this assignment Hadoop Streaming is used to process Wikipedia articles dump.

Dataset location: /data/wiki/en_articles

Stop words list is in /datasets/stop_words_en.txt’ file in local filesystem.

Format: article_id <tab> article_text

To parse articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

#!/usr/bin/env python

import sys
import re

reload(sys)
sys.setdefaultencoding(‘utf-8’)

for line in sys.stdin:
try:
article_id, text = unicode(line.strip()).split(‘\t’, 1)
except ValueError as e:
continue
text = re.sub(“^\W+|\W+$”, “”, text, flags=re.UNICODE)
words = re.split(“\W*\s+\W*”, text, flags=re.UNICODE)

# your code goes here

%%writefile mapper.py

import sys
import re

reload(sys)
sys.setdefaultencoding(‘utf-8’) # required to convert to unicode

for line in sys.stdin:
try:
article_id, text = unicode(line.strip()).split(‘\t’, 1)
except ValueError as e:
continue
words = re.split(“\W*\s+\W*”, text, flags=re.UNICODE)
for word in words:
print >> sys.stderr, “reporter:counter:Wiki stats,Total words,%d” % 1
print “%s\t%d” % (word.lower(), 1)

%%writefile reducer.py

import sys

current_key = None
word_sum = 0

%%writefile -a reducer.py

for line in sys.stdin:
try:
key, count = line.strip().split(‘\t’, 1)
count = int(count)
except ValueError as e:
continue
if current_key != key:
if current_key:
print “%s\t%d” % (current_key, word_sum)
word_sum = 0
current_key = key
word_sum += count

if current_key:
print “%s\t%d” % (current_key, word_sum)

! hdfs dfs -ls /data/wiki

Found 1 items
drwxrwxrwx – jovyan supergroup 0 2017-10-17 13:15 /data/wiki/en_articles_part

%%bash

OUT_DIR=”wordcount_result_”$(date +”%s%6N”)
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name=”Streaming wordCount” \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-files mapper.py,reducer.py \
-mapper “python mapper.py” \
-combiner “python reducer.py” \
-reducer “python reducer.py” \
-input /data/wiki/en_articles_part \
-output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/part-00000 | head

0%however	1
0&\mathrm{if	1
0(8)320-1234	1
0)).(1	2
0,03	1
0,1,...,n	1
0,1,0	1
0,1,\dots,n	1
0,5	1
0,50	1
rm: `wordcount_result_1504600278440756': No such file or directory
17/09/05 12:31:23 INFO client.RMProxy: Connecting to ResourceManager at mipt-master.atp-fivt.org/93.175.29.106:8032
17/09/05 12:31:23 INFO client.RMProxy: Connecting to ResourceManager at mipt-master.atp-fivt.org/93.175.29.106:8032
17/09/05 12:31:24 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/05 12:31:24 INFO mapreduce.JobSubmitter: number of splits:3
17/09/05 12:31:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1503302131685_0379
17/09/05 12:31:25 INFO impl.YarnClientImpl: Submitted application application_1503302131685_0379
17/09/05 12:31:25 INFO mapreduce.Job: The url to track the job: http://mipt-master.atp-fivt.org:8088/proxy/application_1503302131685_0379/
17/09/05 12:31:25 INFO mapreduce.Job: Running job: job_1503302131685_0379
17/09/05 12:31:31 INFO mapreduce.Job: Job job_1503302131685_0379 running in uber mode : false
17/09/05 12:31:31 INFO mapreduce.Job:  map 0% reduce 0%
17/09/05 12:31:47 INFO mapreduce.Job:  map 33% reduce 0%
17/09/05 12:31:49 INFO mapreduce.Job:  map 55% reduce 0%
17/09/05 12:31:55 INFO mapreduce.Job:  map 67% reduce 0%
17/09/05 12:32:01 INFO mapreduce.Job:  map 78% reduce 0%
17/09/05 12:32:11 INFO mapreduce.Job:  map 100% reduce 0%
17/09/05 12:32:19 INFO mapreduce.Job:  map 100% reduce 100%
17/09/05 12:32:19 INFO mapreduce.Job: Job job_1503302131685_0379 completed successfully
17/09/05 12:32:19 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=4851476
		FILE: Number of bytes written=11925190
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=76993447
		HDFS: Number of bytes written=5370513
		HDFS: Number of read operations=33
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=16
	Job Counters 
		Launched map tasks=3
		Launched reduce tasks=8
		Rack-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=351040
		Total time spent by all reduces in occupied slots (ms)=231588
		Total time spent by all map tasks (ms)=87760
		Total time spent by all reduce tasks (ms)=38598
		Total vcore-milliseconds taken by all map tasks=87760
		Total vcore-milliseconds taken by all reduce tasks=38598
		Total megabyte-milliseconds taken by all map tasks=179732480
		Total megabyte-milliseconds taken by all reduce tasks=118573056
	Map-Reduce Framework
		Map input records=4100
		Map output records=11937375
		Map output bytes=97842436
		Map output materialized bytes=5627293
		Input split bytes=390
		Combine input records=11937375
		Combine output records=575818
		Reduce input groups=427175
		Reduce shuffle bytes=5627293
		Reduce input records=575818
		Reduce output records=427175
		Spilled Records=1151636
		Shuffled Maps =24
		Failed Shuffles=0
		Merged Map outputs=24
		GC time elapsed (ms)=1453
		CPU time spent (ms)=126530
		Physical memory (bytes) snapshot=4855193600
		Virtual memory (bytes) snapshot=32990945280
		Total committed heap usage (bytes)=7536115712
		Peak Map Physical memory (bytes)=906506240
		Peak Map Virtual memory (bytes)=2205544448
		Peak Reduce Physical memory (bytes)=281800704
		Peak Reduce Virtual memory (bytes)=3315249152
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	Wiki stats
		Total words=11937375
	File Input Format Counters 
		Bytes Read=76993057
	File Output Format Counters 
		Bytes Written=5370513
17/09/05 12:32:19 INFO streaming.StreamJob: Output directory: wordcount_result_1504600278440756
cat: Unable to write to output stream.

===== Testing (num. 700): STARTING =====
Executing notebook with kernel: python2

===== Testing (num. 700): SUMMARY =====
Tests passed: crs700_1 mrb17 sp7_1 res1 res2_1 res3 res6_1 res7_1
Tests failed: – – | – –
==================================================
Duration: 49.0 sec
VCoreSeconds: 178 sec

 

============================================================================

Hadoop Streaming assignment 1: Words Rating

============================================================================

Create your own WordCount program and process Wikipedia dump. Use the second job to sort words by quantity in the reverse order (most popular first). Output format:

word <tab> count

The result is the 7th word by popularity and its quantity.

Hint: it is possible to use exactly one reducer in the second job to obtain a totally ordered result.


This course uses a third-party tool, Hadoop Streaming assignment 1: Words Rating, to enhance your learning experience. The tool will reference basic information like your Coursera ID.
 Helped:
=======================================================================
working with Hadoop streaming and python
=======================================================================

Hadoop streaming may give non-informative exceptions when mappers’ or reducers’ scripts work incorrectly:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():

subprocess failed with code 1

It would be better to find the main part of bugs in code before running it on Hadoop (it, by the way, saves your time). You can debug mappers’ and reducers’ code locally (as simple Python scripts) before running it on Hadoop. For doing this you should emulate Hadoop’s workflow using bash.

You already know that every Hadoop job has 3 stages:

  • Map,
  • Shuffle & sort,
  • Reduce.

Since you already have mapper’s and reducer’s code you can emulate Hadoop’s behavior using a simple bash:

cat my_data | python ./mapper.py | sort | python ./reducer.py

It helps you to understand if your code is working at all.

 

================ Another note on using hadoop streams and python=============

1. Logs of the task should be displayed in the standard error stream (stderr).

Do not redirect stderr to /dev/null because in this case the system won’t receive any information about running Jobs.

2. Output directories (including intermediates) should have randomly generated paths, for example:

OUT_DIR=”wordcount_result_”$(date +”%s%6N”)

3. All the letters in the names of Hadoop counters except the first letter should be small (the 1st could be any).

4. Please, use relative HDFS-paths, i.e. dir1/file1 instead of /user/jovyan/dir1/file1. When you submit the code it will be executed on a real Hadoop cluster. And ‘jovyan’ user’s account may not exist in it.

5. In the Hadoop logs the counter of stop words should be before the counter of total words. For doing this please take into account that the counters print in lexicographical order by default.

 

==========END ====== Another note on using hadoop streams and python=========

SPARK ASSIGNMENT 

Week5

In this assignment you will use Spark to compute various statistics for word pairs. At the same time, you will learn some simple techniques of natural language processing.

Dataset location: /data/wiki/en_articles

Format: article_id <tab> article_text

While parsing the articles, do not forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. Here is a starting snippet:

#! /usr/bin/env python

from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf().setAppName(“MyApp”).setMaster(“local[2]”))

import re

def parse_article(line):
try:
article_id, text = unicode(line.rstrip()).split(‘\t’, 1)
except ValueError as e:
return []
text = re.sub(“^\W+|\W+$”, “”, text, flags=re.UNICODE)
words = re.split(“\W*\s+\W*”, text, flags=re.UNICODE)
return words

wiki = sc.textFile(“/data/wiki/en_articles_part1/articles-part”, 16).map(parse_article)
result = wiki.take(1)[0]

You can use this code as a starter template. Now proceed to LTI assignments.

If you want to deploy the environment on your own machine, please use bigdatateam/spark-course1 Docker container.

for word in result[:50]:
print word

 

Spark QUIZ 1:

What functions must a dataset implement in order to be an RDD?

foo, bar, baz

getSplits, getRecordReader

Question 2

2. Question 2

Mark all the things that can be used as RDDs.

Question 3

3. Question 3

Is it possible to access a MySQL database from within the predicate in the ‘filter’ transformation?

No, it is not possible to use database handles in predicates at all.

Yes, it is possible to create a database handle and capture it within the closure.

Question 4

4. Question 4

True or false? Mark only the correct statements about the ‘filter’ transform.

Question 5

5. Question 5

True or false? Mark only the incorrect statements.

Question 6

6. Question 6

Mark all the transformations with wide dependencies. Try to do this without sneaking into the documentation.

Question 7

7. Question 7

Imagine you would like to print your dataset on the display. Which code is correct (in Python)?

1
2
myRDD.foreach(print)
Question 8

8. Question 8

Imagine you would like to count items in your dataset. Which code is correct (in Python)?

1
2
3
count = 0
myRDD.foreach(lambda x: count += 1)
1
2
myRDD.reduce(lambda x, y: x + y)
Question 9

9. Question 9

Consider the following implementation of the ‘sample’ transformation:

1
2
3
4
class MyRDD(RDD):
def my_super_sample(self, ratio):
return this.filter(lambda x: random.random() < ratio)

Are there any issues with the implementation?

No, it is completely valid implementation.

Yes, it is written in Python and thus very slow.

Question 10

10. Question 10

Consider the following action that updates a counter in a MySQL database:

1
2
3
4
5
6
def update_mysql_counter():
handle = connect_to_mysql()
handle.execute(“UPDATE counter SET value = value + 1”)
handle.close()
myRDD.foreach(update_mysql_counter)

Are there any issues with the implementation?

Yes, the action may produce incorrect results due to non-atomic increments.

Yes, the action is inefficient; it would be better to use ‘foreachPartition’.

Spark QUIZ 2:

=================

Congratulations! You passed!
Question 1

1. Question 1

What is a job?

A unit of work performed by the executor.

That is how Spark calls my application.

A pipelineable part of the computation.

A dependency graph for the RDDs.

An activity you get paid for.

Question 2

2. Question 2

What is a task?

An activity you get paid for.

A pipelineable part of the computation.

That is how Spark calls my application.

An activity spawned in the response to a Spark action.

A dependency graph for the RDDs.

Question 3

3. Question 3

What is a job stage?

A subset of the dependency graph.

A single step of the job.

A place where a job is performed.

An activity spawned in the response to a Spark action.

A particular shuffle operation within the job.

Question 4

4. Question 4

How does your application find out the executors to work with?

You statically define them in the configuration file.

The SparkContext object queries a discovery service to find them out.

Question 5

5. Question 5

Mark all the statements that are true.

Question 6

6. Question 6

Imagine that you need to deliver three floating-point parameters for a machine learning algorithm used in your tasks. What is the best way to do it?

Hardcode them into the algorithm and redeploy the application.

Make a broadcast variable and put these parameters there.

Question 7

7. Question 7

Imagine that you need to somehow print corrupted records from the log file to the screen. How can you do that?

Use an accumulator variable to collect all the records and pass them back to the driver.

Use a broadcast variable to broadcast the corrupted records and listen for these events in the driver.

Question 8

8. Question 8

How broadcast variables are distributed among the executors?

The executors are organized in a tree-like hierarchy, and the distribution follows the tree structure.

The driver sends the content in parallel to every executor.

The driver sends the content one-by-one to every executor.

Question 9

9. Question 9

What will happen if you use a non-associative, non-commutative operator in the accumulator variables?

Spark will not allow me to do that.

I have tried that — everything works just fine.

The cluster will crash.

Question 10

10. Question 10

Mark all the operators that are both associative and commutative.

Question 11

11. Question 11

Does Spark guarantee that accumulator updates originating from actions are applied only once?

No.

Question 12

12. Question 12

Does Spark guarantee that accumulator updates originating from transformations are applied at least once?

Yes.

Automation of Big Data Stack Tuning – Cloudera REST API

I was assigned a task to provide tuning recommendations to improve price performance for under- performing queries of TPCx-BB when scaled from a 8 node cluster to a 21 node cluster and vice versa. The first stage was to identify why these queries behaved so?

Hence I used Intel PAT tool (https://github.com/asonje/PAT) to collect cluster node telemetrics data and which would also give a high level information on hot modules of the stack deployment which are consuming high system resources. Later used jupyter notebooks to plot visualizations of the results obtained for each performance usecase experiment done on the big data cluster. Idea was to correlate resource usage metrics to hadoop counters data and then corresponding tuning parameters used. This turned out to be a valuable piece of information to understand hadoop counter, cluster system resources usage patterns for a given set of tuning parameters. I could not thing of any other big data tools than jupyter notebooks for fine grained analysis (Please do let me know of the same — maybe zeppelin?). The scale factors of the big data cluster was around 1 TB to 30 TB for different no of datanodes. I cannot talk about nitty-gritty of technical cohort and the cluster hardware stack configuration details like processor SKU type / model , memory DIMMs, HDDs/ SSDs, specific cloudera tunes used as this work is still in progress (optimizing machine learning models) and some patents filed.

The Idea was conceptualized by me using a simple cluster initially to convince stake holders of the potential in recommending optimal configurations, as shown in the below conceptual diagram.

Conceptual Diagram

Lot of effort went into initially defining hadoop-spark cluster sizing and assuming already defined best practices from cloudera for each service of the big data deployment for the benchmark.

Picture showing initial cluster sizing as per cloudera best practices
CDH 6 tuning values based on sizing assumptions above
Yarn, MR tuning parameter used for sizing – cluster floor plan
data blocks distribution across cluster nodes after rebalancing blocks

The process involved analyzing the reasons for queries response time degradation even though there were resources available and to understand data distribution across the cluster nodes with different data formats (Avro, parque, csv, json). Since this process took a huge amount of manual process of executing different scripts and data analysis, I designed and developed a Jenkins framework which would read key-value pairs of spark, hive tuning parameters with a predefined set of range of values and tuning options from a dataset whose input is the tuning parameters and output is the Benchmark response time metric. These tuning parameters are updated in CDH cluster deployment by the Jenkins data pipeline through cloudera REST API, configured across big data components. This dataset example (figure below) is shown below which had about 2000 tuning experiments defined as each row, which the jenkins pipeline would read and execute the required benchmark queries and dump the results into a csv files and then run machine learning models with history of the results obtained. Sub-dataset from the results were resource utilizations and some of the big data software counter information used to find relationship between performance KPI’s and the counters, helping the model to better tuning recommendations, which would be valuable in RFI and RFP’s or RA / RC best practices.

Dataset – Input to framework:

one of the Input dataset fed to the pipeline

Data pipeline Framework:

Data Pipeline in Action:

Automated Data pipeline

Problem statement:

With massive amounts of ever-increasing raw data every day, Data analytics platforms, including parallel and distributed data processing systems (like  Hadoop MapReduce, Spark), have emerged to assist with the Big Data challenges of collection, processing, and analyzing huge volumes of data. Achieving system performance at such a scale is directly linked to the capacity of systems being used for executing such tasks. Because of the exponential growth in the volume of the data, the more resources are required, in terms of processors, memory and disks in order to meet the customer’s SLAs (latency, query execution time). With every extra node or hardware required to meet the SLAs also increases the costs due hardware footprint and licensing, if you are using for ex: Cloudera based Big Data solution stack which is why capacity planning (resourse demands) for Big Data workloads is so important in order to reduce TCO and at the same time meet SLAs at a competitive price. For customers who require service guarantees, a performance question to be answered is the following: given a Big data workload W with input dataset D, what should be the allocated cluster size in term of number of compute, memory, and disks required so this workload finishes within deadline (SLAs) T? Also, with  time spent on arriving at optimal configuration for varied customer use cases and enormous data growth expectations YoY (Year-on-year), choosing the right workload & configuration, to meet usecase specific resource demands, vary due to above challenges, resulting in costs amounting to nearly millions of dollars.

For right sizing of the infrastructure, we have used industry-standard big data workload TPCx-BB Express Benchmark which is designed to measure the performance of the Big Data analytics system. TPCx-BB has been largely used to benchmark our company servers for competing on Intel’s/AMD’s Performance Leaderboard, creating marketing collaterals, and bidding in Customer’s Request for Proposals and has largely impacted various customer bids. The benchmark contains 30 use cases (queries) that simulate big data processing, big data storage, big data analytics, and reporting which provides enough scalability to address challenges of scaling data size (volume) and nodes (cluster size). Our company’s large server portfolio supporting As-a-service model, will be greatly benefitted with sizing guidance based on predictive analytical model which actually captures the customers’ workload based on growing data volumes and could make an offer meeting customers SLAs with minimum TCO. Hence we require a method, which addresses the Big Data customer specific requirement’s based capacity planning for our company’s large server portfolio supporting As-a-service model.

Solution:

We propose a framework we have developed, which helps in gathering data relevant to customer specific data growth requirements and prediction of resource demands for those specific customer use cases. Our solution involves three phases, “Requirements collection”, “Identification & simulated data collection for  Big Data customer requirements by domain expertise” and “Predictive analytical model to predict optimal configuration recommendations”. Figure1 depicts workflow architecture with it’s prime offerings.

Figure 1

Requirements collection phase accepts user inputs in json format as key value pairs. Essentially the inputs detail on customer requirements spanning workload resource characteristics (CPU/MEM/DISK/NETWORK), data growth size, current data scale factors, deployment details w.r.t batch and stream based Big Data components such as  HDFS, Spark, Hive, Kafka and YARN etc, performance SLAs metrics data.

Next phase uses these inputs to generate randomized subsets of cluster configurations parameters at uniform data growth of simulated Big data component parameters and measure performance of key identified parameters at different workload levels. System telemetrics data capturing usage patterns  and Big data application level performance metrics, system level counters are collected during this process, and in turn fed as input to predictive modeling phase.

Predictive modelling, identifies significant counters responsible for performance variations and corresponding usage patterns characteristics for different data growth scale factors. It builds a history of knowledge base (KB) for these performance delta key value mappings for each classified Big data components and cluster configuration parameters. This KB is used for building a machine learning model, which learns adaptitatively as  different Big Data components and use cases are trained. The machine learning model essentially employs gradient boosted decision trees to derive optimal configurations for each Big Data component classification and use case. This model is optimised as we add more Big Data deployments, customer use cases by providing this framework as tool to different teams evaluating Big Data solution stacks. Hence the framework is provided as a service offering to build the knowedge database from different data sources across our company’s large server porfolio supporting As-a-service model. As an end product it is also provided to customer for suggesting optimal configuration recommendations and further do reinforced learning from those service engagements.

As next steps we also consider to build a similar tuning parameters history of knowledge base for our company’s large server portfolio with cost optimizations by DevOps approach to managing automated dynamic sizing of the cluster configurations and hardware, software parameters tuning with a real time workload classifier for each Big Data deployment/solution stack.

Acknowledgements:

Mukund kumar, Vishal Daga colleagues who helped to integrate some of the python scripts in above automation.

Impact of above work:

  1. Patent filed
  2. Aggregation of results from above tool to our company’s Solution sizing tool for Big Data, used widely for capacity planning of customer deployments with BOM (Bill of Material) for different big data components stack helping in achieving successful RFPs and close deals in less time with optimal TCO.

Model results to identify important tunes we can use to reduce resource footprints:

Jenkins Pipeline execution steps:

Prerequisites to executing the Jenkins pipeline to generate the ML dataset:

  1. Sanity test a hive query to be executed, so that we have all the roles/services being up in CDH (cloudera).
  2. Make sure if /home/tpc/hs/tpcxBB-pipeline/hivejenkinsfile, input test*.dat files are in the format containing yarn, hive executions and other commands as expected and python scripts are pushed to the github link below using command [./git_push.sh hiveJenkinsfile “<commit message>”]

https://github.com/harshagn/tpcxBBAuto

  • If Jenkins login page is not up, then follow below steps to bring it up Jenkins home page http://XX.XX.239.13:28080/. If it is started from a directory other than JENKINS_HOME=/var/Jenkins_home/, you will not find the jobs previously created and hence always cd to $JENKINE_HOME and then run below command on CLI (namenode/master node)
  • cd to /var/jenkins_home
  • execute “/home/tpc/hs/openjdk11-28/jdk-11/bin/java -jar /home/tpc/hs/jenkins/jenkins.war –httpPort=28080” (In place of 28080, any free port can be used)
  • Test the data-pipeline*1 by copying 2-3 files of test*.dat files to /var/jenkins_home/Files with corresponding SF, say 100 or 300 and then place all the test*.dat files under /var/jenkins_home/Files. During this phase get to know how much time it takes for the query execution with above few test*.dat files and change the sleep time in datasetRemotebuild Jenkins job accordingly as shown in below and save it.
Jenkins workflow: Reading the predefined key-value pairs of spark, hive tuning parameters
  • To run datanodes, with for say 7 to 5, In CDH, go to Yarn-MR2 service à Instances Tab and select the node manager service to be stopped (out of 7 two can be stopped, so that we have 5 nodemanager in “Running” state). Once the service is down for nodemanagers selected, decommission those datanodes.  Post this Jenkins pipeline can be started by launching the datasetRemotebuild Jenkins job.
  • Again if we want to execute datapoints with different no of datanodes, first back up the artifacts into a folder with suitable nomenclature and start with step 1 above. For example the hive 256 data points we had generated first is named as below:

*1 data-pipeline means a set of Jenkins jobs put in a queue to be executed. In our case, following is the data pipeline used to generate the execution times as “label” (query run time) of the feature set,  datapoints of the ML dataset.  

Locations:

  1. Input test*.dat files:

               /var/jenkins_home/Files (the Jenkins jobs only picks up files with name test*.dat)

/var/jenkins_home/workspace/TPCx-BB_hive/artifacts/

Ansible to setup chronyd (ntp) service

Chrony is a service for keeping your servers time in sync, similar to NTPd, see https://chrony.tuxfamily.org/ for more.

– Chrony provides another implementation of NTP.
– Chrony is designed for systems that are often powered down or disconnected from the network.
– The main configuration file is /etc/chrony.conf.
– Parameters are similar to those in the /etc/ntp.conf file.
– chronyd is the daemon that runs in user space.
– chronyc is a command-line program that provides a command prompt and a number of commands. Examples:
tracking: Displays system time information
sources: Displays information about current sources.

source of above: https://www.thegeekdiary.com/centos-rhel-7-configuring-ntp-using-chrony/

Steps to deploy chronyd on the 1+3 node cluster with CentOS7 as operating system on each node:

  1. Our cluster has 1+3 nodes, named master, slave1, slave2, slave3 with following DHCP assigned IPs.

slave1: 10.xx.0.xx

slave2: 10.xx.0.xx

slave3: 10.xx.0.xx

2. On master node (controller node – where ansible is installed), install ansible and configure chronyd with below conf file configuration. (ansible installation steps source : https://unixcop.com/ansible-installation-configuration-on-centos-7-8/)

In master Under /etc

#start of chronyd.conf : source – https://www.ibm.com/docs/en/db2/11.5?topic=servers-setting-up-network-time-protocol

server 127.xxx.x.x # This is the IP of our Lab NTP server

driftfile /var/lib/chrony/drift

local stratum 10

allow all

#end of chronyd.conf

In slave nodes under /etc

#start of chronyd.conf-slaves

rtcsync

server <master node IP 10.xx.0.xx> iburst prefer

driftfile /var/lib/chrony/drift

logdir /var/log/chrony

#end of chronyd.conf-slaves

3. make copy of abovechronyd.conf-slaves file as /home/chrony.conf.j2 or under any path where ansible playbook is run.

4. ansible configuration of yml for chronyd is created as below :

“chrony-config.yml”: source – https://www.redhat.com/sysadmin/ansible-chrony-daemon

chrony-config.yml file

Under /home, execute below command:

“ansible-playbook chrony-config.yml”

output:

chronyd-config.yml ansible execution

Now manually sync the time on each node including master as shown below:

timedatectl set-time “2021-11-17 11:53:59” –> my current time IST – run this command on master and slave nodes simultaneously (I used moba-xterm multi execution mode for this).

Multi-Execution in Mobaxterm

Above command is used to set the time manually from the approximate LOCAL time got from the wall clock, your phone, the Internet etc

Now restart the chronyd service on all nodes inclusing master

Once above is completed verified chronyd status as shown below:

Hence you can see using ansible chronyd configuration and few commands we have time synced across the cluster nodes. The time is set to 2021-11-17 13:18:38 IST on all nodes including the master node.

So now i can proceed with spark cluster configuration on this cluster!! Once i am done with it, i will post the link to the same.

Cloudera (CDP7.1) setup & Deployment with spark 3.0 cluster

Goal: To setup CDP7.1 on a 3 node cluster for Benchmark publications – marketing collateral, helping in sale of servers in different domains like Retail, Banking and Insurance.

A) First step would be to download the CDP 7.1 (License is applied – as a txt file). There are few prerequisites one needs to follow before going into installing the CDP7.1

B) Get the hardware required. In our case I am choosing three homogeneous servers with below configuration. However it is better to make a choice based on application or workload resource characteristics and the price performance cost involved to purchase the server (Price performance {PP} = Like mileage of a car as the performance metric). If the application is multithreaded then choosing more processors with maximum no of cores and sufficient memory will help. However, if the application is single threaded, then SKU/CPUs with higher frequency and larger L1 cache could help and for memory intensive workloads, DIMMs with higher memory speeds would be beneficial for application performance improvements. Also SSD choice is preferred than an HDD due to better reliability of the hardware for Big data processing and faster data processing capability.

Each server configuration: 2 socket server with 22 cores per CPU. So with HT enabled we have 88 threads in total, with a 384 GB RAM per CPU and 2 x 3.2 TB SSDs. (hard drives)

C) Download the trial version of CDP 7.1 from link and follow the system security, network requirements from this link. Key requirements and commands are listed below.

  1. On Centos7 disable firewall and set SElinux with below options, as some of the ports needs to be opened by cdp manager installer.

systemctl disable firewalld
systemctl stop firewalld

vim /etc/selinux/config

SELINUX=disabled

Setenforce 0

2. Network configurations involve setting DHCP, hostnames, DNS, NTP to have identical time on each node of the cluster and passwordless ssh.

DHCP is setup by Lab admins with a central server assigning IPs to the nodes in the network by ARP protocol. However depending on the operating system one can configure a system or VM as DHCP server.

setting hostnames :

a)

[root@localhost ~]# hostnamectl
Static hostname: localhost.localdomain
Icon name: computer-server
Chassis: server
Machine ID: ****
Boot ID: **
Operating System: CentOS Linux 7 (Core)
CPE OS Name: cpe:/o:centos:centos:7
Kernel: Linux 3.10.0-1160.el7.x86_64
Architecture: x86-64
[root@localhost ~]#

b) Edit and update this (command vim /etc/sysconfig/network) as shown below on respective servers.

HOSTNAME=master.local.com

HOSTNAME=slave1.local.com

HOSTNAME=slave2.local.com

HOSTNAME=slave3.local.com

c) Setup domain name as below

[root@localhost ~]# domainname
(none)
[root@localhost ~]# domainname master.local.com
[root@localhost ~]# domainname
master.local.com
[root@localhost ~]#

d) On systems with Linux application servers that are using DNS, edit the /etc/hosts file and add IP addresses to /etc/hosts to ping each node of the cluster with each other
10.xx.x.xx master.local.com master
10.xx.x.xx slave1.local.com slave1
10.xx.x.xx slave2.local.com slave2
10.xx.x.xx slave3.local.com slave3

e) verify by pinging the hostnames above using “ping <master>”, “ping <slave#>” etc on each node.

f) Setting up ntp service with chronyd : reference link

g) Passwordless ssh on master node and copy the public key on other slave nodes

[root@master-spark ~]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory ‘/root/.ssh’.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:***************************** root@master-spark
The key’s randomart image is:
+—[RSA 2048]—-+
| dssf

| ddfdadfadadf
| . |
| |
| |
| |
+—-[SHA256]—–+
[root@master-spark ~]# ssh-copy-id

[root@master-spark ~]# ssh-copy-id root@slave1
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: “/root/.ssh/id_rsa.pub”
The authenticity of host ‘slave1 (10.xx.xx.xx)’ can’t be established.
ECDSA key fingerprint is SHA256:*******************************************
ECDSA key fingerprint is MD5:***************************************
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed — if you are prompted now it is to install the new keys
root@slave1’s password:

Number of key(s) added: 1

Now try logging into the machine, with: “ssh ‘root@slave1′”
and check to make sure that only the key(s) you wanted were added.

[root@master-spark ~]# ssh-copy-id root@slave2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: “/root/.ssh/id_rsa.pub”
The authenticity of host ‘slave2 (10.10.0.73)’ can’t be established.
ECDSA key fingerprint is SHA256:*******************************************
ECDSA key fingerprint is MD5:***************************************
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed — if you are prompted now it is to install the new keys
root@slave2’s password:

Number of key(s) added: 1

Now try logging into the machine, with: “ssh ‘root@slave2′”
and check to make sure that only the key(s) you wanted were added.

[root@master-spark ~]#

similarly do it on slave3 too.

3. Other dependencies to be installed on each node is as shown below. One can use ansible for these system security, network configurations and cdp installation itself by cloning the github repository and editing some of the yaml files. Please check ansible reference link

yum install -y java-1.8.0-openjdk-devel (java)

yum install postgresql-server

verify below :

rpm -qa iptables-services
iptables-services-1.4.21-34.el7.x86_64
systemctl status iptables
iptables -L -n -v
iptables -F
iptables -L -n -v
systemctl stop iptables
systemctl disable iptables

chkconfig iptables off
Note: Forwarding request to ‘systemctl disable iptables.service’.

service iptables status –> output
Redirecting to /bin/systemctl status iptables.service
● iptables.service – IPv4 firewall with iptables
Loaded: loaded (/usr/lib/systemd/system/iptables.service; disabled; vendor preset: disabled)
Active: inactive (dead)

swappiness=10

D) Installation of the CDP 7.1 by following steps in the link

On the master node, run below commands to install the clourdera manager server

wget https://archive.cloudera.com/cm7/7.4.4/cloudera-manager-installer.bin

$ chmod u+x cloudera-manager-installer.bin

$ sudo ./cloudera-manager-installer.bin

Installer First page
Cloudera Manager server Page
Continue Installation by following directions in above screenshot

If the domain name of the master server doesn’t work use the links <http://master node IP>:7180 with credentials admin/admin.

Pardon the screenshots, which aren’t very clear. I have tried to mask the IP’s of our systems for security purposes.

License applied

TLS (Transport Layer security) security Page: Ignore & continue
I had tried this installation couple of times due some misconfigurations and so got above error.

I tried multiple things like checking the configurations, checked logs (under /var/log/cloudera-scm-manager/) for error, but nothing was obvious and also tried running the installer many times but it was hard to get through this above page. This link helped me to get over it from the cloudera community. I had to reboot my master node and that resolved the issue.

Finally got the login page again:

Login Page (admin/admin)
Name your cdp cluster as you wish
Read carefully to add slave1 to slave3 nodes to the cluster. Using hostnames as in slave[1-3].local.com helped to add all the nodes easily in my cluster setup.
Due some hardware issue I had to add my slave3 later.
I choose kafka too
choice of cloudera provided jdk as shown above

install agents
Follow the inspections for network perfomance
follow further few steps as directed
Once the inspection is successfully completed you get above page
I got couple of warnings to set some kernel parameters for proper function of the cloudera cluster services and roles etc.
Cloudera recommends setting /proc/sys/vm/swappiness to a maximum of 10. Current setting is 60. Use the sysctl command to change this setting at run time and edit /etc/sysctl.conf for this setting to be saved after a reboot. You can continue with installation, but Cloudera Manager might report that your hosts are unhealthy because they are swapping. The following hosts are affected:  View Details slave[1-2].local.com

Transparent Huge Page Compaction is enabled and can cause significant performance problems. Run “echo never > /sys/kernel/mm/transparent_hugepage/defrag” and “echo never > /sys/kernel/mm/transparent_hugepage/enabled” to disable this, and then add the same command to an init script such as /etc/rc.local so it will be set on system reboot. The following hosts are affected:

 View Details

slave[1-2].local.com

From <http://XXXXXX:7180/cmf/inspector?commandId=15####$$87&gt;

[root@master home]# ssh root@slave1 “sudo sysctl vm.swappiness=10”

vm.swappiness = 10

[root@master home]# ssh root@slave2 “sudo sysctl vm.swappiness=10”

vm.swappiness = 10

[root@master home]#

[root@master ~]# tail -f /var/log/cloudera-manager-installer/3.install-cloudera-manager-server.log

Total download size: 1.6 G

Installed size: 1.9 G

Downloading packages:

——————————————————————————–

Total                                               11 MB/s | 1.6 GB  02:35

Running transaction check

Running transaction test

Transaction test succeeded

Running transaction

Warning: RPMDB altered outside of yum.

  Installing : cloudera-manager-daemons-7.4.4-15850731.el7.x86_64           1/2

  Installing : cloudera-manager-server-7.4.4-15850731.el7.x86_64            2/2

  Verifying  : cloudera-manager-daemons-7.4.4-15850731.el7.x86_64           1/2

  Verifying  : cloudera-manager-server-7.4.4-15850731.el7.x86_64            2/2

Installed:

  cloudera-manager-server.x86_64 0:7.4.4-15850731.el7

Dependency Installed:

  cloudera-manager-daemons.x86_64 0:7.4.4-15850731.el7

Complete!

Inspection of network performance and hosts is completed. These steps would take a while. Be patient!!

Depending on requirements, one may choose specific services. I have chosen “customer services”, so that I focus on my batch programs to be deployed on spark cluster

Make choice of services based on your requirements. Hence this does require some level of expertise in Big Data and requirements gathering with end goal of this spark cluster deployment.

For our cluster i made choice of HDFS/HBASE, YARN, Spark, HIVE, KAFKA, Zookeeper and SQOOP with zepplin services.

missed to update database passwords in previous step

Another error:

as per the stderr log, there seemed to be some data preexisting in the /dfs/nn/

[root@master home]# ls -lrt /dfs/nn
total 4
drwx——. 2 hdfs hdfs 4096 Nov 18 01:42 current
[root@master home]# date
Thu Nov 18 01:49:05 IST 2021
[root@master home]# rm -rf /dfs/nn/current/
edits_0000000000000000001-0000000000000004325 edits_0000000000000009076-0000000000000009077 fsimage_0000000000000000000
edits_0000000000000004326-0000000000000009059 edits_0000000000000009078-0000000000000009085 fsimage_0000000000000000000.md5
edits_0000000000000009060-0000000000000009067 edits_0000000000000009086-0000000000000009093 seen_txid
edits_0000000000000009068-0000000000000009075 edits_inprogress_0000000000000009094 VERSION
[root@master home]# rm -rf /dfs/nn/current/
edits_0000000000000000001-0000000000000004325 edits_0000000000000009076-0000000000000009077 fsimage_0000000000000000000
edits_0000000000000004326-0000000000000009059 edits_0000000000000009078-0000000000000009085 fsimage_0000000000000000000.md5
edits_0000000000000009060-0000000000000009067 edits_0000000000000009086-0000000000000009093 seen_txid
edits_0000000000000009068-0000000000000009075 edits_inprogress_0000000000000009094 VERSION
[root@master home]# rm -rf /dfs/nn/current/*
[root@master home]#

Soln:

Continued errors:

Soln:

[root@slave3 process]# find . -name webapp.properties
./1546334388-queuemanager-QUEUEMANAGER_STORE/conf/webapp.properties
./1546334346-queuemanager-QUEUEMANAGER_STORE/conf/webapp.properties
./1546334344-queuemanager-QUEUEMANAGER_WEBAPP/conf/webapp.properties
./1546333931-queuemanager-QUEUEMANAGER_STORE/conf/webapp.properties
./1546333929-queuemanager-QUEUEMANAGER_WEBAPP/conf/webapp.properties
[root@slave3 process]# pwd
/var/run/cloudera-scm-agent/process
[root@slave3 process]#

Final assignments:

Hive

TypeMySQLPostgreSQLOracle Database Hostnamemaster.hpelocal.com:7432 Database Namehive1 Usernamehive1 PasswordiyxJSXOqQV 

Reports Manager

Currently assigned to run on master.hpelocal.com.TypeMySQLPostgreSQLOracle Database Hostnamemaster.hpelocal.com:7432 Database Namerman Usernamerman PasswordpcELjCr8J4 

Oozie Server

Currently assigned to run on master.hpelocal.com.TypeMySQLPostgreSQLOracle Database Hostnamemaster.hpelocal.com:7432 Database Nameoozie_oozie_server1 Usernameoozie_oozie_server1 Passwordk8mKDQ322k 

i removed yarn queue manager service :

Updated all nodes with below:

[root@slave3 process]# cat /etc/hosts
10.1x.xx.xx master.hpelocal.com
10.1x.xx.xx slave1.hpelocal.com
10.1x.xx.xx slave2.hpelocal.com
10.1x.xx.xx slave3.hpelocal.com
[root@slave3 process]#

[root@slave1-spark process]# python -c “import socket; print socket.getfqdn(); print socket.gethostbyname(socket.getfqdn())”
slave1.hpelocal.com
10.1x.xx.xx
[root@slave1-spark process]#

https://community.cloudera.com/t5/Support-Questions/Bad-health-issue-DNS-resolve/m-p/37209

Below solved the DNS resolution:

Issue:

Configuration issues: DNS resolution unresolved with yarn and hosts failures

Solution: ( do it on all nodes)

 hostname –fqdn

master.hpelocal.com

 sudo hostnamectl set-hostname slave1.hpelocal.com

[root@slave1-spark process]# hostname –fqdn

slave1.hpelocal.com

 sudo hostnamectl set-hostname slave2.hpelocal.com

[root@slave2-spark process]# hostname –fqdn

slave2.hpelocal.com

[root@slave2-spark process]#

 sudo hostnamectl set-hostname slave3.hpelocal.com

[root@slave3 process]# hostname –fqdn

slave3.hpelocal.com

[root@slave3 process]#

And then restart cloudera scm agent:

 service cloudera-scm-agent restart

Redirecting to /bin/systemctl restart cloudera-scm-agent.service

[root@master process]#

References:

https://stackoverflow.com/questions/35472852/hadoop-dns-resolution/35517749 — helped

https://stackoverflow.com/questions/19639561/cloudera-cdh4-cant-add-a-host-to-my-cluster-because-canonical-name-is-not-cons — good to know

https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/configure_network_names.html

Spark and other Big Data CDP7.1 cluster is finally up and running!!

CDP 7.1 cluster

Uninstallation of CDP:

https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/installation/topics/cdpdc-uninstallation.html

  1. Stop all services and the cluster.

2) Remove user data:

Record the location of the user data paths by checking the configuration in each service.

The user data paths listed in the topic Remove User Data, /var/lib/flume-ng /var/lib/hadoop* /var/lib/hue /var/lib/navigator /var/lib/oozie /var/lib/solr /var/lib/sqoop* /var/lib/zookeeper data_drive_path/dfs data_drive_path/mapred data_drive_path/yarn, are the default settings.

3) Deactivate and Remove Parcels

If you installed using packages, skip this step and go to Uninstall the Cloudera Manager Server; you will remove packages in Uninstall Cloudera Manager Agent and Managed Software. If you installed using parcels remove them as follows:

  1. Click the parcel indicator  in the left-hand navigation bar.
  2. In the Location selector on the left, select All Clusters.
  3. For each activated parcel, select Actions > Deactivate. When this action has completed, the parcel button changes to Activate.
  4. For each activated parcel, select Actions > Remove from Hosts. When this action has completed, the parcel button changes to Distribute.
  5. For each activated parcel, select Actions > Delete. This removes the parcel from the local parcel repository.

There might be multiple parcels that have been downloaded and distributed, but that are not active. If this is the case, you should also remove those parcels from any hosts onto which they have been distributed, and delete the parcels from the local repository.

click on deactivate

4) Delete the Cluster

5) Uninstall the Cloudera Manager Server

6) Uninstall Cloudera Manager Agent and Managed Software

do the same on all slave nodes

7) Remove Cloudera Manager, User Data, and Databases

Permanently remove Cloudera Manager data, the Cloudera Manager lock file, and user data. Then stop and remove the databases.

a) On all Agent hosts, kill any running Cloudera Manager and managed processes:for u in cloudera-scm flume hadoop hdfs hbase hive httpfs hue impala llama mapred oozie solr spark sqoop sqoop2 yarn zookeeper; do sudo kill $(ps -u $u -o pid=); doneNoteThis step should not be necessary if you stopped all the services and the Cloudera Manager Agent correctly.

b) If you are uninstalling on RHEL, run the following commands on all Agent hosts to permanently remove Cloudera Manager data. If you want to be able to access any of this data in the future, you must back it up before removing it. If you used an embedded PostgreSQL database, that data is stored in /var/lib/cloudera-scm-server-db.sudo umount cm_processes sudo rm -Rf /usr/share/cmf /var/lib/cloudera* /var/cache/yum/cloudera* /var/log/cloudera* /var/run/cloudera*

c) On all Agent hosts, run this command to remove the Cloudera Manager lock file:sudo rm /tmp/.scm_prepare_node.lock

This step permanently removes all user data. To preserve the data, copy it to another cluster using the distcp command before starting the uninstall process.

d) On all Agent hosts, run the following commands:sudo rm -Rf /var/lib/flume-ng /var/lib/hadoop* /var/lib/hue /var/lib/navigator /var/lib/oozie /var/lib/solr /var/lib/sqoop* /var/lib/zookeeper

e) Run the following command on each data drive on all Agent hosts (adjust the paths for the data drives on each host):sudo rm -Rf data_drive_path/dfs data_drive_path/mapred data_drive_path/yarn

Stop and remove the databases. If you chose to store Cloudera Manager or user data in an external database, see the database vendor documentation for details on how to remove the databases.

A Lazy Yercaud

Yes we decided on yercaud for a lazy boys trip after dropping pondicherry and Goa! Can you believe? we dropped Goa as the former was due frequently visited one and latter far to drive in our friend’s brand new SUV. Guys generally hate to drive long when we are on a boy’s trip especially with a chopped car speed and the unsettled new vehicle 🙂

Nonetheless, it was indeed a calm and slow soothing trip away from humdrums of the routine chores at home and office, especially rusted because of the corona outbreak, although with reduced intensity but still lurking around as “shadows of our souls”. When guys turn men, close to 40 age, they suffer IBS (Irritable bowel syndrome) – that’s what we felt after our early morning brunch as we entered tamil nadu – missing much awaited breakfast at Sarvana bhavan, which we never did in the whole four day trip 🙂

So we drive through the Bengaluru, hosur, krishnagiri, dharmapuri – salem route with no much hiccups in our friend’s new KIA with sunroof , with speeds not crossing beyond 90-100 km/hr. As a twist to the story, we visited hogenakkal. Since it was a sunday there was a mad rush of people around 9:30 am and no parking available. All were parking in some nearby hotel park ground, just few meters before the falls entry. It was fun maneuvering in the (teppa) circular boat with deep fierce river amidst huge rocky mountains. We also had good small dive in the water, although it is considered dangerous to go into strong streams as many have lost lives in this hogenakkal falls. Therefore be cautious of the streams and do not go all alone in some deserted areas in the water. One piece of advice is to not give extra money to the labour their who take you in the boat. It costs only Rs750 for the complete ride (4 people in one boat) but the person in blue wear (maybe their local uniform – boating ) looted us with extra Rs600 saying he will drop us close to the car parking area floating over the river. Please check some of pics of hogenakkal below.

After playing in river, it just felt like below quotes say “Let your arms be wide open to every moment you meet and Learn to flow like the waves that make a sea”.

It was almost 2pm once we were out of this expedition, we made a mistake of not having lunch here and it felt just like the saying “if you are carrying your restlessness in your stomach you are dead”. Then we moved onto salem route and did not find any good restaurants until we met Hotel Selvi and Salem 🙂 The food was awesome – Andhra Style and we almost felt like getting ‘oasis in the desert!’

We reached Salem around 4pm and since it was a sunday saw a gust of vehicles running down the hairpin bends to our Sterling booking for 3Ns. I must say the food and ambience on the hill top at sterling was like a gust of wind saying to live free and our heart said “May every moment gift you a new sight to greet”. Checkout some of the wonderful pics at Sterling Yercaud.

We enjoyed our stay at Sterling Yercaud with nice buffet breakfast and in room dining options. Played TT, Scramble, Carom board and shuttle. So the next day we spent most of the time in the resort itself with a Lazy walk in Gent’s (Better) and Ladies seat. These are the hill top panoramic view spots. When you walk up the gent’s seat above sterling on the hill top, there is a children’s play park and the view is very beautiful.

Story around Gent’s seat: I am not sure about this anecdote but heard from one of the coffee/tea stall /security people here. Ladies and Gent’s seat were places of suicide points long before in olden days. This thought made me think we are staying on cemetery 🙂 We discussed at Gent’s seat to plan for a Cable car by govt from this place to salem and to have a coffee shop or shooting spot, as the garden around and sky way pics were very good.

Gent’s seat Pics:

At his view I could not stop thinking “if you are carrying the spark of your dreams in your eyes you are alive, if you are carrying wonder in your eyes you are alive”.

Yercaud, weather-wise is a very good place to stay but not much places to see. Do not buy any chocolates or other perfumes as there are not of much taste. We loafed around the lake and had bread omlets with lemon tea, near lake which was again very delicious. Then hit upon to visit the Shevaroy Temple, which has a large level ground on the hill top, again with many exhibition games to experience like Shooting ballons, Giant wheel and rolling boat etc as shown in the below pics (Includes Lake pics too). One could also try the coffee estate treks around here, which we did not explore.

On the next day we ventured out and had breakfast at Venkateshwara old hotel run by a guy from Karnataka after checking google reviews . It was decent food but the coffee was not available and we were suggested to have coffee at nearby bakery. For lunch we wanted to visit rascal restaurant but much to the hype it was mainly meant for non vegetarians and booking was based on prior appointment 🙂 This Rascal restaurant has nice quotes all around the hotel, but we did not feel as great as in one of the TripAdvisor comments. We again stepped upon “Selvam hotel”, which had decent Andhra style food. To our surprise there were not much wine shops here in yercaud and so we took them from “TASMAC Mall shop” as per google, where I felt a guy was selling the bottles as if selling black tickets from a house turned shop.

Breakfast Pics:

Hence a lazy trip to yercaud …..