Jenkins Pipeline execution steps:

Prerequisites to executing the Jenkins pipeline to generate the ML dataset:

  1. Sanity test a hive query to be executed, so that we have all the roles/services being up in CDH (cloudera).
  2. Make sure if /home/tpc/hs/tpcxBB-pipeline/hivejenkinsfile, input test*.dat files are in the format containing yarn, hive executions and other commands as expected and python scripts are pushed to the github link below using command [./git_push.sh hiveJenkinsfile “<commit message>”]

https://github.com/harshagn/tpcxBBAuto

  • If Jenkins login page is not up, then follow below steps to bring it up Jenkins home page http://XX.XX.239.13:28080/. If it is started from a directory other than JENKINS_HOME=/var/Jenkins_home/, you will not find the jobs previously created and hence always cd to $JENKINE_HOME and then run below command on CLI (namenode/master node)
  • cd to /var/jenkins_home
  • execute “/home/tpc/hs/openjdk11-28/jdk-11/bin/java -jar /home/tpc/hs/jenkins/jenkins.war –httpPort=28080” (In place of 28080, any free port can be used)
  • Test the data-pipeline*1 by copying 2-3 files of test*.dat files to /var/jenkins_home/Files with corresponding SF, say 100 or 300 and then place all the test*.dat files under /var/jenkins_home/Files. During this phase get to know how much time it takes for the query execution with above few test*.dat files and change the sleep time in datasetRemotebuild Jenkins job accordingly as shown in below and save it.
Jenkins workflow: Reading the predefined key-value pairs of spark, hive tuning parameters
  • To run datanodes, with for say 7 to 5, In CDH, go to Yarn-MR2 service à Instances Tab and select the node manager service to be stopped (out of 7 two can be stopped, so that we have 5 nodemanager in “Running” state). Once the service is down for nodemanagers selected, decommission those datanodes.  Post this Jenkins pipeline can be started by launching the datasetRemotebuild Jenkins job.
  • Again if we want to execute datapoints with different no of datanodes, first back up the artifacts into a folder with suitable nomenclature and start with step 1 above. For example the hive 256 data points we had generated first is named as below:

*1 data-pipeline means a set of Jenkins jobs put in a queue to be executed. In our case, following is the data pipeline used to generate the execution times as “label” (query run time) of the feature set,  datapoints of the ML dataset.  

Locations:

  1. Input test*.dat files:

               /var/jenkins_home/Files (the Jenkins jobs only picks up files with name test*.dat)

/var/jenkins_home/workspace/TPCx-BB_hive/artifacts/

One thought on “Jenkins Pipeline execution steps:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s