**Explain the steps in making a decision tree.?**

Take the entire data set as input.

Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.

Apply the split to the input data (divide step).

Re-apply steps 1 to 2 to the divided data.

Stop when you meet some stopping criteria.

This step is called pruning. Clean up the tree if you went too far doing splits.

**What is root cause analysis?**

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.

**What is logistic regression?**

Logistic Regression is also known as the logit model. It is a technique to forecast the binary outcome from a linear combination of predictor variables.

Cognizant Data Science Recently Asked Interview Questions Answers |

**What are the important skills to have in Python with regard to data analysis?**

The following are some of the important skills to possess which will come handy when performing data analysis using Python.

Good understanding of the built-in data types especially lists, dictionaries, tuples and sets.

Mastery of N-dimensional NumPy arrays.

Mastery of pandas dataframes.

Ability to perform element-wise vector and matrix operations on NumPy arrays. This requires the biggest shift in mindset for someone coming from a traditional software development background who’s used to for loops.

Knowing that you should use the Anaconda distribution and the conda package manager.

Familiarity with scikit-learn.

Ability to write efficient list comprehensions instead of traditional for loops.

Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.

Knowing how to profile the performance of a Python script and how to optimize bottlenecks.

The following will help to tackle any problem in data analytics and machine learning.

**What is Selection Bias?**

Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. It is the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.

The types of selection bias includes:

Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.

Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.

Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.

Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

**What is the goal of A/B Testing?**

It is a statistical hypothesis testing for randomized experiment with two variables A and B.

The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest.

An example for this could be identifying the click-through rate for a banner ad.

**A fair six-sided die is rolled twice. What is the probability of getting 2 on the first roll and not getting 4 on the second roll?**

A) 1/36

B) 1/18

C) 5/36

D) 1/6

E) 1/3

Ans: (C)

The two events mentioned are independent. The first roll of the die is independent of the second roll. Therefore the probabilities can be directly multiplied.

P(getting first 2) = 1/6

P(no second 4) = 5/6

Therefore P(getting first 2 and no second 4) = 1/6* 5/6 = 5/36

**Consider a tetrahedral die and roll it twice. What is the probability that the number on the first roll is strictly higher than the number on the second roll?**

Note: A tetrahedral die has only four sides (1, 2, 3 and 4).

A) 1/2

B) 3/8

C) 7/16

D) 9/16

Ans: (B)

(1,1) (2,1) (3,1) (4,1)

(1,2) (2,2) (3,2) (4,2)

(1,3) (2,3) (3,3) (4,3)

(1,4) (2,4) (3,4) (4,4)

There are 6 out of 16 possibilities where the first roll is strictly higher than the second roll.

**What do you understand by statistical power of sensitivity and how do you calculate it?**

Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).

Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true.

Calculation of seasonality is pretty straight forward.

Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )

where true positives are positive events which are correctly classified as positives.

**Can you cite some examples where a false negative important than a false positive?**

Example 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?

Example 2: What if Jury or judge decide to make a criminal go free?

Example 3: What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after few years and realize that you had a false negative?

**Can you cite some examples where both false positive and false negatives are equally important?**

In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.

Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.

**Can you explain the difference between a Validation Set and a Test Set?**

Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.

On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.

In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.

**Explain cross-validation.**

Cross validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.

**What is Machine Learning?**

Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. Closely related to computational statistics. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics.

**What is the Supervised Learning?**

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples.

Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks

E.g. If you built a fruit classifier, the labels will be “this is an orange, this is an apple and this is a banana”, based on showing the classifier examples of apples, oranges and bananas.

**What is Unsupervised learning?**

Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.

Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models

E.g. In the same example, a fruit clustering will categorize as “fruits with soft skin and lots of dimples”, “fruits with shiny hard skin” and “elongated yellow fruits”.

**What is logistic regression? State an example when you have used logistic regression recently.**

Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables.

For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

**Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?**

For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.

**When does regularization becomes necessary in Machine Learning?**

Regularization becomes necessary when the model begins to ovefit / underfit. This technique introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term. This helps to reduce model complexity so that the model can become better at predicting (generalizing).

**What do you understand by Bias Variance trade off?**

The error emerging from any model can be broken down into three components mathematically. Following are these component :

Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends. Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.

**OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.**

OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value. In simple words,

Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.