**Explain selective bias?**

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.

**What are the types of biases that can occur during sampling?**

Selection bias

Under coverage bias

Survivorship bias

**Explain survivorship bias.**

It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous different means.

**How do you work towards a random forest?**

The underlying principle of this technique is that several weak learners combined to provide a strong learner. The steps involved are

Build several decision trees on bootstrapped training samples of data

On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors

Rule of thumb: At each split m=p√m=p

Predictions: At the majority rule

Infosys Data Science Recently Asked Interview Questions Answers |

**What are the basic assumptions to be made for linear regression?**

Normality of error distribution, statistical independence of errors, linearity and additivity.

**Can you write the formula to calculat R-square?**

R-Square can be calculated using the below formular -

1 - (Residual Sum of Squares/ Total Sum of Squares)

**Differentiate between univariate, bivariate and multivariate analysis.**

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

**What is Linear Regression?**

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.

**What is Interpolation and Extrapolation?**

Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.

**A coin of diameter 1-inches is thrown on a table covered with a grid of lines each two inches apart. What is the probability that the coin lands inside a square without touching any of the lines of the grid? You can assume that the person throwing has no skill in throwing the coin and is throwing it randomly.**

You can assume that the person throwing has no skill in throwing the coin and is throwing it randomly.

A) 1/2

B) 1/4

C) Î /3

D) 1/3

Ans: (B)

Think about where all the center of the coin can be when it lands on 2 inches grid and it not touching the lines of the grid.

If the yellow region is a 1 inch square and the outside square is of 2 inches. If the center falls in the yellow region, the coin will not touch the grid line. Since the total area is 4 and the area of the yellow region is 1, the probability is ¼ .

**Consider the following probability density function: What is the probability for X≤6 i.e. P(x≤6)**

f(x)=1/8e-x/8 forx>=0

What is the probability for X≤6 i.e. P(x≤6)

A) 0.3935

B) 0.5276

C) 0.1341

D) 0.4724

Ans: (B)

To calculate the area of a particular region of a probability density function, we need to integrate the function under the bounds of the values for which we need to calculate the probability.

Therefore on integrating the given function from 0 to 6, we get 0.5276

**Do gradient descent methods always converge to same point?**

No, they do not because in some cases it reaches a local minima or a local optima point. You don’t reach the global optima point. It depends on the data and starting conditions

**During analysis, how do you treat missing values?**

The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored.There are various factors to be considered when answering this question-

Understand the problem statement, understand the data and then give the answer.Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important.

If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.

If you have a distribution of data coming, for normal distribution give the mean value.

Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

**Explain about the box cox transformation in regression models.**

For some reason or the other, the response variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-mornla dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.

**Can you use machine learning for time series analysis?**

Yes, it can be used but it depends on the applications.

**Write a function that takes in two sorted lists and outputs a sorted list that is their union.**

First solution which will come to your mind is to merge two lists and short them afterwards

Python code-

def return_union(list_a, list_b):

return sorted(list_a + list_b)

R code-

return_union <- function(list_a, list_b)

{

list_c<-list(c(unlist(list_a),unlist(list_b)))

return(list(list_c[[1]][order(list_c[[1]])]))

}

Generally, the tricky part of the question is not to use any sorting or ordering function. In that case you will have to write your own logic to answer the question and impress your interviewer.

Python code-

def return_union(list_a, list_b):

len1 = len(list_a)

len2 = len(list_b)

final_sorted_list = []

j = 0

k = 0

for i in range(len1+len2):

if k == len1:

final_sorted_list.extend(list_b[j:])

break

elif j == len2:

final_sorted_list.extend(list_a[k:])

break

elif list_a[k] < list_b[j]:

final_sorted_list.append(list_a[k])

k += 1

else:

final_sorted_list.append(list_b[j])

j += 1

return final_sorted_list

Similar function can be returned in R as well by following the similar steps.

return_union <- function(list_a,list_b)

{

#Initializing length variables

len_a <- length(list_a)

len_b <- length(list_b)

len <- len_a + len_b

#initializing counter variables

j=1

k=1

#Creating an empty list which has length equal to sum of both the lists

list_c <- list(rep(NA,len))

#Here goes our for loop

for(i in 1:len)

{

if(j>len_a)

{

list_c[i:len] <- list_b[k:len_b]

break

}

else if(k>len_b)

{

list_c[i:len] <- list_a[j:len_a]

break

}

else if(list_a[[j]] <= list_b[[k]])

{

list_c[[i]] <- list_a[[j]]

j <- j+1

}

else if(list_a[[j]] > list_b[[k]])

{

list_c[[i]] <- list_b[[k]]

k <- k+1

}

}

return(list(unlist(list_c)))

}

**What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?**

In bayesian estimate we have some knowledge about the data/problem (prior) .There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making multiple predcitions i.e. one for each pair of parameters but with the same prior. So, if a new example need to be predicted than computing the weighted sum of these predictions serves the purpose.

Maximum likelihood does not take prior into consideration (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.

**What is Machine Learning?**

The simplest way to answer this question is – we give the data and equation to the machine. Ask the machine to look at the data and identify the coefficient values in an equation.

For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine learns about the values of m and c from the data.

**What is the difference between skewed and uniform distribution?**

When the observations in a dataset are spread equally across the range of distribution, then it is referred to as uniform distribution. There are no clear perks in an uniform distribution. Distributions that have more observations on one side of the graph than the other are referred to as skewed distribution.Distributions with fewer observations on the left ( towards lower values) are said to be skewed left and distributions with fewer observation on the right ( towards higher values) are said to be skewed right.

**You created a predictive model of a quantitative outcome variable using multiple regressions. What are the steps you would follow to validate the model?**

Since the question asked, is about post model building exercise, we will assume that you have already tested for null hypothesis, multi collinearity and Standard error of coefficients.

Once you have built the model, you should check for following –

· Global F-test to see the significance of group of independent variables on dependent variable

· R^2

· Adjusted R^2

· RMSE, MAPE

In addition to above mentioned quantitative metrics you should also check for-

· Residual plot

· Assumptions of linear regression.