October 12, 2018

Sreekanth B

Dell Boomi Most Frequently Asked Data Science Interview Questions Answers

How is kNN different from kmeans clustering?

Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesn’t use training data to make generalization on unseen data set.

How is True Positive Rate and Recall related? Write the equation.

True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

Yes, it is possible. We need to understand the significance of intercept term in a regression model. The intercept term shows model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 – ∑(y – y´)²/∑(y – ymean)² where y´ is predicted value.

When intercept term is present, R² value evaluates your model wrt. to the mean model. In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting in higher R².
Dell Boomi Most Frequently Asked Data Science Interview Questions Answers
Dell Boomi Most Frequently Asked Data Science Interview Questions Answers

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.

But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.

Some test scores follow a normal distribution with a mean of 18 and a standard deviation of 6. What proportion of test takers have scored between 18 and 24?

A) 20%
B) 22%
C) 34%
D) None of the above

Ans: (C)

So here we would need to calculate the Z scores for value being 18 and 24. We can easily doing that by putting sample mean as 18 and population mean as 18 with σ = 6 and calculating Z. Similarly we can calculate Z for sample mean as 24.

Z= (X-μ)/σ

Therefore for 26 as X,

Z = (18-18)/6  = 0 , looking at the Z table we find 50% people have scores below 18.

For 24 as X

Z = (24-18)/6  = 1, looking at the Z table we find 84% people have scores below 24.

Therefore around 34% people have scores between 18 and 24.

A jar contains 4 marbles. 3 Red & 1 white. Two marbles are drawn with replacement after each draw. What is the probability that the same color marble is drawn twice?

A) 1/2
B) 1/3
C) 5/8
D) 1/8

Ans: (C)

If the marbles are of the same color then it will be 3/4 * 3/4 + 1/4 * 1/4 = 5/8.

Which of the following events is most likely?

A) At least one 6, when 6 dice are rolled

B) At least 2 sixes when 12 dice are rolled

C) At least 3 sixes when 18 dice are rolled

D) All the above have same probability
Ans: (A)

Probability of ‘6’ turning up in a roll of dice is P(6) = (1/6) & P(6’) = (5/6). Thus, probability of

∞ Case 1: (1/6) * (5/6)5 = 0.06698

∞ Case 2: (1/6)2 * (5/6)10 = 0.00448

∞ Case 3: (1/6)3 * (5/6)15 = 0.0003

Thus, the highest probability is Case 1

Suppose you were interviewed for a technical role. 50% of the people who sat for the first interview received the call for second interview. 95% of the people who got a call for second interview felt good about their first interview. 75% of people who did not receive a second call, also felt good about their first interview. If you felt good after your first interview, what is the probability that you will receive a second interview call?

A) 66%
B) 56%

C) 75%

D) 85%

Ans: (B)

Let’s assume there are 100 people that gave the first round of interview. The 50 people got the interview call for the second round. Out of this 95 % felt good about their interview, which is 47.5. 50 people did not get a call for the interview; out of which 75% felt good about, which is 37.5. Thus, the total number of people that felt good after giving their interview is (37.5 + 47.5) 85. Thus, out of 85 people who felt good, only 47.5 got the call for next round. Hence, the probability of success is (47.5/85) = 0.558.

Another more accepted way to solve this problem is the Baye’s theorem. I leave it to you to check for yourself.

When is Ridge regression favorable over Lasso regression?

You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

After reading this question, you should have understood that this is a classic case of “causation and correlation”. No, we can’t conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon.

Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can’t say that pirated died because of rise in global average temperature.

While working on a data set, how do you select important variables? Explain your methods.

Following are the methods of variable selection you can use:

Remove the correlated variables prior to selecting important variables

Use linear regression and select variables based on p values

Use Forward Selection, Backward Selection, Stepwise Selection

Use Random Forest, Xgboost and plot variable importance chart

Use Lasso Regression

Measure information gain for the available set of features and select top n features accordingly.

Subscribe to get more Posts :