**‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?**

The basic idea for this kind of recommendation engine comes from collaborative filtering.

Collaborative Filtering algorithm considers “User Behavior” for recommending items. They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information. Other users behaviour and preferences over the items are used to recommend items to the new users. In this case, features of the items are not known.

**What do you understand by Type I vs Type II error ?**

Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’.

In the context of confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).

**You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?**

In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn’t takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

Wipro Data Science Recently Asked Interview Questions Answers |

You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is desirable.

We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R² would only increase if an additional variable improves the accuracy of model, otherwise stays same. It is difficult to commit a general threshold value for adjusted R² because it varies between data sets. For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.

**In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?**

We don’t use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option.

Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements.

**What is the difference between Supervised Learning an Unsupervised Learning?**

If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.

**What is the goal of A/B Testing?**

It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.

**What is an Eigenvalue and Eigenvector?**

Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

**How can outlier values be treated?**

Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values.The most common ways to treat outlier values –

1) To change the value and bring in within a range

2) To just remove the value.

**How can you assess a good logistic model?**

There are various methods to assess the results of a logistic regression analysis-

• Using Classification Matrix to look at the true negatives and false positives.

• Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.

• Lift helps assess the logistic model by comparing it with random selection.

**What are various steps involved in an analytics project?**

• Understand the business problem

• Explore the data and become familiar with it.

• Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.

• After data preparation, start running the model, analyse the result and tweak the approach. This is an iterative step till the best possible outcome is achieved.

• Validate the model using a new data set.

• Start implementing the model and track the result to analyse the performance of the model over the period of time.

**How can you iterate over a list and also retrieve element indices at the same time?**

This can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.

**A roulette wheel has 38 slots, 18 are red, 18 are black, and 2 are green. You play five games and always bet on red. What is the probability that you win all the 5 games?**

A) 0.0368

B) 0.0238

C) 0.0526

D) 0.0473

Ans: (B)

The probability that it would be Red in any spin is 18/38. Now, you are playing for game 5 times and all the games are independent of each other. Thus, the probability that you win all the games is (18/38)5 = 0.0238

**Let A and B be events on the same sample space, with P (A) = 0.6 and P (B) = 0.7. Can these two events be disjoint?**

A) Yes

B) No

Ans: (B)

These two events cannot be disjoint because P(A)+P(B) >1.

P(Aê“´B) = P(A)+P(B)-P(Aê“µB).

An event is disjoint if P(Aê“µB) = 0. If A and B are disjoint P(Aê“´B) = 0.6+0.7 = 1.3

And Since probability cannot be greater than 1, these two mentioned events cannot be disjoint.

**Alice has 2 kids and one of them is a girl. What is the probability that the other child is also a girl?**

You can assume that there are an equal number of males and females in the world.

A) 0.5

B) 0.25

C) 0.333

D) 0.75

Ans: (C)

The outcomes for two kids can be {BB, BG, GB, GG}

Since it is mentioned that one of them is a girl, we can remove the BB option from the sample space. Therefore the sample space has 3 options while only one fits the second condition. Therefore the probability the second child will be a girl too is 1/3.

**Which of the following options cannot be the probability of any event?**

A) -0.00001

B) 0.5

C) 1.001

A) Only A

B) Only B

C) Only C

D) A and B

E) B and C

F) A and C

Ans: (F)

Probability always lie within 0 to 1.

**Anita randomly picks 4 cards from a deck of 52-cards and places them back into the deck ( Any set of 4 cards is equally likely ). Then, Babita randomly chooses 8 cards out of the same deck ( Any set of 8 cards is equally likely). Assume that the choice of 4 cards by Anita and the choice of 8 cards by Babita are independent. What is the probability that all 4 cards chosen by Anita are in the set of 8 cards chosen by Babita?**

A)48C4 x 52C4

B)48C4 x 52C8

C)48C8 x 52C8

D) None of the above

Ans: (A)

The total number of possible combination would be 52C4 (For selecting 4 cards by Anita) * 52C8 (For selecting 8 cards by Babita).

Since, the 4 cards that Anita chooses is among the 8 cards which Babita has chosen, thus the number of combinations possible is 52C4 (For selecting the 4 cards selected by Anita) * 48C4 (For selecting any other 4 cards by Babita, since the 4 cards selected by Anita are common).

**Question Context:**

A player is randomly dealt a sequence of 13 cards from a deck of 52-cards. All sequences of 13 cards are equally likely. In an equivalent model, the cards are chosen and dealt one at a time. When choosing a card, the dealer is equally likely to pick any of the cards that remain in the deck.