ABCDEFGHIJKLMNOPQRSTUVWXY
1
QuestionAnswer(s)Asker NameAsker Email
2
Where are earlier recorded sessions posted plIt will be in the My Course page
Go to My Courses, and you would find them under topic# 10 and 11
AG
abhijeetgadgil@gmail.com
3
ThanksAG
abhijeetgadgil@gmail.com
4
I could not find yesterday’s session under topic 11 (Session-3 Recording)Please send us an email with a screenshot, we will look into it.
Please check under topic 10
https://cloudxlab.com/assessment/displayslide/5014/understanding-and-visualisation-of-data?course_id=84&playlist_id=464
Kajal
kajalchatterjee@gmail.com
5
sure, will do.Kajal
kajalchatterjee@gmail.com
6
Seed functionThe seed method is used to initialize the pseudorandom number generator in Python. The random module uses the seed value as a base to generate a random number. if seed value is not present it takes system current time. if you provide same seed value before generating random data it will produce the same data.VED
parmarvedpro5@gmail.com
7
Got it. Thank you.Kajal
kajalchatterjee@gmail.com
8
how to determine the value of parameters of function seedYou can check all the methods of the random library from the below link:

https://docs.python.org/3/library/random.html
VED
parmarvedpro5@gmail.com
9
okVED
parmarvedpro5@gmail.com
10
Will we get the same output as you if we use 42 in seed?YesNishant Singh
nishant1695@gmail.com
11
what are we trying to achieve with randomness?Randomness is for shuffling the dataKaran K
karankarnik47@gmail.com
12
is there a shortcut to open api doc from line of code ?You can use help(<func>) in jupyter. For e.g. help(pd.read_csv)jia sharma
jiavidhi.sharma@gmail.com
13
Ho do we know which function is part of whic lib like which one belongs to numpy and whichone belong to pandasOnce you start practicing you would know the commons ones as you would be using them frequently, if you need anything special you can always search in GoogleHemanta Lenka
hemanta.lenka@gmai.com
14
Why so we compare contents of Test data and Train dataVED
parmarvedpro5@gmail.com
15
Is there a more extensive course for Pandas? The pre-requistes in the course have a brief content about DataFrame but seems it is pretty less compared to the coverage of the overall Pandas topic.Rohit Arora
rohit.arora@creation-tec.com
16
How do we decide which existing features to use to engineer new feature? Is it intuition based?Nishant Singh
nishant1695@gmail.com
You may have to iterate various combinations and evaluate performance for each combination and decide which one to choose
So this way you can choose - Prof Durga
17
why data engeneering required hereDIvya Pathak
dev.feb88@gmail.com
18
?DIvya Pathak
dev.feb88@gmail.com
19
Can you repeat teh drop() part at lineNitin Nigam
nknigam@gmail.com
20
#113Nitin Nigam
nknigam@gmail.com
21
The data we were looking at was that not clean data?So far the data is not clean, we will clean the dataAG
abhijeetgadgil@gmail.com
22
Higher side as in there is any %age we take into account?Sugandhita
sugandhitap@gmail.com
23
sir, isnt data cleaning and tidying come before creating train and test data sets?Sanjeeb Bose
sanjeeb.bose@oracle.com
24
instead of dropping can we not assign it some default value
Chinmay Athavale
chinmayat@gmail.com
25
we created a train_set and test_set data structures, while whatever prof is showing now as an example incomplete rows as an example have to be taken out finally from the train_set data structure, right?AG
abhijeetgadgil@gmail.com
26
Obeservations and predictors aren’t same ?Observations are input data records (rows). Predictors are variables used for prediction (columns).Srihari M
srihariblr12@gmail.com
27
ao that all rows of housing is deleted
Dr. Santosh Kumar
ksantosh.11@gmail.com
28
how we do data cleaning in case of categorical valueDIvya Pathak
dev.feb88@gmail.com
29
As you just explained Median is NOT sensitive to outliers. What if we have data set as 5, 6 8,10,7,8,9,200. In this case I guess it is affected by Outlier(200).Manoj Kumar
manoj.gupta.91@gmail.com
30
we saw that total_bedroom has very less correlation, why do we need to correct the column.. probably need to drop those irrelevant columnsNitin Nigam
nknigam@gmail.com
31
Won't imputation create a overfitting scenario?Aakash Sinha
post2aakash@gmail.com
32
can we not use numerical translation for categorical values ?Vikas Bhartiya
ghivikas@gmail.com
33
If we remove categorical attributed before imputation, do we merge the data frame back into the main post imputation so that we dont miss the categorical values in the data set. The categorical attributes may still be important to make predictions....Preedesh M
Preedesh@Gmail.com
34
I mean merge the inputed dataframe to the originalPreedesh M
Preedesh@Gmail.com
35
Still not completely clear about stratified sampling, can you point to some resources? thanksKaran K
karankarnik47@gmail.com
36
What is the difference b/w imputer.fit(housing_num) AND X = imputer.transform(housing_num) ?So by fit the imputer calculates the means of columns from some data, and by transform it applies those means to some data (which is just replacing missing values with the means) Prof DurgaNitin Nigam
nknigam@gmail.com
imputer.fit takes the data in and analyses it. It does not transform the data. imputer.transform will take in the analysed data and then transform it.
And "means", is an example of it
Using "means" for analysis or for doing transform
37
coolAakash Sinha
post2aakash@gmail.com
38
imputer.fit(housing_num) - Does this impute all numerical columns with median value where there are NAN values?Srini Boddu
siliconfish@yahoo.com
39
bins = labels = categories = can be imputed, but if you put numerical like 1,2,3,4,5 then not good idea to impute, right?AG
abhijeetgadgil@gmail.com
40
A recorded video of every class will be available before the next day of the session. Instructors will keep adding Slides, Questions, and Projects on a weekly basis.
where I can find Questions and project sir as given above
It will be updated in LMS
sshrivastava
sshrivastava@imtnag.ac.in
41
Is it recommended to apply the OrdinalEncoder object to fit & transform the test data as well ?Prakhar Prasad
prakhar.prasad@gmail.com
42
When Sirsshrivastava
sshrivastava@imtnag.ac.in
43
I cant get any question and project till date
It is available under the slides on the session page
sshrivastava
sshrivastava@imtnag.ac.in
44
can we do PCA after encoding
?
suppose we have 50 categories , it will create 50 more columns , how to deal with that?
we can't control number of categories , if column is having strong relationship
Arpit vijaywargiya
arpitvw16@gmail.com
45
Sir is it comes under Machine Learning Specializationsshrivastava
sshrivastava@imtnag.ac.in
46
Please suggestsshrivastava
sshrivastava@imtnag.ac.in
47
Although my course is Artificial Intelligence Deep Learning IIT Roorkeesshrivastava
sshrivastava@imtnag.ac.in
48
Which one is usually used StandardScaler or MinMaxScaler and whether there be difference in the model performance due to the scaler chosen ?Prakhar Prasad
prakhar.prasad@gmail.com
49
Is it necessary to create CombinedAttributesAdder class for newly added features?Nitin Nigam
nknigam@gmail.com
50
cant we use inbuilt transform function?Nitin Nigam
nknigam@gmail.com
51
Since we already added to dfNitin Nigam
nknigam@gmail.com
52
what was , 16AG
abhijeetgadgil@gmail.com
53
Correct me if I'm wrong but transformer that we built is just automating feature engineering process and appending it to our data set, right?Nishant Singh
nishant1695@gmail.com
54
diff between data fit vs transform with egkunal upadhyay
kupadhy@gmail.com
55
but we will be using any one transofrmation in our data right?
what is benefit of combining ?
Dr. Santosh Kumar
ksantosh.11@gmail.com
56
I mean 16 inAG
abhijeetgadgil@gmail.com
57
(16512, 16)AG
abhijeetgadgil@gmail.com
58
PCA will affect encoding ?Arpit vijaywargiya
arpitvw16@gmail.com
59
can we get some study material to understand transform concept on numerical and categoriacl variables? or soem reference links?
YOu can use the Oreilly book for reference.
jia sharma
jiavidhi.sharma@gmail.com
60
suppose if we apply PCA on encoding data , How it will affect the model ?Arpit vijaywargiya
arpitvw16@gmail.com
61
please provide some study material or more references
Refer to the O’Reilly book which we refered earlier
Sarbjit Singh
ssingh@imtnag.ac.in
62
attr_adder =CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
SUVAIN G
brusuvain@gmail.com
63
as PCA is projection of data on to Eigen vectors , how it will loose the information of encoded data?Arpit vijaywargiya
arpitvw16@gmail.com
64
can you please explain the above code lines once again these are from CombinedAttributeAdderSUVAIN G
brusuvain@gmail.com
65
can you please explain the last 2 code lines once again from CombinedAttributeAdderSUVAIN G
brusuvain@gmail.com
66
or instead of onehot if we use ordinal encoding , and then use PCA ,Arpit vijaywargiya
arpitvw16@gmail.com
67
so information will be project as per catrgoriesArpit vijaywargiya
arpitvw16@gmail.com
68
thanksSarbjit Singh
ssingh@imtnag.ac.in
69
When should we build a pipeline?
anantpadmanabh divanji
apgd14@gmail.com
70
So should we make it a habit, as beginners, to incorporate transformers and pipelines in our models at this point of time? What would you suggest?Nishant Singh
nishant1695@gmail.com
71
Which one to go for first missing value analysis or encoding the categorical variables
anantpadmanabh divanji
apgd14@gmail.com
72
Should we follw the same steps as you are teaching as a beginners?
Try to follow. If behind try later,using video and slides
VED
parmarvedpro5@gmail.com
We recommend you to try along as far as possible. You may not understand everything, but you can run it as is, without any edits.
73
I was reading about sparse matrix and it was mentioned that they are usually time consuming to work with owing to very few non-zero values
. Are there ways to structure them better?
Puneet Rastogi
puneetrstg@gmail.com
74
What do we have in housing_labels ?Nitin Nigam
nknigam@gmail.com
75
why did we use decision tree hereAG
abhijeetgadgil@gmail.com
76
sir what is CV=10 means?Sanjeeb Bose
sanjeeb.bose@oracle.com
77
10 rows for cross validation?Sanjeeb Bose
sanjeeb.bose@oracle.com
78
i didnt get the concept of negative mean square error please repeat ?VED
parmarvedpro5@gmail.com
79
why can't I use my training set for cross validation?
Because then the hyperparameters that you get from cross-validation might be overfitted to your training set, and might not perform well on test set.

So we want to to tune hyperparameters on data different from training set, so we do it on cross-validation set
Nini Nursiah
nursiah.neelesh28@gmail.com
80
what is random forestIts a modelVED
parmarvedpro5@gmail.com
81
where can we get more information around models for e.g. RandomForestRegressor?Rohit Arora
rohit.arora@creation-tec.com
82
is there a way to know when should I stop fine tuning?Nini Nursiah
nursiah.neelesh28@gmail.com
83
So pretty much we need to get the lowest RMSE for various models and select that model , is that what we are trying to do here?Preedesh M
Preedesh@Gmail.com
84
Shouldn't the RMSE value lie between 0 and 1?
Swetha Lakshmipathy
swethalpathy@gmail.com
85
shall we always get better result from random forest?
Dr. Santosh Kumar
ksantosh.11@gmail.com
86
Why negative in np.sqrt() for few model and not for Random Forest ?Nitin Nigam
nknigam@gmail.com
87
does it cause overfit, if keep on check performance and tune the featureSrihari M
srihariblr12@gmail.com
88
But then if we just compare how it performs on the test set, if the error is decreasing, then we might get stuck in a local minima right?Nini Nursiah
nursiah.neelesh28@gmail.com
89
How do we account for the time dimension of data observations because some observations will become obsolete in due course of timeRohit Arora
rohit.arora@creation-tec.com
90
In case the model evaluation is not promising on the validation data, then we again go back to the training data / revisit the model and iteratively check the performance on the test data. But this would also at some point result in overfitting on the validation data. Should we then have an unseen dataset that where we evaluate only when we are fully satisfied with the model performance ?Prakhar Prasad
prakhar.prasad@gmail.com
91
Which one to choose We have other as wee MSE orMAE?VED
parmarvedpro5@gmail.com
When we want to magnify errors, we use MSE otherwise MAE
92
In real world, should we have pre determined target for confidence/error range on the predictions?
It depends on the real-world domain and problem that you're trying to solve, and how much error in predictions is acceptable for that kind of problem
Puneet Rastogi
puneetrstg@gmail.com
93
When we are using K-Fold cross validation, the training set is different for each training epoch, why did we have to set the seed to get pseudo random split of train and test sets? if we had continued without setting the seed then it would have behaved as K-Fold cross validation. Am I missing something here for splitting the train and test setsVinod
vinods.kumar@gmail.com
94
How do I know which model is performing better? Only from the mean error, is it enough to know how good a model is?Nini Nursiah
nursiah.neelesh28@gmail.com
95
What is the meaning of Negative Mean squared error? Does it mean that model is bad?Srini Boddu
siliconfish@yahoo.com
96
What is the good RMSE to see if it gives satisfactory prediction?Nitin Nigam
nknigam@gmail.com
97
OK, I get it, CV is within the train data setVinod
vinods.kumar@gmail.com
98
What is random_state and bootstrap ?Nitin Nigam
nknigam@gmail.com
99
can we apply ab testing to this dataset?
anantpadmanabh divanji
apgd14@gmail.com
100
can you explain with example hyper large vs scale space?AG
abhijeetgadgil@gmail.com