AI-DL Session 4 - Q&A Report

	A	B	C	D	E	G	H	I	J	K
1	Question	Answer(s)			Asker Name	Asker Email
2	Where are earlier recorded sessions posted pl	It will be in the My Course page	Go to My Courses, and you would find them under topic# 10 and 11		AG	abhijeetgadgil@gmail.com
3	Thanks				AG	abhijeetgadgil@gmail.com
4	I could not find yesterday’s session under topic 11 (Session-3 Recording)	Please send us an email with a screenshot, we will look into it.	Please check under topic 10	https://cloudxlab.com/assessment/displayslide/5014/understanding-and-visualisation-of-data?course_id=84&playlist_id=464	Kajal	kajalchatterjee@gmail.com
5	sure, will do.				Kajal	kajalchatterjee@gmail.com
6	Seed function	The seed method is used to initialize the pseudorandom number generator in Python. The random module uses the seed value as a base to generate a random number. if seed value is not present it takes system current time. if you provide same seed value before generating random data it will produce the same data.			VED	parmarvedpro5@gmail.com
7	Got it. Thank you.				Kajal	kajalchatterjee@gmail.com
8	how to determine the value of parameters of function seed	You can check all the methods of the random library from the below link: https://docs.python.org/3/library/random.html			VED	parmarvedpro5@gmail.com
9	ok				VED	parmarvedpro5@gmail.com
10	Will we get the same output as you if we use 42 in seed?	Yes			Nishant Singh	nishant1695@gmail.com
11	what are we trying to achieve with randomness?	Randomness is for shuffling the data			Karan K	karankarnik47@gmail.com
12	is there a shortcut to open api doc from line of code ?	You can use help(<func>) in jupyter. For e.g. help(pd.read_csv)			jia sharma	jiavidhi.sharma@gmail.com
13	Ho do we know which function is part of whic lib like which one belongs to numpy and whichone belong to pandas	Once you start practicing you would know the commons ones as you would be using them frequently, if you need anything special you can always search in Google			Hemanta Lenka	hemanta.lenka@gmai.com
14	Why so we compare contents of Test data and Train data				VED	parmarvedpro5@gmail.com
15	Is there a more extensive course for Pandas? The pre-requistes in the course have a brief content about DataFrame but seems it is pretty less compared to the coverage of the overall Pandas topic.				Rohit Arora	rohit.arora@creation-tec.com
16	How do we decide which existing features to use to engineer new feature? Is it intuition based?				Nishant Singh	nishant1695@gmail.com	You may have to iterate various combinations and evaluate performance for each combination and decide which one to choose	So this way you can choose - Prof Durga
17	why data engeneering required here				DIvya Pathak	dev.feb88@gmail.com
18	?				DIvya Pathak	dev.feb88@gmail.com
19	Can you repeat teh drop() part at line				Nitin Nigam	nknigam@gmail.com
20	#113				Nitin Nigam	nknigam@gmail.com
21	The data we were looking at was that not clean data?	So far the data is not clean, we will clean the data			AG	abhijeetgadgil@gmail.com
22	Higher side as in there is any %age we take into account?				Sugandhita	sugandhitap@gmail.com
23	sir, isnt data cleaning and tidying come before creating train and test data sets?				Sanjeeb Bose	sanjeeb.bose@oracle.com
24	instead of dropping can we not assign it some default value				Chinmay Athavale	chinmayat@gmail.com
25	we created a train_set and test_set data structures, while whatever prof is showing now as an example incomplete rows as an example have to be taken out finally from the train_set data structure, right?				AG	abhijeetgadgil@gmail.com
26	Obeservations and predictors aren’t same ?	Observations are input data records (rows). Predictors are variables used for prediction (columns).			Srihari M	srihariblr12@gmail.com
27	ao that all rows of housing is deleted				Dr. Santosh Kumar	ksantosh.11@gmail.com
28	how we do data cleaning in case of categorical value				DIvya Pathak	dev.feb88@gmail.com
29	As you just explained Median is NOT sensitive to outliers. What if we have data set as 5, 6 8,10,7,8,9,200. In this case I guess it is affected by Outlier(200).				Manoj Kumar	manoj.gupta.91@gmail.com
30	we saw that total_bedroom has very less correlation, why do we need to correct the column.. probably need to drop those irrelevant columns				Nitin Nigam	nknigam@gmail.com
31	Won't imputation create a overfitting scenario?				Aakash Sinha	post2aakash@gmail.com
32	can we not use numerical translation for categorical values ?				Vikas Bhartiya	ghivikas@gmail.com
33	If we remove categorical attributed before imputation, do we merge the data frame back into the main post imputation so that we dont miss the categorical values in the data set. The categorical attributes may still be important to make predictions....				Preedesh M	Preedesh@Gmail.com
34	I mean merge the inputed dataframe to the original				Preedesh M	Preedesh@Gmail.com
35	Still not completely clear about stratified sampling, can you point to some resources? thanks				Karan K	karankarnik47@gmail.com
36	What is the difference b/w imputer.fit(housing_num) AND X = imputer.transform(housing_num) ?	So by fit the imputer calculates the means of columns from some data, and by transform it applies those means to some data (which is just replacing missing values with the means) Prof Durga			Nitin Nigam	nknigam@gmail.com		imputer.fit takes the data in and analyses it. It does not transform the data. imputer.transform will take in the analysed data and then transform it.	And "means", is an example of it	Using "means" for analysis or for doing transform
37	cool				Aakash Sinha	post2aakash@gmail.com
38	imputer.fit(housing_num) - Does this impute all numerical columns with median value where there are NAN values?				Srini Boddu	siliconfish@yahoo.com
39	bins = labels = categories = can be imputed, but if you put numerical like 1,2,3,4,5 then not good idea to impute, right?				AG	abhijeetgadgil@gmail.com
40	A recorded video of every class will be available before the next day of the session. Instructors will keep adding Slides, Questions, and Projects on a weekly basis. where I can find Questions and project sir as given above		It will be updated in LMS		sshrivastava	sshrivastava@imtnag.ac.in
41	Is it recommended to apply the OrdinalEncoder object to fit & transform the test data as well ?				Prakhar Prasad	prakhar.prasad@gmail.com
42	When Sir				sshrivastava	sshrivastava@imtnag.ac.in
43	I cant get any question and project till date		It is available under the slides on the session page		sshrivastava	sshrivastava@imtnag.ac.in
44	can we do PCA after encoding ? suppose we have 50 categories , it will create 50 more columns , how to deal with that? we can't control number of categories , if column is having strong relationship				Arpit vijaywargiya	arpitvw16@gmail.com
45	Sir is it comes under Machine Learning Specialization				sshrivastava	sshrivastava@imtnag.ac.in
46	Please suggest				sshrivastava	sshrivastava@imtnag.ac.in
47	Although my course is Artificial Intelligence Deep Learning IIT Roorkee				sshrivastava	sshrivastava@imtnag.ac.in
48	Which one is usually used StandardScaler or MinMaxScaler and whether there be difference in the model performance due to the scaler chosen ?				Prakhar Prasad	prakhar.prasad@gmail.com
49	Is it necessary to create CombinedAttributesAdder class for newly added features?				Nitin Nigam	nknigam@gmail.com
50	cant we use inbuilt transform function?				Nitin Nigam	nknigam@gmail.com
51	Since we already added to df				Nitin Nigam	nknigam@gmail.com
52	what was , 16				AG	abhijeetgadgil@gmail.com
53	Correct me if I'm wrong but transformer that we built is just automating feature engineering process and appending it to our data set, right?				Nishant Singh	nishant1695@gmail.com
54	diff between data fit vs transform with eg				kunal upadhyay	kupadhy@gmail.com
55	but we will be using any one transofrmation in our data right? what is benefit of combining ?				Dr. Santosh Kumar	ksantosh.11@gmail.com
56	I mean 16 in				AG	abhijeetgadgil@gmail.com
57	(16512, 16)				AG	abhijeetgadgil@gmail.com
58	PCA will affect encoding ?				Arpit vijaywargiya	arpitvw16@gmail.com
59	can we get some study material to understand transform concept on numerical and categoriacl variables? or soem reference links?		YOu can use the Oreilly book for reference.		jia sharma	jiavidhi.sharma@gmail.com
60	suppose if we apply PCA on encoding data , How it will affect the model ?				Arpit vijaywargiya	arpitvw16@gmail.com
61	please provide some study material or more references		Refer to the O’Reilly book which we refered earlier		Sarbjit Singh	ssingh@imtnag.ac.in
62	attr_adder =CombinedAttributesAdder(add_bedrooms_per_room=False) housing_extra_attribs = attr_adder.transform(housing.values)				SUVAIN G	brusuvain@gmail.com
63	as PCA is projection of data on to Eigen vectors , how it will loose the information of encoded data?				Arpit vijaywargiya	arpitvw16@gmail.com
64	can you please explain the above code lines once again these are from CombinedAttributeAdder				SUVAIN G	brusuvain@gmail.com
65	can you please explain the last 2 code lines once again from CombinedAttributeAdder				SUVAIN G	brusuvain@gmail.com
66	or instead of onehot if we use ordinal encoding , and then use PCA ,				Arpit vijaywargiya	arpitvw16@gmail.com
67	so information will be project as per catrgories				Arpit vijaywargiya	arpitvw16@gmail.com
68	thanks				Sarbjit Singh	ssingh@imtnag.ac.in
69	When should we build a pipeline?				anantpadmanabh divanji	apgd14@gmail.com
70	So should we make it a habit, as beginners, to incorporate transformers and pipelines in our models at this point of time? What would you suggest?				Nishant Singh	nishant1695@gmail.com
71	Which one to go for first missing value analysis or encoding the categorical variables				anantpadmanabh divanji	apgd14@gmail.com
72	Should we follw the same steps as you are teaching as a beginners?		Try to follow. If behind try later,using video and slides		VED	parmarvedpro5@gmail.com	We recommend you to try along as far as possible. You may not understand everything, but you can run it as is, without any edits.
73	I was reading about sparse matrix and it was mentioned that they are usually time consuming to work with owing to very few non-zero values . Are there ways to structure them better?				Puneet Rastogi	puneetrstg@gmail.com
74	What do we have in housing_labels ?				Nitin Nigam	nknigam@gmail.com
75	why did we use decision tree here				AG	abhijeetgadgil@gmail.com
76	sir what is CV=10 means?				Sanjeeb Bose	sanjeeb.bose@oracle.com
77	10 rows for cross validation?				Sanjeeb Bose	sanjeeb.bose@oracle.com
78	i didnt get the concept of negative mean square error please repeat ?				VED	parmarvedpro5@gmail.com
79	why can't I use my training set for cross validation?			Because then the hyperparameters that you get from cross-validation might be overfitted to your training set, and might not perform well on test set. So we want to to tune hyperparameters on data different from training set, so we do it on cross-validation set	Nini Nursiah	nursiah.neelesh28@gmail.com
80	what is random forest			Its a model	VED	parmarvedpro5@gmail.com
81	where can we get more information around models for e.g. RandomForestRegressor?				Rohit Arora	rohit.arora@creation-tec.com
82	is there a way to know when should I stop fine tuning?				Nini Nursiah	nursiah.neelesh28@gmail.com
83	So pretty much we need to get the lowest RMSE for various models and select that model , is that what we are trying to do here?				Preedesh M	Preedesh@Gmail.com
84	Shouldn't the RMSE value lie between 0 and 1?				Swetha Lakshmipathy	swethalpathy@gmail.com
85	shall we always get better result from random forest?				Dr. Santosh Kumar	ksantosh.11@gmail.com
86	Why negative in np.sqrt() for few model and not for Random Forest ?				Nitin Nigam	nknigam@gmail.com
87	does it cause overfit, if keep on check performance and tune the feature				Srihari M	srihariblr12@gmail.com
88	But then if we just compare how it performs on the test set, if the error is decreasing, then we might get stuck in a local minima right?				Nini Nursiah	nursiah.neelesh28@gmail.com
89	How do we account for the time dimension of data observations because some observations will become obsolete in due course of time				Rohit Arora	rohit.arora@creation-tec.com
90	In case the model evaluation is not promising on the validation data, then we again go back to the training data / revisit the model and iteratively check the performance on the test data. But this would also at some point result in overfitting on the validation data. Should we then have an unseen dataset that where we evaluate only when we are fully satisfied with the model performance ?				Prakhar Prasad	prakhar.prasad@gmail.com
91	Which one to choose We have other as wee MSE orMAE?				VED	parmarvedpro5@gmail.com				When we want to magnify errors, we use MSE otherwise MAE
92	In real world, should we have pre determined target for confidence/error range on the predictions?		It depends on the real-world domain and problem that you're trying to solve, and how much error in predictions is acceptable for that kind of problem		Puneet Rastogi	puneetrstg@gmail.com
93	When we are using K-Fold cross validation, the training set is different for each training epoch, why did we have to set the seed to get pseudo random split of train and test sets? if we had continued without setting the seed then it would have behaved as K-Fold cross validation. Am I missing something here for splitting the train and test sets				Vinod	vinods.kumar@gmail.com
94	How do I know which model is performing better? Only from the mean error, is it enough to know how good a model is?				Nini Nursiah	nursiah.neelesh28@gmail.com
95	What is the meaning of Negative Mean squared error? Does it mean that model is bad?				Srini Boddu	siliconfish@yahoo.com
96	What is the good RMSE to see if it gives satisfactory prediction?				Nitin Nigam	nknigam@gmail.com
97	OK, I get it, CV is within the train data set				Vinod	vinods.kumar@gmail.com
98	What is random_state and bootstrap ?				Nitin Nigam	nknigam@gmail.com
99	can we apply ab testing to this dataset?				anantpadmanabh divanji	apgd14@gmail.com
100	can you explain with example hyper large vs scale space?				AG	abhijeetgadgil@gmail.com