Master of Science in Data Science University of Boulder CO
My notes on the topics of the courses.
Peter Myers
Table of�Contents
Chapter 1 - Algorithms for Searching, Sorting, and Indexing
Chapter 2 - Trees and Graphs: Basics
Chapter 3 - Dynamic Programming, Greedy Algorithms
Chapter 4 - Data Science as a Field
Chapter 5 - Ethical Issues in Data Science
Chapter 6 - Cybersecurity for Data Science
Chapter 7 - Fundamentals of Data Visualization�
Chapter 8 - Data Mining Pipeline
Chapter 9 - Data Mining Methods
Chapter 10 - Data Mining Project
Chapter 11 - Introduction to Machine Learning: Supervised Learning
Chapter 12 - Unsupervised algorithms in Machine Learning
Chapter 13 - Introduction to Deep Learning
Chapter 14 - Modern Regression Analysis in R
Chapter 15 - ANOVA and Experimental Design
Chapter 16 - Generalized Linear Models and Nonparametric Regression
Chapter 17 - Relational Database Design
Chapter 18 - The Structured Query Language (SQL)
Conclusion
CHAPTER ONE
Algorithms for Searching, Sorting, and Indexing
1
Algorithms for Searching, Sorting, and Indexing
How do we compare algorithms?
Insertion sort:
Binary search tree or heap sort:
Data structures and algorithms
Merge sort:
Binary search
arr = [1, 5, 8, 15, 26, 30, 50, 83, 86]
search = 45; left_i = 0; right_i = len(arr) - 1
for _ in range(len(arr)):
i = (left_i + right_i) // 2
if arr[i] == search:
answer = i; break
if left_i == right_i:
answer = -1; break
if arr[i] < search:
left_i = i + 1
if arr[i] > search:
right_i = i - 1
print(answer)
Quicksort:
Hash tables:
Hash functions written from scratch:
Heaps:
Bloom Filters:
Count-Min Sketching:
Rabin-Karp algorithm:
CHAPTER TWO
Trees and Graphs: Basics
2
Trees and Graphs: Basics
Binary search tree:
Red-black trees
Nodes and edges
Linked lists:
Skip lists:
Graphs:
Stack:
Queue:
Graph Traversals:
Topological Sorting:
Strongly Connected Components (SCC):
SCC Identification Algorithm - Tarjan’s Algorithm
Amortized Analysis:
Minimum Spanning Trees (MST):
Shortest Path Part 1:
Shortest Path Part 2:
All Pairs Shortest Path (APSP):
CHAPTER THREE
Dynamic Programming, Greedy Algorithms
3
Dynamic Programming, Greedy Algorithms
Divide and Conquer:
Longest Common Subsequence:
Longest Common Substring:
Saving intermediate results in dictionaries
Max Subarray Problem:
Karatsuba Multiplication:
Master Method for Recursion:
Fast Fourier Transform (FFT):
Introduction to dynamic programming using the rod cutting problem:
Coin changing problem:
Knapsack problem:
When optimal substructure fails:
Introduction to greedy algorithms:
Greedy interval scheduling:
Huffman Codes:
Decision problem and languages:
CHAPTER FOUR
Data Science as a Field
4
Data Science as a Field
Data science adds a lot of value, making it necessary for a business to use:
The skills you would need to achieve these goals are:
Some interesting NLP areas:
An overview of data science
Issues with most DS code today:
Solution:
With this setup, you're ready to make 1% improvements to the model every other day and in no time create a model that can have much higher performance, trust, and maintainability than messy code solutions.
R:
Presenting:
Not so good question: "Do I know everything to do good data science". It's infinitely hard to prove this with a "yes".
Better question: "Can I do my job". This is easy to answer "yes" in no time. You know if you did your job if the metrics improve in your task. Tasks are likely one of these:
Although there are some basic skills that are optional but very nice to have:
CHAPTER FIVE
Ethical Issues in Data Science
5
Ethical Issues in Data Science
Introduction:
Different theories of ethics:
It is vital to ensure ethics in your actions and a statistical model’s actions.
Ethical analysis 1:
i read the case of medical implants and the potential for a hack to restart the implant.
it was concluded that the risk of harm was negligible because the the difficulty of the hack and the consequences of the hack.
firstly looking through the lens of consequences i find that the consequences are clear in the sense that they have a bug bounty program so they agreed there could be issues if someone hacks into the information a device reset might not be the worst type of hack but it's still important it could lead to the device not having the appropriate settings configured when we started which could result in harm, death, or higher chances of harm or death if it opens the opportunity for further hacks in the reset state.
those are the consequences of not doing a recall but the consequences of doing a recall would likely mean they have to spend a lot of money on a recall.
in terms of doing the right thing for the right thing and doing their duty the product is intended to work a certain way and to be and should not do any harm so they should take this as an opportunity to redo the program so that it's possible to do patch fixes for any type of future issue that comes up if they go down the route of not fixing the issue then it seems like they're more prone to make the same decision again to not patch something when they should make it more compatible for patches for very quick fix on the software of the program
in terms of virtues i would say that their decision is not virtuous in that it lacks trustworthiness that they don't intend to fix something that could lead to death it does not respect the customer and giving them a product that will do no harm in this case not alerting them of the issue when that's what the product is intended to do they're not taking responsibility for their issue they're just pushing it under the rug it's not fair to the people who are harmed for their profit savings and it's not caring for people's lives. overall it is very much not following the virtues framework of the six core values in our society mentioned by the Josephson Institute of Ethics in Marina del Rey.
the ethical framework that appeals to me most in pretty much any situation would be the Kant point of view of doing the right thing can only be done by doing it for the purpose of it being the right thing which i've come to believe is the framework that most closely aligns with objective morality.
what appeals to me secondly is the virtues framework which logically make sense that a very high character will statistically lead them to doing the right thing more often
what appeals to me least is the the other framework which closely aligns with the phrase the ends justify the means which doesn't seem like a very nice thing to say
Ethical analysis 2:
Kant. YouTube recommendation engine.
I am a data scientist who has spent a long time with recommendation engines and retention dashboards. Based on the public information about these popular recommendation systems, their goal is to improve engagement/retention.
I like the youtube recommendation engine I think it's pretty good I do think they should put in a good feature that they probably may never add and that's to make the recommendation engine try to make the person quit after they've been on for 2 to 4 hours in a single day just to very casually nudge them to get off or at least not be going video to video through recommendations.
The one downside is that they have competition and you might just push them off into another system and that system might be seen as outperforming the system but yeah it's just a tough situation and the ethical thing would be to just make sure they don't use on a single day on their own system and just kind of take the small hit of not getting that engagement past the two to four hours whatever they deem as too habitual
We should look to avoid this habitual behavior that will cause harm to society by having people addicted to these youtube or other apps longer than they should in a period of time. we have seen this habitual use clearly be a problem in the news with social media use and depression and research of social media companies looking into depression caused by habitual users and overall the decline in people interacting more than the world without using screens.
Comment made on another post:
That's true, they aren't provide the service out of good will but rather to make a profit. I wonder if this applies to every for-profit company. They will often speak of the value they add to society of giving entertainment, but their main goal is often for ever increasing profit. It's surprising that communism and fascism which might originally sound humanitarian lead to security states and capitalism that sounds selfish can create good; I'm not too well versed in these topics.
I somethings think about how the pressures of competition lead to unethical choices: "If we don't do it we will go bankrupt and the unethical competition will do it anyways" or in other words "if we don't do it, someone else will", almost like a prisoner's dilemma; I even think countries could be acting this way (rather than companies) in that they want to ever grow GDP at the expense of pollution, because rivalling companies will just do it anyway and win in GDP.
Ethical analysis 3:
I believe much like Google's opinion in the matter in the reading that there are some extreme cases where action needs to be done but in most cases people are being a little bit too sensitive like the example where a news station wants to remove another new station's article.
Ethical analysis 4:
I believe the "Right to be Forgotten" sounds okay in extremely disturbing cases, just like how any app needs to address bad actors. I would dislike to see it abused for trivial cases like a news station wanting to remove another news station's article.
Each complaint should be sent to the search engine or legal action taken. The complaint should be assessed for damages caused and the ethics of the situation. Then appropriate action should be taken. If the search engine doesn't act in good faith, it's only a matter of time before legal action is taken against them by the countries or individuals that care to argue their case. If a country makes a policy like GDPR, then it should be respected by all applications. However, as we have seen with GDPR, we don't need to enforce the GDPR on the US user data or other country user data.
From Kant's framework, doing the right thing seems obvious for extremely disturbing cases, even if it harms freedom of speech of bad actors. Their information is still on the internet, it's just too extreme and is found to be one of the rare cases it is delisted from the search engine. There is no law that a search engine must show your material, their algorithm is their own, and Google has had great integrity and held high public opinion overall through the years to do what is right to provide knowledge to all.
In terms of the decision not to allow anyone to delist something from Google, I may have used the virtues framework. Aristotle valued rational thinking highly, so rationally I figured a good way to go is to take an action equal to the size of the harm. If there is no harm no action should be taken. If it hurts someone's reputation for something to be on the internet, they can plead their case and if it is an honorable cause hopefully they get justice in the end. Google and the courts are at people's reach to judge their concerns. Each case needs to be judged individually and rationally.
Comment made on another post:
Yes, I've had a YouTube video uploaded about me where I got stuck in something and would like to have it removed. The person removed it and it's probably down from YouTube and not disseminated by others since it was pretty boring.
Good point, different countries have different values.
Yes, that's interesting to view from the idea of if someone is a good actor would they have uploaded the information. If only a bad actor would do it, then it's not good. And I see how good/bad is hard to determine like you said, but it makes sense to me.
Professional society codes of ethics:
Ethics Analysis 5
The article about a strange year at Uber: I found it very striking that there was so much sexual harassment and abuse and dysfunction at the company and lack of cohesion and good integrity from leaders in her experiences. The biggest ethical issues seems to be a lack of virtues in the leaders based on my understanding of her experiences.
the article with the phd ai paper. what i felt was most compelling about this article was that these large language models definitely lack understanding and it's more that they are predicting the next word without much understanding, just to trick the users. It cost so much to train them, and i'm pretty sure it's costing a whole lot more these days which contributes to carbon emissions. i like the idea of teaching the computer to have better understanding so it doesn't need to use so much carbon to train something that doesn't necessarily have understanding. the 2017 davinci code book speaks of an ai coming about through both a logical side and a probabilistic side similar to the human brain which perhaps is what we can do by creating a more logical side instead of focusing only on probabilistic word predictions. i'm sure this idea has been debated a lot but it seems the money is following the probabilistic side for my understanding and the opinion of this ai paper in the article.
Ethics Analysis 6
I believe facial recognition should be used for social media filters and there should be strict laws not to use facial recognition for other purposes.
Even once perfected, it should not be used, and we need to work harder towards making strict laws against it.
I recognize the following benefits to automation, but any of these are a slippery slope to mass surveillance:
These are just my opinions, everyone has their own opinions on privacy vs security. Overall, usage of any facial recognition can't be used carelessly because it's too easy for people to suddenly use it for mass surveillance once it's seen as accepted in one place.
Comment I made on another post:
I agree about unlocking the phone and that social media tagging might be not harmful; you probably aren't the only one to feel slightly weird when tagged automatically. I think they were sued for some reason and have abandoned automated tagging, but I'm not sure. Yeah getting arrested for your face could be bad.
I agree finding suspected terrorists, finding criminals, and placing them at airports would be the best use of it when perfected if used and am not fully convinced either way if it should be used. With the right laws in place we could keep the facial recognition at airports.
Mass surveillance is a powerful tool that would likely require a high level of virtue to use for an overall good. Further mass surveillance is a large step towards some of the dystopias in science fiction, but if terrorists and criminals are a huge problem, a huge solution might be required.
Ethical Analysis 7
Gene editing
Data science stance on gene editing today:
(no comment made since no other comments were there)
Ethical Analysis 8
Yes taking away someone’s job seems unethical and we should support those who are most vulnerable with the education required to keep up in the workplace.
With the trending skills being known by data scientists and jobs being very focused on data professionals I believe all trained data scientists should take on the obligation of helping others upscale for the future jobs to as much as their ability allows. My hesitation is what if I found a system that made data science so easy and shared it to everyone and then there was significant competition and I struggle to get a job I needed to pay a mortgage; but I suppose I could downsize.
Some areas I would want to help are a cloud computing website that makes it very easy to both use cloud computing services cheaply and write very clean maintainable modeling code. This code written by an average person would easily be more maintainable and fast compared to the average senior data scientist by following key fundamentals unknown by most.
Another area I would like to help is through education of kids going through elementary school to high school and I have created a 12-point learning plan that would create a very well-rounded educated individual and currently it’s just an ebook of a few dozen pages highlighting the learning plan for those 13 years of education. This would get them ready for the cognitive abilities needed for future jobs.
Lastly, I have thought about tutoring data scientists. I am a bit hesitate with ChatGPT being able to fill that role pretty well these days, I wouldn't feel right charging an hourly rate when they could take resources and ChatGPT and get almost the same benefit. I had thought to teach out of my short ebook that simplifies how to follow best practices.
The government should have a very good career virtual services to help people choose their careers of interest, high demand, and low supply. For example, creating lesson plans for like how to become a landscaper through this YouTube course for example, then various low-mid-high cost learning programs once they get their first few clients and like it. Education at the free, low, and mid-levels oftentimes are just a series of topics and resources to learn how to achieve milestones to gain the skills or abilities needed to do a job.
The government should support education for kids under the age of 18. The government should be speaking very supportively of college to improve people's cognitive abilities further after the ages of 18.
Say we enter a world with sufficient wealth but not enough work for all. I trust the way we have things today in the USA, a lot of idealistic ideas have led to arguably bad results in other countries. In other words, keep doing what we’ve been doing, but look for ways for sustainable living using technology as one thing we should change. Our current government is quite good to those without jobs and I would imagine they would help those who feel like they aren’t qualified for the open jobs.�(no comment made since no other comments were there)
CHAPTER SIX
Cybersecurity in Data Science
6
Cybersecurity for Data Science
Triad of cybersecurity:
Basics of cryptography:
Keeping private and sensitive data safe to the best of our abilities.
Columnar transposition cipher:
Enigma:
Hacking:
Internet of things:
Social engineering:
Facial recognition:
Reflecting to improve listening:
Passwords:
How to do well in the cybersecurity space:
CHAPTER SEVEN
Fundamentals of Data Visualization
7
Fundamentals of Data Visualization
Basics of visualization:
Altair:
Basics on making data visually appealing.
Graph tips:
Interactions:
Visualization tasks:
Pitfalls that can occur in a design study:
Graph tips:
Qualitative evaluation:
Give an example of how you would design a visualization experiment:
I would like to build a visualization that allows someone to slice and dice the data to determine key factors that lead to a student dropping out of the application prioritization department of universities. Overall the goal of a user would be to find patterns that correlate well with someone dropping out. The results of these findings can lead to more support for certain groups that are at risk of dropping out and also prioritization of merit-based attributes for the application priority order.
I would recruit random adults as opposed to expert data analysts to ensure the tool is usable for all.
The approach I would use is a semi-structured interview for random adult participants who have operated a website. I wouldn't look specifically for expert data analysts but would be content with a random adult participant to check if the visualization works well. I would prepare a website where participants can do tasks with the visualization. I would need to simulate data or ensure I have approval and sufficient privacy to use university dropout data.
The structure of the interview would be as follows with the following process and open-ended questions:
Lastly, I would use computer analysis to analyze the experiment text data carefully and maintain participant privacy.
CHAPTER EIGHT
Data Mining Pipeline
8
Data Mining Pipeline
Drowning in data but starving for knowledge:
How do you go about using data:
Common techniques:
Ethics:
Building a pipeline to move data from one place to another.
Data basics:
Data similarity:
Data cleaning and data integration:
Data transformations:
Data warehouse, data cube, and OLAP:
Building a data cube and data warehouse:
CHAPTER NINE
Data Mining Methods
9
Data Mining Methods
Frequent pattern analysis:
The methods you use to achieve the use case or gain knowledge from the data.
Decision Tree Classifier:
Naive Bayes Classifier:
Support Vector Classification:
Neural Network Classification:
Ensemble Classifcation
Model evaluation:
Clusters:
Anomaly Detection Methods:
CHAPTER TEN
Data Mining Project
10
Data Mining Project
Project ideas prompt:
Data mining projects that sound interesting to me in ranked order:
1) Doing frequent pattern analysis on online purchase data.
2) Analysis on user data looking for engagement and retention.
3) Simply trying to get the most accurate model for regression or classification.
I would need to search free data sets online to get this data.
Starting questions.
Project proposal 5-10 slides:
Exploring data’s depths: Unearthing patterns through advanced mining.
Project proposal report:
Checkpoint slides:
Checkpoint report:
Project final slides:
Project final report:
Project conclusion:
CHAPTER ELEVEN
Introduction to Machine Learning: Supervised Learning
11
Introduction to Machine Learning: Supervised Learning
Introduction:
Linear Regression:
Making predictions with data.
Linear Regression (continued):
Logistic regression:
Confusion matrix:
Regularization hyperparameters and cross validation:
K-Nearest Neighbor Classifier:
Decision Tree Classifier:
Avoid Overfitting Decision Tree Classifier:
Ensemble with Random Forest:
Adaboost Boosting:
Gradient Boosting:
SVM:
CHAPTER TWELVE
Unsupervised Algorithms in Machine Learning
12
Unsupervised Algorithms in Machine Learning
Introduction:
PCA:
Clustering methods:
Recommendation Systems:
Algorithms that work well without labels.
Matrix Factorization:
CHAPTER THIRTEEN
Introduction to Deep Learning
13
Introduction to Deep Learning
If it takes a lot of data to train and I would summarize its most common use cases as: creation/identification of images/audio, vectorizing words, predicting the next word, time series, and game AI.
You can make use of the deep learning models out there without retraining from scratch such as audio transcription (google's speech_recognition library), image to text (pytesseract), and word2vec (convert a word to a vector), infersent (convert a sentence to a vector), resnet50 (can train it on 20 images and it will often be accurate; transfer learning).
I am interested in building game AI with XGBoost and perhaps using deep learning as more data is collected. This would solve the game and create a simulation that generates data with optimal decisions being made. then you use decision trees to learn from the data to find a simple strategy that is almost as strong and can be copied by a person.
Neural network models
Perceptron and multi-layer perceptron:
Other topics:
CHAPTER FOURTEEN
Modern Regression Analysis in R
14
Modern Regression Analysis in R
Frameworks and Goals of Statistical Modeling:
Predicting a numeric outcome in R
Coefficient t-tests:
Confidence intervals:
Ethics in statistics:
Tests:
Selection:
Collinearity:
CHAPTER FIFTEEN
ANOVA and Experimental Design
15
ANOVA and Experimental Design
Introduction to Experimental Design
ANOVA:
Statistical tests
Post-Hoc Test in Python:
False negatives:
Two-way ANOVA:
�Experimental Design:
Sampling techniques:
CHAPTER SIXTEEN
Generalized Linear Models and Nonparametric Regression
16
Generalized Linear Models and Nonparametric Regression
Introduction:
Components of GLM:
Binomial Regression Parameter estimation:
GLMs and their benefits.
Interpreting betas:
Poisson Regression:
Poisson Regression Parameter Estimation:
Interpretability:
Goodness of Fit for Poisson Regression:
Overdispersion:
Nonparametric:
Generalized Additive Models (GAM)
CHAPTER SEVENTEEN
Relational Database Design
17
Relational Database Design
Benefits of a database:
Benefits of a file system:
Key concepts:
Designing a database such as MySQL.
Data storage tips
CHAPTER EIGHTEEN
The Structured Query Language (SQL)
18
The Structured Query Language (SQL)
Introduction:
Writing queries
End