1 of 83

Master of Science in Data Science University of Boulder CO

My notes on the topics of the courses.

Peter Myers

2 of 83

Table of�Contents

Chapter 1 - Algorithms for Searching, Sorting, and Indexing

Chapter 2 - Trees and Graphs: Basics

Chapter 3 - Dynamic Programming, Greedy Algorithms

Chapter 4 - Data Science as a Field

Chapter 5 - Ethical Issues in Data Science

Chapter 6 - Cybersecurity for Data Science

Chapter 7 - Fundamentals of Data Visualization�

Chapter 8 - Data Mining Pipeline

Chapter 9 - Data Mining Methods

Chapter 10 - Data Mining Project

Chapter 11 - Introduction to Machine Learning: Supervised Learning

Chapter 12 - Unsupervised algorithms in Machine Learning

Chapter 13 - Introduction to Deep Learning

Chapter 14 - Modern Regression Analysis in R

Chapter 15 - ANOVA and Experimental Design

Chapter 16 - Generalized Linear Models and Nonparametric Regression

Chapter 17 - Relational Database Design

Chapter 18 - The Structured Query Language (SQL)

Conclusion

3 of 83

CHAPTER ONE

Algorithms for Searching, Sorting, and Indexing

1

4 of 83

Algorithms for Searching, Sorting, and Indexing

How do we compare algorithms?

  • Firstly, we consider worst case time complexity, represented by O notation like O(n) or f(n).
  • Secondly, we consider space complexity, like whether it requires extra GB to do the algorithm or g(n).
  • Thirdly, in rare cases we might consider the average time complexity, but typically this is best measured by statistical experiments so it’s more work to figure out.
  • Algorithms can use a heuristic function h(n) that improve efficiency, reducing the exploration of unpromising paths. For example in the travelling salesman problem, the heuristic function is the “distance between two cities”. This allows us to explore cities close to each other instead of pursuing cities that are far away.

Insertion sort:

  • Imagine you start with an array: [1 10 9 3 2 7 6 0 8 5]
  • Firstly, you imagine you just have [1 10]. You check if 10 is bigger than 1, it is, so keep it the same.
  • Secondly, you imagine you have just [1 10 9]. The last digit is 9, so you need to check 9 against everything else. So, you check 9 and 10 first and see they are out of order so you swap it to [1 9 10]. You then check 9 and 1 and see it’s in the right order.
  • Thirdly to Ninthly, you just keep doing that for bigger and bigger arrays until you do it for the entire array.
  • Speed: O(n^2)
  • Why this speed: approximately n elements being inserted n times. The insert is O(n)

Binary search tree or heap sort:

  • Start with an empty binary search tree or heap
  • Insert all elements into the tree or heap
  • Do an in order traversal of the tree or heap. That is the sorted data.
  • Speed: O(nlogn)
  • Issues: Have to initiate trees or heaps, not space efficient. Other sorting algorithms might be simpler to understand and run a little bit faster due to having no initiation.

Data structures and algorithms

5 of 83

Merge sort:

  • Start with an array: [10 8 2 5 9 0 6 1 7 3]
  • Break it down into single element arrays [10] [8] [2] [5] [9] [0] [6] [1] [7] [3]
  • Merge two arrays together five times. [8 10] [2 5] [0 9] [1 6] [3 7]. It is trivial to sort these two element arrays, so I won’t explain that.
  • Next we keep merging. Say we merge [8 10] and [2 5] together. The algorithm is to first check 2 and 8. Since 2 is smaller we insert that into an empty array [] so it says [2] now. Next you check 8 again since we didn’t choose it and 5. 5 is smaller so we insert 5 into the array so it is [2 5]. Since one array is empty we append the other array [8 10] onto the array. We end up with [2 5 8 10].
  • To recap, to sort two sorted arrays like [8 10] and [2 5] we just check one element from each to see which is smallest and push that one into a new array, and remember to not consider that one again.
  • Speed: O(nlogn)
  • Issues: Does not sort in place. This means it requires more memory, but is fast at O(nlogn)

Binary search

  • Start with a sorted array, and look for a number say 45 inside it: [1 5 8 15 26 30 50 83 86]
  • There are 9 elements so the left one has index 0 and right has index 8
  • Iteration 1
    • Left = 0
    • Right = 8
    • Mid = (Left + Right)//2 = 4 (integer division, can’t have fractions)
    • So we check the value at index “Mid”. That gives us 26. Our number is higher so we need to change Left from 0 to Mid+1. So we have the following
  • Iteration 2
    • Left = 5
    • Right = 8
    • Mid = 6
    • We check and get the number 50. Too high so we go need to look lower
  • Iteration 3
    • Left = 5
    • Right = 5
    • Mid = 5
    • We check and get the number 30. Too low, but we ran out of space so we’re done
  • Returns: The index of the number. In this case we didn’t find the number so we can return -1 to indicator that the number does not exist in the array.

arr = [1, 5, 8, 15, 26, 30, 50, 83, 86]

search = 45; left_i = 0; right_i = len(arr) - 1

for _ in range(len(arr)):

i = (left_i + right_i) // 2

if arr[i] == search:

answer = i; break

if left_i == right_i:

answer = -1; break

if arr[i] < search:

left_i = i + 1

if arr[i] > search:

right_i = i - 1

print(answer)

6 of 83

Quicksort:

  • What is it: A fast average runtime of nlogn sorting method commonly used.
  • How does it work: Take a element in an array like [5, 8, 7, 1, 9, 10] such as 7. Then put everything smaller on the left and bigger on the right of it for example [5, 1, 7, 8, 9, 10]. Then call quicksort twice for both sides: [5, 1] and [8, 9, 10]. Repeat until it’s sorted.
  • How do you move the data around without using more memory: You just move through the array on the left and right side. If you find the left is too big and the right is too small, you swap them.
  • What if I only find one value out of place and nothing to swap with it: Then you need to be creative with moving the center value, let’s look at an example. You start with an array [8, 5, 7, 11, 9, 10]. When examining the left you find 8 is too high and needs a swap, when examining the right you find no swaps are needed. So, you simply move the 8 to the far right of the left side like this: [5, 8, 7, 11, 9, 10] (we swapped 5 and 8). Then you swap the 8 with the pivot 7 like this: [5, 7, 8, 11, 9, 10].
  • How do you handle sorting small arrays of length 1-4 for example: The quicksort will use an insertion sort instead of a quicksort for tiny arrays.
  • What is the worst case: The worst case runtime is O(n^2) but that’s mostly if the implementation uses a bad strategy of choosing center values for the process. If you do something like a random choice it tends to be nlogn runtime on average.
  • Another trick: By partially applying quicksort, you can quickly find quantiles without fully sorting the data. Say you’re looking for the 50% quantile called a median. If your data is 101 elements long then the qunatile is the 51st element in a sorted array. We intend to partially sort simply looking for the 51st element in the sorted array. So once you do a quicksort step, check if the left side or the right side holds the 51st element, then you only need to run quicksort again on that side. You keep doing that until you find the 51st element. On average the runtime is just “n” instead of “nlogn”. Worst case is still O(n^2), see above on how to implement it to avoid that.

Hash tables:

  • What is it: Just a giant 7,777 length array.
  • Insert “pineapple” into it: Run a function that when given “pineapple” it returns the same ID every time, say 523426153412. Just do divide by 7,777 and keep the remainder and that’s the new ID, in this case 5,706. Place “pineapple” in the array’s 5,706 position.
  • Read from it: Recompute the ID of “pineapple” and find it is 5,706. Then search the array at location 5,706.
  • When to use it: In Python, this is used when you use dictionaries. Insertion time runs in constant time, and reading happens in constant time, so it should be used a lot.
  • What can go wrong: If two things are given the same ID, you can correct it by 1) assigning it another ID nearby like 5,707…5708… 2) storing an array in each ID location. 3) checking NEW_ID = ID + (1…4…9…16…25…) for each collision of two IDs. 4) Have multiple hashing functions to try every time you get a collision. 5) If keys are known in advance, then just assign each key a spot.

Hash functions written from scratch:

  • ((((integer_1 * your_value + integer_2) % a_prime) % hash_table_size.
  • Run that command and you have your hash id.
  • When hashing strings, just loop over each character and add the results together. Then do “% has_table_size” again. Each character can be given a numeric value of your choosing.

7 of 83

Heaps:

  • Heaps are a tree-based technique to allow you to extract the smallest number from a collection, and is faster than making a sorted array.
  • Heaps have the benefits of sorted arrays, but the inserts are O(logn) instead of O(n).
  • Take an array such as [9, 5, 2, 10, 7].
  • Run heapq.heapify(my_list)
  • This converts the list into a heap, which is commonly stored as an array [2, 5, 9, 10, 7].
  • Now simply run “heapq.heappop(my_heap)” and it will always extract the smallest number from the collection.
  • If you run that command 5 times you’ll get the following results: 2, 5, 7, 9, 10.

Bloom Filters:

  • Result: A dictionary of 3 million keys with 1000 string length keys would take about 3 GB in a Python dictionary, but a bloom filter would only be about 7.5 MB large. It works best when the expected keys are 3 million or less and the keys are very long strings; otherwise just use a Python dictionary instead.
  • Use case:
    • SQL queries come in. You run them, get a result, save some to a cache.
    • Which ones to save? The ones that come in more than once during that hour.
    • How do you know if one came in more than once? A bloom filter can be used.
  • How a bloom filter works:
    • It’s just an array of bits (boolean True/False). It starts as all zeros.
    • Say you put a SQL query into the bloom filter as a key. What you do is apply 7 hash functions and swap those bits from zeros to ones.
    • Now if you want to see if a key is new or not, you run the 7 hash functions. If any of the bits at those locations are 0, then the key is new and does not exist. If all are 1, then usually you implement it to be 99% sure that the key exists, but there is a 1% chance that it doesn’t exist (called a false positive). False positives are usually no big deal, just a bit of extra computation that rarely happens.
  • Implementation details:
    • Reset the bloom filters at the start of every hour.
    • (A SQL query comes in) Check bloom filter 1, and it isn’t found. Run the query. Return the result. Check the bloom filter 2 and find it doesn’t exist. Update the bloom filter to say this SQL query came in.
    • (The same query comes in). Check bloom filter 1, and it isn’t found. Run the query. Return the result. Check the bloom filter 2 and see that it came in twice now. Place the result in the cache. Insert into a first bloom filter that says what’s in the cache.
    • (The same query comes in again). Check bloom filter 1, and it is found. Check the cache result. Return the cache result, skipping the need to re-run the query.
    • End result: First run was normal. Second run was normal but saved the result in the cache. The third run onward will use the cache instead of re-running it.
  • Pitfalls: Bloom filters don’t scale well after a few million expected keys. A python dictionary will probably be less memory and simpler. A python dictionary will probably only be 40 MB compared to a bloom filter being 8 MB if the keys are integers or small 10 character strings; so a simpler python dictionary approach would be just as good and likely simpler to do.

8 of 83

Count-Min Sketching:

  • Use case:
    • Say you want the word count of 1 billion words in documents with common english words removed and the most remaining common words expected to show up 100,000 times.
    • A dictionary would be many GB of memory.
    • Instead you could use a count-min sketch
  • How it works:
    • First you set up an array of length say 5x100,000=500,000 based on the use case above.
    • Then you define around 100 hash functions.
    • Each word is put through all 100 hash functions, and the array at that location has its count increased by 1 (so 100 single increments each time a word is seen).
    • Say you want to know the approximate word count of the word “sketch”.
    • You simply run all 100 hash functions and find out all 100 values that are there. The answer you use is the minimum number of the 100 numbers.
    • This method can overestimate how popular a word is, but it will never underestimate it. This means if a word is said to be uncommon it is definitely true. If it is said to be popular it is 99% likely to be true if you set it up correctly, but there is still that 1% error where it might say it is popular when it is not.
  • Pros:
    • Saves space
    • Gives a good estimate for the count per word
  • Cons:
    • It might say a word is a bit more popular than it actually is

Rabin-Karp algorithm:

  • Use case: Ctrl+f functionality (finding a search word on a document)
  • How to implement it:
    • Compute the hash value of the search term.
    • Find the length of the search term.
    • Move a window over all possible subsets of the document with the length of the search term.
    • While moving over all possible subsets of the document, calculate the hash value at each one. If the hash value is identical to the search term, then check if all the characters match.
    • When computing the hash value of each window size in the document, you can do an optimization to calculate the hash value faster. What you do is modify the hash value from removing the first character which is now removed as the window moves. Then compute the change in hash value from adding the next character which is the new character as the window moves along.
  • Benefits:
    • A faster method to find a search term in a document.
    • The check of a single window for a match is O(1). This is an improvement over checking character by character which is O(p) where p is the length of the search term.
  • Another use case:
    • Finding common substrings of size p between two documents. Store all hash values in a collection for document 1. When searching document 2, check all hash values from document 1 and if there is a match then do a character by character check.

9 of 83

CHAPTER TWO

Trees and Graphs: Basics

2

10 of 83

Trees and Graphs: Basics

Binary search tree:

  • Typically all search, inserts, and deletes are O(logn), but if the tree is built imbalanced these become O(n). There are self balancing trees that are more complicated.
  • Each element is called a node.
  • Each node can have children called “left” and “right”.
  • “Left” is a value less than the parent node or null to represent no child.
  • “Right” is a value higher than the parent node or null to represent no child.
  • Implementation of functionality:
    • Search: Say you are searching for the value 50 in the binary tree. If the first node is 30 then you know to go to the “right” child since that is where 50 might be, it can’t be on the “left” side. You just repeat this until you reach the value 50 or a dead end of null.
    • Delete: Search for it. When found you want to delete it. If it has no children just delete it. If it has one child then replace it with that child. If it has two children then replace it with one of the children, and assign the other child to be this one’s child.
    • Insert: Keep moving down the tree until a null is found. Then place it in the null spot.
    • Traversal: A traversal is where you take an action on every element in the binary search tree. Traversing “left”, then self, then “right” is how you traverse in a sorted order, but feel free to traverse in any order you want. This traverse function calls itself so it is recursion.

Red-black trees

  • The same idea as a binary search tree, but it self balances to ensure O(logn) for the searches, inserts, and deletes.
  • Every insert and delete does complex operations to ensure the tree is as balanced as possible.
  • Rules: The root is black. All nodes are either red or black. All paths must have the same # of black nodes. All red nodes must have black children. Null values are black. Inserts are initially red.
  • Inserts: if you get a red-red violation there are two remedies. The easier one is to perform color changes while preserving the rules. The second way is to do a rotation:
    • Right rotation: The left child becomes the parent. The parent becomes that one’s right child. The original left child’s right child becomes the original parent’s left child.
    • Left rotation: The right child becomes the parent. The parent becomes that one’s left child. The original right child’s left child becomes the original parent’s right child.
  • Deletes: Lots of edge cases to handle and complex steps to keep the rules unviolated.

Nodes and edges

11 of 83

Linked lists:

  • What is it: It’s just a series of nodes. Each node has the data is stores (like the number 5, etc), and a reference to the next node. It’s a chain of nodes which have values and references. It’s very easy to do inserts and deletes in O(1) time.
  • How it works: Say you want to insert 5 inside a linked list: [1, 4, 7, 8, 10]. All you do is use the first node “1” and change the address to “5” instead and set the address of node “5” to the address of node “4”. The result is: [1, 5, 4, 7, 8, 10].
  • Downsides: They can be quite slow to do searches as it takes O(n) time even when sorted.
  • Should it be sorted: Some linked lists are sorted and others are not sorted. It’s up to your need, similar to lists.

Skip lists:

  • Why it exists: A sorted linked list is one data structure that has O(1) inserts and deletes, but slow searches O(n). Skip lists is an improvement to change searches to O(sqrt(n)).
  • Explained with an example:
    • Say you have a sorted linked list with 100 elements where the values of each element is 1, 2, …, 100 for this simple example. For this example, let’s say you build a second linked list with 11 elements: 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100; these elements for this secondary linked list correspond to the 10th or so linked list values in the primary linked list.
    • So say you want to search for the number 55 in the linked list. You first start with the secondary linked list of 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100. You keep moving forward from 1, 10, … onwards until you reach a number higher than 55, in this case 60. So you go back one to 50.
    • Next you use that secondary linked list to find the location of 50 on the primary linked list. Then you just keep stepping forward until you reach 55 or a number higher than 55. If 55 is found then you did it. If 55 is not found but a number like 56 is found then you know 55 does not exist in the linked list.
  • Inserts and when to build new layers: Whether we are starting from a blank skip list, or appending to it, we follow the same process. We have to continue to grow layer 2-5 even as we do inserts to layer 1. We do this by giving a 50% chance that an insert to layer 1 will also be placed in layer 2. We give a 50% chance that an insert to layer 2 will be placed in layer 3. We do the same for layer 4 and 5 respectively. You can even make it work for unbounded layers.

Graphs:

  • What is it: It is a set of nodes which are the objects, and edges which are the relationships between objects.
  • Types of directions: Directed graphs are where edges have a one-way direction with it. Undirected graphs are where edges do not have a direction with it. Weighted graphs are where the edge has a weight representing the cost to travel it. A acyclic graphs are directed graphs that have no cycles or loops where you can arrive back at a previously visited point. A cyclic graph is where a directed graph can have those cycles or loops occur.
  • Data Structure: Any technique to keep track of the data (nodes) and the edges.
    • Lists of nodes/edges
    • Or it could be a nxn matrix where n is the number of nodes and each 0 or 1 in the matrix is whether an edge exists
    • Etc.

12 of 83

Stack:

  • What is it: Think of it as a normal array, but to access the elements you always take from the right side of the array and remove it from the array. When you place new data into it it goes on the right side of the array.

Queue:

  • What is it: Think of it as a normal array, but to access the elements you always take from the right side of the array and remove it from the array. When you place new data into it it goes on the left side of the array.

Graph Traversals:

  • Types of traversals: Imaging there is a cave with two paths “left” and “right”. No matter which way you go, there is a “left” and “right” choice several times. Only the exact order of “left”, “right”, “right”, “right”, “right” will lead out of the cave. To traverse the cave and discover this exit, you can pursue two strategies:
    • Depth first search: Keeps going until a dead end before backtracking.
    • Breadth first search: Backtracks constantly. Tends to explore the shortest paths earlier.
  • How to implement types of traversals:
    • A traversal: You place nodes into a collection (stack or queue). You pop out a node, then you act on the node, then you push all of the nodes it is attached to into the collection if it hasn’t been visited yet.
    • Depth first search: Place each node into a stack.
    • Breadth first search: Place each node into a queue.

Topological Sorting:

  • Requirements: It must be a directed acyclic graph (see previous page for definition).
  • What is it: Say you have a 10 Python scripts and each of them might rely on other scripts running. We are coming up with a run to account for dependencies.
  • How to determine this: Just start at the leaf nodes and pop them out of the graph. As you pop them out, they are the last Python scripts to run. Say you popped the following Python scripts in this order: [10, 9, 5, 6, 3, 8, 7, 4, 2, 1]. Then the topological sort is just the same thing but the last one popped out is first and so on: [1, 2, 4, 7, 8, 3, 6, 5, 9, 10] is the final topological sort. Any node without connections to any other nodes are placed first as well.
  • Is there one correct way to sort it: No, there are multiple ways, it just must be in the sorted order where there are no problems. For example, with a topological sorted result of [1, 2, 4, 7, 8, 3, 6, 5, 9, 10], if 9 is a dependency of 4, then it’s a problem because all dependencies of 4 need to run before 4 (in this case 1, 2 run before 4).

13 of 83

Strongly Connected Components (SCC):

  • Among the nodes on a directed graph, a set of nodes are strongly connected if all nodes can reach all other nodes. Remember from the last page, a directed graph is when each edge has a single direction. This strongly connected components definition means a set of nodes where all nodes can reach any other node using these single direction edges.
  • Properties:
    • The biggest SCC set is the SCC: Each SCC is the biggest set possible. Say 4 nodes belong to a SCC, then you would consider all 4 to be part of a SCC, but not say 2 of the nodes to one SCC and and 2 of the other nodes to another. You only consider the biggest set as the SCC.
    • Mutually exclusive nodes: One SCC can’t contain nodes from another SCC, otherwise they would have become one combined SCC. All remaining nodes not part of a SCC are actually SCC’s of size 1; no arrows pointing to itself required.
    • A path always exists: If looking to get from one node to another in a SCC, there is always a path to do so.
    • Condensation Graph: You can condense each 2+ node SCC into a single node. This condenses the graph to be smaller. It can simplify some graph problems. The new graph is now a directed acyclic graph (DAG).
    • Exit from a cycle: If a 2+ node SCC reaches a one node SCC we know that the cycle between the 2+ nodes is over. It is sometimes known as a terminal node, since the cycle possible by the 2+ node SCC is no longer possible now that it has terminated by reaching a one node SCC. In other words, if we were able to get back to the 2+ node SCC from the one node SCC, then the one node SCC by definition would actually be part of the 2+ node SCC instead of it’s own SCC.
    • Bridges: A bridge is when you exit one SCC and enter a new SCC.
    • Topological Order: This can be applied once it becomes a DAG using the condensation graph mentioned above.

SCC Identification Algorithm - Tarjan’s Algorithm

  • Do a DFS labeling the nodes as they are first visited.
  • The starting node is node_id 0 and it’s low-link value is 0, each node visited from there increments by one.
  • You also keep track of all nodes visited. You also keep track of all nodes visited since the last deadend.
  • When you DFS takes you to a:
    • Node visited before: Check if it’s on the list of nodes visited since the last deadend. If it is not, then you don’t do anything. If it is, then you do the minimum of the low-link value of the nodes on either side of that edge that triggered this.
    • Dead-end:
      • No more routes to check.
      • You backtrack over all the nodes visited backwards (called a stack, it’s the list of nodes visited since the last deadend).
        • First you do a minimum low-link value between that nodes on either side of the edge (the backtracked node and the node we backtracked from).
        • Then you continue the DFS as normal from this node’s remaining unused edges.
        • When you reach a final dead-end on all the nodes involved, you look at all the low-link values and set all those that are identical low-link values as a SCC.
        • Pop those SCC nodes found from the stack.
        • Start again from a random remaining node.

14 of 83

Amortized Analysis:

  • This is like saying a pop costs 1 operation. But to amortize it means, let’s just say pop costs 0 operations, but a push costs 2 operations. That is a push will cost 1 operation to push it and 1 to pop, but we can just say the push is 2 and the pop is 0 because we prepaid the operation cost during the push.
  • Some operations might require n+1 operations like adding 1 to 9,999,999 by hand, you have to carry the zero 7 times, but the result is 10,000,000 which means you won’t have to carry the 1 during increments as often. So you’ve prepaid the operation cost of carrying the 0 and it will be quite a while before you get lots of 9s again. Computers work more in binary but this example is with base-10 numbers.

Minimum Spanning Trees (MST):

  • Valid Graphs: A graph where edges don’t have a direction. All edges have a weight.
  • What is MST: Find the tree that connects all nodes with the minimum cost. There are no cycles because connecting two nodes already part of the tree would be wasted cost.
  • Prim’s Algorithm: Select a random node. Place all of its edges in a priority queue. Select the edge that has the lowest cost. Then it becomes a two node tree. Add that new node’s edges into the priority queue. Select the edge that has the lowest cost. Keep repeating this until you get the final solution.
  • Kruskal’s Algorithm: Sort rank the edge costs. Iterate over the smallest to largest cost. If an edge does not make a cycle, use it and continue iterating. If it does create a cycle, it isn’t useful so you can skip it. This gives the final solution once all edges are iterated on or if you realize that all nodes are connected.
  • Union-Find Rank Compression Optimization: This is a fancy way of saying that each set of nodes connect to each other in Kruskal’s algorithm should be given an id. That way you can efficiently check if adding an edge will create a cycle (if both nodes have the same id already).

15 of 83

Shortest Path Part 1:

  • Common problems:
    • Shortest path from a starting location to all other locations.
    • Shortest path to an ending location from all other locations.
    • Shortest path from a starting location and an ending location.
    • Shortest path from all possible start and end locations.
  • Properties:
    • Keeping track of the shortest path of smaller journeys: Sometimes solving the shortest path is done by finding the shortest path of a smaller journey and combining the results. The best path for smaller journeys can be saved in a dictionary with keys (start_node, end_node) and the value is information about “the shortest path between them”.
    • Negative weights don’t work in all algorithms. If you have negative weights, you will need to use certain algorithms that allow them. For example, sometimes you can follow cycles of negative weights infinite many times to find a path that has a negative infinity cost.
  • Bellman-Ford Algorithm: Works with negative weights.
    • Keep a table to keep track of distances. The starting node has distance 0 and the others have distance infinity.
    • Just go through all the edges (u, v)
      • If node u has distance infinity then just ignore it for now.
      • Otherwise, use the distance of u (distance from s to u) plus the edge weight to update v’s distance if it’s lower than the previously calculated shortest path to v. Also, keep track of the path taken to reach this minimum distance for each node, as an array of nodes.
    • Repeat this again (all edges) “Nodes-1” times to reach a final solution. If the weights don’t change at all for all edges, you can stop early, because you found a final solution.
    • If you ran it “Nodes-1” times and run it one more time and find improvements, you’ve detected a negative infinity cost cycle. To detect all nodes that need a weight of negative infinity due to this negative cycle, you need to review the path taken to reach this node that had a decrease in the Nth iteration; and once you find a node repeat itself, you know you’ve detected the negative cycle.
  • Dijkstra’s Shortest Path Algorithm: Works with negative weights and cycles.
    • Output: For each node:
      • What is the cost to get to there from the starting node A
      • What is the previous vertex (check this multiple times to build a shortest path. For example A to E might be: A -> C -> E if E’s previous vertex is C and C’s is A.
    • Useful variables:
      • Nodes visited
      • Nodes not visited
    • Steps
      • Set the shortest path to A as 0 and for others as infinity.
      • Start at start vertex.
      • (Distance from A of start node) + edge distance. If that is less than the distance to the new node in the table, we can update it’s shortest distance and previous node values. Set A as visited, and say A is no longer unvisited in the variables mentioned above.
      • Among the unvisited nodes, choose the one with the shortest distance to A in the table. Repeat until this is done for all nodes one time.

16 of 83

Shortest Path Part 2:

  • Directed Acyclic Graphs (DAGs) Shortest Path Algorithm:
    • First perform a topological sort
    • Set the shortest distinct per node. This initializes as 0 for the starting node and infinity for other nodes.
    • Just use the topological sort to decide the order the nodes are processed. Just go through all edges that leave from that node, and update the destinations with the shortest path found so far (this path or any prior paths).
    • That’s it, it runs in O(V+E) time. Notably, if you need to find the longest path you can multiply all edge weights by -1 and solve it.

All Pairs Shortest Path (APSP):

  • Finding the shortest path for all pairs of nodes. Runs in O(V^3) so it’s good for graphs no more than a few hundred nodes.
  • Floyd-Warshall Algorithm:
    • Data is stored in a two dimensional matrix.
      • The rows are the starting locations, the columns are the destination locations.
      • Initialize with all values as infinity.
      • Set the diagonals to 0 (where row_i = column_j)
      • Use all the edges to initialize all other values in the matrix
      • When we get our final matrix answer later on, infinity will represent destination nodes (column_j_ that are impossible to reach from the starting node (row_i).
    • The algorithm is to loop over all nodes like this:�for k in range(n):� for i in range(n):� for j in range(n):
    • Inside this loop it does if �dist[i][j] > dist[i][k] + dist[k][j]:� Dist[i][j] = dist[i][k] + dist[k][j]

17 of 83

CHAPTER THREE

Dynamic Programming, Greedy Algorithms

3

18 of 83

Dynamic Programming, Greedy Algorithms

Divide and Conquer:

  • Three steps: Divide, conquer, and combine.

Longest Common Subsequence:

  • For example, the longest common substring of abcdaf and acbcf is: “abcf”
  • Matrix of size m+1 and n+1 where m and n are the lengths of the two strings.
  • Anything with row 0 or column 0 is a zero.
  • Iterate by row then by column will work
  • For a new cell you are filling in the matrix, check if the two characters match. Say it’s the 3rd column and 2nd row, check if the 3rd character matches the 2nd character of the two strings respectively.
    • If they match: Do 1 + the value “diagonal up/left”
    • If they don’t match: Take the max of the two values: the “left” value and the value “above” in the matrix
  • The answer is the final bottom right value in the matrix filled in last. To find the sequence, just follow the numbers mostly left; if the left/up are both smaller than move diagonal up and left. Keep doing that until you find all characters that contributed.

Longest Common Substring:

  • For example, the longest common substring of abcdaf and zbcdf is: “bcd”
  • Matrix of size m+1 and n+1 where m and n are the lengths of the two strings.
  • Anything with row 0 or column 0 is a zero.
  • Iterate by row then by column will work
  • For a new cell you are filling in the matrix, check if the two characters match. Say it’s the 3rd column and 2nd row, check if the 3rd character matches the 2nd character of the two strings respectively.
    • If they match: Do 1 + the value “diagonal up/left”
    • If they don’t match: The place a 0 there.
  • Iterate over everything to find the max value to find the max length. It is trivial to find the common substring based on the location of the max value in the matrix. Say the max value is found at row 4 and column 3 (ignoring the 0 index) and the length is 2. Then the longest substring is the 3rd/4th character in word1 and the 2nd/3rd character in word2.

Saving intermediate results in dictionaries

19 of 83

Max Subarray Problem:

  • Array of integers. Find the maximum sum using a sequence of adjacent integers. For example [-2, 1, -3, 4, -1, 2, 1, -5, 4]
  • One solution:
    • Iterate through and combine negative or positive values: [-2, 1, -3, 4, -1, 3, -5, 4] (combined 2 and 1). This is probably optional, just seems simpler to figure out with them combined.
    • Start on the left and just iterate across and you’ll be done:
      • -2: Skip since it’s negative. Best so far: []
      • 1: Positive so we consider it. Best so far: [1]
      • -3: This brings the sum to -2 (1 + -3), therefore we skip this and everything to the left because it brings everything below 0. Best so far: []
      • 4: Positive so we consider it. Array so far: [4]
      • -1: This brings the sum to 3 (4+-1). Array so far: [4, -1]
      • 3: This is positive. It gives a sum of 6. Array so far: [4, -1, 3]
      • -5: This is negative. It gives the sum of 1. Array so far: [4, -1, 3, -5]
      • 4: This is positive, it gives the sum of 5. Array so far: [4, -1, 3, -5, 4]
      • End. The highest sum we had was 6. So we use the array [4, -1, 3]. We expand it back to the original values of [4, -1, 2, 1]
    • Running time O(n)
  • The course teaches a suboptimal solution that runs in O(nlogn) instead of the solution above that runs in O(n). This version is to essentially do exactly the same divide and conquer recursion as mergesort. The max subarray is:
    • Find which one is a higher sum: The left subarray or right subarray.
    • Take the one with a higher sum. Say it’s the left one. Keep expanding the left one by 1 at a time going into the right one. Find the max value with this approach.
    • This approach seems confusing, but should work as well.

Karatsuba Multiplication:

  • Multiplying large numbers together takes O(n^2) the classic way. This way runs in about O(n^1.6) time so quite a bit faster.
  • Implementation:
    • Input: Say 9876*5432
    • Split it into 4 variables a=98, b=76, c=54, d=32
    • Step 1) a*c=5,292
    • Step 2) b*d=2,432
    • Step 3) (a+b)(c+d)=14,964
    • Step 4) (3)-(2)-(1)=7,240
    • Step 5) 5,292*10^4 + 2,432 + 7,240*10^2 = 53,646,432
    • If you have larger numbers, just recursively call this formula when the multiplication of two large numbers is involved. For cases where it’s 1-2 digits use the normal multiplication method.

20 of 83

Master Method for Recursion:

  • Recursion is broken into two steps:
    • The recursive calls. We will call this a*T(n/b) where a is the number of recursive calls per recursive level and “b” is the size of each subproblem (i.e. 2 if it splits it into ½ for each subproblem).
    • Extra operations. We will call this f(n) (such as merging the results of the subproblems, or any processing before/after the recursion calls).
  • Time complexity
    • If the recursive calls is by far the highest cost operation then O(n^log_b(a)). This is when log_b(a) > f(n)
    • If extra operations is by far the highest cost operation then O(f(n)). This is when log_b(a) < f(n)
    • If they are both similar in cost then O(n^log_b(a)*log(n)). This is when log_b(a) = f(n)
  • Examples
    • Mergesort: a=b=2, c=1 ends up with nlog(n) which is right.
    • Karatsuba Multiplication a=3, b=2, c=1 ends up with n^1.6 which is right.

Fast Fourier Transform (FFT):

  • Use cases:
    • Solving partial differential equations (PDEs)
    • Data compression for audio data, to remove noise from the data, or data analysis.
  • Running time: O(nlogn). An improvement over DFT which is O(n^2).
  • What is it doing: It takes a series of data values and compresses (decomposes) it into a combination of sines and cosines. The summation of sines and cosines is infinite, but you can truncate it for an approximation.
  • Implementations: It works by divide and conquer looking for symmetry. The most common implementation is is radix-2 FFT, which requires the input data size N to be a power of 2; there are radix-3 FFT and radix-4 FFT too. Others are Cooley-Tukey, Bluestein, prime-factor FFT. The Inverse FFT can do signal reconstruction to reverse a previously used FFT transformation.
  • Complex numbers: The fourier transform involves e^i which is a spiral; this spiral when viewed at the right angle is a cosine wave or a sine wave.
  • Complex roots of unity: Helps with computing the DFT for FFT. This is just solving for z so that z^n=1 (n is input size).
  • DeMoivre’s Theorem: States that z^n = r^n*(cos(nθ)+i*sin(nθ)).
  • Polar form: Complex numbers are often represented in polar form as z = r(cos(θ) + i*sin(θ))
  • Changing the dimensions: In math you can have as many dimensions as you want. For imaginary numbers we can use the x-axis and the imaginary axis to represent two dimensions. Positive i values are in one direction, negative i values are in another direction, and for the x axis there can be positive and negative values as well.
  • Application 1, polynomial multiplication with FFT: Instead of the standard way of multiplying polynomials together (Reminder: a polynomial is like this: ax^2 + bx + c), we can use FFT. What you do is apply FFT to the input polynomial coefficients (0 if one doesn’t exist). Then do pointwise multiplication in the frequency domain, followed by an inverse FFT.
  • Application 2, data analysis using FFT: Converting audio, sensors, images, or time series into a spectrogram (frequency domain), features, to remove noise, or to visualize the spectrogram.
    • For example, you represent a time series as an infinite series of sine/cosine waves, but then select the first five sine/cosine waves to create a new time series with less noise.

21 of 83

Introduction to dynamic programming using the rod cutting problem:

  • Say you have a rod of length 8. You are given the prices of the rod for various sizes: [1, 5, 8, 9, 10, 17, 17, 20]. Find the optimal cuts to maximize profit, for example: size 2, size 2, size 4 would give 19 dollars, which is definitely not the answer since a single 8 length bar which is worth 20 dollars.
  • How to solve it:
    • Save a dictionary variable called dp (short for dynamic programming) that stores the optimal profit from size n.
    • Start with solving the optimal profit for size 1. The prices again are [1, 5, 8, 9, 10, 17, 17, 20]. That is obviously $1. So you do “dp[1] = 1”
    • Next find the optimal profit for size 2. That is either 5 or dp[1]+dp[1] = 1+1. So the result is “dp[2] = 5”.
    • For the optimal profit for size 3, you check dp[2] + dp[1] = 5 + 1 and compare that with 8. So you say “dp[3] = 8”
    • For optimal profit for size 4, you check dp[1]+dp[3], dp[2]+dp[2] and 9. The best result is 5+5=10 for dp[2]. So “dp[4] = 10”
    • For the optimal profit for size 5, you check 10, dp[1]+dp[4]=11, dp[2]+dp[3]=13 so “dp[5] = 13”.
    • For optimal profit of size 6, you check 17, dp[1]+dp[5], dp[2]+dp[4], dp[3]+dp[3], so “dp[6] = 17”.
    • For optimal profit of size 7, you check 17, dp[1]+dp[6], dp[2]+dp[5], dp[3]+dp[4], so “dp[7] = 18”
    • For optimal profit of size 8, you check 20, dp[1]+dp[7], dp[2]+dp[6], dp[3]+dp[5], dp[4]+dp[4]. We get the final answer as size 2 and size 6 for a profit of $22.

Coin changing problem:

  • The problem: Find the minimum coins to give the exact change. Let’s say the minimum coins needed for exact change of $0.98 given normal US coins.
  • To solve it:
    • So, dp[0.98] will be the answer, so dp must represent the minimum coins to give the exact change. So just keep solving dp[0.01], dp[0.02] onwards until you reach 0.98.
    • So dp[0] is zero and dp[0.01] is one for a penny
    • dp[0.02], dp[0.03], and dp[0.04] are two, three, and four for penny change.
    • dp[0.05] is solved by checking dp[0]+1 for a nickel and checking dp[4]+dp[1]. The best answer is 1 for the nickel.
    • Just keep going checking all coins: penny, nickel, dime, quarter combined with a prior dp[n] (n>=0).
    • The answer is simple with US coins, but can get complex with other denominations. The answer is clearly 3 quarters, 2 dimes, and 3 pennies. A greedy approach that works with this denomination where numbers are multiples of each other is as follows: use the biggest coins first without going over. For 0.98 you try 3 quarters, then 2 dimes, then 3 pennies and you reach at the optimal result of 8 coins. Dynamic programming works with other denominations everytime, while this greedy approach works sometimes with the right denominations such as US coins.

22 of 83

Knapsack problem:

  • Goal: Find the subset of items with the highest value without surpassing a weight limit.
  • Dynamic programming solution:
    • The dynamic programming will be a 2d matrix. Columns are the weight zero to total weight allowed, the rows are the number of objects considered.
    • The first row and column represent w=0 (column) and i=0 (row) and have value 0.
    • So start filling the i=1 row next. Consider only the first item. When it doesn’t fit in the bag place 0. When it does fit in the bag, place the value of that item. Now you’re done with the first row.
    • Now start filling the i=2 row next. Find the maximum profit at each weight considering this new item as well. Always be willing to look at the values populated in the i=1 row as needed.
    • Formula: For every value consider the following to fill in a value. The maximum of one of these two values:
      • Option 1: Not using this item: One row before, and the same column. In other words (i-1, w).
      • Option 2: Using this item: Again, one row before, but go enough columns back to fit this new item. Say this item weighs 3, then you have to go 3 columns back on the previous row to fit this item. Also, add the value of the item to see which is greater. In other words: (i-1, w-w(i)) + Profit(i).
  • Greedy solution:
    • Sort order the objects by profit/weight. Select all the items in this sort order. When you start to get to the weight limit, carefully consider which items to include/exclude to maximize profit which typically is very close to the weight limit. You aren’t guaranteed to get the optimal answer, but you’ll get something very close.

When optimal substructure fails:

  • This just means dynamic programming (dp) isn’t always the solution.
  • To identify if “dp” is right for a problem: The solution is the final value in the “dp” variable, and solving smaller problems guarantees the answer for a slightly larger problem: See rod cutting, coin changing, or knapsack above for examples of this.
  • Other solution techniques: As seen in the knapsack problem “greedy solution”, you can solve the problem with other approaches such as greedy algorithms.

Introduction to greedy algorithms:

  • State your assumptions and constraints at the start
  • Select the best available option at each moment in time.
  • That’s pretty much it. But greedy algorithms don’t always give the optimal answer but often it’s good enough.

Greedy interval scheduling:

  • Problem: Find the schedule with the most events, without overlapping events.
  • Solution:
    • Initialize: Sort the intervals by their end time ascending order. Have an answers=[] variable.
    • Add one at a time: Go through the list in order. If the new event’s doesn’t overlap with the previously added event end time.
    • That’s it and it works pretty well.

23 of 83

Huffman Codes:

  • What is it: A compression algorithm for strings.
  • How is a string normally encoded: A single character can take many bits to be encoded. For example “a” is encoded as 01100001.
  • How it works: We create a binary tree where each node is assigned a character. A traversal to a left child is represented by “0”. Traversal to a right child has a value of “1”.
  • Goal: Make traversals shortest for characters with the highest frequency. So overall less bits are needed.
  • Steps:
    1. Get the frequency per character
    2. Get the two characters with the lowest frequency. Make a new parent for them called “n1” and make these two the child. Add this “n1” to the frequency table, the frequency is the left child plus right child frequencies.
    3. Repeat this many times until the entire tree is done.
  • End result: The common characters are represented often by 2 bits, while the less common characters are represented by longer codes.
  • How do you know what bits belong to characters if they are variable length bits: Just follow the tree and when you hit a dead-end you know that is the character and the next character will begin on the next bit.

Decision problem and languages:

  • How fast can you solve it: For some problems it is not certain if a faster solution exists, and it runs very slow on huge datasets. For other problems there are proofs that no faster solution exists, so it must run very slow on huge datasets, like in cryptography.
  • One solution might be easier to find: Like when your puzzle is check to check if there exists a way to achieve an outcome. It is easier to solve for “yes”, because you’re checking if one solution exists. To answer “no” you have to try all possible ways to prove it doesn’t exist, so it takes more computation to find the answer “no”.
    • “For all x, it is true”: “yes” take more computation to compute typically.
    • “There exists an x which makes it true”: “no” takes more computation typically.

24 of 83

CHAPTER FOUR

Data Science as a Field

4

25 of 83

Data Science as a Field

Data science adds a lot of value, making it necessary for a business to use:

  • Improving engagement/retention is done with customer data (i.e. clicks) by means of goal setting, dashboards, data analysis, and A/B testing.
  • A statistical model can do almost any work task a human can do and complete it much faster and cheaper. For example, it can look at 100 million records and identify the profitable records and the company can then take action on them.
  • Just having the data available on a website is valuable to customers or customer support.

The skills you would need to achieve these goals are:

  • Regular Supervised Machine learning for most statistical models.
  • Product analytics for engagement/retention improvement.
  • Data management and distributed computing to organize and process data.
  • NLP to use text data usually for features for your model. Other interesting use cases are question answering, summarization, sentence similarity, translation, and auto-complete.
  • Cloud computing starting with starting a computer/cluster for a jupyter notebook or to host a dashboard and using storage to store CSV and parquet files similar to Dropbox. Most other tools are optional and expensive.
  • Other Skills:
    • Coding puzzles for fast code.
    • Recommendation engines.
    • Time series forecasting.
    • Persuasion and web design to complement product analytic skills.
    • Staying motivated.
    • Tuning gradient boosting models.
    • Deep learning most commonly to identify images and create word embeddings. Other uses are create/identify images/audio, NLP tasks such as “next word prediction” or translation, and time series with large datasets.

Some interesting NLP areas:

  • Question answering and summarization.

An overview of data science

26 of 83

Issues with most DS code today:

  • Too many cells: 2500 cells of jupyter notebooks per project run out of order.
  • Software engineers think "notebooks = bad code" due to this.
  • Slow improvements: If they make a discovery that improves the model, it is too messy to implement into their final model, and months go by without progress.
  • Unmaintainable by creator and others: No one really wants to touch code like this. Sometimes people would rather leave their jobs than make a change to their models.
  • Slow ETL and slow Spark usage: It's essential to learn fast ETL concepts to ensure simple and fast code.

Solution:

  • One cell: All code in a notebook must be in the first cell only whenever you save it for the day.
  • Start from scratch: If you're improving someone's 2500 cells of code, it's best to start from scratch to ensure long-term reproducibility. I've created 10x higher profit models with a few hours of effort by starting from scratch rather than trying to maintain a very messy model.
  • Constantly delete your code: Move your unnecessary code that would otherwise make your codebase messy into a old.txt file. This is like cleaning your house by putting everything in the closet. With this old.txt file you can do cmd+f or ctrl+f to find any code you have tried in the past.
  • 1.1 notebook: I have a GUI that generates this notebook. It exists in all projects. It's intended to be one cell long and extracts all the slow data you need for the project, like from databases. It thins it and ensures usage of it runs fast.
  • 1.2 data split notebook: This is where you make final touches to your dataset and then split it into 11 datasets: train-cv1-5, val-cv1-5, and a test dataset.
  • ETL Note: You only need to run the ETL approximately once a year so it's encapsulated away so you never have to look at it again. Save everything to the cloud storage.
  • 2.X feature engineering notebooks: Create unlimited feature engineering notebooks using the open-closed principle. Anyone who wants to improve the model can just create a new feature engineering notebook and does not need to modify other data scientists' code. My GUI creates the boilerplate code needed. As you become an advanced user, you can create multiple forms of features to test which one is best later on with sklearn pipelines.
  • 3.1 model notebook: This is where you load your 11 datasets mentioned in the 1.2 notebook, join it with all the features, and evaluate your model with 5-fold CV.
  • Local development: Not super important but no lag is nice. Also always consider having up-to-date production scripts, which could be the same or different notebooks.
  • To clarify: You are allowed one cell per notebook, you should delete tons of code by placing it in old.txt, and you should only have a few ETL notebooks (1.X notebooks), any number of feature engineering notebooks that are successful, and one model notebook.

With this setup, you're ready to make 1% improvements to the model every other day and in no time create a model that can have much higher performance, trust, and maintainability than messy code solutions.

27 of 83

R:

  • Install R and then RStudio
  • Do new R Markdown file. Use Knit on save. You can Knit to PDF or HTML.
  • Some useful things to do:
    • Load in the data.
    • Tidy the data: For example, changing data types and looking for missing values.
    • Visualizing the data.
    • Analyzing the data. Do I trust the data.
    • Modeling the data.

Presenting:

  • Pictures are good.
  • Lots of words and reading slides isn’t good.
  • Practice, be enthusiastic, and maintain eye contact.

Not so good question: "Do I know everything to do good data science". It's infinitely hard to prove this with a "yes".

Better question: "Can I do my job". This is easy to answer "yes" in no time. You know if you did your job if the metrics improve in your task. Tasks are likely one of these:

  • Statistical models
  • Improving engagement/retention
  • Providing data on a website as needed by others

Although there are some basic skills that are optional but very nice to have:

  • Easy coding puzzles: Able to do these
  • Keep ETL separate for statistical models so you only have to run it once a year instead of every day.
  • Extract data quickly using RAPID: "R" stands for "ready to use" or caching. "A" stands for aggregation, "P" for partitioning, "I" for indexing, "D" for distributed computing. Most of the time, just check if the database has an index or partition.
  • Shrink your data: Try to make your data as small as is needed. For example, a 2.5TB dataset can be reduced to 100MB if you change datatypes and drop unlabeled rows. Don't be deceived, almost any dataset can easily be reduced to under 1 GB with enough compression and optimization; just there is little point to reducing it below 1 GB most of the time so you can stop at that point.
  • Normal tools first, distributed computing when needed: Try to use normal tools until it's far too slow, then go for distributed computing to speed up 1-5 lines of code rather than 100s of lines of code. The truth is distributed computing has a greater cost to maintainability and is almost never actually needed.

28 of 83

CHAPTER FIVE

Ethical Issues in Data Science

5

29 of 83

Ethical Issues in Data Science

Introduction:

  • Key areas for ethics considerations are usage of private data, social media data, facial recognition accuracy by race, and any statistical model’s bias towards gender, race, or age.
  • The goal of this course is to be more aware of ethical issues during data science work.

Different theories of ethics:

  • Consider taking good care of yourself so you can help others, consequences, doing the right thing, and having virtues.

It is vital to ensure ethics in your actions and a statistical model’s actions.

30 of 83

Ethical analysis 1:

i read the case of medical implants and the potential for a hack to restart the implant.

it was concluded that the risk of harm was negligible because the the difficulty of the hack and the consequences of the hack.

firstly looking through the lens of consequences i find that the consequences are clear in the sense that they have a bug bounty program so they agreed there could be issues if someone hacks into the information a device reset might not be the worst type of hack but it's still important it could lead to the device not having the appropriate settings configured when we started which could result in harm, death, or higher chances of harm or death if it opens the opportunity for further hacks in the reset state.

those are the consequences of not doing a recall but the consequences of doing a recall would likely mean they have to spend a lot of money on a recall.

in terms of doing the right thing for the right thing and doing their duty the product is intended to work a certain way and to be and should not do any harm so they should take this as an opportunity to redo the program so that it's possible to do patch fixes for any type of future issue that comes up if they go down the route of not fixing the issue then it seems like they're more prone to make the same decision again to not patch something when they should make it more compatible for patches for very quick fix on the software of the program

in terms of virtues i would say that their decision is not virtuous in that it lacks trustworthiness that they don't intend to fix something that could lead to death it does not respect the customer and giving them a product that will do no harm in this case not alerting them of the issue when that's what the product is intended to do they're not taking responsibility for their issue they're just pushing it under the rug it's not fair to the people who are harmed for their profit savings and it's not caring for people's lives. overall it is very much not following the virtues framework of the six core values in our society mentioned by the Josephson Institute of Ethics in Marina del Rey.

the ethical framework that appeals to me most in pretty much any situation would be the Kant point of view of doing the right thing can only be done by doing it for the purpose of it being the right thing which i've come to believe is the framework that most closely aligns with objective morality.

what appeals to me secondly is the virtues framework which logically make sense that a very high character will statistically lead them to doing the right thing more often

what appeals to me least is the the other framework which closely aligns with the phrase the ends justify the means which doesn't seem like a very nice thing to say

31 of 83

Ethical analysis 2:

Kant. YouTube recommendation engine.

I am a data scientist who has spent a long time with recommendation engines and retention dashboards. Based on the public information about these popular recommendation systems, their goal is to improve engagement/retention.

I like the youtube recommendation engine I think it's pretty good I do think they should put in a good feature that they probably may never add and that's to make the recommendation engine try to make the person quit after they've been on for 2 to 4 hours in a single day just to very casually nudge them to get off or at least not be going video to video through recommendations.

The one downside is that they have competition and you might just push them off into another system and that system might be seen as outperforming the system but yeah it's just a tough situation and the ethical thing would be to just make sure they don't use on a single day on their own system and just kind of take the small hit of not getting that engagement past the two to four hours whatever they deem as too habitual

We should look to avoid this habitual behavior that will cause harm to society by having people addicted to these youtube or other apps longer than they should in a period of time. we have seen this habitual use clearly be a problem in the news with social media use and depression and research of social media companies looking into depression caused by habitual users and overall the decline in people interacting more than the world without using screens.

Comment made on another post:

That's true, they aren't provide the service out of good will but rather to make a profit. I wonder if this applies to every for-profit company. They will often speak of the value they add to society of giving entertainment, but their main goal is often for ever increasing profit. It's surprising that communism and fascism which might originally sound humanitarian lead to security states and capitalism that sounds selfish can create good; I'm not too well versed in these topics.

I somethings think about how the pressures of competition lead to unethical choices: "If we don't do it we will go bankrupt and the unethical competition will do it anyways" or in other words "if we don't do it, someone else will", almost like a prisoner's dilemma; I even think countries could be acting this way (rather than companies) in that they want to ever grow GDP at the expense of pollution, because rivalling companies will just do it anyway and win in GDP.

Ethical analysis 3:

I believe much like Google's opinion in the matter in the reading that there are some extreme cases where action needs to be done but in most cases people are being a little bit too sensitive like the example where a news station wants to remove another new station's article.

32 of 83

Ethical analysis 4:

I believe the "Right to be Forgotten" sounds okay in extremely disturbing cases, just like how any app needs to address bad actors. I would dislike to see it abused for trivial cases like a news station wanting to remove another news station's article.

Each complaint should be sent to the search engine or legal action taken. The complaint should be assessed for damages caused and the ethics of the situation. Then appropriate action should be taken. If the search engine doesn't act in good faith, it's only a matter of time before legal action is taken against them by the countries or individuals that care to argue their case. If a country makes a policy like GDPR, then it should be respected by all applications. However, as we have seen with GDPR, we don't need to enforce the GDPR on the US user data or other country user data.

From Kant's framework, doing the right thing seems obvious for extremely disturbing cases, even if it harms freedom of speech of bad actors. Their information is still on the internet, it's just too extreme and is found to be one of the rare cases it is delisted from the search engine. There is no law that a search engine must show your material, their algorithm is their own, and Google has had great integrity and held high public opinion overall through the years to do what is right to provide knowledge to all.

In terms of the decision not to allow anyone to delist something from Google, I may have used the virtues framework. Aristotle valued rational thinking highly, so rationally I figured a good way to go is to take an action equal to the size of the harm. If there is no harm no action should be taken. If it hurts someone's reputation for something to be on the internet, they can plead their case and if it is an honorable cause hopefully they get justice in the end. Google and the courts are at people's reach to judge their concerns. Each case needs to be judged individually and rationally.

Comment made on another post:

Yes, I've had a YouTube video uploaded about me where I got stuck in something and would like to have it removed. The person removed it and it's probably down from YouTube and not disseminated by others since it was pretty boring.

Good point, different countries have different values.

Yes, that's interesting to view from the idea of if someone is a good actor would they have uploaded the information. If only a bad actor would do it, then it's not good. And I see how good/bad is hard to determine like you said, but it makes sense to me.

33 of 83

Professional society codes of ethics:

  • Convey results honestly
  • Share data when that doesn’t violate confidentiality
  • Help to educate the public
  • Treat other people decently
  • Avoid harm
  • Respect privacy
  • Design and implement systems that are robustly and usable secure
  • Contribute to society and to human well-begin, acknowledging that all people are stakeholders in computing
  • Be fair and take action not to discriminate
  • Perform work only in areas of competence
  • Foster public awareness and understanding of computing, related technologies, and their consequences
  • Ensure that the public is central concerned during all professional computing work
  • Manage personnel and resources to enhance the quality of working life

Ethics Analysis 5

The article about a strange year at Uber: I found it very striking that there was so much sexual harassment and abuse and dysfunction at the company and lack of cohesion and good integrity from leaders in her experiences. The biggest ethical issues seems to be a lack of virtues in the leaders based on my understanding of her experiences.

the article with the phd ai paper. what i felt was most compelling about this article was that these large language models definitely lack understanding and it's more that they are predicting the next word without much understanding, just to trick the users. It cost so much to train them, and i'm pretty sure it's costing a whole lot more these days which contributes to carbon emissions. i like the idea of teaching the computer to have better understanding so it doesn't need to use so much carbon to train something that doesn't necessarily have understanding. the 2017 davinci code book speaks of an ai coming about through both a logical side and a probabilistic side similar to the human brain which perhaps is what we can do by creating a more logical side instead of focusing only on probabilistic word predictions. i'm sure this idea has been debated a lot but it seems the money is following the probabilistic side for my understanding and the opinion of this ai paper in the article.

34 of 83

Ethics Analysis 6

I believe facial recognition should be used for social media filters and there should be strict laws not to use facial recognition for other purposes.

Even once perfected, it should not be used, and we need to work harder towards making strict laws against it.

I recognize the following benefits to automation, but any of these are a slippery slope to mass surveillance:

  • Verifying the person works here after scanning an ID card: A security guard can do facial recognition, no automation is needed. Good locations for this are: government buildings, military buildings, power plants, water supply, research laboratories, prisons, and companies.
  • Checking all people leaving and entering a country by border or airport: Some benefit, but this should be made illegal by laws.
  • Checking all cameras in the world for terrorists: This might be necessary, I’m not really one to say.
  • Checking all cameras in the world for criminals: This is essentially mass surveillance, and we should avoid it.
  • Surveillance for high security events: looking for criminals at sporting events, concerts, amusement parks, conferences, political rallies, and holiday celebrations. This seems fine but it's too easy for it to turn into mass surveillance in few years afterward.
  • Facebook automatic facial recognition tagging. This should not be done. They already know who is friends with who to share photos with friends, no need for this.

These are just my opinions, everyone has their own opinions on privacy vs security. Overall, usage of any facial recognition can't be used carelessly because it's too easy for people to suddenly use it for mass surveillance once it's seen as accepted in one place.

Comment I made on another post:

I agree about unlocking the phone and that social media tagging might be not harmful; you probably aren't the only one to feel slightly weird when tagged automatically. I think they were sued for some reason and have abandoned automated tagging, but I'm not sure. Yeah getting arrested for your face could be bad.

I agree finding suspected terrorists, finding criminals, and placing them at airports would be the best use of it when perfected if used and am not fully convinced either way if it should be used. With the right laws in place we could keep the facial recognition at airports.

Mass surveillance is a powerful tool that would likely require a high level of virtue to use for an overall good. Further mass surveillance is a large step towards some of the dystopias in science fiction, but if terrorists and criminals are a huge problem, a huge solution might be required.

35 of 83

Ethical Analysis 7

Gene editing

  • Gene editing seems ethical if it extends the life expectancy from 20 and younger to something closer to the normal human life expectancy.
  • Gene editing seems ethical if it removes a disease guaranteeing an IQ below 80 at the time of reaching age 20.
  • Gene editing seems ethical if given consent by someone 30+ years old and it extends their ability for rational thought by 10+ years. While gene editing seems impossible so late in life it may become a possibility in the distant future.
  • I have these beliefs partially because aristotle says that a human being living a life consists of engaging in rational activity so I find a life cut short or a life cut short of rational activity should be considered a big issue with a possibly big solution.
  • Some science fiction has considered the possibility of gene editing allowing humans to travel through space but I would think that a more ethical technology would allow humans to travel through space efficiently.
  • I believe superhuman through modification of the body is more of a dystopian future we shouldn’t pursue. Modification of the body with technology seems fine if it’s given consent and it brings someone with a disability to equal ability with a human. For neurolink, the ability to run scripting commands through thought doesn’t seem super useful because he can just use a smartphone device but the ability to move a mouse with your mind sounds ethical for someone with a disability and has given consent.

Data science stance on gene editing today:

  • I believe the stance of data scientists should be to avoid ethically grey areas as your first job. This way you build up confidence and feel like you can quit your job at any time if it becomes ethically wrong work.
  • I would say the stance of the data science community should be one of holding off and not pursuing it at all until the right laws are in place to ensure that bad actors with the technology won’t begin using it to create a dystopian world.
  • It’s too powerful of a technology to be put in the hands of people without the right laws in place so the data science community should not condone work in this area; even if people give excuses of which diseases they will cure the right laws needs to be put in place first.
  • Some limitations to what i’ve said are firstly that the powerful corporations could lobby and change the laws in the future and secondly I haven’t thought about this for very long so the exact constraints of what should be pursued for a cure now versus what should not need to be thought about more carefully.

(no comment made since no other comments were there)

36 of 83

Ethical Analysis 8

Yes taking away someone’s job seems unethical and we should support those who are most vulnerable with the education required to keep up in the workplace.

With the trending skills being known by data scientists and jobs being very focused on data professionals I believe all trained data scientists should take on the obligation of helping others upscale for the future jobs to as much as their ability allows. My hesitation is what if I found a system that made data science so easy and shared it to everyone and then there was significant competition and I struggle to get a job I needed to pay a mortgage; but I suppose I could downsize.

Some areas I would want to help are a cloud computing website that makes it very easy to both use cloud computing services cheaply and write very clean maintainable modeling code. This code written by an average person would easily be more maintainable and fast compared to the average senior data scientist by following key fundamentals unknown by most.

Another area I would like to help is through education of kids going through elementary school to high school and I have created a 12-point learning plan that would create a very well-rounded educated individual and currently it’s just an ebook of a few dozen pages highlighting the learning plan for those 13 years of education. This would get them ready for the cognitive abilities needed for future jobs.

Lastly, I have thought about tutoring data scientists. I am a bit hesitate with ChatGPT being able to fill that role pretty well these days, I wouldn't feel right charging an hourly rate when they could take resources and ChatGPT and get almost the same benefit. I had thought to teach out of my short ebook that simplifies how to follow best practices.

The government should have a very good career virtual services to help people choose their careers of interest, high demand, and low supply. For example, creating lesson plans for like how to become a landscaper through this YouTube course for example, then various low-mid-high cost learning programs once they get their first few clients and like it. Education at the free, low, and mid-levels oftentimes are just a series of topics and resources to learn how to achieve milestones to gain the skills or abilities needed to do a job.

The government should support education for kids under the age of 18. The government should be speaking very supportively of college to improve people's cognitive abilities further after the ages of 18.

Say we enter a world with sufficient wealth but not enough work for all. I trust the way we have things today in the USA, a lot of idealistic ideas have led to arguably bad results in other countries. In other words, keep doing what we’ve been doing, but look for ways for sustainable living using technology as one thing we should change. Our current government is quite good to those without jobs and I would imagine they would help those who feel like they aren’t qualified for the open jobs.�(no comment made since no other comments were there)

37 of 83

CHAPTER SIX

Cybersecurity in Data Science

6

38 of 83

Cybersecurity for Data Science

Triad of cybersecurity:

  • Confidential: Limit read access, encrypt data, and authentication.
  • Integrity: Limit write access, data validation, and data backups.
  • Availability: Make it accessible always. To ensure this safeguard against denial-of-service, have redundant systems, and implement disaster recovery plans.

Basics of cryptography:

  • Encryption: Making plain text unreadable until it is decrypted by a decryption key.
  • Decryption: Reverses an encryption.
  • Private key: There is a secret key that allows decryption.
  • Data integrity with hash functions: The data is sent and alongside it is the value of the hash function when applied to the data. To verify that the data has not been tampered with the hash function is run on the receiving side and compared against the hash function sent alongside the data to verify the numbers are the same to show that the data has not been tampered with.
  • Password storage with hash functions: With this method the password is originally in plain text but converted to a hash values as a string. When the password is plugged into check a hash function first converts the password to the hash values and then the hash values are later compared to see if it’s a matching password. SHA-256 is collision resistant.
  • SHA-256 and other hashing algorithms: You should not rely on the secrecy of the algorithm but the secrecy of the key so it shouldn’t be a concern if other people know that you’re using SHA-256 you just have to keep the key very secret.
  • Symmetric key cryptography: This is where both people have the same key to encrypt and decrypt messages
  • Asymmetric key cryptography: The person who wants to receive the message (person 1) first creates a public and private key. They then give the public key to the person who will send the message (person 2). This allows “person 2” to encrypt it and then when the message is sent back to “person 1” he or she can decrypt it with his private key.
  • What was used long ago: Basic substitution ciphers where every character actually represents another character. This can be cracked by hackers by looking at letter frequencies to figure out which letters most likely represent which other letters. This is because in any given message certain characters tend to have a higher frequency than other characters

Keeping private and sensitive data safe to the best of our abilities.

39 of 83

Columnar transposition cipher:

  • For example, write a message into a 5x5 grid, one character per space. Instead of reading from left to right up to down you read from down to up left to right. This creates a new message and then someone who understands what the grid size should be and also in which order the character should be read can decrypt the message.

Enigma:

  • A complicated setup that leads to every character pressed on the typewriter to actually create a different character and it can be changed every day. One weakness was that a character could never be substituted with itself which makes it slightly easier to
  • With this device it could change the substitution cipher mid message which makes it very difficult to decrypt.
  • Part of the solution was to guess the configurations and then run it through simulations.

Hacking:

  • What is it: Unauthorized access to computer systems through security vulnerabilities and weaknesses
  • Intent of hackers: There are three types of hackers the first finds the vulnerabilities as part of their job. The second are those who engage in hacking with malicious intent. The third type are those which engage in hacking but do not have malicious intent and will sometimes notify system owners when they have hacked into something.

Internet of things:

  • What is it: A device that can send and receive data with the internet. Objects often have sensors for data insights and are controllable like smart home applications. This control can be manual or automatic based on the sensor gathered data.
  • Problems: They are new and vulnerable to security breaches having very good security is essential when developing these systems

Social engineering:

  • Pretending to be something they are not to gain sensitive data. They might have urgency, threats, or bribes to convince you.
  • Following someone with access.
  • Checking discarded items for data.

Facial recognition:

  • It’s dangerous and to avoid mass surveillance we need to pass laws. It threatens our privacy for anything we do in the public in the possible future. It is already used in airports and other areas.

Reflecting to improve listening:

  • Paraphrasing the speaker’s words and feelings to show understanding and encourage them to continue talking. It enables the listener to empathize with the speaker’s experiences and emotions leading to better communication.

Passwords:

  • To have a good password you need a password manager and very long passwords and to set two-factor authentication as much as possible. If you handle passwords, you should hash passwords as described earlier. If you manage passwords at a corporation, you should ensure the access/passwords expire within 1 year once created.

40 of 83

How to do well in the cybersecurity space:

  • You need to go on linkedin and twitter and make connections to stay informed in the cybersecurity space
  • You can also conduct interviews with people to have them share their professional experience in cybersecurity:
    • Be respectful.
    • Start with personal questions to relax the subject. Like how they reached their current position or dealt with challenges.
    • Ask open-ended questions. Talk about their experiences building relationships with key stakeholders and improving at skills that you’ve identified as your weaknesses

41 of 83

CHAPTER SEVEN

Fundamentals of Data Visualization

7

42 of 83

Fundamentals of Data Visualization

Basics of visualization:

  • Visualizations let us see patterns in data.
  • Ensure that there is a purpose to the visualization.
  • Sunburn, horizon graph, stacked graph are some unique ones I like.
  • Altair is a python library that uses a Javascript library called Vega-Lite.
  • pip install altair vega_datasets

Altair:

  • alt.Chart(data) is how you start any plot
  • Then place .mark_bar(), .mark_circle(), mark_line() to choose between things like bar charts, scatterplot circles, or line charts.
  • Then place .encode(...)
  • Inside encode you can place visualization grammar, the most basic of which is x=... and y=..., color=....
  • If you assign it to a variable named “graph”, then you can save the graph with: graph.save(“graph.html”, embed_options={‘renderer’: ‘svg’})
  • This allows us to have interactive visualizations that we can place on any website.
  • Setting color to alt.Color(“my_column”, scale= alt.Scale(...)) you can use categorical schemes, sequential schemes, and diverging schemes; which are a group of colors.
    • Categorical schemes: Good for string columns.
    • Sequential Schemes: Good for numeric data.
    • Diverging Schemes: Are good for numeric data and furthermore to show deviation from a central point (hot/cold etc).
  • tooltip=... shows information when you hover over anything.
  • To do facets (two plots side by side), you build 2 plots. Then you combine them with the following code: “graph1 | graph2”. Use “graph1 + graph2” to make them in the same plot area, while pipe puts them side by side. Using “graph1 & graph2” is vertical facets. Using .repeat allows horizontal and vertical facets.
  • Setting a .properties(...) you can define a title, width, and height.
  • Often you want to aggregate your data in a tool like pandas before using a plot.
  • Use one of the interactive keywords to make the plot interactive.
  • Use project(...) to put an image inside the plot for example with geolocation data.
  • Overall Altair is a very simple tool that is easy to place on a website. Very easy to edit compared to some much more complicated JavaScript libraries.

Basics on making data visually appealing.

43 of 83

Graph tips:

  • Consider log scale for data with some very high values.
  • Darker should be higher values when color represents numerics. Background should be white.
  • Dark blue is a great color to use. It is also good to use dark blue fading to light blue.
  • Y-axis should start at 0.
  • Pie charts should add to 100%.
  • Dual axis should usually be avoided; in other words, have all data share the same axis.
  • It’s typically best to avoid unnecessary clutter.
  • Typically avoid 3D plots.
  • One way to present data is to put all the data focus in blue and the data that shouldn’t be focused on in gray.

Interactions:

  • Sorting.
  • Dropdowns for filters.
  • Clicking for detail: You can show the overall data as well with the data clicked on in color and other data not in color. Near it you can show the data detail.
  • Clicking for filter: Say you click on something in one plot, then it will filter all plots on the dashboard to that group.
  • Mouse hover: These are tooltips to see the values for data

Visualization tasks:

  • Visualization loop
    • You start with questions.
    • Explore the data with visualizations.
    • Generate knowledge.
    • This prompts you with more questions, and you repeat the process.
  • To enable this visualization loop, a dashboard designer needs to think about what is the purpose of the dashboard, what is the user doing, and what metrics does the user need. This allows the visualization loop to go on longer before it reaches a dead-end of not being able to answer the user’s questions.
  • Talk with experts and visualization-savvy individuals to see what exactly is needed in an interactive dashboard tool.
  • Prototype something quickly to get feedback quickly. One way to do that is the five design sheets method.
    • Sheet one: Sketch ideas.
    • Sheet two: Sketch, with details, interactions, discuss, and reflect.
    • Sheet 3-5: Iterate the same as sheet two but three more times
  • Design studies
    • Try to have a real problem for real users. A clear idea of what you need and how to do it are key.
    • Create a solution for them, and get feedback.
    • Think carefully about their feedback and make changes.
    • Repeat as needed.

44 of 83

Pitfalls that can occur in a design study:

  • Collaboraborate with the right people.
  • Have real data ready.
  • Problem can be automated without a visualization.
  • Existing tools are good enough.
  • Not a real important recurring task.
  • No rapport with collaborators.
  • Mistaking fellow tool builders for real end users.
  • Focusing on the design of visualizations instead of solving the problem.
  • Being comfortable with visualization tools but not going overboard with what it can do.
  • Slow prototyping
  • Interactions are too much or too little.

Graph tips:

  • An interesting way to visualize a scatterplot with many colors is to draw hexagons and color each hexagon the color with the majority of that color. If a dot falls on a hexagon of it’s own color it can be removed. The leaves you with colored hexagons and also dots of color on hexagons where it isn’t a majority for that hexagon.
  • Effective visualization simplifies the cognitive processes required to understanding the data.
  • Start with a task and find the solution. If the solution happens to be a graph, then use a graph, but don’t decide to use a graph at the start.
  • You can add error bars to your visualizations; much like how t-tests use standard error.

Qualitative evaluation:

  • Purpose: Gain deep understanding of the visualization and explore it from multiple perspectives. Answering “how” and “why” questions. Understand user experiences, opinions, and develop testable theories to solve the problem quicker.
  • Data collection: Interviews, open-ended questionnaires, and focus groups.
  • Sampling: Random sampling is ideal. Try to set it up so anyone can participate instead of experts.
  • Data analysis: Analyze the text responses with a computer. Present in written form. Ensure the privacy of the participant responses.

45 of 83

Give an example of how you would design a visualization experiment:

I would like to build a visualization that allows someone to slice and dice the data to determine key factors that lead to a student dropping out of the application prioritization department of universities. Overall the goal of a user would be to find patterns that correlate well with someone dropping out. The results of these findings can lead to more support for certain groups that are at risk of dropping out and also prioritization of merit-based attributes for the application priority order.

I would recruit random adults as opposed to expert data analysts to ensure the tool is usable for all.

The approach I would use is a semi-structured interview for random adult participants who have operated a website. I wouldn't look specifically for expert data analysts but would be content with a random adult participant to check if the visualization works well. I would prepare a website where participants can do tasks with the visualization. I would need to simulate data or ensure I have approval and sufficient privacy to use university dropout data.

The structure of the interview would be as follows with the following process and open-ended questions:

  • They would come in for the virtual interview. I would ask them to introduce themselves and tell me a bit about their background in data analysis and using a computer. if they've done data analysis I would follow up by asking a bit more about that. If they have never used a computer, I would tell them we are sorry but we need someone with a bit more computer experience.
  • I would ask them to visit a website where they can complete a series of tasks today.
  • I would have them have the visualization open and ask them what they expect this tool will accomplish for them.
  • I would ask them to find a variable that might correlate with them dropping out from the variables listed.
  • I would ask what their thoughts and feelings are as they finish the task.
  • I would ask them if they had any challenges while doing the task.
  • I would have them repeat this process with three more tasks.
  • I would ask if the tool was clear and informative.
  • I would ask if they felt in control while using the visualization.
  • I would ask what they liked particularly from visualization.
  • I would ask what they think needs improvement with the visualization.
  • I would ask if they would recommend the visualization to someone working in university application prioritization.
  • I would ask them to describe their overall experience.
  • I would ask them if there's anything they want to share about the visualization.
  • I would thank them for their time and ensure they are compensated by the virtual interview software.

Lastly, I would use computer analysis to analyze the experiment text data carefully and maintain participant privacy.

46 of 83

CHAPTER EIGHT

Data Mining Pipeline

8

47 of 83

Data Mining Pipeline

Drowning in data but starving for knowledge:

  • Need automated analysis of massive data.
  • Find a new pattern, and find a potential use for it.

How do you go about using data:

  • Data: What kind of data and how large is it.
  • Technique: The technique you use to achieve the use case or gain knowledge from data.
  • Use case or knowledge: What you might be able to do with it or learn from it.

Common techniques:

  • Classification: For example, is the picture of a dog or a cat?
  • Clustering: Similar points are put together in a cluster.
  • Co-Occurrence analysis: For example, what two items tend to be in the cart together.
  • Regression: For example, predicting the house price.
  • Anomaly detection: Finding things that deviate from the norm.
  • Sequential pattern mining: Analyzing a sequence of behavior such as a customer click/purchase history.
  • Text mining: Finding useful information from text data.
  • Time series analysis: For example, forecasting stock prices over time.
  • Decision trees: Models that represent decisions and their consequences and our interpretable.
  • Neural networks: Focused on tasks like image recognition, natural language processing, and speak recognition.
  • Support vector machines: Find the optimal hyperplane that best separates classes of data for classification or best draws a line through the data for regression.
  • Principal component analysis (PCA): Transforms high-dimensional data into a lower-dimensional space while preserving essential characteristics.
  • Ensemble methods: Combines multiple models to improve predictive performance.
  • Recommendation systems: The first step in recommendation systems is collaborative filtering or content-based filtering. The next step is to use regression or classification. The final step is to use a rules-based logic to adjust the final results.

Ethics:

  • Data access/ownership, privacy, model validation, fair model, and consequences considered.

Building a pipeline to move data from one place to another.

48 of 83

Data basics:

  • Data is made up of objects with attributes and attributes can be numeric or categorical.
  • Some statistics to describe the attributes are the central tendency and dispersion of the attribute values.
    • Box plot: Shows explicitly the central point and dispersion.
    • Histograms: Show the dispersion with binned counts, but the central point can be harder to identify.
    • Quantile plot: Sorts data in ascending order. Compute many quantile values and plot it.
    • Quantile-quantile plot: Checks how much the data differs from the a distribution usually the normal distribution. Points on the diagonal line are close to the normal distribution.
    • Scatterplot: Plots the x and y points on a graph as small circles. Say you have 4 attributes, you can create 6 scatterplots to see the scatterplot of each combination (4 choose 2 = 6), often called a pairplot.
    • Other plots: There are other common and uncommon plots. Use the one that is going to likely be effective and simple to understand.

Data similarity:

  • Looking for a similarity or dissimilarity patterns.
  • You have to start with the matrix where each row is an object in every column is an attribute (n objects as rows and p attributes as columns).
  • A similarity or dissimilarity matrix will be nxn (n objects as rows and n objects as columns).
  • For categorical attributes you can say they’re similar if they’re exactly equal or you can come up with a way for them to be partially equal for instance if the color is the attribute and the colors are pretty similar.
  • To check if two binary attributes are similar or dissimilar all you have to do is check if they’re equal to each other if they are symmetric (yes/no occur at similar frequencies).
  • If yes or no do not occur at the same frequencies then you can ignore cases where they’re both the rare case such as a no for purchasing a specific item; since that isn’t too interesting. Same consideration can be made for null values.
  • It’s good to have each attribute range from 0 to 1 to keep the same scale
  • For ordinal values you can have (1, 2, 3) and map it to (0, 0.5, 1.0).
  • Once the data is all converted as needed use one of four common similarity calculations:
    • Manhattan distance: Look for numeric differences you can do absolute value differences like abs(distance_attribute1) + … + abs(distance_attribute10).
    • Euclidean distance: Look for numeric differences you can also do another formula like square_root(square(distance_attribute1) + … + square(distance_attribute10)).
    • Cosine similarity: Usually used with high-dimensional or text data it measures the angle between two vectors usually like the percentage word frequency as opposed to the absolute word frequency.
    • Jaccard similarity: Suitable for categorical data or sets and it looks for the size of the union between two sets.
    • A mix of multiple methods and then a weighted average afterwards.

49 of 83

Data cleaning and data integration:

  • Cleaning: Handling missing values, outliers, standardization (formats etc), deduplication, inconsistencies, (such as typos or naming), and validation of the data.
  • Integration: If data is from various sources you may need a common structure, aligning the schemas, or aggregating the data. Additionally, you may be merging records that represent the same entities across different sources, addressing differences in terminology across sources, checks for accuracy and checks for completeness completeness.

Data transformations:

  • Smoothing, aggregation, generalization (using a more general categorical field than the most granular one), feature scaling, converting continuous values to intervals, and attribute construction.
  • Feature scaling: You can do min-max scaling, you can do x-mean/(max-min) (called mean normalization), and lastly you can convert it into z-scores (called z-score normalization, where you compute the mean and standard deviation to determine the value’s z-score.
  • Data reduction: Reduce the rows or columns to make it more efficient.
  • Attribute selection: Forward (start with a single feature model and work forwards), backward (start with using all the features and iteratively remove some), featuring engineering.
  • PCA for dimension reduction into orthogonal vectors.
  • Sampling considerations.

Data warehouse, data cube, and OLAP:

  • OLAP data warehouse: This is for complex queries and it’s separate from the operations database. You have fact tables and dimension tables that can be joined together most commonly as a star schema (one fact table and multiple dimension).
  • Data cube: Store 3 dimensions such as month, customer, and source. Also store the numeric measures like sales and revenue. You pre-compute the data so it can rapidly be queried.

Building a data cube and data warehouse:

  • Data cube timing options: You can pre-compute it before using it, or you can compute on demand much like treating it as a dynamic programming problem where you compute parts as they are needed and delete data from storage/memory when it’s stale. Or you can do a mix of pre-computing and computation on demand.
  • Data cube building efficiently: Say you extract data into Python as an example and are building eight data cubes. You can do an iteration over the data with one for-loop and inside the single for-loop you can build up all eight data cubes. So it just takes one loop to build all eight data cubes.
  • Data warehousing building: You start with the data sources and you put them in a back room area. Then you put them in a front room area. But you aren’t done there, you can do things like aggregate the data to get other front room tables. One way to think about it is that it is a data pipeline going from start to finish.

50 of 83

CHAPTER NINE

Data Mining Methods

9

51 of 83

Data Mining Methods

Frequent pattern analysis:

  • Example: Which items were purchased together (1 or more items).
  • Item set example: {milk, bread, eggs} {milk, bread}, {bread, eggs}, {milk, eggs}, {milk, bread, eggs, cheese}
  • Closed pattern definition: {milk, bread} {bread, eggs}, {milk, eggs}. Why not {milk, bread eggs} and {milk, bread, eggs, cheese}? The answer is because extending to those doesn’t decrease our frequency (those 3-4 item sets have the same frequency as {milk, bread}).
  • Maximal patterns: You can use an efficient algorithm to reduce the number of item sets often by 100x. What it does is to remove the patterns that don’t add much information based on the “support” value.
  • Apriori Algorithm: Starts with single item sets and gradually builds to longer item sets. Note, keep item sets alphabetical to avoid duplicates item-sets in other orders. It comes up with potential item sets but removes those that are not frequent. After finding all candidate item-sets the algorithm determines the “support” metric of these candidates and prunes those that don’t meet the minimum “support” parameter. It again repeats generation of candidate item-sets and continues to repeat these steps for some time. One output is that it can recommend item sets if another item set is purchased.
  • FP-Growth Algorithm: Improves the Apriori algorithm for larger datasets using an FP-tree, fewer scans, and fewer candidates. It starts by getting the frequency of each single item to soon help in guessing which item-set candidates might be frequent. Then it makes an FP-table by checking the dataset again and linking items that share common prefixes. Recursively growth conditional FP-trees one depth at a time and then is able to extract patterns.
  • Association: Finding interesting relationships between items in a dataset. “If X item-set then Y item-set with this confidence probability”.
  • Correlation: Positive and negative correlations between item-sets. The formula is P(A&B)/(P(A)P(B)). If it is greater than 1 then it’s a positive correlation, if it’s less than 1 it’s a negative correlation.

The methods you use to achieve the use case or gain knowledge from the data.

52 of 83

Decision Tree Classifier:

  • Classification is like identifying if something is one thing or another. For the purpose of explanation, I’ll say the classes are “a good loan” or “a bad loan” based on the applicant.
  • It’s like asking 20 questions and each question takes you to the left branch or right branch of a large tree.
  • When you reach the end of the tree, it lists the probability of it being a good loan (good loans / total loans)
  • It decides what a good question is by trying tons of questions and sticking with the best one that splits up the training data of good loans from bad loans.
  • This is a greedy algorithm, because it builds itself overtime one question at a time. It would take too much computation to build the perfect decision tree, but this greedy approach works well.

Naive Bayes Classifier:

  • Say you are classifying the classes: “a good loan” or “a bad loan”.
  • You calculate the probability of the feature being that value given it is one class and also calculate given it is the other class.
  • Say 60% are good loans, that is your initial guess.
  • Calculate the probability of a good loan: 60%*P(feature given good loan)*....
  • Do the same for bad loans:40%*P(feature given bad loan)*...
  • Whichever is greater is the prediction.

Support Vector Classification:

  • Draws a line to separate the classes.
  • If it isn’t possible to separate all classes, it allows some misclassifications.
  • It uses kernel tricks to find non-linear lines to separate the classes.

Neural Network Classification:

  • Neurons work together in layers.
  • Input layer is of size equal to your features.
  • The hidden layers are in the middle.
  • The output layer are the predictions. If there are two classes, then the output layer has two nodes.
  • The neurons connect to the next layer. These connections have weights. The weights are adjusted during training to find optimal values.

Ensemble Classifcation

  • Combining the scores of multiple models.
  • One simple one is majority vote. Another is weighted averaging of scores.
  • Another method is bagging, where you train multiple models with a random sample of the data with replacement.
  • Another method is boosting. It is where you train many models. As you progress building more of the models it trains to fit to the residual of previous models’ predictions. To balance the impact of the many models, there is a learning rate to it learns slower.

Model evaluation:

  • Train/val/test split is the simple method.
  • K-fold cross validation is preferred.

53 of 83

Clusters:

  • The measure how well cluster does is similarity of datapoints in the same cluster and dissimilarity of datapoints in other clusters.
  • K-means clustering: Set centroids (center of each cluster) in some random way. Calculate which points are in each cluster. Recompute the centroids. Repeat many times. Try with different values for k.
  • Hierarchical clustering: All data is a cluster. Cluster similarity is measured. If two clusters are similar enough they are merged together.
  • Gaussian Mixture Model: Imagine your data came from different clusters generating data with a normal distribution away from a central value. This aims to find those clusters and their central/distribution properties. It iterates on the expectation maximization to determine the answer.
  • Incorporate domain knowledge.

Anomaly Detection Methods:

  • Differs from the norm. Could be fraud, errors, or perhaps something else.
  • It could be a global anomaly, an anomaly within a subset, or anomaly when considered together with other data points (spike in the count of events at a point in time/location).
  • Simple techniques: z-score and percentiles.
  • Clustering: See the section above. What is a small cluster or far from other clusters, or understand each cluster.
  • Isolation forest: Decision trees where short paths lead to datapoints that are easily isolated are considered the anomalies. Using the average path length of many trees is done.
  • Support Vector Regression: Fit a hyperplane through the data. The data that is far away from the hyperplane going through the data is an anomaly.
  • Time series algorithms: Anything far from the fitted line.
  • Autoencoder and SVD: Start with a matrix. For autoencoder you use a tiny hidden layer of a neural network. For SVD you use two smaller matrices. What is common about these algorithms is you start with the large matrix, pass it through the smaller structure (layers or matrices) and try to reconstruct it at the end. This denoises it. The noise are the anomalies.
  • Domain knowledge on what anomalies you’re looking for and what is uncommon. For instance you could mix this with z-scores to flag online fraud behavior. For credit card fraud perhaps someone buys something they normally don’t buy at that time. Make assumptions of what is an anomaly and look for it, and look at it closely to ensure it is a good rule.
  • Applications: Fraudulent transactions, hackers, malfunctions in machines or people’s health, and spikes in energy consumption.

54 of 83

CHAPTER TEN

Data Mining Project

10

55 of 83

Data Mining Project

Project ideas prompt:

Data mining projects that sound interesting to me in ranked order:

1) Doing frequent pattern analysis on online purchase data.

2) Analysis on user data looking for engagement and retention.

3) Simply trying to get the most accurate model for regression or classification.

I would need to search free data sets online to get this data.

Starting questions.

  • What is it?
  • Why is it important?
  • Why are other solutions lacking?
  • What is the benefit of your solution?

Project proposal 5-10 slides:

  • What data and techniques you will use.
  • Slide Style: Clean, large font, color, and pictures.
  • Slides: Title, problem statement, related work, proposed work, evaluation, and timeline.
  • Checking your presentation: Is it clear? Does it show value? Is it a reasonable amount of work? Is the evaluation plan good?

Exploring data’s depths: Unearthing patterns through advanced mining.

56 of 83

Project proposal report:

  • ACM Template.
  • Two-column format.
  • Word document.
  • Latex or Overleaf “sigconf” are other templates.
  • Outline: Title, abstract, introduction, related work, proposed work, evaluation, discussion, conclusion, references slides. See KDD papers for examples.
  • Abstract: 1-2 paragraphs summarizing the project
  • Introduction: Go back to the “starting questions” above and write about that.
  • Related work: What has been done? What methods? How does your work build on prior work?
  • Proposed work: What dataset? What tools? What tasks are involved?
  • Evaluation: Which metrics or experiment?
  • Discussion: Timelines and challenges.
  • Checking your report: Is it clear? Does it show value? Is it a reasonable amount of work? Is the evaluation plan good?

Checkpoint slides:

  • 10-15 slides highlighting progress and improvements over time.
  • Copy your proposal slides next and edit them to show how it’s going compared to the proposal plan.
  • For example, you can include your evaluation results and checking how you’re doing with the timeline proposed, challenges/backup plans, and changes.

Checkpoint report:

  • Should have the same sections as the proposal. Update each section, adding more filling up to 3-5 pages.

Project final slides:

  • Use the checkpoint slides and have 15-20 slides.
  • What have you accomplished? What have you learned?
  • Title, executive summary at start and timeline near the end.

Project final report:

  • Use the checkpoint report. Ensure it is 5-10 pages. Ensure it has a conclusion near the end before the references.
  • A conclusion has the project summary and key findings and improvement ideas.

Project conclusion:

  • At the end you should often write a summary, key findings, and improvement ideas.

57 of 83

CHAPTER ELEVEN

Introduction to Machine Learning: Supervised Learning

11

58 of 83

Introduction to Machine Learning: Supervised Learning

Introduction:

  • Supervised learning: Has two types, classification and regression.
  • Unsupervised learning: Includes clustering, anomaly detection, and dimension reduction.
  • Help improve supervised learning: Getting more data, generating new features, analyzing the computer’s errors and improving it to make less errors.
  • Classification types: Classification can be binary classification (1/0 class labels) or multi-class classification (when there are more than 2 classes).
  • Parametric models examples: Linear regression, logistic regression, poisson regression, LDA, Gaussian naive bayes, ridge regression, lasso regression, exponential distribution, bernoulli distribution, multinomial distribution. The number of parameters it estimates are fixed; these are best for simple problems and are less prone to overfitting.
  • Non-parametric models examples: K-Nearest Neighbor, decision tree, random forest, support vector machine, neural networks, kernel density estimation, gaussian process regression, hierarchical clustering, nonparametric bayesian models, local regression. The parameters can grow with the size of the data and can overfit more easily.

Linear Regression:

  • Estimates the coefficients that create a prediction with the formula: prediction = coef_1*feature_1 + coef_2*feature_2 + … = coef_n*feature_n. It will draw a line through the data.
  • The RSS is minimized using calculus and a formula that solves directly for “coefficient 0” (intercept) and “coefficient 1”.
  • Another solution is B = (X_transpose * X)_inverse * X_transpose * Y.
  • RSS is the distance from the prediction and validation actuals. TSS is the distance from the average training set actuals and validation actuals.
  • R squared equals 1 minus RSS/TSS.
  • You can get a prediction interval for each new observation prediction for the range of values the true prediction should be with 95% confidence.

Making predictions with data.

59 of 83

Linear Regression (continued):

  • Feature selection is done by an algorithm that adds/removes features one at a time.
  • A single categorical feature can be transformed into many boolean features.
  • Creating features that are combinations of two important features works well.
  • New features that are other features raised to different powers works well (polynomial features)
  • Issues that can come up:
    • Not flexible enough, which is mitigated by polynomial features.
    • Watch for instances where data is gathered through convenience sampling or when the decision of collecting data for each sample is influenced by the sampler or surveyor.
    • Strange feature values or target values; typically high values.
    • When two features are highly correlated with each other, often the model doesn’t improve much by having both of them and the coefficients can act wildly when both are included in the model. A correlation matrix can see this in detail. For example, say house price = square_foot_coefficient*square_foot + grade_coefficient*grade; but let’s say grade and square foot are highly correlated; the coefficient for a single feature model could be positive for both single feature models, but one coefficient can become negative with the two feature model.
  • The best model comes about going between improving bias or variance. You know you have bias if your training set’s accuracy is far less than the estimated optimal accuracy. To fix this, you should make the model more flexible to get a better training set accuracy. You know you have variance when your validation set’s accuracy is far less than the training set’s accuracy. To fix this, you make the model less flexible which results in a lower training set accuracy but a higher validation set accuracy.

Logistic regression:

  • Steps:
    • Probability = 1 / (1 + e^(coefficients and v values).
    • Compute the log loss.
    • Gradient descent.

Confusion matrix:

  • Four values: True/False positives/negatives (T/F P/N)
  • Accuracy: TP + TN.
  • Precision: TP / (TP + FP)
  • Recall: TP / (all positives)
  • ROC AUC: All about the true positive rate (TPR - top 2 in confusion matrix) and false positive rate (FPR - bottom 2 in confusion matrix).
  • PR ROC AUC: This one is more focused on TP. It looks at the TPR (top 2 in confusion matrix), and the recall (right 2 in confusion matrix. That is, you’re optimizing for these two areas. Therefore, a lot of focus is on TP since it shows up in both TPR and recall. TPR is also known as precision.

60 of 83

Regularization hyperparameters and cross validation:

  • Ridge: Linear regression with a regularization term increases the sum of square for large coefficients (square penalty) to avoid any coefficient getting too large compared to the others. Often leads to all coefficients down closer to zero but not often reaching zero.
  • Lasso: Same as ridge but a linear penalty (instead of a squared penalty). This leads to features that underperform often reducing to zero coefficient values.
  • Elastic Net: A mix of both ridge and lasso together.
  • Hyperparameters: Tweakable input parameters to models. For ridge and lasso, it is lambda. Lambda affects how much regularization to use: a little (creates less bias) or a lot (creates less variance). Adjusting these tend to create a best validation accuracy. Furthermore, tuning regularization can get interpretable coefficients even with collinearity occurring.
  • Cross-Validation:
    • One common cross-validation method is a train/val/test split. The train set is used to train the model, the val set is used to check the accuracy on new unseen data, and the test set is checked at the very end to see how accurate it is on unseen data (since we start to overfit the validation set as we check it so often to optimize it).
    • Another common cross-validation method is k-fold cross-validation which divides into k subsets our one is the test subset and the remaining are used for training. So k tests are done where all data is at one point part of a test set; and the resulting k accuracies are averaged together.

K-Nearest Neighbor Classifier:

  • Finds the k nearest neighbors to the datapoint. Assigns the class that shows up the most often.
  • A higher k value is more flexible, a lower k value is less flexible.
  • Can suffer a lot from higher number of features, even 4+ features can become too much for the algorithm.

Decision Tree Classifier:

  • Classification is like identifying if something is one thing or another. For the purpose of explanation, I’ll say the classes are “a good loan” or “a bad loan” based on the applicant.
  • It’s like asking 20 questions and each question takes you to the left branch or right branch of a large tree.
  • When you reach the end of the tree, it lists the probability of it being a good loan (good loans / total loans)
  • It decides what a good question is by trying tons of questions and sticking with the best one that splits up the training data of good loans from bad loans.
  • This is a greedy algorithm, because it builds itself overtime one question at a time. It would take too much computation to build the perfect decision tree, but this greedy approach works well.

Avoid Overfitting Decision Tree Classifier:

  • Early stopping: Stop splitting at certain conditions.
  • Pruning: Build the full tree then remove some decisions. Ends up with a smaller tree just like the early stopping. The more complex the tree the higher the cost much like ridge penalizes large coefficients (same alpha parameter; called ccp_alpha in sklearn). Complexity penalty is alpha times the ”# of leaf nodes below this decision”
  • Ensembling: A model that uses many decision trees like XGBoost.

61 of 83

Ensemble with Random Forest:

  • Sample random records with replacement sample square root of n features for columns to consider.

Adaboost Boosting:

  • It is an ensemble of decision trees without bagging. The best performing trees have more voting power.
  • The weights of individual observations are increased when the current ensemble is not good at predicting it. Because of this, it has to build one tree at a time, instead making all of them in parallel.

Gradient Boosting:

  • This is an ensemble of decision trees as well. It can do bagging, but by default it is turned off. Every tree has a prediction and when summed up, it gives the final prediction.
  • A tree is built, the prediction is multiplied by the learning rate usually 1% to 30%. The target of each observation becomes the residual between the prediction from all trees built so far and the original target.
  • It ends up building 100+ trees one at a time. Each one is attempting to explain the leftover residuals from the ensemble built so far. Because of the learning rate, it takes a while before the residual gets small (i.e. learning rate of 1% might require 100+ trees before it is able to get close to the true target, inching one percent at a time towards it).
  • While all trees can’t run in parallel in boosting methods, it can be computationally heavy to build a single decision tree, so parallel computing can be helpful to build each single tree.

SVM:

  • Draws a line which separates the two classes.
  • It allows for error in which some datapoints of class 0 might end up on the wrong side of the line (the slide for class 1), etc. This is a hyperparameter how much leniency it has with how many of these datapoints are on the wrong side.
  • Rather than drawing any line between the two it tries to draw it in the center to avoid being too close to the data points of either class. It does this mathematically by finding the largest width away from the line until it hits a record of class zero and then the same distance in the other direction that hits class one.
  • This can be applied to regression as well; for regression it finds the line that follows the data with having many errors. An error is defined as being too far away from the line. Too far away from the line is defined as parameter epsilon.
  • One of the best things about SVMs is the kernel tricks which allow it to draw lines that curve.
  • Needs feature scaling. Becomes slow with many rows of data. Becomes fast in comparison to other methods with many columns (scales linearly with more columns while others run in polynomial time).

62 of 83

CHAPTER TWELVE

Unsupervised Algorithms in Machine Learning

12

63 of 83

Unsupervised Algorithms in Machine Learning

Introduction:

  • Common methods: Dimension reduction, clustering, collaborative filtering, and autoencoders.

PCA:

  • Reduces the features into a lower dimensional space while keeping as much of the original variance as possible.
  • To implement it, you use the following concepts: Standardized the data, covariance matrix, eigenvectors, and eigenvalues.
  • The amount of dimensions to keep is often decided by the cumulative percent of variance explained by the first n-components and choosing the best value for n after looking at various values.

Clustering methods:

  • Some common algorithms are: K-means, hierarchical clustering, mean shift, and gaussian mixture models.
  • Evaluate the fit with sum of squares of all datapoints to their clusters’ centroid.
  • When using hierarchical cluster dendrograms can be looked at as well. This is read the following way: When two clusters merge in the diagram that is the distance the two clusters were from each other when they merged. The distance or height on the diagram tells you what is clustered together with n-clusters.
  • Hierarchical clustering looks closely at the distance between clusters, four choices to do this are finding the distance between the closest points in two clusters (single), farthest points in two clusters (complete), average distance between all pairs between the two clusters (average), or distance between centroids (centroid).

Recommendation Systems:

  • Step 1, choose an approach to get a pool of users+items to score: Popularity, Content-based (similar attributes to what they’ve been interested in before), collaborative filtering (items liked by similar users).
  • Step 2, use supervised learning to train a model on past labeled data and score it on the selected users+items.
  • Step 3, apply rules-based adjustments to the results.
  • Simplest of a dozen collaborative filtering algorithms is to multiply a similarity matrix (measure of similarity between users; often similarity based on the items each person liked) with an affinity matrix (affinity means the items each user liked)

Algorithms that work well without labels.

64 of 83

Matrix Factorization:

  • A technique to take the original large matrix and convert it into two smaller matrices. Then, just like the previous topic mentioned, you can do simple algorithms like multiplying the two smaller matrices together to get the step 1 results from the previous section “Recommendation Systems”.
  • One algorithm to do this is SVD. If you multiply the matrices back together you get a denoised matrix. Consider mean/median/zero imputation, a more complex method, or a version of SVD that handles missing values well.

65 of 83

CHAPTER THIRTEEN

Introduction to Deep Learning

13

66 of 83

Introduction to Deep Learning

If it takes a lot of data to train and I would summarize its most common use cases as: creation/identification of images/audio, vectorizing words, predicting the next word, time series, and game AI.

You can make use of the deep learning models out there without retraining from scratch such as audio transcription (google's speech_recognition library), image to text (pytesseract), and word2vec (convert a word to a vector), infersent (convert a sentence to a vector), resnet50 (can train it on 20 images and it will often be accurate; transfer learning).

I am interested in building game AI with XGBoost and perhaps using deep learning as more data is collected. This would solve the game and create a simulation that generates data with optimal decisions being made. then you use decision trees to learn from the data to find a simple strategy that is almost as strong and can be copied by a person.

Neural network models

67 of 83

Perceptron and multi-layer perceptron:

  • It is made up of a lot of nodes. Each node has a weight. You pass data along, the weight and the input data interact and based on the result it has an activation which is passed to the next layer of nodes it is connected to.
  • It aims to minimize the loss function. For binary classification it uses binary cross entropy (called log loss), which gives a bigger penalty when the last layer is wrong and a smaller when unsure. For multi-class it uses categorical cross entropy which is the same as binary but it’s the sum of all final nodes; where each final node represents one of the many multi-class outcomes. Sigmoid is a good final layer for all classification, but softmax is always used for multi-class classification. Softmax just means it looks at all final layer nodes and the one that activated the most is the predicted class. For numerical outputs it uses mean squared error.

Other topics:

  • Gradient descent: Move in the direction that minimizes the cost function. Adding momentum can make it go faster; there are various optimizers that use momentum.
  • Ways to reduce overfitting: Dropout and batch normalization are two methods.
  • Hardway: GPUs.
  • Library: Keras is a good one.
  • Convolutions in convolutional neural network (CNN): A small 2D or 3D grid. A math calculation is applied and the answer is a number. The grid is shifted a bit and this is repeated. After a convolution is finished, you should do max pooling to make it simpler: this is by setting say a 2x2 grid with stride 2 (each shift doesn’t overlap with stride 2 and 2x2); it simply gets the max of the 4 values in the grid and creates a matrix with those values. Also you can consider padding since each of these operations makes the matrix smaller, but with padding it doesn’t get to be a smaller matrix as fast; this is padding the matrix with zeros around the edges as needed. Very often this convolution layer will be followed by lots of other convolution layers and max pooling layers. A neural network has an input layer, n layers of transformations (hidden layers), and an output layer.
  • Imagenet is a huge image dataset.
  • VGGNet architecture hidden layers: 2 convolution layers, max pooling layer, repeat n times, 3 fully connected layers
  • GoogLeNet InceptionNet: A more complex architecture. Often a mix of 1x1, 3x3, and 5x5 convolutions in one layer.
  • ResNet: The ability was added to skip layers.
  • Batch normalization: activation – u/sigma where u and sigma are calculated from the last 50 rows.
  • Dropout: Randomly select some weights to be zero.
  • CNN suggestions: learning_rate=0.01 to 0.0001; Adam/RMSProp, ReLU/PReLU for hidden layers and anything for output layer, 3x3 filters, 2 convolutions then a max pool, L2 regularization, dropout, and batch normalization.
  • Transfer learning: Remove output layer, freeze weights, add more CNN layers.
  • Autoencoding: Wide input layer, tiny hidden layers, and wide output layers. Very similar to the topic previously mentioned “matrix factorization”. The true labels for the output layer are the input layer (reconstruction error). The output layer can also be thought of as a completely new thing; and you can use this output as the input in another autoencoder to create something similar to that.
  • Generative Adversarial Networks (GAN): One agent creates, the other decides if it was a real (original data) or a fake (created by the first agent).

68 of 83

CHAPTER FOURTEEN

Modern Regression Analysis in R

14

69 of 83

Modern Regression Analysis in R

Frameworks and Goals of Statistical Modeling:

  • In almost all cases, you as the statistician don’t control the input values, but rather observe the inputs and the eventual output (outside a controlled experiment environment).
  • You learn from the sample to generalize to the population.
  • The task of explanation and the task of prediction are the two key tasks in regression.
  • Y = B0 + B1Xi,1 + … + BpXi,p + epison.
  • It can handle lots of complicated non-linear relationships by adjusting the terms next to each coefficient; say X^2 or sin(X).
  • Assumptions: Linear, independent samples, constant variance at each prediction value, and errors are normally distributed. Can test these assumptions. If a test fails the conclusions might be misleading so that might need to be fixed first.
  • When using exploratory data analysis and testing hypotheses, you should not test your hypotheses against data you’ve seen already.
  • B0: The average value of Y when all X are zero
  • B1: The average value of Y when X1 changes while all other X2, …, Xp values remain constant.
  • Least squares method: Can be derived from RSS = residual^2. Taking the derivative and setting to zero gives us: “B = (XTX)^-1 * XTy” as the solution to find B. The inverse solution doesn’t scale well so this solution is rarely used.
  • R^2 is a metric to see the goodness of fit. It is equal to 1 - RSS/TSS. TSS is the difference between the residuals and the average y value. RSS is the difference between the residuals and the model’s predicted y value. The “SS” of both terms stands for sum of squares so you are always finding the difference, squaring it, then summing it for all data points. R^2 will typically keep getting better and better as you add more features; but more features isn’t always a better model. To account for this, there is the adjusted R^2 which adds a penalty for the number of features used.
  • If the task of explanation is being focused on, choosing one feature to drop among correlated features is usually the right action. Say the true coefficient is 2. If you have two collinear features, one coefficient might be 3 and the other -1, but it’s better to drop one and get the coefficient of 2 which is much better for the task of explanation.

Predicting a numeric outcome in R

70 of 83

Coefficient t-tests:

  • If the p-value for a coefficient is less than 5% then it’s likely it’s a good feature. This is because it rejects the null hypothesis that the coefficient is zero.
  • It’s best to use a single test to see if a linear regression model works well. That’s where the F test comes in, which will return less than a 5% p-value if at least one feature is useful in predicting the outcome.

Confidence intervals:

  • You can generate these 95% confidence intervals for coefficients and also for a model’s prediction.

Ethics in statistics:

  • Transparent in everything you do.
  • Accept improvement ideas.
  • Stats is a science of uncertainty, and we need to acknowledge the limitations of statistical methods to avoid making dramatic deterministic claims.

Tests:

  • Graphs are a great way to validate the assumptions of linear regression.
  • Linear assumption graph: y-axis: residuals, x-axis fitted values. Look for a straight line horizontal around y=0.
  • Independence assumption graph: y-axis residuals, x-axis index. Look for time series patterns.
  • Independence assumption graph 2: y-axis residual of i-th row, x-axis is the residual of the (i-1) row. Then look for a correlation.
  • Constant variance assumption graph: y-axis: residuals, x-axis fitted values. Look for changes in variance as the x value changes.
  • Normality assumption graph: QQ plots.
  • Normality assumption graph 2: y-axis: residuals, x-axis fitted values. Look for normality going from the left to right for the x values.

Selection:

  • Forward/backward selection. Chosen based on the lowest p-value of the coefficient.
  • AIC and BIC both have a penalty term for when there are a larger number of features. AIC tends to be best for prediction tasks and BIC for explanation tasks.
  • R squared is 1-RSS/TSS. Adjusted R squared is similar with a penalty for the number of features p: 1 - (RSS/(n-(p+1)))/(TSS/(n-1)).
  • MSE can also be used to validate the results on a separate test set.

Collinearity:

  • Try to keep in the variables that are causally related to the target. If two variables are highly correlated, their coefficients are no longer explainable, you must remove one. Collecting more data can help reduce this correlation.
  • Correlation matrix or VIF are two ways to identify collinearity. Condition number is another way (finding low eigenvalues).
  • Ways to remove collinearity: Manually, ridge, and PCA.

71 of 83

CHAPTER FIFTEEN

ANOVA and Experimental Design

15

72 of 83

ANOVA and Experimental Design

Introduction to Experimental Design

  • Experimental study: Where you can control the inputs
  • Observational study: Where you can’t control the inputs.
  • ANOVA: Checking if there are differences in the mean across groups.
  • Each group is defined by a string name column.

ANOVA:

  • “Total SS” = sum of “squares between each datapoint and the overall mean”
  • “Between SS” = sum of “squares between each group mean and overall mean, also multiplied by the observation count for that group”
  • “Within SS” = sum of “squares between every data point and it’s group’s mean”
  • “Total SS” = “Between SS” + “Within SS”
  • You can solve the left hand side or right hand side to get the “Total SS”
  • ANOVA really just the regression F test.
  • Null hypothesis: No differences across groups.
  • Alternative hypothesis: Some differences between groups.
  • If you reject the null hypothesis: Are all groups different, or are some different? You don’t know until you run other tests.
  • ANCOVA differs from ANOVA in that it controls of confounding variables, for example controlling for students’ initial knowledge when analyzing two teaching methods. This can be as simple as adding the confounding variable into the linear regression.
  • For ANOVA or ANCOVA, regression can be used with dummy variables to determine which group performs differently from other groups.
  • One way to visualize it is to plot a regression line for each group.
  • By adding interaction terms, you can see if the coefficient changes for each group. This can be as simple as checking the p-value of the coefficient.
  • Two bad things are running too many tests or running tests after looking at the data. Post-hoc comparisons is a technique that makes running tests after looking at the data slightly less bad.
  • A viable post-hoc test is if group1=group2, group2=group3, group3=group1. Any other tests are very complicated because all have to be orthogonal.

Statistical tests

73 of 83

Post-Hoc Test in Python:

  • Statsmodel library.
  • The OLS model is an ANOVA model.
  • The post-hoc test is MultiComparison. It finds which groups are significantly different from each other.
  • The default one will have false positives, if you want to avoid those more, use the bonferroni adjustment:
    • The function pairwise_tukeyhsd does the bonferroni adjustment which gives more conservative p-values. With this adjustment a positive test will only occur 5% of the time even if you ran many tests, for example 10 tests.
    • Bonferroni is a generic method to avoid false positives when doing many tests.
    • Tukey’s HSD is used when you want to determine which groups are significantly different from each other while avoiding false positives when doing many tests.

False negatives:

  • The best way to avoid this is to analyze beforehand how much data you will need to collect to avoid these false negatives.
  • You must first decide the significance level (95% usually) and alternative hypothesis (mean1 != mean2 or mean1 > mean2 are common). Next you estimate the variance of the distribution. Lastly, you get an output for the sample size required to avoid false negatives.
  • It’s important to do this beforehand. Once you look at the data and change the test it is highly likely data dredging will occur.

Two-way ANOVA:

  • This tests for two group variables and an outcome variable. Effects of the two variables and interactions between them as well.
  • A single study is more efficient in terms of sample size needed.
  • Similar diagram to chi-squared can be created.
  • Another view is a line plot; X axis is different groups, each line is also a different group.
  • What’s most interesting is setting up a null hypothesis. The null hypothesis is usually three at the same time:
    • There is no significant difference among the levels of factor A
    • There is no significant difference among the levels of factor B
    • There is no significant difference among the levels of the interaction of factor A and factor B
    • Adjust your p-values with methods like bonferroni.

�Experimental Design:

  • How is the data collected matters in your conclusions.
  • Often a designed randomized experiment is needed to get more at causal claims. For example, the conclusion from A/B tests are a causal claim that decides which is better.
  • Causal conclusions: A correlation between the two, the cause happens before the effect, and no other variable is really causing the effect (similar to collinearity, sometimes it isn’t tracked so it goes unnoticed).

Sampling techniques:

  • Randomized.
  • Stratified: One idea is to do a two-way ANOVA where the stratum is one factor.
  • Fully factorial design: Assigning lots of different factors randomly to each user.

74 of 83

CHAPTER SIXTEEN

Generalized Linear Models and Nonparametric Regression

16

75 of 83

Generalized Linear Models and Nonparametric Regression

Introduction:

  • GLMs should be based on linear regression for continuous predictions and based on logistic regression for binary classification predictions (1/0).
  • GLMs is a statistical framework that handles many types of probability distributions/regression: Poisson, Gamma, Inverse Gaussian, Binomial, Negative Binomial, Tweedie, Multinomial (multi-class), Ordinal, Proportional hazards, Robust.

Components of GLM:

  • The random component (Gaussian for continuous responses, binomial distribution for binary, poisson distribution for count data, and inverse gaussian for skewed continuous data.
  • Systematic component - The predictor which is coefficients times the terms.
  • The link function transforms the systematic component. For gaussian it’s no transformation, for binomial it’s logit link, for poisson it’s log link, and for the inverse gaussian it’s inverse link.
  • Mean function - The expectation of the linear predictor.
  • Dispersion parameter - To account for overdispersion when the variance is greater than what would be expected by the chosen distribution.
  • Some exponential distributions - binomial, Poisson, normal, and exponential.

Binomial Regression Parameter estimation:

  • Start with the pmf for a single observation.
  • Convert it to the joint pmf by multiplying all cases (all observations).
  • Swap out p(i) with e^(beta formula) / (1 + e^(beta formula)) as seen in logistic regression.
  • For easier calculation you log both sides, so the product of all cases becomes the summation of all cases. Since products are done by summation when you take the log.
  • Find the values that maximize the summation formula using an optimization library. For example the iteratively reweighted least squares algorithm.

GLMs and their benefits.

76 of 83

Interpreting betas:

  • In linear regression, if B1 is 10 it can be interpreted as an increase in X1 by 1 results in a change in the outcome by 10 within the range of common X1 values.
  • In logistic regression, it is very similar but it would instead increase the log-odds by an amount. The outcome ranges from 0% to 100% so higher log-odds would lead to something closer getting closer to 100% as the predicted outcome.
  • To consider this on the odds scale, we can exponentiate. So if the B1 is equal to 0.021 you would do e^0.021 which equals 1.02. The increase is multiplicative when we consider it this way since we’ve exponentiated it. Linear regression is additive, as is log-odds in logistic regression.

Poisson Regression:

  • To keep the result positive after multiplying the predictors by the coefficients, you use an exponential. Since poisson is count data, exponential also helps it be right-skewed.
  • Say coefficients times values equals 6. Then you say lambda equals e^6. Lambda is now predicted as e^6, and poisson has properties such as expected value equals lambda and variance equals lambda, so you know the expected count will be e^6.

Poisson Regression Parameter Estimation:

  • Follow the same steps as the binary version, just using a poisson pmf.
  • Due to the exponent rule, (e^x)^y = e^(xy); which is something to keep in mind when exchanging lambda with the coefficients times the observation values.

Interpretability:

  • This is very good for count data. The alternative has flaws, for example ridge regression to count online sales based on online activity. You would likely log the outcome and predict, but this is less stable compared to using poisson regression.
  • Coefficients are understood to be the rate of change of the count data from changes in the response variable with other variables held constant.

Goodness of Fit for Poisson Regression:

  • Deviance: A smaller deviance means a better fit.
  • Pearson Chi-Square Statistic: A significant chi-square indicates a lack of a fit.
  • Residual analysis: Patterns can suggest areas the model may be lacking.
  • Others: Dispersion test, AIC, BIC, and Hosmer-Lameshow Test.

Overdispersion:

  • Variance will be greater than the mean. This can lead to confidence intervals being too narrow, and other distributions such as negative binomial regression might be better. Quasi-poisson regression has a dispersion parameter to remove the assumption of equal dispersion.

Nonparametric:

  • Make it as flexible as is reasonable.
  • If you set no limit to the flexibility it will just memorize the data.
  • There is no theory to tell you the exact best way to do this.

77 of 83

Generalized Additive Models (GAM)

  • Extends linear models.. More flexible to have non-linear relationships.
  • Uses regularization and piecewise smoothing (spline functions).
  • Interpretation can be tough due to the smoothing.
  • To get a good fit visualization tools are key.
  • It works like this:
    • Look at the x1 and y plot. Fit different lines at different sections. That built a good fit for that relationship.
    • Do that for more features even interaction features. Some can be kept linear or you can allow flexibility.
    • The flexibility is automatic by an algorithm that finds how flexible it should be but should be validated visually as well.
    • Fit the model which adds the results together.
  • You can blend things like poisson learned previously with GAMs.

78 of 83

CHAPTER SEVENTEEN

Relational Database Design

17

79 of 83

Relational Database Design

Benefits of a database:

  • A database works with structured data well. In other words predefined columns in data types.
  • It also works well with complex SQL queries.
  • Multiple users can modify the data simultaneously.
  • Can scale with indexing and partitioning.
  • Is good for star schemas where you have a fact table with lots of rows and dimension tables with lots of columns.
  • It is consistent and reliable because transactions follow ACID (Atomicity, Consistency, Isolation, and Durability)

Benefits of a file system:

  • Something simple, using folders, runs fast, easy to use, and for static content.

Key concepts:

  • Indexes are sorting to allow quick search in a database but requires multiple copies of the data where each one is sorted differently which makes writes slower because there are multiple copies to make the writes.
  • Each table is a type of entity. Each record is an entity. Entities have attributes, primary identifiers, relationship keys, constraints (like this attribute is always an integer between 0-20, etc), a data types assigned to each attribute, default values, and nullability.
  • Relationships can be one-to-one, one to many, or many to many (called cardinality). Different design techniques are required depending no this cardinality.
  • An Entity Relationship Diagram (ERD) is where you define the entities as boxes, attributes as text in the boxes, and relationships as lines between attributes between boxes. Crow’s Foot notation has a single line meaning one, a circle meaning zero, and 3 lines meaning multiple. You can mix them to say “one and only one” or “one or zero” etc. You can place a word in the center of the line to represent the type of relationship that best describes this relationship.

Designing a database such as MySQL.

80 of 83

Data storage tips

  • Whenever a string field is duplicated in storage there is potential to store the string as an integer with a lookup table.

81 of 83

CHAPTER EIGHTEEN

The Structured Query Language (SQL)

18

82 of 83

The Structured Query Language (SQL)

Introduction:

  • A

Writing queries

83 of 83

End