ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
there are some qs that are
DepartmentCourse NumberCourse NamePrerequisitesCorequisitesAssignmentTopic
Question Number
Part Number
Percentage of Total Grade
Question TypeQuestionSolution TypeSolutionFew shot question 1Few shot solution 1Few shot question 2Few shot solution 2Few shot question 3Few shot solution 3
2
73Mathematics18.3
Principles of Continuum Applied Mathematics
18.02, 18.03NoneProblem Set 5Traveling Waves1a1.09375Text
Imagine that someone tells you that the following equation is a model for traffic flow:
$$
c_{t}+c c_{x}=\nu c_{x t},
$$
where $\boldsymbol{\nu}>\mathbf{0}$ is "small" and $c$ is the the wave velocity - related to the car density via $\boldsymbol{c}=\frac{d Q}{d \boldsymbol{\rho}}$. The objective of this problem is to ascertain if (1.1) makes sense as model for Traffic Flow. To this end, answer the two questions below.
Does the model have acceptable traveling wave "shock" solutions $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \cdot c=\boldsymbol{F}(\boldsymbol{z})$,
where $\ldots \ldots \ldots \ldots \ldots \ldots \ldots z= \frac{x - Ut}{\nu}$ and $\boldsymbol{U}$ is a constant?
Here "acceptable" means the following
1a. The function $F$ has finite limits as $z \rightarrow \pm \infty$, i.e.: $\quad c_{L}=\lim _{z \rightarrow-\infty} F(z) \quad$ and $\quad c_{R}=\lim _{z \rightarrow+\infty} F(z)$. Further: the derivatives of $F$ vanish as $z \rightarrow \pm \infty$, and $c_{L} \neq c_{R}$.
This means that, as $\nu \rightarrow 0$, the solution $c$ becomes a discontinuity traveling at speed $U$, with $c=c_{L}$ for $x<U t$ and $c=c_{R}$ for $x>U$. That is, a shock wave.
1b. The solution satisfies the Rankine-Hugoniot jump conditions $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots U = \frac{[Q]}{[\rho]}$
where $\rho_{L}$ and $\rho_{R}$ are related to $c_{L}$ and $c_{R}$ via $c_{L}=\frac{d Q}{d \rho}\left(\rho_{L}\right)$ and $c_{R}=\frac{d Q}{d \rho}\left(\rho_{R}\right)$.
Assume that $Q=Q(\rho)$ is a quadratic traffic flow function - see remark 1.1.
1c. The solution satisfies the Entropy condition $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \boldsymbol{c}_{\boldsymbol{L}}>\boldsymbol{U}>\boldsymbol{c}_{\boldsymbol{R}}$.
To answer this question:
\begin{itemize}
\item Find all the solutions satisfying 1a. Get explicit formulas for $\boldsymbol{F}$ and $\boldsymbol{U}$ in terms of $c_{L}, c_{R}$, and $z$.
\item Check if the solutions that you found satisfy $\mathbf{1 b}$.
\item Check if the solutions that you found satisfy 1c.
\item Finally, given $\mathbf{A}-\mathbf{C}$ : Does, so far, the equation make sense as a model for traffic flow?
\end{itemize}
Hints.
\begin{itemize}
\item Find the ode $F$ satisfies. Show it can be reduced to the form $F^{\prime}=P(F)$, where $P=$ second order polynomial.
\item Write $P$ in terms of its two zeroes, $c_{1}$ and $c_{2}$, and express all the constants (e.g.: $U$ ) in terms of $c_{1}$ and $c_{2}$.
\item Solve now the equation, and relate $c_{1}$ and $c_{2}$ to $c_{L}$ and $c_{R}$. You are now ready to proceed with $\mathbf{A}-\mathbf{D}$.
\item Remember that, while the density $\rho$ has to be non-negative, wave speeds can have any sign.
\end{itemize}
Open
Substituting $c=F(z)$, where $z=\frac{x-U t}{\nu}$, into the pde gives the ode
$$
(F-U) F^{\prime}=-U F^{\prime \prime},
$$
where the primes indicate differentiation with respect to $z$. This equation can be integrated once, to obtain
$$
U F^{\prime}=U F-\frac{1}{2} F^{2}+\kappa,
$$
where $\kappa$ is a constant of integration. The right hand side in this equation is a quadratic function of $F$, with maximum value at $F=U$, where it reaches the value $\frac{1}{2} U^{2}+\kappa$. Therefore, for $\kappa$ large enough, the right hand side in (1.5) has two real zeros, and the equation can be written in the form
$$
U F^{\prime}=-\frac{1}{2}\left(F-c_{1}\right)\left(F-c_{2}\right),
$$
where $c_{1} \geq c_{2}$ are constants,
$$
U=\frac{1}{2}\left(c_{1}+c_{2}\right), \quad \text { and } \quad \kappa=-\frac{1}{2} c_{1} c_{2} .
$$
As long as $U \neq 0$, the appropriate (and explicit) solutions to (1.6) follow from the hyperbolic tangent - since $y=\tanh (s)$ satisfies $y^{\prime}=1-y^{2}$. That is
$$
F=U+\frac{c_{1}-c_{2}}{2} \tanh \left(\frac{c_{1}-c_{2}}{4 U} z\right) .
$$
Assume that $c_{1}>c_{2}$, since the case $c_{1}=c_{2}$ is trivial. Then, with $\boldsymbol{c}_{\boldsymbol{L}}=\lim _{\boldsymbol{z} \rightarrow-\infty} \boldsymbol{F}(\boldsymbol{z})$ and $\boldsymbol{c}_{\boldsymbol{R}}=\lim _{\boldsymbol{z} \rightarrow+\infty} \boldsymbol{F}(\boldsymbol{z})$,
$$
\begin{array}{ll}
\text { If } U>0, & c_{L}=c_{2} \text { and } c_{R}=c_{1} . \text { Thus } c_{L}<U<c_{R} . \\
\text { If } U<0, & c_{L}=c_{1} \text { and } c_{R}=c_{2} . \text { Thus } c_{L}>U>c_{R} .
\end{array}
$$
Thus the model has traveling solutions satisfying 1 a for all values of $U$, except $U=0$ (note that equation (1.5) yields $F=$ constant for $U=0$ ). However:
1. Rankine-Hugoniot jump conditions: satisfied, since the wave velocity is given as the average of the characteristic velocities, which is correct for a quadratic flow function - see remark 1.1.
2. Entropy condition: violated when $\boldsymbol{U}>\mathbf{0}$, as shown by (1.9).
Thus, we must conclude that (1.1) is NOT a good model for traffic flow. But we have the derivation in remark 1.1, or so it seems. However, note:
It is, indeed reasonable to assume that the drivers respond to the rate of change of the density. But, why should they respond to $\rho_{t}$ alone, as in (1.2)? The rate of change of the density a driver sees is $r=\rho_{t}+\boldsymbol{u} \boldsymbol{\rho}_{\boldsymbol{x}}$, not $\rho_{t}$ ! The drivers should also respond to the local gradient of the density $\rho_{x}$, specially when it is large.
Since $\rho_{t} \approx-c \rho_{x}$ (at least away from shocks), it follows that $r=\rho_{t}+u \rho_{x} \approx(u-c) \rho_{x}$. But $u \geq c$, so $\rho_{x}$ and $r$ have the same sign. Thus a model with $q=Q(\rho)-\nu \rho_{x}$ will behave properly [this model is actually used], but one based in (1.2) will not. In fact $r \approx-\frac{u-c}{c} \rho_{t}$, so that (1.2) gives the wrong sign for the correction whenever positive wave speeds are involved!
Let us now consider solutions of the form $c=c_{0}+u$, where $c_{0}$ is a constant and $u$ is very small. Then
$$
u_{t}+c_{0} u_{x}=\nu u_{x t}.
$$
This has solutions of the form $u=e^{i k x+\lambda t}$, provided that
$$
\lambda=\frac{-i c_{0} k}{1-i \nu k}=\frac{\nu c_{0} k^{2}}{1+\nu^{2} k^{2}}-i \frac{c_{0} k}{1+\nu^{2} k^{2}} .
$$
If $c_{0}>0$, these solutions grow (exponentially), which means that any uniform traffic flow with $c_{0}>0$ is unstable at least according to this model. ${ }^{3}$ This is rather strange, since $c_{0}>0$ corresponds to light traffic. Another indication that this is not a good model for traffic flow.
Consider the traffic flow equation
$$
\rho_{t}+q_{x}=0,
$$
for a flow $q=Q(\rho)$ that is a quadratic function of $\rho$. In this case $c=\mathrm{d} Q / \mathrm{d} \rho$ is a conserved quantity as well (why?). Thus the problem (including shocks, if any) can be entirely formulated in terms of $c$, which satisfies
$$
c_{t}+\left(\frac{1}{2} c^{2}\right)_{x}=0 .
$$
Consider the initial value problem determined by $(5.2)$ and ${ }^{1}$
$$
c(x, 0)=0 \text { for } x \leq 0 \text { and } c(x, 0)=2 \sqrt{x} \geq 0 \text { for } x \geq 0 .
$$
Without actually solving the problem, argue that the solution to this problem must have the form
$$
c=t f\left(x / t^{2}\right) \text { for } t>0, \text { for some function } f \text {. }
$$
Hint. Let $c=c(x, t)$ be the solution. For any constant $a>0$, define $\mathcal{C}=\mathcal{C}(x, t)$ by $\mathcal{C}=\frac{1}{a} \boldsymbol{c}\left(\mathbf{a}^{\mathbf{2}} \boldsymbol{x}, \boldsymbol{a} \boldsymbol{t}\right)$. What problem does $\mathcal{C}$ satisfy? Use now the fact that the solution to (5.2-5.3) is unique to show that (5.4) must apply, by selecting the constant $\boldsymbol{a}$ appropriately at any fixed time $t>0$.
Now we proceed with the answer to the problem. Note that
$$
\mathcal{C}_{t}=c_{t} \quad \text { and } \quad\left(\mathcal{C}^{2}\right)_{x}=\left(c^{2}\right)_{x}
$$
where $c$ and its derivatives evaluated at $\left(a^{2} x, a t\right)$. Further:
$\mathcal{C}(x, 0)=c(x, 0)$. It follows that $\mathcal{C}=c$, that is:
Now, evaluate (5.6) at $t=1 / a$.
Since $a>0$ is arbitrary, it follows that
which is (5.4) with $f(\xi)=c(\xi, 1)$.
$$
\begin{array}{r}
c(x, t)=\frac{1}{a} c\left(a^{2} x, a t\right) \quad \text { for any } a>0 . \\
c(x, t)=t c\left(x / t^{2}, 1\right) \quad \text { for any } t>0
\end{array}
$$
Consider the traffic flow equation
$$
\rho_{t}+q_{x}=0,
$$
for a flow $q=Q(\rho)$ that is a quadratic function of $\rho$. In this case $c=\mathrm{d} Q / \mathrm{d} \rho$ is a conserved quantity as well (why?). Thus the problem (including shocks, if any) can be entirely formulated in terms of $c$, which satisfies
$$
c_{t}+\left(\frac{1}{2} c^{2}\right)_{x}=0 .
$$
For the solution obtained in item 2 , evaluate $c_{\boldsymbol{x}}$ at $\boldsymbol{x}=\mathbf{0}$ for $\boldsymbol{t}>\mathbf{0}$. Note that this derivative is discontinuous there, so it has two values (left and right).
Finally, from (5.8-5.9), at $x=0$ and $t>0$,
$$
\boldsymbol{c}_{\boldsymbol{x}}=\mathbf{0} \text { from the left, and } \boldsymbol{c}_{\boldsymbol{x}}=\mathbf{1} / \boldsymbol{t} \text { from the right. }
$$
Consider the traffic flow equation
$$
\rho_{t}+q_{x}=0,
$$
for a flow $q=Q(\rho)$ that is a quadratic function of $\rho$. In this case $c=\mathrm{d} Q / \mathrm{d} \rho$ is a conserved quantity as well (why?). Thus the problem (including shocks, if any) can be entirely formulated in terms of $c$, which satisfies
$$
c_{t}+\left(\frac{1}{2} c^{2}\right)_{x}=0 .
$$
Use the method of characteristics to solve the problem in (5.2-5.3). Write the solution explicitly for all $\boldsymbol{t}>\mathbf{0}$, and verify that it satisfies (5.4). Warning: the solution involves a square root. Be careful to select the correct sign, and to justify your choice.
Next we solve (5.2-5.3) using characteristics. For $\zeta \leq 0$ we obtain $c=0$ along $x=\zeta$. Hence these characteristics give
$$
c=0 \text { for } x \leq 0 .
$$
On the other hand, for $\zeta \geq 0$ the characteristics give ${ }^{2} c=2 \sqrt{\zeta}$ along $x=2 \sqrt{\zeta} t+\zeta$. Thus
$$
c=2\left(\sqrt{x+t^{2}}-t\right)=2 t\left(\sqrt{1+\frac{x}{t^{2}}}-1\right) \text { for } x \geq 0 .
$$
s $\zeta$ varies from $\zeta=0$ to $\zeta=\infty$, the characteristics $x=2 \sqrt{\zeta} t+\zeta$ cover the entire region $x \geq 0$. Further, they do so one-to-one, since $\partial_{\zeta} x=1+t / \sqrt{\zeta}>0$. Hence we can solve for $\zeta$ as a function of $(x, t)$. To do so we write these characteristics in the form $(t+\sqrt{\zeta})^{2}=x+t^{2}$, so that $\sqrt{\zeta}=-t+\sqrt{x+t^{2}}$. Note that, since $\sqrt{\zeta} \geq 0$ is required, the positive square root $\sqrt{x+t^{2}}$ must be selected.
The solution to $(5.2-5.3)$ is given by $(5.8-5.9)$. This clearly satisfies (5.4), with $f(z)=0$ for $z<0$, and $f(z)=2(\sqrt{1+z}-1)$ for $z>0$.
3
121EECS6.191
Computation Structures
6.100A, 8.02NoneMidterm Exam 1Binary Arithmetic1d0.9Text
Using 8-bit 2's complement encoding, compute $((\sim(0xAB \& 0x55)) / / 0x04) * 0x04$, where " $\sim$ " represents bitwise negation, "\&" represents a bitwise AND, "//" represents integer division (ignores remainder), and "*” represents multiplication.
For performing // and $*$, the only operations allowed are $>>_{a},>_{1},<<_{1}$ which correspond to arithmetic right shift, logical right shift and logical left shift respectively. Clearly specify which operation to use for performing // and *. Write the intermediate and final answers in 8-bit 2's complement.
Open$\sim(0xAB \& 0x55)$: $0b1111\_1110$.
Operation used for performing "//": $>>>_{a}$.
Operation used for performing "* ": ${<<}_{1}$
$((\sim(0xAB \& 0x55)) / / 0x04) * 0x04$: $0b1111\_1100$.
Write -3 and -4 in 4-bit 2's complement notation, then add them together using fixed width 2's complement arithmetic. Show your work. Provide your result in binary, and decimal. For each computation also specify whether or not overflow occurred.
Sum in binary: $0b1001$
Sum in decimal: -7
Did overflow occur? (Yes/No): No
Write 7 and 4 in 4-bit 2's complement notation, then add them together using fixed width 2's complement arithmetic. Show your work. Provide your result in binary, and decimal. For each computation also specify whether or not overflow occurred.
Sum in binary: $0b1011$
Sum in decimal: $-5$
Did overflow occur? (Yes/No): Yes
You are given the truth table for a circuit that takes a 3-bit unsigned binary input ( $\mathrm{X}$ $=\mathrm{ABC})$, multiplies it by $2 \bmod 8$ and adds $1 \bmod 8$ to it to produce a 3-bit unsigned binary output $\left(\mathrm{Y}=\mathrm{A}^{\prime} \mathrm{B}^{\prime} \mathrm{C}\right.$ ' $)$.
\begin{tabular}{|c|c|c|c|c|c|}
\hline $\mathrm{A}$ & $\mathrm{B}$ & $\mathrm{C}$ & $\mathrm{A}^{\prime}$ & $\mathrm{B}^{\prime}$ & $\mathrm{C}^{\prime}$ \\
\hline 0 & 0 & 0 & 0 & 0 & 1 \\
\hline 0 & 0 & 1 & 0 & 1 & 1 \\
\hline 0 & 1 & 0 & 1 & 0 & 1 \\
\hline 0 & 1 & 1 & 1 & 1 & 1 \\
\hline 1 & 0 & 0 & 0 & 0 & 1 \\
\hline 1 & 0 & 1 & 0 & 1 & 1 \\
\hline 1 & 1 & 0 & 1 & 0 & 1 \\
\hline 1 & 1 & 1 & 1 & 1 & 1 \\
\hline
\end{tabular}
For the above truth table, write out a minimal sum-of-products for each function $\mathrm{A}^{\prime}(\mathrm{A}, \mathrm{B}, \mathrm{C}), \mathrm{B}^{\prime}(\mathrm{A}, \mathrm{B}, \mathrm{C})$, and $\mathrm{C}^{\prime}(\mathrm{A}, \mathrm{B}, \mathrm{C})$.
Minimal sum-of-products for $A^{\prime}(A, B, C)=$ B.
Minimal sum-of-products for $\mathrm{B}^{\prime}(\mathrm{A}, \mathrm{B}, \mathrm{C})=$ C.
Minimal sum-of-products for $C^{\prime}(A, B, C)=$ 1.
4
31Mathematics18.01Calculus INoneNoneProblem Set 1Tangent Lines14b0.07919746568Text
The tangent line is closely related to linear approximation. This little problem should help clarify that. Let $f(x)=x^{2}$ and consider the tangent line to the graph at $x=1$. This line has the form $y=L(x)$, and you computed $L(x)$ in the last problem.
Compute $L(1.2)$ and $L(1.4)$.
Numerical
At $x=1$, using the results of Problem 13, the tangent line is
$$
L(x)=2 x-1 .
$$
Using $L(x)$ to approximate $f(x)$ :
$$
f(1.2) \approx 2 \times 1.2-1=1.4 \text {, }
$$
and
$$
f(1.4) \approx 2 \times 1.4-1=1.8 \text {. }
$$
These results agree with the linear approximations in part (a), as they should because the tangent line is the linear approximation.
The tangent line is closely related to linear approximation. This little problem should help clarify that. Let $f(x)=x^{2}$ and consider the tangent line to the graph at $x=1$. This line has the form $y=L(x)$, and you computed $L(x)$ in the last problem.
Use the linear approximation of $f$ around $x=1$ to approximate $f(1.2)$ and $f(1.4)$.
The linear approximation to $f(x)=x^{2}$ at $x=1$ is
$$
f(1+\Delta x) \approx 1+2 \Delta x .
$$
Thus,
$$
f(1.2) \approx 1+2 \times 0.2=1.4,
$$
and
$$
f(1.4) \approx 1+2 \times 0.4=1.8.
$$
Let $f(x)=x^{2}$. Compute the tangent line to the graph of $f$ at $x=-1$, at $x=0$, and at $x=1$. Write each line in the form $y=m x+b$. Then sketch the graph of $f(x)$ and the three tangent lines.
With $f(x)=x^{2}, f^{\prime}(x)=2 x$. The equation of the tangent line at $x=a$ is
$$
L(x)=f(a)+f^{\prime}(a)(x-a) .
$$
In $m x+b$ form, it's
$$
L(x)=f_{m}^{\prime}(a) x+\underbrace{\left(f(a)-a f^{\prime}(a)\right)}_{b} .
$$
With $a=-1, f(a)=1$, and $f^{\prime}(a)=-2$. Thus,
$$
L(x)=-2 x+(1-(-1) \times(-2))=-2 x-1 .
$$
With $a=0, f(a)=0$, and $f^{\prime}(a)=0$. Thus, $L(x)=0$.
With $a=1, f(a)=1$, and $f^{\prime}(a)=2$. Thus,
$$
L(x)=2 x-1 .
$$
Here is a graph of $f$ with the three tangent lines below.
Suppose that $L(x)=f(1)+f^{\prime}(1)(x-1)$ is the linear approximation of $f(x)$ around $x=1$. Here is a picture of the graph of $f^{\prime}(x)$ and the graph of $L^{\prime}(x)$ below.
Here $f^{\prime}(1)=10$ and so $L^{\prime}(x)=10$ for all $x$.
Compare your answer to b with the bound from Taylor's theorem.
The bound from Taylor's theorem requires us to find $M$ which is an upper bound for $\left|f^{\prime \prime}(x)\right|$ when $1 \leq x \leq 1.1$. Since the $f^{\prime}$ is steepest when $x=1$, $\left|f^{\prime \prime}(x)\right| \leq\left|f^{\prime \prime}(1)\right| \approx 3$ when $1 \leq x \leq 1.1$. Taylor's theorem gives the error bound of $\frac{1}{2} 3(.1)^{2}=.015$, which is exactly what we obtained in (b).
5
138Mathematics18.02Calculus II18.01NoneFinal ExamDouble Integrals10nan1.8Text
Set up the integral $\iint_R f(x, y) d A$ where $R$ is the region bounded by the four curves $x^2 y=4, x^2 y=9, \frac{y}{x}=1$, and $\frac{y}{x}=2$ as a double integral in the variables $u=x^2 y$ and $v=\frac{y}{x}$. (Your answer should be completely ready to integrate, once the function $f$ is given.)
Note: the inverse transformation is given by $\quad x=u^{\frac{1}{3}} v^{-\frac{1}{3}}, \quad y=u^{\frac{1}{3}} v^{\frac{2}{3}}$.
Expression
$$
\int_R f(x, y) d A=i n t_1^2 \int_4^9 f\left(u^{1 / 3} v^{-1 / 3}, u^{1 / 3} v^{2 / 3}\right)\left(\frac{1}{3} u^{-1 / 3} v^{-2 / 3}\right) d u d v
$$
$\iint_R f d A=\int_0^2 \int_{x^2}^{2 \sqrt{2 x}} f(x, y) d y d x$.
Sketch the region $R$.
The sketch is below.
$\iint_R f d A=\int_0^2 \int_{x^2}^{2 \sqrt{2 x}} f(x, y) d y d x$.
Rewrite the double integral as an iterated integral with the order interchanged.
$$
R=\left\{\begin{array}{l}
y^2 / 8 \leq x \leq \sqrt{y} \\
0 \leq y \leq 4
\end{array} \Rightarrow \iint_R f d A=\int_0^4 \int_{y^2 / 8}^{\sqrt{y}} f(x, y) d x d y\right.
$$
Using the coordinate change $u=x y, v=y / x$, set up and evaluate an iterated integral for the moment of inertia around the $z$-axis of the region bounded by the hyperbola $x y=1$, the $x$-axis, and the two lines $x=1$ and $x=2$. Choose the order of integration which makes the limits simplest.
The inertia is
$$
\begin{aligned}
\iint_{R}\left(x^{2}+y^{2}\right) d A &=\iint_{S}\left(\frac{u}{v}+u v\right) \frac{d A}{2 v} \\
&=\frac{1}{2} \int_{0}^{1} \int_{u / 4}^{u}\left(\frac{u}{v^{2}}+u\right) d v d u \\
&=\frac{1}{2} \int_{0}^{1}\left(\frac{3}{4} u^{2}+3\right) d u \\
&=\frac{13}{8} .
\end{aligned}
$$
6
84EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneExercise 12Q-Learning1aii0.01157407407Text
Let's simulate the Q-learning algorithm! Assume there are states $(0,1,2,3)$ and actions ('b', 'c'), and discount factor $\gamma=0.9$. Furthermore, assume that all the $\mathrm{Q}$ values are initialized to 0 (for all state-action pairs) and that the learning rate $\alpha=0.5$.
Experience is represented as a list of 4-element tuples: the $t$ th element of the experience corresponds to a record of experience at time $t:\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$ (state, action, next state, reward).
After each step $t$, indicate what update $Q\left(s_{t}, a_{t}\right) \leftarrow q$ will be made by the Q learning algorithm based on $\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$. You will want to keep track of the overall table $Q\left(s_{t}, a_{t}\right)$ as these updates take place, spanning the multiple parts of this question.
As a reminder, the Q-learning update formula is the following:
$$
Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)
$$
You are welcome to do this problem by hand, by drawing a table specifying $Q(s, a)$ for all possible $s$ and $as$. Alternatively, you may write a program which takes in the following history of experience:
experience = [(0, 'b', 2, 0), #t = 0
(2, 'b', 3, 0),
(3, 'b', 0, 2),
(0, 'b', 2, 0), #t = 3
(2, 'b', 3, 0),
(3, 'c', 0, 2),
(0, 'c', 1, 0), #t = 6
(1, 'b', 0, 1),
(0, 'b', 2, 0),
(2, 'c', 3, 0), #t = 9
(3, 'c', 0, 2),
(0, 'c', 1, 0)]
What is the action in the state-action pair that is updated?
Multiple Choice
b.
Since we observe an experience in state 0, we update the $Q$ value for state 0.
Let's simulate the Q-learning algorithm! Assume there are states $(0,1,2,3)$ and actions ('b', 'c'), and discount factor $\gamma=0.9$. Furthermore, assume that all the $\mathrm{Q}$ values are initialized to 0 (for all state-action pairs) and that the learning rate $\alpha=0.5$.
Experience is represented as a list of 4-element tuples: the $t$ th element of the experience corresponds to a record of experience at time $t:\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$ (state, action, next state, reward).
After each step $t$, indicate what update $Q\left(s_{t}, a_{t}\right) \leftarrow q$ will be made by the Q learning algorithm based on $\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$. You will want to keep track of the overall table $Q\left(s_{t}, a_{t}\right)$ as these updates take place, spanning the multiple parts of this question.
As a reminder, the Q-learning update formula is the following:
$$
Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)
$$
You are welcome to do this problem by hand, by drawing a table specifying $Q(s, a)$ for all possible $s$ and $as$. Alternatively, you may write a program which takes in the following history of experience:
experience = [(0, 'b', 2, 0), #t = 0
(2, 'b', 3, 0),
(3, 'b', 0, 2),
(0, 'b', 2, 0), #t = 3
(2, 'b', 3, 0),
(3, 'c', 0, 2),
(0, 'c', 1, 0), #t = 6
(1, 'b', 0, 1),
(0, 'b', 2, 0),
(2, 'c', 3, 0), #t = 9
(3, 'c', 0, 2),
(0, 'c', 1, 0)]
What is the action of the state-action pair that is updated?
b.
Since action $b$ was used in this experience, we update the Q value for action b.
Let's simulate the Q-learning algorithm! Assume there are states $(0,1,2,3)$ and actions ('b', 'c'), and discount factor $\gamma=0.9$. Furthermore, assume that all the $\mathrm{Q}$ values are initialized to 0 (for all state-action pairs) and that the learning rate $\alpha=0.5$.
Experience is represented as a list of 4-element tuples: the $t$ th element of the experience corresponds to a record of experience at time $t:\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$ (state, action, next state, reward).
After each step $t$, indicate what update $Q\left(s_{t}, a_{t}\right) \leftarrow q$ will be made by the Q learning algorithm based on $\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$. You will want to keep track of the overall table $Q\left(s_{t}, a_{t}\right)$ as these updates take place, spanning the multiple parts of this question.
As a reminder, the Q-learning update formula is the following:
$$
Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)
$$
You are welcome to do this problem by hand, by drawing a table specifying $Q(s, a)$ for all possible $s$ and $as$. Alternatively, you may write a program which takes in the following history of experience:
experience = [(0, 'b', 2, 0), #t = 0
(2, 'b', 3, 0),
(3, 'b', 0, 2),
(0, 'b', 2, 0), #t = 3
(2, 'b', 3, 0),
(3, 'c', 0, 2),
(0, 'c', 1, 0), #t = 6
(1, 'b', 0, 1),
(0, 'b', 2, 0),
(2, 'c', 3, 0), #t = 9
(3, 'c', 0, 2),
(0, 'c', 1, 0)]
Putting it all together, what is the new Q-value for the state-action pair that is updated?
0.
We will make use of the Q-learning update rule for an action $a$ that takes state $s$ to $s^{\prime}$ with reward $r$:
$$
\begin{gathered}
Q(s, a)=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) . \\
Q_{n e w}(0, b)=0.5 \cdot Q_{\text {old }}(0, b)+0.5\left(0+0.9 \cdot \max _{a^{\prime}} Q_{\text {old }}\left(2, a^{\prime}\right)\right)=0
\end{gathered}
$$
Let's simulate the Q-learning algorithm! Assume there are states $(0,1,2,3)$ and actions ('b', 'c'), and discount factor $\gamma=0.9$. Furthermore, assume that all the $\mathrm{Q}$ values are initialized to 0 (for all state-action pairs) and that the learning rate $\alpha=0.5$.
Experience is represented as a list of 4-element tuples: the $t$ th element of the experience corresponds to a record of experience at time $t:\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$ (state, action, next state, reward).
After each step $t$, indicate what update $Q\left(s_{t}, a_{t}\right) \leftarrow q$ will be made by the Q learning algorithm based on $\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$. You will want to keep track of the overall table $Q\left(s_{t}, a_{t}\right)$ as these updates take place, spanning the multiple parts of this question.
As a reminder, the Q-learning update formula is the following:
$$
Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)
$$
You are welcome to do this problem by hand, by drawing a table specifying $Q(s, a)$ for all possible $s$ and $as$. Alternatively, you may write a program which takes in the following history of experience:
experience = [(0, 'b', 2, 0), #t = 0
(2, 'b', 3, 0),
(3, 'b', 0, 2),
(0, 'b', 2, 0), #t = 3
(2, 'b', 3, 0),
(3, 'c', 0, 2),
(0, 'c', 1, 0), #t = 6
(1, 'b', 0, 1),
(0, 'b', 2, 0),
(2, 'c', 3, 0), #t = 9
(3, 'c', 0, 2),
(0, 'c', 1, 0)]
t: S A S' R
---------------
1: 2 'b' 3 0
The $t=1$ step of Q-learning will update the Q value of some state-action pair based on the experience tuple $\left(s_{1}, a_{1}, s_{2}, r_{1}\right)$.
After observing this tuple, what is the state of the state-action pair that is updated?
2.
Since we observe an experience in state 2, we update the Q value for state 2.
7
362EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneProblem Set 1NumPy1cvii0.01157407407Text
The shape of the resulting array is different depending on if you use indexing or slicing. Indexing refers to selecting particular elements of an array by using a single number (the index) to specify a particular row or column. Slicing refers to selecting a subset of the array by specifying a range of indices.
If you're unfamiliar with these terms, and the indexing and slicing rules of arrays, please see the indexing and slicing sections of this link: Numpy. Overview (Same as the Numpy Overview link from the introduction). You can also look at the official numpy documentation here.
In the following questions, let A = np.array([[5,7,10,14],[2,4,8,9]]). Tell us what the output would be for each of the following expressions. Use brackets [] as necessary. If the operation is invalid, write the python string "none".
Note: Remember that Python uses zero-indexing and thus starts counting from 0, not 1. This is different from R and MATLAB.
Reminder: A = np.array([[5,7,10,14],[2,4,8,9]])
A[:, 1:2]
Expression[[7], [4]]
The shape of the resulting array is different depending on if you use indexing or slicing. Indexing refers to selecting particular elements of an array by using a single number (the index) to specify a particular row or column. Slicing refers to selecting a subset of the array by specifying a range of indices.
If you're unfamiliar with these terms, and the indexing and slicing rules of arrays, please see the indexing and slicing sections of this link: Numpy. Overview (Same as the Numpy Overview link from the introduction). You can also look at the official numpy documentation here.
In the following questions, let A = np.array([[5,7,10,14],[2,4,8,9]]). Tell us what the output would be for each of the following expressions. Use brackets [] as necessary. If the operation is invalid, write the python string "none".
Note: Remember that Python uses zero-indexing and thus starts counting from 0, not 1. This is different from R and MATLAB.
Reminder: A = np.array([[5,7,10,14],[2,4,8,9]])
A[:, 1]
[7, 4]
The shape of the resulting array is different depending on if you use indexing or slicing. Indexing refers to selecting particular elements of an array by using a single number (the index) to specify a particular row or column. Slicing refers to selecting a subset of the array by specifying a range of indices.
If you're unfamiliar with these terms, and the indexing and slicing rules of arrays, please see the indexing and slicing sections of this link: Numpy. Overview (Same as the Numpy Overview link from the introduction). You can also look at the official numpy documentation here.
In the following questions, let A = np.array([[5,7,10,14],[2,4,8,9]]). Tell us what the output would be for each of the following expressions. Use brackets [] as necessary. If the operation is invalid, write the python string "none".
Note: Remember that Python uses zero-indexing and thus starts counting from 0, not 1. This is different from R and MATLAB.
Reminder: A = np.array([[5,7,10,14],[2,4,8,9]])
A[1:,:2]
[[2, 4]]
The shape of the resulting array is different depending on if you use indexing or slicing. Indexing refers to selecting particular elements of an array by using a single number (the index) to specify a particular row or column. Slicing refers to selecting a subset of the array by specifying a range of indices.
If you're unfamiliar with these terms, and the indexing and slicing rules of arrays, please see the indexing and slicing sections of this link: Numpy. Overview (Same as the Numpy Overview link from the introduction). You can also look at the official numpy documentation here.
In the following questions, let A = np.array([[5,7,10,14],[2,4,8,9]]). Tell us what the output would be for each of the following expressions. Use brackets [] as necessary. If the operation is invalid, write the python string "none".
Note: Remember that Python uses zero-indexing and thus starts counting from 0, not 1. This is different from R and MATLAB.Reminder: A = np.array([[5,7,10,14],[2,4,8,9]])
A[0:1,1:3]
[[7, 10]]
8
10EECS6.100A
Introduction to Computer Science Programming in Python
NoneNone
Finger Exercise Lecture 8
Classes1nan1.428571429Text
In this problem, you will implement three classes according to the specification below: one Container class, one Stack class (a subclass of Container), and one Queue class (a subclass of Container). Our Container class will initialize an empty list. The two methods we will have are to calculate the size of the list and to add an element. The second method will be inherited by the two subclasses. We now want to create two subclasses of this generic Container class so that we can add more functionality -- the ability to remove
elements from the list. A Stack and a Queue will add elements to the list in the same way, but will behave differently when removing an element.
A stack is a last in, first out data structure. Think of a stack of pancakes. As you make pancakes, you create a stack of them with older pancakes going on the bottom and newer pancakes on the top. As you start eating the pancakes, you pick one off the top so you are removing the newest pancake added to the stack. When implementing your Stack class, you will have to think about which end of your list contains the element that has been in the list the shortest amount of time. This is the element you will want to remove and return.
A queue is a first in, first out data structure. Think of a store checkout queue. The customer who has been in the line the longest gets the next available cashier. When implementing your Queue class, you will have to think about which end of your list contains the element that has been in the list the longest. This is the element you will want to remove and return.
class Container(object):
"""
A container object is a list and can store elements of any type
"""
def __init__(self):
"""
Initializes an empty list
"""
self.myList = []
def size(self):
"""
Returns the length of the container list
"""
# Your code here
def add(self, elem):
"""
Adds the elem to one end of the container list, keeping the end
you add to consistent. Does not return anything
"""
# Your code here
class Stack(Container):
"""
A subclass of Container. Has an additional method to remove elements.
"""
def remove(self):
"""
The newest element in the container list is removed
Returns the element removed or None if the queue contains no elements
"""
# Your code here
Programming
class Container(object):
"""
A container object is a list and can store elements of any type
"""
def __init__(self):
"""
Initializes an empty list
"""
self.myList = []
def size(self):
"""
Returns the length of the container list
"""
# Your code here
return len(self.myList)
def add(self, elem):
"""
Adds the elem to one end of the container list, keeping the end
you add to consistent. Does not return anything
"""
# Your code here
self.myList.append(elem)
class Stack(Container):
"""
A subclass of Container. Has an additional method to remove elements.
"""
def remove(self):
"""
The newest element in the container list is removed
Returns the element removed or None if the queue contains no elements
"""
# Your code here
if self.size() == 0:
return None
return self.myList.pop()
In this problem, we’ll explore an interesting kind of data structure called a \textit{cycle}. Cycles are like lists in that they represent an ordered collection of elements. Unlike lists, though, cycles have no beginning or end; the elements in a cycle repeat endlessly, with the first element following the last.
For this problem, we will represent cycles as a particular kind of linked list. Similarly to our representation from lab 9, we can represent regular linked lists as two-element Python lists, where the first element represents the first value in the linked list, and where the second element represents the remainder of the linked list (either as None, representing the empty list, or as another linked list). For example, we would represent a linked list containing 1, 1, 2, 3, and 5 (in that order) as: [1, [1, [2, [3, [5, None]]]]].
In our environment diagram notation, this linked list would look like the following.
A \textit{cycle} containing those same elements looks very similar, but with one key difference: the first node in the list follows from the last node, leading to the following circular structure.
Answer the questions on the following pages about various operations on cycles.
We would also like to be able to modify cycles. To this end, we will implement a function delete_node(node), which should mutate the cycle that the given node belongs to (and/or the node itself) such that the node is no longer part of the cycle and the node no longer points back into the cycle. For example, using the test cycle from the previous page, the following REPL transcript shows the desired behavior:
>>> x = next_node(test_cycle)
>>> y = next_node(x)
>>> value(x)
8
>>> delete_node(x)
>>> value(test_cycle)
4
>>> value(next_node(test_cycle))
15
>>> next_node(test_cycle) is y
True
>>> next_node(x) is y
False
Fill in the definition of delete_node below. You may assume that the given node is part of a well-formed, cycle with at least 2 nodes. For full credit, your code should not use any built-in list manipulations, but rather should only use the helper functions defined throughout this problem (you may assume working versions of all helper functions from part 1, and a working version of prev_node).
def delete_node(node):
set_next(prev_node(node), next_node(node))
set_next(node, None)
In this problem set, we will be using a provided Node object in tree.py to represent trees.
The following simple tree can be initialized with the Node object as follows:
example_tree = Node(1, Node(2), Node(5, Node(7), Node(8)))
A brief explanation of the Node class is below.
• You can initialize a node with the following Node(value, left_child, right_child). value holds the value held in the node, left_child optionally holds the Node constructing the left subtree, right_child does the same for the right subtree. If there is not a subtree either do not input that parameter or pass in None
• You can get the Node object holding the left or right subtrees with get_left_child() or get_right_child respectively. If there is no child this function returns None.
• You can get the value held by a Node with get_value().
We will practice initializing trees in this part. For the trees shown below, create objects accurately representing the data. Put them into the variables at the top of ps4a.py, named tree1, tree2, and tree3.
tree1 = None #TODO
tree2 = None #TODO
tree3 = None #TODO
tree1 = Node(8, Node(2, Node(1), Node(5)), Node(10)) #TODO
tree2 = Node(7, Node(2, Node(1), Node(5, Node(4), Node(6))), Node(9, Node(8), Node(10))) #TODO
tree3 = Node(5, Node(3, Node(2), Node(4)), Node(14, Node(12), Node(21, Node(19), Node(26)))) #TODO
In this problem, we’ll explore an interesting kind of data structure called a \textit{cycle}. Cycles are like lists in that they represent an ordered collection of elements. Unlike lists, though, cycles have no beginning or end; the elements in a cycle repeat endlessly, with the first element following the last.
For this problem, we will represent cycles as a particular kind of linked list. Similarly to our representation from lab 9, we can represent regular linked lists as two-element Python lists, where the first element represents the first value in the linked list, and where the second element represents the remainder of the linked list (either as None, representing the empty list, or as another linked list). For example, we would represent a linked list containing 1, 1, 2, 3, and 5 (in that order) as: [1, [1, [2, [3, [5, None]]]]].
In our environment diagram notation, this linked list would look like the following.
A \textit{cycle} containing those same elements looks very similar, but with one key difference: the first node in the list follows from the last node, leading to the following circular structure.
Answer the questions on the following pages about various operations on cycles.
Consider the following set of functions designed to operate on linked lists and/or cycles.
def value(node):
# Return the value associated with the given node
return node[0]
def next_node(node):
# Return the node that follows from the given node in the cycle containing
# that node
return node[-1]
def set_value(node, val):
# Mutate the given node such that its value
node[0] = val
def set_next(node, target):
# Mutate the first argument (a node) such that it is now followed by the
# second argument (another node)
node[-1] = target
def last(inp):
# Return the last element in a non-cyclic linked list (not expected to work
# for cycles), without mutating the given input.
n = next_node(inp)
if n is None:
return inp
return last(n)
def make_cycle(inp):
# Create a cycle containing all of the elements from the given input (a
# non-cyclic linked list), without mutating the given input.
out = inp
set_next(last(out), out)
return out
Note that the is keyword can be used to perform an identity check. For example, x is y will evaluate to True if x and y refer to exactly the same object in memory (i.e., if they are aliases of the each other).
Each of the functions on the facing page includes a comment at the top describing its intended behavior. Are all of these functions implemented in ways that are consistent with their stated behavior (from the
comments)?
If not, please specify which functions are inconsistent with their comments, and briefly describe the issue with each:
No.
Most of these functions are indeed completely correct.
However, there is a slight issue with make_cycle. It produces the correct output, but it also mutates its input, which is inconsistent with the documentation.
9
615EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneFinal Exam
Convolutional Neural Networks
8h0.35Text
MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition.
If Rec instead labeled line-shaped pieces as "1" and corner-shaped pieces as "0" then what values of $\mathrm{w}$ and $\mathrm{b}$ of the output layer give perfect classification and outputs that are close to 0 for corners and close to 1 for lines?
OpenThe same as above with opposite sign.
MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition.
Rec labels corner-shaped tetris pieces as "1" and line-shaped tetris pieces as "0". Using this labeling, what values of $\mathrm{w}$ and $\mathrm{b}$ of the output layer give perfect classification and outputs that are close to 1 for corners and close to 0 for lines? (Assume the examples in (e) are representative of the entire dataset.)
$\sigma(2.95) \rightarrow 0.95$, so we need to have our output be at least $\sim 3$ for corners and equal to or less than $-3$ for lines. The max and min value of a_sum for lines is 0 and for corners the max is 2 and the min is 1 . Therefore, distance between 0 and 1 needs to be 6 , so we scale by 6 and subtract by 3 . Therefore, $w>6$ with $b=w / 2$. Full credit will be given for "large and positive" for $w$ and $b=-w / 2$.
MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition.
What are dimensions of $\mathrm{w}$ and $\mathrm{b}$ for i) binary classification vs. ii) $k$-class classification?
For binary classification $\mathrm{w}$ is: $[1,1] \quad$ and $\mathrm{b}$ is: $[1]$.
For $k$-class classification $\mathrm{w}$ is: $[1, \mathrm{k}] \quad$ and $\mathrm{b}$ is: $[\mathrm{k}]$.
MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition.
Using your answers from above, write an expression for gradient of the loss with respect to $\mathrm{w}$ and $\mathrm{b}$ of the output layer. You may express your answers in terms of a_sum.
$\frac{\partial \mathcal{L}}{\partial w}=z 1_{\text {sum }}(g-y)$.
$\frac{\partial \mathcal{L}}{\partial b}=(g-y)$.
10
388EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneProblem Set 2Regression1c0.03472222222Text
You are given the following data, where $d=1, n=4$.
$$
D=\{[[1], 2],[[2], 7],[[3],-3],[[4], 1]\}
$$
You want to use analytic linear regression to solve the problem.
Using Python and numpy (you might want to fire up a co-lab session), compute the $\theta, \theta_{0}$ that minimize MSE on this data. Provide your answer as a list.
As a reminder, you can use np.linalg.inv to take inverses in numpy.
Open[-1.3, 5]
You are given the following data, where $d=1, n=4$.
$$
D=\{[[1], 2],[[2], 7],[[3],-3],[[4], 1]\}
$$
You want to use analytic linear regression to solve the problem.
What is the $Y$ vector? Provide a list of lists that would be an argument to np.array().
[[2, 7, -3, 1]]
You are given the following data, where $d=1, n=4$.
$$
D=\{[[1], 2],[[2], 7],[[3],-3],[[4], 1]\}
$$
You want to use analytic linear regression to solve the problem.
What is the $X$ matrix? Remember to include an extra input dimension that always has the value 1. Provide a list of lists that would be an argument to np. array(), where each data point is a column.
[[1, 2, 3, 4], [1, 1, 1, 1]]
You are given the following data, where $d=1, n=4$.
$$
D=\{[[1], 2],[[2], 7],[[3],-3],[[4], 1]\}
$$
You want to use analytic linear regression to solve the problem.
What is the MSE of the hypothesis you found on the data (any answer within the right order of magnitude will be fine)?
10.575
11
21Mathematics18.2
Principles of Discrete Applied Mathematics
None18.C06Problem Set 4
Generating Function
3nan2.037037037TextSolve the following recurrence using generating functions:
$$
a_{n}=3 a_{n-1}+4 a_{n-2} \text { for } n \geq 2
$$
with the initial conditions $a_{0}=3, a_{1}=2$.
Open
We let
$$
A(x)=a_{0}+a_{1} x+a_{2} x^{2}+\ldots=\sum_{i=0}^{\infty} a_{i} x^{i}
$$
be the generating function for our sequence $\left\{a_{i}\right\}_{i \in \mathbb{N}}$. Then we get that
$$
A(x)=3 x A(x)+4 x^{2} A(x)-7 x+3.
$$
Rearranging the terms, we get that
$$
A(x)\left(1-3 x-4 x^{2}\right)=-7 x+3,
$$
or equivalently,
$$
A(x)=\frac{-7 x+3}{1-3 x-4 x^{2}}.
$$
We have the factorization $1-3 x-4 x^{2}=(1-4 x)(1+x)$, so we will get a partial fraction decomposition of the form
$$
A(x)=\frac{A}{1-4 x}+\frac{B}{1+x}.
$$
This tells us that $A+B=3$ and $A-4 B=-7$. Solving these equations, we get $A=1$ and $B=2$, so
$$
A(x)=\frac{1}{1-4 x}+\frac{2}{1+x}.
$$
Recalling that $\frac{1}{1-x}=\sum_{i=0}^{\infty} x^{i}$, we get that
$$
A(x)=\sum_{i=0}^{\infty}\left(4^{i}+2 \times(-1)^{i}\right) x^{i}
$$
Thus, we find that $a_{i}=4^{i}+2 \times(-1)^{i}$.
Find the generating function of the following sequence $\left(a_{n}\right)_{n \geq 0}$, where $a_{0}=1$, $a_{1}=3$, and $a_{n}=6 a_{n-1}-6 a_{n-2}$ for $n \geq 2$. (You do not need to find a formula for $a_{n}$.)
We have that
$$
\begin{aligned}
A(x) & =a_{0}+a_{1} x+a_{2} x^{2}+a_{3} x^{3}+\ldots \\
x A(x) & =0+a_{0} x+a_{1} x^{2}+a_{2} x^{3}+\ldots \\
x^{2} A(x) & =0+0 x+a_{0} x^{2}+a_{1} x^{3}+\ldots \\
\left(1-6 x+6 x^{2}\right) A(x) & =a_{0}+\left(a_{1}-6 a_{0}\right) x=1-3 x \\
A(x) & =\frac{1-3 x}{1-6 x+6 x^{2}}
\end{aligned}
$$
Define the Tribonacci Sequence as $a_{1}=a_{2}=a_{3}=1$ and $a_{n}=a_{n-1}+a_{n-2}+a_{n-3}$ for $n \geq 4$. Prove, using strong induction, that $a_{n}<2^{n}$ for all $n>0$.
We prove, by strong induction, that $(*) a_{n}<2^{n}$.
The base cases for $n \leq 3$ are clearly true. Assume that $\left(^{*}\right)$ holds for $k<n$. Then,
$$
\begin{aligned}
a_{n} & =a_{n-1}+a_{n-2}+a_{n-3} \\
& <2^{n-1}+2^{n-2}+2^{n-3} \\
& =2^{n} \cdot\left(\frac{1}{2}+\frac{1}{4}+\frac{1}{8}\right) \\
& =\frac{7}{8} 2^{n} \\
& <2^{n}
\end{aligned}
$$
In this problem we will be interested in the recurrence
$$
x_{k}=3 x_{k-1}+\beta x_{k-2}
$$
where $\beta$ is a nonnegative parameter to be chosen later.
What is the largest value of $\beta$ for which the recurrence satisfies the growth condition
$$
x_{k} \leq C 4^{k}
$$
for any initial condition? Note: The constant $C$ can depend on the initial condition itself.
Using the expression for $A$ above, we compute the eigenvalues through the trace and determinant. In particular we have $\lambda_{1}+\lambda_{2}=3$ and $\lambda_{1} \lambda_{2}=-\beta$. Substituting for $\lambda_{2}$ we obtain
$$
\lambda_{1}\left(3-\lambda_{1}\right)=-\beta
$$
and using the quadratic equation we obtain
$$
\lambda_{1}=\frac{3 \pm \sqrt{9+4 \beta}}{2}
$$
Thus we get that $\beta \leq 4$ is a necessary condition (since otherwise $\lambda_{1}$ would be too big). We can check that for this value we $\lambda_{2}=-1$ and thus neither eigenvalue is too big.
12
9Mathematics18.704
Seminar in Algebra
18.701NoneProblem Set 2
Linear Representation
2nan1.5Text
Let $\rho: G \rightarrow \mathrm{GL}(V)$ be a representation and let
$$
V^{*}:=\left\{v^{*}: V \rightarrow \mathbb{C} \mid v^{*} \text { is linear }\right\}
$$
be the dual of $V$. For $x \in V, x^{*} \in V^{*}$, let $\left\langle x, x^{*}\right\rangle$ denote the value of the linear form $x^{*}$ at $x$. Show that there exists a unique linear representation $\rho^{*}: G \rightarrow \operatorname{GL}\left(V^{*}\right)$ such that
$$
\left\langle\rho_{s} x, \rho_{s}^{*} x^{*}\right\rangle=\left\langle x, x^{*}\right\rangle \quad \text { for } s \in G, x \in V, x^{*} \in V^{*} \text {. }
$$
Open
Define $\rho^{*}: G \rightarrow \mathrm{GL}\left(V^{*}\right)$ such that $\rho_{s}^{*} x^{*}=x^{*} \circ \rho_{s^{-1}}$.
Claim. $\rho^{*}$ is a representation.
Proof. $\rho_{s}^{*}$ is linear since the scalars can be factered out from the composition, and composition distributes over addition. $\rho_{s}^{*}$ is invertible since it's inverse is $\left(\rho_{s}^{*}\right)^{-1} x^{*}=x^{*} \circ \rho_{s}$. Check:
$$
\begin{aligned}
\left(\left(\rho_{s}^{*}\right)^{-1} \rho_{s}^{*}\right) x^{*} & =\left(\rho_{s}^{*}\right)^{-1}\left(\rho_{s}^{*} x^{*}\right) \\
& =\left(\rho_{s}^{*}\right)^{-1}\left(x^{*} \circ \rho_{s^{-1}}\right) \\
& =x^{*} \circ \rho_{s^{-1}} \circ \rho_{s} \\
& =x^{*} \\
& =x^{*} \circ \rho_{s} \circ \rho_{s^{-1}} \\
& =\rho_{s}^{*}\left(\left(\rho_{s}^{*}\right)^{-1} x^{*}\right) \\
& =\left(\rho_{s}^{*}\left(\rho_{s}^{*}\right)^{-1}\right) x^{*}
\end{aligned}
$$
Thus, $\left(\rho_{s}^{*}\right)^{-1} \rho_{s}^{*}=\mathrm{Id}=\rho_{s}^{*}\left(\rho_{s}^{*}\right)^{-1}$, so $\rho_{s}^{*} \in \mathrm{GL}\left(V^{*}\right)$. The group operation is also preserved.
$$
\begin{aligned}
\left(\rho^{*} g\right)\left(\rho^{*} h\right) x^{*} & =\rho_{g}^{*}\left(x^{*} \circ \rho_{h^{-} 1}\right) \\
& =x^{*} \circ \rho_{h^{-1}} \circ \rho_{g^{-} 1} \\
& =x^{*} \circ\left(\rho_{h^{-1}} \rho_{g^{-1}}\right) \\
& =x^{*} \circ \rho_{(g h)^{-1}} \\
& =\left(\rho^{*}(g h)\right) x^{*}
\end{aligned}
$$
Thus $\rho^{*}$ is a representation.
Claim. $\rho^{*}$ satisfies the given equation.
Proof.
$$
\begin{aligned}
\left\langle\rho_{s} x, \rho_{s}^{*} x^{*}\right\rangle & =\left(\rho_{s}^{*} x^{*}\right)\left(\rho_{s} x\right) \\
& =\left(x^{*} \circ \rho_{s^{-1}}\right)\left(\rho_{s} x\right) \\
& =x^{*} x \\
& =\left\langle x, x^{*}\right\rangle
\end{aligned}
$$
Claim. If there exists a representation that satisfies the equation, then it is unique.
Proof. Suppose that the representations $\rho^{*}$ and $\sigma^{*}$ both satisfy the equation. Then
$$
\left\langle\rho_{s} x, \rho_{s}^{*} x^{*}\right\rangle=\left\langle\rho_{s} x, \sigma_{s}^{*} x^{*}\right\rangle \quad \forall s \in G, x \in V, x^{*} \in V^{*}
$$
Let $x^{\prime}=\rho_{s} x$. Since $\rho_{s}$ is invertible, the equation becomes
$$
\left\langle x^{\prime}, \rho_{s}^{*} x^{*}\right\rangle=\left\langle x^{\prime}, \sigma_{s}^{*} x^{*}\right\rangle \quad \forall s \in G, x^{\prime} \in V, x^{*} \in V^{*}
$$
Applying,
$$
\left(\rho_{s}^{*} x^{*}\right) x^{\prime}=\left(\sigma_{s}^{*} x^{*}\right) x^{\prime} \quad \forall s \in G, x^{\prime} \in V, x^{*} \in V^{*}
$$
Since it's true for all elements in their domains, the functions are equal.
$$
\rho_{s}^{*} x^{*}=\sigma_{s}^{*} x^{*} \quad \forall s \in G, x^{*} \in V^{*}
$$
Since it's true again, the functions are equal.
$$
\rho_{s}^{*}=\sigma_{s}^{*} \quad \forall s \in G
$$
One more time.
$$
\rho^{*}=\sigma^{*}
$$
Thus, the representations are equal, so a representation that satisfies the equation is unique.
Thus the defined $\rho^{*}$ is the unique representation which satisfies the equation.
Let $X$ be a finite set on which $G$ acts. Let $V$ be a vector space with a basis $\left(e_{x}\right)_{x \in X}$ indexed by the elements of $X$. For $g \in G$, let $\rho_{g}$ be the linear map of $V$ into $V$ which sends $e_{x}$ to $e_{g * x}$.
Prove that $g \mapsto \rho_{g}$ defines a representation $\rho$ of $G$. (That is, prove that the map $G \rightarrow \operatorname{GL}(V)$ given by $g \mapsto \rho_{g}$ is a group homomorphism.)
Proof. It suffices to check (i) $\rho_{g} \in G L(V)$ for all $g \in G$ and the homomorphism axioms, namely (ii) $\rho_{1_{G}}=1_{G L(V)}$ and (iii) $\rho_{a b}=\rho_{a} \rho_{b}$ for all $a, b \in G$.
Since $\rho_{g}$ permutes the set $E:=\left\{e_{x} \mid x \in X\right\}$, it maps $E$, which is a basis for $V$, to itself, which is again a basis for $V$. Hence it is inev 2 able and this proves (i).
(ii) is trivial as $\rho_{1}$ acts on $E$, the basis, as identity.
Finally, for (iii), it suffices to note that
$$
\rho_{a b}\left(e_{x}\right)=e_{(a b) * x}=e_{a *(b * x)}=\rho_{a}\left(e_{b * x}\right)=\rho_{a}\left(\rho_{b}\left(e_{x}\right)\right)=\left(\rho_{a} \rho_{b}\right)\left(e_{x}\right), \forall x \in X,
$$
which proves (iii) and thus finishes the proof.
Let $G$ be a finite abelian group and let $\rho: G \rightarrow \operatorname{GL}(V)$ be any complex representation ("complex representation" means that $V$ is a $\mathbb{C}$-vector space). Prove that there exists a basis $\mathcal{B}$ of $V$ such that for all $g \in G$, the matrix $[\rho(g)]_{\mathcal{B}}$ is diagonal.
Consider the corresponding $F G$-module $V$ with multiplication $v g=v(g \rho)$ for $v \in V, g \in G$. Since our field is $\mathbb{C}$ and $G$ is finite, we know by Theorem $8.7$ that we can decompose
$$
V=U_{1} \oplus \cdots \oplus U_{n},
$$
where each $U_{i}$ is an irreducible submodule of $V$.
By Theorem 9.5, we know that since $G$ is also abelian, each $U_{i}$ has dimension 1. If we let $U_{i}=\operatorname{span}\left(u_{i}\right)$ for each $i$, then from our definition of direct sum (2.9), we know that $\mathcal{B}=\left\{u_{1}, \ldots, u_{n}\right\}$ forms a basis of $V$.
Returning back to our representation, we consider the matrix of the transformation $g \rho$ for an arbitrary $g \in G$ with respect to $\mathcal{B}$. For $u_{i} \in \mathcal{B}$, since $U_{i}$ is closed (stable), we must have that $u_{i}(g \rho)$ gets sent to a vector also in $U_{i}$, and since $u_{i}$ is the basis vector, we can express the result vector as $\lambda_{i} u_{i}$. Hence the matrix is diagonal:
$$
[g \rho]_{\mathcal{B}}=\left[\begin{array}{lll}
\lambda_{1} & & \\
& \ddots & \\
& & \lambda_{n}
\end{array}\right].
$$
Let $G$ be a finite group and let $G$ act on itself by right multiplication. Let $V$ be a vector space with basis $\left(e_{x}\right)_{x \in G}$ indexed by the elements of $G$. For $g \in G$, let $\rho_{g}$ be the linear map of $V$ into $V$ which sends $e_{x}$ to $e_{x g}$. Recall that we proved on the previous problem set that the $\operatorname{map} \rho: G \rightarrow \operatorname{GL}(V)$ defined as $g \mapsto \rho_{g}$ is a representation of $G$. Prove that the $F G$-module corresponding to $\rho$ (in the sense of Theorem 4.4(1)) is isomorphic to the regular $F G$-module.
Let $\vartheta: V \rightarrow F G$ be the linear map such that $e_{x} \vartheta=x$ for all $x \in G$. This is clearly a bijection, since each basis element in $V$ gets sent to the corresponding basis element in $F G$. Furthermore, we have
$$
\left(e_{x} g\right) \vartheta=\rho_{g}\left(e_{x}\right) \vartheta=e_{x g} \vartheta=x g=\left(e_{x} \vartheta\right) g
$$
for all $g \in G$. Since $\rho_{g}$ and $\vartheta$ are linear, we have $(v g) \vartheta=(v \vartheta) g$ for all $g \in G, v \in V$, so $\vartheta$ is a bijective homomorphism, and $V$ is isomorphic to $F G$.
13
42EECS6.122
Design and Analysis of Algorithms
6.121NoneProblem Set 4Probability2c0.2727272727Text
Sophie Germain is organizing a dinner party where $n$ guests are seated around a round table with $N=2 n$ seats, numbered 1 through $N$ clockwise.
The guests arrive one by one, and each guest takes his or her seat before the next guest arrives. To optimize the amount of networking between the guests, Sophie has come up with an unusual way of seating them. Namely, to find his or her seat, each guest $i$ first receives a seat number $r_i$ chosen uniformly and independently at random. Then, if seat $r_i$ is not currently occupied, the guest sits there; otherwise, the guest walks clockwise around the table until he or she finds the first unoccupied seat, and sits there.
A block is a set of consecutive seats that are all occupied but with the seats before and after it being unoccupied. Let $p_{j, k}$ be the probability that after all the guests have taken their seats there is a block of length $k$ starting at seat $j$, i.e. that $\{j, \ldots, j+k-1\}$ is a block in the final configuration.
Show that $\operatorname{Pr}\left[E_{j, k}\right] \leq \frac{1}{c^k}$ for some constant $c>1$ independent of the problem parameters. Hint: Consider random variables $X_i$, for each guest $i$, where $X_i=1$ if $r_i \in\{j, \ldots, j+$ $k-1\}$ and $X_i=0$, otherwise; and recall that $e>1$.
Open
Defining random variables as in the hint, we see that $\mathbb{E}\left[X_i\right]=\frac{k}{N}$ and thus letting $X=X_1+\ldots+X_n$ we have $\mathbb{E}[X]=\frac{n k}{N}=\frac{k}{2}$ by linearity of expectation. We thus want to bound the probability that
$$
\operatorname{Pr}[X \geq k]=\operatorname{Pr}[X \geq(1+1) \mathbb{E}[X]]
$$
We will use (the multiplicative form of) the Chernoff bound for that. Note that since $\beta=1$ here, either the form for $\beta \leq 1$ or $\beta \geq 1$ works here. With this, we get
$$
\operatorname{Pr}[X \geq k] \leq e^{-k / 6}=\frac{1}{\left(e^{1 / 6}\right)^k}
$$
so we can use $c=e^{1 / 6}>1$.
Sophie Germain is organizing a dinner party where $n$ guests are seated around a round table with $N=2 n$ seats, numbered 1 through $N$ clockwise.
The guests arrive one by one, and each guest takes his or her seat before the next guest arrives. To optimize the amount of networking between the guests, Sophie has come up with an unusual way of seating them. Namely, to find his or her seat, each guest $i$ first receives a seat number $r_i$ chosen uniformly and independently at random. Then, if seat $r_i$ is not currently occupied, the guest sits there; otherwise, the guest walks clockwise around the table until he or she finds the first unoccupied seat, and sits there.
A block is a set of consecutive seats that are all occupied but with the seats before and after it being unoccupied. Let $p_{j, k}$ be the probability that after all the guests have taken their seats there is a block of length $k$ starting at seat $j$, i.e. that $\{j, \ldots, j+k-1\}$ is a block in the final configuration.
Let $E_{j, k}$ be the event that at least $k$ of the random numbers $r_1, \ldots, r_n$ given to the guests lie in the set $\{j, \ldots, j+k-1\}$. Argue that $p_{j, k} \leq \operatorname{Pr}\left[E_{j, k}\right]$.
Denote $B=\{j, \ldots, j+k-1\}$. Suppose that $B$ is a block in the final configuration. Then nobody who got an $r_i \notin B$ is sitting in any of the seats $B$. Indeed, otherwise they would have passed seat $j-1$ by, which would mean that there was someone sitting there, whereas we know that seat $j-1$ is empty in the final configuration, and was thus empty during the entire seating process. Thus, all the people sitting in the seats $B$ had their $r_i$ 's in B, which means that there were exactly $k$ of the $r_i{ }^{\prime}$ 's which were in the set $B$. So we showed that the event that $B$ is a block implies that the event $E_{j, k}$ happened, hence $p_{j, k} \leq \operatorname{Pr}\left[E_{j, k}\right]$.
Sophie Germain is organizing a dinner party where $n$ guests are seated around a round table with $N=2 n$ seats, numbered 1 through $N$ clockwise.
The guests arrive one by one, and each guest takes his or her seat before the next guest arrives. To optimize the amount of networking between the guests, Sophie has come up with an unusual way of seating them. Namely, to find his or her seat, each guest $i$ first receives a seat number $r_i$ chosen uniformly and independently at random. Then, if seat $r_i$ is not currently occupied, the guest sits there; otherwise, the guest walks clockwise around the table until he or she finds the first unoccupied seat, and sits there.
A block is a set of consecutive seats that are all occupied but with the seats before and after it being unoccupied. Let $p_{j, k}$ be the probability that after all the guests have taken their seats there is a block of length $k$ starting at seat $j$, i.e. that $\{j, \ldots, j+k-1\}$ is a block in the final configuration.
After all the guests have taken their seats, Sophie finally joins them and heads for her favorite seat (which of course is number 1). However, it is quite possible that by that time there is already somebody sitting there because nobody saved Sophie's favorite seat!
Show that if Sophie follows the same protocol for resolving this problem (i.e., walk clockwise until the first unoccupied seat), with probability at least $1-\frac{1}{N^2}$, the number of the seat she ends up taking will be $O(\log n)$.
Hint: Prove that with probability at least $1-\frac{1}{N^2}$, there is no block of length $C \cdot \log n$ at the table, for some sufficiently large constant $C$.
The hint asks us to prove a stronger claim than necessary, but in doing so, we'll have solved the original problem with nicer bounds.
Let $q_{j, k}$ denote the probability that there exists a block of length at least $k$ starting at seat $j$. Then,
$$
q_{j, k} \leq \sum_{r=k}^n p_{j, r} \leq \sum_{r=k}^n \frac{1}{c^r} \leq \frac{d}{c^k}
$$
by Union Bound and part (b), where $d=\frac{1}{1-1 / c}$ is a constant.
Let $s_k$ be the probability that there exists a block of length at least $k$ anywhere at the table. Taking another Union Bound over all possible start indices for the block, we have
$$
s_k \leq \sum_{j=1}^N q_{j, k}=N \frac{d}{c^k} .
$$
Plugging in $k=4 \log _c N=O\left(\log n\right.$ ) (or more precisely, $k=3 \log _c N+\log _c d)$, we have
$$
s_{C \log n} \leq \frac{1}{N^2}
$$
for sufficiently large $C$.
Sophie Germain is organizing a dinner party where $n$ guests are seated around a round table with $N=2 n$ seats, numbered 1 through $N$ clockwise.
The guests arrive one by one, and each guest takes his or her seat before the next guest arrives. To optimize the amount of networking between the guests, Sophie has come up with an unusual way of seating them. Namely, to find his or her seat, each guest $i$ first receives a seat number $r_i$ chosen uniformly and independently at random. Then, if seat $r_i$ is not currently occupied, the guest sits there; otherwise, the guest walks clockwise around the table until he or she finds the first unoccupied seat, and sits there.
A block is a set of consecutive seats that are all occupied but with the seats before and after it being unoccupied. Let $p_{j, k}$ be the probability that after all the guests have taken their seats there is a block of length $k$ starting at seat $j$, i.e. that $\{j, \ldots, j+k-1\}$ is a block in the final configuration.
Suppose $n=4$. Given the following sequences $r_i = \left(r_1, r_2, r_3, r_4\right) = (1, 3, 1, 1)$, what is the size of the maximum block?
4
14
120EECS6.122
Design and Analysis of Algorithms
6.121NoneMidterm Exam 2
Linear Programming
3b2.5Text
Let $G=(V, E)$ be a weighted directed graph, where all weights are non-negative. Let $w_{u v}$ denote the weight of each edge $(u, v) \in E$. Alicia wrote down the following linear program with variables $\left\{x_{v}: v \in V\right\}$ that she claims computes the length of the shortest path from a source vertex $s$ to a target vertex $t$. (By convention, if there is no path from $s$ to $t$, the shortest path length is $\infty$.)
$$
\begin{array}{lll}
\operatorname{maximize} & x_{t} & \\
\text { subject to } & x_{v}-x_{u} \leq w_{u v} & \text { for all edges }(u, v) \in E \\
& x_{v} \geq 0 & \text { for all vertices } v \in V . \\
& x_{s}=0 &
\end{array}
$$
Is Alicia's claim correct? If yes, prove that the optimal solution to her linear program is indeed the length of a shortest path from $s$ to $t$. If not, correct the linear program and prove that your (corrected) LP computes the shortest path length from $s$ to $t$.
Open
Alicia's claim is correct. The correct linear program for the shortest path somewhat counter-intuitively maximizes $x_{t}$ subject to the given constraints. Think of each edge as a string of length $w_{u v}$. The constraints tell us how far the vertex $v$ can get from the source given the values $x_{u}$ of its in-neighbors and the weights $w_{u v}$ of its in-edges. The problem then is to see how far the target $t$ can get. A rigorous argument follows.
Let $d(u, v)$ denote the length of the shortest path between $u$ and $v$. Then, $x_{u}=d(s, u)$ for all $u$ is a feasible solution to the LP because $x_{v}=d(s, v) \geq 0$ and $x_{s}=d(s, s)=0$.
$x_{u}-x_{v}=d(s, u)-d(s, v) \leq w_{u v}$ because otherwise the path $s \longrightarrow v \longrightarrow u$ would be shorter than $d(s, u)$. As this gives a feasible solution where $x_{t}=d(s, t)$ and this is a maximization LP, we have shown that the optimal $x_{t} \geq d(s, t)$.
Now, we show it must hold that $x_{t} \leq d(s, t)$. If there were a solution with $x_{t}>d(s, t)$ then if $v_{1}, v_{2}, \ldots, v_{n}$ is the shortest path between $s$ and $t$, then
$$
\begin{gathered}
x_{t}=x_{t}-x_{s}=x_{v_{n}}-x_{v_{1}} \\
=\left(x_{v_{n}}-x_{v_{n-1}}\right)+\left(x_{v_{n-1}}-x_{v_{n-2}}\right)+\cdots+\left(x_{v_{2}}-x_{v_{1}}\right) \\
\leq w_{v_{n-1} v_{n}}+w_{v_{n-2} v_{n-1}}+\cdots+w_{v_{1} v_{2}} \\
=d(s, t)
\end{gathered}
$$
We get a contradiction. Therefore, $d(s, t)$ is the optimal value of the LP.
You are given a weighted directed graph $G=(V, E, w)$ where each node $v$ has a color v.color (which is part of the input). Assume that there are $k$ possible colors and that the weights can be both positive and negative. Describe an algorithm that computes the length of the shortest path from a designated source $s$ to a given destination $t$, where every time the path repeats colors, you incur a cost of $\ell$. Here, we say that a path repeats colors if two consecutive nodes in the path have the same color. So, for example, going RED, BLUE, RED does not repeat colors but going RED, BLUE, BLUE does. You can assume that there is at least one path from $s$ to $t$.
For full credit, provide a short description of the algorithm and an analysis of its run time. The runtime should be expressed in terms of $|V|,|E|$ and/or $k$. Faster algorithms will receive more credit. You can invoke any algorithm discussed in lecture, recitation or p-sets.
Change the weights to add $l$ to the weight of any edge between nodes of the same color. Run BF. The cost is $O(|V| \cdot|E|)$.
Consider a directed graph $G$ with non-negative weights with two vertices $s$ and $t$ connected via a shortest path $p$ which goes through vertices $a$ and $b$. That is, the path $p$ starts at $s$, eventually reaches $a$, then $b$ and ends at $t$. Which of the following sub-paths are shortest paths as well?
$\mathrm{a} \quad s \rightarrow a$.
$\mathrm{b} \quad a \rightarrow b$.
$\mathrm{c} \quad s \rightarrow b$.
$\mathrm{d} \quad b \rightarrow t$.
$\mathrm{e} \quad a \rightarrow t$.
All of the above.
Indeed, any sub-path of a shortest path is a shortest path.
Please select True or False for the following.
Suppose that an undirected graph $G=(V, E)$ with positive edge weights has a minimum spanning tree of weight $W$. For all pairs of nodes $u, v \in V$ the shortest path between $u$ and $v$ in $G$ has length less than or equal to $W$.
True. We can walk from any point to any other point using only edges on the MST. The edges we don't use in the MST have total weight greater than or equal to 0, thus the cost of the path we walk is at most $W$.
15
112EECS6.411
Representation, Inference, and Reasoning in AI
6.1010, 6.1210, 18.600
NoneProblem Set 3
Propositional Proof
4bii0.2232142857Text
There are three suspects for a murder: Adams, Brown, and Clark.
1. Adams says "I didn't do it. The victim was an old acquaintance of Brown's. But Clark hated him."
2. Brown states "I didn't do it. I didn't know the guy. Besides I was out of town all week."
3. Clark says "I didn't do it. I saw both Adams and Brown in town around the victim that day; one of them must have done it."
4. We know that exactly one of the suspects is guilty.
Assume that the two innocent people are telling the truth, but that the guilty people might not be. So, the statements from the suspects can be encoded as "If suspect_is_innocent, then some other facts are true".
Let the propositional variables have the following definitions:
\begin{itemize}
\item $A=$ Adams is innocent
\item $B=$ Brown is innocent
\item $C=$ Clark is innocent
\item $X=$ Brown knew the victim
\item $Y=$ Brown was out of town
\item $Z=$ Adams was out of town
\item $W=$ Clark hated the victim
\end{itemize}
Now, enter the steps in a valid resolution proof. Each step will indicate the indices of two parent clauses (as integers) and the resolvent clause (as a list of literal strings), e.g. [3, 4, [ $\left.\left.\sim X^{\prime}, ' A^{\prime}\right]\right]$. The resolvent clause in each of the steps entered can be used in subsequent steps by using its index, starting with 13. Indicate a contradiction by entering an empty list for the clause, e.g. [1, 2, [] ]
The entries below illustrate the format; the number of required steps in the proof is not necessarily as indicated.
proof = [
[0, 0, ['A']], # 13
[0, 0, ['B']], # 14
[0, 0, ['C']], # 15
[0, 0, ['D']], # 16
[0, 0, ['E']], # 17
[0, 0, []'], # 18
]
Open
proof = [[3, 12, ["~X"]], [1, 13, ["~A"]], [4, 12, ["Y"]], [5, 15, ["~C"]], [8, 16, ["A"]], [14, 17, []]]
There are three suspects for a murder: Adams, Brown, and Clark.
1. Adams says "I didn't do it. The victim was an old acquaintance of Brown's. But Clark hated him."
2. Brown states "I didn't do it. I didn't know the guy. Besides I was out of town all week."
3. Clark says "I didn't do it. I saw both Adams and Brown in town around the victim that day; one of them must have done it."
4. We know that exactly one of the suspects is guilty.
Assume that the two innocent people are telling the truth, but that the guilty people might not be. So, the statements from the suspects can be encoded as "If suspect_is_innocent, then some other facts are true".
Let the propositional variables have the following definitions:
\begin{itemize}
\item $A=$ Adams is innocent
\item $B=$ Brown is innocent
\item $C=$ Clark is innocent
\item $X=$ Brown knew the victim
\item $Y=$ Brown was out of town
\item $Z=$ Adams was out of town
\item $W=$ Clark hated the victim
\end{itemize}
We will continue our investigation of the Adams, Brown and Clark affair. In this episode, we prove conclusively that Brown is guilty. We will do that by using resolution refutation starting from the set of clauses that we derived in the previous episode.
\begin{tabular}{|l|l|}
\hline 1 & $\neg A \vee X$ \\
\hline 2 & $\neg A \vee W$ \\
\hline 3 & $\neg B \vee \neg X$ \\
\hline 4 & $\neg B \vee Y$ \\
\hline 5 & $\neg C \vee \neg Y$ \\
\hline 6 & $\neg C \vee \neg Z$ \\
\hline 7 & $\neg C \vee \neg A \vee \neg B$ \\
\hline 8 & $A \vee C$ \\
\hline 9 & $A \vee B$ \\
\hline 10 & $B \vee C$ \\
\hline 11 & $A \vee B \vee C$ \\
\hline
\end{tabular}
Note that we have dropped a duplicate clause that we derived from both the third and fourth of the original axioms.
In addition to the clauses derived from the original axioms, what additional clause do we need to add so as to be able to conclude that Brown is guilty using resolution refutation?
Enter clause 12 as a list of literal strings.
['B']
There are three suspects for a murder: Adams, Brown, and Clark.
1. Adams says "I didn't do it. The victim was an old acquaintance of Brown's. But Clark hated him."
2. Brown states "I didn't do it. I didn't know the guy. Besides I was out of town all week."
3. Clark says "I didn't do it. I saw both Adams and Brown in town around the victim that day; one of them must have done it."
4. We know that exactly one of the suspects is guilty.
Assume that the two innocent people are telling the truth, but that the guilty people might not be. So, the statements from the suspects can be encoded as "If suspect_is_innocent, then some other facts are true".
Let the propositional variables have the following definitions:
\begin{itemize}
\item $A=$ Adams is innocent
\item $B=$ Brown is innocent
\item $C=$ Clark is innocent
\item $X=$ Brown knew the victim
\item $Y=$ Brown was out of town
\item $Z=$ Adams was out of town
\item $W=$ Clark hated the victim
\end{itemize}
We can write down propositional logic axioms for each of the four statements defining this problem. For propositional resolution, we need to convert these sentences to CNF. We will ask you to convert one sentence at a time. Enter one CNF formula corresponding to the specified sentence in each of the answer spaces below.
Enter each CNF formula as a list of lists of literal strings. A literal string is either a propositional symbol, e.g. 'A' or the negation of a propositional symbol, e.g. ' $A$ '. A typical clause will look like: $\left[{ }^{\prime} A ', ' \sim B\right.$ ', ' $\left.\sim C '\right]$. And a CNF formula is a list of clauses. Do not include any spaces in the strings.
The second axiom is: $B \Rightarrow(\neg X \wedge Y)$
Enter the CNF as a formula following the syntax described above.
[['~B', '~X'], ['~B', 'Y']]
There are three suspects for a murder: Adams, Brown, and Clark.
1. Adams says "I didn't do it. The victim was an old acquaintance of Brown's. But Clark hated him."
2. Brown states "I didn't do it. I didn't know the guy. Besides I was out of town all week."
3. Clark says "I didn't do it. I saw both Adams and Brown in town around the victim that day; one of them must have done it."
4. We know that exactly one of the suspects is guilty.
Assume that the two innocent people are telling the truth, but that the guilty people might not be. So, the statements from the suspects can be encoded as "If suspect_is_innocent, then some other facts are true".
Let the propositional variables have the following definitions:
\begin{itemize}
\item $A=$ Adams is innocent
\item $B=$ Brown is innocent
\item $C=$ Clark is innocent
\item $X=$ Brown knew the victim
\item $Y=$ Brown was out of town
\item $Z=$ Adams was out of town
\item $W=$ Clark hated the victim
\end{itemize}
We can write down propositional logic axioms for each of the four statements defining this problem. For propositional resolution, we need to convert these sentences to CNF. We will ask you to convert one sentence at a time. Enter one CNF formula corresponding to the specified sentence in each of the answer spaces below.
Enter each CNF formula as a list of lists of literal strings. A literal string is either a propositional symbol, e.g. 'A' or the negation of a propositional symbol, e.g. ' $A$ '. A typical clause will look like: $\left[{ }^{\prime} A ', ' \sim B\right.$ ', ' $\left.\sim C '\right]$. And a CNF formula is a list of clauses. Do not include any spaces in the strings.
The first axiom is: $A \Rightarrow(X \wedge W)$
Enter the CNF as a formula following the syntax described above.
[['~A', 'X'], ['~A', 'W']]
16
16Mathematics18.03
Differential Equations
None18.02Problem Set 2
Complex Numbers
2b0.1608579088Text
Express the following complex numbers in polar form $r e^{i \theta}$ with $\theta \in(-\pi, \pi]$.
$1-i$.
Expression
The modulus is $|1-i|=\sqrt{1^{2}+(-1)^{2}}=\sqrt{2}$. The argument is $\arctan \frac{-1}{1}=\frac{7 \pi}{4}$. So $1-i=\sqrt{2} e^{i \frac{7 \pi}{4}}$.
Express the following complex numbers in polar form $r e^{i \theta}$ with $\theta \in(-\pi, \pi]$.
$1+i \sqrt{3}$.
The modulus is $|1+i \sqrt{3}|=\sqrt{1^{2}+\sqrt{3}^{2}}=\sqrt{4}=2$. The $\operatorname{argument}$ is $\arctan \frac{\sqrt{3}}{1}=$ $\frac{\pi}{3}$. So $1+i \sqrt{3}=2 e^{i \frac{\pi}{3}}$.
Compute the real and imaginary parts of the following complex numbers. Simplify as much as possible.
$1+e^{-\frac{\pi i}{6}}$.
$$
\begin{gathered}
1+e^{-\frac{\pi i}{6}}=1+\cos \left(-\frac{\pi}{6}\right)+i \sin \left(-\frac{\pi}{6}\right)=1+\frac{\sqrt{3}}{2}-i \frac{1}{2} \\
\operatorname{Re}\left(1+e^{-\frac{\pi i}{6}}\right)=1+\frac{\sqrt{3}}{2}, \quad \operatorname{Im}\left(1+e^{-\frac{\pi i}{6}}\right)=-\frac{1}{2}
\end{gathered}
$$
Answer the following questions without the use of a calculator or computer. Briefly explain your answers.
Determine the real and imaginary parts of $1 /\left(e^{j 3 \pi / 4}+\frac{1}{\sqrt{2}} e^{j \pi / 2}\right)$.
We can simplify the denominator by converting each of the complex exponentials to cartesian form:
$$
\frac{1}{e^{j 3 \pi / 4}+\frac{1}{\sqrt{2}} e^{j \pi / 2}}=\frac{1}{-\frac{1}{\sqrt{2}}+\frac{j}{\sqrt{2}}+\frac{j}{\sqrt{2}}}=\frac{1}{\frac{-1+2 j}{\sqrt{2}}}=\frac{\sqrt{2}}{-1+2 j} .
$$
Then multiply by $\frac{-1-2 j}{-1-2 j}$ to make the denominator real:
$$
\left(\frac{\sqrt{2}}{-1+2 j}\right)\left(\frac{-1-2 j}{-1-2 j}\right)=-\frac{\sqrt{2}}{5}(1+2 j)
$$
Thus the real part is $-\frac{\sqrt{2}}{5}$ and the imaginary part is $-\frac{2 \sqrt{2}}{5}$.
17
32Mathematics18.2
Principles of Discrete Applied Mathematics
None18.C06Problem Set 6Linear Program2c0.6790123457Text
Consider the LP
$$
\begin{aligned}
\max & 4 x_{1}+x_{2} \\
\text { s.t. } & x_{1}-x_{2} \leq 2 \\
& x_{1}+x_{2} \leq 8 \\
& x_{1}, x_{2} \geq 0
\end{aligned}
$$
Write the dual LP for the original problem in $x_{1}, x_{2}$, and use complementary slackness to determine the optimal solution for this dual.
Expression
The dual LP is given by
$$
\begin{array}{cl}
\min & 2 y_{1}+8 y_{2} \\
\text { s.t. } & y_{1}+y_{2} \geq 4 \\
& -y_{1}+y_{2} \geq 1 \\
& y_{1}, y_{2} \geq 0 .
\end{array}
$$
The optimum value of the original problem will be achieved at one of the vertices of its feasible region. Testing all four vertices, one gets that the optimal value of 23 at $\left(x_{1}, x_{2}\right)=(5,3)$. We have that $x_{1}-x_{2}=2$ and $x_{1}+x_{2}=8$, so we don't need the $y_{i}$ to be zero. Since $x_{1}$ and $x_{2}$ are nonzero, we must have $y_{1}+y_{2}=4$ and $-y_{1}+y_{2}=1$. Solving this gives us $y_{1}=3 / 2$ and $y_{2}=5 / 2$.
Consider the LP
$$
\begin{aligned}
\max & 4 x_{1}+x_{2} \\
\text { s.t. } & x_{1}-x_{2} \leq 2 \\
& x_{1}+x_{2} \leq 8 \\
& x_{1}, x_{2} \geq 0
\end{aligned}
$$
Convert the linear program to standard form, and construct a basic feasible solution.
Adding slack variables $s_{1}$ and $s_{2}$ to the first and second equations respectively, we get
$$
\begin{aligned}
\max & 4 x_{1}+x_{2} \\
\text { s.t. } & x_{1}-x_{2}+s_{1}=2 \\
& x_{1}+x_{2}+s_{2}=8 \\
& x_{1}, x_{2}, s_{1}, s_{2} \geq 0
\end{aligned}
$$
If we take the basis consisting of $s_{1}$ and $s_{2}$, then we get a basic feasible solution with $\left(x_{1}, x_{2}, s_{1}, s_{2}\right)=(0,0,2,8)$.
Consider the LP
$$
\begin{aligned}
\max & 4 x_{1}+x_{2} \\
\text { s.t. } & x_{1}-x_{2} \leq 2 \\
& x_{1}+x_{2} \leq 8 \\
& x_{1}, x_{2} \geq 0
\end{aligned}
$$
Draw a graphical representation of the feasible region of this LP (in terms of $x_{1}$ and $\left.x_{2}\right)$.
The region given by the constraints is the quadrilateral with vertices $(0,0),(0,8),(5,3)$, and $(2,0)$.
Consider the following linear program and its dual
$$
\begin{array}{ccrl}
\operatorname{maximize} & 6 x+6 y & \text { minimize } & a+2 b \\
\text { subject to } & x+2 y=1 & \text { subject to } & a+3 b \geq 6, \\
& 3 x+y=2, & & 2 a+b \geq 6, \\
& x, y \geq 0, & &
\end{array}
$$
Suppose we add the constraint $a+b \geq 4$ to the dual. Fill in the corresponding primal
$$
\begin{array}{r}
\text { maximize } \\
\text { subject to }
\end{array}
$$
$$
\begin{array}{cc}
\operatorname{minimize} & a+2 b \\
\text { subject to } & a+3 b \geq 6, \\
& 2 a+b \geq 6, \\
& a+b \geq 4,
\end{array}
$$
The new primal and dual are
$$
\begin{aligned}
& \text { maximize } 6 x+6 y+4 z \quad \text { minimize } a+2 b \\
& \text { subject to } x+2 y+z=1 \quad \text { subject to } a+3 b \geq 6 \text {, } \\
& 3 x+y+z=2 \quad 2 a+b \geq 6 . \\
& x, y, z \geq 0, \quad a+b \geq 4,
\end{aligned}
$$
18
148Mathematics18.6
Probability and Random Variables
18.02NoneFinal Exam
Central Limit Theorem
5b1.25Text
In this problem, we'll try to approximate $\pi$ by throwing darts at a dartboard. To do this, suppose we have a square dartboard with sides of length 5. Inside it, we'll draw a circle with radius 1.
When I throw a dart, suppose that the spot where it hits is uniformly distributed over the dartboard. The probability that it lands inside the circle is thus given by
$$
\mathrm{P}(\text { lands in circle })=\frac{\text { Area }(\text { circle })}{\text { Area }(\text { dartboard })}=\frac{\pi}{25} .
$$
Use the Central Limit Theorem to approximate the probability that $Z \in[\pi-0.1, \pi+$ 0.1]. Write your answer in terms of $\Phi(x)$, the CDF of a standard normal.
Expression
Let $\sigma=\sqrt{\operatorname{Var}(Z)}$, (using the value we computed in part (a)). By the Central Limit Theorem, we can approximate $Z$ by a normal with parameters $\left(\mu, \sigma^{2}\right)$, and thus $(Z-\pi) / \sigma$ is approximately a standard normal. This gives us
$$
\begin{aligned}
P(Z \in[\pi-0.1, \pi+0.1]) & =P\left(\frac{Z-\pi}{\sigma} \in\left[\frac{-0.1}{\sigma}, \frac{0.1}{\sigma}\right]\right) \approx \Phi\left(\frac{0.1}{\sigma}\right)-\Phi\left(\frac{-0.1}{\sigma}\right) \\
& =2 \Phi\left(\frac{0.1}{\sigma}\right)-1.
\end{aligned}
$$
In this problem, we'll try to approximate $\pi$ by throwing darts at a dartboard. To do this, suppose we have a square dartboard with sides of length 5. Inside it, we'll draw a circle with radius 1.
When I throw a dart, suppose that the spot where it hits is uniformly distributed over the dartboard. The probability that it lands inside the circle is thus given by
$$
\mathrm{P}(\text { lands in circle })=\frac{\text { Area }(\text { circle })}{\text { Area }(\text { dartboard })}=\frac{\pi}{25} .
$$
Suppose I throw 1000 darts independently, and I let
$$
X_{i}= \begin{cases}1 & \text { if the } i^{\text {th }} \text { dart lands in the circle, and } \\ 0 & \text { otherwise. }\end{cases}
$$
My estimate of $\pi$ will be given by
$$
Z=\left(X_{1}+X_{2}+\cdots+X_{1000}\right) / 40
$$
Compute the expectation and variance of $Z$.
$$
E[Z]=\frac{1}{40} \sum_{i=1}^{1000} E\left[X_{i}\right]=\frac{1000}{40} \frac{\pi}{25}=\pi.
$$
$X_{i}$ is a Bernoulli random variable with parameter $\pi / 25$, so
$$
\operatorname{Var}\left(X_{i}\right)=\left(\frac{\pi}{25}\right)\left(1-\frac{\pi}{25}\right).
$$
Since the $X_{i}$ are independent, we have
$$
\begin{aligned}
\operatorname{Var}(Z) & =\operatorname{Var}\left(\frac{\sum_{i=1}^{1000} X_{i}}{40}\right)=\frac{1}{40^{2}} \sum_{i=1}^{1000} \operatorname{Var}\left(X_{i}\right) \\
& =\frac{1000}{40^{2}}\left(\frac{\pi}{25}\right)\left(1-\frac{\pi}{25}\right)=\frac{\pi}{40}\left(1-\frac{\pi}{25}\right).
\end{aligned}
$$
In this problem, we'll try to approximate $\pi$ by throwing darts at a dartboard. To do this, suppose we have a square dartboard with sides of length 5. Inside it, we'll draw a circle with radius 1.
When I throw a dart, suppose that the spot where it hits is uniformly distributed over the dartboard. The probability that it lands inside the circle is thus given by
$$
\mathrm{P}(\text { lands in circle })=\frac{\text { Area }(\text { circle })}{\text { Area }(\text { dartboard })}=\frac{\pi}{25} .
$$
Suppose that I drew two overlapping radius 1 circles, $C_{1}$ and $C_{2}$, on the dartboard. Let
$$
Y_{i}= \begin{cases}1 & \text { if the } i^{\text {th }} \text { dart lands in } C_{1}, \text { and } \\ 0 & \text { otherwise }\end{cases}
$$
and let
$$
Z_{i}= \begin{cases}1 & \text { if the } i^{\text {th }} \text { dart lands in } C_{2}, \text { and } \\ 0 & \text { otherwise }\end{cases}
$$
Let $t=\operatorname{area}\left(C_{1} \cap C_{2}\right)$. Compute $\operatorname{Cov}\left(Y_{i}, Z_{i}\right)$.
Note that the variable $Y_{i} Z_{i}$ is 1 if the dart lands in both circles and 0 otherwise. It has expected value $t / 25$. We therefore have
$$
\operatorname{Cov}\left(Y_{i}, Z_{i}\right)=E\left[Y_{i} Z_{i}\right]-E\left[Y_{i}\right] E\left[Z_{i}\right]=\frac{t}{25}-\left(\frac{\pi}{25}\right)^{2}.
$$
In this problem, we'll try to approximate $\pi$ by throwing darts at a dartboard. To do this, suppose we have a square dartboard with sides of length 5. Inside it, we'll draw a circle with radius 1.
When I throw a dart, suppose that the spot where it hits is uniformly distributed over the dartboard. The probability that it lands inside the circle is thus given by
$$
\mathrm{P}(\text { lands in circle })=\frac{\text { Area }(\text { circle })}{\text { Area }(\text { dartboard })}=\frac{\pi}{25} .
$$
Is there some $t$ for which $Y_{i}$ and $Z_{i}$ are independent? If so, you now get two independent variables with just one dart. Does this mean that you can get as good an estimate as in parts (a) and (b) while only throwing 500 darts? And, if so, could you get the same quality estimate by drawing 1000 circles and just throwing 1 dart? Why or why not?
Yes, there is such a value. Indicator variables are independent when the associated events are independent, i.e., when
$$
P\left(\text { dart lands in } C_{1} \cap C_{2}\right)=P\left(\text { dart lands in } C_{1}\right) P\left(\text { dart lands in } C_{2}\right) \text {. }
$$
The left-hand side equals $t / 25$, and the right-hand side equals $(\pi / 25)^{2}$, so this occurs when $t=\pi^{2} / 25$. (One needs to check that it is possible to achieve this intersection. This is possible since we can achieve any $t$ between 0 and $\pi$, and $t \approx 0.395$.)
This actually does mean that you can get the same quality estimate by throwing only 500 darts. However, you can't just draw 1000 circles and throw 1 dart, because it won't be possible to make all 1000 events independent. You can't even do it with 3 circles. To see this, note that if you take three circles $C_{1}, C_{2}$, and $C_{3}$ and make
$$
\operatorname{area}\left(C_{1} \cap C_{2}\right)=\operatorname{area}\left(C_{1} \cap C_{3}\right)=\operatorname{area}\left(C_{2} \cap C_{3}\right)=t,
$$
this determines the area $A=C_{1} \cap C_{2} \cap C_{3}$. If the three corresponding events were independent, we'd need $A=\left(\frac{\pi}{25}\right)^{3}$, and it doesn't equal this value. (And, in fact, you can't even draw 4 circles such that the area of $C_{i} \cap C_{j}$ equals $t$ for all $i$ and $j$.)
19
422EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneProblem Set 2Regression6a0.05208333333Text
We will now try to synthesize what we've learned in order to perform ridge regression on the DataCommons public health dataset that we explored in Lab 2. Unlike in Lab 2, where we did some simple linear regressions, here we now employ and explore regularization, with the goal of building a model which generalizes better (than without regularization) to unseen data.
The overall objective function for ridge regression is
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Remarkably, there is an analytical function giving $\Theta=\left(\theta, \theta_{0}\right)$ which minimizes this objective, given $X, Y$, and $\lambda$. But how should we choose $\lambda$?
To choose an optimum $\lambda$, we can use the following approach. Each particular value of $\lambda$ gives us a different linear regression model. And we want the best model: one which balances providing good predictions (fitting well to given training data) with generalizing well (avoiding overfitting training data). And as we saw in the notes on Regression, we can employ cross-validation to evaluate and compare different models.
Let us begin by implementing this algorithm for cross-validation:
CROSS-VALIDATE $(\mathcal{D}, k)$
1 divide $\mathcal{D}$ into $k$ chunks $\mathcal{D}_{1}, \mathcal{D}_{2}, \ldots \mathcal{D}_{k}$ (of roughly equal size)
2 for $i=1$ to $k$
3 train $h_{i}$ on $\mathcal{D} \backslash \mathcal{D}_{i}$ (withholding chunk $\mathcal{D}_{i}$ )
4 compute "test" error $\mathcal{E}_{\mathfrak{i}}\left(h_{i}\right)$ on withheld data $\mathcal{D}_{i}$
5 return $\frac{1}{k} \sum_{i=1}^{k} \varepsilon_{i}\left(h_{i}\right)$
Let's implement the cross-validation algorithm as the procedure cross_validate, which takes the following input arguments:
\begin{itemize}
\item $\mathrm{x}$ : the list of data points $(d \times n)$
\item $\mathrm{Y}$ : the true values of the responders $(1 \times n)$
\item n_splits: the number of chunks to divide the dataset into
\item lam: the regularization parameter
\item learning_algorithm: a function that takes $X, Y$, and 1 lam, and returns th, th $\theta$
\item loss_function: a function that takes $X, Y$, th, and th $\theta$, and returns a $1 \times 1$ array
\end{itemize}
cross_validate should return a scalar, the cross-validation error of applying the learning algorithm on the list of data points.
Note that this is a generic version of cross-validation, that can be applied to any learning algorithm and any loss function. Later in this problem, we will use cross-validation specifically for ridge regression and mean square loss.
You have the following function available to you:
def make_splits(X, Y, n_splits):
'''
Splits the dataset into n_split chunks, creating n_split sets of
cross-validation data.
Returns a list of n_split tuples (X_train, Y_train, X_test, Y_test).
For the ith returned tuple:
*X_train and Y_train include all data except the ith chunk, and
* X_test and Y_test are the ith chunk.
X : d x n numpy array (d = #features, n = #data points)
Y : 1 x n numpy array
n_splits : integer
'''
def cross_validate(X, Y, n_splits, lam,
learning_algorithm, loss_function):
pass
Programmingdef cross_validate(X, Y, n_splits, lam,
learning_algorithm, loss_function):
test_errors = []
for (X_train, Y_train, X_test, Y_test) in make_splits(X, Y, n_splits):
th, th0 = learning_algorithm(X_train, Y_train, lam)
test_errors.append(loss_function(X_test, Y_test, th, th0))
return np.array(test_errors).mean()
We will now try to synthesize what we've learned in order to perform ridge regression on the DataCommons public health dataset that we explored in Lab 2. Unlike in Lab 2, where we did some simple linear regressions, here we now employ and explore regularization, with the goal of building a model which generalizes better (than without regularization) to unseen data.
The overall objective function for ridge regression is
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Remarkably, there is an analytical function giving $\Theta=\left(\theta, \theta_{0}\right)$ which minimizes this objective, given $X, Y$, and $\lambda$. But how should we choose $\lambda$?
To choose an optimum $\lambda$, we can use the following approach. Each particular value of $\lambda$ gives us a different linear regression model. And we want the best model: one which balances providing good predictions (fitting well to given training data) with generalizing well (avoiding overfitting training data). And as we saw in the notes on Regression, we can employ cross-validation to evaluate and compare different models.
Let us begin by implementing this algorithm for cross-validation:
CROSS-VALIDATE $(\mathcal{D}, k)$
1 divide $\mathcal{D}$ into $k$ chunks $\mathcal{D}_{1}, \mathcal{D}_{2}, \ldots \mathcal{D}_{k}$ (of roughly equal size)
2 for $i=1$ to $k$
3 train $h_{i}$ on $\mathcal{D} \backslash \mathcal{D}_{i}$ (withholding chunk $\mathcal{D}_{i}$ )
4 compute "test" error $\mathcal{E}_{\mathfrak{i}}\left(h_{i}\right)$ on withheld data $\mathcal{D}_{i}$
5 return $\frac{1}{k} \sum_{i=1}^{k} \varepsilon_{i}\left(h_{i}\right)$
Below, X and Y are sample data, and lams is a list of possible values of lambda. Write code to set errors as a list of corresponding cross-validation errors. Use the cross_validate function above to run cross-validation with three splits. Use the following functions (which we implement for you, per the specifications below) as the learning algorithm and loss function:
def ridge_analytic(X_train, Y_train, lam):
'''Applies analytic ridge regression on the given training data.
Returns th, th0.
X : d x n numpy array (d = # features, n = # data points)
Y : 1 x n numpy array
lam : (float) regularization strength parameter
th : d x 1 numpy array
th0 : 1 x 1 numpy array'''
def mse(x, y, th, th0):
'''Calculates the mean-squared loss of a linear regression.
Returns a scalar.
x : d x n numpy array
y : 1 x n numpy array
th : d x 1 numpy array
th0 : 1 x 1 numpy array'''
X = np.array([[4, 6, 8, 2, 9, 10, 11, 17],
[1, 1, 6, 0, 5, 8, 7, 9],
[2, 2, 2, 6, 7, 4, 9, 8],
[1, 2, 3, 4, 5, 6, 7, 8]])
Y = np.array([[1, 3, 3, 4, 7, 6, 7, 7]])
lams = [0, 0.01, 0.02, 0.1]
errors = [] # your code here
X = np.array([[4, 6, 8, 2, 9, 10, 11, 17],
[1, 1, 6, 0, 5, 8, 7, 9],
[2, 2, 2, 6, 7, 4, 9, 8],
[1, 2, 3, 4, 5, 6, 7, 8]])
Y = np.array([[1, 3, 3, 4, 7, 6, 7, 7]])
lams = [0, 0.01, 0.02, 0.1]
errors = [cross_validate(X, Y, 3, lam, ridge_analytic, mse) for lam in lams]
We will now try to synthesize what we've learned in order to perform ridge regression on the DataCommons public health dataset that we explored in Lab 2. Unlike in Lab 2, where we did some simple linear regressions, here we now employ and explore regularization, with the goal of building a model which generalizes better (than without regularization) to unseen data.
The overall objective function for ridge regression is
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Remarkably, there is an analytical function giving $\Theta=\left(\theta, \theta_{0}\right)$ which minimizes this objective, given $X, Y$, and $\lambda$. But how should we choose $\lambda$?
To choose an optimum $\lambda$, we can use the following approach. Each particular value of $\lambda$ gives us a different linear regression model. And we want the best model: one which balances providing good predictions (fitting well to given training data) with generalizing well (avoiding overfitting training data). And as we saw in the notes on Regression, we can employ cross-validation to evaluate and compare different models.
Let us begin by implementing this algorithm for cross-validation:
CROSS-VALIDATE $(\mathcal{D}, k)$
1 divide $\mathcal{D}$ into $k$ chunks $\mathcal{D}_{1}, \mathcal{D}_{2}, \ldots \mathcal{D}_{k}$ (of roughly equal size)
2 for $i=1$ to $k$
3 train $h_{i}$ on $\mathcal{D} \backslash \mathcal{D}_{i}$ (withholding chunk $\mathcal{D}_{i}$ )
4 compute "test" error $\mathcal{E}_{\mathfrak{i}}\left(h_{i}\right)$ on withheld data $\mathcal{D}_{i}$
5 return $\frac{1}{k} \sum_{i=1}^{k} \varepsilon_{i}\left(h_{i}\right)$
As the number of data points increases,
We tend to see:
(a) The optimal increase
(b) The optimal decrease
(c) The minimum cross-validation error increase
(d) The minimum cross-validation error decrease
(b) The optimal decrease
(d) The minimum cross-validation error decrease
We will now try to synthesize what we've learned in order to perform ridge regression on the DataCommons public health dataset that we explored in Lab 2. Unlike in Lab 2, where we did some simple linear regressions, here we now employ and explore regularization, with the goal of building a model which generalizes better (than without regularization) to unseen data.
The overall objective function for ridge regression is
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Remarkably, there is an analytical function giving $\Theta=\left(\theta, \theta_{0}\right)$ which minimizes this objective, given $X, Y$, and $\lambda$. But how should we choose $\lambda$?
To choose an optimum $\lambda$, we can use the following approach. Each particular value of $\lambda$ gives us a different linear regression model. And we want the best model: one which balances providing good predictions (fitting well to given training data) with generalizing well (avoiding overfitting training data). And as we saw in the notes on Regression, we can employ cross-validation to evaluate and compare different models.
Let us begin by implementing this algorithm for cross-validation:
CROSS-VALIDATE $(\mathcal{D}, k)$
1 divide $\mathcal{D}$ into $k$ chunks $\mathcal{D}_{1}, \mathcal{D}_{2}, \ldots \mathcal{D}_{k}$ (of roughly equal size)
2 for $i=1$ to $k$
3 train $h_{i}$ on $\mathcal{D} \backslash \mathcal{D}_{i}$ (withholding chunk $\mathcal{D}_{i}$ )
4 compute "test" error $\mathcal{E}_{\mathfrak{i}}\left(h_{i}\right)$ on withheld data $\mathcal{D}_{i}$
5 return $\frac{1}{k} \sum_{i=1}^{k} \varepsilon_{i}\left(h_{i}\right)$
Based on our observations, does regularization have a larger effect when we have more or less data?
Regularization is much more important when we have:
(a) more data
(b) less data
(c) around the same for both
(b) less data
20
303EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneLab 12
Reinforcement Learning
2av0.04166666667Text
Now, we'll take a look at Q-learning in a simple 2D grid setting but with a single goal location. We'll adopt the same state space and action space as in MDP lab.
Specifically, our state space is a 10-by-10 grid, with the bottom-left state indexed as state $(0,0)$, and our four possible actions are moving North, South, East or West (note that if the robot tries to move off the board, it cannot; it just stays in the same state). Our single goal is at state $(3,6)$.
Remember that for Q-learning, the transition probabilities and reward values are not known in advance by the method-the agent just gets to observe state, action, and reward values as it moves through the domain.
Some notes (please read!):
\begin{itemize}
\item A new episode is started by sampling the first state uniformly at random.
\item The agent follows an epsilon-greedy policy with $\epsilon=0.1$.
\item Every action taken from the goal state leads to a zero-reward state that can never be escaped. Thus, to continue learning, we repeat the steps above. Note that we start a new/reset episode only after the agent reaches the goal state.
\item In the case of a tie in the value $\max _{a} Q(s, a)$ across actions $a$, we choose the "best" action randomly.
\item All transitions are noisy; that is, there is some non-zero probability of the agent moving somewhere different from its desired destination. For example, say the agent in in state $(0,0)$ and takes a "North" action, there is a non-zero chance that it actually ends up in state $(1,1)$.
\item Our $\gamma$ (discount factor) is set to $0.9$ and our $\alpha$ (learning rate) is set to $0.5$.
\end{itemize}
Note that the scale of the colors changes across the different plots, per the bar on the right of each plot.
At iteration 10,000, how close are the values plotted (our estimates of the true value of taking the best actions starting at the given state) to the actual value of taking the best actions starting from the given state?
In particular, what should the value of the bottom-right corner be? (To make this easier to think about, you can assume that the transitions are deterministic; that is, the robot always moves in the direction it is "aiming".)
Open
State $(9,0)$ is 12 steps away from the goal, and so, without any errors, the robot will reach the goal in 12 steps. Given the discount factor of $0.9$, if we get a reward of 100 after 12 steps, the state has a value of about $100 * 0.9 \wedge 12=28$. The value in our table seems to be pretty close. It makes sense that it's slightly under, because we have a learning rate of 0.5. (If our learning rate was 1 , it'd be much closer to 28 .)
Now, we'll take a look at Q-learning in a simple 2D grid setting but with a single goal location. We'll adopt the same state space and action space as in MDP lab.
Specifically, our state space is a 10-by-10 grid, with the bottom-left state indexed as state $(0,0)$, and our four possible actions are moving North, South, East or West (note that if the robot tries to move off the board, it cannot; it just stays in the same state). Our single goal is at state $(3,6)$.
Remember that for Q-learning, the transition probabilities and reward values are not known in advance by the method-the agent just gets to observe state, action, and reward values as it moves through the domain.
Some notes (please read!):
\begin{itemize}
\item A new episode is started by sampling the first state uniformly at random.
\item The agent follows an epsilon-greedy policy with $\epsilon=0.1$.
\item Every action taken from the goal state leads to a zero-reward state that can never be escaped. Thus, to continue learning, we repeat the steps above. Note that we start a new/reset episode only after the agent reaches the goal state.
\item In the case of a tie in the value $\max _{a} Q(s, a)$ across actions $a$, we choose the "best" action randomly.
\item All transitions are noisy; that is, there is some non-zero probability of the agent moving somewhere different from its desired destination. For example, say the agent in in state $(0,0)$ and takes a "North" action, there is a non-zero chance that it actually ends up in state $(1,1)$.
\item Our $\gamma$ (discount factor) is set to $0.9$ and our $\alpha$ (learning rate) is set to $0.5$.
\end{itemize}
Note that the scale of the colors changes across the different plots, per the bar on the right of each plot.
At iteration 1000, how many more times do you think the agent reached the goal state?
The value of the goal state is now something like 85. This would happen if it entered the goal state two more times. Why? Because after entering one time, as we saw above, the $Q$ value was 50 . When it enters again, the new $Q$ value will be $0.5 * 50+0.5 * 100=75$. When it enters the third time, the new $Q$ value will be $0.5 * 75+0.5 * 100=87.5$.
Now, we'll take a look at Q-learning in a simple 2D grid setting but with a single goal location. We'll adopt the same state space and action space as in MDP lab.
Specifically, our state space is a 10-by-10 grid, with the bottom-left state indexed as state $(0,0)$, and our four possible actions are moving North, South, East or West (note that if the robot tries to move off the board, it cannot; it just stays in the same state). Our single goal is at state $(3,6)$.
Remember that for Q-learning, the transition probabilities and reward values are not known in advance by the method-the agent just gets to observe state, action, and reward values as it moves through the domain.
Some notes (please read!):
\begin{itemize}
\item A new episode is started by sampling the first state uniformly at random.
\item The agent follows an epsilon-greedy policy with $\epsilon=0.1$.
\item Every action taken from the goal state leads to a zero-reward state that can never be escaped. Thus, to continue learning, we repeat the steps above. Note that we start a new/reset episode only after the agent reaches the goal state.
\item In the case of a tie in the value $\max _{a} Q(s, a)$ across actions $a$, we choose the "best" action randomly.
\item All transitions are noisy; that is, there is some non-zero probability of the agent moving somewhere different from its desired destination. For example, say the agent in in state $(0,0)$ and takes a "North" action, there is a non-zero chance that it actually ends up in state $(1,1)$.
\item Our $\gamma$ (discount factor) is set to $0.9$ and our $\alpha$ (learning rate) is set to $0.5$.
\end{itemize}
Note that the scale of the colors changes across the different plots, per the bar on the right of each plot.
At iteration 500, why does the state at $(3,6)$ have value 50?
Recall that the learning rate is $0.5$ ! So the robot finally randomly entered the goal state once and did a Q-learning update based on the fact that it got a reward of 100 . Since the old $Q$ value was 0 , the updated value is $.5 * 0+.5 * 100=$ 50.
Now, we'll take a look at Q-learning in a simple 2D grid setting but with a single goal location. We'll adopt the same state space and action space as in MDP lab.
Specifically, our state space is a 10-by-10 grid, with the bottom-left state indexed as state $(0,0)$, and our four possible actions are moving North, South, East or West (note that if the robot tries to move off the board, it cannot; it just stays in the same state). Our single goal is at state $(3,6)$.
Remember that for Q-learning, the transition probabilities and reward values are not known in advance by the method-the agent just gets to observe state, action, and reward values as it moves through the domain.
Some notes (please read!):
\begin{itemize}
\item A new episode is started by sampling the first state uniformly at random.
\item The agent follows an epsilon-greedy policy with $\epsilon=0.1$.
\item Every action taken from the goal state leads to a zero-reward state that can never be escaped. Thus, to continue learning, we repeat the steps above. Note that we start a new/reset episode only after the agent reaches the goal state.
\item In the case of a tie in the value $\max _{a} Q(s, a)$ across actions $a$, we choose the "best" action randomly.
\item All transitions are noisy; that is, there is some non-zero probability of the agent moving somewhere different from its desired destination. For example, say the agent in in state $(0,0)$ and takes a "North" action, there is a non-zero chance that it actually ends up in state $(1,1)$.
\item Our $\gamma$ (discount factor) is set to $0.9$ and our $\alpha$ (learning rate) is set to $0.5$.
\end{itemize}
Note that the scale of the colors changes across the different plots, per the bar on the right of each plot.
These are plots of the maximum $\mathrm{Q}$ values of the states $\max _{a} Q(s, a)$ using the SSP formulation as we run 10,000 iterations of Q-value learning, plotting at specific iterations. Note that the scale of the colors changes across the different plots, per the bar on the right of each plot.
At iteration 10,000, how close are the values plotted (our estimates of the true value of taking the best actions starting at the given state) to the actual value of taking the best actions starting from the given state? Roughly what should the value in the bottom right corner be? (Assume that the transitions are deterministic.)
This state is still 12 steps from the goal. That means 12 steps on hot lava (ouch, ouch, ouch...). But, lucky for us, lava in the future doesn't seem to bad as lava in the present (coz of discounting negative reward). So, we end up with the value being a discounted sum of 12 rewards, each of which is $-1$. So, the value should be $-1^{*}\left(0.9^{\wedge} 0+0.9^{\wedge} 1+\ldots+\right.$ $\left.0.9 \wedge^{\wedge} 11\right)=-7.17$, and the value is close. (Same comment about learning rate from $2.1 .5$ applies.)
21
81Mathematics18.2
Principles of Discrete Applied Mathematics
None18.C06Midterm Exam 3Miller-Rabin Test2b1.25Text
Consider the following three numbers:
$$
N_{1}=257=2^{8}+1, \quad N_{2}=4097=2^{12}+1, \quad N_{3}=1537=3 \cdot 2^{9}+1.
$$
For each of these three numbers, we would like to test whether it is a prime or not.
Let $a_{2}=233$. For each $k=1,2,4,8,16,32,64,128,256,512,1024,2048,4096$, we compute $a_{2}^{k}$ modulo $N_{2}=4097$. The results of the computation are:
$$
233,1028,3855,1206,1,1,1,1,1,1,1,1,1.
$$
Based on these results, do you think $N_{2}$ is a prime? explain.
Open
Here we fail the Miller-Rabin test, because the last element preceding a 1 is 1206 , which is not 1 or $-1$ modulo 4097 . So $N_{2}=4097$ is composite.
Consider the following three numbers:
$$
N_{1}=257=2^{8}+1, \quad N_{2}=4097=2^{12}+1, \quad N_{3}=1537=3 \cdot 2^{9}+1.
$$
For each of these three numbers, we would like to test whether it is a prime or not.
Let $a_{3}=2$. For each $k=3,6,12,24,48,96,192,384,768,1536$ we compute $a_{3}^{k}$ modulo $N_{3}=1537$. The results are:
$$
8,64,1022,861,487,471,513,342,152,49
$$
Again, do you think $N_{3}$ is a prime? Explain.
Here the Fermat test fails, since $2^{1536}$ is not 1 modulo 1537. So 1537 is composite.
Consider the following three numbers:
$$
N_{1}=257=2^{8}+1, \quad N_{2}=4097=2^{12}+1, \quad N_{3}=1537=3 \cdot 2^{9}+1.
$$
For each of these three numbers, we would like to test whether it is a prime or not.
Let $a_{1}=51$. For each $k=1,2,4,8,16,32,64,128,256$ (i.e., all powers of two up to $2^{8}$ ), we compute $a_{1}^{k}$ modulo $N_{1}=257$. The results of the computation are given below.
$$
51,31,190,120,8,64,241,256,1.
$$
Based on these results, do you think $N_{1}$ is a prime? Explain.
The pair $N_{1}=257$ and $a_{1}=51$ passes both the Fermat test (since $\left.a_{1}^{256} \equiv 1 \bmod 257\right)$ and the Miller-Rabin test (since right before the first occurrence of 1 we have $256 \equiv-1$ $\bmod 257$. So as far as we can tell, 257 looks like a prime. And indeed, 257 turns out to be a prime!
One has the factorization of $340561=13 \times 17 \times 23 \times 67$. Show that for any integer $a$ relatively prime to 340561, one has
$$
a^{340560} \equiv 1 \quad \bmod 340561 .
$$
By the Chinese-Remainder Theorem, it suffices to show that
$$
a^{340560} \equiv 1 \quad \bmod p
$$
for $p=13,17,23,67$ and $a$ relatively prime to $p$. One can compute that 340560 is divisible by 12,16,22,66 so Fermat's little theorem shows that the equation holds.
22
91Mathematics18.01Calculus INoneNoneProblem Set 3Integration2a0.03959873284Text
On Problem Set 2, Problem 1, you estimated the area of an annulus, where the inner circle has radius $r$ and the outer circle has radius $r+\Delta r$. When $\Delta r$ is small, you showed that the area is approximately $2 \pi r \Delta r$. If you want a reminder about this, you could read the solution to pset 2. We're going to build on that in an integration problem.
Suppose we have a disk of radius 3 meters. The disk has variable density. Near a point at distance $r$ from the center, the disk has density $4-r$ kilograms per square meter. So the densest part of the disk is at the center, where the density is $4 \mathrm{~kg} / \mathrm{m}^{2}$. The least dense part of the disk is at the edge, where the density is $1 \mathrm{~kg} / \mathrm{m}^{2}$.
Approximate the mass of the part of the disk where the distance to the center is between 2 meters and $2.1$ meters. (This part of the disk is an annulus with inner radius 2 and outer radius 2.1.) Which is the best approximation: $8 \pi k g$ or $4 \pi k g$ or $2 \pi k g$ or $(.8) \pi k g$ or (.4) $\pi k g$ or $(.2) \pi k g$?
Multiple Choice
The approximate area for this annulus is $2 \pi(2)(2.1-2)=0.4 \pi$ (units are $m^{2}$). The density on this annulus is approximately equal to $4-(2)=2$ (units are $\mathrm{kg} / \mathrm{m}^{2}$ ). Multiply the density times the area to get approximate mass of the annulus equal to $0.8 \pi$ (units are $\mathrm{kg}$).
On Problem Set 2, Problem 1, you estimated the area of an annulus, where the inner circle has radius $r$ and the outer circle has radius $r+\Delta r$. When $\Delta r$ is small, you showed that the area is approximately $2 \pi r \Delta r$. If you want a reminder about this, you could read the solution to pset 2. We're going to build on that in an integration problem.
Suppose we have a disk of radius 3 meters. The disk has variable density. Near a point at distance $r$ from the center, the disk has density $4-r$ kilograms per square meter. So the densest part of the disk is at the center, where the density is $4 \mathrm{~kg} / \mathrm{m}^{2}$. The least dense part of the disk is at the edge, where the density is $1 \mathrm{~kg} / \mathrm{m}^{2}$.
Approximate the mass of the part of the disk where the distance to the center is between $r$ meters and $r+\Delta r$ meters.
Repeating the steps in part a, the approximate area is $2 \pi r \Delta r$ and the approximate density is $4-r$, so the approximate mass is $2 \pi r(\Delta r)(4-r)$.
On Problem Set 2, Problem 1, you estimated the area of an annulus, where the inner circle has radius $r$ and the outer circle has radius $r+\Delta r$. When $\Delta r$ is small, you showed that the area is approximately $2 \pi r \Delta r$. If you want a reminder about this, you could read the solution to pset 2. We're going to build on that in an integration problem.
Suppose we have a disk of radius 3 meters. The disk has variable density. Near a point at distance $r$ from the center, the disk has density $4-r$ kilograms per square meter. So the densest part of the disk is at the center, where the density is $4 \mathrm{~kg} / \mathrm{m}^{2}$. The least dense part of the disk is at the edge, where the density is $1 \mathrm{~kg} / \mathrm{m}^{2}$.
Write an integral for the total mass of the disk.
$\int_{0}^{3} 2 \pi r(4-r) d r$.
On Problem Set 2, Problem 1, you estimated the area of an annulus, where the inner circle has radius $r$ and the outer circle has radius $r+\Delta r$. When $\Delta r$ is small, you showed that the area is approximately $2 \pi r \Delta r$. If you want a reminder about this, you could read the solution to pset 2. We're going to build on that in an integration problem.
Suppose we have a disk of radius 3 meters. The disk has variable density. Near a point at distance $r$ from the center, the disk has density $4-r$ kilograms per square meter. So the densest part of the disk is at the center, where the density is $4 \mathrm{~kg} / \mathrm{m}^{2}$. The least dense part of the disk is at the edge, where the density is $1 \mathrm{~kg} / \mathrm{m}^{2}$.
Evaluate the integral to find the total mass.
$$
\begin{aligned}
2 \pi \int_{0}^{3}\left(4 r-r^{2}\right) d r & =2 \pi\left[\frac{4 r^{2}}{2}-\frac{r^{3}}{3}\right]_{0}^{3} \\
& =2 \pi\left[\frac{4 \cdot 3^{2}}{2}-\frac{3^{3}}{3}\right]=2 \pi(18-9)=18 \pi
\end{aligned}
$$
where the units are $k g$.
23
56Mathematics18.02Calculus II18.01NoneProblem Set 6Vector Fields5a0.3205128205Text
Let $C$ be the unit circle, oriented counterclockwise, and consider the field $\vec{F}=\hat{i}+\hat{j}$. Which portions of $C$ contribute positively to the line integral of $\vec{F} ?$ Which portions contribute negatively?
Open
Consider the parametrization $(x, y)=(\cos t, \sin t)$. The velocity vector $\vec{v}$ at a point $(x, y)$ on the curve $C$ is
$$
\vec{v}(x, y)=(-\sin t) \hat{i}+(\cos t) \hat{j}=-y \hat{i}+x \hat{j} \text {. }
$$
Thus, $\vec{F} \cdot \vec{v}(x, y)>0$ if and only if $-y+x>0$; and $\vec{F} \cdot \vec{v}(x, y)<0$ if and only if $-y+x<0$. That is to say, the portion $\left\{(\cos t, \sin t) \in C:-\frac{3 \pi}{4}<t<\frac{\pi}{4}\right\}$ contributes positively to the flux, while the portion $\left\{(\cos t, \sin t) \in C: \frac{\pi}{4}<t<\frac{5 \pi}{4}\right\}$ contributes negatively.
Let $C$ be the unit circle, oriented counterclockwise, and consider the field $\vec{F}=x^{2} y \hat{i}+x y^{2} \hat{j}$. Which portions of $C$ contribute positively to the line integral of $\vec{F} ?$ Which portions contribute negatively?
Since $\vec{F} \cdot \vec{v}(x, y)=x^{2} y \cdot(-y)+x y^{2} \cdot x=0$, at every point of $C$ the vector field $\vec{F}$ and the velocity vector $\vec{v}(x, y)$ are perpendicular to each other. Thus, the line integral is zero and every point on $C$ contributes zero to the line integral.
$\mathbf{F}(x, y)=x(\mathbf{i}+\mathbf{j}), \quad$ and let $C$ be the closed curve in the xy-plane formed by the triangle with vertices at the origin and the points $(1,0)$ and $(0,1)$.
Give a rough sketch of the field $\mathbf{F}$ in the first quadrant, and use it to predict whether the net flux out of the region $R=$ the interior of $C$ will be positive or negative.
Net flux out of $R$ will be positive (more flow out than into $R$ ).
Consider the vector field
$$
\vec{F}(x, y)=\frac{-y \hat{i}+x \hat{j}}{x^{2}+y^{2}} .
$$
Suppose $C$ is a smooth curve in the right half-plane $x>0$ joining two points $A=\left(x_{1}, y_{1}\right)$ and $B=\left(x_{2}, y_{2}\right)$. Express $\int_{C} \vec{F} \cdot d \vec{r}$ in terms of the polar coordinates $\left(r_{1}, \theta_{1}\right)$ and $\left(r_{2}, \theta_{2}\right)$ of $A$ and $B$.
The fundamental theorem of calculus for line integrals gives
$$
\int_{C} \vec{F} \cdot d \vec{r}=\theta\left(x_{2}, y_{2}\right)-\theta\left(x_{1}, y_{1}\right)=\theta_{2}-\theta_{1}
$$
24
42EECS6.102
Elements of Software Construction
6.101NoneMidterm Exam 2Concurrency3c0.48Text
/**
* Immutable type representing a strand of DNA.
*/
class DNA {
/** omitted */
public constructor(bases: string) {
// omitted
}
/**
* @returns zero-based index of first occurence of `dna` as a substring of this strand,
* or undefined if `dna` never occurs.
*/
public find(dna: DNA): number|undefined {
// omitted
}
/**
* @returns true iff this and that are observationally equivalent
*/
public equalValue(that: DNA): boolean {
// omitted
}
// other code omitted
}
/**
* Immutable type representing a gene-editing process.
*/
interface Crispr {
/**
* Simulates this gene-editing process entirely in software, without using chemicals or a lab.
* @returns DNA strand that would result from this process
*/
simulate(): DNA;
/**
* Run this gene-editing process using the given `lab`.
* @returns the tube of `lab` in which the final DNA from this process
* can be found.
*/
async fabricate(lab: Lab): Promise<Tube>;
// other code omitted
}
/**
* Represents an already-existing DNA strand (a "precursor") in a gene-editing
* process. Precursors are bought premade from a supplier.
*/
class Precursor implements Crispr {
/**
* Make a gene-splicing step that results in the given `dna` strand.
*/
public constructor(private readonly dna: DNA) {
}
// other code omitted
}
/**
* Represents a gene-splicing step in a gene-editing process,
* which replaces all instances of one gene with another.
*/
class Splice implements Crispr {
/**
* Make a gene-splicing step that finds all occurrences of
* oldGene in target and substitutes newGene in place of each one.
*/
public constructor(
private readonly target: Crispr,
private readonly oldGene: Crispr,
private readonly newGene: Crispr
) {
}
// other code omitted
}
/**
* Mutable type controlling an automated gene-editing machine.
*/
class Lab {
/**
* Modifies the DNA in targetTube to replace all occurrences of the DNA from oldGeneTube with the
* DNA from newGeneTube.
* @returns a promise that fulfills with the same tube as targetTube, after the process is complet
*/
public async splice(targetTube: Tube, oldGeneTube: Tube, newGeneTube: Tube): Promise<Tube> {
// omitted
}
private tubeMap: Map<Tube, DNA> = new Map();
/**
* @returns a tube containing DNA strands corresponding to `dna`
*/
public async get(dna: DNA): Promise<Tube> {
for (const tube of this.tubeMap.keys()) {
if (this.tubeMap.get(tube).equalValue(dna)) {
return tube;
}
}
const tube = new Tube();
this.tubeMap.set(tube, dna);
await this.load(tube, dna); // "line 6a" is the load() call, "line 6b" is the await
return tube;
}
/**
* Ask a human to order premade DNA from a supplier
* and load it into the tube.
* @returns a promise that fulfills once this tube contains `dna`.
*/
private async load(tube: Tube, dna: DNA): Promise<void> {
// omitted
}
// other code omitted
}
/**
* Mutable type representing a test tube containing DNA.
*/
class Tube {
/** Make a new Tube. */
public constructor() {
}
// other code omitted
}
Suppose that:
• two different gene-editing processes A and B are running asynchronously using the same Lab
• A and B both call lab.get(dnaX) for the same precursor dnaX
• no other asynchronous processes are using lab
For each of the following interleavings, referring to the line numbers 1-7 in get() in the provided code, decide whether the
interleaving is impossible, leads to a race condition or deadlock, or runs safely; then explain your answer in one sentence.
A runs lines 1, 4, 5, 6a; then B runs lines 1, 4, 5, 6a; then A finishes lines 6b and 7, then B finishes lines 6b and 7.
Open
impossible; After A runs line 5, there will be at least one tube in tubeMap, so B must proceed from line 1 to line 2.
/**
* Immutable type representing a strand of DNA.
*/
class DNA {
/** omitted */
public constructor(bases: string) {
// omitted
}
/**
* @returns zero-based index of first occurence of `dna` as a substring of this strand,
* or undefined if `dna` never occurs.
*/
public find(dna: DNA): number|undefined {
// omitted
}
/**
* @returns true iff this and that are observationally equivalent
*/
public equalValue(that: DNA): boolean {
// omitted
}
// other code omitted
}
/**
* Immutable type representing a gene-editing process.
*/
interface Crispr {
/**
* Simulates this gene-editing process entirely in software, without using chemicals or a lab.
* @returns DNA strand that would result from this process
*/
simulate(): DNA;
/**
* Run this gene-editing process using the given `lab`.
* @returns the tube of `lab` in which the final DNA from this process
* can be found.
*/
async fabricate(lab: Lab): Promise<Tube>;
// other code omitted
}
/**
* Represents an already-existing DNA strand (a "precursor") in a gene-editing
* process. Precursors are bought premade from a supplier.
*/
class Precursor implements Crispr {
/**
* Make a gene-splicing step that results in the given `dna` strand.
*/
public constructor(private readonly dna: DNA) {
}
// other code omitted
}
/**
* Represents a gene-splicing step in a gene-editing process,
* which replaces all instances of one gene with another.
*/
class Splice implements Crispr {
/**
* Make a gene-splicing step that finds all occurrences of
* oldGene in target and substitutes newGene in place of each one.
*/
public constructor(
private readonly target: Crispr,
private readonly oldGene: Crispr,
private readonly newGene: Crispr
) {
}
// other code omitted
}
/**
* Mutable type controlling an automated gene-editing machine.
*/
class Lab {
/**
* Modifies the DNA in targetTube to replace all occurrences of the DNA from oldGeneTube with the
* DNA from newGeneTube.
* @returns a promise that fulfills with the same tube as targetTube, after the process is complet
*/
public async splice(targetTube: Tube, oldGeneTube: Tube, newGeneTube: Tube): Promise<Tube> {
// omitted
}
private tubeMap: Map<Tube, DNA> = new Map();
/**
* @returns a tube containing DNA strands corresponding to `dna`
*/
public async get(dna: DNA): Promise<Tube> {
for (const tube of this.tubeMap.keys()) {
if (this.tubeMap.get(tube).equalValue(dna)) {
return tube;
}
}
const tube = new Tube();
this.tubeMap.set(tube, dna);
await this.load(tube, dna); // "line 6a" is the load() call, "line 6b" is the await
return tube;
}
/**
* Ask a human to order premade DNA from a supplier
* and load it into the tube.
* @returns a promise that fulfills once this tube contains `dna`.
*/
private async load(tube: Tube, dna: DNA): Promise<void> {
// omitted
}
// other code omitted
}
/**
* Mutable type representing a test tube containing DNA.
*/
class Tube {
/** Make a new Tube. */
public constructor() {
}
// other code omitted
}
Suppose that:
• two different gene-editing processes A and B are running asynchronously using the same Lab
• A and B both call lab.get(dnaX) for the same precursor dnaX
• no other asynchronous processes are using lab
For each of the following interleavings, referring to the line numbers 1-7 in get() in the provided code, decide whether the
interleaving is impossible, leads to a race condition or deadlock, or runs safely; then explain your answer in one sentence.
A runs lines 1, 4, 5, 6a, then B runs lines 1, 2, 3, then A finishes lines 6b and 7.
race condition; B returns the tube that A is loading before the tube has finished loading, violating the postcondition of get().
/**
* Immutable type representing a strand of DNA.
*/
class DNA {
/** omitted */
public constructor(bases: string) {
// omitted
}
/**
* @returns zero-based index of first occurence of `dna` as a substring of this strand,
* or undefined if `dna` never occurs.
*/
public find(dna: DNA): number|undefined {
// omitted
}
/**
* @returns true iff this and that are observationally equivalent
*/
public equalValue(that: DNA): boolean {
// omitted
}
// other code omitted
}
/**
* Immutable type representing a gene-editing process.
*/
interface Crispr {
/**
* Simulates this gene-editing process entirely in software, without using chemicals or a lab.
* @returns DNA strand that would result from this process
*/
simulate(): DNA;
/**
* Run this gene-editing process using the given `lab`.
* @returns the tube of `lab` in which the final DNA from this process
* can be found.
*/
async fabricate(lab: Lab): Promise<Tube>;
// other code omitted
}
/**
* Represents an already-existing DNA strand (a "precursor") in a gene-editing
* process. Precursors are bought premade from a supplier.
*/
class Precursor implements Crispr {
/**
* Make a gene-splicing step that results in the given `dna` strand.
*/
public constructor(private readonly dna: DNA) {
}
// other code omitted
}
/**
* Represents a gene-splicing step in a gene-editing process,
* which replaces all instances of one gene with another.
*/
class Splice implements Crispr {
/**
* Make a gene-splicing step that finds all occurrences of
* oldGene in target and substitutes newGene in place of each one.
*/
public constructor(
private readonly target: Crispr,
private readonly oldGene: Crispr,
private readonly newGene: Crispr
) {
}
// other code omitted
}
/**
* Mutable type controlling an automated gene-editing machine.
*/
class Lab {
/**
* Modifies the DNA in targetTube to replace all occurrences of the DNA from oldGeneTube with the
* DNA from newGeneTube.
* @returns a promise that fulfills with the same tube as targetTube, after the process is complet
*/
public async splice(targetTube: Tube, oldGeneTube: Tube, newGeneTube: Tube): Promise<Tube> {
// omitted
}
private tubeMap: Map<Tube, DNA> = new Map();
/**
* @returns a tube containing DNA strands corresponding to `dna`
*/
public async get(dna: DNA): Promise<Tube> {
for (const tube of this.tubeMap.keys()) {
if (this.tubeMap.get(tube).equalValue(dna)) {
return tube;
}
}
const tube = new Tube();
this.tubeMap.set(tube, dna);
await this.load(tube, dna); // "line 6a" is the load() call, "line 6b" is the await
return tube;
}
/**
* Ask a human to order premade DNA from a supplier
* and load it into the tube.
* @returns a promise that fulfills once this tube contains `dna`.
*/
private async load(tube: Tube, dna: DNA): Promise<void> {
// omitted
}
// other code omitted
}
/**
* Mutable type representing a test tube containing DNA.
*/
class Tube {
/** Make a new Tube. */
public constructor() {
}
// other code omitted
}
Suppose that:
• two different gene-editing processes A and B are running asynchronously using the same Lab
• A and B both call lab.get(dnaX) for the same precursor dnaX
• no other asynchronous processes are using lab
For each of the following interleavings, referring to the line numbers 1-7 in get() in the provided code, decide whether the
interleaving is impossible, leads to a race condition or deadlock, or runs safely; then explain your answer in one sentence.
A runs lines 1 and 4, then B runs lines 1 and 4, then A runs lines 5, 6, 7, then B runs lines 5, 6, 7.
impossible; A cannot lose control to B after line 4, it can only lose control at the await in line 6.
/**
* Immutable type representing a strand of DNA.
*/
class DNA {
/** omitted */
public constructor(bases: string) {
// omitted
}
/**
* @returns zero-based index of first occurence of `dna` as a substring of this strand,
* or undefined if `dna` never occurs.
*/
public find(dna: DNA): number|undefined {
// omitted
}
/**
* @returns true iff this and that are observationally equivalent
*/
public equalValue(that: DNA): boolean {
// omitted
}
// other code omitted
}
/**
* Immutable type representing a gene-editing process.
*/
interface Crispr {
/**
* Simulates this gene-editing process entirely in software, without using chemicals or a lab.
* @returns DNA strand that would result from this process
*/
simulate(): DNA;
/**
* Run this gene-editing process using the given `lab`.
* @returns the tube of `lab` in which the final DNA from this process
* can be found.
*/
async fabricate(lab: Lab): Promise<Tube>;
// other code omitted
}
/**
* Represents an already-existing DNA strand (a "precursor") in a gene-editing
* process. Precursors are bought premade from a supplier.
*/
class Precursor implements Crispr {
/**
* Make a gene-splicing step that results in the given `dna` strand.
*/
public constructor(private readonly dna: DNA) {
}
// other code omitted
}
/**
* Represents a gene-splicing step in a gene-editing process,
* which replaces all instances of one gene with another.
*/
class Splice implements Crispr {
/**
* Make a gene-splicing step that finds all occurrences of
* oldGene in target and substitutes newGene in place of each one.
*/
public constructor(
private readonly target: Crispr,
private readonly oldGene: Crispr,
private readonly newGene: Crispr
) {
}
// other code omitted
}
/**
* Mutable type controlling an automated gene-editing machine.
*/
class Lab {
/**
* Modifies the DNA in targetTube to replace all occurrences of the DNA from oldGeneTube with the
* DNA from newGeneTube.
* @returns a promise that fulfills with the same tube as targetTube, after the process is complet
*/
public async splice(targetTube: Tube, oldGeneTube: Tube, newGeneTube: Tube): Promise<Tube> {
// omitted
}
private tubeMap: Map<Tube, DNA> = new Map();
/**
* @returns a tube containing DNA strands corresponding to `dna`
*/
public async get(dna: DNA): Promise<Tube> {
for (const tube of this.tubeMap.keys()) {
if (this.tubeMap.get(tube).equalValue(dna)) {
return tube;
}
}
const tube = new Tube();
this.tubeMap.set(tube, dna);
await this.load(tube, dna); // "line 6a" is the load() call, "line 6b" is the await
return tube;
}
/**
* Ask a human to order premade DNA from a supplier
* and load it into the tube.
* @returns a promise that fulfills once this tube contains `dna`.
*/
private async load(tube: Tube, dna: DNA): Promise<void> {
// omitted
}
// other code omitted
}
/**
* Mutable type representing a test tube containing DNA.
*/
class Tube {
/** Make a new Tube. */
public constructor() {
}
// other code omitted
}
Suppose that:
• two different gene-editing processes A and B are running asynchronously using the same Lab
• A and B both call lab.get(dnaX) for the same precursor dnaX
• no other asynchronous processes are using lab
For each of the following interleavings, referring to the line numbers 1-7 in get() in the provided code, decide whether the
interleaving is impossible, leads to a race condition or deadlock, or runs safely; then explain your answer in one sentence.
A runs lines 1, 4, 5, 6, 7, then B runs lines 1, 4, 5, 6, 7.
impossible; After A runs line 5, there will be at least one tube in tubeMap, so B must proceed from line 1 to line 2.
25
257Mathematics18.01Calculus INoneNoneProblem Set 6Taylor Series10b0.07919746568Text
There is no formula for the antiderivative of $e^{-x^{2}}$, and so there is no formula for integrals like $\int_{0}^{\cdot 1} e^{-x^{2}} d x$. This type of integral is actually quite important in probability and statistics as we will see later. In this problem, we use Taylor series to approximate it.
There is a quicker way to find the Taylor series of $e^{-x^{2}}$. Earlier in the problem set you found the degree 3 Taylor series of $e^{-u}$ around $u=0$. Plug in $u=x^{2}$ and we get the Taylor series of $e^{-x^{2}}$ around $x=0$ to degree 6.
ExpressionRecall our formula for the degree 3 Taylor series $e^{-u}$ from earlier:
$$
1-u+\frac{1}{2} u^{2}-\frac{1}{6} u^{3}
$$
and plug in $u=x^{2}$ to get the degree 6 Taylor series for $e^{-x^{2}}$ :
$$
T(x)=1-x^{2}+\frac{1}{2} x^{4}-\frac{1}{6} x^{6} .
$$
There is no formula for the antiderivative of $e^{-x^{2}}$, and so there is no formula for integrals like $\int_{0}^{\cdot 1} e^{-x^{2}} d x$. This type of integral is actually quite important in probability and statistics as we will see later. In this problem, we use Taylor series to approximate it.
Using the Taylor series from part b, approximate $\int_{0}^{.1} e^{-x^{2}} d x$. At the end, you'll get a sum of fractions. Use a calculator to turn this into a decimal and record the first six digits.
$$
\begin{aligned}
\int_{0}^{.1} e^{-x^{2}} d x & \approx \int_{0}^{.1}\left(1-x^{2}+\frac{1}{2} x^{4}-\frac{1}{6} x^{6}\right) d x \\
& =\left[x-\frac{x^{3}}{3}+\frac{1}{2} \frac{x^{5}}{5}-\frac{1}{6} \frac{x^{7}}{7}\right]_{0}^{.1} \\
& =.1-\frac{(.1)^{3}}{3}+\frac{(.1)^{5}}{10}-\frac{(.1)^{7}}{42} \\
& \approx 0.0996676
\end{aligned}
$$
There is no formula for the antiderivative of $e^{-x^{2}}$, and so there is no formula for integrals like $\int_{0}^{\cdot 1} e^{-x^{2}} d x$. This type of integral is actually quite important in probability and statistics as we will see later. In this problem, we use Taylor series to approximate it.
Compute the first two derivatives of $e^{-x^{2}}$, and use them to find the degree 2 Taylor series of $e^{-x^{2}}$ around $x=0$.
We need $f(0), f^{\prime}(0), f^{\prime \prime}(0)$ :
$$
\begin{aligned}
f(x)=e^{-x^{2}} & \longrightarrow \quad f(0)=1 \\
f^{\prime}(x)=-2 x e^{-x^{2}} & \longrightarrow \quad f^{\prime}(0)=0 \\
f^{\prime \prime}(x)=-2 e^{-x^{2}}+4 x^{2} e^{-x^{2}} & \longrightarrow \quad f^{\prime \prime}(0)=-2
\end{aligned}
$$
Use the degree 2 Taylor formula to get
$$
T(x)=1+(0)(x-0)+)+\frac{1}{2}(-2)(x-0)^{2}=1-x^{2}.
$$
There is no formula for the antiderivative of $e^{-x^{2}}$, and so there is no formula for integrals like $\int_{0}^{\cdot 1} e^{-x^{2}} d x$. This type of integral is actually quite important in probability and statistics as we will see later. In this problem, we use Taylor series to approximate it.
Using the Taylor series from part b, approximate $e^{-.1^{2}}$. At the end, you will get a sum of fractions. Use a calculator to turn this into a decimal and record the first six digits.
$$
T(.1)=1-(.1)^{2}+\frac{1}{2}(.1)^{4}-\frac{1}{6}(.1)^{6} \approx 0.990049
$$
26
55EECS6.18
Computer Systems Engineering
6.1010, 6.1910NoneMidterm Exam 1
DCTCP and MapReduce
9c0.45Text
We have a large-scale web application using the partition/aggregate design pattern with many workers and real-time requirements. Assume there is NO background traffic (no other applications on the network) and the responses from workers are all small (1-2 packets each).
The MapReduce paper doesn't mention incast as an issue. Why not?
(a) Actually it does, but MapReduce talks about "stragglers" instead of "incast."
(b) The MapReduce paper doesn't describe many details of the reshuffling that happens between stages.
(c) Incast problems only started after the MapReduce paper had already been published.
(d) MapReduce operates in rigidly-scheduled stages, so it's impossible to have multiple messages arrive at a single switch at the same time.
(e) Google used exotic custom switches, described in the paper, to avoid these kinds of networking problems.
Multiple Choice(b).
We have a large-scale web application using the partition/aggregate design pattern with many workers and real-time requirements. Assume there is NO background traffic (no other applications on the network) and the responses from workers are all small (1-2 packets each).
Which of these problems is likely to be a concern in our network? Select ALL correct answers:
(a) Incast.
(b) Queue buildup.
(c) Buffer pressure.
(d) None of these problems is likely.
(a).
We have a large-scale web application using the partition/aggregate design pattern with many workers and real-time requirements. Assume there is NO background traffic (no other applications on the network) and the responses from workers are all small (1-2 packets each).
Max has a bright idea of changing the aggregator structure for the partition/aggregate pattern. Instead of having one aggregator to aggregate the answer of 1000 workers, Max proposes using 10 aggregators each aggregating the answer from 100 workers, and then 1 aggregator to aggregate the answer from the 10 aggregators.
For our application, what is the most significant potential problem with this approach?
Select the BEST answer:
(a) It increases the number of machines required.
(b) It roughly doubles the time required for aggregation.
(c) It creates longer queues.
(d) It intensifies incast.
(b).
We have a large-scale web application using the partition/aggregate design pattern with many workers and real-time requirements. Assume there is NO background traffic (no other applications on the network) and the responses from workers are all small (1-2 packets each).
TCP with RED/ECN has some similarities to DCTCP. What distinguishes DCTCP from TCP with RED/ECN? Select ALL correct statements on how DCTCP is DIFFERENT:
(a) DCTCP implements a central service to actively optimize the queue length of each switch.
(b) DCTCP can be implemented in commercially-available switches.
(c) DCTCP automatically determines the queue-length threshold $\mathrm{K}$.
(d) DCTCP does not adjust the size of the congestion window.
(e) DCTCP tries to minimize queue occupancy, but at the cost of lower throughput than TCP with RED/ECN.
(f) None of the above statements distinguishes DCTCP from TCP with RED/ECN.
(f).
27
63Mathematics18.102
Introduction to Functional Analysis
18.C06, 18.100BNoneProblem Set 9
Bounded Linear Operators
4nan0.5TextShow that if $V \in \mathcal{C}^{0}[0, \pi]$ then the linear map
$$
Q_{V}: H_{0}^{2}([0, \pi]) \ni u \longmapsto-F_{2}+V u \in L^{2}(0, \pi)
$$
is bounded and reduces to
$$
u \longmapsto-u^{\prime \prime}+V u \text { on } \mathcal{C}_{0}^{2}([0, \pi])
$$
Open
Let $\mathscr{H}=H_{0}^{2}(0, \pi), \mathscr{L}=L^{2}(0, \pi)$. The map $u \rightarrow-F_{2}$ is a bounded map from $\mathscr{H}$ to $\mathscr{L}$. In fact is maps $u=\sum c_{k} \sin k x$ to $-F_{2}=\sum k^{2} c_{k} \sin k x$ and $\|u\|_{\mathscr{H}}=\left\|F_{2}\right\|_{\mathscr{L}}$. The map $u \rightarrow u$ is a bounded map from $\mathscr{H}$ to $\mathscr{L}$ as $\|u\|_{\mathscr{H}} \geq\|u\|_{\mathscr{L}}$. Finally, multiplication by $V$ is a bounded map from $\mathscr{L}$ to $\mathscr{L}$. Combining the above, we conclude that $Q_{V}$ is a bounded map from $\mathscr{H}$ to $\mathscr{L}$. It reduces to $-u^{\prime \prime}+V u$ by Problem $9.3$.
With $F_{1}$ and $F_{2}$ as in (2) for $u \in H_{0}^{2}([0, \pi])$ show that
$$
\int_{(0, \pi)} u \phi^{\prime}=-\int_{(0, \pi)} F_{1} \phi, \int_{(0, \pi)} u \phi^{\prime \prime}=\int_{(0, \pi)} F_{2} \phi, \forall \phi \in \mathcal{C}_{0}^{2}([0, \pi])
$$
and show that if $u \in \mathcal{C}_{0}^{2}([0, \pi]) \subset H_{0}^{2}([0, \pi])$ then $F_{1}=u^{\prime}, F_{2}=u^{\prime \prime}$.
Since $u_{N}$ vanishes at 0 and $\pi$, and $u, F_{1}, \phi$ and $\phi^{\prime}$ are integrable, we can use Lebesgue dominated convergence to exchange limit and integration and conclude that
$$
\int_{0}^{\pi} u \phi^{\prime}=\int_{0}^{\pi} \lim _{N \rightarrow \infty} u_{N} \phi^{\prime}=\lim _{N \rightarrow \infty}\left[\left[u_{N} \phi\right]_{0}^{\pi}-\int_{0}^{\pi} \frac{d u_{N}}{d x} \phi\right]=-\int_{0}^{\pi} \lim _{N \rightarrow \infty} \frac{d u_{N}}{d x} \phi=-\int_{0}^{\pi} F_{1} \phi .
$$
Similarly,
$$
\int_{0}^{\pi} u \phi^{\prime \prime}=-\int_{0}^{\pi} F_{1} \phi^{\prime}=-\lim _{N \rightarrow \infty}\left[\left[F_{1}^{\prime} \phi\right]_{0}^{\pi}-\int_{0}^{\pi} \frac{d^{2} u_{N}}{d x} \phi\right]=\int_{0}^{\pi} \lim _{N \rightarrow \infty} \frac{d^{2} u_{N}}{d x} \phi=\int_{0}^{\pi} F_{2} \phi
$$
where the final equality holds because $\frac{d^{2} u_{N}}{d x} \rightarrow F_{2}$ in $\|\cdot\|_{2}$ and because integration with respect to $\phi$ is a continuous linear functional on $L^{2}(0, \pi)$. Finally, suppose $u \in C_{0}^{2}([0, \pi])$. Then
$$
-\int_{0}^{\pi} F_{1} \phi=\int_{0}^{\pi} u \phi^{\prime}=-\int_{0}^{\pi} u^{\prime} \phi
$$
for all $\phi \in C_{0}^{2}([0, \pi])$. This implies that $u^{\prime}=F_{1}$ in the $L^{2}$ sense, and since $u^{\prime}$ and $F_{1}$ are continuous, that $u=F_{1}$ everywhere. Similarly, we find
$$
\int_{0}^{\pi} F_{2} \phi=\int_{0}^{\pi} u^{\prime \prime} \phi
$$
and therefore that $F_{2}=u^{\prime \prime}$ in the $L^{2}$ sense.
Show (it is really a matter of recalling) that the inverse $Q_{0}^{-1}=A^{2}$ is the square of a compact self-adjoint non-negative operator on $L^{2}(0, \pi)$ and that
$$
Q_{V}^{-1}=A(\mathrm{Id}+A V A)^{-1} A
$$
(where we are assuming that $0 \leq V \in \mathcal{O}[0, \pi]$). Using results from class or the notes on the Dirichlet problem (or otherwise ..) show that if $V \geq 0$ then $Q_{V}$ is an isomorphism (meaning just a bounded bijection with bounded inverse) of $H_{0}^{2}([0, \pi])$ to $L^{2}(0, \pi)$.
Let $\mathscr{H}=H_{0}^{2}(0, \pi), \mathscr{L}=L^{2}(0, \pi)$. The only difference here is that we use the interval $[0, \pi]$ instead of $[0,2 \pi]$ in the textbook. We define $A$ by $A(\sin k x)=\frac{1}{k} \sin k x$. This extends to a bounded operator from $\mathscr{L} \rightarrow \mathscr{L}$ which is compact and self adjoint. By definition $Q_{0}^{-1}=A^{2}$ and the same deduction as in the textbook shows $Q_{V}^{-1}=A(1+A V A)^{-1} A$ (we use 1 to represent Id). The bounded operator $A V A$ is compact, self adjoint and since $V \geq 0$, its eigenvalues are non-negative. Therefore, by the spectral theorem of compact and self adjoint operator, we may diagonalize $A V A$ and find $(1+A V A)^{-1}=1+Q$ for a bounded and compact operator $Q$.
We need to show $Q=A F A$ for some bounded operator $F: \mathscr{L} \rightarrow \mathscr{L}$. By definition $(1+$ $A V A)(1+Q)=1$. Expanding this, we find
$$
\begin{aligned}
Q & =-(1+A V A)^{-1} A V A=-(1+A V A-A V A)(1+A V A)^{-1} A V A \\
& =-A V A+A V A(1+A V A)^{-1} A V A=A\left(-V+V A(1+A V A)^{-1} A V\right) A
\end{aligned}
$$
Therefore, the bounded operator $F=-V+V A(1+A V A)^{-1} A V$ will do the job. Finally, we need to show $Q_{V}^{-1}$ is bounded from $\mathscr{L}$ to $\mathscr{H}$. By the above computation, we find $Q_{V}^{-1}=A(1+A F A) A=A\left(A+A F A^{2}\right)=A^{2}\left(1+F A^{2}\right)$. Here $1+F A^{2}$ is a bounded operator from $\mathscr{L}$ to $\mathscr{L}$ and $A^{2}$ by its definition is a bounded operator from $\mathscr{L}$ to $\mathscr{H}$. We conclude $Q_{V}^{-1}$ is bounded from $\mathscr{L}$ to $\mathscr{H}$ and this finishes the proof.
Show that if $u \in H_{0}^{2}([0, \pi])$ and $u_{N}$ is the sum of the first $N$ terms in the Fourier-Bessel series for $u$ (which is in $\left.L^{2}(0, \pi)\right)$ then
$$
u_{N} \rightarrow u, \frac{d u_{N}}{d x} \rightarrow F_{1}, \frac{d^{2} u_{N}}{d x^{2}} \rightarrow F_{2}
$$
where in the first two cases we have convergence in supremum norm and in the third, convergence in $L^{2}(0, \pi)$. Deduce that $u \in \mathcal{C}^{0}[0, \pi], u(0)=u(\pi)=0$ and $F_{1} \in \mathcal{C}^{0}[0, \pi]$ whereas $F_{2} \in L^{2}(0, \pi)$.
After normalisation, the set $\left\{\sqrt{\frac{2}{\pi}} \sin k x,(k=1,2, \ldots)\right\}$ forms an orthonormal basis of $L^{2}(0, \pi)$. Let $u \in H_{0}^{2}([0, \pi])$ and $c_{k}=\int_{0}^{\pi} \sin k x u(x) d x$. We have $\sum\left|k^{2} c_{k}\right|^{2}<\infty$. Therefore there exists functions $F_{1}$ and $F_{2}$ in $L^{2}(0, \pi)$ such that the following equations holds in $L^{2}(0, \pi)$:
$$
u=\frac{2}{\pi} \sum_{k=1}^{\infty} c_{k} \sin k x, \quad F_{1}=\frac{2}{\pi} \sum_{k=1}^{\infty} k c_{k} \cos k x, \quad F_{2}=-\frac{2}{\pi} \sum_{k=1}^{\infty} k^{2} c_{k} \sin k x .
$$
This shows the convergence in (2) holds at least in $L^{2}$ sense. Our goal is to show the first two summations actually converge in supremum norm (uniform convergence of continuous functions). This will imply $u$ and $F_{1}$ are uniform limit of continuous functions and therefore are continuous and $u(0)=u(\pi)=0$.
We first show the summation defining $F_{1}$ converges at $x=0$. Namely, $\sum k c_{k}$ converges. This follows from the Cauchy-Schwartz inequality $\sum\left|k c_{k}\right| \leq\left(\sum\left|k^{2} c_{k}\right|^{2}\right)\left(\sum \frac{1}{k^{2}}\right)$. Now, we show the summation defining $F_{1}$ converges in the supremum norm. For any $x \in[0, \pi]$, we compute the summation from $N_{1}$ to $N_{2}$ terms
$$
\sum_{k=N_{1}}^{N_{2}} k c_{k} \cos k x=\sum_{k=N_{1}}^{N_{2}} k c_{k}(\cos k x-1)+\sum_{k=N_{1}}^{N_{2}} k c_{k}=-\sum_{k=N_{1}}^{N_{2}} k^{2} c_{k} \int_{0}^{x} \sin k x+\sum_{k=N_{1}}^{N_{2}} k c_{k}
$$
Now, taking absolute value and using $\sum\left|k^{2} c_{k}\right|^{2}<\infty$, we find this tends to 0 as long as $N_{1}, N_{2}$ is large (independent of $x$ ). This finishes the uniform convergence of $F_{1}$. Similarly, the summation defining $u$ converges at $x=0$ since every term equals 0 . For any $x \in[0, \pi]$, we compute
$$
\sum_{k=N_{1}}^{N_{2}} c_{k} \sin k x=\sum_{k=N_{1}}^{N_{2}} k c_{k} \int_{0}^{x} \cos k x
$$
This tends to 0 as long as $N_{1}, N_{2}$ is large (independent of $x$ ). This proves the uniform convergence of $u$.
28
103Mathematics18.102
Introduction to Functional Analysis
18.C06, 18.100BNoneFinal Exam
Orthogonal Complements
8nan4.5Text
Suppose that $f \in L^{2}(\mathbb{R})$ is such that there exists a function $v \in L^{2}(\mathbb{R})$ satisfying
$$
\int_{\mathbb{R}} f \phi^{\prime}=\int_{\mathbb{R}} v \widehat{\phi} \quad \forall \phi \in \mathcal{S}(\mathbb{R})
$$
where $\widehat{\phi}$ is the Fourier transform of $\phi$. Show that $f \in C_{0}(\mathbb{R}) \subset L^{2}(\mathbb{R})$, where $C_{0}(\mathbb{R})$ is the space of continuous functions on $\mathbb{R}$ with zero limit at $\pm \infty$.
You may use that if $h$ is a locally $L^{2}$ function on $\mathbb{R}$ such that $\int h \phi=0$ for every $\phi \in C_{c}^{\infty}(\mathbb{R})$ then $h=0$ a.e. (as $C_{c}^{\infty}(I)$ is dense in $L^{2}(I)$ for every interval $I$).
Open
Let $\psi:=\widehat{\phi}$. Then $\phi(x)=\frac{1}{2 \pi} \widehat{\psi}(-x)$, so
$$
\int_{\mathbb{R}} f(-x) \widehat{i \xi \psi}(x)=\int v \psi \quad \forall \psi \in \mathcal{S}(\mathbb{R}).
$$
So
$$
\int_{\mathbb{R}} \widehat{f}(-\xi) i \xi \psi(\xi)=\int v \psi
$$
from which it follows that the locally $L^{2}$-function $h(\xi):=i \xi \widehat{f}(-\xi)-v(\xi)$ is orthogonal to $\mathcal{S}(\mathbb{R})$, hence to its subspace $C_{c}^{\infty}(\mathbb{R})$. Thus $h=0$ a.e., i.e., $\xi \widehat{f} \in L^{2}(\mathbb{R})$. This means $\widehat{f} \in L^{1}(\mathbb{R})$, since
$$
\left(\int|\widehat{f}|\right)^{2} \leq \int \frac{1}{1+\xi^{2}} \cdot \int\left(1+\xi^{2}\right)|\widehat{f}|^{2}=\pi \int\left(1+\xi^{2}\right)|\widehat{f}|^{2}<\infty.
$$
so $f$ is continuous and goes to zero at infinity.
Suppose that $f: \mathbb{R} \rightarrow \mathbb{C}$ is a locally $\mathcal{L}^{1}$ function and $\phi \in \mathcal{C}_{\mathrm{c}}(\mathbb{R})$. Explain why $f \phi \in \mathcal{L}^{1}(\mathbb{R})$.
Suppose $\phi$ is supported on $[-R, R]$ and let $f_{R}(x)=f(x)$ if $x \in[-R, R]$ and $f(x)=0$ otherwise. Then $f_{R} \in \mathcal{L}^{1}(\mathbb{R})$ and $f \phi=f_{R} \phi$. So it suffices to show that if $f \in \mathcal{L}^{1}(\mathbb{R})$ then so is $f \phi$. But this is easy since if $u_{n}$ is an approximating sequence for $f$ then $u_{n} \phi$ is an approximating sequence for $f \phi$. Indeed, $u_{n}(x) \phi(x) \rightarrow f(x) \phi(x)$ a.e. and
$$
\sum_{n} \int\left|\left(u_{n}-u_{n-1}\right) \phi\right| \leq \max |\phi| \sum_{n} \int\left|u_{n}-u_{n-1}\right|<\infty
$$
(where $u_{n-1}=0$).
If $U \subset \mathbb{R}$ is measureable and $f \in \mathcal{L}^{1}(\mathbb{R})$ show that
$$
\int_{U} f=\int \chi_{U} f \in \mathbb{C}
$$
is well-defined. Prove that if $f \in \mathcal{L}^{1}(\mathbb{R})$ then
$$
I_{f}(x)= \begin{cases}\int_{(0, x)} f & x \geq 0 \\ -\int_{(x, 0)} f & x<0\end{cases}
$$
is a bounded continuous function on $\mathbb{R}$.
The integral is well-defined by Problem 4.3.3. The function $I_{f}(x)$ is well defined and bounded since $I_{|f|}(x)$ is bounded by $\int|f|<\infty$.
To prove continuity it is sufficient to check that the sequence $\int_{\left(x, x_{n}\right)} f$ for $x_{n}<x$ and $\int_{\left(x_{n}, x\right)} f$ for $x \leq x_{n}$ tends to 0 as $x_{n}$ tends to $x$. The sequence $\chi_{\left(x, x_{n}\right)} f$ (resp. $\chi_{\left(x_{n}, x\right)} f$ ) is dominated by $|f| \in \mathcal{L}^{1}(\mathbb{R})$ and its limit is 0 a.e. The statement now follows from dominated convergence theorem.
Define $\mathcal{L}^{\infty}(\mathbb{R})$ as the set of functions $g: \mathbb{R} \longrightarrow \mathbb{C}$ such that there exists $C>0$ and a sequence $v_{n} \in \mathcal{C}(\mathbb{R})$ with $\left|v_{n}(x)\right| \leq C$ and $v_{n}(x) \rightarrow g(x)$ a.e.
Show that if $g \in \mathcal{L}^{\infty}(\mathbb{R})$ and $f \in \mathcal{L}^{1}(\mathbb{R})$ then $g f \in \mathcal{L}^{1}(\mathbb{R})$ and that this defines a map
$$
L^{\infty}(\mathbb{R}) \times L^{1}(\mathbb{R}) \longrightarrow L^{1}(\mathbb{R})
$$
which satisfies $\|g f\|_{L^{1}} \leq\|g\|_{L^{\infty}}\|f\|_{L^{1}}$.
Proof. For $g$ we keep the notations from the definition. For $f$ let $w_{n}$ be the absolutely summable series converging to $f$ a.e.
Note that the sequence $u_{n}=v_{k} w_{n}$ is absolutely summable as $\sum_{n} \int\left|v_{k} w_{n}(x)\right| \leq C \sum_{n} \int\left|w_{n}(x)\right|<\infty$ and converges to $v_{k} f$ a.e. which is thus in $\mathcal{L}^{1}(\mathbb{R})$. Now $t_{n}=v_{n} f$ is dominated by $C|f|$ and converges to $f g \in \mathcal{L}^{1}(\mathbb{R})$.
If either $f$ or $g$ is in $\mathcal{N}$ then $f g \in \mathcal{N}$, which ensure passing from $\mathcal{L}$ to $L$.
Finally, since $|g| \leq\|g\|_{L^{\infty}}$ a.e.
$$
\|g f\|_{L^{1}}=\int|g f| \leq \int\|g\|_{L^{\infty}}|f|=\|g\|_{L^{\infty}}\|f\|_{L^{1}}.
$$
29
3Mathematics18.404
Theory of Computation
6.1210/18.200NoneProblem Set 1
Regular Expression
3b0.5555555556TextFor any regular expression $R$ and $k\ge0$, let $R^k$ be
$R$ self-concatenated $k$ times, $\smash{\underbrace{RR\cdots R}_k}$. \\
Let $\SS=\{\st0,\st1\}$.
Let $B=\set{\st0^ku\st1^k} k\ge1$ and $ u\in\st1\SSS\setend$.
Show $B$ is not regular.
Open
Assume for contradiction that $B$ is regular. Use the pumping lemma to get the pumping length $p$. Letting $s=0^{p} 11^{p}$ we have $s \in B$ (here $u=1$ ) and so we can divide up $s=x y z$ according to the conditions of the pumping lemma. By condition 3, $x y$ has only 0s, hence the string $x y y z$ is $0^{l} 11^{p}$ for some $l>p$. But then $0^{l} 11^{p}$ isn't equal to $0^{k} 1 u 1^{k}$ for any $u \in \Sigma^{*}$ and $k$, because the left-hand part of the string requires $k=l$ and the right-hand part requires $k \leq p$. Both together are impossible, because $l>p$. That contradicts the pumping lemma and we conclude that $B$ isn't regular.
For any regular expression $R$ and $k\ge0$, let $R^k$ be
$R$ self-concatenated $k$ times, $\smash{\underbrace{RR\cdots R}_k}$. \\
Let $\SS=\{\st0,\st1\}$.
Let $A=\set{\st0^ku\st1^k} k\ge1$ and $ u\in\SSS\setend$.
Show $A$ is regular.
Any string that doesn't begin with 0 and end with 1 obviously cannot be a member of $A$. If string $w$ does begin with 0 and end with 1 then $w=0 u 1$ for some string $u$. Hence $A=0 \Sigma^{*} 1$ and therefore $A$ is regular.
Let $\SS = \{\st0,\!\st1\}$. For $k\ge1$, let $E_k=\set{w}
|w| \ge k$ and the $k$th symbol from the end of $w$ is~a~\st1\setend.
Here, $|w|$ means the length of $w$.
Given $k$, describe a regular expression for $E_k$.
You may use the exponentation notation given in problem 4.
$\Sigma^{*} 1 \Sigma^{k-1}$.
Let $\SS = \{\st0,\!\st1\}$. For $k\ge1$, let $E_k=\set{w}
|w| \ge k$ and the $k$th symbol from the end of $w$ is~a~\st1\setend.
Here, $|w|$ means the length of $w$.
Prove that for each $k$, no \dfa\ can recognize $E_k$ with fewer than $2^k$ states.
Assume for contradiction that DFA $C$ recognizes $E_{k}$ with fewer than $2^{k}$ states. Consider all $2^{k}$ strings of length $k$. When $C_{k}$ reads each of these strings, it must end up in the same state for at least two of them, $s=s_{1} s_{2} \cdots s_{k}$ and $t=t_{1} t_{2} \cdots t_{k}$. For some $i$ we have $s_{i} \neq t_{i}$. Let $s^{\prime}=s 0^{i-1}$ and $t^{\prime}=t 0^{i-1}$. The $k$ th symbols from the ends of $s^{\prime}$ and $t^{\prime}$ are unequal, hence only one of $s^{\prime}$ and $t^{\prime}$ is in $E_{k}$. However, $C$ ends up in the same state on $s$ and $t$ so it also ends up in the same state on $s^{\prime}$ and $t^{\prime}$, so $C$ acts the same on $s$ ' and $t$ ', i.e., it accepts both or rejects both, a contradiction.
30
72Mathematics18.03
Differential Equations
None18.02Problem Set 6Linear Algebra4b0.2412868633Text
Let $S_{3}$ be the vector space of polynomials of degree at most 3 . Consider the linear map $T: S_{3} \rightarrow \mathbb{R}^{2}$ defined by
$$
T(p)=\left(p(0), p^{\prime}(1)\right)
$$
You may assume this map is linear (it follows from problem (4)).
Find a basis for $\operatorname{Ker}(T):=\left\{p \in S_{3}: T(p)=0\right\}$ and compute the dimension of this vector space. Note in particular that we have the equality
$$
\operatorname{dim} \operatorname{Ker}(T)+\operatorname{dim} \operatorname{Im}(T)=\operatorname{dim} S_{3}
$$
Open
Let $p(x)=a_{3} x^{3}+a_{2} x^{2}+a_{1} x+a_{0}$ be an element of $S_{3}$. Notice that $T(p)=\left(a_{0}, 3 a_{3}+\right.$ $\left.2 a_{2}+a_{1}\right)$. Hence, if $p$ belongs to the kernel of $T$, we have $a_{0}=0$ and $a_{1}=-3 a_{3}-2 a_{2}$. Consequently, we have $p(x)=a_{3}\left(x^{3}-3 x\right)+a_{2}\left(x^{2}-2 x\right)$. We see that the kernel of $T$ is included in the span of $x^{3}-3 x$ and $x^{2}-2 x$. Reciprocally, it follows from our computation above that $x^{3}-3 x$ and $x^{2}-2 x$ belongs to the kernel of $T$. Thus, the kernel of $T$ is exactly the span of $x^{3}-3 x$ and $x^{2}-2 x$.
Let us prove that $x^{3}-3 x$ and $x^{2}-2 x$ are linearly independent. Let $\lambda$ and $\mu$ be real numbers such that $\lambda\left(x^{3}-3 x\right)+\mu\left(x^{2}-2 x\right)=0$ for every $x \in \mathbb{R}$. Differentiating this equality three times and evaluating at $x=1$, we find that $\lambda=0$. Hence, we have $\mu\left(x^{2}-2 x\right)=0$, and evaluating at $x=1$, we find that $\mu=0$. We proved that $\lambda=\mu=0$, it follows that $x^{3}-3 x$ and $x^{2}-2 x$ are linearly independent.
Consequently, $x^{3}-3 x$ and $x^{2}-2 x$ form a basis of the kernel of $T$, which is thus of dimension 2. It follows from the previous question that the dimension of the image of $T$ is 2 , so that we have
$$
\operatorname{dim} \operatorname{Ker}(T)+\operatorname{dim} \operatorname{Im}(T)=2+2=4=\operatorname{dim} S_{3} .
$$
Let $S_{3}$ be the vector space of polynomials of degree at most 3 . Consider the linear map $T: S_{3} \rightarrow \mathbb{R}^{2}$ defined by
$$
T(p)=\left(p(0), p^{\prime}(1)\right)
$$
You may assume this map is linear (it follows from problem (4)).
Show that this map is surjective. Concretely, for any $\left(a_{1}, a_{2}\right) \in \mathbb{R}^{2}$, find a polynomial $p$ so that $T(p)=\left(a_{1}, a_{2}\right)$.
For $\left(a_{1}, a_{2}\right) \in \mathbb{R}^{2}$, let $p$ be the polynomial $p(x)=a_{2} x+a_{1}$ and notice that $T(p)=\left(a_{1}, a_{2}\right)$. The map $T$ is consequently surjective.
Let $x$ be a real variable and for $k \geq 0$ consider
$$
S_{k}=\{\text { real polynomials } p(x) \text { with degree } \leq k\} .
$$
Show that $S_{k}$ is a vector space (over $\mathbb{R}$ ), find a basis for $S_{k}$ and compute $\operatorname{dim} S_{k}$.
Let us define the space $S_{k}$ more precisely as the set of all polynomials $p(x)$ that can be expressed as $p(x)=\sum_{n=0}^{k} p_{n} x^{n}$, where each $p_{n} \in \mathbb{R}$. To check that $S_{k}$ is a vector space over $\mathbb{R}$, we only need to check three requirements. The mathematical statement will be provided, and then a translation in more regular English.
1. If $v, w \in S_{k}$, then $v+w \in S_{k}$. In other words, the sum of two polynomials degree each not exceeding $k$ is a polynomial with degree not exceeding $k$.
2. If $v \in S_{k}, c \in \mathbb{R}$, then $c \cdot v \in S_{k}$. In other words, the product of a real number and a polynomial degree not exceeding $k$ is a polynomial with degree not exceeding $k$
3. There exists an element $\mathbf{0}$ in $S_{k}$ such that $v+\mathbf{0}=v \quad \forall v \in S_{k}$. In other words, there is a 0 polynomial with degree not exceeding $k$.
To facilitate checking of the requirements, write $v(x)=\sum_{n=0}^{k} v_{n} x^{n}, w(x)=\sum_{n=0}^{k} w_{n} x^{n}$. Then, we can check each requirement in turn.
1. $v+w=\sum_{n=0}^{k}\left(v_{n}+w_{n}\right) x^{n} \in S_{k}$.
2. $c \cdot v=\sum_{n=0}^{k}\left(c v_{n}\right) x^{n} \in S_{k}$
3. We can construct $\mathbf{0}=\sum_{n=0}^{k} p_{n} x^{n}$ with $p_{n}=0 \forall n$. Then, see that $v+\mathbf{0}=$ $\sum_{n=0}^{k} v_{n} x^{n}+\sum_{n=0}^{k} 0 x^{n}=\sum_{n=0}^{k} v_{n} x^{n}=v$ for any $v \in S_{k}$.
One basis for $S_{k}$ is given by $\left\{1, x, x^{2}, x^{3} \ldots x^{k}\right\}$, which contains $k+1$ terms, and so the dimension of $S_{k}$ is $k+1$.
Recall the definition of a linear map: If $V_{1}, V_{2}$ are vector spaces, both over $\mathbb{R}$, or both over $\mathbb{C}$, then a map $T: V_{1} \rightarrow V_{2}$ is said to be linear if: for any scalars $c_{1}, c_{2}$ and any vectors $\vec{v}_{1}, \vec{v}_{2}$ we have
$$
T\left(c_{1} \vec{v}_{1}+c_{2} \vec{v}_{2}\right)=c_{1} T\left(\vec{v}_{1}\right)+c_{2} T\left(\vec{v}_{2}\right)
$$
Let $S_{k}:=\{p(x)$ polynomials of degree $\leq k\}$. Show that the following maps are linear:
Consider the basis $\left\{1, x, x^{2}, x^{3}\right\}$ of $S_{3}$. With this choice of basis we may identify a polynomial $p(x) \in S_{3}$ with a vector as follows:
$$
p(x)=a_{0}+a_{1} x+a_{2} x^{2}+a_{3} x^{3} \longleftrightarrow\left(\begin{array}{l}
a_{0} \\
a_{1} \\
a_{2} \\
a_{3}
\end{array}\right)
$$
Similarly, let $\left\{1, x, x^{2}\right\}$ be a basis of $S_{2}$. Then we may identify a polynomial $q(x) \in S_{2}$ with a vector as
$$
q(x)=b_{0}+b_{1} x+b_{2} x^{2} \longleftrightarrow\left(\begin{array}{l}
b_{0} \\
b_{1} \\
b_{2}
\end{array}\right)
$$
With these choices, express the map $D: S_{3} \rightarrow S_{2}$ as a $3 \times 4$ matrix.
Suppose $A$ is the matrix representation of $D$ with respect to the bases above. Let $A_{i}$ be the $i$-th column of $A$ for $1 \leq i \leq 4$. Then we have
$$
\begin{aligned}
&A_{1}=A \cdot\left[\begin{array}{l}
1 \\
0 \\
0 \\
0
\end{array}\right] \longleftrightarrow D(1)=0 \longleftrightarrow\left[\begin{array}{l}
0 \\
0 \\
0
\end{array}\right] \\
&A_{2}=A \cdot\left[\begin{array}{l}
0 \\
1 \\
0 \\
0
\end{array}\right] \longleftrightarrow D(x)=1 \longleftrightarrow\left[\begin{array}{l}
1 \\
0 \\
0
\end{array}\right] \\
&A_{3}=A \cdot\left[\begin{array}{l}
0 \\
0 \\
1 \\
0
\end{array}\right] \longleftrightarrow D\left(x^{2}\right)=2 x \longleftrightarrow\left[\begin{array}{l}
0 \\
2 \\
0
\end{array}\right] \\
&A_{4}=A \cdot\left[\begin{array}{l}
0 \\
0 \\
0 \\
1
\end{array}\right] \longleftrightarrow D\left(x^{3}\right)=3 x^{2} \longleftrightarrow\left[\begin{array}{l}
0 \\
0 \\
3
\end{array}\right]
\end{aligned}
$$
Therefore, $A=\left[\begin{array}{llll}0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3\end{array}\right]$.
31
74EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneExercise 10State Machines1a0.1041666667Text
For each of the following state machines, provide the output sequence $\left[y_{1}, y_{2}, \ldots, y_{T}\right]$ given the input sequence $\left[x_{1}, x_{2}, \ldots, x_{T}\right]$. Notice that the inputs start with subscript 1:
Input: $[0,1,2,1]$
$s_{0}=0$
$s_{t}=f_{s}\left(s_{t-1}, x_{t}\right)=\max \left(s_{t-1}, x_{t}\right)$
$y_{t}=f_{o}\left(s_{t}\right)=2 \times s_{t}$
Enter a Python list of four numbers:
Expression[0, 2, 4, 4].
Passing in $x_{1}$:
$s_{0}=0$
$s_{1}=f_{s}\left(s_{0}, x_{1}\right)=\max (0,0)=0$
$y_{1}=f_{o}\left(s_{1}\right)=2 \times 0=0$
Passing in $x_{2}$:
$s_{1}=0$
$s_{2}=f_{s}\left(s_{1}, x_{2}\right)=\max (0,1)=1$
$y_{2}=f_{o}\left(s_{2}\right)=2 \times 1=2$
Similar calculations for $x_{3}$ and $x_{4}$.
You can also solve the problem with code:
def f_s(s, x_t):
return max(s, x_t)
def f_o(s):
return s * 2
s = 0
inputs = [0, 1, 2, 1]
outputs = []
for x in inputs:
s = f_s(s, x)
y = f_o(s)
outputs.append(y)
print(outputs)
For each of the following state machines, provide the output sequence $\left[y_{1}, y_{2}, \ldots, y_{T}\right]$ given the input sequence $\left[x_{1}, x_{2}, \ldots, x_{T}\right]$. Notice that the inputs start with subscript 1:
Input: $[0,1,2,1]$
$s_{0}=(0,0)$
$s_{t}=f_{s}\left(s_{t-1}, x_{t}\right)=\left(s_{t-1}[0]+x_{t}, s_{t-1}[1]+1\right)$
$y_{t}=f_{o}\left(s_{t}\right)=s_{t}[0] / s_{t}[1]$
Note that the state is two-dimensional.
Enter a Python list of four numbers:
[0, 0.5, 1, 1].
Passing in $x_{1}$:
$s_{0}=(0,0)$
$s_{1}=f_{s}\left(s_{0}, x_{1}\right)=(0+0,0+1)=(0,1)$
$y_{1}=f_{o}\left(s_{1}\right)=s_{1}[0] / s_{1}[1]=0 / 1=0$
Passing in $x_{2}$:
$s_{1}=(0,1)$
$s_{2}=f_{s}\left(s_{1}, x_{2}\right)=(0+1,1+1)=(1,2)$
$y_{1}=f_{o}\left(s_{2}\right)=s_{2}[0] / s_{2}[1]=1 / 2=0.5$
Similar calculations for $x_{3}$ and $x_{4}$.
You can also solve the problem with code:
def f_s(s, x_t):
return s[0] + x_t, s[1] + 1
def f_o(s):
return s[0] / s[1]
s = (0, 0)
inputs = [0, 1, 2, 1]
outputs = []
for x in inputs:
s = f_s(s, x)
y = f_o(s)
outputs.append(y)
print(outputs)
As in the last problem, suppose that $x_{1}(t)$ obeys $x_{1}^{\prime}(t)=-x_{1}(t)$ and $x_{2}(t)$ obeys $x_{2}^{\prime}(t)=-x_{2}(t)^{2}$. Suppose that $x_{1}(0)=x_{2}(0)=1$.
Compute $x_{1}(T)$.
You can just guess it: $x_{1}(T)=e^{-T}$. It satisfies $x_{1}^{\prime}=-x_{1}$ and $x_{1}(0)=1$.
Let $\Sigma=\{0,1,2\}$ be the alphabet for the languages in all parts of this problem.
Let $A=\left\{0^{i} 1^{j} 2^{k} \mid i, j, k \geq 0\right\}$. Give the state diagram of a DFA with 4 states that recognizes $\bar{A}$, the complement of $A$.
Here's a description of the diagram. Put self-loops on each of the states $q_{0}, q_{1}, q_{2}$, and $q_{3}$ with labels $0,1,2$, and $\{0,1,2\}$ respectively. Additional transitions are: $q_{0}$ to $q_{1}$ labeled $1, q_{0}$ to $q_{2}$ labeled 2 , $q_{1}$ to $q_{2}$ labeled $2, q_{1}$ to $q_{3}$ labeled 0 , and $q_{2}$ to $q_{3}$ labeled $\{0,1\}$. The start state is $q_{0}$, and $q_{3}$ is the only accept state.
32
382Mathematics18.01Calculus INoneNoneProblem Set 8Variance10a0.1583949314Text
Suppose that we flip a coin a hundred times and count how many heads we get. On average the number of heads is 50 . On the other hand, if we flip a coin a hundred times, it's unlikely that we will get exactly 50 heads. Usually the number of heads will be somewhat above average or somewhat below average. The variance helps understand how far away the number of heads typically is from the mean. (Remember, mean is another word for average.)
If we have a probability distribution for $x$ with mean $M$, then the variance is the average value of $(x-M)^{2}$. Let's illustrate this with an example.
Example. Suppose we slip a coin two times and count the number of heads. The probability distribution for the number of heads is
\begin{tabular}{|c|c|}
\hline Number of heads & Probability \\
\hline 0 & $1 / 4$ \\
\hline 1 & $1 / 2$ \\
\hline 2 & $1 / 4$ \\
\hline
\end{tabular}
The average number of heads is $M=1$. To find the variance, we want to find the average value of (The number of heads $-M)^{2}$. To compute we first include this information on our table:
\begin{tabular}{|c|c|c|}
\hline Number of heads & Probability & (The number of heads $-M)^{2}$ \\
\hline 0 & $1 / 4$ & $(0-1)^{2}=1$ \\
\hline 1 & $1 / 2$ & $(1-1)^{2}=0$ \\
\hline 2 & $1 / 4$ & $(2-1)^{2}=1$ \\
\hline
\end{tabular}
The variance is the average value of (The number of heads $-M)^{2}$ which is (1/4) . $1+(1 / 2) \cdot 0+(1 / 4) \cdot 1=(2 / 4)$.
Suppose we flip a coin three times and count the number of heads. Find the mean and the variance.
Numerical
The mean value $M$ is $0(1 / 8)+1(3 / 8)+2(3 / 8)+3(1 / 8)=3 / 2$.
The variance is $(9 / 4)(1 / 8)+(1 / 4)(3 / 8)+(1 / 4)(3 / 8)+(9 / 4)(1 / 8)=\frac{9+3+3+9}{32}=$ $\frac{24}{32}=3 / 4$.
Suppose that we flip a coin a hundred times and count how many heads we get. On average the number of heads is 50 . On the other hand, if we flip a coin a hundred times, it's unlikely that we will get exactly 50 heads. Usually the number of heads will be somewhat above average or somewhat below average. The variance helps understand how far away the number of heads typically is from the mean. (Remember, mean is another word for average.)
If we have a probability distribution for $x$ with mean $M$, then the variance is the average value of $(x-M)^{2}$. Let's illustrate this with an example.
Example. Suppose we slip a coin two times and count the number of heads. The probability distribution for the number of heads is
\begin{tabular}{|c|c|}
\hline Number of heads & Probability \\
\hline 0 & $1 / 4$ \\
\hline 1 & $1 / 2$ \\
\hline 2 & $1 / 4$ \\
\hline
\end{tabular}
The average number of heads is $M=1$. To find the variance, we want to find the average value of (The number of heads $-M)^{2}$. To compute we first include this information on our table:
\begin{tabular}{|c|c|c|}
\hline Number of heads & Probability & (The number of heads $-M)^{2}$ \\
\hline 0 & $1 / 4$ & $(0-1)^{2}=1$ \\
\hline 1 & $1 / 2$ & $(1-1)^{2}=0$ \\
\hline 2 & $1 / 4$ & $(2-1)^{2}=1$ \\
\hline
\end{tabular}
The variance is the average value of (The number of heads $-M)^{2}$ which is (1/4) . $1+(1 / 2) \cdot 0+(1 / 4) \cdot 1=(2 / 4)$.
Suppose we flip a coin one time and count the number of heads. Find the mean and the variance.
The mean is $(0)(1 / 2)+(1)(1 / 2)=1 / 2$.
The variance is $(0-1 / 2)^{2}(1 / 2)+$ $(1-1 / 2)^{2}(1 / 2)=1 / 4$.
Finally, we write down a general Gaussian, with an arbitrary mean and variance. A Gaussian with mean $M$ and variance $V$ is given by the following formula: $\frac{1}{\sqrt{2 \pi V}} e^{-\frac{(x-M)^{2}}{2 V}} d x$
As we discussed in class, if we flip a coin 100 times and count the number of heads, the probability distribution for the number of heads is very close to a Gaussian. Find that Gaussian. Recall that if we flip a coin $N$ times, the mean number of heads is $N / 2$, and the variance in the number of heads is $N / 4$.
We want a Gaussian distribution with mean equal to $\frac{100}{2}=50$ and variance equal to $\frac{100}{4}=25$. The Gaussian distribution with mean $M$ and variance $V$ is $\frac{1}{\sqrt{2 \pi V}} e^{-\frac{(x-M)^{2}}{2 V}} d x$. Plug in $M=50$ and $V=25$ to get the distribution
$$
\frac{1}{\sqrt{50 \pi}} e^{-\frac{(x-50)^{2}}{50}} d x
$$
The average value of a probability distribution $f(x) d x$ is $\int_{-\infty}^{\infty} x f(x) d x$. If you imagine a biased spinner which spins a number $x$ according to the probability distribution $f(x) d x$, then this integral is the average value of $x$. The average value of a probability distribution is also called the mean value.
Here is a picture of a probability density function $f(x)$ below.
Suppose that $\int_{-\infty}^{\infty} x f(x) d x=M$ (so the mean value of $f(x) d x$ is $\left.M\right)$. Suppose that $g(x)=f(x-2)$. Find the mean value of the probability distribution $g(x) d x$. In other words, compute $\int_{-\infty}^{\infty} x g(x) d x$. Give your answer in terms of $M$.
Note that
$$
\int_{-\infty}^{\infty} x g(x) d x=\int_{-\infty}^{\infty} x f(x-2) d x .
$$
Do the substitution $u=x-2, d u=d x$ to get
$$
\int_{-\infty}^{\infty}(u+2) f(u) d u .
$$
Split this integral into
$$
\int_{-\infty}^{\infty} u f(u) d u+\int_{-\infty}^{\infty} 2 f(u) d u .
$$
The first integral is the average value of $f(x) d x$, which is $M$. Since $f$ is a probability density function, $\int_{-\infty}^{\infty} f(u) d u=1$. Thus the average value of $g(x) d x$ is $M+2$.
33
28Mathematics18.6
Probability and Random Variables
18.02NoneProblem Set 3
Cumulative Distribution Function
3b0.25Text
Let $X$ be a random variable with cumulative distribution function $F$.
What is the cumulative distribution function of $aX+b$, where $a$ and $b$ are constants and $a \neq 0$? (Remember that $a$ could be positive or negative.)
Expression
Let $Y = aX + b$.
\textbf{Case 1}: $a > 0$.
$F_Y(c) = P\{Y \leq c\} = P\{aX + b \leq c\} = P\{X \leq \frac{c - b}{a}\} = F_X(\frac{c - b}{a})$.
\textbf{Case 2}: $a < 0$.
$F_Y(c) = P\{Y \leq c\} = P\{aX + b \leq c\} = P\{X \geq \frac{c - b}{a}\} = 1 - P\{X < \frac{c - b}{a}\} = 1 - P\{X \leq \frac{c - b}{a}\} + P\{X = \frac{c - b}{a}\} = 1 - F_X(\frac{c - b}{a}) + p_X(\frac{c - b}{a})$.
Let $X$ be a random variable with cumulative distribution function $F$.
What is the cumulative distribution function of $e^X$?
Let $Y = e^X$.
\textbf{Case 1}: $c > 0$.
$F_Y(c) = P\{Y \leq c\} = P\{e^X \leq c\} = P\{X \leq ln$ $c)\} = F_X(ln$ $c)$.
\textbf{Case 2}: $c \leq 0$.
$F_Y(c) = 0$.
Let $X$ be a continuous random variable have cumulative distribution function $F$. Define the random variable $Y$ by $Y = F(X)$. Show that $Y$ is uniformly distributed over $(0, 1)$.
Since $F_X$ is a cumulative distribution function of a continuous random variable, $0 \leq F_X \leq 1$ and $0 \leq Y \leq 1$. When $0 < x < 1$, $F_Y(x) = P\{Y \leq x\} = P\{F_X(X) \leq x\} = P\{X \leq F_X^{-1}(x)\} = F_X(F_X^{-1}(x)) = x$. Thus, $Y$ has the cumulative distribution function of a uniform random variable, so $Y$ is uniformly distributed over $(0, 1)$.
Let $X$ have probability density $f_X$. Find the probability density function of the random variable $Y$ defined by $Y = aX + b$.
\textbf{Case 1}: $a = 0$.
$F_Y(c) = P\{Y = b\} = 1$.
\textbf{Case 2}: $a < 0$.
$F_Y(c) = P\{Y \leq c\} = P\{aX + b \leq c\} = P\{X \geq \frac{c - b}{a}\} = 1 - P\{X < \frac{c - b}{a}\} = 1 - P\{X \leq \frac{c - b}{a}\} + P\{X = \frac{c - b}{a}\} = 1 - F_X(\frac{c - b}{a}) + p_X(\frac{c - b}{a})$.
$f_Y(t) = \frac{\dv{d}}{\dv{dt}}(1 - F_X(\frac{t - b}{a}) = -\frac{1}{a}f_X(\frac{t-b}{a})$.
\textbf{Case 3}: $a > 0$.
$F_Y(c) = P\{Y \leq c\} = P\{aX + b \leq c\} = P\{X \leq \frac{c - b}{a}\} = F_X(\frac{c - b}{a})$.
$f_Y(t) = \frac{\dv{d}}{\dv{dt}} F_X(\frac{t - b}{a}) = \frac{1}{a}f_X(\frac{t-b}{a})$.
34
417Mathematics18.01Calculus INoneNoneMidterm Exam 1Critical Points3a0.3333333333Text
Suppose that $f(x)=4 x^{3}-6 x^{2}+1$.
Find the values of $x$ where $f^{\prime}(x)=0$. These are called the critical points of $f$.
Numerical
First we calculate $f^{\prime}(x)$. We find that $f^{\prime}(x)=3 \times 4 x^{3-1}-2 \times 6 x^{2-1}+0=12 x^{2}-12 x$. You might be able to see straight away that $12\left(x^{2}-x\right)=0$ at $x=0$ and $x=1$. Otherwise we can solve by plugging in the quadratic formula, giving:
$$
x=\frac{-(-12) \pm \sqrt{(-12)^{2}-4 \times 12 \times 0}}{2 \times 12}=\frac{12 \pm \sqrt{12^{2}}}{24}=\frac{12 \pm 12}{24}=0,1
$$
Let $f(x)=(1 / 3) x^{3}-x$.
Find all the values of $x$ where $f^{\prime}(x)=0$. These are called critical points of $f$. For each critical point $x$, compute $f(x)$.
Now $f(x)=x^{3} / 3-x$.
$f^{\prime}(x)=0$ when $x=\pm 1$. At $x=-1, f(x)=2 / 3$. At $x=+1, f(x)=-2 / 3$.
Suppose that $f(x)=4 x^{3}-6 x^{2}+1$.
Label the critical point of $f$ on the x-axis, and then label the places where $f^{\prime}(x)>0$ and where $f^{\prime}(x)<0$.
As we found in part a, $f^{\prime}(x)=12 x^{2}-12 x$. This means that when $x<0$ both $12 x^{2}$ and $-12 x$ are positive, so that $12 x^{2}-12 x>0$. When $0<x<1$ we have by multiplying through by $x$ that $0 \times x<x \times x<1 \times x \Rightarrow x^{2}<x \Rightarrow 12 x^{2}<12 x \Rightarrow 12 x^{2}-12 x<0$. Lastly, when $x>1$ we get that $x^{2}>x$ so that again $12 x^{2}>12 x$ and $12 x^{2}-12 x>0$. The critical points we get from part a.
In the graph below there is a thickened line to indicate critical points. Plus for $f^{\prime}(x)>0$ and minus for $f^{\prime}(x)<0$.
Suppose that $f(x)=4 x^{3}-6 x^{2}+1$.
Compute $f(x)$ for each critical point $x$.
$$
f(0)=4 \times 0^{3}-6 \times 0^{2}+1=0+0+1=1 \text { and } f(1)=4 \times 1^{3}-6 \times 1^{2}+1=4-6+1=-1
$$
35
51EECS6.411
Representation, Inference, and Reasoning in AI
6.1010, 6.1210, 18.600
NoneProblem Set 1
Monte-Carlo Tree Search
4bi0.1302083333Text
Consider MCTS on a problem where the initial state $s_{0}$ has actions two actions $a_{0}$ and $a_{1}$. The UCB parameter $C$ is $1.4$. Suppose the search has completed 8 full iterations:
\begin{itemize}
\item It selected action $a_{0} 3$ times, receiving utilities $[0.1,0.7,0.3]$.
\item It selected action $a_{1} 5$ times, receiving utilities $[0.4,0.4,0.4,0.4,0.4]$.
\end{itemize}
On the 9-th iteration, what is the UCB value for action $a_{0}$ when expanding from $s_{0}$? (Enter a number accurate to 2 decimal places).
Numerical1.532.
Consider MCTS on a problem where the initial state $s_{0}$ has actions two actions $a_{0}$ and $a_{1}$. The UCB parameter $C$ is $1.4$. Suppose the search has completed 8 full iterations:
\begin{itemize}
\item It selected action $a_{0} 3$ times, receiving utilities $[0.1,0.7,0.3]$.
\item It selected action $a_{1} 5$ times, receiving utilities $[0.4,0.4,0.4,0.4,0.4]$.
\end{itemize}
On the 9-th iteration, what is the UCB value for action $a_{1}$ when expanding from $s_{0}$? (Enter a number accurate to 2 decimal places).
1.303.
Consider MCTS on a problem where the initial state $s_{0}$ has actions two actions $a_{0}$ and $a_{1}$. The UCB parameter $C$ is $1.4$. Suppose the search has completed 8 full iterations:
\begin{itemize}
\item It selected action $a_{0} 3$ times, receiving utilities $[0.1,0.7,0.3]$.
\item It selected action $a_{1} 5$ times, receiving utilities $[0.4,0.4,0.4,0.4,0.4]$.
\end{itemize}
Which of the two actions will be selected on the 9-th iteration?
(a) $a_{0}$
(b) $a_{1}$
(a) $a_{0}$
Consider a tiny MDP with states $(0,1,2,3)$ and actions $(b, c)$.
Given the reward and transition functions below with an infinite horizon and a discount factor of $0.9$, compute three iterations of value iteration. Don't assume a particular policy. Assume that:
\begin{itemize}
\item All the value estimates start at 0: meaning, at iteration $0, Q(s, a)=0$ for all $s, a$ pairs.
\item You operate synchronously: that is, on iteration $t$ of value iteration, you only use values that were computed on iteration $t-1$.
\end{itemize}
We recommend you compute the Q-value iteration by hand to get a better understanding of the algorithm.
For each iteration, enter eight numbers corresponding to our value function estimate, expressed as
$$
[Q(0, b), Q(0, c), Q(1, b), Q(1, c), Q(2, b), Q(2, c), Q(3, b), Q(3, c)]
$$
at that iteration, accurate to three decimal places.
Here are the reward and transition functions:
$$
\begin{gathered}
R(s, a)=\left\{\begin{array}{lll}
1 & \text { if } s=1 \\
2 & \text { if } s=3 \\
0 & \text { otherwise }
\end{array}\right. \\
T\left(s_{t}, \mathrm{~b}, s_{t+1}\right)=\left[\begin{array}{llll}
0.0 & 0.9 & 0.1 & 0.0 \\
0.9 & 0.1 & 0.0 & 0.0 \\
0.0 & 0.0 & 0.1 & 0.9 \\
0.9 & 0.0 & 0.0 & 0.1
\end{array}\right] \\
T\left(s_{t}, \mathrm{c}, s_{t+1}\right)=\left[\begin{array}{llll}
0.0 & 0.1 & 0.9 & 0.0 \\
0.9 & 0.1 & 0.0 & 0.0 \\
0.0 & 0.0 & 0.1 & 0.9 \\
0.9 & 0.0 & 0.0 & 0.1
\end{array}\right]
\end{gathered}
$$
After the third iteration, what action would you select in state 0?
Action c.
$Q(0, c)>Q(0, b)$.
36
32EECS18.C06
Linear Algebra and Optimization
18.02NoneProblem Set 2Vector Spaces2c0.06172839506Text
Is the following set a vector space (with the usual definitions of multiplication by real scalars and addition)?
The set of vectors that solve the equation $A x=b$ for some fixed nonzero $b$.
Open
False. In general, this set does not include the origin, so it cannot be a subspace.
Is the following set a vector space (with the usual definitions of multiplication by real scalars and addition)?
Given a subspace $V$ of $\mathbb{R}^{n}$ and an $m \times n$ matrix $A$, the set of all vectors $A x$, where $x \in V$.
True. If we take two vectors $u=A x$ and $v=A y$ then $u+v$ is also in the set because $u+v=A x+A y=A(x+y)$. A similar argument works for scalar multiplication.
Is the following set a vector space (with the usual definitions of multiplication by real scalars and addition)?
The set of $m \times n$ real matrices.
True. You can always multiply a matrix by a scalar. And as long as the dimensions match, you can add matrices too. Thus the axioms of being a vector space are satisfied.
Is the following set a vector space (with the usual definitions of multiplication by real scalars and addition)?
The set of vectors in $\mathbb{R}^{5}$ whose first two coordinates are equal.
True. Suppose $x$ and $y$ are in the set and their first and second coordinates are $a$ and $b$ respectively. Then the first and second coordinate of $x+y$ are both $a+b$. Similarly scalar multiplication does not change the fact that the first and second coordinates match. Thus the axioms of being a vector space are satisfied.
37
416Mathematics18.01Calculus INoneNoneMidterm Exam 1
Linear Approximation
2b0.8333333333Text
Approximate the cube root of the number $8.24$. Is the true value closest to 2.02 or $2.03$ or $2.04$ or $2.06 ?$ Explain your reasoning.
Open
As above, we have the linear approximation $(2+\Delta x)^{3} \approx 8+12 \Delta x$. We are looking for $(2+\Delta x)^{3} \approx 8.24$ so we solve $8.24 \approx 8+12 \Delta x \Rightarrow .24=12 \Delta x \Rightarrow \Delta x=.02$. This gives us $\sqrt[3]{8.24} \approx 2+.02=2.02$.
Approximate $(2.01)^{3}$. Is the true value closest to $8.04$ or $8.06$ or $8.08$ or $8.12$ ? Briefly explain your reasoning.
We have that $(2.01)^{3}=(2+.01)^{3}$. So we use linear approximation of $f(x)=x^{3}$ near $x=2$. We use that $f^{\prime}(x)=\left(x^{3}\right)^{\prime}=3 x^{2}$. This gives $f(2+\Delta x) \approx f(2)+f^{\prime}(2) \Delta x=$ $2^{3}+3 \times 2^{2} \Delta x=8+12 \Delta x$. Since $\Delta x=.01$ this is $8+.12=8.12$.
Using the linear approximation of $x^{2}$ around 2, estimate the square root of $4.1$.
Here $f^{\prime}(x)=2 x$, so $f^{\prime}(2)=4$. Thus, the linear approximation to $f(x)$ around $x=2$ is
$$
f(2+\Delta x) \approx \underset{f(2)}{4}+\underset{f^{\prime}(2)}{4} \cdot \Delta x .
$$
Here you still use $f(x)=x^{2}$ but invert the linear approximation. To make $4+4 \Delta x$ equal 4.1, the $4 \Delta x$ must be 0.1. So, $\Delta x=0.025$. Thus,
$$
\sqrt{4.1} \approx 2.025 .
$$
Using the linear approximation of $x^{2}$ around $x=2$, estimate the square root of $3.9$.
Again we use
$$
f(2+\Delta x) \approx 4+4 \Delta x .
$$
Now invert the linear approximation to find $\Delta x$.
$$
\begin{aligned}
& f(2+\Delta x) \approx 4+4 \Delta x=3.9, \\
& \text { so } 4 \Delta x=-0.1 \text {, and } \\
& \Delta x=-0.025 .
\end{aligned}
$$
In other words,
$$
(2-0.025)^{2} \approx 3.9 .
$$
Taking the square root of both sides,
$$
\sqrt{3.9} \approx 2-0.025=1.975 \text {. }
$$
38
85EECS6.411
Representation, Inference, and Reasoning in AI
6.1010, 6.1210, 18.600
NoneProblem Set 2
Propositional Logic
3biii0.0744047619Text
Consider a domain with propositions $\mathrm{A}, \mathrm{B}, \mathrm{C}$, and $\mathrm{D}$, and the particular model $m=\{A=t, B=f, C=t, D=f\}$. For each of these sentences, indicate whether it is valid, unsatisifiable, not valid but true in $\mathrm{m}$, or not unsatisifiable but false in $\mathrm{m}$.
$$
B \Rightarrow C \wedge D
$$
Multiple Choicenot valid, but true in m.
Consider a domain with propositions $\mathrm{A}, \mathrm{B}, \mathrm{C}$, and $\mathrm{D}$, and the particular model $m=\{A=t, B=f, C=t, D=f\}$. For each of these sentences, indicate whether it is valid, unsatisifiable, not valid but true in $\mathrm{m}$, or not unsatisifiable but false in $\mathrm{m}$.
$A \Rightarrow C \wedge D$
not unsatisfiable, but false in m.
Consider a domain with propositions $\mathrm{A}, \mathrm{B}, \mathrm{C}$, and $\mathrm{D}$, and the particular model $m=\{A=t, B=f, C=t, D=f\}$. For each of these sentences, indicate whether it is valid, unsatisifiable, not valid but true in $\mathrm{m}$, or not unsatisifiable but false in $\mathrm{m}$.
$$
\begin{aligned}
& (A \wedge C) \Leftrightarrow(B \wedge D)\\
\end{aligned}
$$
not unsatisfiable, but false in m.
Consider a domain with propositions $\mathrm{A}, \mathrm{B}, \mathrm{C}$, and $\mathrm{D}$, and the particular model $m=\{A=t, B=f, C=t, D=f\}$. For each of these sentences, indicate whether it is valid, unsatisifiable, not valid but true in $\mathrm{m}$, or not unsatisifiable but false in $\mathrm{m}$.
$$
D \Leftrightarrow \neg D
$$
unsatisfiable.
39
61Mathematics18.6
Probability and Random Variables
18.02NoneProblem Set 5Probability6b0.35Text
Suppose that $A$, $B$, $C$, are independent random variables, each being uniformly distributed over $(0,1)$.
What is the probability that all of the roots of the equation $Ax^2 + Bx + C = 0$ are real?
Numerical
The roots of the equation $Ax^2 + Bx + C = 0$ are real if $B^2 - 4AC \geq 0$. The desired probability, $P\{B^2 - 4AC \geq 0\}$, is obtained as follows
$P\{B^2 - 4AC \geq 0\} = 1 - P\{B^2 - 4AC < 0\} = 1 - \iiint_{B^2 - 4AC < 0} f(x, y, z) \,dx\,dy\,dz = 1 - \int_{0}^{1} \int_{\frac{y^2}{4}}^{1} \int_{\frac{y^2}{4z}}^{1} \,dx\,dz\,dy = 1 - \int_{0}^{1} \int_{\frac{y^2}{4}}^{1} (1 - \frac{y^2}{4z}) \,dz\,dy = 1 - \int_{0}^{1} (1 - \frac{y^2}{4} + \frac{y^2}{4}\ln({\frac{y^2}{4}})) \,dy = \frac{1}{6}\ln({2}) + \frac{5}{36}$.
Suppose that $A$, $B$, $C$, are independent random variables, each being uniformly distributed over $(0,1)$.
What is the joint cumulative distribution function of $A$, $B$, $C$?
\[
F_{A, B, C}(x, y, z) = P\{A \leq x, B \leq y, C \leq z\} =
\begin{cases}
0 & \mbox{if } x < 0 \mbox{ or } y < 0 \mbox{ or } z < 0\\
\mbox{min}(1, x) \mbox{ min}(1, y) \mbox{ min}(1, z) & otherwise
\end{cases}
\]
Suppose that $A$ and $B$, and $C$ are independent random variables, where
• $A$ and $B$ are normal with mean 2 and variance 1.
• $C$ is uniform on $(0,3)$.
Let $X=4 A-3 B$, and let $Y=A-2 B+2 C$.
Compute $P(X>0)$. You should write your answer as $\Phi(a)$ for some $a>0$, where $\Phi(x)=\int_{-\infty}^{x} \frac{1}{\sqrt{2 \pi}} e^{-t^{2} / 2} d t$ is the CDF of a standard normal random variable.
$X$ is normal with $E[X]=4 E[A]-3 E[B]=4 \cdot 2-3 \cdot 2=2$ and (using the fact that $A$ and $B$ are independent) $\operatorname{Var}(X)=4^{2} \cdot \operatorname{Var}(A)+(-3)^{2} \operatorname{Var}(B)=16 \cdot 1+9 \cdot 1=25=5^{2}$. We can thus write $X$ as $5 Z+2$, where $Z$ is a standard normal, so
$$
P(X>0)=P(5 Z+2>0)=P\left(Z>-\frac{2}{5}\right)
$$
By the symmetry of the normal distribution,
$$
P\left(Z>-\frac{2}{5}\right)=P\left(Z<\frac{2}{5}\right)=P\left(Z \leq \frac{2}{5}\right)=\boldsymbol{\Phi}\left(\frac{\mathbf{2}}{\mathbf{5}}\right) .
$$
Alternatively, instead of using symmetry, one could obtain this answer by noting that $-Z$ is also a standard normal, and $P\left(Z>-\frac{2}{5}\right)=P\left(-Z<\frac{2}{5}\right)$.
Suppose that $A$ and $B$, and $C$ are independent random variables, where
• $A$ and $B$ are normal with mean 2 and variance 1.
• $C$ is uniform on $(0,3)$.
Let $X=4 A-3 B$, and let $Y=A-2 B+2 C$.
Compute the CDF of the variable $W=\min (A, B)$.
We first note that $A$ and $B$ have the same distribution, and $A-2$ and $B-2$ are both standard normal random variables, so
$$
F_{A}(t)=F_{B}(t)=P(B \leq t)=P(B-2 \leq t-2)=\Phi(t-2) .
$$
Using the fact that $A$ and $B$ are independent, we can thus write the CDF of $W$ as
$$
\begin{aligned}
F_{W}(t) & =P(W \leq t)=P(\min (A, B) \leq t)=1-P((A>t) \cap(B>t)) \\
& =1-P(A>t) P(B>t)=1-(1-P(A \leq t))(1-P(B \leq t)) \\
& =1-\left(1-F_{A}(t)\right)\left(1-F_{B}(t)\right)=\mathbf{1}-(\mathbf{1}-\mathbf{\Phi}(\mathbf{t}-\mathbf{2}))^{\mathbf{2}} .
\end{aligned}
$$
40
48EECS6.191
Computation Structures
6.100A, 8.02None
Prelab Questions 4
Conditionals2b0.05Text
Now consider the following code.
function Bit#(1) conditional(Bit#(1) a, Bit#(1) b);
Bit#(1) ret = 0;
Integer c=1;
if(c==1) begin
ret = comb(a, b);
end else begin
ret = comb2(a, b);
end
return ret;
endfunction
How many instances of comb and comb2 will be synthesized for this specification of conditional?
(a) 1 comb and 0 comb2.
(b) 0 comb and 1 comb2.
(c) 1 comb and 1 comb2.
Multiple Choice
(a) 1 comb and 0 comb2.
In this function, the type Integer is used for the c variable. Integer types cannot be synthesized into hardware, thus their value must be known at compile time so that the Integer can be replaced with an actual value. In this minispec code, the Integer c is defined to be 1. Thus c is replaced by 1 and the conditional statement is simplified to only instantiate the if clause, or the comb circuit but not the comb2 circuit.
Consider the following code.
function Bit#(1) conditional(Bit#(1) a, Bit#(1) b, Bool c);
Bit#(1) ret = 0;
if(c) begin
ret = comb(a, b);
end else begin
ret = comb2(a, b);
end
return ret;
endfunction
How many instances of comb and comb2 will be synthesized for the conditional circuit?
(a) 1 comb and 0 comb2.
(b) 0 comb and 1 comb2.
(c) 1 comb and 1 comb2.
(d) depends on what the value of c is.
(c) 1 comb and 1 comb2.
It is important to note that Minispec conditional code does not work like conditional code in a software program. In a software program only one of the two conditional paths will be executed. In hardware, both paths must be synthesized and then a mux is used to select the output of one of the two paths. At synthesis time, it is unknown what the value of c is so the hardware must be able to execute both conditional paths.
Consider the following code.
function Bit#(1) for_loop(Bit#(4) a);
Bit#(1) ret = a[0];
for(Integer i = 1; i < 4 ; i = i +1) begin
ret = comb(ret, a[i]);
end
return ret;
endfunction
How many instances of comb will be synthesized for the for_loop circuit?
(a) 1 comb.
(b) 3 comb.
(c) depends on what the value of a is.
(b) 3 comb.
Minispec unrolls loops. This means that it effectively replaces the for loop with multiple instances of the code inside the for loop. The effect of this is that for each value of i, there is an additional instantiation of the comb circuit.
Consider the following code
function Bit#(1) complex_circuit(Bit#(1) a, Bit#(1) b, Bit#(1) c, Bit#(1) d);
return comb(comb(comb(a, b), c), d);
endfunction
where comb is an unspecified combinational circuit that takes in 2 1-bit inputs and outputs a 1-bit output.
The following longer code is equivalent to the code above.
function Bit#(1) complex_circuit(Bit#(1) a, Bit#(1) b, Bit#(1) c, Bit#(1) d);
Bit#(1) ab = comb(a, b);
Bit#(1) abc = comb(ab, c);
Bit#(1) abcd = comb(abc, d);
return abcd;
endfunction
You might find the longer code more helpful to answer the following questions.
Now suppose comb is defined to be an xor gate which is known to be an associative function, what circuit will minispec produce pre-optimization?
(a)
(b)
(a).
The idea here is that the serial circuit implementation is directly encoded in how the circuits variables are defined regardless of the underlying functionality of comb. So, pre-optimization, the answer must remain the same. The second picture is achievable after the optimizer looks at the circuit and realizes that comb is actaully associative and be combined in a tree-like manner.
However, the optimizer may not be able to recognize all associative combinational circuits. Hence, it is always best to describe your circuits in minispec in a way that can more easily lead to optimized circuit implementations.
41
15EECS6.100A
Introduction to Computer Science Programming in Python
NoneNoneProblem Set 2Hangman1a1.875Text
Implement the function has_player_won according to its docstrings. This function will be useful in determining when the hangman game has been won (i.e. the user has guessed all the letters in the secret word).
Example Usage:
>>> secret_word = 'apple'
>>> letters_guessed = ['e', 'i', 'k', 'p', 'r', 's']
>>> print(has_player_won(secret_word, letters_guessed))
False
Testing: Navigate to the test_ps2_student.py file and run it in Spyder. This will run a series of unit tests on your code. Note that this file contains tests for functions you will implement later on in this pset, so not all of them will pass right away. Examine the tests that start with test_has_player_won. If your function is correct, you should see the following printout:
test_has_player_won (__main__.TestPS2) ... ok
test_has_player_won_empty_list (__main__.TestPS2) ... ok
test_has_player_won_empty_string (__main__.TestPS2) ... ok
test_has_player_won_repeated_letters (__main__.TestPS2) ... ok
def has_player_won(secret_word, letters_guessed):
'''
secret_word: string, the lowercase word the user is guessing
letters_guessed: list (of lowercase letters), the letters that have been
guessed so far
returns: boolean, True if all the letters of secret_word are in letters_guessed,
False otherwise
'''
# FILL IN YOUR CODE HERE AND DELETE "pass"
pass
Programming
def has_player_won(secret_word, letters_guessed):
'''
secret_word: string, the lowercase word the user is guessing
letters_guessed: list (of lowercase letters), the letters that have been
guessed so far
returns: boolean, True if all the letters of secret_word are in letters_guessed,
False otherwise
'''
# FILL IN YOUR CODE HERE AND DELETE "pass"
for character in secret_word:
if character not in letters_guessed:
return False
return True
Next, implement the function get_word_progress according to its docstrings. This should be fairly similar to
has_player_won.
Hint: Think about...
• if you need to store information as you loop over a data structure
• how you want to add information to your accumulated result
Example Usage:
>>> secret_word = 'apple'
>>> letters_guessed = ['e', 'i', 'k', 'p', 'r', 's']
>>> print(get_word_progress(secret_word, letters_guessed))
+pp+e
Testing: Run test_ps2_student.py. If your function is correct, the test printout should read:
test_get_word_progress (__main__.TestPS2) ... ok
test_get_word_progress_empty_list (__main__.TestPS2) ... ok
test_get_word_progress_empty_string (__main__.TestPS2) ... ok
test_get_word_progress_repeated_letters (__main__.TestPS2) ... ok
def get_word_progress(secret_word, letters_guessed):
'''
secret_word: string, the lowercase word the user is guessing
letters_guessed: list (of lowercase letters), the letters that have been
guessed so far
returns: string, comprised of letters and plus signs (+) that represents
which letters in secret_word have not been guessed so far
'''
# FILL IN YOUR CODE HERE AND DELETE "pass"
pass
def get_word_progress(secret_word, letters_guessed):
'''
secret_word: string, the lowercase word the user is guessing
letters_guessed: list (of lowercase letters), the letters that have been
guessed so far
returns: string, comprised of letters and plus signs (+) that represents
which letters in secret_word have not been guessed so far
'''
# FILL IN YOUR CODE HERE AND DELETE "pass"
display_word = []
for character in secret_word:
display_word.append("+")
for character in range(len(secret_word)):
if secret_word[character] in letters_guessed:
display_word[character] = secret_word[character]
return "".join(display_word)
1. The secret_word along with the boolean with_help are passed into the hangman function as parameters.
2. At the start of the game, display how many letters the computer's word contains.
3. Users start with 10 guesses.
Example Game Implementation:
Loading word list from file...
55900 words loaded.
Welcome to Hangman!
I am thinking of a word that is 4 letters long.
1. Before each guess, you should display to the user:
• Some dashes (--------------) to separate individual guesses from each other. Leaving out the row of dashes will cause the tester to fail - however, just make sure the number of dashes is at least 3.
• How many guesses they have remaining
• All the letters that have not yet been guessed
2. Ask the user to supply one guess at a time.
• The user can type any number, symbol, or letter. Your code should only accept capital and lowercase single letters as valid guesses!
• If the game is played with help, your code should also accept the help character (!)
3. Immediately after each guess, you should display:
• Whether or not the letter is in the secret word
• The word with guessed letters revealed and unguessed letters as plus signs ( + )
Example Game Implementation:
Loading word list from file...
55900 words loaded.
Welcome to Hangman!
I am thinking of a word that is 4 letters long.
--------------
You have 10 guesses left.
Available letters: abcdefghijklmnopqrstuvwxyz
Please guess a letter: a # This is the user input
Good guess: +a++
--------------
You have 10 guesses left.
Available letters: bcdefghijklmnopqrstuvwxyz
Please guess a letter: b # This is the user input
Oops! That letter is not in my word: +a++
--------------
You have 9 guesses left.
Available letters: cdefghijklmnopqrstuvwxyz
Please guess a letter: 2 # This is the user input
Oops! That is not a valid letter. Please input a letter from
the alphabet: +a++
--------------
You have 9 guesses left.
Available letters: cdefghijklmnopqrstuvwxyz
Please guess a letter: foo # This is the user input
Oops! That is not a valid letter. Please input a letter from
the alphabet: +a++
--------------
You have 9 guesses left.
Available letters: cdefghijklmnopqrstuvwxyz
Please guess a letter: & # This is the user input
Oops! That is not a valid letter. Please input a letter from
the alphabet: +a++
Hints:
1. Use calls to the input() function to get the user's guess.
• Check that the user input is an alphabet letter (or the help character if the game is played with help).
• If the user does not input a valid letter/character, tell them they can only input a letter from the alphabet.
2. Since the words in words.txt are lowercase, we suggest converting user input to lowercase so program only needs to handle lowercase characters.
3. You may find the string functions str.isalpha() and str.lower() helpful! You can type help(str.isalpha) or help(str.lower) in the Spyder shell to see documentation for the functions.
>> my_string = "HeLLoWoRlD"
>> my_string.isalpha()
True
>> my_string.lower()
'helloworld'
If the user inputs:
1. Anything besides a letter in the alphabet (e.g. symbols or numbers), tell the user that they can only input an alphabet letter. The user loses no guesses. Note: When the game is being played with help, the '!' is also a valid input.
2. A letter that has already been guessed, print a message telling the user the letter has already been guessed before. The user loses no guesses.
3. Any letter that hasn't been guessed before and the letter is in the secret word, the user loses no guesses.
4. Consonants: If the user inputs a consonant that hasn't been guessed and the consonant is not in the secret word, the user loses one guess.
5. Vowels: If the user inputs a vowel that hasn't been guessed and the vowel is not in the secret word, the user loses two guesses. Vowels are a, e, i, o, and u. The letter y does not count as a vowel. Note: if a user inputs an incorrect vowel that hasn't been guessed and there is only one guess remaining, the user loses and the game is over.
Example Game Implementation (continued):
You have 9 guesses left.
Available letters: bcdefghijklmnopqrtuvwxyz
Please guess a letter: t
Good guess: ta+t
--------------
You have 9 guesses left.
Available letters: bcdefghijklmnopqruvwxyz
Please guess a letter: e
Oops! That letter is not in my word: ta+t
--------------
You have 7 guesses left.
Available letters: bcdfghijklmnopqruvwxyz
Please guess a letter: e
Oops! You've already guessed that letter: ta+t
It isn't always easy to beat the computer, especially when it selects an esoteric word. It might be nice if you could ask for some help.
To do this you will create a feature of the game that works as follows:
• If you type the special character "!", the computer will provide you with one of the missing letters in the secret word at a cost of three guesses. This word should be the only non-letter-character input that your game accepts as a guess.
• If you do not have at least three guesses remaining, the computer will warn you of this and let you try again. You lose no guesses.
Note: The user can play the game with this feature only when the with_help parameter is True.
As a starting point, we suggest writing a helper function that chooses a letter to reveal. It should take two arguments: the secret word and the string of available letters (from get_available_letters). This helper function should create a string choose_from, containing the unique letters that are in both the secret word and the available letters. You can then use the
following statements to pick a random character revealed_letter from that string:
new = random.randint(0, len(choose_from)-1)
revealed_letter = choose_from[new]
Your helper function should then return this revealed_letter. Back in your original game logic, you'll need to add a conditional statement to catch the case of the user inputting "!". This case, if triggered, can add the letter returned by your helper function to letters_guessed, show the new guessed word, decrement the remaining guesses by 3, and continue the gameplay.
Example Implementation:
Welcome to Hangman!
I am thinking of a word that is 7 letters long.
--------------
You currently have 10 guesses left.
Available letters: abcdefghijklmnopqrstu
Please guess a letter: !
Letter revealed: r
r+++++r
--------------
You currently have 7 guesses left.
Available letters: abcdefghijklmnopqrstu
Please guess a letter: !
Letter revealed: a
ra+++ar
--------------
You currently have 4 guesses left.
Available letters: abdefghijklmnopqrstu
Please guess a letter: !
Letter revealed: e
ra+e+ar
--------------
You currently have 1 guess left.
Available letters: abdefghijklmnopqstu
Please guess a letter: !
Oops! Not enough guesses left: ra+e+ar
Please refer to the appendix at the end of this handout for an example of a complete game of hangman with help.
1. The game ends when the user guesses all the letters in secret_word or has 0 guesses.
2. If the user wins, print a congratulatory message, and tell the user their score.
• Total score = (4 * number of unique letters in secret_word * guesses_remaining) + (2 * length of secret_word)
• Example: For a game with secret word “asleep” with 6 guesses remaining, there are a total of 5 unique letters (i.e. 'a', 's', 'l', 'e', and 'p'). Then, the final score is: (4 * 5 * 6) + (2 * 6) = 132.
3. If the player runs out of guesses before completing the word, tell them they lost and reveal the word to the user when the game ends.
Example Implementation (win):
You have 5 guesses left.
Available letters: abcgnqrstuvwxyz
Please guess a letter: n
Good guess: dolphin
--------------
Congratulations, you won!
Your total score for this game is: 154
Example Implementation (Lose):
You have 1 guess left.
Available Letters: ghijklmnopqrstuvwxyz
Please guess a letter: i
Oops! That letter is not in my word: e++e
--------------
Sorry, you ran out of guesses. The word was else.
Look carefully at the example hangman games in the handout appendix and make your print statements as close to the
example games as possible! If you run into issues, try consulting the debugging hints.
If you scroll to the bottom of hangman.py, you will see the lines below:
if __name__ == "__main__":
# secret_word = choose_word(wordlist)`
# with_help = False
# hangman(secret_word, with_help)
Uncomment the bottom three lines to choose a random secret word and play hangman with the provided secret word. Feel free to pass in your own secret word when testing your program.
2.6.1) Student Tester
In order to test if your game runs properly, please run test_ps2_student.py. If your function is correct, you should see the following in the test printout:
test_play_game_short (__main__.TestPS2) ... ok
test_play_game_short_fail (__main__.TestPS2) ... ok
test_play_game_with_help (__main__.TestPS2) ... ok
You might see some additional messages printed out between the ... and the ok. For example, you might see the following:
Problem Set 2 Unit Test Results:
All correct!
Points for these tests: 5/5
(Please note that this is not your final pset score, additional test cases will be run on submissions)
ok
This is fine.
Appendix
Hangman Example (Winning Game)
Loading word list from file...
55900 words loaded.
Welcome to Hangman!
I am thinking of a word that is 4 letters long.
--------------
You have 10 guesses left.
Available letters: abcdefghijklmnopqrstuvwxyz
Please guess a letter: a
Good guess: +a++
--------------
You have 10 guesses left.
Available letters: bcdefghijklmnopqrstuvwxyz
Please guess a letter: a
Oops! You've already guessed that letter: +a++
--------------
You have 10 guesses left.
Available letters: bcdefghijklmnopqrstuvwxyz
Please guess a letter: s
Oops! That letter is not in my word: +a++
--------------
You have 9 guesses left.
Available letters: bcdefghijklmnopqrtuvwxyz
Please guess a letter: $
Oops! That is not a valid letter. Please input a letter from the alphabet: +a++
--------------
You have 9 guesses left.
Available letters: bcdefghijklmnopqrtuvwxyz
Please guess a letter: t
Good guess: ta+t
--------------
You have 9 guesses left.
Available letters: bcdefghijklmnopqruvwxyz
Please guess a letter: e
Oops! That letter is not in my word: ta+t
--------------
You have 7 guesses left.
Available letters: bcdfghijklnopquvwxyz
Please guess a letter: c
Good guess: tact
--------------
Congratulations, you won!
Your total score for this game is: 92
Hangman Example (Losing Game)
Loading word list from file...
55900 words loaded.
Welcome to Hangman!
I am thinking of a word that is 4 letters long
--------------
You have 10 guesses left.
Available Letters: abcdefghijklmnopqrstuvwxyz
Please guess a letter: a
Oops! That letter is not in my word: ++++
--------------
You have 8 guesses left.
Available Letters: bcdefghijklmnopqrstuvwxyz
Please guess a letter: b
Oops! That letter is not in my word: ++++
--------------
You have 7 guesses left.
Available Letters: cdefghijklmnopqrstuvwxyz
Please guess a letter: c
Oops! That letter is not in my word: ++++
--------------
You have 6 guesses left.
Available Letters: defghijklmnopqrstuvwxyz
Please guess a letter: 2
Oops! That is not a valid letter. Please input a letter from the alphabet: ++++
--------------
You have 6 guesses left.
Available Letters: defghijklmnopqrstuvwxyz
Please guess a letter: d
Oops! That letter is not in my word: ++++
--------------
You have 5 guesses left.
Available Letters: efghijklmnopqrstuvwxyz
Please guess a letter: u
Oops! That letter is not in my word: ++++
--------------
You have 3 guesses left.
Available Letters: efghijklmnopqrstvwxyz
Please guess a letter: e
Good guess: e++e
--------------
You have 3 guesses left.
Available Letters: fghijklmnopqrstuvwxyz
Please guess a letter: f
Oops! That letter is not in my word: e++e
--------------
You have 2 guesses left.
Available Letters: ghijklmnopqrstuvwxyz
Please guess a letter: o
Oops! That letter is not in my word: e++e
--------------
Sorry, you ran out of guesses. The word was else.
Hangman with Help
Loading word list from file...
55900 words loaded.
Welcome to Hangman!
I am thinking of a word that is 7 letters long
--------------
You currently have 10 guesses left
Available letters: abcdefghijklmnopqrstuvwxyz
Please guess a letter: r
Good guess: r+++++r
--------------
You currently have 10 guesses left
Available letters: abcdefghijklmnopqstuvwxyz
Please guess a letter: !
Letter revealed: c
r+c+c+r
--------------
You currently have 7 guesses left
Available letters: abdeghijklmnopqstuvwxyz
Please guess a letter: !
Letter revealed: a
rac+car
--------------
You currently have 4 guesses left
Available letters: bdeghijklmnopqstuvwxyz
Please guess a letter: e
Good guess: racecar
--------------
Congratulations, you won!
Your total score for this game is: 78
def hangman(secret_word, with_help):
'''
secret_word: string, the secret word to guess.
with_help: boolean, this enables help functionality if true.
Starts up an interactive game of Hangman.
* At the start of the game, let the user know how many
letters the secret_word contains and how many guesses they start with.
* The user should start with 10 guesses.
* Before each round, you should display to the user how many guesses
they have left and the letters that the user has not yet guessed.
* Ask the user to supply one guess per round. Remember to make
sure that the user puts in a single letter (or help character '!'
for with_help functionality)
* If the user inputs an incorrect consonant, then the user loses ONE guess,
while if the user inputs an incorrect vowel (a, e, i, o, u),
then the user loses TWO guesses.
* The user should receive feedback immediately after each guess
about whether their guess appears in the computer's word.
* After each guess, you should display to the user the
partially guessed word so far.
-----------------------------------
with_help functionality
-----------------------------------
* If the guess is the symbol !, you should reveal to the user one of the
letters missing from the word at the cost of 3 guesses. If the user does
not have 3 guesses remaining, print a warning message. Otherwise, add
this letter to their guessed word and continue playing normally.
Follows the other limitations detailed in the problem write-up.
'''
# FILL IN YOUR CODE HERE AND DELETE "pass"
pass
def get_revealed_letter(secret_word, avail_letters):
choose_from = ""
for i in secret_word:
if i in avail_letters and i not in choose_from:
choose_from += i
revealed_letter = choose_from[random.randint(0, len(choose_from) - 1)]
return revealed_letter

def hangman(secret_word, with_help):
'''
secret_word: string, the secret word to guess.
with_help: boolean, this enables help functionality if true.
Starts up an interactive game of Hangman.
* At the start of the game, let the user know how many
letters the secret_word contains and how many guesses they start with.
* The user should start with 10 guesses.
* Before each round, you should display to the user how many guesses
they have left and the letters that the user has not yet guessed.
* Ask the user to supply one guess per round. Remember to make
sure that the user puts in a single letter (or help character '!'
for with_help functionality)
* If the user inputs an incorrect consonant, then the user loses ONE guess,
while if the user inputs an incorrect vowel (a, e, i, o, u),
then the user loses TWO guesses.
* The user should receive feedback immediately after each guess
about whether their guess appears in the computer's word.
* After each guess, you should display to the user the
partially guessed word so far.
-----------------------------------
with_help functionality
-----------------------------------
* If the guess is the symbol !, you should reveal to the user one of the
letters missing from the word at the cost of 3 guesses. If the user does
not have 3 guesses remaining, print a warning message. Otherwise, add
this letter to their guessed word and continue playing normally.
Follows the other limitations detailed in the problem write-up.
'''
# FILL IN YOUR CODE HERE AND DELETE "pass"
num_letters = len(secret_word)
num_guesses = 10
letters_guessed = []
dash = "-------------------"
print("Welcome to Hangman!")
print("I am thinking of a word that is", num_letters, "letters long.")
while num_guesses > 0:
print(dash)
if has_player_won(secret_word, letters_guessed):
print("Congratulations, you won!")
unique_letters = 0
letters_seen = ""
for i in range(len(secret_word)):
if secret_word[i] not in letters_seen:
unique_letters += 1
letters_seen += secret_word[i]
guesses_remaining = num_guesses
total_score = 4 * unique_letters * guesses_remaining + 2 * len(secret_word)
print("Your total score for this game is:", total_score)
break
else:
print("You have", num_guesses, "guesses left.")
print("Available letters:", get_available_letters(letters_guessed))
user_guess = input("Please guess a letter: ")
if user_guess == "!" and with_help:
if num_guesses <= 3:
print ("Oops! Not enough guesses left:", get_word_progress(secret_word, letters_guessed))
else:
num_guesses -= 3
avail_letters = get_available_letters(letters_guessed)
letter_revealed = get_revealed_letter(secret_word, avail_letters)
print("Letter revealed:", letter_revealed)
letters_guessed.append(letter_revealed)
print(get_word_progress(secret_word, letters_guessed))
if not user_guess.isalpha or len(user_guess)>1:
print("Oops! That is not a valid letter. Please input a letter from the alphabet: ")
elif user_guess.isalpha():
user_guess.lower()
if user_guess not in secret_word and user_guess in "aeiou":
print("Oops! That letter is not in my word:", get_word_progress(secret_word, letters_guessed))
num_guesses -= 2
letters_guessed.append(user_guess)
elif user_guess not in secret_word and user_guess not in "aeiou":
print("Oops! That letter is not in my word:", get_word_progress(secret_word, letters_guessed))
num_guesses -= 1
letters_guessed.append(user_guess)
elif user_guess in letters_guessed:
print("Oops! You've already guessed that letter:", get_word_progress(secret_word, letters_guessed))
elif user_guess in secret_word:
letters_guessed.append(user_guess)
print("Good guess:", get_word_progress(secret_word, letters_guessed))
print(dash)
if num_guesses == 0:
print("Sorry, you ran out of guesses. The word was", secret_word)
Next, implement the function get_available_letters according to its docstring. This function should return the letters in alphabetical order.
Hint: You might consider using string.ascii_lowercase, which is a string comprised of all lowercase letters:
>>> import string
>>> print(string.ascii_lowercase)
abcdefghijklmnopqrstuvwxyz
Example Usage:
>>> letters_guessed = ['e', 'i', 'k', 'p', 'r', 's']
>>> print(get_available_letters(letters_guessed))
abcdfghjlmnoqtuvwxyz
Testing: Run test_ps2_student.py. If your function is correct, the test printout should read:
test_get_available_letters (__main__.TestPS2) ... ok
test_get_available_letters_empty_list (__main__.TestPS2) ... ok
test_get_available_letters_empty_string (__main__.TestPS2) ... ok
def get_available_letters(letters_guessed):
'''
letters_guessed: list (of lowercase letters), the letters that have been
guessed so far
returns: string, comprised of letters that represents which
letters have not yet been guessed. The letters should be returned in
alphabetical order
'''
# FILL IN YOUR CODE HERE AND DELETE "pass"
pass
def get_available_letters(letters_guessed):
'''
letters_guessed: list (of lowercase letters), the letters that have been
guessed so far
returns: string, comprised of letters that represents which
letters have not yet been guessed. The letters should be returned in
alphabetical order
'''
# FILL IN YOUR CODE HERE AND DELETE "pass"
all_letters = []
for character in string.ascii_lowercase:
if character not in letters_guessed:
all_letters.append(character)
return "".join(all_letters)
42
87EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneExercise 12Q-Learning1av0.01157407407Text
Let's simulate the Q-learning algorithm! Assume there are states $(0,1,2,3)$ and actions ('b', 'c'), and discount factor $\gamma=0.9$. Furthermore, assume that all the $\mathrm{Q}$ values are initialized to 0 (for all state-action pairs) and that the learning rate $\alpha=0.5$.
Experience is represented as a list of 4-element tuples: the $t$ th element of the experience corresponds to a record of experience at time $t:\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$ (state, action, next state, reward).
After each step $t$, indicate what update $Q\left(s_{t}, a_{t}\right) \leftarrow q$ will be made by the Q learning algorithm based on $\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$. You will want to keep track of the overall table $Q\left(s_{t}, a_{t}\right)$ as these updates take place, spanning the multiple parts of this question.
As a reminder, the Q-learning update formula is the following:
$$
Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)
$$
You are welcome to do this problem by hand, by drawing a table specifying $Q(s, a)$ for all possible $s$ and $as$. Alternatively, you may write a program which takes in the following history of experience:
experience = [(0, 'b', 2, 0), #t = 0
(2, 'b', 3, 0),
(3, 'b', 0, 2),
(0, 'b', 2, 0), #t = 3
(2, 'b', 3, 0),
(3, 'c', 0, 2),
(0, 'c', 1, 0), #t = 6
(1, 'b', 0, 1),
(0, 'b', 2, 0),
(2, 'c', 3, 0), #t = 9
(3, 'c', 0, 2),
(0, 'c', 1, 0)]
t: S A S' R
---------------
1: 2 'b' 3 0
The $t=1$ step of Q-learning will update the Q value of some state-action pair based on the experience tuple $\left(s_{1}, a_{1}, s_{2}, r_{1}\right)$.
After observing this tuple, what is the state of the state-action pair that is updated?
Numerical
2.
Since we observe an experience in state 2, we update the Q value for state 2.
Let's simulate the Q-learning algorithm! Assume there are states $(0,1,2,3)$ and actions ('b', 'c'), and discount factor $\gamma=0.9$. Furthermore, assume that all the $\mathrm{Q}$ values are initialized to 0 (for all state-action pairs) and that the learning rate $\alpha=0.5$.
Experience is represented as a list of 4-element tuples: the $t$ th element of the experience corresponds to a record of experience at time $t:\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$ (state, action, next state, reward).
After each step $t$, indicate what update $Q\left(s_{t}, a_{t}\right) \leftarrow q$ will be made by the Q learning algorithm based on $\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$. You will want to keep track of the overall table $Q\left(s_{t}, a_{t}\right)$ as these updates take place, spanning the multiple parts of this question.
As a reminder, the Q-learning update formula is the following:
$$
Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)
$$
You are welcome to do this problem by hand, by drawing a table specifying $Q(s, a)$ for all possible $s$ and $as$. Alternatively, you may write a program which takes in the following history of experience:
experience = [(0, 'b', 2, 0), #t = 0
(2, 'b', 3, 0),
(3, 'b', 0, 2),
(0, 'b', 2, 0), #t = 3
(2, 'b', 3, 0),
(3, 'c', 0, 2),
(0, 'c', 1, 0), #t = 6
(1, 'b', 0, 1),
(0, 'b', 2, 0),
(2, 'c', 3, 0), #t = 9
(3, 'c', 0, 2),
(0, 'c', 1, 0)]
t: S A S' R
---------------
0: 0 'b' 2 0
The $t=0$ step of Q-learning will update the $\mathrm{Q}$ value of some state-action pair based on the experience tuple $\left(s_{0}, a_{0}, s_{1}, r_{0}\right)$. After observing this tuple, the Q-value for one specific state-action pair is updated. What is the state in this state-action pair?
0.
Since we observe an experience in state 0, we update the Q value for state 0.
Let's simulate the Q-learning algorithm! Assume there are states $(0,1,2,3)$ and actions ('b', 'c'), and discount factor $\gamma=0.9$. Furthermore, assume that all the $\mathrm{Q}$ values are initialized to 0 (for all state-action pairs) and that the learning rate $\alpha=0.5$.
Experience is represented as a list of 4-element tuples: the $t$ th element of the experience corresponds to a record of experience at time $t:\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$ (state, action, next state, reward).
After each step $t$, indicate what update $Q\left(s_{t}, a_{t}\right) \leftarrow q$ will be made by the Q learning algorithm based on $\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$. You will want to keep track of the overall table $Q\left(s_{t}, a_{t}\right)$ as these updates take place, spanning the multiple parts of this question.
As a reminder, the Q-learning update formula is the following:
$$
Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)
$$
You are welcome to do this problem by hand, by drawing a table specifying $Q(s, a)$ for all possible $s$ and $as$. Alternatively, you may write a program which takes in the following history of experience:
experience = [(0, 'b', 2, 0), #t = 0
(2, 'b', 3, 0),
(3, 'b', 0, 2),
(0, 'b', 2, 0), #t = 3
(2, 'b', 3, 0),
(3, 'c', 0, 2),
(0, 'c', 1, 0), #t = 6
(1, 'b', 0, 1),
(0, 'b', 2, 0),
(2, 'c', 3, 0), #t = 9
(3, 'c', 0, 2),
(0, 'c', 1, 0)]
t: S A S' R
---------------
2: 3 'b' 0 2
The $t=2$ step of Q-learning will update the $\mathrm{Q}$ value of some state-action pair based on the experience tuple $\left(s_{2}, a_{2}, s_{3}, r_{2}\right)$.
What is the updated value that Q-learning makes for $t=2$? Recall that $\alpha=0.5$ and $\gamma=0.9$.
1.
$$
Q_{\text {new }}(3, b)=0.5 \cdot Q_{\text {old }}(3, b)+0.5\left(2+0.9 \cdot \max _{a^{\prime}} Q_{o l d}\left(0, a^{\prime}\right)\right)=0.5 \cdot 0+0.5 \cdot 2=1 .
$$
Let's simulate the Q-learning algorithm! Assume there are states $(0,1,2,3)$ and actions ('b', 'c'), and discount factor $\gamma=0.9$. Furthermore, assume that all the $\mathrm{Q}$ values are initialized to 0 (for all state-action pairs) and that the learning rate $\alpha=0.5$.
Experience is represented as a list of 4-element tuples: the $t$ th element of the experience corresponds to a record of experience at time $t:\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$ (state, action, next state, reward).
After each step $t$, indicate what update $Q\left(s_{t}, a_{t}\right) \leftarrow q$ will be made by the Q learning algorithm based on $\left(s_{t}, a_{t}, s_{t+1}, r_{t}\right)$. You will want to keep track of the overall table $Q\left(s_{t}, a_{t}\right)$ as these updates take place, spanning the multiple parts of this question.
As a reminder, the Q-learning update formula is the following:
$$
Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)
$$
You are welcome to do this problem by hand, by drawing a table specifying $Q(s, a)$ for all possible $s$ and $as$. Alternatively, you may write a program which takes in the following history of experience:
experience = [(0, 'b', 2, 0), #t = 0
(2, 'b', 3, 0),
(3, 'b', 0, 2),
(0, 'b', 2, 0), #t = 3
(2, 'b', 3, 0),
(3, 'c', 0, 2),
(0, 'c', 1, 0), #t = 6
(1, 'b', 0, 1),
(0, 'b', 2, 0),
(2, 'c', 3, 0), #t = 9
(3, 'c', 0, 2),
(0, 'c', 1, 0)]
What is the action of the state-action pair that is updated?
b.
Since action $b$ was used in this experience, we update the Q value for action b.
43
58EECS6.411
Representation, Inference, and Reasoning in AI
6.1010, 6.1210, 18.600
NoneProblem Set 1
Monte-Carlo Tree Search
4cv0.05580357143Text
Now, for each of the problems defined in get_fractal_problems, let us compare the performances of MCTS vs. UCS empirically by running run_mcts_search and run_uniform_cost_search. In particular, you should:
\begin{itemize}
\item Fix step_budget for both algorithms to 2500. Set iteration_budget for MCTS to infinity.
\item Run MCTS 20 times and record the average cumulative reward.
\item Run UCS once:
\begin{itemize}
\item If it fails (due to running out of step budget), record the cumulative reward as 0 .
\item If it succeeds, record the obtained cumulative reward. Hint: you might need to recover rewards from path costs
\end{itemize}
\item Repeat the above for all three problems in get_fractal_problems.
\end{itemize}
What is the average cumulative reward obtained by MCTS in reward-field-3?
Numerical1.2.
Now, for each of the problems defined in get_fractal_problems, let us compare the performances of MCTS vs. UCS empirically by running run_mcts_search and run_uniform_cost_search. In particular, you should:
\begin{itemize}
\item Fix step_budget for both algorithms to 2500. Set iteration_budget for MCTS to infinity.
\item Run MCTS 20 times and record the average cumulative reward.
\item Run UCS once:
\begin{itemize}
\item If it fails (due to running out of step budget), record the cumulative reward as 0 .
\item If it succeeds, record the obtained cumulative reward. Hint: you might need to recover rewards from path costs
\end{itemize}
\item Repeat the above for all three problems in get_fractal_problems.
\end{itemize}
What is the average cumulative reward obtained by MCTS in reward-field-1?
4.2.
Now, for each of the problems defined in get_fractal_problems, let us compare the performances of MCTS vs. UCS empirically by running run_mcts_search and run_uniform_cost_search. In particular, you should:
\begin{itemize}
\item Fix step_budget for both algorithms to 2500. Set iteration_budget for MCTS to infinity.
\item Run MCTS 20 times and record the average cumulative reward.
\item Run UCS once:
\begin{itemize}
\item If it fails (due to running out of step budget), record the cumulative reward as 0 .
\item If it succeeds, record the obtained cumulative reward. Hint: you might need to recover rewards from path costs
\end{itemize}
\item Repeat the above for all three problems in get_fractal_problems.
\end{itemize}
What is the average cumulative reward obtained by MCTS in reward-field-2?
3.1.
Now, for each of the problems defined in get_fractal_problems, let us compare the performances of MCTS vs. UCS empirically by running run_mcts_search and run_uniform_cost_search. In particular, you should:
\begin{itemize}
\item Fix step_budget for both algorithms to 2500. Set iteration_budget for MCTS to infinity.
\item Run MCTS 20 times and record the average cumulative reward.
\item Run UCS once:
\begin{itemize}
\item If it fails (due to running out of step budget), record the cumulative reward as 0 .
\item If it succeeds, record the obtained cumulative reward. Hint: you might need to recover rewards from path costs
\end{itemize}
\item Repeat the above for all three problems in get_fractal_problems.
\end{itemize}
What is the obtained cumulative rewards by UCS in reward-field-3?
3.1911534984544563.
44
566EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneFinal ExamNeural Networks1d0.7Text
Mac O'Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they're able to share data on how previous apps have performed on the store.
Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac's first attempt at machine learning to predict the sales volume (setup of (b)) uses all customer data from 2020 . He randomly partitions the data into train $(80 \%)$ and validation $(20 \%)$, and uses the same number of units, activation function(s), and loss function as in (b). To prevent overfitting, he uses ridge regularization of the weights $W$, minimizing the optimization objective
$$
J(W ; \lambda)=\sum_{i=1}^{n} \mathcal{L}\left(h\left(x^{(i)} ; W\right), y^{(i)}\right)+\lambda\|W\|^{2},
$$
where $\|W\|^{2}$ is the sum over the square of all output units' weights.
Mac discovers that it's possible to find a value of $W$ such that $J(W ; \lambda)=0$ even when $\lambda$ is very large, nearing $\infty$. Mac suspects that he might have an error in the code that he wrote to derive the labels (i.e., the monthly sales volumes). Let's see why. First, what can Mac conclude about $W$ from this finding? Second, what does this imply about the labels?
Open
(1) Since the loss is always non-negative and the penalty is always nonnegative, the only way to get 0 here is for both to be equal to 0 . The only way the penalty can equal 0 is if every element of $W$ equals 0 .
(2) When $W$ has all entries equal to 0 , the prediction at every data point is a constant (the offset). The only way for the squared error to be 0 is for the label of every data point to equal that offset. It seems unlikely that every data label would be exactly the same in this data set, which we assume ranges over a wide number of apps.
Mac O'Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they're able to share data on how previous apps have performed on the store.
Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac wants to predict the sales volume (how many times someone will purchase the app each month) for his new app. The sales volume can be negative if many people returned the app for a refund in a given month. What should Mac choose for the number of units in the output layer, the activation function(s) in the output layer (linear, ReLU, sigmoid, softmax), and the loss function (negative log likelihood, quadratic)?
One unit, because the output is an integer, a linear activation function, and a quadratic loss function.
Mac O'Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they're able to share data on how previous apps have performed on the store.
Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
The initial results look promising. Mac now wants to add in data from additional, earlier, years. (He is confident his customers have been behaving similarly over many years, so the earlier data is relevant.)
Before curating the older data, Mac decides to use the training data that he has to get a sense of whether more data would help. He creates a learning curve where on the horizontal axis he varies the amount of training data used and on the vertical axis he shows the validation error, using a fixed validation set across all settings considered. He experiments with $\lambda=1,10,100$, but again forgot to include a legend. Fill in the below legend by labeling the curves with the value of $\lambda$ that each corresponds to:
The plot is below.
Mac O'Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they're able to share data on how previous apps have performed on the store.
Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac experiments with even more training data and additional values of $\lambda$, but finds that he cannot decrease the validation error further. Are there changes to the neural network architecture that Mac could make to try to improve prediction performance? Explain.
Mac could add hidden layers with nonlinear activation functions to the neural network.
45
247Mathematics18.01Calculus INoneNoneProblem Set 6Approximations6c0.05279831045Text
Taylor's theorem tells us that when the second derivative of a function is really big, then the linear approximation is not so accurate. Here's an example. Let $f(x)=\sin (100 x)$.
Compute $f^{\prime \prime}(x)$. Check that $f^{\prime \prime}(0)=0$. Let $M$ denote the maximum of $\left|f^{\prime \prime}(x)\right|$ for $0 \leq x \leq .1$. Find $M$.
Numerical
$f^{\prime \prime}(x)=100^{2} \sin (100 x)$. When $x=0,100 x=0$ and when $x=.1,100 x=10$. Thus the input of sine goes between 0 and 10 when $0 \leq x \leq .1$. Since sine achieves its maximum at $\frac{\pi}{2}$ (which corresponds to taking $0 \leq x=\pi / 200 \leq .1$,
$$
\left|f^{\prime \prime}(x)\right| \leq 100^{2}|\sin (100 x)| \leq 100^{2}|\sin (100 \cdot \pi / 200)|=100^{2}
$$
whenever $0 \leq x \leq .1$. Thus $M=100^{2}=10^{4}$.
Taylor's theorem tells us that when the second derivative of a function is really big, then the linear approximation is not so accurate. Here's an example. Let $f(x)=\sin (100 x)$.
Approximate $f(.1)$ by taking the linear approximation of $f$ around $x=0$.
$f(0)=0, f^{\prime}(x)=100 \cos (100 x)$ and $f^{\prime}(0)=100$. The linear approximation is therefore
$$
f(.1) \approx f(0)+f^{\prime}(0)(.1)=100(.1)=10.
$$
Taylor's theorem tells us that when the second derivative of a function is really big, then the linear approximation is not so accurate. Here's an example. Let $f(x)=\sin (100 x)$.
Approximate the magnitude of the error in this linear approximation? Is it about $.1$ or 1 or 10 or 100 ? Hint: You don't need to compute $f(.1)$ exactly to do this!
Recall that $f(.1)=\sin (10)$ and $\sin$ takes values between $-1$ and 1 . Thus the difference between our approximation (which was 10) and the actual value of $f(.1)$ is between 9 and 11 . (So magnitude of error is about 10).
Suppose that $L(x)=f(1)+f^{\prime}(1)(x-1)$ is the linear approximation of $f(x)$ around $x=1$. Here is a picture of the graph of $f^{\prime}(x)$ and the graph of $L^{\prime}(x)$ below.
Here $f^{\prime}(1)=10$ and so $L^{\prime}(x)=10$ for all $x$.
Compare your answer to b with the bound from Taylor's theorem.
The bound from Taylor's theorem requires us to find $M$ which is an upper bound for $\left|f^{\prime \prime}(x)\right|$ when $1 \leq x \leq 1.1$. Since the $f^{\prime}$ is steepest when $x=1$, $\left|f^{\prime \prime}(x)\right| \leq\left|f^{\prime \prime}(1)\right| \approx 3$ when $1 \leq x \leq 1.1$. Taylor's theorem gives the error bound of $\frac{1}{2} 3(.1)^{2}=.015$, which is exactly what we obtained in (b).
46
25EECS18.C06
Linear Algebra and Optimization
18.02NoneProblem Set 1
Matrix Multiplication
9a0.1851851852Text
We want to compute the product $A B C$ of three matrices $A, B, C$. The matrices have dimensions $n \times m, m \times p$, and $p \times r$, respectively. Recall that matrix multiplication is an associative operation, i.e., $(A B) C=A(B C)$.
For this problem we will use the standard formula for multiplying matrices from class:
$$
\sum_{k} A_{i, k} B_{k, j},
$$
and when counting arithmetic operations (addition, multiplication) you should only aim to get the answer right up to a constant factor.
If we compute the product as $(A B) C$, how many arithmetic operations are required (as a function of $n, m, p, r)$?
Expression
The number of operations to compute $A B$ is the number of entries $(n p)$ times the number of arithmetic operations we need to make to compute each entry ( $m$ multiplications $+(m-1)$ additions $=2 m-1$ operations $)$, for a total of $n p(2 m-1)$ operations. To compute $(A B) C$ now, we argue similarly to conclude that we make $n r(2 p-1)$ additional arithmetic operations. Adding these up, we get that the total number of arithmetic operations is $n(p(2 m-1)+r(2 p-1))$.
We want to compute the product $A B C$ of three matrices $A, B, C$. The matrices have dimensions $n \times m, m \times p$, and $p \times r$, respectively. Recall that matrix multiplication is an associative operation, i.e., $(A B) C=A(B C)$.
For this problem we will use the standard formula for multiplying matrices from class:
$$
\sum_{k} A_{i, k} B_{k, j},
$$
and when counting arithmetic operations (addition, multiplication) you should only aim to get the answer right up to a constant factor.
If we compute the product as $A(B C)$, how many arithmetic operations are required?
Arguing as before, we find that the total number of arithmetic operations is $r(m(2 p-$ 1) $+n(2 m-1))$.
We want to compute the product $A B C$ of three matrices $A, B, C$. The matrices have dimensions $n \times m, m \times p$, and $p \times r$, respectively. Recall that matrix multiplication is an associative operation, i.e., $(A B) C=A(B C)$.
For this problem we will use the standard formula for multiplying matrices from class:
$$
\sum_{k} A_{i, k} B_{k, j},
$$
and when counting arithmetic operations (addition, multiplication) you should only aim to get the answer right up to a constant factor.
Let $n=1000, m=20, p=1000, r=1$. How many operations each method needs? Which one of the two methods is faster (and by how much)?
One could give a quick and probably correct answer by thinking asymptotically, but here we have plenty of time to compute (and precise formulas at hand), and we will just compare those. For $(A B) C$, we plug into the formula we found in part (a) to find that we make 40999000 operations. As for $A(B C)$, we plug into the formula from (b) to find that we make 78980 operations. Computing this as $A(B C)$ turns out to be much faster.
We want to compute the product $A B C$ of three matrices $A, B, C$. The matrices have dimensions $n \times m, m \times p$, and $p \times r$, respectively. Recall that matrix multiplication is an associative operation, i.e., $(A B) C=A(B C)$.
For this problem we will use the standard formula for multiplying matrices from class:
$$
\sum_{k} A_{i, k} B_{k, j},
$$
and when counting arithmetic operations (addition, multiplication) you should only aim to get the answer right up to a constant factor.
Let's test numerically the associative property in Julia. For this, generate some random matrices $A, B, C$ of compatible dimensions using the command randn (e.g., $\operatorname{randn}(20,30)$, and compute the difference between the two results, i.e., $A(B C)-$ $(A B) C$. What do you expect to happen? What actually happens? Explain the results.
By associativity, $A(B C)-(A B) C=0$. However, we get some small floating point errors in Julia, and the output of $A(B C)-(A B) C$ is a matrix with entries that are very close to 0 (the absolute values of entries could be like $10^{-14}$). Here is some example code for testing this out:
$>A=r a n d n(20,30) ; B=r a n d n(30,40) ; C=r a n d n(40,10)$;
$>\mathrm{A} *(\mathrm{~B} * \mathrm{C})-(\mathrm{A} * \mathrm{~B}) * \mathrm{C}$
47
192EECS18.C06
Linear Algebra and Optimization
18.02NoneFinal ExamInverse Matrix13c0.6956521739Text
Let $a, b$ be vectors in $\mathbb{R}^{n}$. In this problem we will find an expression for the inverse of $M=I-a b^{T}$ and explore some implications for optimization.
Recall that for any matrix $A$ with $\|A\|<1$ we have the identity
$$
(I-A)^{-1}=\sum_{k=0}^{\infty} A^{k}=I+A+A^{2}+\cdots
$$
Use this formula to compute $M^{-1}$ and simplify to get an expression of the form
$$
M^{-1}=I+\alpha a b^{T}
$$
What is the value of $\alpha$?
Expression
Applying the formula, we have
$$
\begin{aligned}
\left(I-a b^{T}\right)^{-1} & =I+a b^{T}+\left(a b^{T}\right)\left(a b^{T}\right)+\left(a b^{T}\right)\left(a b^{T}\right)\left(a b^{T}\right)+\cdots \\
& =I+a b^{T}+\left(b^{T} a\right)\left(a b^{T}\right)+\left(b^{T} a\right)^{2}\left(a b^{T}\right)+\cdots \\
& =I+a b^{T}\left(1+\left(b^{T} a\right)+\left(b^{T} a\right)^{2}+\cdots\right) \\
& =I+a b^{T}\left(1-b^{T} a\right)^{-1},
\end{aligned}
$$
i.e., $\alpha=1 /\left(1-b^{T} a\right)$.
Let $a, b$ be vectors in $\mathbb{R}^{n}$. In this problem we will find an expression for the inverse of $M=I-a b^{T}$ and explore some implications for optimization.
Suppose that $a^{T} b=1$. Find a nonzero vector in $N(M)$.
By the assumption, both $a$ and $b$ are nonzero. We have
$$
M a=\left(I-a b^{T}\right) a=a-a\left(b^{T} a\right)=\left(1-a^{T} b\right) a=0,
$$
so $a \in N(M)$.
Let $a, b$ be vectors in $\mathbb{R}^{n}$. In this problem we will find an expression for the inverse of $M=I-a b^{T}$ and explore some implications for optimization.
For the rest of the problem we will assume that $a^{T} b \neq 1$. Show that when $\|a\|<1$ and $\|b\|<1$ that $\left\|a b^{T}\right\|<1$.
Hint: Can you bound the maximum of $\left\|a b^{T} x\right\|$ over $x$ which is a unit vector?
We have
$$
\left\|a b^{T}\right\|=\max _{x:\|x\|=1}\left\|a b^{T} x\right\| \leq \max _{x:\|x\|=1}\|a\|\left|b^{T} x\right|=\|a\| \max _{x:\|x\|=1}\left|b^{T} x\right|=\|a\|\|b\|<1.
$$
Let $a, b$ be vectors in $\mathbb{R}^{n}$. In this problem we will find an expression for the inverse of $M=I-a b^{T}$ and explore some implications for optimization.
What is the minimum of $f(x)$ when $\|c\|>1$? Give a geometric interpretation.
When $\|c\|>1$ the function is unbounded below, so the minimum is $-\infty$. To see this, notice that for $x=\lambda c$ we have
$$
f(\lambda c)=\lambda^{2}\left(c^{T} c\right)\left(1-c^{T} c\right)-\lambda d^{T} c,
$$
which goes to $-\infty$ as $\lambda \rightarrow \infty$.
48
143EECS6.122
Design and Analysis of Algorithms
6.121NoneFinal Exam
Randomized Algorithms
1o0.375Text
Please select True or False for the following.
Consider the following algorithm for testing if a given list $L$ of $n>3$ distinct numbers is sorted:
Repeat $\Theta(\log n)$ times: Pick three indices $i<j<k$ uniformly at random and return NO if the following is false: $L_{i}<L_{j}<L_{k}$. At the end of the $\Theta(\log n)$ iterations, return YES.
Is it true that this algorithm always returns YES if $L$ is sorted and returns $\mathrm{NO}$ with probability at least $3 / 4$ if $L$ is not $\epsilon$-close to sorted for small constant $\epsilon$?
Multiple Choice
False. Consider the list $L$ as follows, for $n$ divisible by 2 :
$$
2,1,4,3, \ldots, 2 i, 2 i-1, \ldots, n, n-1 .
$$
This list is definitely very far from sorted: it is not $\epsilon$-close to sorted for any $\epsilon<1 / 2$, as we need to remove at least half of the numbers to make it sorted.
In order for us to pick $L_{i}, L_{j}, L_{k}(i<j<k)$ such that the algorithm returns NO, we need to have either $i=j-1$, or $j=k-1$, as only consecutive indices are in the wrong order. The probability of this happening in a single iteration is at most $O(1 / n)$. Even if we repeat $O(\log n)$ times, by a union bound, the probability that it happens in one of the iterations is at most $O(\log n / n)$. So we definitely will not return NO with constant probability.
Please select True or False for the following.
Consider an $O\left(n^{2}\right)$-time Monte Carlo randomized algorithm for some problem, which uses 1000 randomly generated integers between 1 and 1000, and gives the correct answer with probability $\frac{2}{3}$. Then, there is also a $O\left(n^{2}\right)$-time deterministic algorithm for the same problem.
True. Simply try all $1000^{1000}$ possible random integers, and then take the majority answer. The runtime is $O\left(n^{2}\right)$ times $1000^{1000}=O(1)$, which is still $O\left(n^{2}\right)$.
Please select True or False for the following.
A Las Vegas algorithm with expected $O(n)$ runtime may run in $\Omega\left(2^{n}\right)$ time in the worst case.
True.
Please select True or False for the following.
Suppose we have a recurrence $T(n)=T(0.9999 n)+T(0.0001 n)+O(n)$ with $T(n)=O(1)$ for small $n$. Then, the recurrence solves to $T(n)=\Theta(n \log n)$.
False. Because the $O(n)$ term can be zero, producing $T(n)=O(n)$.
49
13Mathematics18.102
Introduction to Functional Analysis
18.C06, 18.100BNoneProblem Set 3Lp Spaces1nan0.5Text
Let $f(x)=x^{a} \log (x)^{b}$ for $x>2$ and $f(x)=0$ otherwise. For which real $a, b$ is $f$ in $\mathcal{L}^{1}(\mathbb{R})$? Justify your answer.
Open
We first claim that $f(x) \in \mathcal{L}^{1}(\mathbb{R})$ iff $\lim _{R \rightarrow \infty} \int_{2}^{R} f(x) d x<\infty$ : Assuming $f(x) \in \mathcal{L}^{1}(\mathbb{R})$, then $\int_{2}^{R} f(x) d x<\int f(x)<\infty$, and therefore it converges. Conversely, assuming the integral converges, define $f_{n}(x)=f(x) \chi_{[0, n]}$, then $f_{n}(x)$ is a monotone sequence converging pointwise to $f(x)$, by Lemma $2.7$ we conclude $f(x) \in \mathcal{L}^{1}(\mathbb{R})$.
Therefore we need to check the convergence of the integral $\int_{2}^{\infty} f(x) d x$. We distinguish different cases:
(1) $a>-1$. We have $x^{a} \log (x)^{b}>x^{-1}$ when $x$ is large, therefore the integral diverges.
(2) $a<-1$. Choose any $a<c<-1$, we have $x^{a} \log (x)^{b}<x^{c}$ when $x$ is large, therefore the integral converges.
(3) $a=-1$. We compute directly,
$$
\int x^{-1} \log (x)^{b}= \begin{cases}\frac{\log (x)^{b+1}}{b+1} & b \neq-1 \\ \log (\log (x)) & b=-1\end{cases}
$$
Therefore, the integral diverges when $b \geq-1$ and converges when $b<-1$.
In summary, $f(x) \in \mathcal{L}^{1}(\mathbb{R})$ iff in the following cases:
\begin{itemize}
\item $a<-1$.
\item $a=-1$ and $b<-1$.
\end{itemize}
Give an example of a function $f: \mathbb{R} \rightarrow \mathbb{C}$ which is in $\mathcal{L}^{1}(\mathbb{R})$ but $f \log (1+|f|)$ is not in $\mathcal{L}^{1}(\mathbb{R})$ and another example of a function $f: \mathbb{R} \rightarrow \mathbb{C}$ such that $f \log (1+|f|)$ is in $\mathcal{L}^{1}(\mathbb{R})$ but $f \notin \mathcal{L}^{1}(\mathbb{R})$; justify both.
The first example: $f(x)=\frac{1}{x \log ^{3 / 2}(x)}$ if $0<|x|<1 / 2$ and $f(x)=0$ if $x=0$ or $|x| \geq 1 / 2$. We have $\int|f|=\int f<\infty$ but
$$
\int|f| \log (1+|f|)=\infty.
$$
The second example: $f(x)=\frac{1}{1+|x|}$. We have $\int|f|=\int f=\infty$ but
$$
\int|f| \log (1+|f|)<\infty.
$$
Suppose that $f(x)=\ln \left(2 x^{4}-x^{3}\right)$. Remember that $\ln x$ is short for $\log _{e} x$, and $\frac{d}{d x} \ln x=\frac{1}{x}$
Suppose we want to approximate $f(1.01)$. It sounds pretty complicated at first, but we can do it using the things we know if we go in two steps.
Using linear approximation again, estimate the logarithm of the number you found in part a.
Write $p(x)=2 x^{4}-x^{3}$ such that $f(x)=\ln (p(x))$. The goal is then to approximate $\ln (p(1.01))$, so we begin by approximating $p(1.01)$.
$\ln (1)=0$ and $\ln ^{\prime}(1)=1$ implies $\ln (1.05) \approx \ln (1)+.05 \cdot \ln ^{\prime}(1)=.05$. Thus,
$$
f(1.01)=\ln (p(1.01)) \approx \ln (1.05) \approx .05 .
$$
Show that the function with $F(0)=0$ and
$$
F(x)= \begin{cases}0 & x>1 \\ \exp (i / x) & 0<|x| \leq 1 \\ 0 & x<-1\end{cases}
$$
is an element of $\mathcal{L}^{1}(\mathbb{R})$.
Let $f_{n}=\left(\chi_{\left[-1,-\frac{1}{n}\right]}+\chi_{\left[\frac{1}{n}, 1\right]}\right) F$; by Lemma $2.2, f_{n} \in \mathcal{L}^{1}(\mathbb{R})$. Moreover, $f_{n} \rightarrow F$ pointwise almost everywhere. Thus, because $\chi_{[-1,1]} \in$ $\mathcal{L}^{1}(\mathbb{R})$ and $\left|f_{n}(x)\right| \leq \chi_{[-1,1]}(x)$ everywhere, it follows by the Lebesgue dominated convergence theorem that $F \in \mathcal{L}^{1}(\mathbb{R})$.
50
172EECS6.191
Computation Structures
6.100A, 8.02NoneMidterm Exam 3Caches1a0.45Text
Cache Ketchum wants to design a cache to help keep track of his Pokedex entries. He’s enlisted your help as a talented 6.191 student!
Ketchum wants to build a direct-mapped cache with a block size of eight words. He also wants the cache to hold a total of $2^9 = 512$ data words. Which address bits should be used for the block offset, cache index, and tag? Assume that data words and addresses are 32 bits wide.
NumericalAddress bits used for block offset: A[ __4__ : __2__ ]
Address bits used for cache index: A[ __10__ : __5__ ]
Address bits used for tag: A[ __31__ : __11__ ]
Cache Ketchum wants to design a cache to help keep track of his Pokedex entries. He’s enlisted your help as a talented 6.191 student!
Ketchum ponders over the design and decides that he wants to double the number of cache lines in his direct-mapped cache. However, he wants to keep the total number of words in the cache the same. How will the number of bits used to represent the block offset change as a result?
(a) UNCHANGED.
(b) +1.
(c) -1.
(d) 2x.
(e) 0.5x.
(f) CAN'T TELL.
(c) -1.
Cache Ketchum wants to design a cache to help keep track of his Pokedex entries. He’s enlisted your help as a talented 6.191 student!
Ketchum decides he doesn’t want a direct-mapped cache at all! He wants a two-way set-associative cache.
The remainder of the problem will be considering this 2-way set-associative cache with a capacity of 32 words. Below is a snapshot of this cache during the execution of some unknown code. V is the valid bit and D is the dirty bit of each set. Assume an LRU replacement policy and that Way 0 is currently holds the LRU cache line for all sets.
Way 0
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline V & D & Tag & Word 0 & Word 1 & Word 2 & Word 3 \\
\hline 1 & 0 & 0x28 & 0xA65 & 0x521 & 0xA2C & 0x947 \\
\hline 1 & 1 & 0x1D & 0xB54 & 0xE95 & 0x9AA & 0xC7A \\
\hline 1 & 0 & 0x4D & 0xE71 & 0x2FE & 0xC58 & 0x4C4 \\
\hline 1 & 0 & 0x085 & 0xB6B & 0xD55 & 0x27D & 0xE1E \\
\hline
\end{tabular}
Way 1
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline V & D & Tag & Word 0 & Word 1 & Word 2 & Word 3 \\
\hline 1 & 1 & 0x093 & 0x2EA & 0x4CE & 0x42D & 0x462 \\
\hline 1 & 1 & 0x093 & 0x3C2 & 0x152 & 0xB9C & 0xC23 \\
\hline 1 & 0 & 0xAF & 0xC05 & 0xE81 & 0xCEA & 0x60B \\
\hline 1 & 0 & 0xA5 & 0x57B & 0xC5F & 0xA1F & 0xAF5 \\
\hline
\end{tabular}
Identify whether each of the following memory accesses is a hit or a miss. Consider each memory access independently. If it is a hit, specify what value is returned; if it is a miss, write N/A. In addition, if it is a miss, determine if any values need to be written back to main memory, and if so, to which location(s) in main memory? List all updated main memory word addresses. If no writes to main memory are needed, write N/A.
Load from address 0x2974
Load from address 0x11D8
Load from address 0x2974
0x2974 = 0010_1001_0111_0100
tag = 0xA5, index = 3, block offset = 1
Hit.
Returned value if hit or N/A if miss: __C5F_______
All updated main memory word addresses or N/A: ___N/A________________________________
Load from address 0x11D8
0x11D8 = 0001_0001_1101_1000
tag = 0x47, index = 1, block offset = 2
miss -> replaces way 0 line 1 which is dirty
must first write this cache line back to memory
if tag = 0x1D and index = 1, then memory addresses are 111_0101_XX00 = 0x750, 0x754, 0x758, 0x75C.
Miss.
Returned value if hit or N/A if miss: ___N/A______
All updated main memory word addresses or N/A: ___0x750, 0x754, 0x758, 0x75C _____________
Cache Ketchum wants to design a cache to help keep track of his Pokedex entries. He’s enlisted your help as a talented 6.191 student!
After testing, Ketchum decides to use the cache with the following RISC-V assembly program that
increments every element in an array and stores the changed elements in another array.
// Assume the following registers are initialized:
// x1 = 0xC0 (base address of input array)
// x2 = 0x80 (base address of output array)
// x3 = 4 (number of elements in input and output arrays)
. = 0x100 // The following code starts at address 0x100
slli x6, x3, 2
add x6, x1, x6 // address of end of input array
loop:
lw x4, 0(x1) // get array element
addi x4, x4, 1 // increment element
sw x4, 0(x2) // store element into output array
addi x1, x1, 4 // compute next address for input array
addi x2, x2, 4 // compute next address for output array
blt x1, x6, loop // continue looping
Answer the following questions about the behavior of the cache during execution of the above code. Note the cache has 2 ways and uses an LRU replacement policy. Assume that the cache is initially empty.
Ketchum wants to get the best performance out of his cache. He is considering modifying his current cache to double the number of cache lines while leaving all other parameters of the cache the same (2-way set associative and a block size of 4), thus doubling the total capacity of the cache. However, this new cache is a lot more expensive! Ketchum wants to choose the cheapest cache that maximizes the hit ratio. Which one should he choose? Explain your answer.
New Cache.
Currently, the memory accesses overwrite each other every time since the indices overlap. If we have three bits to represent the index instead of two, the indices will no longer conflict, reducing the misses to 2/8.
51
83Mathematics18.03
Differential Equations
None18.02Problem Set 7
Gaussian Elimination
5a0.08042895442Text
One of the black boxes we used in class was the theorem that an $n \times n$ matrix $A$ has $A \vec{v}=0$ for some non-zero vector $\vec{v} \in \mathbb{R}^{n}\left(\right.$ or $\mathbb{C}^{n}$ ) if and only if $\operatorname{det}(A)=0$ (see, e.g., MITx 20.7). The goal of this problem is to work out $w h y$ this is true (at least in the case of $3 \times 3$ matrices). The only blackbox we will use is the properties of the determinant. Recall that $\operatorname{dim} \operatorname{Ker}(A)=0$ means that $\operatorname{Ker}(A)$ contains only the zero vector.
(Story time begins) The way we are going to go about showing that a $3 \times 3$ matrix has $\operatorname{det} A=0$ if and only if $\operatorname{dim} \operatorname{Ker}(A)>0$ is by using Gaussian elimination to reduce the statement to the case of upper triangular (or rather, RREF) matrices. So, as a first step, we're going to check that the theorem is true for this model case. (Story time ends)
Suppose $A$ is a $3 \times 3$ matrix which is upper triangular; that is
$$
A=\left(\begin{array}{ccc}
p_{1} & a & b \\
0 & p_{2} & c \\
0 & 0 & p_{3}
\end{array}\right) \text {. }
$$
Show that $\operatorname{det} A=p_{1} p_{2} p_{3}$. In particular, $\operatorname{det}(A)=0$ if and only if at least one of $p_{1}, p_{2}, p_{3}$ is 0.
Open
Using the Laplace expansion, the only non-zero term is $p_{1} \cdot\left|\left(\begin{array}{cc}p_{2} & c \\ 0 & p_{3}\end{array}\right)\right|=$ $p_{1} p_{2} p_{3}$. Or you may use the fact that eigenvalues are $p_{1}, p_{2}, p_{3}$ and the determinant is the product of them.
One of the black boxes we used in class was the theorem that an $n \times n$ matrix $A$ has $A \vec{v}=0$ for some non-zero vector $\vec{v} \in \mathbb{R}^{n}\left(\right.$ or $\mathbb{C}^{n}$ ) if and only if $\operatorname{det}(A)=0$ (see, e.g., MITx 20.7). The goal of this problem is to work out $w h y$ this is true (at least in the case of $3 \times 3$ matrices). The only blackbox we will use is the properties of the determinant. Recall that $\operatorname{dim} \operatorname{Ker}(A)=0$ means that $\operatorname{Ker}(A)$ contains only the zero vector.
(Story time begins) The way we are going to go about showing that a $3 \times 3$ matrix has $\operatorname{det} A=0$ if and only if $\operatorname{dim} \operatorname{Ker}(A)>0$ is by using Gaussian elimination to reduce the statement to the case of upper triangular (or rather, RREF) matrices. So, as a first step, we're going to check that the theorem is true for this model case. (Story time ends)
Combine (b), (e) with (B) to show that $\operatorname{dim} \operatorname{Ker}(A)>0$ if and only if $\operatorname{det} A=0$.
$(\Rightarrow)$ Suppose that $\operatorname{dim} \operatorname{Ker}(A)>0$, that is there is a non-zero vector $\vec{v}$ such that $A v=0$. Note that (B) implies that $B \vec{v}=0$. Therefore $\operatorname{dim} \operatorname{Ker}(B)>0$ and it follows from (b) that $\operatorname{det}(B)=0$, which proves $\operatorname{det}(A)=0$.
$(\Leftarrow)$ Suppose that $\operatorname{det}(A)=0$, which implies $\operatorname{det}(B)=0$. It also follows from (b) that $\operatorname{dim} \operatorname{Ker}(B)>0$; otherwise $\operatorname{det}(B)=1$. This means there is a non-zero vector $\vec{v}$ such that $B \vec{v}=0$. Note that (B) implies $A \vec{v}=0$. Therefore $\operatorname{dim} \operatorname{Ker}(A)>0$.
One of the black boxes we used in class was the theorem that an $n \times n$ matrix $A$ has $A \vec{v}=0$ for some non-zero vector $\vec{v} \in \mathbb{R}^{n}\left(\right.$ or $\mathbb{C}^{n}$ ) if and only if $\operatorname{det}(A)=0$ (see, e.g., MITx 20.7). The goal of this problem is to work out $w h y$ this is true (at least in the case of $3 \times 3$ matrices). The only blackbox we will use is the properties of the determinant. Recall that $\operatorname{dim} \operatorname{Ker}(A)=0$ means that $\operatorname{Ker}(A)$ contains only the zero vector.
(Story time begins) The way we are going to go about showing that a $3 \times 3$ matrix has $\operatorname{det} A=0$ if and only if $\operatorname{dim} \operatorname{Ker}(A)>0$ is by using Gaussian elimination to reduce the statement to the case of upper triangular (or rather, RREF) matrices. So, as a first step, we're going to check that the theorem is true for this model case. (Story time ends)
Now suppose that $B$ is a $3 \times 3$ matrix in reduced row echelon form. Show that
(i) If $\operatorname{dim} \operatorname{Ker}(B)=0$ then $\operatorname{det} B=1$. (Hint: It may be helpful to recall problem $(4))$
(ii) If $\operatorname{dim} \operatorname{Ker}(B)>0$, then $\operatorname{det} B=0$. (Hint: If $B$ is in rref then $B$ is upper triangular. What does $\operatorname{dim} \operatorname{Ker}(B)>0$ tell you about the pivots of $B$ ? Combine this with part (a).)
If $\operatorname{dim} \operatorname{Ker}(B)=0$, we have $\operatorname{Ker}(B)=\{0\}$ and this implies that $B$ is the identity matrix from (4). Therefore its determinant is 1.
Since $B$ is upper triangular, if all diagonals are non-zero, then those entries must be first non-zero element for each row and be leading ones, and thus all columns should be pivots.
On the other hand, it follows from the rank-nulity theorem that the number of pivots is $3-\operatorname{dim} \operatorname{Ker}(B)<3$. So not every diagonal can be non-zero, and there is some zero diagonal entry. By the part (a), the determinant is 0.
One of the black boxes we used in class was the theorem that an $n \times n$ matrix $A$ has $A \vec{v}=0$ for some non-zero vector $\vec{v} \in \mathbb{R}^{n}\left(\right.$ or $\mathbb{C}^{n}$ ) if and only if $\operatorname{det}(A)=0$ (see, e.g., MITx 20.7). The goal of this problem is to work out $w h y$ this is true (at least in the case of $3 \times 3$ matrices). The only blackbox we will use is the properties of the determinant. Recall that $\operatorname{dim} \operatorname{Ker}(A)=0$ means that $\operatorname{Ker}(A)$ contains only the zero vector.
(Story time begins) The way we are going to go about showing that a $3 \times 3$ matrix has $\operatorname{det} A=0$ if and only if $\operatorname{dim} \operatorname{Ker}(A)>0$ is by using Gaussian elimination to reduce the statement to the case of upper triangular (or rather, RREF) matrices. So, as a first step, we're going to check that the theorem is true for this model case. (Story time ends)
(Story time begins, again) Let's remember what Gauss-Jordan elimination tells us. Gauss-Jordan says that if we have a matrix $A$, then we can perform a sequence of row operations to bring $A$ into reduced row echelon form. Let's set $B=\operatorname{rref}(A)$. Each row operation corresponds to left multiplication by one of the elementary matrices. So, if we write Gauss-Jordan elimination in terms of the elementary matrices, what we have is
$$
E_{s(N)} E_{s(N-1)} \cdots E_{s(1)} A=B
$$
Here we have written $E_{s(i)}$ to denote the elementary matrix corresponding to the row operation performed at step $i$ of the Gauss-Jordan algorithm. Furthermore, from Gauss-Jordan elimination we know that
$$
\vec{v} \in \operatorname{Ker}(A) \text { if and only if } \vec{v} \in \operatorname{Ker}(B) .
$$
If this is unclear to you, it might be worth reflecting on Gauss-Jordan elimination.
We are now very close to being finished. Recall that if $M_{1}, M_{2}$ are $n \times n$ matrices then
$$
\operatorname{det}\left(M_{1} M_{2}\right)=\operatorname{det}\left(M_{1}\right) \cdot \operatorname{det}\left(M_{2}\right) .
$$
From this we will finish our proof. (Story time ends, again).
By applying the multiplication property of the determinant (C) iteratively, show that (A) together with part (d) implies
$$
\operatorname{det} A=0 \text { if and only if } \operatorname{det} B=0 .
$$
From (A) and (C) we have
$$
\begin{aligned}
\operatorname{det}(B) &=\operatorname{det}\left(E_{s(N)} E_{s(N-1)} \cdots E_{s(1)} A\right)=\operatorname{det}\left(E_{s(N)} E_{s(N-1)} \cdots E_{s(1)}\right) \operatorname{det}(A) \\
&=\operatorname{det}\left(E_{s(N)} E_{s(N-1)} \cdots E_{s(2)}\right) \operatorname{det}\left(E_{s(1)}\right) \operatorname{det}(A)=\cdots \\
&=\operatorname{det}\left(E_{s(N)}\right) \operatorname{det}\left(E_{s(N-1)}\right) \cdots \operatorname{det}\left(E_{s(1)}\right) \operatorname{det}(A) .
\end{aligned}
$$
Since $\operatorname{det}\left(E_{s(i)}\right)$ is not 0 for all $i$, we conclude that $\operatorname{det}(A)=0$ if and only if $\operatorname{det}(B)=0$.
52
115Mathematics18.01Calculus INoneNoneProblem Set 3Chain Rule10a0.05279831045TextFind the derivatives of the following functions:
$e^{5 x}$.
Expression$\frac{d}{d x} e^{5 x}=e^{5 x}(5)=5 e^{5 x}$.Find the derivatives of the following functions:
$e^{x+1}$.
$\frac{d}{d x} e^{x+1}=e^{x+1}(1)=e^{x+1}$.
Using the product rule, compute the derivative of each of the following functions.
$x e^{x}$.
$\frac{d}{d x} x e^{x}=e^{x}+x e^{x}$.Compute the derivatives of the following functions.
$e^{-x^{2}}$.
Here we use the chain rule. That gives us that $\left(e^{\left(-x^{2}\right)}\right)^{\prime}=\left(-x^{2}\right)^{\prime} e^{-x^{2}}=-2 x e^{-x^{2}}$.
53
610EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneFinal Exam
Convolutional Neural Networks
8c0.35Text
MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition.
If Rec wants to allow for more than two classes, which activation function should they use for final_act and which loss function?
OpenSoftmax $+$ Cross Entropy.
MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition.
What does each filter do? Which filter is best for distinguishing line-shaped tetris pieces vs. corner-shaped pieces? Why?
The first filter detects pixels on the diagonal and ignores vertical and horizontal lines. The second filter only detects vertical lines. The first filter is best for distinguishing corners and lines after applying ReLU to the output, it linearly separates corners and lines whereas the second filter does not.
MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition.
If Rec instead labeled line-shaped pieces as "1" and corner-shaped pieces as "0" then what values of $\mathrm{w}$ and $\mathrm{b}$ of the output layer give perfect classification and outputs that are close to 0 for corners and close to 1 for lines?
The same as above with opposite sign.
MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition.
Write an expression for the derivative of the binary classification loss with respect to z2, the input of final_act. You may express your answer using $g$ for the output of $f$ inal_act and $y$ for the example label.
The derivative of the negative log likelihood loss with respect to the argument of the Sigmoid function is very elegant. It is $g-y$.
54
222EECS6.411
Representation, Inference, and Reasoning in AI
6.1010, 6.1210, 18.600
NoneProblem Set 6Particle Filter5dii0.08333333333Text
Consider a domain in which the forward transition dynamics are "hybrid" in the sense that
$$
P\left(X_{t}=x_{t} \mid X_{t-1}=x_{t-1}\right)=p * N\left(x_{t-1}+1,0.1\right)\left(x_{t}\right)+(1-p) * N\left(x_{t-1}-1,0.1\right)\left(x_{t}\right)
$$
that is, that the state will hop forward one unit in expectation with probability $p$, or backward one unit in expectation with probability $1-p$, with variance $0.1$ in each case.
Assume additionally that the observation model $P\left(Y_{t}=y_{t} \mid X_{t}=x_{t}\right)=\operatorname{Uniform}\left(x_{t}-1, x_{t}+1\right)\left(y_{t}\right)$.
Norm runs the filter for two steps with no observations several times and is trying to decide whether there could be bugs in the code. Assuming $p=0.5$, for each of the following sets of particles, indicate whether it is (a) fairly likely (b) quite unlikely (c) completely impossible: {-2.01, -1.9, -1.0, 0.1, 0, 2.1}.
Multiple Choiceb.
Consider a domain in which the forward transition dynamics are "hybrid" in the sense that
$$
P\left(X_{t}=x_{t} \mid X_{t-1}=x_{t-1}\right)=p * N\left(x_{t-1}+1,0.1\right)\left(x_{t}\right)+(1-p) * N\left(x_{t-1}-1,0.1\right)\left(x_{t}\right)
$$
that is, that the state will hop forward one unit in expectation with probability $p$, or backward one unit in expectation with probability $1-p$, with variance $0.1$ in each case.
Assume additionally that the observation model $P\left(Y_{t}=y_{t} \mid X_{t}=x_{t}\right)=\operatorname{Uniform}\left(x_{t}-1, x_{t}+1\right)\left(y_{t}\right)$.
Norm runs the filter for two steps with no observations several times and is trying to decide whether there could be bugs in the code. Assuming $p=0.5$, for each of the following sets of particles, indicate whether it is (a) fairly likely (b) quite unlikely (c) completely impossible: {-2.05, -1.95, -0.1, 0.1, 1.9, 2.1}.
a.
Consider a domain in which the forward transition dynamics are "hybrid" in the sense that
$$
P\left(X_{t}=x_{t} \mid X_{t-1}=x_{t-1}\right)=p * N\left(x_{t-1}+1,0.1\right)\left(x_{t}\right)+(1-p) * N\left(x_{t-1}-1,0.1\right)\left(x_{t}\right)
$$
that is, that the state will hop forward one unit in expectation with probability $p$, or backward one unit in expectation with probability $1-p$, with variance $0.1$ in each case.
Assume additionally that the observation model $P\left(Y_{t}=y_{t} \mid X_{t}=x_{t}\right)=\operatorname{Uniform}\left(x_{t}-1, x_{t}+1\right)\left(y_{t}\right)$.
Norm runs the filter for two steps with no observations several times and is trying to decide whether there could be bugs in the code. Assuming $p=0.5$, for each of the following sets of particles, indicate whether it is (a) fairly likely (b) quite unlikely (c) completely impossible: {-20, -2.01, -2.001, .01, .001, 1.99, 1.999}.
b.
Consider a domain in which the forward transition dynamics are "hybrid" in the sense that
$$
P\left(X_{t}=x_{t} \mid X_{t-1}=x_{t-1}\right)=p * N\left(x_{t-1}+1,0.1\right)\left(x_{t}\right)+(1-p) * N\left(x_{t-1}-1,0.1\right)\left(x_{t}\right)
$$
that is, that the state will hop forward one unit in expectation with probability $p$, or backward one unit in expectation with probability $1-p$, with variance $0.1$ in each case.
Assume additionally that the observation model $P\left(Y_{t}=y_{t} \mid X_{t}=x_{t}\right)=\operatorname{Uniform}\left(x_{t}-1, x_{t}+1\right)\left(y_{t}\right)$.
Norm initializes the filter with particles {-2.05, -1.95, -0.1, 0.1, 1.9, 2.1} and then gets an observation of -1.0. Which of the following is a plausible posterior, assuming resampling?
(a) {-1.95, -0.1, -1.95, -1.95, -0.1, -0.1, -0.1}
(b) {-1.95, -.01}
(c) {0.1, 0.1, 0.1, -0.1, -0.1, -0.1}
(d) {-1.96, -0.11, -1.94, -1.97, -0.11, -0.09, -0.12}v
(a) {-1.95, -0.1, -1.95, -1.95, -0.1, -0.1, -0.1}
55
24EECS6.191
Computation Structures
6.100A, 8.02None
Prelab Questions 2
RISC-V Calling Convention
1a0.025Text
Consider the following C function, `f`:
int f(int x, int y) {
int w = y;
int z = mul(x, 2) + y + w;
return z;
}
Suppose that you wanted to implement the function, f, in RISC-V assembly.
Which register(s) should be used to pass the arguments to the function f? Select all correct answers.
(a) zero.
(b) ra.
(c) a0.
(d) a1.
(e) a2.
(f) t0.
(g) t1.
(h) s0.
(i) s1.
Multiple Choice
(c) a0.
(d) a1.
According to the calling conventions, the arguments to the function are passed in through the a registers, starting from a0, a1, and so on. In this question, there are 2 arguments (x and y) needed to be passed in, so only a0 and a1 are used.
Consider the following C function, `f`:
int f(int x, int y) {
int w = y;
int z = mul(x, 2) + y + w;
return z;
}
Suppose that you wanted to implement the function, f, in RISC-V assembly.
When the function returns, which register(s) should be set to the returned value? Select all correct answers.
(a) zero.
(b) ra.
(c) a0.
(d) a1.
(e) a2.
(f) t0.
(g) t1.
(h) s0.
(i) s1.
(c) a0.
According to the calling conventions, the return value from the functions should always be returned in the a0 register (or in a0 and a1 if 2 values need to be returned).
Consider the following incorrect RISC-V implementation of the function f.
Assume that its calling function believes that f is properly implemented and follows the RISC-V calling conventions. Also
assume that the function mul, being called by f, is implemented correctly and follows the RISC-V calling convention. The
function mul takes two unsigned integer arguments and returns the result of multiplying those inputs.
f:
// (1)
mv a2, a1
mv s0, a1
li a1, 2
// (2)
call mul
// (3)
add a0, a0, a2
add a0, a0, s0
// (4)
ret
// (5)
Given that mul follows the calling convention, which of the following register(s) are guaranteed to hold the same value at both points (2) and (3) in the code execution? Select all correct answers.
(a) zero.
(b) ra.
(c) a0.
(d) a1.
(e) a2.
(f) t0.
(g) t1.
(h) s0.
(i) s1.
(a) zero.
(h) s0.
(i) s1.
According to the calling convention, s registers are callee-saved registers. Callee-saved registers must be returned unaltered from any function call. In addition, the zero register is hardwired to 0 and its value can never be changed. The a registers are caller-saved, however, so there is no guarantee that their value will be the same both before and after the call to mul.
Consider the following incorrect RISC-V implementation of the function f.
Assume that its calling function believes that f is properly implemented and follows the RISC-V calling conventions. Also
assume that the function mul, being called by f, is implemented correctly and follows the RISC-V calling convention. The
function mul takes two unsigned integer arguments and returns the result of multiplying those inputs.
f:
// (1)
mv a2, a1
mv s0, a1
li a1, 2
// (2)
call mul
// (3)
add a0, a0, a2
add a0, a0, s0
// (4)
ret
// (5)
Given that mul follows the calling convention, but f does not, which register(s) are not handled properly by the code above? Select all correct answers.
(a) zero.
(b) ra.
(c) a0.
(d) a1.
(e) a2.
(f) t0.
(g) t1.
(h) s0.
(i) s1.
(b) ra.
(e) a2.
(h) s0.
According to the calling convention, ra and a2 are caller-saved registers. Caller-saved registers must be saved onto the stack prior to calling another procedure if their original value will be needed upon return from the called procedure. ra must be saved in order for the ret pseudoinstruction of function f to return to the correct location in the calling code. a2 must be saved onto the stack in order to make sure that when it is used, it holds its original value and was not modified by the mul function. In addition, callee-saved registers must be saved onto the stack before being modified. Thus, s0, must be saved onto the stack prior to pseudoinstruction mv s0, a1.
56
19Mathematics18.102
Introduction to Functional Analysis
18.C06, 18.100BNoneProblem Set 3Lp Spaces5nan0.5Text
Suppose $f \in \mathcal{L}^{1}(\mathbb{R})$ is real-valued. Show that there is a sequence $f_{n} \in \mathcal{C}_{\mathrm{c}}(\mathbb{R})$ and another element $F \in \mathcal{L}^{1}(\mathbb{R})$ such that
$$
f_{n}(x) \rightarrow f(x) \text { a.e. on } \mathbb{R},\left|f_{n}(x)\right| \leq F(x) \text { a.e. }
$$
Hint: Take an approximating series $u_{n}$ as in the definition and think about $\left|u_{n}\right|$.
Remark: The converse of this, where the $f_{n}$ are allowed to be in $\mathcal{L}^{1}(\mathbb{R})$ is 'Lebesgue Dominated Convergence'.
Open
Let $f_{n} \in C_{c}(\mathbb{R})$ be such that if $w_{1}=f_{1}$ and $w_{k}=f_{k}-f_{k-1}$, $\left(w_{n}\right)$ is absolutely summable and $f_{n}(x) \rightarrow f(x)$ almost everywhere (such a sequence $\left(f_{n}\right)$ exists by the definition of $\mathcal{L}^{1}(\mathbb{R})$ ). Then, define
$$
F(x)=\left\{\begin{array}{ll}
\sum_{n}\left|w_{n}(x)\right| & \text { if } \sum_{n}\left|w_{n}(x)\right|<\infty \\
0 & \text { otherwise }
\end{array} .\right.
$$
By the definition of measure zero, $F(x)=\sum_{n}\left|w_{n}(x)\right|$ almost everywhere, so by Proposition $2.5$ it follows that $F \in \mathcal{L}^{1}(\mathbb{R})$. Moreover, because whenever $F(x)=\sum_{n}\left|w_{n}(x)\right|$ we have that
$$
\left|f_{n}(x)\right|=\left|\sum_{k=1}^{n} w_{k}(x)\right| \leq \sum_{k=1}^{n}\left|w_{k}(x)\right| \leq F(x),
$$
it follows that $\left|f_{n}(x)\right| \leq F(x)$ almost everywhere. The desired conclusion follows.
Let $[f] \in L^{1}(\mathbb{R})$ be the image of $f \in \mathcal{L}^{1}(\mathbb{R})$. Suppose that $\left[f_{j}\right] \in L^{1}(\mathbb{R})$ is a Cauchy sequence. Show that $f_{j}$ has a subsequence which converges almost everywhere.
Let $n_{k}$ be such that for $m, n \geq n_{k}$ we have
$$
\left\|f_{n}-f_{m}\right\|<2^{-k}.
$$
Then setting $g_{1}=f_{n_{1}}$ and $g_{k}:=f_{n_{k}}-f_{n_{k-1}}$ for $k \geq 2$, by the triangle inequality we have
$$
\left\|g_{k}\right\|<2^{-k+1}, k \geq 2
$$
So the series $\sum_{k} g_{k}$ is absolutely summable, and its partial sums are equal to $f_{n_{k}}$. This implies that $f_{n_{k}}$ converges almost everywhere (Proposition $2.5$ in the notes), as claimed.
Suppose $u_{n} \in \mathcal{C}_{\mathrm{c}}(\mathbb{R})$ form an absolutely summable series with respect to the $L^{1}$ norm and set
$$
E=\left\{x \in \mathbb{R} ; \sum_{n}\left|u_{n}(x)\right|=\infty\right\}
$$
Deduce that if $\epsilon>0$ is given then there is an open set $O_{\epsilon} \supset E$ with $\sum_{n}\left|u_{n}(x)\right|>1 / \epsilon$ for each $x \in O_{\epsilon}$.
The subset $Z_{\epsilon}=\left\{x \in \mathbb{R} ; \sum_{n}\left|u_{n}(x)\right| \leq 1 / \epsilon\right\}$ is closed by 1 . Then $O_{\epsilon}=\mathbb{R} \backslash Z_{\epsilon}$ is open and, obviously, $E \subset O_{\epsilon}$.
Define $\mathcal{L}^{\infty}(\mathbb{R})$ as the set of functions $g: \mathbb{R} \longrightarrow \mathbb{C}$ such that there exists $C>0$ and a sequence $v_{n} \in \mathcal{C}(\mathbb{R})$ with $\left|v_{n}(x)\right| \leq C$ and $v_{n}(x) \rightarrow g(x)$ a.e.
Show that if $g \in \mathcal{L}^{\infty}(\mathbb{R})$ and $f \in \mathcal{L}^{1}(\mathbb{R})$ then $g f \in \mathcal{L}^{1}(\mathbb{R})$ and that this defines a map
$$
L^{\infty}(\mathbb{R}) \times L^{1}(\mathbb{R}) \longrightarrow L^{1}(\mathbb{R})
$$
which satisfies $\|g f\|_{L^{1}} \leq\|g\|_{L^{\infty}}\|f\|_{L^{1}}$.
Proof. For $g$ we keep the notations from the definition. For $f$ let $w_{n}$ be the absolutely summable series converging to $f$ a.e.
Note that the sequence $u_{n}=v_{k} w_{n}$ is absolutely summable as $\sum_{n} \int\left|v_{k} w_{n}(x)\right| \leq C \sum_{n} \int\left|w_{n}(x)\right|<\infty$ and converges to $v_{k} f$ a.e. which is thus in $\mathcal{L}^{1}(\mathbb{R})$. Now $t_{n}=v_{n} f$ is dominated by $C|f|$ and converges to $f g \in \mathcal{L}^{1}(\mathbb{R})$.
If either $f$ or $g$ is in $\mathcal{N}$ then $f g \in \mathcal{N}$, which ensure passing from $\mathcal{L}$ to $L$.
Finally, since $|g| \leq\|g\|_{L^{\infty}}$ a.e.
$$
\|g f\|_{L^{1}}=\int|g f| \leq \int\|g\|_{L^{\infty}}|f|=\|g\|_{L^{\infty}}\|f\|_{L^{1}}.
$$
57
234Mathematics18.01Calculus INoneNoneProblem Set 5
Differential Equations
19b0.07919746568Text
In this section, we give an oversimplified model of how blood sugar and insulin work, and we consider the problem of designing an artificial pancreas. The biology is over-simplified, but the issues we will explore are still relevant in more accurate and complex models.
Let $S(t)$ denote the blood sugar level at time $t$. Suppose $S=10$ is a good level of blood sugar, $S$ above 12 is too high, and $S$ below 8 is too low. The blood sugar reacts to insulin levels. Let $I(t)$ denote the insulin level in the blood at time $T$. Suppose that
$$
S^{\prime}(t)=5-I(t) .
$$
So if $I(t)>5$, then blood sugar does down, and if $I(t)<5$ then blood sugar goes up. In patients with severe diabetes, the pancreas doesn't make insulin. The artificial pancreas is a fairly recent medical technology where a medical device installed in the patient makes insulin and has to adjust insulin levels to regulate blood sugar. Figuring out when to increase/decrease the insulin level is a mathematical problem. One approach is the following: if the patient's blood sugar is too high, the artificial pancreas increases the insulin level. If the patient's blood sugar is too low, the artificial pancreas decreases the insulin level. This approach can be modelled by the following differential equation:
$$
I^{\prime}(t)=S(t)-10 .
$$
At the moment we have two equations involving two functions:
$$
S^{\prime}(t)=5-I(t) \text { and } I^{\prime}(t)=S(t)-10 .
$$
To get an equation for just one function, we can differentiate the first equation $S^{\prime}(t)=5-I(t)$, which gives $S^{\prime \prime}(t)=-I^{\prime}(t)$ and then plug in the equation for $I^{\prime}(t)$. This leads to an equation for $S(t)$ which is similar to the ones in the last few problems.
Suppose that at time $0, S(0)=13$ (too high) and $I(0)=5$.
Is the artificial pancreas doing a good job or a bad job? Explain.
Open
It's bad. Because the coefficient of $\sin$ in $S$ is 3 , the blood sugar swings between 7 and 13. Thus, it keeps swinging into the too-high zone $(>12)$ and the too-low zone $(<8)$. Another objection to this design is that the blood sugar never stops swinging.
In this section, we give an oversimplified model of how blood sugar and insulin work, and we consider the problem of designing an artificial pancreas. The biology is over-simplified, but the issues we will explore are still relevant in more accurate and complex models.
Let $S(t)$ denote the blood sugar level at time $t$. Suppose $S=10$ is a good level of blood sugar, $S$ above 12 is too high, and $S$ below 8 is too low. The blood sugar reacts to insulin levels. Let $I(t)$ denote the insulin level in the blood at time $T$. Suppose that
$$
S^{\prime}(t)=5-I(t) .
$$
So if $I(t)>5$, then blood sugar does down, and if $I(t)<5$ then blood sugar goes up. In patients with severe diabetes, the pancreas doesn't make insulin. The artificial pancreas is a fairly recent medical technology where a medical device installed in the patient makes insulin and has to adjust insulin levels to regulate blood sugar. Figuring out when to increase/decrease the insulin level is a mathematical problem. One approach is the following: if the patient's blood sugar is too high, the artificial pancreas increases the insulin level. If the patient's blood sugar is too low, the artificial pancreas decreases the insulin level. This approach can be modelled by the following differential equation:
$$
I^{\prime}(t)=S(t)-10 .
$$
At the moment we have two equations involving two functions:
$$
S^{\prime}(t)=5-I(t) \text { and } I^{\prime}(t)=S(t)-10 .
$$
To get an equation for just one function, we can differentiate the first equation $S^{\prime}(t)=5-I(t)$, which gives $S^{\prime \prime}(t)=-I^{\prime}(t)$ and then plug in the equation for $I^{\prime}(t)$. This leads to an equation for $S(t)$ which is similar to the ones in the last few problems.
Suppose that at time $0, S(0)=13$ (too high) and $I(0)=5$.
Find $S(t)$ and $I(t)$.
From $S^{\prime \prime}=-I^{\prime}$ (in the discussion just before Problem 19 officially begins) and $I^{\prime}=S-10$ (the second differential equation),
$$
S^{\prime \prime}=10-S \text {. }
$$
It has the same form as $x^{\prime \prime}=1-x$ (Problem 18), so its solution is, by analogy,
$$
S=10+A \sin t+B \cos t .
$$
To find $A$ and $B$, find $S(0)$ and $S^{\prime}(0) . S(0)$ is given as 13. And $S^{\prime}(0)=5-I(0)$. With $I(0)=5, S^{\prime}(0)=0$. Thus, $S$ has no sine term $(A=0)$, which would otherwise give $S$ a nonzero derivative at 0 . To make $S(0)=13$, set $B=3$.
$$
S=10+3 \cos t .
$$
To find $I$, use the first differential equation $S^{\prime}=5-I$, or $I=5-S^{\prime}$. Differentiating the solution (40) for $S$ gives $S^{\prime}=-3 \sin t$. Thus,
$$
I=5+3 \sin t .
$$
As a sanity check: $I$ is increasing at $t=0$, as it should (the blood sugar started out too high).
Let us remember where we left off trying to design a good feedback loop for an artificial pancreas. We let $S(t)$ denote the blood sugar at time $t$ and $I(t)$ the insulin level at time $t$. A blood sugar $S=10$ is normal, $S>12$ is too high, and $S<8$ is too low. Blood sugar is regulated by insulin according to the equation
$$
S^{\prime}(t)=5-I(t) .
$$
The artificial pancreas can measure the blood sugar and respond by increasing or decreasing the insulin level, and we have to design exactly how it does so. In our first model, the artificial pancreas increased insulin when the blood sugar was above 10, and decreased insulin when the blood sugar was below 10. This plan led us to the equation
$$
I^{\prime}(t)=S(t)-10 .
$$
Combining the equations, we found
$$
S^{\prime \prime}(t)=-I^{\prime}(t)=10-S(t) .
$$
On the last problem set, we supposed that $S(0)=13$ (blood sugar too high) and $I(0)=5$, and we solved for $S(t)$ and $I(t)$. This describes a situation where the patient's blood sugar starts off too high, and we want to see if the artificial pancreas can restore blood sugar to a normal level. When we solved the equations, we found
$$
S(t)=10+3 \cos t \text { and } I(t)=5+3 \sin t .
$$
These pictures below show the sugar level and insulin level over time.
The patient's blood sugar drops too low then goes up too high then drops too low again and repeats forever. Instead of this blood sugar roller coaster, we would like the patient's blood sugar to go down from 13 to the normal range and then stay in the normal range. How can we fix the artificial pancreas?
Diagnosis: We discussed in class what is going wrong. At time $t=\pi / 2$, we have $S(t)=10$ which looks normal, but $I(t)=8$. An insulin level of 8 is high and it's going to drive the blood sugar down. Having $S=10$ and $I=8$ is not a stable situation and not a particularly good situation. The goal is to get $S$ close to 10 and $I$ close to 5.
Here are four ways we could change the feedback loop of our artificial pancreas. Which one will best fix this problem? Explain your reasoning.
a. $I^{\prime}(t)=2(S(t)-10)$.
b. $I^{\prime}(t)=(1 / 2)(S(t)-10)$.
c. $I^{\prime}(t)=(S(t)-10)+(I(t)-5)$.
d. $I^{\prime}(t)=(S(t)-10)-(I(t)-5)$.
The best fix is option (d). Initially, using $S(0)=13$ and $I(0)=5, I^{\prime}(0)$ is positive for all the options. This means that $I(t)$ increases from 5 , which makes $S^{\prime}(t)=5-I(t)$ decrease from 13. As $S$ decreases from 13 to $10, S-10>0$ and $S^{\prime}<0$. Meanwhile, $I(t)$ is increasing until it reaches a maximum $I_{\max }$ when $I^{\prime}(t)=0$. Analyze each of the options to think about when this happens:
a. $I^{\prime}(t)=0 \quad \longrightarrow \quad S(t)=10$
b. $I^{\prime}(t)=0 \quad \longrightarrow \quad S(t)=10$
c. $I^{\prime}(t)=0 \quad \longrightarrow \quad S(t)-10=S^{\prime}(t)$
d. $I^{\prime}(t)=0 \quad \longrightarrow \quad S(t)-10=-S^{\prime}(t)$.
We should exclude (a) and (b) because they mean that $I(t)$ reaches $I_{\max }$ when $S(t)=10$, which means that the next thing that happens is a sugar crash $(S(t)$ decreases at a rapid rate). For (c), as $S(t)>10, S^{\prime}(t)$ should still be negative, so $I(t)$ is still increasing until $S(t) \leq 10$. This is also an unstable situation because $I(t)$ reaches $I_{\max }$ after $S(t) \leq 10$, causing another sugar crash. The last option is (d), which makes sense since $S(t)-10=-S^{\prime}(t)$ (a positive number) has a chance of happening before $S(t)$ reaches 10, so the insulin level can come down from $I_{\max }$ before $S(t)=10$.
Differential equations model feedback loops. Here is a description of a feedback loop from biology. Your job is to decide which differential equations are a reasonable model of this feedback loop. This probably won't be a very accurate model, but it's really good practice in understanding what differential equations mean, and that skill would help you find a more accurate model based on more data.
Hormone $\mathrm{X}$ tells the liver to make more of protein P. However, when the liver is exposed to hormone $X$, it slowly becomes less sensitive to hormone $X$, and the effect of a given amount of hormone $\mathrm{X}$ is smaller.
We let $P(t)$ be the amount of protein $P$ in the bloodstream at time $t, X(t)$ be the amount of hormone $X$, and $S(t)$ be a measure of the sensitivity of the liver to hormone $X$. The sensitivity $S$ lies in between 0 and 1 , with 1 being the most sensitive and 0 being the least sensitive. Which differential equations approximately match the description in the last paragraph? Explain your reasoning.
b. $P^{\prime}(t)=S(t)+X(t)$ and $S^{\prime}(t)=-\frac{1}{100} P(t)$.
Choice $\mathrm{c}$ is no good because its second equation, $S^{\prime}=X-S$, says that hormone $X$ increases the sensitivity $S$-contrary to the description that exposure to hormone $X$ decreases sensitivity. Choice $b$ is also no good because $S^{\prime}$ doesn't depend on $X$.
Fortunately, choice a does show a reasonable dependence for $S^{\prime}:$ more $X$ makes $S^{\prime}$ more negative, which indeed decreases $S$. This choice also has a reasonable behavior for $P$. More $X$ means more $P^{\prime}$ (increasing $P$ ), as it should. And more $S$ (more sensitivity) means that $P^{\prime}$ is bigger, consistent with the meaning of sensitivity.
58
311EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneLab 12
Reinforcement Learning
3aiii0.06944444444Text
Recall that the hyperparameter epsilon $(\epsilon)$ characterizes a trade-off between exploration and exploitation in reinforcement learning. When we use an " $\epsilon$-greedy" strategy in Q learning, we take a completely random action with probability $\epsilon$; and with probability $1-\epsilon$, we take the action that'd lead to the highest $Q$ value, i.e. we take $\arg \max _{a} Q(s, a)$.
We'll explore how choosing the value of epsilon affects the performance of $Q$ learning in a very simple game.
The choice of epsilon can affect the overall behavior of Q-learning. Let's consider three possible values for epsilon: $0.0$, $0.5$, and $1.0$.
Which of these epsilon values is guaranteed to cause optimal behavior during learning?
Multiple Choice
none: No algorithm can guarantee optimal behavior in training time. The algorithm is oblivious to the true transition and reward functions and it needs to explore to get reward signal and understand the world (eps $>0$ ), but then even if it eventually converged, it would take sub-optimal actions due to the eps $>0$ which was necessary in the first place.
Recall that the hyperparameter epsilon $(\epsilon)$ characterizes a trade-off between exploration and exploitation in reinforcement learning. When we use an " $\epsilon$-greedy" strategy in Q learning, we take a completely random action with probability $\epsilon$; and with probability $1-\epsilon$, we take the action that'd lead to the highest $Q$ value, i.e. we take $\arg \max _{a} Q(s, a)$.
We'll explore how choosing the value of epsilon affects the performance of $Q$ learning in a very simple game.
The choice of epsilon can affect the overall behavior of Q-learning. Let's consider three possible values for epsilon: $0.0$, $0.5$, and $1.0$.
Which of these epsilon values risks never finding the optimal policy?
eps=0: No exploration means there is a good chance you will miss the optimal.
Recall that the hyperparameter epsilon $(\epsilon)$ characterizes a trade-off between exploration and exploitation in reinforcement learning. When we use an " $\epsilon$-greedy" strategy in Q learning, we take a completely random action with probability $\epsilon$; and with probability $1-\epsilon$, we take the action that'd lead to the highest $Q$ value, i.e. we take $\arg \max _{a} Q(s, a)$.
We'll explore how choosing the value of epsilon affects the performance of $Q$ learning in a very simple game.
The choice of epsilon can affect the overall behavior of Q-learning. Let's consider three possible values for epsilon: $0.0$, $0.5$, and $1.0$.
Which of these epsilon values has the highest risk of spending way too much time exploring parts of the space that are unlikely to be useful?
eps=1: Completely random exploration means that a lot of time might be spent exploring clearly suboptimal strategies.
Recall that the hyperparameter epsilon $(\epsilon)$ characterizes a trade-off between exploration and exploitation in reinforcement learning. When we use an " $\epsilon$-greedy" strategy in Q learning, we take a completely random action with probability $\epsilon$; and with probability $1-\epsilon$, we take the action that'd lead to the highest $Q$ value, i.e. we take $\arg \max _{a} Q(s, a)$.
We'll explore how choosing the value of epsilon affects the performance of $Q$ learning in a very simple game.
For this part, you will use a Colab notebook we have prepared for you. You can find the Colab notebook here.
Once you run the code, wait patiently until you see a yellow and purple square on a teal background (you may need to scroll down from the "score" and "reward" text lines printed out). Ignore everything else for now. Click play in the button right below the square. This is a movie of a policy playing the game No Exit. It's kind of like Pong: the purple square is the "ball" and the yellow square is your "paddle". The actions are to move the paddle up, down, or keep it still.
The state is specified by the positions and velocities of the ball and paddle, with a special added "game over" state.
The transition model is a very approximate physics model of the ball reflecting off walls and the paddle, except if the ball gets past the paddle in the positive $x$ direction, the game is over.
The agent gets a reward of $+1$ on every step it manages to survive.
When watching the game play out, you'll sometimes see that the purple square gets near the right-hand border and then suddenly it changes to a state with the purple square in the bottom left and the yellow one in the upper right $--$ this means that the game terminated and then reset to the initial state.
Now we can go back and look at the other output in the notebook:
\begin{itemize}
\item First, we print what happens during learning in the format (number of iterations, average score): after every 10 iterations of batch Q learning, we take the current greedy policy and run it to see what its average score is. This score represents how long the episode ran before the ball ran off the map, or 100 if it lasted for that long.
\item Next is a plot of the score as a function of number of iterations.
\item Finally, we run the greedy policy with respect to the last Q-value function for 10 games and report the rewards achieved on each game. We also show a movie clip from a handful of these 10 games.
\end{itemize}
Run the code given on the notebook for values of $\epsilon$ in the set $0,0.5,1$. Does your observation match your answers from $3.1$?
Remember that this is a small instance, so sometimes the random noise of the environment might prevent you from seeing any useful information. Run the notebook two or three times if something doesn't line up with your expectation, and then ask for help.
For part (b), ask to see their plots, and whether that match up with their answers. Take time to explain anything that might make the plots not match.
It appears that the model with epsilon $=0$ case performs poorly, and the epsilon $=0.5$ and epsilon $=1$ models perform better. There is randomness here, so student results may vary. Have them try to explain why they got the results that they did.
59
13Mathematics18.2
Principles of Discrete Applied Mathematics
None18.C06Problem Set 3Bijection2b0.7638888889Text
In the notes on counting in canvas, Section $4.2$ describes a map $\Psi$ that takes a binary tree $B$ with $n$ nodes (remember these are those vertices with 2 children, as opposed to leaves who have no children) and maps it to a lattice path.
Furthermore prove that $\Psi$ is a bijection between $\mathcal{B}_{n}$ and $\mathcal{D}_{n}$.
Open
To show that $\Psi$ is a bijection, we construct an inverse $\Phi$, so that given a Dyck path $D$, we construct a binary tree $B=\Phi(D)$ such that $\Psi(B)=D$. Since the first step of $D$ is an upwards step, we add a vertex to $B$ and place ourselves on that vertex. We now go step-by-step through $D$ to generate $B$. At each step, we will be growing a 2-tree in such a way that any vertex with a single child must be an ancestor of the vertex we are on. If we are at an upwards step, if the vertex we are on has less than two children, we add a child and move to the new vertex. If the vertex has two children, we move up the tree until we are at an ancestor vertex with less than two children, then add a child and move to the new vertex. (We show later that this is always possible.) This new vertex will become a node of $B$, and it will be a node of the generated tree after the next step. If we are at a downwards step and the current vertex has less than two children, we add a child but stay on the current vertex. This new vertex will become a leaf of $B$. If there are two children, then like before we move up until we are at a vertex $v$ with less than two children, and add a child while staying at $v$. Each of these procedures preserves the fact the generated tree is a 2-tree and that all vertices with a single child are ancestors of the current vertex.
We now show that when needed, we can always find an ancestor with only one child. Let $T$ be the tree we have generated so far, and suppose we need to check the ancestors for a place to add a vertex, and no ancestor has space. This implies that every node has zero or two children, and so $T$ is a binary tree. Then $n(T)=l(T)-1$ by the lemma. The previous step of the Dyck path must have produced a leaf, since otherwise there would have been space at the current vertex. This means that, by this point, every vertex produced with an upwards path has at least one child. Thus, every node of $T$ corresponds to an upwards path, and every leaf corresponds to a downwards path. If $k$ is the number of upwards paths so far and $m$ is the number of downwards paths so far, then $k=n(T)=l(T)-1=m-1$. Then $k<m$, which contradicts the fact that $D$ is a Dyck path. Thus, we can never reach this state and we can always find an ancestor with one child when necessary.
After the last step of the Dyck path, we will have just added a new leaf vertex, so if $G$ is the tree generated so far, then $l(G)=n$ and $n(G)=n$. Then from the lemma, $G$ is not yet a binary tree, and either the current node has space or one of its ancestors does. We add a leaf to this node, and call this tree $B$. We have $l(B)=l(G)+1$, so $l(B)-1=n(G)=n(B)$, which makes $B$ a binary tree.
The route that we added vertices is exactly the same route we would take when doing a depth-first search, so $\Psi(\Phi(D))=\Psi(B)=D$. Conversely, we have $\Phi(\Psi(B))=B$ for all binary trees $B$. Thus, $\Phi$ and $\Psi$ are inverses and we get that these maps are bijections.
In the notes on counting in canvas, Section $4.2$ describes a map $\Psi$ that takes a binary tree $B$ with $n$ nodes (remember these are those vertices with 2 children, as opposed to leaves who have no children) and maps it to a lattice path.
Prove that $\Psi(B)$ is a Dyck path with $2 n$ steps.
We define a 2-tree to be a plane tree where each vertex has at most 2 children. Similar to the definition of the binary tree, each vertex with at least one child is called a node, and each vertex with no children is called a leaf. In order to relate binary trees to Dyck paths, we first prove the following lemma to relate the number of leaves to the number of nodes.
Lemma 1. Let $T$ be a 2-tree, and let $l(T)$ be the number of leaves, $n(T)$ be the number of nodes. Then $n(T) \geq l(T)-1$, with equality if and only if $T$ is a binary tree.
Proof. We prove this by induction on the number of vertices. If $T$ is the trivial tree with only one vertex, then $l(T)=1$ and $n(T)=0$, so we get the statement. Now suppose we have proved the statement for all trees with at most $k$ vertices, and $T$ has $k+1$ vertices. Let $v \in T$ be a leaf of maximum depth. Then it has a parent node $w$ and at most one other sibling $v^{\prime}$. If we remove both $v$ and $v^{\prime}$ (if $v^{\prime}$ exists) to get a new tree $T^{\prime}$, then $w$ will become a leaf node. If $v^{\prime}$ exists, then $l\left(T^{\prime}\right)=l(T)-1$ and $n\left(T^{\prime}\right)=n(T)-1$, and if $v^{\prime}$ does not exist, then $l\left(T^{\prime}\right)=l(T)$ and $n\left(T^{\prime}\right)=n(T)-1$. Since $n\left(T^{\prime}\right) \geq l\left(T^{\prime}\right)-1$ by induction, we know that
$$
n(T)=n\left(T^{\prime}\right)+1 \geq l\left(T^{\prime}\right) \geq l(T)-1 .
$$
Equality only holds if $v^{\prime}$ exists and $n\left(T^{\prime}\right)=l\left(T^{\prime}\right)-1$, which only holds if $T^{\prime}$ is a binary tree. This implies that we have equality if and only if $T$ is a binary tree, as desired. Now suppose we are given a binary tree $G$. In order for $\Psi(G)$ to be a Dyck path, we need to show that as we traverse $G$ using a depth-first search, the number of leaves encountered does not exceed the number of nodes encountered until we reach the last vertex in the search. Let $T$ be a subtree of $G$ consisting of all nodes traversed at some point in the depth-first search. Note that $T$ is a 2 -tree and each node of $T$ must be a node of $G$. Then $n(T) \geq l(T)-1$ with equality only if $T$ is a binary tree. If $T$ is not a binary tree, then the number of nodes encountered at this point in the search must exceed $n(T)$ and the number of leaves encountered is at most $l(T)$, so the number of encountered leaves does not exceed the number of nodes. If $T$ is a binary tree and $T \neq G$, then one of the leaves of $T$ must be a node of $G$ (in particular, the parent of the next vertex encountered in the search). Then the number of leaves encountered is at most $l(T)-1$ and we are still fine. This shows that the number of leaves encountered does not exceed the number of nodes encountered until the last vertex, so $\Psi(G)$ is a Dyck path.
The set of $n \times n$ matrices can be identified with the space $\mathbb{R}^{n \times n}$. Let $G$ be a subgroup of $G L_n(\mathbb{R})$. With the notation of the previous problem, prove:
If $A, B, C, D$ are in $G$, and if there are paths in $G$ from $A$ to $B$ and from $C$ to $D$, then there is a path in $G$ from $A C$ to $B D$.
If $X(t)$ is a path from $A$ to $B$ in $G L_{n}$ and $Y(t)$ is a path from $C$ to $D$, then the matrix product $X(t) Y(t)$ defines a path from $A B$ to $C D$. It is continuous because matrix multiplication is continuous.
Prove that $D_{2 n}$ has exactly 4 one-dimensional complex representations if $n$ is even and exactly 2 one-dimensional representations if $n$ is odd.
Let $G$ be the group formed by the set of commutators of $D_{2 n}$. We know that the number of distinct linear characters (and thus the number of non-isomorphic one-dimensional representations) is the order of the quotient group $D_{2 n} / G$, so it is enough to identify $G$. Write $D_{2 n}=\left\{a, b: a^{n}=1, b^{2}=1, b a^{i}=a^{-i} b\right\}$. Then the commutators
$$
\begin{aligned}
{\left[a^{i} b, a^{j} b\right] } & =\left(a^{i} b\right)^{-1}\left(a^{j} b\right)^{-1} a^{i} b a^{j} b=b a^{-i} b a^{-j} a^{i-j}=a^{2(i-j)}, \\
{\left[a^{i} b, a^{j}\right] } & =b a^{-i} a^{-j} a^{i} b a^{j}=b a^{-2 j} b=a^{2 j} \\
{\left[a^{i}, a^{j} b\right] } & =a^{-i} b a^{-j} a^{i} a^{j} b=a^{-2 i} \\
{\left[a^{i}, a^{j}\right] } & =1
\end{aligned}
$$
Therefore the derived subgroup of $D_{2 n}$ is the cyclic group $G=\left\langle a^{2}\right\rangle$. If $n$ is odd, $\left\langle a^{2}\right\rangle=\langle a\rangle$ and $D_{2 n} / G$ is isomorphic to $\langle b\rangle$ which has order 2. If $n$ is even, we have four cosets: $G, G a, G b$, and $G a b$. Thus $D_{2 n}$ has 4 1D representations for $n$ even and 2 1D representations for $n$ odd. $\diamond$.
60
50Mathematics18.100BReal Analysis18.02NoneMidterm Exam 1Metric Spaces3nan5Text
Let $K$ be a non-empty compact set in a metric space $X$ and suppose $p \in$ $X \backslash K$. Show that there exists a point $q \in K$ such that
$$
d(p, q)=\inf \{d(p, x) ; x \in K\}
$$
Open
The set $\{d(p, x) ; x \in K\}$ is a non-empty subset of $(0, \infty)$ so the infimum exists and there is a sequence $x_{n} \in K$ such that $d\left(p, x_{n}\right) \rightarrow$ $D=\inf \{d(p, x) ; x \in K\}$. Now, as a sequence in a compact set, $\left\{x_{n}\right\}$ has a convergent subsequence. Since $d\left(p, x_{n_{k}}\right)$ also converges to $D$, we may just assume that $x_{n} \rightarrow q$ in $X$. Since compact sets are closed, $q \in K$ and we just need to check that $d(p, q)=D$. By the definition of infimum and the convergence of the distance, give $\epsilon>0$ there exists $n$ such that
$$
\begin{array}{r}
\left|d\left(x, x_{n}\right)-D\right|<\epsilon / 2 \text { and } d\left(x_{n}, q\right)<\epsilon / 2 \text { but this implies that } \\
|d(p, q)-D|<\left|d(p, q)-d\left(p, x_{n}\right)\right|+\left|d\left(p, x_{n}\right)-D\right|<\epsilon
\end{array}
$$
for any $\epsilon>0$. Thus $d(p, q)=D$ as desired and the infimum of the distance is attained.
Here is a direct approach that a couple of people used. Set $D=$ $\inf \{d(p, x) ; x \in K\}$ and suppose that this is not attained on $K$, so for all $x \in K, d(p, x)>D$. Thus
$$
K \subset \bigcup_{x \in K} B\left(x, \frac{1}{2}(d(p, x)-D)\right)
$$
is an open cover, which therefore has a finite subcover since $K$ is compact. Let $x_{i}, i=1, \ldots, N$ be the centers of such a cover with $\epsilon_{i}=\frac{1}{2}\left(d\left(p, x_{i}\right)-D\right)$ and $\epsilon=\min _{i} \epsilon_{i}>0$. Then, each $x \in K$ is in one of these balls, so from the triangle inequality, for the appropriate $i$,
$$
d(p, x) \geq d\left(p, x_{i}\right)-\epsilon_{i} \geq D+\epsilon_{i} \geq D+\epsilon .
$$
This however shows that $D$ is not the infimum as it is defined to be, so there must be a point $q \in K$ with $d(p, q)=D$.
There is an even simpler direct approach used by several people. Suppose that $d(p, x)>r=\inf \{d(y, p) ; y \in K\}$ for all $x \in K$. Then the open sets
$$
G(n)=\{x \in K ; d(x, p)>r+1 / n\}
$$
form an open cover of $K$ which therefore must have a finite subcover by compactness. Since the $G(n)$ increase with $n, K \subset G(N)$ for some $N$ and hence $d(x, p)>r+1 / N$ for all $x \in K$, contradicting the definition of the infimum.
Another variant of this is to define $r=\inf \{d(x, p) ; x \in K\}$ and then to set
$$
K(n)=K \cap\{x \in X ; d(x, p) \leq r+1 / n\} .
$$
Since the second sets are closed, these are compact sets, being closed subsets of $K$, which are non-empty, by the definition of infimum, and decreasing as $n$ increases. Thus, by a theorem in Rudin, $T=\cap_{n} K(n) \neq \emptyset$. If $q \in T \subset K$ then $d(p, q)=r$ since $d(p, q) \leq r+1 / n$ for all $n$ and $d(p, q) \geq r$.
Suppose that $X$ is a metric in which $d(x, y)$ is always a (nonnegative) integer. Show that $X$ is complete.
Set $\epsilon=1$ in the Cauchy property. There is an $N$ such that $d\left(x_{n}, x_{m}\right)<1$ for $m, n \geq N$. By assumption, this means that $d\left(x_{n}, x_{m}\right)=0$, so $x_{n}=x_{m}$, meaning that the sequence is eventually constant, $x_{n}=x$ for $n \geq N$. It is clear from the definition that $x_{n}$ converges to $x$.
Suppose $E \subset \mathbb{R}$ has the property that for every non-empty $B \subset E$ which is bounded, $\sup B$ and $\inf B$ are in $E$. Show that $E$ is closed with respect to the standard metric.
By a theorem in Rudin, any limit point of a set in a metric space is the limit of a sequence, the sequence in the set, $E$, the limit in the metric space. Thus if $x \in E^{\prime}$ is a limit point of $E$ then there is a sequence $x_{n} \in E$ with $x_{n} \rightarrow x$ in $\mathbb{R}$. Consider all $n \in \mathbb{N}$ such that $x_{n} \leq x$. If this is infinite, then there is a subsequence $x_{n_{j}}$ with $x_{n_{j}} \leq x$. If not then there is a subsequence $x_{n_{j}}$ with $x_{n_{j}}>x$. In either case, $x_{n_{j}} \rightarrow x$ so we can change notation and just assume that either $x_{n} \leq x$ or $x_{n}>x$ for all $n$. Let $B \subset E$ be the range of this sequence, this set is bounded, since any convergent sequence is bounded. Moreover in the first case $\sup B=x$ and in the second $\inf B=x$, since other wise the sequence could not converge to $x$. Thus $x \in E$ and hence $E^{\prime} \subset E$ and $E$ is necessarily closed.
Of course there are many variants of this. One can certainly avoid using sequences. For instance, suppose $x \in E^{\prime}$ but $x \notin E$. Then the sets $B(x, 1 / n) \cap E$ are all infinite, for $n \in \mathbb{N}$. Consider $(x, 1 / n) \cap E$; either this is infinite for all $n$ or else it is empty for large $n$, in which case $(x-1 / n, x) \cap E$ must be infinite for all $n$. So, we can choose either $x_{n} \in(x, x+1 / n)$ for all $n$ or $x_{n} \in(x-1 / n, x)$ for all $n$. Let $B$ be the subset of $E$ consisting of these choices, then $x=\inf B$ in the first case and $x=\sup B$ in the second case, and in both cases $B$ is bounded. Thus in fact $x \in E$ by the assumption, contradicting the assumption that $x \notin E$. Thus $E$ is closed.
I rather like the following proof from Yunjian Xu which neatly avoids the division into two pieces: Let $p$ be a limit point of $E$. Then for every $n \in \mathbb{N}$, $D_{n}=B(p, 1 / n) \cap E$ is a bounded, infinite subset of $E$, so by assumption $q_{n}=\sup D_{n} \in E$. This sequence is bounded since it lies in $B(p, 1)$; let $B$ be its range. This is again a bounded nonempty subset of $E$ and we claim $p=\inf B$, so $p \in E$. Indeed $q_{n}$ is a non-increasing (the sets are getting smaller) sequence which is bounded below so it converges to the infimum of its range, but since $\left|q_{n}-p\right| \leq 1 / n$ the limit must be $p$.
Main shortcomings: Not making sure that $x$ was the sup or inf of a chosen subset. Minor problems included assuming that just because $(x-1, x) \cap E$ was infinite then $x$ has to be the supremum $-E \cap(x-1, x) \subset\left(x-1, x-\frac{1}{2}\right)$ is a possibility (but then of course $x=\inf (x, x+1) \cap E)$.
Let $X$ be a metric space which is totally bounded. Show that there is a countable subset $B \subset X$ such that, for every point $l \in X$, there is a sequence in $B$ which converges to $l$.
For each $n \geq 1$, set $\varepsilon=\frac{1}{n}$. Since $X$ is totally bounded, there exists a finite set $F_{n}$ such that for each $x \in X$, there is a $y \in F_{n}$ with $d(x, y)<\varepsilon$. Let $B=\bigcup_{n \geq 1} F_{n}$. Since $B$ is a countable union of finite sets, it is countable.
Let $l \in X$ be given. For each $n \in \mathbb{N}$, let $x_{n}$ be a point in $F_{n}$ such that $d\left(x_{n}, k\right)<\frac{1}{n}$. Then $\left(x_{n}\right)$ is a sequence in $B$ converging to $l$.
61
5Mathematics18.100BReal Analysis18.02NoneProblem Set 1
Axioms of Arithmetic
6nan1.071428571Text
Show that the axioms of arithmetic and the axioms of ordering imply the following: if $x>y$, then $x^{3}>y^{3}$. [Besides the axioms, you can use any theorem proved in the first two lectures. If you feel underwhelmed by this pset, you can try to also prove the converse implication to this problem; however, no credit will be awarded for it.]
Open
We need to show that if $x-y \in P$, then $x^{3}-y^{3} \in P$. Write
$$
2\left(x^{3}-y^{3}\right)=(x-y)\left(2 x^{2}+2 x y+2 y^{2}\right)=(x-y)\left(x^{2}+y^{2}+(x+y)^{2}\right).
$$
The three squares $x^{2}, y^{2}$, and $(x+y)^{2}$ are either zero or in $P$ (theorem from the class: the square of any nonzero element lies in $P$ ). Moreover, since $x \neq y$ by assumption, at least one of the squares $x^{2}, y^{2}$ must be in $P$ (same theorem). It follows that $x^{2}+y^{2}+(x+y)^{2} \in P$. We now know, from the equation above and the assumption that $x-y \in P$, that $2\left(x^{3}-y^{3}\right) \in P$. If $x^{3}-y^{3}$ were not in $P$, it would either have to be zero, or $-\left(x^{3}-y^{3}\right)$ would have to be in $P$ (trichotomy), and then $2\left(x^{3}-y^{3}\right)$ would inherit the same properties, which is a contradiction. Hence, $x^{3}-y^{3}$ must be in $P$.
Now, the solution above definitely qualifies as sneaky (the much-hated "pull a formula out of a hat" trick). A more reasonable alternative (only sketched here) would be to first show the following Lemma: if $a$ is positive, and $b>c$, then $a b>a c$. (This follows directly from the axioms, all we're saying is that if $a \in P$ and $b-c \in P$, then $a b-a c=a(b-c) \in P$.). Using that, one can show the desired inequality if both $x$ and $y$ are positive:
$$
x^{3}=\left(x^{2}\right) x>\left(x^{2}\right) y=(x y) x>(x y) y=\left(y^{2}\right) x>\left(y^{2}\right) y=y^{3}.
$$
What about all the other situations? If both $x$ and $y$ are negative and satisfy $x>y$, then $(-x)$ and $(-y)$ are both positive, and $-x<-y$ (one can easily check that by reducing both properties to $x-y \in P$ ). The previous case tells us that $(-x)^{3}<(-y)^{3}$, but (using the fact that $(-a) \cdot(-b)=a b$ from the lecture) one sees easily that $-x^{3}=(-x)^{3}$ and $-y^{3}=(-y)^{3}$, so we now know that $-x^{3}<-y^{3}$, which (as before) yields $x^{3}>y^{3}$. There are still more cases, namely when one of the two numbers is zero, or when $x$ is positive and $y$ is negative; but those can be dealt with case-by-case quite easily.
Continuing the previous problem, suppose that our originally given numbers had a subset $P$ which satisfies the axioms of ordering (with respect to $+$ and $\cdot$ ). Is there a subset which does the same for our new $+$, . . operations? [Note that the axiom of completeness is not part of the axioms of ordering.]
We use $P=-P=\{-x \quad: \quad x \in P\}$ as subset of positive numbers for our new operations. Trichotomy for this $P$ says that for each $x$, either $x=0$, or $-x \in P$, or $-(-x) \in P$. But $-(-x)=x$, because both those numbers are the additive inverse of $-x$, and additive inverses are unique. So this statement is the same as trichotomy for $P$, which we know.
Suppose $x, y \in P$, so $-x \in P$ and $-y \in P$. Now $(-x)+(-y)$ is the additive inverse of $x+y$, because $(-x)+(-y)+x+y=((-x)+x)+((-y)+y)=0+0=0$. Therefore, it follows from the axioms of ordering for $P$ that $(-x)+(-y)=-(x+y) \in P$, which shows that $x+y \in P$.
Suppose $x, y \in P$, so $-x \in P$ and $-y \in P$. The statement that $x \cdot y \in P$ means that $-(-(x \cdot y)) \in P$, or equivalently (by what I've observed above) that $x \cdot y \in P$. But we know that to be true, because (as proved in lecture) $x \cdot y=(-x) \cdot(-y)$, where the right hand side lies in $P$ because of the axiom of ordering for $\cdot$.
Suppose that we have any notion of number, satisfying the axioms of arithmetic. Let's change the operations as follows: we keep addition, $x+y=x+y$, but change multiplication to $x \cdot y=-(x \cdot y)$, where $-(\cdots)$ is the additive inverse for the old operation $+$. Do our new operations satisfy the axioms of arithmetic? Explain your answer.
During these computations, we will use $-(a \cdot b)=(-a) \cdot b=a \cdot(-b)$ many times. (Axiomatically, this follows from the distributive axiom, which shows that $(-a) \cdot b$ is an additive inverse to $a \cdot b$.)
Addition did not change, so we don't have to check any of its properties.
When we spell out axioms for $\cdot \cdot$ in terms of the old operations, we get:
$$
\begin{array}{ll}
-(x \cdot y)=-(y \cdot x) & \text { commutativity } \\
-(x \cdot(-(y \cdot z)))=-((-(x \cdot y)) \cdot z) & \text { associativity } \\
-(x \cdot(y+z))=(-(x \cdot y))+(-(x \cdot z)) & \text { distributivity. }
\end{array}
$$
The first line, commutativity, is obviously true. For the second line, we see that (using the fact mentioned at the beginning $)-(x \cdot(-(y \cdot z)))=x \cdot y \cdot z$, and the same applies to $-((-(x \cdot y)) \cdot z)$. Distributivity uses the same strategy, $-(x \cdot(y+z))=(-x) \cdot(y+z)=(-x) \cdot y+(-x) \cdot z=$ $(-(x \cdot y))+(-(x \cdot z))$
The final step is multiplicative neutral element and inverses. One has $-((-1) \cdot x)=1 \cdot x=x$, so $(-1)$ is a multiplicative neutral element for $\cdot \cdot$. One also has $-\left(\left(-x^{-1}\right) \cdot x\right)=x^{-1} \cdot x=1$, so $-x^{-1}$ is a multiplicative inverse with respect to $\cdot$.
Prove that the commutativity and associativity axioms for addition, together with the axiom of the existence of a neutral element for addition, imply that each $x$ can have at most one additive inverse. [This is Lemma 1.2 from the class summaries; obviously, you can't use either that Lemma, or anything that came after that. Argue strictly axiomatically.]
Suppose that $y$ and $z$ are both additive inverses of $x$, so $x+y=0$ and $x+z=0$. Then,
$$
y=y+0=y+(x+z).
$$
Here, we have used the defining property of the neutral element 0 , as well as the fact that $z$ is an inverse of $x$. Now we use associativity and commutativity:
$$
y+(x+z)=(y+x)+z=(x+y)+z .
$$
Now we use that $y$ is an inverse, and the defining property of the neutral element 0:
$$
(x+y)+z=0+z=z .
$$
Taking all that together, we have shown that $y=z$: any two additive inverses of $x$ must be equal, so there is at most one.
62
12Mathematics18.6
Probability and Random Variables
18.02NoneProblem Set 1Counting7d0.1Text
In how many ways can 8 people be seated in a row if there are 5 men and they must sit next to each other?
Numerical
If there are 5 men and they must sit next to each other, then there are $5! \cdot 4! = 2,880$ possible seating arrangements because the 5 men can be bundled together and permuted with the remaining 3 people.
In how many ways can 8 people be seated in a row if there are 4 men and 4 women and no 2 men or 2 women can sit next to each other?
If there are 4 men and 4 women and no 2 men or 2 women can sit next to each other, then there are $4! \cdot 4! \cdot 2! = 1,152$ possible seating arrangements because there are $4!$ possible permutations of the men, $4!$ possible permutations of the women, and $2!$ possible permutations of their positioning.
In how many ways can 8 people be seated in a row if there are 4 married couples and each couple must sit together?
If there are 4 married couples and each couple must sit together, then there are $4! \cdot (2!)^4 = 384$ possible seating arrangements because there are $4!$ possible permutations of the couples and $2!$ possible permutations of the couples among themselves.
In how many ways can 8 people be seated in a row if there are no restrictions on the seating arrangement?
If there are no restrictions, then there are $8! = 40, 320$ possible seating arrangements.
63
143EECS18.C06
Linear Algebra and Optimization
18.02NoneMidterm Exam 1Projection Matrix6a0.9375Text
Consider two $n \times n$ projection matrices
$$
P=I-v_{1} v_{1}^{\top} \quad \text { and } \quad Q=I-v_{2} v_{2}^{\top}
$$
where $v_{1}$ and $v_{2}$ have unit norm and are orthogonal to each other. Let $A=P Q$.
What is the dimension of $N(A)$? Find an orthonormal basis for $N(A)$.
Hint: You can find vectors in $N(A)$, but to show a vector is not in $N(A)$ you may want to use an orthogonal decomposition.
Expression
The dimension of $N(A)$ is two and $\left\{v_{1}, v_{2}\right\}$ forms an orthonormal basis. It is easy to see that $A v_{i}=0$. Moreover for any vector $v$ we can form an orthogonal decomposition $v=u+w$ where $u$ is in the span of $v_{1}$ and $v_{2}$ and $w$ is in the orthogonal complement. Then $A v=u$ and so if $v$ is not in the span of $v_{1}$ and $v_{2}$ it is not in the nullspace.
Consider two $n \times n$ projection matrices
$$
P=I-v_{1} v_{1}^{\top} \quad \text { and } \quad Q=I-v_{2} v_{2}^{\top}
$$
where $v_{1}$ and $v_{2}$ have unit norm and are orthogonal to each other. Let $A=P Q$.
What is the rank of $A$?
By the rank-nullity theorem, we have that
$$
\operatorname{rank}(A)+\operatorname{dim} N(A)=n .
$$
By the previous item, $\operatorname{dim} N(A)=2$, and thus the rank of $A$ is $n-2$.
Consider two $n \times n$ projection matrices
$$
P=I-v_{1} v_{1}^{\top} \quad \text { and } \quad Q=I-v_{2} v_{2}^{\top}
$$
where $v_{1}$ and $v_{2}$ have unit norm and are orthogonal to each other. Let $A=P Q$.
Is $A$ a projection matrix?
Yes. We can write out
$$
A=\left(I-v_{1} v_{1}^{\top}\right)\left(I-v_{2} v_{2}^{\top}\right)=I-v_{1} v_{1}^{\top}-v_{2} v_{2}^{\top},
$$
so this is the projection onto the orthogonal complement of $\operatorname{span}\left\{v_{1}, v_{2}\right\}$. We can also verify that $A^{2}=A$.
Consider the following matrix.
$$
A=\left[\begin{array}{ccccc}
0 & 2 & 4 & 5 & 6 \\
0 & 1 & 2 & 5 & 8 \\
0 & 0 & 0 & -3 & -6
\end{array}\right].
$$
Give a basis for $N(A)$, and also state the dimension of $N(A)$.
The null-space has dimension 3, and is
$$
N(A)=\operatorname{span}\left\{\left[\begin{array}{l}
1 \\
0 \\
0 \\
0 \\
0
\end{array}\right],\left[\begin{array}{c}
0 \\
2 \\
0 \\
-2 \\
1
\end{array}\right],\left[\begin{array}{c}
0 \\
-2 \\
1 \\
0 \\
0
\end{array}\right]\right\}
$$
Notice that this is consistent with the rank-nullity theorem, since
$$
\operatorname{dim} C(A)+\operatorname{dim} N(A)=2+3=5=\operatorname{dim} \mathbb{R}^{5}.
$$
64
38Mathematics18.701Algebra I18.100BNoneProblem Set 5
Orthogonal Matrices and Rotations
3a0.25Text
Let $A$ be a $3 \times 3$ orthogonal matrix with det $A=1$, whose angle of rotation is different from 0 or $\pi$, and let $M=A-A^t$.
Show that $M$ has rank 2, and that a nonzero vector $X$ in the nullspace of $M$ is an eigenvector of $A$ with eigenvalue 1.
Open
Let $A$ be a rotation matrix, an element of $S O_{3}$. If a vector $X$ is fixed by $A$, it is also fixed by its inverse $A^{-1}=A^{t}$, and therefore $M X=\left(A-A^{t}\right) X=0$. The rank of $M$ is less than 3 . Conversely, if $M X=0$, then $A X=A^{-1} X$. When the angle of rotation isn't 0 or $\pi$, this happens only for vectors $X$ in the axis of rotation, so the $\operatorname{rank}$ of $M$ is 2.
Let $A$ be a $3 \times 3$ orthogonal matrix with det $A=1$, whose angle of rotation is different from 0 or $\pi$, and let $M=A-A^t$.
Find such an eigenvector explicitly in terms of the entries of the matrix $A$.
A fixed vector can be found by solving the equation $M X=0$, and this isn't difficult. The result is this: Let $u=a_{12}-a_{21}, v=a_{13}-a_{31}$, and $w=a_{23}-a_{32}$. Then
$$
M=\left(\begin{array}{ccc}
0 & u & v \\
-u & 0 & w \\
-v & -w & 0
\end{array}\right)
$$
and $(w,-v, u)^{t}$ is in the nullspace of $M$ and is a fixed vector of $A$.
One of the black boxes we used in class was the theorem that an $n \times n$ matrix $A$ has $A \vec{v}=0$ for some non-zero vector $\vec{v} \in \mathbb{R}^{n}\left(\right.$ or $\mathbb{C}^{n}$ ) if and only if $\operatorname{det}(A)=0$ (see, e.g., MITx 20.7). The goal of this problem is to work out $w h y$ this is true (at least in the case of $3 \times 3$ matrices). The only blackbox we will use is the properties of the determinant. Recall that $\operatorname{dim} \operatorname{Ker}(A)=0$ means that $\operatorname{Ker}(A)$ contains only the zero vector.
(Story time begins) The way we are going to go about showing that a $3 \times 3$ matrix has $\operatorname{det} A=0$ if and only if $\operatorname{dim} \operatorname{Ker}(A)>0$ is by using Gaussian elimination to reduce the statement to the case of upper triangular (or rather, RREF) matrices. So, as a first step, we're going to check that the theorem is true for this model case. (Story time ends)
Suppose $A$ is a $3 \times 3$ matrix which is upper triangular; that is
$$
A=\left(\begin{array}{ccc}
p_{1} & a & b \\
0 & p_{2} & c \\
0 & 0 & p_{3}
\end{array}\right) \text {. }
$$
Show that $\operatorname{det} A=p_{1} p_{2} p_{3}$. In particular, $\operatorname{det}(A)=0$ if and only if at least one of $p_{1}, p_{2}, p_{3}$ is 0.
Using the Laplace expansion, the only non-zero term is $p_{1} \cdot\left|\left(\begin{array}{cc}p_{2} & c \\ 0 & p_{3}\end{array}\right)\right|=$ $p_{1} p_{2} p_{3}$. Or you may use the fact that eigenvalues are $p_{1}, p_{2}, p_{3}$ and the determinant is the product of them.
There is a $3 \times 3$ real matrix $A$ so that $A^{2}=-I$. Hint: Think about determinants.
False. Suppose there is such an $A$. Then $\operatorname{det}\left(A^{2}\right)=\operatorname{det}(A)^{2}>0$. But $\operatorname{det}(-I)=(-1)^{3}=-1$. Thus we reach a contradiction and there can be no such $A$.
65
68EECS6.191
Computation Structures
6.100A, 8.02None
Prelab Questions 7
Ideal Cache Behavior
2d0.012Text
We will be using the following program to examine our cache behavior. Let N = 16 be the size of data region, in words. Let A be an array of N elements, located initially at 0x240. Note that these values are hardcoded into the program below but we will be changing them later.
// A = 0x240, starting address of array
// N = 16, size of data region
// this program adds 16 words from array A, then repeats.
. = 0x200
test:
li a0, 16 // initialize loop index i
li a1, 0 // sum = 0
loop: // add up elements in array
addi a0, a0, -1 // decrement index
slli a2, a0, 2 // convert to index byte offset
lw a3, 0x240(a2) // load value of A[i]
add a1, a1, a3 // add to sum
bnez a0, loop // loop until all words are summed
j test // perform test again!
// Array
. = 0x240
.word ... // A[0]
.word ... // A[1]
...
.word ... // A[15]
Our cache has a total of 64 words. The initial configuration is direct mapped, with 1 word per line, so the cache has 64 lines numbered 0-63 (0x00 - 0x3F).
To achieve 100% steady state hit ratio, it must be the case that the instructions and array data can reside in the cache at the same time. Let's check if this is currently the case.
Which cache line (index) does the last data element, A[15], map to? Provide your answer in hexadecimal.
Numerical
0x1F.
The address of the last data element, A[15], is 0x27C = 0b_0010_0111_1100. Bits[7:2] are the index bits = 0b011111 = 0x1F (or line 31).
We will be using the following program to examine our cache behavior. Let N = 16 be the size of data region, in words. Let A be an array of N elements, located initially at 0x240. Note that these values are hardcoded into the program below but we will be changing them later.
// A = 0x240, starting address of array
// N = 16, size of data region
// this program adds 16 words from array A, then repeats.
. = 0x200
test:
li a0, 16 // initialize loop index i
li a1, 0 // sum = 0
loop: // add up elements in array
addi a0, a0, -1 // decrement index
slli a2, a0, 2 // convert to index byte offset
lw a3, 0x240(a2) // load value of A[i]
add a1, a1, a3 // add to sum
bnez a0, loop // loop until all words are summed
j test // perform test again!
// Array
. = 0x240
.word ... // A[0]
.word ... // A[1]
...
.word ... // A[15]
Our cache has a total of 64 words. The initial configuration is direct mapped, with 1 word per line, so the cache has 64 lines numbered 0-63 (0x00 - 0x3F).
To achieve 100% steady state hit ratio, it must be the case that the instructions and array data can reside in the cache at the same time. Let's check if this is currently the case.
Which cache line (index) does A[0] map to? Provide your answer in hexadecimal.
0x10.
The address of A[0] is 0x240 = 0b_0010_0100_0000. The bottom two bits are used for word alignment. Bits[7:2] are the index bits = 0b010000 = 0x10. So A[0] maps to index 0x10 (or line 16).
We will be using the following program to examine our cache behavior. Let N = 16 be the size of data region, in words. Let A be an array of N elements, located initially at 0x240. Note that these values are hardcoded into the program below but we will be changing them later.
// A = 0x240, starting address of array
// N = 16, size of data region
// this program adds 16 words from array A, then repeats.
. = 0x200
test:
li a0, 16 // initialize loop index i
li a1, 0 // sum = 0
loop: // add up elements in array
addi a0, a0, -1 // decrement index
slli a2, a0, 2 // convert to index byte offset
lw a3, 0x240(a2) // load value of A[i]
add a1, a1, a3 // add to sum
bnez a0, loop // loop until all words are summed
j test // perform test again!
// Array
. = 0x240
.word ... // A[0]
.word ... // A[1]
...
.word ... // A[15]
Our cache has a total of 64 words. The initial configuration is direct mapped, with 1 word per line, so the cache has 64 lines numbered 0-63 (0x00 - 0x3F).
To achieve 100% steady state hit ratio, it must be the case that the instructions and array data can reside in the cache at the same time. Let's check if this is currently the case.
Which cache line (index) does the last instruction j test map to? Provide your answer in hexadecimal.
0x7.
The address of the last instruction is 0x21C = 0b_0010_0001_1100. Bits[7:2] are the index bits = 0b000111 = 0x7.
We will be using the following program to examine our cache behavior. Let N = 16 be the size of data region, in words. Let A be an array of N elements, located initially at 0x240. Note that these values are hardcoded into the program below but we will be changing them later.
// A = 0x240, starting address of array
// N = 16, size of data region
// this program adds 16 words from array A, then repeats.
. = 0x200
test:
li a0, 16 // initialize loop index i
li a1, 0 // sum = 0
loop: // add up elements in array
addi a0, a0, -1 // decrement index
slli a2, a0, 2 // convert to index byte offset
lw a3, 0x240(a2) // load value of A[i]
add a1, a1, a3 // add to sum
bnez a0, loop // loop until all words are summed
j test // perform test again!
// Array
. = 0x240
.word ... // A[0]
.word ... // A[1]
...
.word ... // A[15]
Our cache has a total of 64 words. The initial configuration is direct mapped, with 1 word per line, so the cache has 64 lines numbered 0-63 (0x00 - 0x3F).
To achieve 100% steady state hit ratio, it must be the case that the instructions and array data can reside in the cache at the same time. Let's check if this is currently the case.
Since there are 64 lines in the cache, we need log2(64) = 6 index bits to select a cache line. Which cache line (index) does the first instruction li a0, 16 map to? Provide your answer in hexadecimal.
0x0.
The address of the first instruction is 0x200 = 0b_0010_0000_0000. The bottom two bits are used for word alignment. Since the cache has a block size of one, there are no block offset bits. Since there are 64 lines in the cache, there are 6 index bits (bits[7:2]). Since the index = 0x0 for the first instruction, this instruction will go in line 0x0 of the cache.
66
35Mathematics18.01Calculus INoneNoneProblem Set 1
Exponentials and Logarithms
16b0.07919746568Text
If $2^{100}=10^{t}$, which of the following is the best approximation of $t: 10$ or 20 or 30 or 40 or $50 ?$ (If you want, you can use that $\log _{2} 10=3.32 \ldots$)
Multiple Choice
In words, $\log _{2} 10 \approx 3.32$ means that there are approximately $3.32$ factors of 2 in a factor of 10 . Thus, 100 factors of 2 are, approximately, 100/3.32 $\approx 30$ factors of 10.
$$
2^{100} \approx 10^{30} .
$$
Given that $\log _{2} 10=3.32 \ldots$, give a reasonable approximation for $\log _{2} 100 ?$ What about $\log _{2} 10^{10} ?$
First,
$$
\log _{2} 100=\log _{2} 10^{2}=2 \times \underbrace{\log _{2} 10}_{\approx 3.32} \approx 6.64 .
$$
Similarly,
$$
\log _{2} 10^{10}=10 \log _{2} 10 \approx 33.2 \text {. }
$$
If $100^{10}=10^{t}$, what is $t$?Since $100=10^{2}$,
$$
100^{10}=\left(10^{2}\right)^{10}=10^{20} .
$$
Thus, $t=20$.
Recall that $e$ is the number 2.71... It plays a special role in calculus because $\frac{d}{d x} e^{x}=e^{x}$.
Approximate $10^{.01}$. First write $10^{.01}=e^{t}$ and approximate $t$. You can use that $\log _{e} 10=2.30 \ldots$ Then use linear approximation to approximate $e^{t}$. Give an answer that is accurate to within .01.
Start with
$$
10 \equiv e^{\log _{e} 10} .
$$
Then,
$$
10^{0.01}=\left(e^{\log _{e} 10}\right)^{0.01}=e^{0.01 \times \log _{e} 10} \approx e^{0.023} .
$$
Using the linear approximation for $e^{x}$ gives
$$
10^{0.01} \approx 1.023 .
$$
67
346EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneLab 13
Nearest Neighbors
4nan0.4166666667Text
Suppose we are interested in comparing decision trees and kNNs in a new application. We have 10 million data points for training. Suppose every leaf in our trained decision tree has a depth of 5. Now a test point comes along, and we are interested in making a prediction at the new point. At testing time, about how many operations would it take to make a prediction using our decision tree? At testing time, about how many operations would it take to make a prediction using kNNs?
Open
For the decision tree: We need to check our test data set at every split in the decision tree. If every leaf is at a depth of 5 , we expect to check 5 splits.
For kNNs: it seems like we have to compare our test point to all 10 million points. There are more clever ways to handle this in practice, but kNNs can be expensive at test time if there's a lot of data.
For this section, we will be looking at a dataset with 10 points that has three classes, shown below. We're going to apply the BuildTree algorithm from the notes, but for classification instead of regression.
When we use BuildTree for classification, there are two main differences relative to the BuildTree algorithm for regression:
\begin{itemize}
\item In classification, we decide where to split a node based on one of the classification-specific criteria, such as weighted average entropy. (In regression, we decide to split based on squared error loss.)
\item In classification, we predict the majority vote of training data points at a leaf. (In regression, we predict the empirical average of training data points at a leaf.)
\end{itemize}
We're now going to step through building a decision tree for classification. First, we'll need to choose how to split our tree at the root node. To decide how to split our tree at the root node, we'll need to compute the weighted average entropy for all possible splits.
What if our data set looked like the right plot below? How does this new data set relate to the old data set (repeated on the left below for easy comparison)? If you were to repeat the BuildTree $(I=\{1, \ldots, 10\}, k=4)$ computations from above but now on this new data set, how would the decision tree you find be different from on the old data set? How would the accuracies change? (Optional: what if we were to repeat the call to BuildTree $(I=\{1, \ldots, 10\}, k=1$ ) on the new data set? Would the resulting tree and accuracy change compared with when we ran $\operatorname{BuildTree}(I=\{1, \ldots, 10\}, k=1)$ on the old data set?))
We notice that the new dataset is the same as the old dataset except both $\mathrm{x} 1$ and $\mathrm{x} 2$ features have been both shifted and scaled. Thus, the decision tree boundaries will have the same shape and placement relative to the data points, but the values at the splits will just be different. The accuracies of the decision trees stay the same.
Consider the following training set used to train a kNN for classification. The feature space is two-dimensional and the dataset contains two classes: orange stars and blue dots. Points 1 and 2 are test data points. Assume we use Euclidean distance.
Draw the decision boundaries (a rough sketch is fine) for the following values of $k: 1,5,9$. Be prepared to show your sketches during check-off. What is going on when $k=9$ ?
These are the plots for $k=1,5,9$ respectively:
When $\mathrm{k}=9$, everything is classified as orange star because there are only 9 training examples, 5 of which are orange stars.
For this section, we will be looking at a dataset with 10 points that has three classes, shown below. We're going to apply the BuildTree algorithm from the notes, but for classification instead of regression.
When we use BuildTree for classification, there are two main differences relative to the BuildTree algorithm for regression:
\begin{itemize}
\item In classification, we decide where to split a node based on one of the classification-specific criteria, such as weighted average entropy. (In regression, we decide to split based on squared error loss.)
\item In classification, we predict the majority vote of training data points at a leaf. (In regression, we predict the empirical average of training data points at a leaf.)
\end{itemize}
We're now going to step through building a decision tree for classification. First, we'll need to choose how to split our tree at the root node. To decide how to split our tree at the root node, we'll need to compute the weighted average entropy for all possible splits.
Without actually doing all the BuildTree $(I=\{1, \ldots, 10\}, k=1$ ) computations, can you say what the accuracy of the resulting tree would be on the training data?
The accuracy would be 10/10 because every data point would get its own leaf.
68
3EECS18.C06
Linear Algebra and Optimization
18.02NoneProblem Set 1Vector Spaces2a0.1851851852Text
True or False: The columns of the matrix $\left[\begin{array}{lll}0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1\end{array}\right]$ span $\mathbb{R}^{3}$.
Multiple Choice
True. We can write any vector as a linear combination of the columns, since these are the vectors of the standard basis (i.e., columns of the identity matrix) in a different order.
True or False: The columns of the matrix $\left[\begin{array}{lll}2 & 0 & 0 \\ 1 & 0 & 0 \\ 4 & 3 & 2\end{array}\right]$ span $\mathbb{R}^{3}$.
False. The last two columns are multiples of each other, so the column space is a two-dimensional subspace, not $\mathbb{R}^{3}$.
True or False: The columns of the matrix $\left[\begin{array}{ccc}1 & -3 & -1 \\ 0 & 3 & -6 \\ 0 & 0 & 4\end{array}\right]$ span $\mathbb{R}^{3}$.
True. If we try to express a vector as a linear combination of the columns, we can always solve the corresponding linear system since the given matrix is diagonal with nonzero diagonal entries.
Use Gaussian elimination to find all the solutions to the following system of linear equations
$$
\left[\begin{array}{ccc}
2 & -1 & 4 \\
-3 & 2 & 5 \\
-5 & 3 & 1
\end{array}\right]\left[\begin{array}{l}
x \\
y \\
z
\end{array}\right]=\left[\begin{array}{c}
-2 \\
0 \\
2
\end{array}\right]
$$
You should express your answer in the form where there are some free variables that can be set independently and the rest of the variables are then determined. What can you say about the span of the three columns of this matrix? Do they span $\mathbb{R}^{3}$ or can you express one column as a linear combination of the others?
Consider the extended matrix $[A \mid b]$ and its RREF form
$$
\left[\begin{array}{ccc|c}
2 & -1 & 4 & -2 \\
-3 & 2 & 5 & 0 \\
-5 & 3 & 1 & 2
\end{array}\right] \quad \stackrel{\text { RREF }}{\longrightarrow} \quad\left[\begin{array}{ccc|c}
1 & 0 & 13 & -4 \\
0 & 1 & 22 & -6 \\
0 & 0 & 0 & 0
\end{array}\right]
$$
Backsolving from this, we obtain the general solution
$$
x=-4-13 t, \quad y=-6-22 t, \quad z=t .
$$
The column space is two-dimensional, since the first two columns are linearly independent, and the last column is 13 times the first one plus 22 times the second one. Equivalently, the null-space of $A$ is spanned by the vector $[13,22,-1]$.
69
30EECS6.18
Computer Systems Engineering
6.1010, 6.1910NoneHands-on 5Traceroute5nan0.1666666667Text
At the command prompt, type:
traceroute 18.31.0.200
Describe what is strange about the observed output, and why traceroute gives you such an output. Refer to the traceroute man page for useful hints. Copy/paste any of the relevant portions of output below.
Open
After the 4th hop, the hops begin to oscillate between 18.69.3.2 and 18.4.7.65, until the maximum number of hops is reached.
5 MITNET.CORE-1-EXT.CSAIL.MIT.EDU (18.4.7.65) 9.058 ms 9.099 ms 9.344 ms
...
25 MITNET.CORE-1-EXT.CSAIL.MIT.EDU (18.4.7.65) 9.040 ms 9.077 ms 9.035 ms
26 DMZ-RTR-2-CSAIL.MIT.EDU (18.4.7.1) 8.764 ms 8.878 ms 8.836 ms
27 MITNET.CORE-1-EXT.CSAIL.MIT.EDU (18.4.7.65) 8.813 ms 9.049 ms 9.021 ms
28 DMZ-RTR-2-CSAIL.MIT.EDU (18.4.7.1) 8.562 ms 8.726 ms 8.797 ms
29 MITNET.CORE-1-EXT.CSAIL.MIT.EDU (18.4.7.65) 9.335 ms 9.373 ms 9.381 ms
30 DMZ-RTR-2-CSAIL.MIT.EDU (18.4.7.1) 8.980 ms 9.224 ms 9.000 ms
One possible reason for this is the existence of a “routing loop” by which each of the two routers thinks the other has a route to the destination, i.e. that each has a path to the destination that contains the other. This can happen in link-state routing, in part because details of the path being advertised are not shared.
For this exercise, you need to use the traceroute server at http://www.slac.stanford.edu/cgi-bin/nph-traceroute.pl. You'll use this server to execute a traceroute to your own machine.
To figure out your machine's IP address, run /sbin/ifconfig. You'll get a lot of information, including its IP address.
Once you have your IP, use Stanford's server to execute a traceroute to it.
Then run your own traceroute to Stanford's server, via
traceroute [IP ADDRESS FROM STANFORD]
You can get Stanford's IP address from the website.
Describe anything unusual about the output. Are the same routers traversed in both directions? If not, why might this happen? Be sure to copy/paste any relevant portions of your traceroute output here.
I used the looking-glass at www.net.princeton.edu, both outputs are below:
traceroute to 18.9.64.24 (18.9.64.24), 30 hops max, 40 byte packets
1 core-87-router (128.112.128.2) 0.743 ms 0.586 ms 0.438 ms
2 border-87-router (128.112.12.142) 0.888 ms 0.701 ms 0.638 ms
3 local1.princeton.magpi.net (216.27.98.113) 11.788 ms 1.824 ms 1.815 ms
4 216.27.100.18 (216.27.100.18) 2.166 ms 2.009 ms 2.062 ms
5 et-7-1-0.4079.rtsw.newy32aoa.net.internet2.edu (162.252.70.102) 4.070 ms 4.019 ms 4.051 ms
6 nox300gw1-i2-re.nox.org (192.5.89.221) 9.263 ms 9.165 ms 9.262 ms
7 192.5.89.22 (192.5.89.22) 9.220 ms 9.407 ms 9.199 ms
8 external-rtr-3-nox.mit.edu (18.32.4.110) 8.428 ms 8.682 ms 8.334 ms
9 dmz-rtr-1-external-rtr-3.mit.edu (18.69.7.1) 8.715 ms 8.603 ms 8.589 ms
10 backbone-rtr-1-dmz-rtr-1.mit.edu (18.69.1.2) 8.665 ms 8.759 ms 8.622 ms
11 oc11-rtr-1-backbone-rtr-1.mit.edu (18.123.69.2) 8.858 ms 8.459 ms 8.457 ms
12 buzzword-bingo.mit.edu (18.9.64.24) 16.011 ms 16.173 ms 15.738 ms
traceroute to www.net.princeton.edu (128.112.128.55), 30 hops max, 60 byte packets
1 18.9.64.3 (18.9.64.3) 8.164 ms 8.192 ms 8.173 ms
2 BACKBONE-RTR-1-OC11-RTR-1.MIT.EDU (18.123.69.1) 8.367 ms 8.493 ms 8.473 ms
3 DMZ-RTR-1-BACKBONE-RTR-1.MIT.EDU (18.69.1.1) 8.488 ms 8.468 ms 8.442 ms
4 EXTERNAL-RTR-3-DMZ-RTR-1.MIT.EDU (18.69.7.2) 8.498 ms 8.386 ms 8.380 ms
5 NOX-CPS-EXTERNAL-RTR-3.MIT.EDU (18.32.132.109) 8.533 ms 8.633 ms 8.610 ms
6 10ge5-7.core1.bos1.he.net (206.108.236.30) 22.500 ms 15.543 ms 8.487 ms
7 100ge12-2.core1.nyc4.he.net (184.105.64.53) 13.449 ms 13.479 ms 13.526 ms
8 princeton-university.10gigabitethernet1-1-6.switch1.nyc8.he.net (216.66.49.74) 15.948 ms 15.732 ms 15.834 ms
9 core-87-router.Princeton.EDU (128.112.12.130) 16.998 ms 16.951 ms 17.037 ms
10 www.net.Princeton.EDU (128.112.128.55) 16.329 ms 23.503 ms 16.239 ms
The routers shown are definitely not the same for both direction, however the ends of the paths are very similar. This suggests that the AS tends to route the packets through the same routers in both directions, but that BGP chooses different paths for the two directions. This would arise from ties between two routers in path selection being decided arbitrarily, and the fact that BGP typically attempts to hand off packets to ASes at the first possible point, which can be different for the different directions.
In at most 50 words, explain how traceroute discovers a path to a remote host. The man page might be useful in answering this question.
Traceroute works by sending probe packets to to random, typically unused ports on a given destination with small but increasing TTL values, and then listening for “time exceeded” responses from each gateway along the path, which will come from each gateway in order from the original machine to the destination, which we can identify by the “port unreachable” response we get.
What are the IP addresses of maple and willow on this network? (Hint: Check the man page of tcpdump to discover how you can obtain the IP addresses)
If we redo the given command: tcpdump -r tcpdump.dat > /tmp/outfile.pcap ; mv /tmp/outfile.pcap outfile.txt but with the added modifier “-n” in front of the “-r” modifier, the new outfile has the numeric IP addresses instead of the domain names. These are 128.30.4.223 for maple, and 128.30.4.222 for willow.
70
461EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneProblem Set 3
Gradient Descent
10aii0.02083333333Text
Last week, we defined the _ridge regression_objective function.
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Recall that in last week's homework, we derived the closed-form solution for the $\theta$ value that minimizes the least squares loss function. However, it is computationally challenging to compute the closed-form solution on $\theta$ on high-dimensional data and large datasets.
This week, we wish to apply gradient descent and stochastic gradient descent to minimize the ridge regression function. In order to use the gradient descent and stochastic gradient descent functions implemented previously in the homework, we must add ones to the end of each datapoint in $X$, which we implemented in the add_ones_row function in homework 1.
In the next subsections, we assume that $X$ is a $d \times n$ matrix, $Y$ is a $1 \times n$ matrix, and $\theta$ is a $d \times 1$ matrix. Rewriting the ridge objective through matrix operations, we find that:
$$
J_{\text {ridge }}(\theta)=\frac{1}{n}\left(\theta^{T} X-Y\right)\left(\theta^{T} X-Y\right)^{T}+\lambda\|\theta\|^{2}
$$
For the rest of the problem, assume that $\mathrm{X}$ already has ones at the end of each datapoint. You do not need to call the add_one_rows function.
Implement objective_func. objective_func returns a function that computes $J_{\text {ridge }}(\theta)$.
inputs:
X: a (dxn) numpy array.
Y: a (1xn) numpy array
lam: regularization parameter
outputs:
f : a function that takes in a (dx1) numpy array "theta" and returns *as a float* the value of the ridge regression objective when theta(the variable)="theta"(the numpy array)
def objective_func(X, Y, lam):
def f(theta):
# write your implementation here
pass
return f
Programmingdef objective_func(X, Y, lam):
def f(theta):
n = X.shape[1]
sq_loss = (1/n) *(theta.T @ X - Y) @ (theta.T @ X - Y).T
regularizer = lam* theta.T @ theta
return (sq_loss + regularizer).item()
return f
Last week, we defined the _ridge regression_objective function.
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Recall that in last week's homework, we derived the closed-form solution for the $\theta$ value that minimizes the least squares loss function. However, it is computationally challenging to compute the closed-form solution on $\theta$ on high-dimensional data and large datasets.
This week, we wish to apply gradient descent and stochastic gradient descent to minimize the ridge regression function. In order to use the gradient descent and stochastic gradient descent functions implemented previously in the homework, we must add ones to the end of each datapoint in $X$, which we implemented in the add_ones_row function in homework 1.
In the next subsections, we assume that $X$ is a $d \times n$ matrix, $Y$ is a $1 \times n$ matrix, and $\theta$ is a $d \times 1$ matrix. Rewriting the ridge objective through matrix operations, we find that:
$$
J_{\text {ridge }}(\theta)=\frac{1}{n}\left(\theta^{T} X-Y\right)\left(\theta^{T} X-Y\right)^{T}+\lambda\|\theta\|^{2}
$$
For the rest of the problem, assume that $\mathrm{X}$ already has ones at the end of each datapoint. You do not need to call the add_one_rows function.
Implement objective_func_grad. objective_func_grad returns a function that computes $\nabla J_{\text {ridge }}(\theta)$.
inputs:
X: a (dxn) numpy array.
Y: a (1xn) numpy array
lam: regularization parameter
outputs:
df : a function that takes in a (dx1) numpy array "theta" and returns the gradient of the ridge regression objective when theta(the variable)="theta"(the numpy array)
def objective_func_grad(X, Y, lam):
def df(theta):
# write your implementation here
pass
return df
def objective_func_grad(X, Y, lam):
def df(theta):
n = X.shape[1]
sq_loss = (2/n)*X@(theta.T@X - Y).T
regularizer = 2*lam*theta
return sq_loss + regularizer
return df
Last week, we defined the _ridge regression_objective function.
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Recall that in last week's homework, we derived the closed-form solution for the $\theta$ value that minimizes the least squares loss function. However, it is computationally challenging to compute the closed-form solution on $\theta$ on high-dimensional data and large datasets.
This week, we wish to apply gradient descent and stochastic gradient descent to minimize the ridge regression function. In order to use the gradient descent and stochastic gradient descent functions implemented previously in the homework, we must add ones to the end of each datapoint in $X$, which we implemented in the add_ones_row function in homework 1.
In the next subsections, we assume that $X$ is a $d \times n$ matrix, $Y$ is a $1 \times n$ matrix, and $\theta$ is a $d \times 1$ matrix. Rewriting the ridge objective through matrix operations, we find that:
$$
J_{\text {ridge }}(\theta)=\frac{1}{n}\left(\theta^{T} X-Y\right)\left(\theta^{T} X-Y\right)^{T}+\lambda\|\theta\|^{2}
$$
For the rest of the problem, assume that $\mathrm{X}$ already has ones at the end of each datapoint. You do not need to call the add_one_rows function.
Write an expression for $\nabla J_{\text {ridge }}(\theta)$ with respect to $\theta$.
Enter your answers as mathematical expressions. You should use transpose $(m)$ for transpose of an array $m$, $f(q)$ for a function $f$ applied to scalar or vector $\mathrm{x}$, and $\mathrm{Qq}$ to indicate a matrix product of two arrays/matrices, $\mathrm{p}$ and q. Remember that $p^{*} q$ denotes component-wise multiplication.
Enter a Python expression involving X, Y, lambda, n, and theta. You will also need to use transpose, @ and * appropriately.
(2 / n) * X @ transpose(transpose(theta)@X - Y) + 2*lambda*theta
Last week, we defined the _ridge regression_objective function.
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Recall that in last week's homework, we derived the closed-form solution for the $\theta$ value that minimizes the least squares loss function. However, it is computationally challenging to compute the closed-form solution on $\theta$ on high-dimensional data and large datasets.
This week, we wish to apply gradient descent and stochastic gradient descent to minimize the ridge regression function. In order to use the gradient descent and stochastic gradient descent functions implemented previously in the homework, we must add ones to the end of each datapoint in $X$, which we implemented in the add_ones_row function in homework 1.
So far in the course, you've learned about two different hyperparameters: the regularization rate $\lambda$ and the step size/learning rate $\eta$. You might be wondering, how do we pick the best hyperparameters for minimizing the loss function? One of the most basic ways to pick regularization rate and learning rate is to use grid search. The basic idea behind grid search is to select several possible $(\lambda, \eta)$ pairs, train models with every combination of these hyperparameters, and evaluate each trained model to select the best hyperparameters.
We will be running grid search over the Boston Housing dataset. For more information about this dataset, please visit this link.
For the rest of this exercise, we will be predicting the median value of houses in the Boston area using linear regression and gradient descent. Please visit the colab notebook linked on the top of this page to collect metrics on how well gradient descent works on this regression problem.
Among the grid of values specified in the colab, what is the best value of $\lambda$ and $\eta$ when using gradient descent? Enter your answer as a tuple $(\lambda, \eta)$.
(0.1, 0.001)
71
33EECS6.2
Electrical Circuits: Modeling and Design of Physical Systems
8.02NoneProblem Set 7Thermal System2e0.2380952381Text
As mentioned in our first lecture, electrical circuit is a good mathematical language for modelling other non-electrical systems such as mechanical, biological systems, etc. Here we are going to use circuit to analyze a thermal system.
Prior to $t=0$, the building considered in (D) reaches thermal equilibrium with the environment, i.e. $T_{i}=T_{e}$. Then at $t=0$, the owner of the building turns on the heater which has a constant power Q. What is the final temperature that the interior of the building will reach? Express the interior temperature as a function of time $T_{i}(t)$ for $t>$ 0. Note that the environment temperature remains constant at $T_{\mathrm{e}}$.
Expression
$T_{i}^{(t+\infty)} =T_{e}+\frac{Q}{\frac{1}{R_{1}}+\frac{1}{R_{2}}+\frac{1}{R_{3}}+\frac{1}{R_{4}}+\frac{1}{R_{5}}+\frac{1}{R_{6}}}$.
As mentioned in our first lecture, electrical circuit is a good mathematical language for modelling other non-electrical systems such as mechanical, biological systems, etc. Here we are going to use circuit to analyze a thermal system.
Now consider a small building that is attached to a big building as shown in the figure below. The small building exchanges heat with the big building through one shared wall with the thermal resistance $R_{1}$. Meanwhile, the small building exchanges heat with the external environment through the rest three walls plus the roof and the floor $\left(\mathrm{R}_{2} \sim \mathrm{R}_{6}\right)$. Because the big building has a very large volume, its temperature $T_{0}$ remains almost unchanged for the considered time. In order to calculate the temperature change of the small building, what circuit element will you use to model the influence from the big building? Draw a circuit to model the thermal system associated with the small building shown in the figure below. Label node voltages, current, resistances, and values of all the other circuit elements you use. In this part, the heater in the small building has a constant power $\mathrm{Q}$ which turns on at $\mathrm{t}=0$. For $\mathrm{t}<0$, the small building temperature has reached a stable point by exchanging thermal energy with the big building and environment for a very long time. For $t>0$, express the interior temperature of the small building as a function of time $T_{i}(t)$.
Since $T_o$ of the big building is constant, we con use a voltage source with $V=T_o - T_e$ to model it.
The circuit is below.
As mentioned in our first lecture, electrical circuit is a good mathematical language for modelling other non-electrical systems such as mechanical, biological systems, etc. Here we are going to use circuit to analyze a thermal system.
When the heater turns on, the room's temperature does not jump to a high value immediately. Instead, the air, furniture and everything else inside the room absorb thermal energy and slowly raise their temperature. The temperature change rate is proportional to the net heat flow into the room via $q=C_{t h} \frac{d T_{i}}{d t}$, where $C_{t h}$ is the 'thermal capacity', in unit of Joule/Kelvin, $\mathrm{q}$ is the net heat flow (the heater power minus the heat flow to the environment). Here for simplicity we assume that the whole interior of the building has the same temperature $T_{i}$. What circuit element will you use to model the thermal capacity of the building? Draw a circuit to model the heat generation, thermal capacity and heat flow process of this building. Label node voltages, current, resistances and the other circuit element you use.
We use electrical capacitor to model the thermal capacity.
The circuit is below.
As mentioned in our first lecture, electrical circuit is a good mathematical language for modelling other non-electrical systems such as mechanical, biological systems, etc. Here we are going to use circuit to analyze a thermal system.
The building in (B) is equipped with a heating device which generates heat with a constant power of $Q$ (in units of Watt, the same as the unit of heat flow q). What circuit element will you use to model this heat generation device? Draw a circuit to model the heat generation and heat transfer process of this building. Label node voltages, current and resistances in the circuit below.
A currext source with $I=Q$ con be used to model the heater.
The circuit is below.
72
194EECS6.411
Representation, Inference, and Reasoning in AI
6.1010, 6.1210, 18.600
NoneProblem Set 5
Localization with Viterbi
4d0.2604166667Text
In this section, we will implement an HMM for a robot that is moving around randomly in a 2D grid with obstacles. The robot has sensors that allow it to detect obstacles in its immediate vicinity. It knows the grid map, with the locations of all obstacles, but it is uncertain about its own location in the grid. We will use Viterbi to determine the most likely locations for the robot given a sequence of local and potentially noisy observations.
Concretely, we will represent the 2D grid with obstacles as a list of lists, where 1s represent obstacles and 0s represent free space. Example:
obstacle_map = [
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
]
The state of the robot is its location in the grid, represented as a tuple of ints, (row, col). Transitions are uniformly distributed amongst the robot's current location and the neighboring free (not obstacle) locations, where neighboring = 4 cardinal directions (up, down, left, right).
Observations are a 4-tuple that list which directions have obstacles, in order [N E S W], with a 1 for an obstacle and 0 for no obstacle. Observations that are "off the map" are 1, as though they are obstacles. For instance, in the map above, if there were no observation noise, then the observation for the top left corner (state=(0, 0)) would be (1, 0, 1, 1). Observations can also be corrupted with noise; see the create_observation_potential docstring for more details.
Our ultimate task will be to take in a sequence of observations and return the corresponding sequence of most likely states.
Write a function that creates a random variable for an observation at a given time step in an obstacle HMM. See docstring for description.
For reference, our solution is 3 line(s) of code.
def create_observation_variable(name):
'''Creates a RV for the HMM observation with the given name.
Observations are a 4-tuple that list which directions have obstacles,
in order [N E S W], with a 1 for an obstacle and 0 for no obstacle.
Observations that are "off the map" are 1, as though they are obstacles.
For instance, in the following map:
obstacle_map = [
[0, 1],
[0, 0],
]
if there were no observation noise, then the observation for the top left
location would be (1, 1, 0, 1).
The domain of the observation variable should be a list of 4-tuples.
Hint: you may find it useful to use `itertools.product`. For example,
see what happens with `list(itertools.product(["foo", "bar"], repeat=2))`.
Args:
name: A str name for the variable.
Returns:
zt: A RV as described above.
'''
raise NotImplementedError("Implement me!")
Programmingdef create_observation_variable(name):
'''Creates a RV for the HMM observation with the given name.
Observations are a 4-tuple that list which directions have obstacles,
in order [N E S W], with a 1 for an obstacle and 0 for no obstacle.
Observations that are "off the map" are 1, as though they are obstacles.
For instance, in the following map:
obstacle_map = [
[0, 1],
[0, 0],
]
if there were no observation noise, then the observation for the top left
location would be (1, 1, 0, 1).
The domain of the observation variable should be a list of 4-tuples.
Hint: you may find it useful to use `itertools.product`. For example,
see what happens with `list(itertools.product(["foo", "bar"], repeat=2))`.
Args:
name: A str name for the variable.
Returns:
zt: A RV as described above.
'''
domain = list(itertools.product([0, 1], repeat=4))
return RV(name, domain)
In this section, we will implement an HMM for a robot that is moving around randomly in a 2D grid with obstacles. The robot has sensors that allow it to detect obstacles in its immediate vicinity. It knows the grid map, with the locations of all obstacles, but it is uncertain about its own location in the grid. We will use Viterbi to determine the most likely locations for the robot given a sequence of local and potentially noisy observations.
Concretely, we will represent the 2D grid with obstacles as a list of lists, where 1s represent obstacles and 0s represent free space. Example:
obstacle_map = [
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
]
The state of the robot is its location in the grid, represented as a tuple of ints, (row, col). Transitions are uniformly distributed amongst the robot's current location and the neighboring free (not obstacle) locations, where neighboring = 4 cardinal directions (up, down, left, right).
Observations are a 4-tuple that list which directions have obstacles, in order [N E S W], with a 1 for an obstacle and 0 for no obstacle. Observations that are "off the map" are 1, as though they are obstacles. For instance, in the map above, if there were no observation noise, then the observation for the top left corner (state=(0, 0)) would be (1, 0, 1, 1). Observations can also be corrupted with noise; see the create_observation_potential docstring for more details.
Our ultimate task will be to take in a sequence of observations and return the corresponding sequence of most likely states.
Write a function that creates a random variable for a state at a given time step in an obstacle HMM. The domain of the state variable should be a list of(row, col) indices into the map. Only free positions (not obstacles) should be included in the domain of the state variable. See docstring for more description.
For reference, our solution is 4 line(s) of code.
def create_state_variable(obstacle_map, name):
'''Creates a RV for the HMM state.
The state can be any position in the map.
Example map:
obstacle_map = [
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
]
Ones are obstacles and zeros are free positions.
The domain of the state variable should be a list of (row, col)
indices into the map. Only free positions (not obstacles) should
be included in the domain of the state variable.
The domain should be in row-major order. For example, an empty
2x2 obstacle map should lead to the domain:
[(0, 0), (0, 1), (1, 0), (1, 1)].
Args:
obstacle_map: A list of lists of ints, see example above.
name: A str name for the state variable.
Returns:
state_var: A RV as described above.
'''
raise NotImplementedError("Implement me!")
def create_state_variable(obstacle_map, name):
'''Creates a RV for the HMM state.
The state can be any position in the map.
Example map:
obstacle_map = [
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
]
Ones are obstacles and zeros are free positions.
The domain of the state variable should be a list of (row, col)
indices into the map. Only free positions (not obstacles) should
be included in the domain of the state variable.
The domain should be in row-major order. For example, an empty
2x2 obstacle map should lead to the domain:
[(0, 0), (0, 1), (1, 0), (1, 1)].
Args:
obstacle_map: A list of lists of ints, see example above.
name: A str name for the state variable.
Returns:
state_var: A RV as described above.
'''
domain = [(r, c) for r in range(len(obstacle_map))
for c in range(len(obstacle_map[0])) if obstacle_map[r][c] == 0]
return RV(name, domain)
In this section, we will implement an HMM for a robot that is moving around randomly in a 2D grid with obstacles. The robot has sensors that allow it to detect obstacles in its immediate vicinity. It knows the grid map, with the locations of all obstacles, but it is uncertain about its own location in the grid. We will use Viterbi to determine the most likely locations for the robot given a sequence of local and potentially noisy observations.
Concretely, we will represent the 2D grid with obstacles as a list of lists, where 1s represent obstacles and 0s represent free space. Example:
obstacle_map = [
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
]
The state of the robot is its location in the grid, represented as a tuple of ints, (row, col). Transitions are uniformly distributed amongst the robot's current location and the neighboring free (not obstacle) locations, where neighboring = 4 cardinal directions (up, down, left, right).
Observations are a 4-tuple that list which directions have obstacles, in order [N E S W], with a 1 for an obstacle and 0 for no obstacle. Observations that are "off the map" are 1, as though they are obstacles. For instance, in the map above, if there were no observation noise, then the observation for the top left corner (state=(0, 0)) would be (1, 0, 1, 1). Observations can also be corrupted with noise; see the create_observation_potential docstring for more details.
Our ultimate task will be to take in a sequence of observations and return the corresponding sequence of most likely states.
Write a function that creates a potential for the observation distribution between $s_{t}$ and $z_{t}$.
For reference, our solution is 29 line(s) of code.
def create_observation_potential(obstacle_map, state_rv, observation_rv,

noise_prob=0.):

'''Write a function to create a potential between state_rv
and observation_rv in an HMM that corresponds to the map.
You can assume that state_rv was created by `create_state_variable`
and observation_rv was created by `create_observation_variable`.
See `create_observation_variable` for a description of the
observation model. Recall the order is [N E S W].
If noise_prob = 0., then the observations are noise-free. That is,
you observe 0 if there is a free space and 1 otherwise.
In general, for each of the four observation entries, with
probability 1 - noise_prob, the entry will be "correct"; with
probability noise_prob, the entry will be incorrect, that is,
the opposite of the true occupancy.
So if the noise-free observation would be (1, 0, 0, 1), then
the probability of observation (1, 1, 0, 1) would be
noise_prob*(1 - noise_prob)^3.
Args:
obstacle_map: A list of lists of ints;
see example and description in `create_state_variable`.
state_rv: An RV representing the state at time t.
observation_rv: An RV representing the observation at time t.
noise_prob: A float between 0 and 1 indicating the probability
that an observation flips.
Returns:
potential: A Potential for the distribution between st and zt.
'''
raise NotImplementedError("Implement me!")
def create_observation_potential(obstacle_map, state_rv, observation_rv,

noise_prob=0.):

'''Write a function to create a potential between state_rv
and observation_rv in an HMM that corresponds to the map.
You can assume that state_rv was created by `create_state_variable`
and observation_rv was created by `create_observation_variable`.
See `create_observation_variable` for a description of the
observation model. Recall the order is [N E S W].
If noise_prob = 0., then the observations are noise-free. That is,
you observe 0 if there is a free space and 1 otherwise.
In general, for each of the four observation entries, with
probability 1 - noise_prob, the entry will be "correct"; with
probability noise_prob, the entry will be incorrect, that is,
the opposite of the true occupancy.
So if the noise-free observation would be (1, 0, 0, 1), then
the probability of observation (1, 1, 0, 1) would be
noise_prob*(1 - noise_prob)^3.
Args:
obstacle_map: A list of lists of ints;
see example and description in `create_state_variable`.
state_rv: An RV representing the state at time t.
observation_rv: An RV representing the observation at time t.
noise_prob: A float between 0 and 1 indicating the probability
that an observation flips.
Returns:
potential: A Potential for the distribution between st and zt.
'''
def get_obs_for_loc(r, c):
# Out of bounds
if not (0 <= r < len(obstacle_map) and 0 <= c < len(obstacle_map[0])):
return 1
return obstacle_map[r][c]
def get_obs_prob(obs, true_obs):
p = 1.
for i, j in zip(obs, true_obs):
if i == j:
p *= (1 - noise_prob)
else:
p *= noise_prob
return p
table = np.zeros((state_rv.dim, observation_rv.dim))
for i, (r, c) in enumerate(state_rv.domain):
true_obs = (
get_obs_for_loc(r - 1, c), # North
get_obs_for_loc(r, c + 1), # East
get_obs_for_loc(r + 1, c), # South
get_obs_for_loc(r, c - 1), # West
)
for j, obs in enumerate(observation_rv.domain):
table[i, j] = get_obs_prob(obs, true_obs)
return Potential([state_rv, observation_rv], table)
In this section, we will implement an HMM for a robot that is moving around randomly in a 2D grid with obstacles. The robot has sensors that allow it to detect obstacles in its immediate vicinity. It knows the grid map, with the locations of all obstacles, but it is uncertain about its own location in the grid. We will use Viterbi to determine the most likely locations for the robot given a sequence of local and potentially noisy observations.
Concretely, we will represent the 2D grid with obstacles as a list of lists, where 1s represent obstacles and 0s represent free space. Example:
obstacle_map = [
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
]
The state of the robot is its location in the grid, represented as a tuple of ints, (row, col). Transitions are uniformly distributed amongst the robot's current location and the neighboring free (not obstacle) locations, where neighboring = 4 cardinal directions (up, down, left, right).
Observations are a 4-tuple that list which directions have obstacles, in order [N E S W], with a 1 for an obstacle and 0 for no obstacle. Observations that are "off the map" are 1, as though they are obstacles. For instance, in the map above, if there were no observation noise, then the observation for the top left corner (state=(0, 0)) would be (1, 0, 1, 1). Observations can also be corrupted with noise; see the create_observation_potential docstring for more details.
Our ultimate task will be to take in a sequence of observations and return the corresponding sequence of most likely states.
Write a function that creates a potential for the transition distribution between states $s_{t}$ and $s_{t+1}$. Refer to the previous question for more information about the state variables and their domains.
For reference, our solution is 13 line(s) of code.
def create_transition_potential(obstacle_map, st, st1):
'''Write a function to create a potential for the transition from state s_t
to s_{t+1}, in an HMM that corresponds to the map.
Transitions are uniformly distributed amongst the robot's current
location and the neighboring free (not obstacle) locations, where
neighboring = 4 cardinal directions (up, down, left, right).
Hint: remember that if we have a potential with two variables A and B,
with dimension N and M, then the potential table will be a numpy array
of shape (N, M). Furthermore, the potential value for the i^{th} domain
value of A and the j^{th} domain value of B will be table[i, j]. With
this in mind, you may find it useful to use the following pattern in
your code somewhere:
```
for i, (prev_r, prev_c) in enumerate(st.domain):
...
for j, (next_r, next_c) in enumerate(st1.domain):
...
table[i, j] = ...
```
Args:
st: An RV representing the state at time t.
st1: An RV representing the state at time t+1.
obstacle_map: A list of lists of ints;
see example and description in `create_state_variable`.
Returns:
potential: A Potential for the transition between st and st1.
'''
raise NotImplementedError("Implement me!")
def create_transition_potential(obstacle_map, st, st1):
'''Write a function to create a potential for the transition from state s_t
to s_{t+1}, in an HMM that corresponds to the map.
Transitions are uniformly distributed amongst the robot's current
location and the neighboring free (not obstacle) locations, where
neighboring = 4 cardinal directions (up, down, left, right).
Hint: remember that if we have a potential with two variables A and B,
with dimension N and M, then the potential table will be a numpy array
of shape (N, M). Furthermore, the potential value for the i^{th} domain
value of A and the j^{th} domain value of B will be table[i, j]. With
this in mind, you may find it useful to use the following pattern in
your code somewhere:
```
for i, (prev_r, prev_c) in enumerate(st.domain):
...
for j, (next_r, next_c) in enumerate(st1.domain):
...
table[i, j] = ...
```
Args:
st: An RV representing the state at time t.
st1: An RV representing the state at time t+1.
obstacle_map: A list of lists of ints;
see example and description in `create_state_variable`.
Returns:
potential: A Potential for the transition between st and st1.
'''
table = np.zeros((st.dim, st1.dim))
for i, (prev_r, prev_c) in enumerate(st.domain):
possible_next_loc_idxs = set()
for j, (next_r, next_c) in enumerate(st1.domain):
# Check if neighbors or self
if abs(prev_r - next_r) + abs(prev_c - next_c) <= 1:
possible_next_loc_idxs.add(j)
# Next locs have uniform probability
p = 1. / len(possible_next_loc_idxs)
for j in possible_next_loc_idxs:
table[i, j] = p
return Potential([st, st1], table)
73
116EECS6.191
Computation Structures
6.100A, 8.02NoneProblem Set 7Caches2nan2.7Text
Implement a direct-mapped cache by completing the DirectMappedCache module in DirectMappedCache.ms. Note that you should also keep track of hit and miss counts.

import CacheTypes;
import CacheHelpers;
import MainMemory;

// ReqStatus (defined in CacheTypes.ms) is used to keep track of the state of the current request
//typedef enum {
// Ready, // The cache is ready for a new request
// Lookup, // Issued a lookup to tag/data arrays
// Writeback, // In main memory access for dirty writeback
// Fill // In main memory access for requested data
//} ReqStatus;
//
// Possible flows:
// HIT: Ready -> Lookup -> Ready
// MISS, line is clean: Ready -> Lookup -> Fill
// MISS, line is dirty: Ready -> Lookup -> Writeback -> Fill

// Cache SRAM Synonyms (defined in CacheTypes.ms)
// You may find the following type synonyms helpful to access the tag/data/status arrays
// typedef SRAMReq#(logCacheSets, CacheTag) TagReq;
// typedef SRAMReq#(logCacheSets, Line) DataReq;
// typedef SRAMReq#(logCacheSets, CacheStatus) StatusReq;

// TODO: Complete the implementation of DirectMappedCache
// NOTE: Implementing this module requires about 50 lines of additional code
// (~40 lines in rule tick, ~5-10 lines in method data, 1 line in method reqEnabled, 1 line in function isHit)
module DirectMappedCache(MainMemory mainMem);
// SRAM arrays. Note that, for a direct-mapped cache,
// number of cache sets == number of cache lines
SRAM#(logCacheSets, Line) dataArray;
SRAM#(logCacheSets, CacheTag) tagArray;
SRAM#(logCacheSets, CacheStatus) statusArray;

// Registers for holding the current state of the cache and how far along
// it is in processing a request.
RegU#(MemReq) curReq;
Reg#(ReqStatus) state(Ready);

// Hit/miss counters
Reg#(Word) hits(0);
Reg#(Word) misses(0);

input Maybe#(MemReq) req default = Invalid;

// TODO return True if the cache can accept a new request
method Bool reqEnabled = False;

// TODO return True if the cache is in lookup and it is a hit
function Bool isHit;
return False;
endfunction

rule tick;
if (state == Ready && isValid(req)) begin
// TODO Your code here
end else if (state == Lookup) begin
// TODO Your code here
end else if (state == Writeback && mainMem.reqEnabled) begin
// TODO Your code here
end else if (state == Fill && isValid(mainMem.data)) begin
// TODO Your code here
end
endrule

method Maybe#(Word) data;
// This method should return a Valid output in only two cases:
// 1. On a load hit (it is a hit, and curReq.op == Ld).
// 2. On a fill for a load request (we're in the Fill state,
// mainMem.data is valid, and curReq.op == Ld).
// In all other cases, the output should be Invalid
//
// NOTE: You should be checking the above conditions explicitly in
// THIS method so you can return data as soon as possible.
// DO NOT place your output into a register in the rule and then
// simply return that register here.
// This function should take about 4-8 lines of code to implement.
// TODO Your code here.
return Valid(0);
endmethod
method Bit#(32) getHits = hits;
method Bit#(32) getMisses = misses;
endmodule
Programming
import CacheTypes;
import CacheHelpers;
import MainMemory;

// ReqStatus (defined in CacheTypes.ms) is used to keep track of the state of the current request
//typedef enum {
// Ready, // The cache is ready for a new request
// Lookup, // Issued a lookup to tag/data arrays
// Writeback, // In main memory access for dirty writeback
// Fill // In main memory access for requested data
//} ReqStatus;
//
// Possible flows:
// HIT: Ready -> Lookup -> Ready
// MISS, line is clean: Ready -> Lookup -> Fill
// MISS, line is dirty: Ready -> Lookup -> Writeback -> Fill

// Cache SRAM Synonyms (defined in CacheTypes.ms)
// You may find the following type synonyms helpful to access the tag/data/status arrays
// typedef SRAMReq#(logCacheSets, CacheTag) TagReq;
// typedef SRAMReq#(logCacheSets, Line) DataReq;
// typedef SRAMReq#(logCacheSets, CacheStatus) StatusReq;

// TODO: Complete the implementation of DirectMappedCache
// NOTE: Implementing this module requires about 50 lines of additional code
// (~40 lines in rule tick, ~5-10 lines in method data, 1 line in method reqEnabled, 1 line in function isHit)
module DirectMappedCache(MainMemory mainMem);
// SRAM arrays. Note that, for a direct-mapped cache,
// number of cache sets == number of cache lines
SRAM#(logCacheSets, Line) dataArray;
SRAM#(logCacheSets, CacheTag) tagArray;
SRAM#(logCacheSets, CacheStatus) statusArray;

// Registers for holding the current state of the cache and how far along
// it is in processing a request.
RegU#(MemReq) curReq;
Reg#(ReqStatus) state(Ready);

// Hit/miss counters
Reg#(Word) hits(0);
Reg#(Word) misses(0);

input Maybe#(MemReq) req default = Invalid;

// TODO return True if the cache can accept a new request
method Bool reqEnabled = state == Ready;

// TODO return True if the cache is in lookup and it is a hit
function Bool isHit(CacheTag tag);
return state == Lookup && getTag(curReq.addr) == tag;
endfunction

rule tick;
if (state == Ready && isValid(req)) begin
// TODO Your code here
curReq <= fromMaybe(?, req);
MemReq newReq = fromMaybe(?, req);
CacheIndex index = getIndex(newReq.addr);
tagArray.req = Valid(TagReq{addr: index, write: False, data: ?});
statusArray.req = Valid(StatusReq{addr: index, write: False, data: ?});
dataArray.req = Valid(DataReq{addr: index, write: False, data: ?});
state <= Lookup;
end else if (state == Lookup) begin
// TODO Your code here
CacheIndex index = getIndex(curReq.addr);
let tag = fromMaybe(?, tagArray.data);
let status = fromMaybe(?, statusArray.data);
let data = fromMaybe(?, dataArray.data);
if (isHit(tag) && status != NotValid) begin
hits <= hits + 1;
if (curReq.op == St) begin
Line newLine = data;
newLine[getWordOffset(curReq.addr)] = curReq.data;
dataArray.req = Valid(DataReq{addr: index, write: True, data: newLine});
statusArray.req = Valid(StatusReq{addr: index, write: True, data: Dirty});
end
state <= Ready;
end else begin
misses <= misses + 1;
if (status == NotValid || status == Clean) begin
mainMem.req = Valid(LineReq{op: Ld, lineAddr: getLineAddr(curReq.addr), data: ?});
state <= Fill;
end else if (status == Dirty) begin
mainMem.req = Valid(LineReq{op: St, lineAddr: {tag, index}, data: data});
state <= Writeback;
end
end
end else if (state == Writeback) begin
// TODO Your code here
if (mainMem.reqEnabled == True) begin
mainMem.req = Valid(LineReq{op: Ld, lineAddr: getLineAddr(curReq.addr), data: ?});
state <= Fill;
end
end else if (state == Fill) begin
// TODO Your code here
if (isValid(mainMem.data)) begin
CacheIndex index = getIndex(curReq.addr);
CacheTag newTag = getTag(curReq.addr);
Line newLine = fromMaybe(?, mainMem.data);
if (curReq.op == Ld) begin
statusArray.req = Valid(StatusReq{addr: index, write: True, data: Clean});
end else
if (curReq.op == St) begin
statusArray.req = Valid(StatusReq{addr: index, write: True, data: Dirty});
newLine[getWordOffset(curReq.addr)] = curReq.data;
end
tagArray.req = Valid(TagReq{addr: index, write: True, data: newTag});
dataArray.req = Valid(DataReq{addr: index, write: True, data: newLine});
state <= Ready;
end
end
endrule

method Maybe#(Word) data;
// This method should return a Valid output in only two cases:
// 1. On a load hit (it is a hit, and curReq.op == Ld).
// 2. On a fill for a load request (we're in the Fill state,
// mainMem.data is valid, and curReq.op == Ld).
// In all other cases, the output should be Invalid
//
// NOTE: You should be checking the above conditions explicitly in
// THIS method so you can return data as soon as possible.
// DO NOT place your output into a register in the rule and then
// simply return that register here.
// This function should take about 4-8 lines of code to implement.
// TODO Your code here.
let tag = fromMaybe(?, tagArray.data);
if (isHit(tag) && curReq.op == Ld) begin
Line newLine = fromMaybe(?, dataArray.data);
return Valid(newLine[getWordOffset(curReq.addr)]);
end else if (state == Fill && isValid(mainMem.data) && curReq.op == Ld) begin
Line newLine = fromMaybe(?, mainMem.data);
return Valid(newLine[getWordOffset(curReq.addr)]);
end else return Invalid;
endmethod
method Bit#(32) getHits = hits;
method Bit#(32) getMisses = misses;
endmodule
The Processor module in Processor.ms should implement the single-cycle processor. We have provided skeleton code for Processor, which instantiates all state elements and has a single rule that should execute an instruction. Before you can test your processor, you need to complete the skeleton code for this rule. Fortunately, because we have structured the code to have most of the logic in the decode and execute functions (Section 1), the Processor code you need to write is quite short, less than 20 lines of code.
As you fill in your processor and decode and execute functions, you can build your processor by running make Processor, and you can run the microtests or fullasmtests on the processor by running ./test.py. After filling in your processor and at least parts of your decode and execute functions, you should be able to pass some early microtests (./test.py 1, ./test.py 2, etc.). To get credit for nishing your processor, you should pass the microtests (./test.py a) and the fullasmtests (./test.py f).
Complete the Processor module in Processor.ms.
Note: The processor code will not be ready for testing yet: you will be testing it once you finish each instruction class in both Decode.ms and Execute.ms.
Overall, your processor needs to do these things every cycle:
1. Fetch the instruction your processor should decode and execute from memory, i.e., load it. The program counter pc will hold the address of this instruction. For example, at the start of every microtest
when pc is 0, your processor should load the word at address 0.
NOTE: The two memory modules, iMem and dMem, have combinational reads: memory reads return data in the same cycle. This is unrealistic, and hence these memories are MagicMemory modules. In future lectures and in the design project, you will learn how to implement a processor with memories where reads return data one or several cycles later.
For this part of the processor, you should use the instruction memory, iMem. To load from a memory, call its read() method, which accepts a Word as the address you want to access. For example, iMem.read(32'd4) reads the word at address 4. Addresses must be multiples of 4.
2. Decode the instruction to gure out its instruction type, ALU operation, and operands. For example, in microtest 1, the rst instruction in raw hexadecimal is 0x000010b7. To decode it, you must find that it is an LUI instruction and then determine its destination register and immediate.
In Processor.ms, the decode function is already imported from Decode.ms. It takes in a single argument, the instruction as a Word, and returns a struct of type DecodedInst. For now, just call decode with the instruction you loaded, and put the result in a variable; you will fill in decode in Section 4.
3. Read from the registers any values that the instruction might need. We have provided you with a register le rf, of type RegisterFile, which contains 32 registers, where register 0 is hardwired to 0. In every cycle, you can read from two of its registers and write to one of its registers. The code to read from rf is rf.rd1(x) or rf.rd2(x) (there are two methods because you should only call each of these methods once per cycle), where x is a Bit#(5), the x-number of the register.
You can get the x-numbers of the registers to read from your DecodedInst, which has src1 and src2 fields. Note that it is safe and actually simpler than the alternative to always read from two registers every cycle, even for instructions that only need values from zero or one registers, since reading unnecessary values doesn't have side effects and can just be ignored by the next step.
4. Execute the instruction to figure out what you need to do. Eects of the instruction include that you might need to write to a register, load data from memory, store data to memory, and update the program counter pc to the next value.
In Processor.ms, the execute function is already imported from Execute.ms. It takes in four argu-
ments:
(a) the decoded instruction as a DecodedInst;
(b) the value in the rst register to be read (rs1), as a Word, if any;
(c) the value in the second register to be read (rs2), as a Word, if any;
(d) the current program counter (pc), as a Word.
It returns a struct of type ExecInst. You should call execute with the instruction you decoded and the other information required, and put the result in a variable; you will fill in execute in Section 5.
5. Load from or store to memory, if the instruction requires you to, using dMem. Like iMem, you can load from dMem with the read() method. To write to the data memory, use the write input to dMem, which accepts a Maybe#(MemWriteReq). The MemWriteReq struct has the following format:
typedef struct { Word addr; Word data; } MemWriteReq;
For example, to write 0x1234 to address 0x100:
dMem.write = Valid(MemWriteReq{addr: 32'h100, data: 32'h1234});
If you are executing a LW instruction, the data you load from memory needs to get written to a register. You can put the loaded data in the data eld of your ExecInst so that the logic for writing a register (in the next step) can handle it like all other register writes.
6. Write to a register, if the instruction requires you to. Writing rf, allowed only once each cycle, is done by setting the register le's wr input:
rf.wr = Valid(RegWriteArgs{index: x, data: data});
where x is the x-number of the register and data is a Word, the data you are writing into the register.
You can get the register you might need to write from your ExecInst, which has a dst field.
Note that unlike the reading from registers step, if an instruction isn't supposed to write to a register, then you need to make sure no register is written to. To do this, set rf.wr to Invalid (or don't set it).
7. Update the program counter pc. You can again get this from your ExecInst, which has a nextPc field.

import ProcTypes;
import RegisterFile;
import Decode;
import Execute;
import MagicMemory;

module Processor;
Reg#(Word) pc(0);
RegisterFile rf;
MagicMemory iMem; // Memory for loading instructions
MagicMemory dMem; // Memory for loading and storing data

rule doSingleCycle;
// Load the instruction from instruction memory (iMem)
Word inst = 0; // TODO Replace 0 with the correct value

// Decode the instruction
DecodedInst dInst = unpack(0); // TODO Replace unpack(0) with the correct value

// Read the register values used by the instruction
Word rVal1 = 0; // TODO Replace 0 with the correct value
Word rVal2 = 0; // TODO Replace 0 with the correct value

// Compute all outputs of the instruction
ExecInst eInst = unpack(0); // TODO Replace unpack(0) with the correct value

if (eInst.iType == LOAD) begin
// TODO: Load from data memory (dMem) if the instruction requires it
end else if (eInst.iType == STORE) begin
// TODO: Store to data memory (dMem) if the instruction requires it
end

if (isValid(eInst.dst)) begin
// TODO: Write to a register if the instruction requires it
end

// TODO: Update pc to the next pc

// If unsupported instruction, stops simulation and print the state of the processor
// IMPORTANT: Do not modify this code! The microtests check for it.
if (eInst.iType == Unsupported) begin
$display("Reached unsupported instruction (0x%x)", inst);
$display("Dumping the state of the processor");
$display("pc = 0x%x", pc);
$display(rf.fshow);
$display("Quitting simulation.");
$finish;
end
endrule

// This method exists to make the processor synthesizable: synth removes
// circuits without outputs, so we need some non-trivial output to avoid
// removing the processor :)
method Word getPc = pc;
endmodule
import ProcTypes;
import RegisterFile;
import Decode;
import Execute;
import MagicMemory;

module Processor;
Reg#(Word) pc(0);
RegisterFile rf;
MagicMemory iMem; // Memory for loading instructions
MagicMemory dMem; // Memory for loading and storing data

rule doSingleCycle;
// Load the instruction from instruction memory (iMem)
Word inst = iMem.read(pc); // TODO Replace 0 with the correct value

// Decode the instruction
DecodedInst dInst = decode(inst); // TODO Replace unpack(0) with the correct value

// Read the register values used by the instruction
Word rVal1 = rf.rd1(dInst.src1); // TODO Replace 0 with the correct value
Word rVal2 = rf.rd2(dInst.src2); // TODO Replace 0 with the correct value

// Compute all outputs of the instruction
ExecInst eInst = execute(dInst, rVal1, rVal2, pc); // TODO Replace unpack(0) with the correct value

if (eInst.iType == LOAD) begin
// TODO: Load from data memory (dMem) if the instruction requires it
eInst.data = dMem.read(eInst.addr);
end else if (eInst.iType == STORE) begin
// TODO: Store to data memory (dMem) if the instruction requires it
dMem.write = Valid(MemWriteReq{addr: eInst.addr, data: eInst.data});
end

if (isValid(eInst.dst)) begin
// TODO: Write to a register if the instruction requires it
rf.wr = Valid(RegWriteArgs{index: fromMaybe(?, eInst.dst), data: eInst.data});
end

// TODO: Update pc to the next pc
pc <= eInst.nextPc;

// If unsupported instruction, stops simulation and print the state of the processor
// IMPORTANT: Do not modify this code! The microtests check for it.
if (eInst.iType == Unsupported) begin
$display("Reached unsupported instruction (0x%x)", inst);
$display("Dumping the state of the processor");
$display("pc = 0x%x", pc);
$display(rf.fshow);
$display("Quitting simulation.");
$finish;
end
endrule

// This method exists to make the processor synthesizable: synth removes
// circuits without outputs, so we need some non-trivial output to avoid
// removing the processor :)
method Word getPc = pc;
endmodule
We will be using the following program to examine our cache behavior. Let N = 16 be the size of data region, in words. Let A be an array of N elements, located initially at 0x240. Note that these values are hardcoded into the program below but we will be changing them later.
// A = 0x240, starting address of array
// N = 16, size of data region
// this program adds 16 words from array A, then repeats.
. = 0x200
test:
li a0, 16 // initialize loop index i
li a1, 0 // sum = 0
loop: // add up elements in array
addi a0, a0, -1 // decrement index
slli a2, a0, 2 // convert to index byte offset
lw a3, 0x240(a2) // load value of A[i]
add a1, a1, a3 // add to sum
bnez a0, loop // loop until all words are summed
j test // perform test again!
// Array
. = 0x240
.word ... // A[0]
.word ... // A[1]
...
.word ... // A[15]
Our cache has a total of 64 words. The initial configuration is direct mapped, with 1 word per line, so the cache has 64 lines numbered 0-63 (0x00 - 0x3F).
To achieve 100% steady state hit ratio, it must be the case that the instructions and array data can reside in the cache at the same time. Let's check if this is currently the case.
All of the instructions and all of the data elements happen to have the same tag, what is the value of this tag? Provide your answer in hexadecimal.
0x2.
The address of the first instruction is 0x200 = 0b_0010_0000_0000.
The address of the last instruction is 0x21C = 0b_0010_0001_1100.
The address of the first data element, A[0] is 0x240 = 0b_0010_0100_0000.
The address of the last data element, A[15], is 0x27C = 0b_0010_0111_1100.
The bottom 8 bits of the address are used for the 6 bit index and 2 bit word alignment. The remaining bits make up the tag which is 0x2 for all of the instructions and data elements.
We will be using the following program to examine our cache behavior. Let N = 16 be the size of data region, in words. Let A be an array of N elements, located initially at 0x240. Note that these values are hardcoded into the program below but we will be changing them later.
// A = 0x240, starting address of array
// N = 16, size of data region
// this program adds 16 words from array A, then repeats.
. = 0x200
test:
li a0, 16 // initialize loop index i
li a1, 0 // sum = 0
loop: // add up elements in array
addi a0, a0, -1 // decrement index
slli a2, a0, 2 // convert to index byte offset
lw a3, 0x240(a2) // load value of A[i]
add a1, a1, a3 // add to sum
bnez a0, loop // loop until all words are summed
j test // perform test again!
// Array
. = 0x240
.word ... // A[0]
.word ... // A[1]
...
.word ... // A[15]
Our cache has a total of 64 words. The initial configuration is direct mapped, with 1 word per line, so the cache has 64 lines numbered 0-63 (0x00 - 0x3F).
To achieve 100% steady state hit ratio, it must be the case that the instructions and array data can reside in the cache at the same time. Let's check if this is currently the case.
Which cache line (index) does A[0] map to? Provide your answer in hexadecimal.
0x10.
The address of A[0] is 0x240 = 0b_0010_0100_0000. The bottom two bits are used for word alignment. Bits[7:2] are the index bits = 0b010000 = 0x10. So A[0] maps to index 0x10 (or line 16).
74
205Mathematics18.01Calculus INoneNoneProblem Set 5
Second Derivatives
9c0.04751847941Text
A graphing problem using second derivatives. Let $f(x)=x e^{-x}$. We will graph $f(x)$ on the range $0 \leq x$. But first we compute some information about $f(x)$.
For which $x$ is $f^{\prime \prime}(x)=0$ ? For which $x$ is $f^{\prime \prime}(x)>0$ and for which $x$ is $f^{\prime \prime}(x)<0$ ?
Expression
$f^{\prime \prime}(x)=0$ (inflection point) when $x=2 . f^{\prime \prime}(x)>0$ (concave up) when $x>2$. $f^{\prime \prime}(x)<0$ (concave down) when $x<2$.
A graphing problem using second derivatives. Let $f(x)=x e^{-x}$. We will graph $f(x)$ on the range $0 \leq x$. But first we compute some information about $f(x)$.
For which $x$ is $f^{\prime}(x)=0$ ? For which $x$ is $f^{\prime}(x)>0$ and for which $x$ is $f^{\prime}(x)<0$?
$f^{\prime}(x)=0$ (maximum) when $x=1 . f^{\prime}(x)>0$ (increasing) when $x<1 . f^{\prime}(x)<0$ (decreasing) when $x>1$.
A graphing problem using second derivatives. Let $f(x)=x e^{-x}$. We will graph $f(x)$ on the range $0 \leq x$. But first we compute some information about $f(x)$.
Compute $f^{\prime}(x)$ and $f^{\prime \prime}(x)$.
$f^{\prime}(x)=(1-x) e^{-x}$ (product rule) and $f^{\prime \prime}(x)=(x-2) e^{-x}$ (product rule again).
A graphing problem using second derivatives. Let $f(x)=x e^{-x}$. We will graph $f(x)$ on the range $0 \leq x$. But first we compute some information about $f(x)$.
For which value of $x$ is $f^{\prime}(x)$ the most negative?
When $f^{\prime}(x)$ is the most negative, $f^{\prime}(x)$ is at a minimum. Thus, $f^{\prime \prime}(x)$ must be zero, which happens only when $x=2$.
75
181EECS6.191
Computation Structures
6.100A, 8.02NoneMidterm Exam 3
Operating Systems
2b0.45Text
Consider the following two processes running RISC-V programs labeled with virtual addresses. Note that all pseudoinstructions in this code translate into a single RISC-V instruction.
Program for Process A
.=0x200
li t0, 5000
li t1, 0
loop:
li a0, 0x300
li a7, 0x13 // print system call
ecall
addi t1, t1, 1
ble t1, t0, loop
j exit // exit process
.=0x300
.ascii “Hello from process A\n”
Program for Process B
.=0x450
li t0, 50
li t1, 10
div t2, t0, t1
sw t2, 0x900(x0)
li a0, 0x600
li a7, 0x13 // print system call
ecall
j exit // exit process
.=0x600
.ascii “Hello from Process B\n”
Assume the OS schedules Process B first. For the following questions, if you can’t tell a value based on the information given, write CAN’T TELL.
What are the values in the following registers just after the first ecall in Process A completes?
t0:
t1:
pc:
Numericalt0: ______5000_________
t1: _______0___________
pc: _____0x214_________
Consider the following two processes running RISC-V programs labeled with virtual addresses. Note that all pseudoinstructions in this code translate into a single RISC-V instruction.
Program for Process A
.=0x200
li t0, 5000
li t1, 0
loop:
li a0, 0x300
li a7, 0x13 // print system call
ecall
addi t1, t1, 1
ble t1, t0, loop
j exit // exit process
.=0x300
.ascii “Hello from process A\n”
Program for Process B
.=0x450
li t0, 50
li t1, 10
div t2, t0, t1
sw t2, 0x900(x0)
li a0, 0x600
li a7, 0x13 // print system call
ecall
j exit // exit process
.=0x600
.ascii “Hello from Process B\n”
Assume the OS schedules Process B first. For the following questions, if you can’t tell a value based on the information given, write CAN’T TELL.
A timer interrupt occurs just prior to the execution of li t1, 10 in Process B. Process A runs for some time, then another timer interrupt occurs and control is returned to Process B. What are the values in the following registers immediately after returning to Process B?
t0:
t1:
pc:
t0: _______50__________
t1: __CAN’T TELL or 0_
pc: _____0x454_________
Consider the following two processes running RISC-V programs labeled with virtual addresses. Note that all pseudoinstructions in this code translate into a single RISC-V instruction.
Program for Process A
.=0x200
li t0, 5000
li t1, 0
loop:
li a0, 0x300
li a7, 0x13 // print system call
ecall
addi t1, t1, 1
ble t1, t0, loop
j exit // exit process
.=0x300
.ascii “Hello from process A\n”
Program for Process B
.=0x450
li t0, 50
li t1, 10
div t2, t0, t1
sw t2, 0x900(x0)
li a0, 0x600
li a7, 0x13 // print system call
ecall
j exit // exit process
.=0x600
.ascii “Hello from Process B\n”
Assume the OS schedules Process B first. For the following questions, if you can’t tell a value based on the information given, write CAN’T TELL.
The RISC-V processor does not have hardware to support a div (integer division) instruction, so the OS must emulate it. What are the values in the following registers after div is emulated?
t0:
t1:
t2:
pc:
t0: _______50_________
t1: _______10_________
t2: ________5_________
pc: ____0x45C________
Please assume that all registers are initialized to 0, and the instructions in the section below are executed one after another.
addi a1, a2, 9
lui a2, 3
li a3, 3
sub a3, a3, a1
slt a4, a3, zero
sltu a5, a3, zero
li a4, 1
beq a5, zero, L1
slli a4, a4, 2
L1:
slli a4, a4, 5
What final value (answer in 32-bit hexadecimal format like 0x0CDEF1AF) will be inside the register a4 after all entire code snippet above is executed?
0x00000020.
The instruction li a4, 1 first sets the value inside a4 to be 1.
Then, the instruction beq a5, zero, L1 checks whether the value inside a5 is equal to the value in zero or not. In this case, the value inside a5 from the previous section is 0, so the beq jumps to the specified label, L1, and continues its execution there.
Finally, the instruction slli a4, a4, 5. It shifts the value inside a4 the left by 5 bits and writes the shifted result back into a4. a4 was just set to 1 or 0b0000 0000 0000 0000 0000 0000 0000 0001. Shifting this value by 5 bits to the left results in a value of 0b0000 0000 0000 0000 0000 0000 0010 0000 which is 0x00000020 in hexadecimal. This result is written to the destination register which also happens to be a4 in this instruction.
76
160EECS6.191
Computation Structures
6.100A, 8.02NoneMidterm Exam 2
Sequential Circuits in Minispec
4c1.2Text
You are frustrated with the 77 Mass. Ave crosswalk and decide to design a better traffic signal in Minispec. To start, you want to make sure the traffic light will function well in the daytime when there’s lots of traffic. After carefully analyzing traffic patterns, you define the following specification:
• The traffic light should be red for 4 cycles, then green for 10 cycles, then yellow for 1 cycle, and repeat this pattern indefinitely.
• The light starts red (and should stay red for 4 cycles before turning green).
• Pedestrians can only cross when the light is red.
Now you want to add a new feature to your traffic light. During the daytime, you want it to work as in part (A). But during the nighttime, the traffic light should work differently:
• By default, the light should be green.
• When a pedestrian requests to cross the street and the light is green, it should remain green for 3 more cycles, turn yellow for 1 cycle, then red for 4 cycles. Then it should go back to being green indefinitely.
• If a pedestrian requests to cross the street when the light is yellow or red, this request should be ignored and have no effect.
• If a pedestrian requests to cross the street while the light is green, and a pedestrian requests to cross the street in a following cycle when the light is still green, this request should also have no effect.
You also want to add a feature for emergency pedestrian requests. In an emergency, if a pedestrian requests to cross, the light should immediately turn yellow on the next cycle. The pedestrian request is provided as a Maybe#(PedestrianRequest) type – on each cycle it will either be:
• Invalid (no pedestrian request)
• Standard (a standard pedestrian request was made)
• Emergency (an emergency pedestrian request was made)
Note that your implementation should still work when the input transitions from daytime to nighttime, even though in daytime the Green light is 10 cycles and in nighttime it is only 3 cycles following a pedestrian request. Thus, if it is nighttime and our counter variable is too large (because we were counting down from a larger value during the daytime), we should “clamp” it to be no larger than it can be in nighttime. We have provided a currentCounter variable to use for this purpose – i.e. it will be clamped to the maximum value the counter can be during nighttime.
Fill in the Minispec module below to add this functionality. We have provided two inputs – one for whether it is currently nighttime or daytime, and one for whether a pedestrian has requested to cross the street in this cycle.
typedef enum { Green, Yellow, Red } LightState;
typedef enum { Daytime, Nighttime } TimeOfDay;
typedef enum { Standard, Emergency } PedestrianRequest;
module TrafficLight;
Reg#(LightState) light(_<answer from Part A>_);
Reg#(Bit#(_<answer from Part A>_)) counter(_<answer from part A>_);
input TimeOfDay timeOfDay default = Nighttime;
input Maybe#(PedestrianRequest) pedestrianRequest default = Invalid;
method Bool pedestriansCanCross = <answer from part A>;
method LightState currentLight = <answer from part A>;
rule tick;
if (timeOfDay == Daytime) begin
<Your answer from Part A>
end else begin
if (light == Green) begin
Bit#(<answer from part A>) currentCounter;
// Clamp currentCounter to the maximum value counter
// can be for a Green light at night
currentCounter = counter > ____ ? ____ : counter;
if (currentCounter == 0) begin
light <= __________;
// Check if received pedestrian request this cycle
end else if (___________________________________)
begin
// Handle emergency request
if (__________________________________) begin
light <= ___________________________;
end else begin
counter <= _________________________;
end
end else if (currentCounter < _______________) begin
counter <= ___________________________;
end else begin
counter <= ___________________________;
end
end else if (light == Yellow) begin
<Your answer from Part A>
end else if (light == Red) begin
<Your answer from Part A>
end
end
endrule
endmodule
Programmingtypedef enum { Green, Yellow, Red } LightState;
typedef enum { Daytime, Nighttime } TimeOfDay;
typedef enum { Standard, Emergency } PedestrianRequest;
module TrafficLight;
Reg#(LightState) light(_<answer from Part A>_);
Reg#(Bit#(_<answer from Part A>_)) counter(_<answer from part A>_);
input TimeOfDay timeOfDay default = Nighttime;
input Maybe#(PedestrianRequest) pedestrianRequest default = Invalid;
method Bool pedestriansCanCross = <answer from part A>;
method LightState currentLight = <answer from part A>;
rule tick;
if (timeOfDay == Daytime) begin
<Your answer from Part A>
end else begin
if (light == Green) begin
Bit#(<answer from part A>) currentCounter;
// Clamp currentCounter to the maximum value counter
// can be for a Green light at night
currentCounter = counter > ___3_ ? ___3_ : counter;
if (currentCounter == 0) begin
light <= ___Yellow___;
// Check if received pedestrian request this cycle
end else if (_____isValid(pedestrianRequest)_____)
begin
// Handle emergency request
if (fromMaybe(?, pedestrianRequest) ==
Emergency) begin
light <= ____Yellow_________________;
end else begin
counter <= ___currentCounter - 1____;
end
end else if (currentCounter < ______3________) begin
counter <= ___currentCounter - 1______;
end else begin
counter <= ___currentCounter_(or 3)____;
end
end else if (light == Yellow) begin
<Your answer from Part A>
end else if (light == Red) begin
<Your answer from Part A>
end
end
endrule
endmodule
You are frustrated with the 77 Mass. Ave crosswalk and decide to design a better traffic signal in Minispec. To start, you want to make sure the traffic light will function well in the daytime when there’s lots of traffic. After carefully analyzing traffic patterns, you define the following specification:
• The traffic light should be red for 4 cycles, then green for 10 cycles, then yellow for 1 cycle, and repeat this pattern indefinitely.
• The light starts red (and should stay red for 4 cycles before turning green).
• Pedestrians can only cross when the light is red.
Fill in the Minispec module on the next page to track the traffic light state as a sequential circuit.
• The pedestriansCanCross method should return True if and only if the light is in a state where pedestrians are allowed to cross.
• The currentLight method should return the current state of the traffic light.
• We have provided a counter register – use this to count down to the next state transition.
typedef enum { Green, Yellow, Red } LightState;
module TrafficLight;
Reg#(LightState) light(_________);
Reg#(Bit#(_________)) counter(___________);
method Bool pedestriansCanCross = _______________;
method LightState currentLight = _______________;
rule tick;
if (light == Green) begin
if (counter == 0) begin
light <= _____________;
end else begin
counter <= ____________;
end
end else if (light == Yellow) begin
light <= ______________;
counter <= ____________;
end else if (light == Red) begin
if (counter == 0) begin
light <= ____________;
counter <= ___________;
end else begin
counter <= ___________;
end
end
endrule
endmodule
typedef enum { Green, Yellow, Red } LightState;
module TrafficLight;
Reg#(LightState) light(___Red___);
Reg#(Bit#(____4_____)) counter(_____3______);
method Bool pedestriansCanCross = _light == Red__;
method LightState currentLight = __light_________;
rule tick;
if (light == Green) begin
if (counter == 0) begin
light <= __Yellow_____;
end else begin
counter <= __counter - 1__;
end
end else if (light == Yellow) begin
light <= ___Red________;
counter <= ______3______;
end else if (light == Red) begin
if (counter == 0) begin
light <= ___Green____;
counter <= ____9_______;
end else begin
counter <= __counter - 1__;
end
end
endrule
endmodule
You are frustrated with the 77 Mass. Ave crosswalk and decide to design a better traffic signal in Minispec. To start, you want to make sure the traffic light will function well in the daytime when there’s lots of traffic. After carefully analyzing traffic patterns, you define the following specification:
• The traffic light should be red for 4 cycles, then green for 10 cycles, then yellow for 1 cycle, and repeat this pattern indefinitely.
• The light starts red (and should stay red for 4 cycles before turning green).
• Pedestrians can only cross when the light is red.
To ensure that your module behaves as expected, fill in the timing chart below with the register values and outputs for the first 6 cycles.
\begin{tabular}{|l|l|l|l|l|l|l|}
\hline Cycle & $\mathbf{0}$ & $\mathbf{1}$ & $\mathbf{2}$ & $\mathbf{3}$ & $\mathbf{4}$ & $\mathbf{5}$ \\
\hline counter & & & & & & \\
\hline light & & & & & & \\
\hline currentLight & & & & & & \\
\hline pedestriansCanCross & & & & & & \\
\hline
\end{tabular}
\begin{tabular}{|l|l|l|l|l|l|l|}
\hline Cycle & $\mathbf{0}$ & $\mathbf{1}$ & $\mathbf{2}$ & $\mathbf{3}$ & $\mathbf{4}$ & $\mathbf{5}$ \\
\hline counter & 3 & 2 & 1 & 0 & 9 & 8 \\
\hline light & Red & Red & Red & Red & Green & Green \\
\hline currentLight & Red & Red & Red & Red & Green & Green \\
\hline pedestriansCanCross & True & True & True & True & False & False \\
\hline
\end{tabular}
For each of the following questions you are asked to specify the result that the provided minispec code will produce.
Consider the following sequential module:
module Test;
Reg#(Word) x(3);
Reg#(Word) y(7);
Reg#(Bit#(6)) cycle(0);
rule runCycle;
Word temp = x;
x <= y;
y <= temp;
cycle <= cycle + 1;
$display("cycle = %d, x = %d, y = %d", cycle, x, y);
if (cycle >= 2) $finish;
endrule
endmodule
Suppose that the runCycle code was modified so that it no longer used the temp variable and instead it just specified: x <= y; y <= x. Would this code behave differently from the way it was originally written?
(a) Yes.
(b) No.
(b) No.
Since all registers are updated at the same time, at the end of each cycle, there is no need to use the temp variable to specify that x and y should be swapped. The sequence of instructions x <= y; y <= x; will put the old value of x into y and will put the old value of y into x. In fact the order of those two statements will not change the behavior of the code.
77
54Mathematics18.2
Principles of Discrete Applied Mathematics
None18.C06Problem Set 10
Discrete Fourier Transform
3a0.6790123457TextWhat is the Discrete Fourier Transform over $\mathbb{C}$ of $y=(1,1,0, i)$?Expression
We use the formula
$$
c_{k}=\sum_{j=0}^{n-1} y_{j} e^{-2 \pi i j k / n} .
$$
Since $n=4$, we have the roots
$$
r=\left(e^{-2 \pi i 0 / n}, e^{-2 \pi i 1 / n}, e^{-2 \pi i 2 / n}, e^{-2 \pi i 3 / n}\right)=(1,-i,-1, i).
$$
We need the different powers of these roots: $r_{k}=\left(1^{k},(-i)^{k},(-1)^{k}, i^{k}\right)$ for $k=0,1,2,3$. The dot product with $y$ gives us the transform, i.e.
$$
c_{k}=y \cdot r_{k}
$$
We then obtain $c=(2+i,-i,-i, 2+i)$.
Using the result in (a) above, what is the Discrete Fourier Transform of $z$?
Let $\hat{z}$ be the Fourier transform of $z$, then $\hat{z}=y \hat{*} y=\hat{y} \times \hat{y}=c \times c$, where $\times$ denotes pointwise multiplication. From this, we obtain
$$
\hat{z}=(3+4 i,-1,-1,3+4 i).
$$
Let $x[n]$ represent a discrete time signal whose DTFT is given by
$$
X(\Omega)= \begin{cases}1 & \text { if }|\Omega|<\frac{\pi}{5} \\ 0 & \text { if } \frac{\pi}{5}<|\Omega|<\pi\end{cases}
$$
and is periodic in $\Omega$ with period $2 \pi$ as shown below.
Determine an expression for $Y_{1}(\Omega)$ (the Fourier transform of $y_{1}[n]$ ) in terms of $Y_{0}(\Omega)$.
Make a plot of $Y_{1}(\Omega)$.
Briefly describe the relation between $Y_{0}(\Omega)$ and $Y_{1}(\Omega)$.
$$
\begin{aligned}
Y_{1}(\Omega) & =\sum_{n=-\infty}^{\infty} y_{1}[n] e^{-j \Omega n}=\sum_{n=-\infty}^{\infty}\left(\frac{1}{2} y_{0}[n-1]+y_{0}[n]+\frac{1}{2} y_{0}[n+1]\right) e^{-j \Omega n} \\
& =\frac{1}{2} e^{-j \Omega} Y_{0}(\Omega)+Y_{0}(\Omega)+\frac{1}{2} e^{j \Omega} Y_{0}(\Omega)=(1+\cos (\Omega)) Y_{0}(\Omega)
\end{aligned}
$$
The plot of $Y_{1}(\Omega)$ is below.
The overall amplitude of $Y_{1}(\Omega)$ is twice that of $Y_{0}(\Omega)$. This results because the values of $y_{0}[n]$ are zero for odd values of $n$, while those for $y_{1}[n]$ are not. Components of $Y_{1}(\Omega)$ near $\Omega=\pi$ are greatly reduced in magnitude relative to those in $Y_{0}(\Omega)$.
The net effect of these changes is to generate a new signal $Y_{1}(\Omega)$ with half the bandwidth of $X(\Omega)$.
Let $y[n]$ represent the discrete-time signal that results from sampling $x(t)$ once every $\Delta=\frac{1}{2}$ second. Determine $Y(\Omega)$, which represents the discrete-time Fourier transform (DTFT) of $y$ [n]. Plot the magnitude and angle of $Y(\Omega)$ on the axes below.
Determine an expression for $y[n]$.
$$
\begin{aligned}
& y[n]=\frac{\sin \frac{\pi(n+2)}{2}}{\frac{\pi(n+2)}{2}} \\
& y[n]=x(n \Delta)=\left.\frac{\sin (\pi(t+1))}{\pi(t+1)}\right|_{t=n \Delta}=\frac{\sin \left(\pi\left(\frac{n}{2}+1\right)\right)}{\pi\left(\frac{n}{2}+1\right)}=2 \frac{\sin \left(\frac{\pi}{2}(n+2)\right)}{\pi(n+2)}
\end{aligned}
$$
Since $y[n]$ is a sinc function of $n, Y(\Omega)$ is a lowpass filter with cutoff frequency $\frac{\pi}{2}$ and DC value 2. There is a also a time shift of 2 since $n$ has been replaced by $n+2$. This time shift introduces a phase lead of $2 \Omega$. The final answer is
$$
Y(\Omega)= \begin{cases}2 e^{j 2 \Omega} & \text { if }-\pi<(\Omega+2 \pi m)<\pi \text { for some integer } m \\ 0 & \text { otherwise }\end{cases}
$$
The plot is below.
78
246EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneLab 9
Convolutional Neural Networks
4a0.1041666667Text
Many of us have seen applications of facial recognition in real life (ex. Unlocking a phone) and movies (ex. Jarvis, built by fictional MIT alum, Tony Stark, has facial recognition capabilities). In this lab, we are going to introduce some of the ethical and social considerations that arise in the development, deployment, and very existence of facial recognition systems.
Convolutional Neural Networks require a LOT of data to train well. What are some ethical and social implications in deciding whether and how to collect such data for facial recognition systems (e.g., labeled images of faces)?
Tip: Having trouble getting started? Think about this question from a security perspective, privacy perspective, consent perspective, etc. See this interesting Nature article on The ethical questions that haunt facial recognition research .
Open
We want students to think about data privacy, security, consent, and monetization of softwares that use the image of unknowing, unconsenting individuals.
The nature article is really worth a read. To summarize, most research groups scrape the Internet/public sources to collect facial images (And do not ask for permission). Examples: In 2015, Stanford published a set of 12000 images from a webcam in a cafe in SF that had been live-streamed online. Researches at Duke released more than 2 million video frame footage of students walking around campus. MegaFace and MSCeleb are large datasets that have been posted online and were collected from the internet. Many of these larger datasets have been used to evaluate and improve commercial surveillance products. And quite a few of these datasets were published online in one place without a password in place.
There's also the issues to consider around data privacy. When you use a facial recognition model, your personal data is being sent to a model and there are both privacy and security risks associated with that.
Many of us have seen applications of facial recognition in real life (ex. Unlocking a phone) and movies (ex. Jarvis, built by fictional MIT alum, Tony Stark, has facial recognition capabilities). In this lab, we are going to introduce some of the ethical and social considerations that arise in the development, deployment, and very existence of facial recognition systems.
Even if we can make a perfect or "fair" facial recognition system, the first question we should ask (and one that is often skipped) is whether we should develop such technology at all. What are some of the potential ethical and social impacts facial recognition surveillance could have on society (Assuming it can be applied to any images or video feed in the world)? Think about this question from a security perspective, privacy perspective, consent perspective, and what such technologies could do in the hands of bad actors (and even well-intentioned actors).
Facial Recognition is dangerous. The ACLU is actually trying to get the government to regulate surveillance technology. To quote this ACLU article, "Face recognition surveillance presents an unprecedented threat to our privacy and civil liberties. It gives governments, companies, and individuals the power to spy on us wherever we go - tracking our faces at protests, political rallies, places of worship, and more." There is already an issue of governments using this to target particular minorities.
Many of us have seen applications of facial recognition in real life (ex. Unlocking a phone) and movies (ex. Jarvis, built by fictional MIT alum, Tony Stark, has facial recognition capabilities). In this lab, we are going to introduce some of the ethical and social considerations that arise in the development, deployment, and very existence of facial recognition systems.
Suppose you tested the facial recognition systems from Microsoft and IBM on a dataset of 1,270 faces, and you found these systems achieved accuracy rates of $93.7 \%$ and $87.9 \%$ respectively. Are these accuracy rates sufficiently high? Would you feel comfortable deploying either of these systems? What other tests might you want to conduct?
The data set is relatively small and the accuracy isn't that high either. So perhaps not a good idea to deploy as is. Would probably be a good idea to test and validate against more diverse data sets before full deployment.
Many of us have seen applications of facial recognition in real life (ex. Unlocking a phone) and movies (ex. Jarvis, built by fictional MIT alum, Tony Stark, has facial recognition capabilities). In this lab, we are going to introduce some of the ethical and social considerations that arise in the development, deployment, and very existence of facial recognition systems.
Error analysis reveals that $93.6 \%$ of Microsoft's mislabeled faces were those of people with darker skin, and that the IBM system's error rate is $34.4 \%$ higher for women with darker skin than men with lighter skin (for more info, see Dr. Joy Buolamwini's research on gendershades.org. As we've learned, these discrepancies could be a consequence of a confluence of factors: imbalanced representation in the dataset, a bias introduced or amplified through architectural choices in modeling or training the neural network, etc. How might we fix this performance imbalance?
Supposed to be an open-ended question. Some ideas are: Diversify the data set. Adding regularization to the architecture to prevent overfitting. Connecting back to last lab, maybe robust network training could also help here.
79
180EECS6.191
Computation Structures
6.100A, 8.02NoneMidterm Exam 3
Operating Systems
2a0.45Text
Consider the following two processes running RISC-V programs labeled with virtual addresses. Note that all pseudoinstructions in this code translate into a single RISC-V instruction.
Program for Process A
.=0x200
li t0, 5000
li t1, 0
loop:
li a0, 0x300
li a7, 0x13 // print system call
ecall
addi t1, t1, 1
ble t1, t0, loop
j exit // exit process
.=0x300
.ascii “Hello from process A\n”
Program for Process B
.=0x450
li t0, 50
li t1, 10
div t2, t0, t1
sw t2, 0x900(x0)
li a0, 0x600
li a7, 0x13 // print system call
ecall
j exit // exit process
.=0x600
.ascii “Hello from Process B\n”
Assume the OS schedules Process B first. For the following questions, if you can’t tell a value based on the information given, write CAN’T TELL.
A timer interrupt occurs just prior to the execution of li t1, 10 in Process B. Process A runs for some time, then another timer interrupt occurs and control is returned to Process B. What are the values in the following registers immediately after returning to Process B?
t0:
t1:
pc:
Numericalt0: _______50__________
t1: __CAN’T TELL or 0_
pc: _____0x454_________
Consider the following two processes running RISC-V programs labeled with virtual addresses. Note that all pseudoinstructions in this code translate into a single RISC-V instruction.
Program for Process A
.=0x200
li t0, 5000
li t1, 0
loop:
li a0, 0x300
li a7, 0x13 // print system call
ecall
addi t1, t1, 1
ble t1, t0, loop
j exit // exit process
.=0x300
.ascii “Hello from process A\n”
Program for Process B
.=0x450
li t0, 50
li t1, 10
div t2, t0, t1
sw t2, 0x900(x0)
li a0, 0x600
li a7, 0x13 // print system call
ecall
j exit // exit process
.=0x600
.ascii “Hello from Process B\n”
Assume the OS schedules Process B first. For the following questions, if you can’t tell a value based on the information given, write CAN’T TELL.
What are the values in the following registers just after the first ecall in Process A completes?
t0:
t1:
pc:
t0: ______5000_________
t1: _______0___________
pc: _____0x214_________
Consider the following two processes running RISC-V programs labeled with virtual addresses. Note that all pseudoinstructions in this code translate into a single RISC-V instruction.
Program for Process A
.=0x200
li t0, 5000
li t1, 0
loop:
li a0, 0x300
li a7, 0x13 // print system call
ecall
addi t1, t1, 1
ble t1, t0, loop
j exit // exit process
.=0x300
.ascii “Hello from process A\n”
Program for Process B
.=0x450
li t0, 50
li t1, 10
div t2, t0, t1
sw t2, 0x900(x0)
li a0, 0x600
li a7, 0x13 // print system call
ecall
j exit // exit process
.=0x600
.ascii “Hello from Process B\n”
Assume the OS schedules Process B first. For the following questions, if you can’t tell a value based on the information given, write CAN’T TELL.
The RISC-V processor does not have hardware to support a div (integer division) instruction, so the OS must emulate it. What are the values in the following registers after div is emulated?
t0:
t1:
t2:
pc:
t0: _______50_________
t1: _______10_________
t2: ________5_________
pc: ____0x45C________
Now let's look back at the main code.
li a0, 0x2000
li a7, 0
lw a1, 0(a0)
// start of the code piece
L1 :
andi a2, a1, 1
beq a2, zero, L2
addi a7, a7, 1
L2 :
srli a1, a1, 1
// end of the code piece
bnez a1, L1
unimp
. = 0x2000
.word 0x12345678
Execution begins with the first instruction li a0, 0x2000 which initializes register a0 to 0x00002000. The next instruction sets register a7 to 0.
What does the lw a1, 0(a0) instruction write into register a1 (answer in 32-bit hexadecimal format like 0x0CDEF1AF)?
0x12345678.
This instruction loads the contents of the memory location whose address is computed by adding 0 to a0. Thus, a1 is initialized to the value in memory location 0x2000 which is 0x12345678.
80
35EECS6.18
Computer Systems Engineering
6.1010, 6.1910NoneMidterm Exam 1Unix2b0.375Text
As a result of a fork, there are two processes running on a machine: the parent and the child $\mathrm{A}$. Immediately after returning from the fork() call, the parent forks again, creating child B. Neither child process has been scheduled yet (i.e., they have not yet had an opportunity to execute anything after the return from fork()). We are asking about an instant when the two children are fully created and completely ready to run, but before either has had a chance to run.
Select True or False for the following statement:
Processes $\mathrm{A}$ and $\mathrm{B}$ have identical file descriptors.
Multiple ChoiceTrue.
As a result of a fork, there are two processes running on a machine: the parent and the child $\mathrm{A}$. Immediately after returning from the fork() call, the parent forks again, creating child B. Neither child process has been scheduled yet (i.e., they have not yet had an opportunity to execute anything after the return from fork()). We are asking about an instant when the two children are fully created and completely ready to run, but before either has had a chance to run.
Select True or False for the following statement:
Child A knows the pid of Child B.
False.
As a result of a fork, there are two processes running on a machine: the parent and the child $\mathrm{A}$. Immediately after returning from the fork() call, the parent forks again, creating child B. Neither child process has been scheduled yet (i.e., they have not yet had an opportunity to execute anything after the return from fork()). We are asking about an instant when the two children are fully created and completely ready to run, but before either has had a chance to run.
Select True or False for the following statement:
Child B knows the pid of Child A. Parent knows pid of $A$ when forking $B$, so $B$ knows to 0.
True.
As a result of a fork, there are two processes running on a machine: the parent and the child $\mathrm{A}$. Immediately after returning from the fork() call, the parent forks again, creating child B. Neither child process has been scheduled yet (i.e., they have not yet had an opportunity to execute anything after the return from fork()). We are asking about an instant when the two children are fully created and completely ready to run, but before either has had a chance to run.
Select True or False for the following statement:
If virtual address $a$ maps to physical address $p$ in process $A$, then virtual address $a$ maps to physical address $p$ in process $B$.
False.
81
155Mathematics18.03
Differential Equations
None18.02Final Exam
Boundary Value Problems
9nan3.614457831Text
Consider the function
$$
f(x)=x(\pi-x)
$$
Find the function $u(x, t)$ solving the partial differential equation
$$
\frac{\partial u}{\partial t}=\frac{\partial^{2} u}{\partial x^{2}} \quad u(0, t)=u(\pi, t)=0
$$
for $x \in[0, \pi]$ and $t \geq 0$ subject to the initial condition
$$
u(x, 0)=f(x) .
$$
Expression
We recall that a solution to the heat equation with Dirichlet boundary condition is
$$
c e^{-n^{2} t} \sin (n x)
$$
Now we find a sine series for the initial condition by finding the Fourier series of the odd extension of $f(x)$. Since it is an odd function, we have $\tilde{a}_{0}=0$ and $a_{n}=0$ for all $n \geq 1$. For $n \geq 1$
$$
\begin{aligned}
b_{n} &=\frac{2}{\pi} \int_{0}^{\pi} x(\pi-x) \sin (n x) d x \\
&=\frac{2}{\pi}\left[\frac{-x(\pi-x)}{n} \cos (n x)\right]_{0}^{\pi}+\frac{2}{\pi} \int_{0}^{\pi} \frac{\pi-2 x}{n} \cos (n x) d x \\
&=0+\frac{2}{\pi}\left[\frac{\pi-2 x}{n^{2}} \sin (n x)\right]_{0}^{\pi}+\frac{4}{n^{2} \pi} \int_{0}^{\pi} \sin (n x) d x \\
&=0-\frac{4}{n^{3} \pi}[\cos (n x)]_{0}^{\pi}=\frac{4}{n^{3} \pi}\left(1-(-1)^{n}\right)
\end{aligned}
$$
where in the second and the third line we have used integration by parts. In the last equality, we have also used $\cos (n \pi)=(-1)^{n}$. Thus, a sine series for $f$ on $[0, \pi]$ is given by
$$
f(x)=\sum_{n \geq 1, n \text { odd }} \frac{8}{n^{3} \pi} \sin (n x) .
$$
We use this series to find the solution
$$
u(x, t)=\sum_{n \geq 1, n \text { odd }} \frac{8}{n^{3} \pi} e^{-n^{2} t} \sin (n x).
$$
Consider the function $f(x)=x(\pi-x)$ on $[0, \pi]$.
Find the solution of the equation
$$
\begin{aligned}
&\frac{\partial}{\partial t} u(x, t)=\frac{\partial^{2}}{\partial x^{2}} u(x, t) \quad x \in(0, \pi), t>0 \\
&u(0, t)=u(\pi, t)=0 \\
&u(x, t)=f(x)
\end{aligned}
$$
Your solution may be left in complex form. (Hint: You may want to recall what you computed on Problem Set 8).
From Problem Set 8, we have a sine series for $f(x)$.
$$
f(x)=\sum_{n \geq 1, n \text { odd }} \frac{8}{n^{3} \pi} \sin (n x) .
$$
Thus, for Dirichlet boundary conditions, the general solution is,
$$
u(x, t)=\sum_{n \geq 1, n \text { odd }} \frac{8}{n^{3} \pi} e^{-n^{2} t} \sin (n x) .
$$
Find the solution of the equation
$$
\begin{gathered}
\frac{\partial}{\partial t} u(x, t)=\frac{\partial^{2}}{\partial x^{2}} u(x, t) \quad x \in(0, \pi), t>0 \\
\frac{\partial u}{\partial x}(0, t)=\frac{\partial u}{\partial x}(\pi, t)=0 \\
u(x, t)=f(x)
\end{gathered}
$$
Your solution may be left in complex form.
We want a cosine series for $f(x)$. We have,
$$
a_{0}=\frac{2}{\pi} \int_{0}^{\pi} x(\pi-x) d x=\frac{\pi^{2}}{3} .
$$
For $k \geq 1$, we have,
$$
a_{k}=\frac{2}{\pi} \int_{0}^{\pi} x(\pi-x) \cos (k x) d x .
$$
Thus, we have,
$$
a_{k}= \begin{cases}-\frac{4}{k^{2}} & \text { for } k \text { even } \\ 0 & \text { otherwise. }\end{cases}
$$
Thus,
$$
f(x)=\frac{\pi^{2}}{6}-\sum_{k \geq 2, k \text { even }} \frac{4}{k^{2}} \cos (k x)
$$
Thus, for Neumann boundary conditions, the general solution is,
$$
u(x, t)=\frac{\pi^{2}}{6}-\sum_{k \geq 1, k \text { even }} \frac{4}{k^{2}} e^{-k^{2} t} \cos (k x).
$$
Let $u=u(x, t)$ be a solution of the heat
$$
u_{t}=u_{x x} .
$$
What equation does $\phi=-\frac{1}{u} u_{x}$ satisfy?
Hint. Calculate $\phi_{t}$ and use the equation for $u$. Calculate $\phi_{x}$ and write it in terms of $u, u_{x x}$, and $\phi^{2}$. Then compute $\phi_{x x}$. You should now be able to write $\phi_{t}$ in terms of $\phi, \phi_{x}$, and $\phi_{x x}$.
We have
$$
\begin{aligned}
& \phi_{t}=-\frac{1}{u} u_{x t}+\frac{1}{u^{2}} u_{x} u_{t}=-\frac{1}{u} u_{x x x}+\frac{1}{u^{2}} u_{x} u_{x x} \text {, where we have used (12.1). } \\
& \phi_{x}=-\frac{1}{u} u_{x x}+\phi^{2} . \\
& \phi_{x x}=-\frac{1}{u} u_{x x x}+\frac{1}{u^{2}} u_{x} u_{x x}+\left(\phi^{2}\right)_{x} . \\
& \phi_{t}+\left(\phi^{2}\right)_{x}=\phi_{x x} .
\end{aligned}
$$
Thus $\phi$ satisfies $\phi_{t} + (\phi^{2})_{x} = \phi_{xx}$.
82
101EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneExercise 13Decision Trees1bv0.02083333333Text
We won't go through the algorithm of building a tree in full, but we will look at the first step of the algorithm, which is deciding where to first split the data, based on minimizing the weighted average entropy.
Recall the following notation from the notes. We consider a data set $\mathcal{D}$, and let $I$ be an indicator set of all the elements within $\mathcal{D}$, so that $I=\{1, \ldots, n\}$ for our whole data set.
We consider the features $x^{(i)}$ for examples $i \in I$, and split on the $j$ th dimension at location $s$, i.e., based on $x_{j}^{(i)} \geq s$. We let $I_{j, s}^{+}$represent the set of points on the "right" side of the split:
$$
I_{j, s}^{+}=\left\{i \in I \mid x_{j}^{(i)} \geq s\right\} .
$$
Similarly the points on the left side of the same split are
$$
I_{j, s}^{-}=\left\{i \in I \mid x_{j}^{(i)}<s\right\} .
$$
We can define $I_{m}$ as the subset of data samples that are in region $R_{m}$. We then define $\hat{P}_{m, k}$ as the empirical probability of class $k$ in $I_{m}$, where $\hat{P}_{m, k}$ is the fraction of data points in $I_{m}$ that are labeled with class $k$.
The entropy of data points in $I_{m}$ is given by
$$
H\left(I_{m}\right)=-\sum_{k} \hat{P}_{m, k} \log _{2} \hat{P}_{m, k}
$$
where we stipulate that $0 \log _{2} 0=0$.
Finally, the weighted average entropy $\hat{H}$ of a split on the $j$ th dimension at location $s$ is:
$$
\begin{aligned}
\hat{H} & =(\text { fraction of points in left data set }) \cdot H\left(I_{j, s}^{-}\right)+(\text {fraction of points in right data set }) \cdot H\left(I_{j, s}^{+}\right) \\
& =\frac{\left|I_{j, s}^{-}\right|}{N_{m}} \cdot H\left(I_{j, s}^{-}\right)+\frac{\left|I_{j, s}^{+}\right|}{N_{m}} \cdot H\left(I_{j, s}^{+}\right),
\end{aligned}
$$
where $N_{m}=\left|I_{m}\right|$.
For the dataset above, we will compute the weighted average entropies for the following potential splits of the data:
\begin{itemize}
\item Feature $x_{1} \geq 1.5$
\item Feature $x_{1} \geq-1.5$
\item Feature $x_{1} \geq 0.0$
\item Feature $x_{2} \geq 0.0$
\end{itemize}
We will compute the weighted average entropy of the first split $x_{1} \geq 1.5$ as an example:
There are two points in the right split, and both have the same class. So $H\left(I_{1,1.5}^{+}\right)=0-1 \cdot \log _{2} 1=0$.
In the left split there are four points, where $\frac{3}{4}$ of the points are positively labeled and $\frac{1}{4}$ are negatively labeled, so $H\left(I_{1,1.5}^{-}\right)=$ $-\left(\frac{3}{4} \log _{2}\left(\frac{3}{4}\right)+\frac{1}{4} \log _{2}\left(\frac{1}{4}\right)\right) \approx 0.8113$
Thus, the weighted average entropy is
$$
\frac{4}{6} H\left(I_{1,1.5}^{-}\right)+\frac{2}{6} H\left(I_{1,1.5}^{+}\right) \approx 0.541
$$
Make sure that you use log base 2 in computing entropies. You may use $\log 2$ when entering Python expressions or values in the boxes below.
What is the accuracy of the tree using only the split $x_{1} \geq 1.5$?:
Numerical
0.8333333333.
We are able to classify all the points correctly in the right split and 3 out of 4 points correctly in the left split, for a total of classifying 5 out of 6 points correctly.
We won't go through the algorithm of building a tree in full, but we will look at the first step of the algorithm, which is deciding where to first split the data, based on minimizing the weighted average entropy.
Recall the following notation from the notes. We consider a data set $\mathcal{D}$, and let $I$ be an indicator set of all the elements within $\mathcal{D}$, so that $I=\{1, \ldots, n\}$ for our whole data set.
We consider the features $x^{(i)}$ for examples $i \in I$, and split on the $j$ th dimension at location $s$, i.e., based on $x_{j}^{(i)} \geq s$. We let $I_{j, s}^{+}$represent the set of points on the "right" side of the split:
$$
I_{j, s}^{+}=\left\{i \in I \mid x_{j}^{(i)} \geq s\right\} .
$$
Similarly the points on the left side of the same split are
$$
I_{j, s}^{-}=\left\{i \in I \mid x_{j}^{(i)}<s\right\} .
$$
We can define $I_{m}$ as the subset of data samples that are in region $R_{m}$. We then define $\hat{P}_{m, k}$ as the empirical probability of class $k$ in $I_{m}$, where $\hat{P}_{m, k}$ is the fraction of data points in $I_{m}$ that are labeled with class $k$.
The entropy of data points in $I_{m}$ is given by
$$
H\left(I_{m}\right)=-\sum_{k} \hat{P}_{m, k} \log _{2} \hat{P}_{m, k}
$$
where we stipulate that $0 \log _{2} 0=0$.
Finally, the weighted average entropy $\hat{H}$ of a split on the $j$ th dimension at location $s$ is:
$$
\begin{aligned}
\hat{H} & =(\text { fraction of points in left data set }) \cdot H\left(I_{j, s}^{-}\right)+(\text {fraction of points in right data set }) \cdot H\left(I_{j, s}^{+}\right) \\
& =\frac{\left|I_{j, s}^{-}\right|}{N_{m}} \cdot H\left(I_{j, s}^{-}\right)+\frac{\left|I_{j, s}^{+}\right|}{N_{m}} \cdot H\left(I_{j, s}^{+}\right),
\end{aligned}
$$
where $N_{m}=\left|I_{m}\right|$.
For the dataset above, we will compute the weighted average entropies for the following potential splits of the data:
\begin{itemize}
\item Feature $x_{1} \geq 1.5$
\item Feature $x_{1} \geq-1.5$
\item Feature $x_{1} \geq 0.0$
\item Feature $x_{2} \geq 0.0$
\end{itemize}
We will compute the weighted average entropy of the first split $x_{1} \geq 1.5$ as an example:
There are two points in the right split, and both have the same class. So $H\left(I_{1,1.5}^{+}\right)=0-1 \cdot \log _{2} 1=0$.
In the left split there are four points, where $\frac{3}{4}$ of the points are positively labeled and $\frac{1}{4}$ are negatively labeled, so $H\left(I_{1,1.5}^{-}\right)=$ $-\left(\frac{3}{4} \log _{2}\left(\frac{3}{4}\right)+\frac{1}{4} \log _{2}\left(\frac{1}{4}\right)\right) \approx 0.8113$
Thus, the weighted average entropy is
$$
\frac{4}{6} H\left(I_{1,1.5}^{-}\right)+\frac{2}{6} H\left(I_{1,1.5}^{+}\right) \approx 0.541
$$
Make sure that you use log base 2 in computing entropies. You may use $\log 2$ when entering Python expressions or values in the boxes below.
Compute the weighted average entropy for the split $x_{1} \geq-1.5$, with at least two decimal places:
0.809.
Feature $x_{1} \geq-1.5$ : the weighted average entropy is $\hat{H}=\frac{1}{6} * 0+\frac{5}{6} *-\left(\frac{2}{5} * \log _{2}\left(\frac{2}{5}\right)+\frac{3}{5} *\right.$ $\left.\log _{2}\left(\frac{3}{5}\right)\right) \approx 0.809$.
In this case the left side contains items that all have the same label, so the entropy on the left side of this split is 0.
We won't go through the algorithm of building a tree in full, but we will look at the first step of the algorithm, which is deciding where to first split the data, based on minimizing the weighted average entropy.
Recall the following notation from the notes. We consider a data set $\mathcal{D}$, and let $I$ be an indicator set of all the elements within $\mathcal{D}$, so that $I=\{1, \ldots, n\}$ for our whole data set.
We consider the features $x^{(i)}$ for examples $i \in I$, and split on the $j$ th dimension at location $s$, i.e., based on $x_{j}^{(i)} \geq s$. We let $I_{j, s}^{+}$represent the set of points on the "right" side of the split:
$$
I_{j, s}^{+}=\left\{i \in I \mid x_{j}^{(i)} \geq s\right\} .
$$
Similarly the points on the left side of the same split are
$$
I_{j, s}^{-}=\left\{i \in I \mid x_{j}^{(i)}<s\right\} .
$$
We can define $I_{m}$ as the subset of data samples that are in region $R_{m}$. We then define $\hat{P}_{m, k}$ as the empirical probability of class $k$ in $I_{m}$, where $\hat{P}_{m, k}$ is the fraction of data points in $I_{m}$ that are labeled with class $k$.
The entropy of data points in $I_{m}$ is given by
$$
H\left(I_{m}\right)=-\sum_{k} \hat{P}_{m, k} \log _{2} \hat{P}_{m, k}
$$
where we stipulate that $0 \log _{2} 0=0$.
Finally, the weighted average entropy $\hat{H}$ of a split on the $j$ th dimension at location $s$ is:
$$
\begin{aligned}
\hat{H} & =(\text { fraction of points in left data set }) \cdot H\left(I_{j, s}^{-}\right)+(\text {fraction of points in right data set }) \cdot H\left(I_{j, s}^{+}\right) \\
& =\frac{\left|I_{j, s}^{-}\right|}{N_{m}} \cdot H\left(I_{j, s}^{-}\right)+\frac{\left|I_{j, s}^{+}\right|}{N_{m}} \cdot H\left(I_{j, s}^{+}\right),
\end{aligned}
$$
where $N_{m}=\left|I_{m}\right|$.
For the dataset above, we will compute the weighted average entropies for the following potential splits of the data:
\begin{itemize}
\item Feature $x_{1} \geq 1.5$
\item Feature $x_{1} \geq-1.5$
\item Feature $x_{1} \geq 0.0$
\item Feature $x_{2} \geq 0.0$
\end{itemize}
We will compute the weighted average entropy of the first split $x_{1} \geq 1.5$ as an example:
There are two points in the right split, and both have the same class. So $H\left(I_{1,1.5}^{+}\right)=0-1 \cdot \log _{2} 1=0$.
In the left split there are four points, where $\frac{3}{4}$ of the points are positively labeled and $\frac{1}{4}$ are negatively labeled, so $H\left(I_{1,1.5}^{-}\right)=$ $-\left(\frac{3}{4} \log _{2}\left(\frac{3}{4}\right)+\frac{1}{4} \log _{2}\left(\frac{1}{4}\right)\right) \approx 0.8113$
Thus, the weighted average entropy is
$$
\frac{4}{6} H\left(I_{1,1.5}^{-}\right)+\frac{2}{6} H\left(I_{1,1.5}^{+}\right) \approx 0.541
$$
Make sure that you use log base 2 in computing entropies. You may use $\log 2$ when entering Python expressions or values in the boxes below.
Compute the weighted average entropy for the split $x_{1} \geq 0$, with at least two decimal places:
0.918.
Feature $x_{1} \geq 0.0: \hat{H}=\frac{3}{6} *-\left(\frac{1}{3} * \log _{2}\left(\frac{1}{3}\right)+\frac{2}{3} * \log _{2}\left(\frac{2}{3}\right)\right)+\frac{3}{6} *-\left(\frac{1}{3} * \log _{2}\left(\frac{1}{3}\right)+\frac{2}{3} *\right.$ $\left.\log _{2}\left(\frac{2}{3}\right)\right)=0.918$.
We won't go through the algorithm of building a tree in full, but we will look at the first step of the algorithm, which is deciding where to first split the data, based on minimizing the weighted average entropy.
Recall the following notation from the notes. We consider a data set $\mathcal{D}$, and let $I$ be an indicator set of all the elements within $\mathcal{D}$, so that $I=\{1, \ldots, n\}$ for our whole data set.
We consider the features $x^{(i)}$ for examples $i \in I$, and split on the $j$ th dimension at location $s$, i.e., based on $x_{j}^{(i)} \geq s$. We let $I_{j, s}^{+}$represent the set of points on the "right" side of the split:
$$
I_{j, s}^{+}=\left\{i \in I \mid x_{j}^{(i)} \geq s\right\} .
$$
Similarly the points on the left side of the same split are
$$
I_{j, s}^{-}=\left\{i \in I \mid x_{j}^{(i)}<s\right\} .
$$
We can define $I_{m}$ as the subset of data samples that are in region $R_{m}$. We then define $\hat{P}_{m, k}$ as the empirical probability of class $k$ in $I_{m}$, where $\hat{P}_{m, k}$ is the fraction of data points in $I_{m}$ that are labeled with class $k$.
The entropy of data points in $I_{m}$ is given by
$$
H\left(I_{m}\right)=-\sum_{k} \hat{P}_{m, k} \log _{2} \hat{P}_{m, k}
$$
where we stipulate that $0 \log _{2} 0=0$.
Finally, the weighted average entropy $\hat{H}$ of a split on the $j$ th dimension at location $s$ is:
$$
\begin{aligned}
\hat{H} & =(\text { fraction of points in left data set }) \cdot H\left(I_{j, s}^{-}\right)+(\text {fraction of points in right data set }) \cdot H\left(I_{j, s}^{+}\right) \\
& =\frac{\left|I_{j, s}^{-}\right|}{N_{m}} \cdot H\left(I_{j, s}^{-}\right)+\frac{\left|I_{j, s}^{+}\right|}{N_{m}} \cdot H\left(I_{j, s}^{+}\right),
\end{aligned}
$$
where $N_{m}=\left|I_{m}\right|$.
For the dataset above, we will compute the weighted average entropies for the following potential splits of the data:
\begin{itemize}
\item Feature $x_{1} \geq 1.5$
\item Feature $x_{1} \geq-1.5$
\item Feature $x_{1} \geq 0.0$
\item Feature $x_{2} \geq 0.0$
\end{itemize}
We will compute the weighted average entropy of the first split $x_{1} \geq 1.5$ as an example:
There are two points in the right split, and both have the same class. So $H\left(I_{1,1.5}^{+}\right)=0-1 \cdot \log _{2} 1=0$.
In the left split there are four points, where $\frac{3}{4}$ of the points are positively labeled and $\frac{1}{4}$ are negatively labeled, so $H\left(I_{1,1.5}^{-}\right)=$ $-\left(\frac{3}{4} \log _{2}\left(\frac{3}{4}\right)+\frac{1}{4} \log _{2}\left(\frac{1}{4}\right)\right) \approx 0.8113$
Thus, the weighted average entropy is
$$
\frac{4}{6} H\left(I_{1,1.5}^{-}\right)+\frac{2}{6} H\left(I_{1,1.5}^{+}\right) \approx 0.541
$$
Make sure that you use log base 2 in computing entropies. You may use $\log 2$ when entering Python expressions or values in the boxes below.
Compute the weighted average entropy for the split $x_{2} \geq 0$, with at least two decimal places:
0.918.
Feature $x_{2} \geq 0.0: \hat{H}=\frac{3}{6} *-\left(\frac{1}{3} * \log _{2}\left(\frac{1}{3}\right)+\frac{2}{3} * \log _{2}\left(\frac{2}{3}\right)\right)+\frac{3}{6} *-\left(\frac{1}{3} * \log _{2}\left(\frac{1}{3}\right)+\frac{2}{3} *\right.$ $\left.\log _{2}\left(\frac{2}{3}\right)\right)=0.918$.
83
51EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneExercise 7Neural Networks3d0.01736111111Text
When Alex was first studying machine learning, he sometimes wondered about the relationship between linear regression, logistic regression, and neural networks. Is there actually any? For each of the neural-network architectures described below, help him identify whether it is equivalent to some previous model we have studied.
In each case, we will specify the number of layers, the activation function for each layer, and the loss function. In all cases, please assume the last layer outputs a scalar. Let $f$ be the activation function in a single-layer network, and let $f^{1}$ and $f^{2}$ be the activations in the first and second layers of a two-layer network, respectively.
Two layers, $f^{1}$ is identity, $f^{2}$ is sigmoid, loss is NLL.
Which of the following is this equivalent to:
(a) linear regression.
(b) logistic regression.
(c) a different kind of sensible neural network.
(d) an ill-formed neural network.
Multiple Choice
(b) logistic regression.
Since the output of the network is a scalar, the weight matrix in the second layer must be a vector of the same shape as the output of the first layer. We have that the whole network takes the form
$$
\begin{aligned}
f^{2}\left(w_{2}^{T} f^{1}\left(W_{1}^{T} x+b_{1}\right)+b_{2}\right) & =\sigma\left(w_{2}^{T}\left(W_{1}^{T} x+b_{1}\right)+b_{2}\right) \\
& =\sigma\left(\left(w_{2}^{T} W_{1}\right) x+\left(w_{2}^{T} b_{1}+b_{2}\right)\right)
\end{aligned}
$$
Letting $\theta=W_{1}^{T} w_{2}$ (note $\theta$ is a vector) and $\theta_{0}=w_{2}^{T} b_{1}+b_{2}$ (note $\theta_{0}$ is a scalar), we can rewrite the network as $\sigma\left(\theta^{T} x+\theta_{0}\right)$, which together with the NLL loss we recognize as a logistic regression.
Then in the reverse direction, starting with a logistic regression model $\sigma\left(\theta^{T} x+\theta_{0}\right)$, we can take $w_{2}=\theta_{1} b_{2}=\theta_{0}, W_{1}=I$, and $b_{1}=0$ to get an output of the form
$$
\begin{aligned}
\sigma\left(w_{2}^{T}\left(W_{1}^{T} x+b_{1}\right)+b_{2}\right) & =\sigma\left(\theta^{T}\left(I^{T} x+0\right)+\theta_{0}\right) \\
& =\sigma\left(\theta^{T} x+\theta_{0}\right),
\end{aligned}
$$
meaning this two-layer net can express any logistic regression model.
Since we can express any instance of logistic regression as an instance of this two-layer net and any instance of this two-layer net as an instance of logistic regression, we say that the two models are equivalent.
When Alex was first studying machine learning, he sometimes wondered about the relationship between linear regression, logistic regression, and neural networks. Is there actually any? For each of the neural-network architectures described below, help him identify whether it is equivalent to some previous model we have studied.
In each case, we will specify the number of layers, the activation function for each layer, and the loss function. In all cases, please assume the last layer outputs a scalar. Let $f$ be the activation function in a single-layer network, and let $f^{1}$ and $f^{2}$ be the activations in the first and second layers of a two-layer network, respectively.
Two layers, $f^{1}$ is sigmoid, $f^{2}$ is identity, loss is NLL.
Which of the following is this equivalent to:
(a) linear regression.
(b) logistic regression.
(c) a different kind of sensible neural network.
(d) an ill-formed neural network.
(d) an ill-formed neural network.
The weights in layer 2 can push the output outside $(0,1)$, so NLL loss would be ill-suited and undefined for some outputs.
When Alex was first studying machine learning, he sometimes wondered about the relationship between linear regression, logistic regression, and neural networks. Is there actually any? For each of the neural-network architectures described below, help him identify whether it is equivalent to some previous model we have studied.
In each case, we will specify the number of layers, the activation function for each layer, and the loss function. In all cases, please assume the last layer outputs a scalar. Let $f$ be the activation function in a single-layer network, and let $f^{1}$ and $f^{2}$ be the activations in the first and second layers of a two-layer network, respectively.
Two layers, $f^{1}$ is sigmoid, $f^{2}$ is sigmoid, loss is NLL.
Which of the following is this equivalent to:
(a) linear regression.
(b) logistic regression.
(c) a different kind of sensible neural network.
(d) an ill-formed neural network.
(c) a different kind of sensible neural network.
Note that since the output of this network is a scalar, it takes the form
$$
\sigma\left(w_{2}^{T} \sigma\left(W_{1}^{T} x+b_{1}\right)+b_{2}\right) .
$$
This network is sensible because its output is a scalar in the interval $(0,1)$ (the range of sigmoid), and the NLL loss is well-defined for any input in $(0,1)$. Then taking $W_{1}=I, b_{1}=0, b_{2}=0$, and $w_{2}=$ $[1,0, \cdots, 0]^{T}$, we get an output of
$$
\sigma\left([1,0, \cdots, 0] \sigma\left(I^{T} x+0\right)+0\right)=\sigma\left(\sigma\left(x_{0}\right)\right),
$$
a function which cannot be expressed via either a linear or logistic regression.
When Alex was first studying machine learning, he sometimes wondered about the relationship between linear regression, logistic regression, and neural networks. Is there actually any? For each of the neural-network architectures described below, help him identify whether it is equivalent to some previous model we have studied.
In each case, we will specify the number of layers, the activation function for each layer, and the loss function. In all cases, please assume the last layer outputs a scalar. Let $f$ be the activation function in a single-layer network, and let $f^{1}$ and $f^{2}$ be the activations in the first and second layers of a two-layer network, respectively.
One layer, $f$ is sigmoid, loss is NLL.
Which of the following is this equivalent to:
(a) linear regression.
(b) logistic regression.
(c) a different kind of sensible neural network.
(d) an ill-formed neural network.
(b) logistic regression.
As in previous questions, since the output of the neural net is a scalar, the net takes the form
$$
f\left(w^{T} x+b\right)=\sigma\left(w^{T} x+b\right)
$$
which together with the NLL loss we recognize as a logistic regression with $\theta=w$ and $\theta_{0}=b$ in our usual notation.
84
67Mathematics18.404
Theory of Computation
6.1210/18.200NoneMidterm ExamFinite Automata9nan2.222222222Text
A 2-way counter automaton (2WAY-CA) is a deterministic counter automaton where the head on the input tape can move left or right at each step, as specified by its transition function. The input tape is still read-only. For example, a 2WAY-CA can recognize the language $B=\left\{x \# x \mid x \in\{\mathrm{a}, \mathrm{b}\}^{*}\right\}$ as follows. First it scans the entire input to check that it is of the form $y \# z$ where $y, z \in\{\mathrm{a}, \mathrm{b}\}^{*}$ and it uses the counter to check that $y$ and $z$ agree in length. Then it makes multiple scans over the input to check that corresponding symbols in $y$ and $z$ agree, using the counter to identify corresponding locations. If all checks succeed, then it accepts.
Let $E_{2 \text { WAY-CA }}=\{\langle C\rangle \mid C$ is a 2WAY-CA and $L(C)=\emptyset\}$. Show $E_{2 W A Y-C A}$ is undecidable. (Give enough detail to show how the counter and the 2-way head are needed, such as in the example above.)
Open
To prove that $E_{2 \mathrm{WAY}-\mathrm{CA}}$ is undecidable, reduce $A_{\mathrm{TM}}$ to $E_{2 \mathrm{WAY}}$-CA by using the computation history method. Assume TM $R$ decides $E_{\text {2WAY-CA }}$ and construct TM $S$ deciding $A_{\mathrm{TM}}$. This proof is similar to the proof that $E_{\mathrm{LBA}}$ is undecidable.
$S =$ "On input $\langle M, W \rangle$
1. Construct 2WAY-CA $C_{M, w}$ as follows. $\left(C_{M, w}\right.$ is designed to accept all strings $u$ that are an accepting computation history of $M$ on $w$. We refer to the parts of $u$ separated by \# symbols as blocks, so $u=b_{1} \# b_{2} \# \cdots \# b_{k}$.) $C_{M, w}={ }^{\text {"On input } u}$
1. Check (by using a built-in string) that $b_{1}$ is the start configuration of $M$ on $w$. If not then reject.
2. For each pair of blocks $b_{i}$ and $b_{i+1}$, check that $b_{i}$ yields $b_{i+1}$ as configurations according to $M$ 's rules. Do so by using the counter to identify corresponding locations in the blocks (as in the example) and going back and forth to check the entire configuration. Reject if any of these pairs fail.
3. Check that the last block $b_{k}$ contains the accept state. Reject if not.
4. Accept if all checks pass."
2. Run $R$ on input $\left\langle C_{M, w}\right\rangle$ to determine whether its language is empty.
3. Accept if $R$ rejects. Reject if $R$ accepts."
If $M$ accepts $w$ then some string $u$ is an accepting computation history of $M$ on $w$, so $C_{M, w}$ accepts $u$ and its language is nonempty. Hence $R$ rejects $\left\langle C_{M, w}\right\rangle$. If $M$ rejects $w$ then no string $u$ is an accepting computation of $M$ on $w$, so $C_{M, w}$ rejects all strings, its language is empty, and so $R$ accepts $\left\langle C_{M, w}\right\rangle$. Therefore $M$ accepts $w$ iff $R$ rejects $\left\langle C_{M, w}\right\rangle$. Hence $S$ decides $A_{\mathrm{TM}}$, a contradiction. Therefore $E_{2 \text { WAY-CA }}$ is undecidable.
A counter automaton has a single, read-only finite input tape that is just large enough to contain its input, and it also has a counter which contains a non-negative integer value. The counter initially starts at 0 . Under control of the transition function, at each step the counter automaton can add 1, subtract 1, or leave the counter value unchanged, and it can test whether the counter value is 0 . At each step it can also read the symbol under its tape head, and it can test whether the head is at the beginning or end of the input tape. It accepts its input by entering an accept state. Here we consider only deterministic counter automata. A 1-way counter automaton (1WAY-CA) is a counter automaton where the head on the input tape moves one symbol right at each step. In other words, a 1WAY-CA is a DFA with a counter. For example, a 1WAY-CA can recognize the language $A=\left\{\mathrm{a}^{k} \mathrm{~b}^{k} \mid k \geq 0\right\}$ as follows. It scans its input until it reaches the end, adding 1 to the counter for each a and subtracting 1 for each $\mathrm{b}$. When it reaches the end of the input, it accepts if the counter is 0 , and no a's come after b's.
Let $E_{1 \text { WAY-CA }}=\{\langle C\rangle \mid C$ is a 1WAY-CA and $L(C)=\emptyset\}$. Show that $E_{1 \text { WAY-CA }}$ is decidable. (Hint: A theorem we've seen before is useful here.)
We can convert a 1WAY-CA to an equivalent PDA which simulates the counter by using the stack. Hence we can decide $E_{\text {1WAY-CA }}$ by using the $E_{\mathrm{PDA}}$ (or $E_{\mathrm{CFG}}$ ) decider.
In a \defin{two-dimensional finite automaton} (\twodimdfa)
the input is an $m \times n$ rectangle, for any $m,n\ge 2$.
The squares along the boundary of the rectangle contain
the symbol \texttt{\#} and the internal squares contain symbols
over the input alphabet $\Sigma$. The transition function
$\fcn{\delta}
{Q \times (\Sigma\union\{\st\#\})} {Q \times \{\mathrm{L,R,U,D}\}}$
indicates the next state and the new head position (Left, Right, Up, Down).
The machine accepts when it enters one of the designated accept states.
It rejects if it tries to move off the input rectangle or if it never halts.
Two such machines are equivalent if they accept the same rectangles.
Let $A_\twodimdfa = \setb{B,r}B$ is a \twodimdfa\ and $B$ accepts
rectangle $r$\setend.\\ Show that $A_\twodimdfa$ is decidable.
The following TM decides $A_{2 \text { DIM-DFA. }}$
"On input $\langle B, r\rangle$ :
1. Run $B$ for $|Q| m n$ steps.
2. If $B$ has accepted, then accept. Otherwise reject."
If $B$ runs for $|Q| m n$ steps without halting, then it must have repeated some state in the same location, so it will loop forever.
Consider the problem of determining whether a single-tape Turing machine
ever writes a blank symbol over a nonblank symbol during the course of
its computation on any input string.
Formulate this problem as a language and show that it is undecidable.
Let $E=\{\langle M\rangle \mid M$ is a single-tape TM which writes a blank symbol over a nonblank symbol for some input $\}$. Reduce $A_{\text {TM }}$ to $E$ as follows. Assume for the sake of contradiction that TM $R$ decides E. Construct TM $S$ that uses $R$ to decide $A_{\text {TM }}$.
$S=$ "On input $\langle M, w\rangle$ :
1. Use $M$ and $w$ to construct the following TM $T_{M, w}$.
$T_{M, w}=$ "On any input:
1. Simulate $M$ on $w$. Use new symbol $\sqcup$ ' instead of a true blank $\sqcup$ when writing, and treat it like a true blank when reading it.
2. If $M$ accepts, write a true blank over some nonblank."
2. Run $R$ on $\left\langle T_{M, w}\right\rangle$ to determine whether $T_{M, w}$ ever writes a blank.
3. If $R$ accepts, $M$ accepts $w$, therefore accept. Otherwise reject."
85
3EECS6.122
Design and Analysis of Algorithms
6.121NoneProblem Set 1Probability2c0.1818181818Text
Asami and Bolin are playing a dice game in which a pair of dice is rolled repeatedly. Asami wins if a sum of 6 is rolled before any sum greater than or equal to 9, and Bolin wins if any sum greater than or equal to 9 is rolled first. We will find the probability that Asami wins the game.
Let $E_n$ denote the event that a 6 occurs on the $n^{\text {th }}$ roll and neither 6 nor any number greater than or equal to 9 occurs on any of the first $(n-1)$ rolls.
Compute $\sum_{n=1}^{\infty} P\left(E_n\right)$ and argue rigorously that it is the desired probability. HINT: $\sum_{n=0}^{\infty} r^n=\frac{1}{1-r}$ for $|r|<1$.
Open
Since the events $E_n$ are mutually disjoint and their union is precisely the event whose probability we want, we can add their probabilities to obtain the desired result:
$$
P(6 \text { before } \geq 9)=\sum_{n=1}^{\infty} P\left(E_n\right)=\sum_{n=1}^{\infty}\left(\frac{21}{36}\right)^{n-1} \cdot \frac{5}{36} .
$$
This can be evaluated using the hint in the problem to be
$$
\frac{1}{1-21 / 36} \cdot \frac{5}{36}=\frac{1}{3} \text {. }
$$
Asami and Bolin are playing a dice game in which a pair of dice is rolled repeatedly. Asami wins if a sum of 6 is rolled before any sum greater than or equal to 9, and Bolin wins if any sum greater than or equal to 9 is rolled first. We will find the probability that Asami wins the game.
Let $E_n$ denote the event that a 6 occurs on the $n^{\text {th }}$ roll and neither 6 nor any number greater than or equal to 9 occurs on any of the first $(n-1)$ rolls.
Compute $P\left(E_n\right)$.
As we found in the previous part, 6 occurs with probability $\frac{5}{36}$ and any number greater than or equal to 9 with probability $\frac{10}{36}$. Therefore, neither occurs on a given roll with probability
$$
1-\frac{5}{36}-\frac{10}{36}=\frac{21}{36} .
$$
Each of the n rolls required for $E_n$ to occur are independent, so to find the probability $P\left(E_n\right)$, we can multiply the probabilities that the first $(n-1)$ rolls result in neither 6 nor any number greater than or equal to 9 and the probability that the $n^{\text {th }}$ roll results in a 6, yielding
$$
P\left(E_n\right)=\left(\frac{21}{36}\right)^{n-1} \cdot \frac{5}{36} .
$$
Asami and Bolin are playing a dice game in which a pair of dice is rolled repeatedly. Asami wins if a sum of 6 is rolled before any sum greater than or equal to 9, and Bolin wins if any sum greater than or equal to 9 is rolled first. We will find the probability that Asami wins the game.
Let $E_n$ denote the event that a 6 occurs on the $n^{\text {th }}$ roll and neither 6 nor any number greater than or equal to 9 occurs on any of the first $(n-1)$ rolls.
What is the probability that a sum of 6 is rolled on any given roll of the pair of dice? What is the probability that any number greater than or equal to 9 is rolled on any given roll of the pair of dice?
Of the 36 equally likely possible rolls, 5 result in a total roll of 6-(1, 5), $(2,4),(3,3),(4,2)$, and $(5,1)$. Thus, the probability that $a 6$ is rolled on any given roll of the pair of dice is $\frac{5}{36}$.
We can analyze the other probability similarly. Of the 36 equally likely possible rolls, 10 result in a total roll greater than or equal to 9 , with four corresponding to a total of 9 , three corresponding to a total of 10, two corresponding to a total of 11, and one corresponding to a total of 12 . Thus, the probability that any number greater than or equal to 9 is rolled on any given roll of the pair of dice is $\frac{10}{36}=\frac{5}{18}$.
Asami and Bolin are playing a dice game in which a pair of dice is rolled repeatedly. Asami wins if a sum of 6 is rolled before any sum greater than or equal to 9, and Bolin wins if any sum greater than or equal to 9 is rolled first. We will find the probability that Asami wins the game.
Let $E_n$ denote the event that a 6 occurs on the $n^{\text {th }}$ roll and neither 6 nor any number greater than or equal to 9 occurs on any of the first $(n-1)$ rolls.
Now, suppose that in a run of this game, the dice are rolled ten times with neither a 6 nor any number greater than or equal to 9 appearing. Given this, let X be the sum of these ten rolls. We would now like to find an upper bound for the following probability $P[|X − \mathbb{E}[X]| \geq 10]$.
Given that each of these ten rolls results in neither a 6 nor any number greater than or equal to 9, enumerate the possible totals for each roll and their respective probabilities.
Because we know that we roll neither a 6 nor any number greater than or equal to 9 , the only possible rolls are $2,3,4,5,7,8$. There are $36-5-10=21$ rolls that result in these totals, with counts $1,2,3,4$, 6, and 5, respectively.
86
58EECS6.3
Signal Processing
6.100A, 18.03NoneProblem Set 5
Fourier Transforms
2c0.1171875TextFind the Discrete-Time Fourier Transform of $f_{3}[n]$ :
$$
f_{3}[n]=\left(\frac{1}{2}\right)^{|n|}
$$
Expression
$$
\begin{aligned}
F_{3}(\Omega) & =\sum_{n=-\infty}^{\infty}\left(\frac{1}{2}\right)^{|n|} e^{-j \Omega n}=2\left(\sum_{n=0}^{\infty}\left(\frac{1}{2}\right)^{n} e^{-j \Omega n}\right)-1=2\left(\sum_{n=0}^{\infty}\left(\frac{1}{2} e^{-j \Omega}\right)^{n}\right)-1 \\
= & \frac{2}{1-\frac{1}{2} e^{-j \Omega}}-1=\frac{1+\frac{1}{2} e^{-j \Omega}}{1-\frac{1}{2} e^{-j \Omega}}
\end{aligned}
$$
Find the Discrete-Time Fourier Transform of $f_{4}[n]$ :
$$
f_{4}[n]= \begin{cases}n\left(\frac{1}{2}\right)^{n} & \text { if } n \geq 0 \\ 0 & \text { otherwise }\end{cases}
$$
$$
\sum_{n=0}^{\infty} a^{n}=\frac{1}{1-a}
$$
Differentiate both sides by $a$:
$$
\begin{aligned}
& \sum_{n=0}^{\infty} n a^{n-1}=\frac{1}{(1-a)^{2}} \\
& F_{4}(\Omega)=\sum_{n=0}^{\infty} n\left(\frac{1}{2}\right)^{n} e^{-j \Omega n}=\sum_{n=0}^{\infty} n\left(\frac{1}{2} e^{-j \Omega}\right)^{n}=\frac{1}{\left(1-\frac{1}{2} e^{-j \Omega}\right)^{2}}
\end{aligned}
$$
Each part of this problem describes a different discrete-time signal $f_{i}[n]$ and then asks you to determine the $k=3$ component of the DFT of that signal, where the DFT is computed with analysis window $N=16$:
$$
F_{i}[3]=\frac{1}{16} \sum_{n=0}^{15} f_{i}[n] e^{-j 2 \pi 3 n / 16}
$$
Let $f_{3}[n]=\cos (3 \pi n / 8-9 \pi / 8)$. Enter a closed form expression for $F_{3}[3]$ below.
$$
\mathrm{F}_{3}[3]=\overline{\frac{1}{2} e^{-j 2 \pi 9 / 16}}
$$
Since
$$
f_{3}[n]=f_{2}[n-3]
$$
it follows that
$$
\mathrm{F}_{3}[\mathrm{k}]=\mathrm{e}^{-\mathrm{j} 2 \pi \mathrm{k} 3 / 16} \mathrm{~F}_{2}[\mathrm{k}]
$$
Therefore $F_{3}[3]=\frac{1}{2} e^{-j 2 \pi 9 / 16}$.
Each part of this problem describes a different discrete-time signal $f_{i}[n]$ and then asks you to determine the $k=3$ component of the DFT of that signal, where the DFT is computed with analysis window $N=16$:
$$
F_{i}[3]=\frac{1}{16} \sum_{n=0}^{15} f_{i}[n] e^{-j 2 \pi 3 n / 16}
$$
Determine a closed form expression for $F_{5}[3]$ where
$$
\begin{aligned}
& f_{5}[n]=\left(\frac{1}{2}\right)^{n} u[n] \\
\end{aligned}
$$
$$
F_{5}[3]=\frac{1}{16}\left(\frac{1-\left(\frac{1}{2}\right)^{16}}{1-\frac{1}{2} e^{-j 2 \pi 3 / 16}}\right)
$$
$$
\begin{aligned}
& F_{5}[k]=\frac{1}{N} \sum_{n=0}^{N-1}\left(\frac{1}{2}\right)^{n} e^{-j 2 \pi k n / N}=\frac{1}{N}\left(\frac{1-\left(\frac{1}{2}\right)^{N} e^{-j 2 \pi k N / N}}{1-\frac{1}{2} e^{-j 2 \pi k / N}}\right)=\frac{1}{N}\left(\frac{1-\left(\frac{1}{2}\right)^{N}}{1-\frac{1}{2} e^{-j 2 \pi k / N}}\right)
\end{aligned}
$$
Substituting $k=3$ and $\mathrm{N}=16$ yields
$$
F_{5}[3]=\frac{1}{16}\left(\frac{1-\left(\frac{1}{2}\right)^{16}}{1-\frac{1}{2} e^{-j 2 \pi 3 / 16}}\right)
$$
87
18Mathematics18.3
Principles of Continuum Applied Mathematics
18.02, 18.03NoneProblem Set 1
Differentiation Within Integrals
8b0.2430555556Text
In each case compute $u_{x}=\frac{\partial u}{\partial x}$ and $u_{p}=\frac{\partial u}{\partial p}$ (as functions of $u, x$, and $p$ ), given that $u=u(x, p)$ satisfies: $u=\int_{0}^{x} \sin \left(p u\left(s^{2}, s\right)+x s\right) d s$.
Expression
Clearly $u_p=\int_0^x u\left(s^2, s\right) \cos \left(p u\left(s^2, s\right)+x s\right) d s$ and $u_x=\sin \left(p u\left(x^2, x\right)+x^2\right)+\int_0^x s \cos \left(p u\left(s^2, s\right)+x s\right) d s$.
In each case compute $u_{x}=\frac{\partial u}{\partial x}$ and $u_{p}=\frac{\partial u}{\partial p}$ (as functions of $u, x$, and $p$ ), given that $u=u(x, p)$ satisfies: $p=\int_{x}^{u} \cos \left(p \sin (s)+x s^{2}\right) d s$.
$1=\cos \left(p \sin (u)+x u^{2}\right) u_{p}-\int_{x}^{u} \sin (s) \sin \left(p \sin (s)+x s^{2}\right) d s$,
so that
$$
u_{p}=\frac{\int_{x}^{u} \sin (s) \sin \left(p \sin (s)+x s^{2}\right) d s}{\cos \left(p \sin (u)+x u^{2}\right)}
$$
$0=\cos \left(p \sin (u)+x u^{2}\right) u_{x}-\cos \left(p \sin (x)+x^{3}\right)-\int_{x}^{u} \sin \left(p \sin (s)+x s^{2}\right) s^{2} d s$,
so that
$$
u_{x}=\frac{\cos \left(p \sin (x)+x^{3}\right)+\int_{x}^{u} \sin \left(p \sin (s)+x s^{2}\right) s^{2} d s}{\cos \left(p \sin (u)+x u^{2}\right)} .
$$
In each case compute $u_{x}=\frac{\partial u}{\partial x}$ and $u_{p}=\frac{\partial u}{\partial p}$ (as functions of $u, x$, and $p$ ), given that $u=u(x, p)$ satisfies: $p=\int_{0}^{u} \exp \left(p \sin (s)+x s^{2}\right) d s$.
$1=\exp \left(p \sin (u)+x u^{2}\right) u_{p}+\int_{0}^{u} \exp \left(p \sin (s)+x s^{2}\right) \sin (s) d s$
so that $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots+u_{p}=\frac{1-\int_{0}^{u} \exp \left(p \sin (s)+x s^{2}\right) \sin (s) d s}{\exp \left(p \sin (u)+x u^{2}\right)}$
$0=\exp \left(p \sin (u)+x u^{2}\right) u_{x}+\int_{0}^{u} \exp \left(p \sin (s)+x s^{2}\right) s^{2} d s$
so that
$$
u_{x}=-\frac{\int_{0}^{u} \exp \left(p \sin (s)+x s^{2}\right) s^{2} d s}{\exp \left(p \sin (u)+x u^{2}\right)} .
$$
In each case compute $u_{x}=\frac{\partial u}{\partial x}$ and $u_{p}=\frac{\partial u}{\partial p}$ (as functions of $u, x$, and $p$ ), given that $u=u(x, p)$ satisfies: $p=\cos (x+u)$.
Upon taking partial derivatives with respect to $x$ and $p, p=\cos (x+u)$ yields:
$0=\left(1+u_{x}\right) \sin (x+u) \quad$ and $\quad 1=-u_{p} \sin (x+u)$.
Thus: $u_{x}=-1 \quad$ and $\quad u_{p}=-\frac{1}{\sin (x+u)}$.
88
331Mathematics18.01Calculus INoneNoneProblem Set 7Probability13b0.03167898627Text
Suppose we flip a coin three times. There are eight possible sequences of heads and tails that we could get.
As we discussed in class each of these sequences is equally likely. Each sequence occurs with probability 1/8. For instance, the chance of flipping $\mathrm{H}$ then $\mathrm{T}$ then $\mathrm{H}$ is $1 / 8$.
If we flip a coin three times, what is the probability of getting at least one $\mathrm{H}$.
Numerical
$\operatorname{Prob}($ at least one $\mathrm{H})=1-\operatorname{Prob}($ only $\mathrm{T})=1-\operatorname{Prob}(\mathrm{TTT})=1-\frac{1}{8}=\frac{7}{8}$.
Suppose we flip a coin three times. There are eight possible sequences of heads and tails that we could get.
If we flip a coin three times, what is the probability that all three flips are the same?
$\operatorname{Prob}($ All flips are the same $)=\operatorname{Prob}(\mathrm{TTT})+\operatorname{Prob}(\mathrm{HHH})=\frac{1}{8}+\frac{1}{8}=\frac{1}{4}$.
Now suppose we flip a coin four times.
What is the probability of getting $\mathrm{H}$ then $\mathrm{H}$ then $\mathrm{T}$ then $\mathrm{H}$?
$\operatorname{Prob}(\mathrm{HHTH})=2^{-4}=\frac{1}{16}$.
Suppose we flip a coin three times. There are eight possible sequences of heads and tails that we could get.
Write down all eight sequences of heads and tails in an organized way.
In the following picture, left is for tails and right for heads.
89
98Mathematics18.02Calculus II18.01NoneMidterm Exam 3
Directional Derivative
1c1.125Text
What is the directional derivative of the function $g(x, y, z)$ in the direction $\hat{i}+\hat{j}+\hat{k}$ at the point $(1,0,-1)$?
Expression
The unit vector in the direction $\hat{i}+\hat{j}+\hat{k}$ is $\vec{u}=1 / \sqrt{3}\langle 1,1,1\rangle$ hence $D_{\vec{u}} g(1,0,-1)=\nabla g(1,0,-1) \cdot \vec{u}=\langle-2,-2,2\rangle \cdot 1 / \sqrt{3}\langle 1,1,1\rangle=-2 / \sqrt{3}$.
Consider again $g(x, y, z)=x y+3 y z+2 x z$. In which direction is $g$ decreasing most rapidly at the point $(1,0,-1)$ ? Express your answer in the form $a \hat{i}+b \hat{j}+c \hat{k}$. (You do not need to normalize this vector.)
It should be in the opposite direction of the gradient. Since
$$
\left.\nabla g\right|_{(1,0,-1)}=\left.\langle y+2 z, x+3 z, 3 y+2 x\rangle\right|_{(1,0,-1)}=\langle-2,-2,2\rangle
$$
the direction is $2 \hat{i}+2 \hat{j}-2 \hat{k}$.
$\mathbf{F}(x, y, z)=\left(y+y^2 z\right) \mathbf{i}+(x-z+2 x y z) \mathbf{j}+\left(-y+x y^2\right) \mathbf{k}$. Show that $\mathbf{F}(x, y, z)$ is a gradient field using the derivative conditions.
We have $\mathbf{F}=\langle P, Q, R\rangle$, where $P=y+y^2 z, \quad Q=x-z+2 x y z, \quad R=-y+x y^2$.
$$
\frac{\partial P}{\partial z}=y^2=\frac{\partial R}{\partial x} ; \quad \frac{\partial Q}{\partial z}=-1+2 x y=\frac{\partial R}{\partial y} ; \quad \frac{\partial P}{\partial y}=1+2 y z=\frac{\partial Q}{\partial x} .
$$
Consider the function $g(x, y, z)=x y+3 y z+2 x z$ and consider its level surface $\{(x, y, z) \mid x y+3 y z+2 x z=9\}$. Find the equation of the tangent plane to this surface at the point $(0,1,3)$ in the form $a x+b y+c z=d$.
$$
7 x+9 y+3 z=18 \text {. }
$$
90
70Mathematics18.01Calculus INoneNoneProblem Set 2Integration10b0.03959873284Text
Suppose that a train is moving. It starts at time $t=0$ and ends at time $t=5$. At time $t$, its velocity is equal to $5-t$.
Approximately how far did the train go from time $t$ to time $t+\Delta t$?
Expression
From time $t$ to $\Delta t$ the velocity is approximately $5-\Delta t$. This goes on for an amount of time $\Delta t$. The distance traversed in this time is approximately $(5-t) \Delta t$.
Suppose that a train is moving. It starts at time $t=0$ and ends at time $t=5$. At time $t$, its velocity is equal to $5-t$.
Approximately how far did the train go from time $t=2$ to time $t=2.1-5$ or 3 or $.5$ or $.3$ or $.1 ?$
From time 2 to $2.1$ the velocity is approximately $5-2=3$. This goes on for an amount of time $\Delta t=.1$. The distance traversed in this time is approximately $3 \Delta t=.3$.
Suppose that a train is moving. It starts at time $t=0$ and ends at time $t=5$. At time $t$, its velocity is equal to $5-t$.
Write down an integral for the total change in the position of the train.
As the times $0=t_{0}, t_{1}, \ldots, t_{n}=5$ get closer together $\left(\Delta t_{i}=t_{i+1}-t_{i} \rightarrow 0\right)$, the sum $\sum_{i=0}^{n-1}(5-t) \Delta t_{i}$ approximating the total distance converges to the integral:
$$
\int_{0}^{5}(5-t) d t .
$$
Suppose that a train is moving. It starts at time $t=0$ and ends at time $t=5$. At time $t$, its velocity is equal to $5-t$.
Compute the integral.
$$
\int_{0}^{5}(5-t) d t=\left.\left(5 t-\frac{t^{2}}{2}\right)\right|_{0} ^{5}=\frac{25}{2} .
$$
91
391EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneProblem Set 2Regression1fi0.01736111111Text
You are given the following data, where $d=1, n=4$.
$$
D=\{[[1], 2],[[2], 7],[[3],-3],[[4], 1]\}
$$
You want to use analytic linear regression to solve the problem.
Using the same data as in the previous question (with two added noise dimensions) but using ridge regression with $\lambda=1 \times$ $10^{-10}$
We get the following parameters:
$$
\theta=\left[\begin{array}{l}
-1.377711 \times 10^{0} \\
-2.574581 \times 10^{5} \\
-5.070563 \times 10^{4}
\end{array}\right], \theta_{0}=7.260269 \times 10^{0}
$$
Does this hypothesis have higher or lower training set error than the result without ridge regression?
Multiple ChoiceHigher.
You are given the following data, where $d=1, n=4$.
$$
D=\{[[1], 2],[[2], 7],[[3],-3],[[4], 1]\}
$$
You want to use analytic linear regression to solve the problem.
Using the same data as in the previous question (with two added noise dimensions) but using ridge regression with $\lambda=1 \times$ $10^{-10}$
We get the following parameters:
$$
\theta=\left[\begin{array}{l}
-1.377711 \times 10^{0} \\
-2.574581 \times 10^{5} \\
-5.070563 \times 10^{4}
\end{array}\right], \theta_{0}=7.260269 \times 10^{0}
$$
Does this hypothesis have higher or lower testing set error than the result without ridge regression?
Lower.
You are given the following data, where $d=1, n=4$.
$$
D=\{[[1], 2],[[2], 7],[[3],-3],[[4], 1]\}
$$
You want to use analytic linear regression to solve the problem.
Now, if we change the data to have a small amount of noise so that
$$
D=\{[[1, \epsilon, \epsilon], 2],[[2, \epsilon, \epsilon], 7],[[3, \epsilon, \epsilon],-3],[[4, \epsilon, \epsilon], 1]\}
$$
where the $\epsilon$ values are all different, randomly chosen independently in the range $\left(-1 \times 10^{-5},+1 \times 10^{-5}\right)$, we get the following parameters:
$$
\theta=\left[\begin{array}{c}
4.70697 \times 10^{0} \\
-1.28107 \times 10^{6} \\
1.10581 \times 10^{7}
\end{array}\right], \theta_{0}=-1.03437 \times 10^{2}
$$
This hypothesis has the following MSE on the training data: $1.6970 \times 10^{-24}$.
Consider a "testing" set, which is very similar to the training set:
$$
D=\{[[1,0,0], 2],[[2,0,0], 7],[[3,0,0],-3],[[4,0,0], 1]\} .
$$
What's the MSE of th, th0 on the testing set?
8782.9
You are given the following data, where $d=1, n=4$.
$$
D=\{[[1], 2],[[2], 7],[[3],-3],[[4], 1]\}
$$
You want to use analytic linear regression to solve the problem.
What is the MSE of the hypothesis you found on the data (any answer within the right order of magnitude will be fine)?
10.575
92
423EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneProblem Set 2Regression6b0.05208333333Text
We will now try to synthesize what we've learned in order to perform ridge regression on the DataCommons public health dataset that we explored in Lab 2. Unlike in Lab 2, where we did some simple linear regressions, here we now employ and explore regularization, with the goal of building a model which generalizes better (than without regularization) to unseen data.
The overall objective function for ridge regression is
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Remarkably, there is an analytical function giving $\Theta=\left(\theta, \theta_{0}\right)$ which minimizes this objective, given $X, Y$, and $\lambda$. But how should we choose $\lambda$?
To choose an optimum $\lambda$, we can use the following approach. Each particular value of $\lambda$ gives us a different linear regression model. And we want the best model: one which balances providing good predictions (fitting well to given training data) with generalizing well (avoiding overfitting training data). And as we saw in the notes on Regression, we can employ cross-validation to evaluate and compare different models.
Let us begin by implementing this algorithm for cross-validation:
CROSS-VALIDATE $(\mathcal{D}, k)$
1 divide $\mathcal{D}$ into $k$ chunks $\mathcal{D}_{1}, \mathcal{D}_{2}, \ldots \mathcal{D}_{k}$ (of roughly equal size)
2 for $i=1$ to $k$
3 train $h_{i}$ on $\mathcal{D} \backslash \mathcal{D}_{i}$ (withholding chunk $\mathcal{D}_{i}$ )
4 compute "test" error $\mathcal{E}_{\mathfrak{i}}\left(h_{i}\right)$ on withheld data $\mathcal{D}_{i}$
5 return $\frac{1}{k} \sum_{i=1}^{k} \varepsilon_{i}\left(h_{i}\right)$
Below, X and Y are sample data, and lams is a list of possible values of lambda. Write code to set errors as a list of corresponding cross-validation errors. Use the cross_validate function above to run cross-validation with three splits. Use the following functions (which we implement for you, per the specifications below) as the learning algorithm and loss function:
def ridge_analytic(X_train, Y_train, lam):
'''Applies analytic ridge regression on the given training data.
Returns th, th0.
X : d x n numpy array (d = # features, n = # data points)
Y : 1 x n numpy array
lam : (float) regularization strength parameter
th : d x 1 numpy array
th0 : 1 x 1 numpy array'''
def mse(x, y, th, th0):
'''Calculates the mean-squared loss of a linear regression.
Returns a scalar.
x : d x n numpy array
y : 1 x n numpy array
th : d x 1 numpy array
th0 : 1 x 1 numpy array'''
X = np.array([[4, 6, 8, 2, 9, 10, 11, 17],
[1, 1, 6, 0, 5, 8, 7, 9],
[2, 2, 2, 6, 7, 4, 9, 8],
[1, 2, 3, 4, 5, 6, 7, 8]])
Y = np.array([[1, 3, 3, 4, 7, 6, 7, 7]])
lams = [0, 0.01, 0.02, 0.1]
errors = [] # your code here
ProgrammingX = np.array([[4, 6, 8, 2, 9, 10, 11, 17],
[1, 1, 6, 0, 5, 8, 7, 9],
[2, 2, 2, 6, 7, 4, 9, 8],
[1, 2, 3, 4, 5, 6, 7, 8]])
Y = np.array([[1, 3, 3, 4, 7, 6, 7, 7]])
lams = [0, 0.01, 0.02, 0.1]
errors = [cross_validate(X, Y, 3, lam, ridge_analytic, mse) for lam in lams]
We will now try to synthesize what we've learned in order to perform ridge regression on the DataCommons public health dataset that we explored in Lab 2. Unlike in Lab 2, where we did some simple linear regressions, here we now employ and explore regularization, with the goal of building a model which generalizes better (than without regularization) to unseen data.
The overall objective function for ridge regression is
$$
J_{\text {ridge }}\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}+\lambda\|\theta\|^{2}
$$
Remarkably, there is an analytical function giving $\Theta=\left(\theta, \theta_{0}\right)$ which minimizes this objective, given $X, Y$, and $\lambda$. But how should we choose $\lambda$?
To choose an optimum $\lambda$, we can use the following approach. Each particular value of $\lambda$ gives us a different linear regression model. And we want the best model: one which balances providing good predictions (fitting well to given training data) with generalizing well (avoiding overfitting training data). And as we saw in the notes on Regression, we can employ cross-validation to evaluate and compare different models.
Let us begin by implementing this algorithm for cross-validation:
CROSS-VALIDATE $(\mathcal{D}, k)$
1 divide $\mathcal{D}$ into $k$ chunks $\mathcal{D}_{1}, \mathcal{D}_{2}, \ldots \mathcal{D}_{k}$ (of roughly equal size)
2 for $i=1$ to $k$
3 train $h_{i}$ on $\mathcal{D} \backslash \mathcal{D}_{i}$ (withholding chunk $\mathcal{D}_{i}$ )
4 compute "test" error $\mathcal{E}_{\mathfrak{i}}\left(h_{i}\right)$ on withheld data $\mathcal{D}_{i}$
5 return $\frac{1}{k} \sum_{i=1}^{k} \varepsilon_{i}\left(h_{i}\right)$
Let's implement the cross-validation algorithm as the procedure cross_validate, which takes the following input arguments:
\begin{itemize}
\item $\mathrm{x}$ : the list of data points $(d \times n)$
\item $\mathrm{Y}$ : the true values of the responders $(1 \times n)$
\item n_splits: the number of chunks to divide the dataset into
\item lam: the regularization parameter
\item learning_algorithm: a function that takes $X, Y$, and 1 lam, and returns th, th $\theta$
\item loss_function: a function that takes $X, Y$, th, and th $\theta$, and returns a $1 \times 1$ array
\end{itemize}
cross_validate should return a scalar, the cross-validation error of applying the learning algorithm on the list of data points.
Note that this is a generic version of cross-validation, that can be applied to any learning algorithm and any loss function. Later in this problem, we will use cross-validation specifically for ridge regression and mean square loss.
You have the following function available to you:
def make_splits(X, Y, n_splits):
'''
Splits the dataset into n_split chunks, creating n_split sets of
cross-validation data.
Returns a list of n_split tuples (X_train, Y_train, X_test, Y_test).
For the ith returned tuple:
*X_train and Y_train include all data except the ith chunk, and
* X_test and Y_test are the ith chunk.
X : d x n numpy array (d = #features, n = #data points)
Y : 1 x n numpy array
n_splits : integer
'''
def cross_validate(X, Y, n_splits, lam,
learning_algorithm, loss_function):
pass
def cross_validate(X, Y, n_splits, lam,
learning_algorithm, loss_function):
test_errors = []
for (X_train, Y_train, X_test, Y_test) in make_splits(X, Y, n_splits):
th, th0 = learning_algorithm(X_train, Y_train, lam)
test_errors.append(loss_function(X_test, Y_test, th, th0))
return np.array(test_errors).mean()
We are interested in performing ordinary least squares regression given data $X, Y$ to find parameters $\theta, \theta_{0}$ that minimize the mean squared error objective:
$$
J\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n} L_{s}\left(x^{(i)}, y^{(i)} ; \theta, \theta_{0}\right)
$$
where the squared loss is
$$
L_{s}\left(x^{(i)}, y^{(i)} ; \theta, \theta_{0}\right)=\left(\hat{y}^{(i)}-y^{(i)}\right)^{2}=\left(\theta^{T} x^{(i)}+\theta_{0}-y^{(i)}\right)^{2}
$$
Note that the $L_{s}\left(x^{(i)}, y^{(i)} ; \theta, \theta_{0}\right)$ notation here is used to emphasize that the loss depends on both the data sample $\left(x^{(i)}, y^{(i)}\right)$, and the parameters, $\theta$ and $\theta_{0}$. Compared to the $\mathcal{L}_{s}$ notation as used in some part of the notes and the lab, we see that $L_{s}\left(x^{(i)}, y^{(i)} ; \theta, \theta_{0}\right)=\mathcal{L}_{s}\left(h\left(x^{(i)} ; \theta, \theta_{0}\right), y^{(i)}\right)$, where $h\left(x^{(i)} ; \theta, \theta_{0}\right)=\theta^{T} x^{(i)}+\theta_{0}$.
Now implement $\theta^{*}$ as found in the previous problem, using symbols $\mathrm{X}$ and $\mathrm{Y}$ for the data matrix and outputs, and np.dot(or @ shorthand), np.transpose (or .T shorthand), np.linalg.inv.
# Enter an expression to compute and set th to the optimal theta
th = None
th = np.dot(np.linalg.inv(np.dot(X, X.T)), np.dot(X, Y.T))
We are beginning our study of machine learning with linear regression which is a fundamental problem in supervised learning. Please study Sections $2.1$ through $2.4$ of the Chapter 2 - Regression lecture notes before starting in on these problems.
A hypothesis in linear regression has the form
$$
y=\theta^{T} x+\theta_{0}
$$
where $x$ is a $d \times 1$ input vector, $y$ is a scalar output prediction, $\theta$ is a $d \times 1$ parameter vector and $\theta_{0}$ is a scalar offset parameter.
This week, just to get warmed up, we will consider a simple algorithm for trying to find a hypothesis that fits the data well: we will generate a lot of random hypotheses and see which one has the smallest error on this data, and return that one as our answer. (We don't recommend this method in actual practice, but it gets us started and makes some useful points.)
Use the mse and lin_reg_predict procedures to implement a procedure that takes
\begin{itemize}
\item $\mathrm{X}: d \times n$ input array representing $n$ points in $d$ dimensions
\item Y: $1 \times n$ output vector representing output values for $n$ points
\item th: parameter vector $d \times 1$
\item th $\emptyset$ : offset $1 \times 1$ (or scalar)
\end{itemize}
and returns
\begin{itemize}
\item $1 \times 1$ (or scalar) value representing the MSE of hypothesis th, th 0 on the data set $X, Y$.
\item Read about the axis argument to $n p$. mean
\end{itemize}
import numpy as np
def lin_reg_err(X, Y, th, th0):
pass
import numpy as np
def lin_reg_err(X, Y, th, th0):
return mse(Y, lin_reg_predict(X, th, th0))
93
498EECS6.39
Introduction to Machine Learning
6.1010/6.1210, 18.06/18.C06
NoneProblem Set 4
Logistic Regression
4ci0.01275510204Text
Our eventual goal is to do gradient descent on the logistic regression objective $J_{\text {nll }}$. In this problem, we'll take the first step toward deriving that gradient update. We'll focus on the gradient of the loss at a single point with respect to parameters $\theta$ and $\theta_{0}$.
What is the maximum value of $\frac{\partial \sigma(z)}{\partial z}$?
Numerical0.25.
Our eventual goal is to do gradient descent on the logistic regression objective $J_{\text {nll }}$. In this problem, we'll take the first step toward deriving that gradient update. We'll focus on the gradient of the loss at a single point with respect to parameters $\theta$ and $\theta_{0}$.
What is the largest number that is always less than any actual value of $\frac{\partial \sigma(z)}{\partial z}$?
0
Our eventual goal is to do gradient descent on the logistic regression objective $J_{\text {nll }}$. In this problem, we'll take the first step toward deriving that gradient update. We'll focus on the gradient of the loss at a single point with respect to parameters $\theta$ and $\theta_{0}$.
What is an expression for the derivative of the sigmoid function $\sigma(z)=\frac{1}{1+e^{-z}}$ with respect to $z$, expressed as a function of $z$, its input? Enter a Python expression (use ** for exponentiation) involving e and $z$.
Solution 1: e**(-z)/(1 + e**(-z))**2
Solution 2: (1/(1+e**(-z)))*(1-(1/(1+e**(-z))))
Our eventual goal is to do gradient descent on the logistic regression objective $J_{\text {nll }}$. In this problem, we'll take the first step toward deriving that gradient update. We'll focus on the gradient of the loss at a single point with respect to parameters $\theta$ and $\theta_{0}$.
What is an expression for the derivative of the sigmoid with respect to $z$, but this time expressed as a function of $o=\sigma(z)=\frac{1}{1+e^{-z}}$ ? (It's beautifully simple!)
Hint: Think about the expression $1-\frac{1}{1+e^{-z}}$. (Here is a review of computing derivatives.)
Enter a Python expression (use ** for exponentiation) involving only $o$. e and $z$ are not allowed, and remember $o=\sigma(z)$.
o*(1-o)
94
317Mathematics18.01Calculus INoneNoneProblem Set 7
Second Order Differential Equations
9a0.04751847941TextCheck that the equation $x^{\prime}(t)=-x(t)$ is linear.Open
Suppose functions $x(t)$ and $z(t)$ are solutions to $x^{\prime}(t)=-x(t)$, and let $s(t)=x(t)+z(t)$ be their sum. Since $x$ and $z$ are solutions we have
$$
\begin{aligned}
& x^{\prime}(t)=-x(t), \\
& z^{\prime}(t)=-z(t)
\end{aligned}
$$
Adding the two equations we have
$$
s^{\prime}(t)=x^{\prime}(t)+z^{\prime}(t)=-x(t)-z(t)=-s(t)
$$
So $s(t)$ also solves the differential equation. And for any constant $A, A x(t)$ also solves the differential equation since
$$
x^{\prime}(t)=-x(t) \quad \Rightarrow \quad A x^{\prime}(t)=A \cdot(-x(t))=-(A x(t))
$$
If $C$ is a constant, then check that the equation $x^{\prime \prime}(t)=-C x(t)$ is linear.
Suppose functions $x(t)$ and $z(t)$ are solutions to $x^{\prime \prime}(t)=-C x(t)$, and let $s(t)=x(t)+z(t)$ be their sum. The fact that $x$ and $z$ are solutions says that
$$
\begin{aligned}
& x^{\prime \prime}(t)=-C x(t), \\
& z^{\prime \prime}(t)=-C z(t) .
\end{aligned}
$$
Adding the two equations yields
$$
s^{\prime \prime}(t)=x^{\prime \prime}(t)+z^{\prime \prime}(t)=-C x(t)-C z(t)=-C s(t),
$$
So function $s$ solves the differential equation. Furthermore,
$$
x^{\prime \prime}(t)=-C x(t) \Rightarrow A x^{\prime \prime}(t)=A \cdot\left(-C x^{\prime \prime}(t)\right)=-C \cdot(A x(t)),
$$
so $A x$ solves the differential equation.
Consider the equation $x^{\prime \prime}(t)=-x(t)$.
Check that $\sin t$ and $\cos t$ obey this equation.
For $x(t)=\sin t, x^{\prime \prime}=-\sin t$, which indeed is $-x$. Similarly, for $x(t)=\cos t$, $x^{\prime \prime}=-\cos t$, which again is $-x$.
Is the equation $x^{\prime \prime}(t)=x(t)-2$ linear? Explain your reasoning.
For the zero function, the left-hand side (LHS) $z_{0}^{\prime \prime}(t)=0$ does not equal the RHS $z_{0}(t)-2=-2$. So the differential equation is not linear.
95
220EECS6.411
Representation, Inference, and Reasoning in AI
6.1010, 6.1210, 18.600
NoneProblem Set 6Particle Filter5c0.25Text
Consider a domain in which the forward transition dynamics are "hybrid" in the sense that
$$
P\left(X_{t}=x_{t} \mid X_{t-1}=x_{t-1}\right)=p * N\left(x_{t-1}+1,0.1\right)\left(x_{t}\right)+(1-p) * N\left(x_{t-1}-1,0.1\right)\left(x_{t}\right)
$$
that is, that the state will hop forward one unit in expectation with probability $p$, or backward one unit in expectation with probability $1-p$, with variance $0.1$ in each case.
Assume additionally that the observation model $P\left(Y_{t}=y_{t} \mid X_{t}=x_{t}\right)=\operatorname{Uniform}\left(x_{t}-1, x_{t}+1\right)\left(y_{t}\right)$.
You know the initial state of the system $X_{0}=0$. Your friend Norm thinks it's fine to initialize a particle filter with a single particle at 0. What do you think?
(a) This is fine and we can continue with a single particle.
(b) We should initialize our pf with $N$ copies of this particle.
Multiple Choice(b) We should initialize our pf with $N$ copies of this particle.
Consider a domain in which the forward transition dynamics are "hybrid" in the sense that
$$
P\left(X_{t}=x_{t} \mid X_{t-1}=x_{t-1}\right)=p * N\left(x_{t-1}+1,0.1\right)\left(x_{t}\right)+(1-p) * N\left(x_{t-1}-1,0.1\right)\left(x_{t}\right)
$$
that is, that the state will hop forward one unit in expectation with probability $p$, or backward one unit in expectation with probability $1-p$, with variance $0.1$ in each case.
Assume additionally that the observation model $P\left(Y_{t}=y_{t} \mid X_{t}=x_{t}\right)=\operatorname{Uniform}\left(x_{t}-1, x_{t}+1\right)\left(y_{t}\right)$.
Norm runs the filter for two steps with no observations several times and is trying to decide whether there could be bugs in the code. Assuming $p=0.5$, for each of the following sets of particles, indicate whether it is (a) fairly likely (b) quite unlikely (c) completely impossible: {-2.01, -1.9, -1.0, 0.1, 0, 2.1}.
b.
Consider a domain in which the forward transition dynamics are "hybrid" in the sense that
$$
P\left(X_{t}=x_{t} \mid X_{t-1}=x_{t-1}\right)=p * N\left(x_{t-1}+1,0.1\right)\left(x_{t}\right)+(1-p) * N\left(x_{t-1}-1,0.1\right)\left(x_{t}\right)
$$
that is, that the state will hop forward one unit in expectation with probability $p$, or backward one unit in expectation with probability $1-p$, with variance $0.1$ in each case.
Assume additionally that the observation model $P\left(Y_{t}=y_{t} \mid X_{t}=x_{t}\right)=\operatorname{Uniform}\left(x_{t}-1, x_{t}+1\right)\left(y_{t}\right)$.
Norm runs the filter for two steps with no observations several times and is trying to decide whether there could be bugs in the code. Assuming $p=0.5$, for each of the following sets of particles, indicate whether it is (a) fairly likely (b) quite unlikely (c) completely impossible: {-2.05, -1.95, -0.1, 0.1, 1.9, 2.1}.
a.
Consider a domain in which the forward transition dynamics are "hybrid" in the sense that
$$
P\left(X_{t}=x_{t} \mid X_{t-1}=x_{t-1}\right)=p * N\left(x_{t-1}+1,0.1\right)\left(x_{t}\right)+(1-p) * N\left(x_{t-1}-1,0.1\right)\left(x_{t}\right)
$$
that is, that the state will hop forward one unit in expectation with probability $p$, or backward one unit in expectation with probability $1-p$, with variance $0.1$ in each case.
Assume additionally that the observation model $P\left(Y_{t}=y_{t} \mid X_{t}=x_{t}\right)=\operatorname{Uniform}\left(x_{t}-1, x_{t}+1\right)\left(y_{t}\right)$.
Norm runs the filter for two steps with no observations several times and is trying to decide whether there could be bugs in the code. Assuming $p=0.5$, for each of the following sets of particles, indicate whether it is (a) fairly likely (b) quite unlikely (c) completely impossible: {-20, -2.01, -2.001, .01, .001, 1.99, 1.999}.
b.
96
103EECS6.18
Computer Systems Engineering
6.1010, 6.1910NoneMidterm Exam 2Meltdown13a0.3Text
A colleague is proposing mitigation strategies for Meltdown in an architecture they are designing. For each proposal, choose the BEST response.
"We could eliminate speculative execution."
(a) No Meltdown attacks would be affected by that change.
(b) That could help against Meltdown, but cost too much performance.
(c) That is a mitigating technique recommended in the published paper.
(d) That would fix the problem, and eliminate stack smashing too.
Multiple Choice(b).
A colleague is proposing mitigation strategies for Meltdown in an architecture they are designing. For each proposal, choose the BEST response.
"We could eliminate caching."
(a) No Meltdown attacks would be affected by that change.
(b) That could help against Meltdown, but cost too much performance.
(c) That is a mitigating technique recommended in the published paper.
(d) That would fix the problem, and eliminate stack smashing too.
(b).
A colleague is proposing mitigation strategies for Meltdown in an architecture they are designing. For each proposal, choose the BEST response.
"We could change the operating system to map in less of the kernel's address space."
(a) No Meltdown attacks would be affected by that change.
(b) That could help against Meltdown, but cost too much performance.
(c) That is a mitigating technique recommended in the published paper.
(d) That would fix the problem, and eliminate stack smashing too.
(c).
A colleague is proposing mitigation strategies for Meltdown in an architecture they are designing. For each proposal, choose the BEST response.
"We could add a mode that splits the address space."
(a) No Meltdown attacks would be affected by that change.
(b) That could help against Meltdown, but cost too much performance.
(c) That is a mitigating technique recommended in the published paper.
(d) That would fix the problem, and eliminate stack smashing too.
(c).
97
14Mathematics18.701Algebra I18.100BNoneProblem Set 2Cyclic Groups5nan0.5Text
Prove that the two matrices
$$
E=\left[\begin{array}{ll}
1 & 1 \\
0 & 1
\end{array}\right], E^{\prime}=\left[\begin{array}{ll}
1 & 0 \\
1 & 1
\end{array}\right]
$$
generate the group $S L_2(\mathbb{Z})$ of all integer matrices with determinant 1. Remember that the subgroup they generate consists of all elements that can be expressed as products using the four elements $E, E^{\prime}, E^{-1}, E^{\prime-1}$.
Hint: Do not try to write a matrix directly as a product of the generators. Use row reduction.
Open
It is hard to use the fact that $S L_{2}(\mathbb{R})$ is generated by elementary matrices of the first type here. One has to start over.
As always, the method is to reduce a matrix $A$ in $S L_{2}(\mathbb{Z})$ to the identity using the given elementary matrices $E$ and $E^{\prime}$ and their inverses. What multiplication by a power of $E$ or $E^{\prime}$ does to a matrix $A$ is add a (positive or negative) integer multiple of one row to the other.
Let's work on the first column of
$$
A=\left(\begin{array}{ll}
a & b \\
c & d
\end{array}\right)
$$
using division with remainder. Also, let's denote the entries of any one of the matrices that we get along the way by $a, b, c, d$. We don't need to change notation at each step.
Note first that because $\operatorname{det} A=1$, the entries $a$ and $c$ of the first column can't both be zero.
Step 1: We make one of the entries $a$ or $c$ of the first column positive. If $c \neq 0$, we add a large positive or negative integer multiple of the second row to the first to make $a>0$. If $c=0$, then $a \neq 0$. In this case we do the analogous thing to make $c>0$.
Step 2: If $a>0$, we divide $c$ by $a$, writing $c=a q+r$ where $q$ and $r$ are integers and $0 \leq r<a$. Then we add $-q($ row 1$)$ to row2. This replaces $c$ by $r$. We change notation, writing $c$ for $r$ in the new matrix, and $d$ for the other entry of row2. Now $0 \leq c<a$. If $c=0$, we stop.
Step 3: If $c>0$, we divide $a$ by $c: a=c q^{\prime}+r^{\prime}$, where $0 \leq r^{\prime}<c$. We add $q^{\prime}$ (row2) to row1, which changes $a$ to $r^{\prime}$. We adjust notation, writing $a$ for $r^{\prime}$. If $a=0$ we stop. If $a>0$, we go back to Step 2.
Since the entries of the first column decrease at each step, the process must stop at some point, with either $c=0$ or $a=0$. Then since $\operatorname{det} A=a d-b c=1$, the nonzero entry of the first column must be 1.
Step 4: If the entry 1 of the first column is the "c" entry, we add (row2) to (row1) to get $a=c=1$. Then we subtract (row1) from $($ row 2$)$ to get $a=1, c=0$.
Step 5: The matrix is now $A=\left(\begin{array}{ll}1 & b \\ 0 & d\end{array}\right)$. Since $\operatorname{det} A=1, d=1$. We subtract $b($ row 2$)$ from (row1) to get the identity matrix.
Prove that the elementary matrices of the first type generate $S L_n(\mathbb{R})$. Do the $2 \times 2$ case first.
Let's do the $2 \times 2$ case. Let $A$ be a matrix $\left(\begin{array}{ll}a & b \\ c & d\end{array}\right)$ with determinant equal to 1. We must show that $A$ can be reduced to the identity using the first type of elementary row operations. In describing this process, we'll use the symbols $a, b, c, d$ to denote the matrix entries in all of the matrices we get as we go along. Our end result should be $a=d=1$, $b=c=0$.
If $c=0$, then $a$ can't be zero. In that case, we add row 1 to row 2 to eliminate this possibility. Next, since $c \neq 0$ in our new matrix, we can add a multiple of row 2 to row 1 to change $a$ to 1 . Then we add a multiple of row 1 to row 2 to change $c$ to 0 . The new matrix has $a=1$ and $c=0$. Elementary operations of the first type don't change the determinant. So the determinant of the new matrix with $a=1$ and $c=0$ is still equal to 1. Therefore $d=1$ in this matrix, an one further row operation reduces the matrix to the identity.
The group $S L_n(\mathbb{R})$ is generated by elementary matrices of the first type (see Exercise 4.8). Use this fact to prove that $S L_n(\mathbb{R})$ is path-connected.
We know from a previous assginment that $S L_{n}$ is generated by elementary matrices of the first type: $E=$ $I+a e_{i j}$. Such a matrix connected to the identity by a path $E_{t}=I+a t e_{i j}$ in $S L_{n}$. Then $A$ connects to $E A$ by the path $E_{t} A$. The relation $A \sim B$ defined in Problem M.6 is an equivalence relation, so any two elements of $S L_{n}$ can be connected by a path.
One of the black boxes we used in class was the theorem that an $n \times n$ matrix $A$ has $A \vec{v}=0$ for some non-zero vector $\vec{v} \in \mathbb{R}^{n}\left(\right.$ or $\mathbb{C}^{n}$ ) if and only if $\operatorname{det}(A)=0$ (see, e.g., MITx 20.7). The goal of this problem is to work out $w h y$ this is true (at least in the case of $3 \times 3$ matrices). The only blackbox we will use is the properties of the determinant. Recall that $\operatorname{dim} \operatorname{Ker}(A)=0$ means that $\operatorname{Ker}(A)$ contains only the zero vector.
(Story time begins) The way we are going to go about showing that a $3 \times 3$ matrix has $\operatorname{det} A=0$ if and only if $\operatorname{dim} \operatorname{Ker}(A)>0$ is by using Gaussian elimination to reduce the statement to the case of upper triangular (or rather, RREF) matrices. So, as a first step, we're going to check that the theorem is true for this model case. (Story time ends)
(Story time begins, again) At this point we can make the following conclusion: If $A$ is a $3 \times 3$ matrix in reduced row echelon form, then $\operatorname{dim} \operatorname{Ker}(A)>0$ if and only if $\operatorname{det} A=0$. The next step is to turn this into a statement about general $3 \times 3$ matrices using Gauss-Jordan elimination. (Story time ends, again)
Recall that the elementary row operations consist of swapping rows, multiplying a row by a non-zero constant, and adding rows. Find matrices $E_{i}$ such that each row operation corresponds to multiplication on the left by one of the $E_{i}$. For example, multiplying the top row of $A$ by a non-zero constant $c$ corresponds to
$$
\left(\begin{array}{lll}
c & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{array}\right) A
$$
We call the matrices $E_{i}$ elementary matrices. (Hint: including the one given to you above, there should be 12 such matrices. Once you recognize the pattern you should be able to write them all down fairly quickly.)
You can do the same elementary row operation to the identity matrix to find the corresponding elementary matrix.
The twelve elementary matrices $E_i$ are
\begin{enumerate}[label=(\roman*)]
\item Swapping the first row of $A$ with the second row of $A$ corresponds to
\begin{center}
\boxed{E_1 = \begin{pmatrix}
0 & 1 & 0 \\
1 & 0 & 0 \\
0 & 0 & 1
\end{pmatrix}}
\end{center}
\item Swapping the first row of $A$ with the third row of $A$ corresponds to
\begin{center}
\boxed{E_2 = \begin{pmatrix}
0 & 0 & 1 \\
0 & 1 & 0 \\
1 & 0 & 0
\end{pmatrix}}
\end{center}
\item Swapping the second row of $A$ with the third row of $A$ corresponds to
\begin{center}
\boxed{E_3 = \begin{pmatrix}
1 & 0 & 0 \\
0 & 0 & 1 \\
0 & 1 & 0
\end{pmatrix}}
\end{center}
\item Multiplying the top row of $A$ by a non-zero constant $c$ corresponds to
\begin{center}
\boxed{E_4 = \begin{pmatrix}
c & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{pmatrix}}
\end{center}
\item Multiplying the middle row of $A$ by a non-zero constant $c$ corresponds to
\begin{center}
\boxed{E_5 = \begin{pmatrix}
1 & 0 & 0 \\
0 & c & 0 \\
0 & 0 & 1
\end{pmatrix}}
\end{center}
\item Multiplying the bottom row of $A$ by a non-zero constant $c$ corresponds to
\begin{center}
\boxed{E_6 = \begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & c
\end{pmatrix}}
\end{center}
\item Adding the second row of $A$ to the first row of $A$ corresponds to
\begin{center}
\boxed{E_7 = \begin{pmatrix}
1 & 1 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{pmatrix}}
\end{center}
\item Adding the third row of $A$ to the first row of $A$ corresponds to
\begin{center}
\boxed{E_8 = \begin{pmatrix}
1 & 0 & 1 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{pmatrix}}
\end{center}
\item Adding the first row of $A$ to the second row of $A$ corresponds to
\begin{center}
\boxed{E_9 = \begin{pmatrix}
1 & 0 & 0 \\
1 & 1 & 0 \\
0 & 0 & 1
\end{pmatrix}}
\end{center}
\item Adding the third row of $A$ to the second row of $A$ corresponds to
\begin{center}
\boxed{E_{10} = \begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 1 \\
0 & 0 & 1
\end{pmatrix}}
\end{center}
\item Adding the first row of $A$ to the third row of $A$ corresponds to
\begin{center}
\boxed{E_{11} = \begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
1 & 0 & 1
\end{pmatrix}}
\end{center}
\item Adding the second row of $A$ to the third row of $A$ corresponds to
\begin{center}
\boxed{E_{12} = \begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 1 & 1
\end{pmatrix}}
\end{center}
\end{enumerate}
98
134EECS6.121
Introduction to Algorithms
6.100A, 6.12006.101Midterm Exam 2
Weighted Directed Graph
2b2Text
We are given a directed weighted graph $G$ with both positive and negative weights. Provide an algorithm that returns a directed cycle of positive weight in $G$, or report that none exists. Correct algorithms with better running times will receive more credit. You can invoke any algorithm discussed in lecture, recitations or p-sets.
Open
Create a graph $G^{\prime}$ with the same vertices and edges, but with the edge weights negated. Add a super node $s$ with directed edges with weight 0 to all original nodes. Run Bellman-Ford on $G^{\prime}$ starting from $s$ to either find a negative-weight cycle (by following parent pointers from a witness), or report that none exists. A negative-weight cycle in $G^{\prime}$ is a positive-weight cycle in $G$. This takes $O(|V| \cdot|E|)$ time.
Your friend Mash Coney was shopping online and noticed that someone was selling 3 TVs in exchange for 5 laptops. She saw another deal, selling 10 shirts for $1 \mathrm{TV}$, and another one selling 5 laptops for 3 shirts.
Mash worked out the math and realized that if she invests 5 laptops, she can exchange it for 3 TVs, exchange those for 30 shirts, to finally get 50 laptops. She just $10 \times$ 'ed her laptops! Notice that this sort of opportunity would not exist if the second deal was 1 shirt for 1 TV.
Sadly, Mash can only handle the business side, and reaches out to you, a $6.006$ student, to code the algorithm to find these money-making opportunities from online stores. We're going to use graphs to solve this problem!
Given a graph, $(V, E) \in G$, recall that the Bellman-Ford algorithm is able to tell us if a given graph contains a negative-weight cycle or not in $O(|V \| E|)$ time. However, it does not tell us where the cycle is.
Modify the Bellman-Ford algorithm such that it can also return an array of all nodes that are on a negative-weight cycle, if one exists. Justify why your algorithm is $O(|V \| E|)$ time. If the graph has multiple negative-weight cycles, you are free to return any one of them. Note that in your output array, it does not matter what the first element is, as long as it is in cyclical order. You can assume that each node is labeled with a unique integer from 0 to $|V|-1$, inclusive.
Bellman-Ford relaxes every edge, $|V|$ times. If we can relax an edge during the last round, then we know that there is a negative-weight cycle on this graph. The vertices incident to the edges that could be relaxed on the last round are surely reachable from a negative-weight cycle. Note that they are not necessarily on the cycle themselves. We call such vertices "witnesses". Using the parent pointer chain, we back-track from any such witness, using a DAA to keep track of seen nodes in $O(1)$ time per lookup/insert. If we encounter a vertex that we have already seen, that vertex has to be on the negative-weight cycle. We then repeat the parent pointer back-track starting at that vertex, until we see it again, inserting it into a linked list, which we return.
Since the length of this cycle is at most $|V|$, our tracing algorithm is within the $O(|V \| E|)$ budget.
Consider the weighted directed graph below.
Modify the weights so that they are all non-negative and yet all shortest paths are preserved. No justification needed, though you might want to show your work in case you make a small mistake.
Run Johnson's first step. We add a super-node $s$ connected to all other vertices with edges of weight 0 . The shortest distances from $s$ to all other vertices are $\delta(z)=0, \delta(x)=-7, \delta(y)=-4, \delta(w)=-1$. Using $w^{\prime}(u, v)=w(u, v)+\delta(u)-\delta(v)$, we get the following.
There are alternative solutions. For example, given that the graph consists of a single strongly connected component, one does not need to add a super-node to do BF on the first step. One can start BF form any node in the graph. Different choices will end up in a different set of weights, all of them correct.
You are given a weighted directed graph $G=(V, E, w)$ where each node $v$ has a color v.color (which is part of the input). Assume that there are $k$ possible colors and that the weights can be both positive and negative. Describe an algorithm that computes the length of the shortest path from a designated source $s$ to a given destination $t$, where every time the path repeats colors, you incur a cost of $\ell$. Here, we say that a path repeats colors if two consecutive nodes in the path have the same color. So, for example, going RED, BLUE, RED does not repeat colors but going RED, BLUE, BLUE does. You can assume that there is at least one path from $s$ to $t$.
For full credit, provide a short description of the algorithm and an analysis of its run time. The runtime should be expressed in terms of $|V|,|E|$ and/or $k$. Faster algorithms will receive more credit. You can invoke any algorithm discussed in lecture, recitation or p-sets.
Change the weights to add $l$ to the weight of any edge between nodes of the same color. Run BF. The cost is $O(|V| \cdot|E|)$.
99
0EECS6.191
Computation Structures
6.100A, 8.02None
Prelab Questions 1
Binary Representation
1a0.03333333333TextWrite 5 as a 4-bit binary number:Numerical
0b0101.
As $5=0 \times 2^{3}+1 \times 2^{2}+0 \times 2^{1}+1 \times 2^{0}$, this is equivalent to 0b0101. The 0b prefix indicates that it is a binary number.
Write -11 as a 5-bit binary number using 2's complement representation:
0b10101.
As $11=0 \times 2^{4}+1 \times 2^{3}+0 \times 2^{2}+1 \times 2^{1}+1 \times 2^{0}$, this is equivalent to 0b01011. The $2^{\prime}$ s complement representation of $-11$ can be found by inverting all the bits and adding 1.
$$
0b01011 \stackrel{\text { invert bits }}{\longrightarrow} 0b10100 \stackrel{+1}{\longrightarrow} 0b10101
$$
Write 7 and 4 in 4-bit 2's complement notation, then add them together using fixed width 2's complement arithmetic. Show your work. Provide your result in binary, and decimal. For each computation also specify whether or not overflow occurred.
Sum in binary: $0b1011$
Sum in decimal: $-5$
Did overflow occur? (Yes/No): Yes
Write -3 and -4 in 4-bit 2's complement notation, then add them together using fixed width 2's complement arithmetic. Show your work. Provide your result in binary, and decimal. For each computation also specify whether or not overflow occurred.
Sum in binary: $0b1001$
Sum in decimal: -7
Did overflow occur? (Yes/No): No
100
33EECS6.121
Introduction to Algorithms
6.100A, 6.12006.101Problem Set 3Critter Sort1e0.125Text
Professor Oak is trying to organize his $n$ Critters so he can study them more efficiently. For each of the following scenarios, provide the most efficient algorithm for sorting the Critters, and state whether the asymptotic complexity is $\Theta(n), \Theta(n \log n)$, or $\Theta\left(n^{2}\right)$. Briefly justify your answers (a sentence or two suffices). Choose only one of the following sorting algorithms from lecture and recitation: insertion sort, selection sort, merge sort, counting sort, and radix sort.
Professor Oak's Critters like to battle each other. Each Critter has participated in at least one battle, and each pair of Critters has battled each other at most once. Professor Oak wants to sort his Critters by the number of times that they have battled.
Open
The sorting key is an integer in the range $[1, n]$, and so radix sort takes $\Theta(n)$ time.
Professor Oak is trying to organize his $n$ Critters so he can study them more efficiently. For each of the following scenarios, provide the most efficient algorithm for sorting the Critters, and state whether the asymptotic complexity is $\Theta(n), \Theta(n \log n)$, or $\Theta\left(n^{2}\right)$. Briefly justify your answers (a sentence or two suffices). Choose only one of the following sorting algorithms from lecture and recitation: insertion sort, selection sort, merge sort, counting sort, and radix sort.
Professor Oak wants to sort his Critters alphabetically by the names he has given them, which are strings of length at $\operatorname{most} \log _{2} n+1$, containing lowercase letters of the English alphabet. Each string is stored as contiguous bits in memory.
There are 26 choices for each letter. Therefore, the names can be interpreted as positive integers bounded by $26^{\log _{2} n+1}=O\left(n^{6}\right)$, so we can use radix sort in $\Theta(n)$ time.
Professor Oak is trying to organize his $n$ Critters so he can study them more efficiently. For each of the following scenarios, provide the most efficient algorithm for sorting the Critters, and state whether the asymptotic complexity is $\Theta(n), \Theta(n \log n)$, or $\Theta\left(n^{2}\right)$. Briefly justify your answers (a sentence or two suffices). Choose only one of the following sorting algorithms from lecture and recitation: insertion sort, selection sort, merge sort, counting sort, and radix sort.
Professor Oak wants to sort his Critters by their species' ID number, which is a positive integer less than 894.
The sorting key comes from a constant-size set, and so counting sort runs in $\Theta(n)$ time.
Professor Oak is trying to organize his $n$ Critters so he can study them more efficiently. For each of the following scenarios, provide the most efficient algorithm for sorting the Critters, and state whether the asymptotic complexity is $\Theta(n), \Theta(n \log n)$, or $\Theta\left(n^{2}\right)$. Briefly justify your answers (a sentence or two suffices). Choose only one of the following sorting algorithms from lecture and recitation: insertion sort, selection sort, merge sort, counting sort, and radix sort.
After Professor Oak sorts his Critters by their weight, one of the Critters got heavier. Professor Oak wants to resort his Critters by weight.
Only one Critter has a different weight from before, and so insertion sort will only need to do $\Theta(n)$ work to put this Critter into its new correct position.