Jitts for 574

JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

Jitts for 574

Random Variable as a Function

Give a formal example of a random variable for the St Petersburg game.

F vs Q

For scalar random variables, given a cadlag F we can find a caglad Q, can this also be done for bivariate random variables?

Generating Functions

Verify the calculations for the 3 die example with the help of mathematica.

Gaussian Characteristic Function

Verify Example 1 of L1 Section 9.

Momentary Lapses

Moments tell us something about probabilities, but what?

Chebychev

What was Chebychev's given name?

L2 Presentations

Choose one of the following proofs to present in class: Thm 1,2,3,6.

Bounds

In recent work on partial identification in econometrics the Roy model plays an important role, try to come up with a model of binary treatment effect that mimics the Roy setup.

qth mean implies "in probability"

A standard argument for consistency, say in regression settings, is to compute the mean and variance of a random variable and show that the variance tends to zero. Illustrate with the least squares estimator in the classical linear model and relate to Thm 1 of section 6 of L3.

Strong vs Weak Convergence

Give an example in which Xn converges in probability, i.e. weakly, but doesn't satisfy the Kolomogorov condition of the third theorem of section 6.

Slutsky Thm

The Slutsky theorem is just a special case of the continuous mapping theorem given below, explain why,

Delta Method

The delta method often provides an extremely convenient first approximation to the limiting behavior of an estimator, review the role of the Slutsky Theorem in establishing the validity of the delta method and illustrate its use in the classical regression setting.

Chebyshev WLLN

In many realistic settings we have models like the simple one posited in the notes for the Chebyshev WLLN, that is we don't really believe that all the observations are drawn from a population with one mean and one variance, but instead there may be many means and variances and we would like to estimate the mean of the means. Relate this to the simple binary treatment model with heterogeneous treatment effect.

Kolmogorov SLLN

The proof of the iid form of the KSLLN can be viewed as a rehearsal for the standard characteristic function argument for the simple CLT given in L4. Be prepared to present this proof in class.

CLT

As you see in the notes all the complication of the argument for the simplest form of the CLT is hidden away in the moment expansion arguments of L1. What is left is very mechanical especially given the use of the o() notational trick. Be prepared to explain this step by step.

Why normal?

Does the Breiman argument help to provide some intuition for why the limiting form of the CLT is Gaussian?

Monte Carlo Rejection

The rejection method developed in L5 is an important tool in many Bayesian Markov Chain Monte Carlo problems. Illustrate the method using the example in the text and explain the bivariate geometry how it "works."

Problem Set 1

I've posted a set of accumulated problems for the first part of the course linked to the class web page as PS1.pdf. The first four problems are sort of warm up exercises for L1. The next three problems are part of my ongoing campaign to induce skepticism about the method of moments. And the remaining problems are more like proposed paper topics. For class on Thursday, Feb 6, provided it isn't too snowy, please come prepared to discuss Questions 6 and 7.

KGF for McCullagh Distribution

If you would like to check the kgf given in PS1.7 it can be done in M'a but s a bit painful. At least when I tried it, my first answer involved some quite mysterious imaginary factors, but if you persist by asking M'a to ExpToTrig, then Simplify, then TrigReduce you eventually get to something recognizable from the PS.

Why not histograms?

Histograms are sometimes taught in elementary school where they usually use equally spaced bins and simply count the number of observations in each bin to get bin heights. This is also the default in R. Criticize both procedures.

Quincunx

In PS 2 Q2 there is R code for a quincunx, download the code and run it a couple of times to get a feel for the DeMoivre Laplace version of the CLT and the relationship between histograms and densities for the simplest CLT in which the summands take values +/- 1 with equal probability. Note the smooth density estimate plotted above the histogram produced by the quincunx.

MSE rates

Explain why centering the bins of the histogram allows one to get the remarkable improvement in the MSE of the point wise estimate from O(n^(-2/3)) to O(n^(-4/5)).

Stein Shrinkage

Review the argument for the James-Stein (1961) result and see if you find places where the steps are obscure. Observe that the result can be extended from multivariate normal means to regression to show that the OLS estimator is inadmissible for p > 3.

Minimaxity and Empirical Bayes

Robbins (1951) proposes the following problem to illustrate why minimax decision rules are unsatisfactory. Suppose you observe Y_i = mu_i + u_i for i = 1, … n and u_i iid N(0, 1). And suppose that you know that each mu_i is either -1 or +1 and you would like to estimate the n vector of mu's subject to squared error loss. The minimax decision rule is muhat_i = sign (Y_i) , but suppose you see a fairly large sample and most of the Y_i are positive, does the minimax rule seem reasonable?

Complete Class Theorems

Priors emerge from decision theory in somewhat the same way that prices emerge from general equilibrium theory. Discuss.

Sufficiency

Usually when we reduce the dimension of sample information something is "lost in translation" but in special cases this isn't true. Be prepared to present the example for the U[0, theta] on p 5 of the notes.

Moments of Sufficient Statistics

The sufficient statistic for theta in the 1pxf is sum T(Z_i), explain how to compute moments of this statistic, in particular how the transition from unnatural parameter theta to natural parameter eta helps.

Cramer Rao Inequality

… follows from the correlation inequality. Explain.

Compound Decisions

The empirical Bayes methods of Robbins, now more than 50 years old, are receiving increased attention. Interpret the Tweedie formula, or Bayes rule and be prepared to discuss its derivation.

Wald Argument

Reducing consistency of the MLE to Jensen's inequality was a considerable insight. Be prepared to discuss the proof and the limitations of the theorem at the bottom of L9 p1.

Consistent Roots

Why should we be skeptical about the consistent root argument on p2?

Asymptotic Normality

The CLT seems to be about means, but MLEs are quite unmean, for example they include the median. Does the the Lehmann argument work for the median when sampling from the Laplace distribution? What about the Cauchy MLE?

One steps

One steps seem to be another "almost free lunch" -- why might we want to be skeptical about their asymptotic efficiency?

Super Efficiency

Hodges example created some serious consternation among high-brow statisticians in the 1950s. Connect to the James Stein estimator.

Super Efficiency and the Heavy Weights rematch

Aad van der Vaart's discussion of super-efficiency includes a figure (Fig 8.1 of AS) that purports to show that Hodges example is really not so bad after all. In effect, he argues, one pays for good behavior near zero by very bad behavior "near zero". This figure appears in the Wikipedia article on super efficiency and also in the blog post by Larry Wasserman. Google Super-efficiency: the nasty ugly little fact. Both Ruchi and Ignacio asked about this after class on thursday, so we should talk about it tuesday. I'll include some bad R code to generate the picture in the software link for 574. One thing you might notice and should think about a bit more is that van der Vaart's discussion and his picture caption are a bit misleading since the figure is based on a rescaled version of MSE which is multiplied by the sample size. This is helpful to keep the curves for different sample sizes on the same scale, but it obscures the fact that the MSE itself is going to zero, so the dramatic risks are exaggerated in the figure.

The Pseudo True

When we use the model f to describe the data generating process, g, the parameters of f only rarely have the same interpretation they would have had the data come from f. Give some examples where the pseudo true is nicely linked to the true.

The Art of Sandwiches

Sandwiches arise due a mismatch, a breakdown in the information identity when models are misspecified. An early econometrics example is the Eicker White covariance matrix for the least squares estimator when there is heteroscedasticity. Explain.

Making Sandwiches from Bootstraps

A rather unappealing way to build sandwiches is to replace the delicious crab salad with old bootstraps.

How many words?

A variant of the numbered cars problem in L13 is the problem of missing species: how to estimate the number of butterfly species in the amazon for example. A very cogent discussion of the development of some of these ideas is available in Brad Efron's paper here: http://projecteuclid.org/download/pdf_1/euclid.aos/1051027871 Let's plan to discuss this on tuesday. Not the micro array part, just the Shakespeare part.

Reproducibility

An important aspect of the design of simulation experiments is the careful documentation and archiving of results. Some discussion of these issues is available from my web page via the link to Reproducibility. There is a paper with Achim Zeileis that appeared in JAE, and a protocol for simulations in R, among other things. We could discuss theses issues as well.

thursday March 6

Given that Sims's seminar doesn't end until 12, I think we should just skip class today and resume next tuesday. Have a good weekend!

How Many Moments?

A perennial question in GMM problems asks: how many moments should be used? There is a large literature on this subject. One attempt to partially answer the question is given here: http://www.sciencedirect.com/science/article/pii/S0304407699000147 where it is suggested that a q_n = O(n^1/3) is sufficient under some regularity conditions. A semi satirical paper that extends this result is available here: http://www.sciencedirect.com/science/article/pii/S0165176598001220 . It might be fun to briefly discuss what is going on in the latter paper.

EL and Continuously Updated GMM

The brief exposition of EL methods in L14 eventually yields a formulation that involves optimizing an expression in the cumulant generating function. This approach is closely related to work by Hansen, Heaton and Yaron on what they called continuously updated GMM. Again, the difficulty of estimating the weighting matrix for GMM is revealed and the advantage of pushing the analysis back towards a proper likelihood formulation is established. This would be another possibly worthwhile topic for discussion.

Robustness as Continuity Requirement

Be prepared to describe Hampel's definition of qualitative robustness.

Influence functions

The other central concept of Hampel's path breaking Phd thesis is the definition of the influence function. Although the corresponding notion of differentiability is a bit slippery and IF is an extremely important tool. We should discuss why.

Huber's Location Estimator

See the notes on Calculus of Variations, Lecture 12a, for a heuristic view of the Huber problem. We should discuss this too.

Kernel Regression, if you must

Locally constant (kernel) regression estimates the conditional mean function. The theory underlying it is all quite familiar from kernel density estimation, I'd like to discuss why local polynomial methods are preferable in most circumstances, and the connection to model selection described in the last pages of L16.

Splines, Sieves, and all that

L17 deals with smoothing splines and we should discuss the relative merits of regression splines as introduced in 508 and smoothing splines. Again, a challenging issue is the choice of smoothing parameters and the connection to model selection.

Smoothing Splines

Class on Tuesday March 31 will be only 50 minutes, since I have a meeting I need to attend at noon. I'd like to discuss the advantages of smoothing splines and various connections of methods for choosing smoothing parameters.

Log-splines and their offspring

L18 deals with log spline density and hazard estimation. I'll try to convince you that this is preferable to kernel methods even though it is hardly visible on the econometric radar. There is also a L18a that is slides for a talk I gave in Aarhus a couple of years ago. It provides an overview of some work I was doing on density estimation before I got captivated by empirical Bayes. Please consider the nice connections to exponential families for the original log spline methods. We should discuss the tradeoffs involved in knot selection vs smoothness penalties for these methods.

SIC

L19's discussion of the Schwarz paper is almost as long as the Schwarz paper itself. A main objective is to try to explain what the underlying decision problem is and why it is really quite distinct from ordinary garden variety hypothesis testing. Let's begin by discussing this.

Lemma 1

Lemma 1 is the heart of the Schwarz result and for linear regression it is all that is really needed since the likelihood is quadratic. My interpretation of this result is that it demonstrates that in his Bayesian setup the role of the prior is like the Cheshire Cat in Alice in Wonderland -- as the sample size increases only the dimensionality of the prior is left, just like the smile of the Cheshire Cat that remained after the cat disappeared.

Lasso, Compressed Sensing and the Dantzig Selector

If there is time it would be fun to discuss Lasso and and the secret decoder ring.

AR(1)'s and All That

From 508, or somewhere else, we know that the Gaussian AR(1) model yields a stationary distribution for y that is Gaussian with mean mu/(1-rho) and variance sigma^2/(1-rho^2), so if we were to simulate such a process and collect a sample of such realizations what would we expect to see?

Ruin

Read through the coin flipping discussion of ruin problems and see is there are points that are still obscure. Does the discussion of the matrix formulations help to clarify things?

Eigenvalues and Stationary Distributions

The theorem on p6 of L20 is central to the subject and leads to a simple characterization of the stationary equilibrium as an eigenvalue problem. Compare the discussion from 508 of the infamous Google eigenvalue problem in L20. Is it coincidental that this result occurs in L20 in both courses?

Gibbs v Metropolis

Alternating draws from conditionals offers a very convenient way to simulate the stationary distribution of some Markov chains. Consider the limitations of this strategy and how it can be extended via the Metropolis algorithm.

Treatment Effects

L21 is quite a grab bag of topics. Rather than try to do a little bit with each of them, I'd like each of you to prepare a brief presentation on one of the topics. Ideally, you should try to coordinate so you don't all do the same thing.

Survival Models

The first part of L22 is a review of material from 508. The parametric framework is rather simple and appealing except that it is difficult to justify the choice of a particular functional form. Similarly, the Kaplan Meier method is appealing, but it doesn't offer any strategy for incorporating covariates. Enter Cox: let's focus on nice way that Murphy and van der Vaart get the partial likelihood.

Competing Risks

In L2 I mentioned that Peterson http://www.pnas.org/content/73/1/11.full.pdf provided an interesting extension of the Frechet bounds. If there is time we might want to discuss this as it relates to L24.

Submit

Clear form

Never submit passwords through Google Forms.

This content is neither created nor endorsed by Google. - Terms of Service - Privacy Policy

Does this form look suspicious? Report

Forms