## Data and Measurement

What is Data? (Data = the results of a measurement)

• Definition of the thing being measured
• Measurement value (number plus units)
• Estimate of the uncertainty of each measurement
• Experimental context (measurement method + environment)
• Context uncertainty (uncertainty of controlled and uncontrolled input parameters)
• Measurement model (theory, assumptions and definitions used in making the measurement)
几乎所有的测量都是indirect，我们实际是不是测量我们真正想要的东西，而是测量something else and then we're relating the thing we measure to the thing we want.

The Measurement Model

• Virtually all measurements are indirect
—  We have a set of theories that relate what is actually being measured to what we want to measure (measurement model)
• Ex: Measuring temperature with a thermometer
• Thermal expansion of mercury (or spirits) turns length measurement into temperature measurement
• Theory: linear model of liquid thermal expansion
• Ex: Measuring temperature with a thermocouple
• Junction of dissimilar metals turns temperature change into voltage change (Seebeck effect)
• Theory: polynomial model of voltage versus temperature, coupled with digital (or analog) voltage measurement.

Measurement is not a Passive Act

• Measurement can change the thing you are measuring
— Measurement is often not observation without disturbance
• Example: SEM measurement
• Sample charging
• Physical and chemical changes of sample due to electron bombardment 电子轰击 (carbon deposition, etc.)
• Effects can be current and voltage dependent

All Measurements are Uncertain

• Measurement error exists, but we do not know what it is
• If we knew the measurement error, we would subtract it out!
• Unknown errors are called uncertainties
• Our goal is to estimate the uncertainty in our measurements
• Random errors can be estimated using repeated measurements
• Systematic errors require a sophisticated understanding of the measurement process (and the measurement model)

Measurement Stability

• We often make assumptions of spatial and/or temporal stability of measurements.
• Repeated measurements are taken at different times, locations, conditions, etc.
• How constant is the sample?
• How constant is the measurement process?
• How constant is the measurement context?

Measurement Terms

• Metrology(计量学/ 量测学/ 量衡学): the science of measurement.
• Measurand(被测物): the thing being measured.
• True Value: the unknown (and unknowable) thing we wish to estimate.
• Error: true value – measured value.
• Measurement Uncertainty】: an estimate of the dispersion of measurement values around the true value.

Uncertainty Components

• Systematic errors
• Produce a bias in the measurement result.
• We look for and try to correct for all systematic errors, but we are never totally successful.
• We try to put an upper limit on how big we expect systematic errors could be.
• Random errors
— Can be evaluated statistically, through repeated measurements.

Other Measurement Terms

• Accuracy(准确度): the same as error, it can never be known, but is estimated as the maximum systematic error that might reasonably exist
• Precision(精密度): the standard deviation of repeated measurements (random error component). 精密仪器Precision Instrument
• Repeatability: standard deviation of repeated measurements under conditions as nearly identical as possible. 侧重于自己在原来的仪器(样品)上重复试验。
• Reproducibility: standard deviation of repeated measurements under conditions that vary in a typical way (different operators, instruments, days).  侧重比如自己观察到的实验现象，别人是否能在他自己的仪器上观察到。
• 对于上面右侧的图，Wiki给的说明是Accuracy is the proximity of measurement results to the accepted value; precision is the degree to which repeated (or reproducible) measurements under unchanged conditions show the same results.

Measurement Model Example

• How wide is this feature?
• Most critical dimension measurement tools use a trapezoidal (梯形的) feature model
• What criterion should be used to best fit a trapezoid to this complicated shape?
• Accuracy of the width measurement is a strong function of the appropriateness of the feature model (not a good fit means not a good measurement)
• The measurement model of an SEM = the feature model + a model for how electrons interact with the sample

## Data Example

Chemical vapor deposition (CVD) is used to deposit a tungsten film on a silicon wafer

• There is no mention of deposition rate measurement error
• "The temperature was controlled to within ± 2ºC of the desired value within the deposition zone."
• The context of the experiment is well described
• There was no mention of context uncertainty (ex: pressure, flow rates)
• "From the Arrhenius plots an activation energy, Ea, of 0.71 eV/atom (69,000 J/mol) was calculated."实际上不同的线拟合的结果应该有差异。

• Deposition rate is the slope of these lines (presumably determined by ordinary least squares regression)
• What is the uncertainty (standard error) in the slopes?
• What assumptions are being made here? Can the assumptions be tested?  (指的是沉积厚度和沉积时间的线性关系)

## Process Modeling

Introduction

• Introduction to Process Modeling-NIST
• Process modeling is the concise description of the total variation in one quantity $$y$$ (called the response variable) by partitioning (分割，分开) it into
• A deterministic component (确定性的成分) given by a mathematical function of one or more other quantities, $$x_{1}, x_{2}, \ldots$$ and possibly unknown coefficients $$\beta_{0}, \beta_{1}, \ldots$$
• A random component $$\varepsilon$$ that follows a particular probability distribution $$y=f(\boldsymbol{x}, \boldsymbol{\beta})+\varepsilon \quad \boldsymbol{x}=\left(x_{1}, x_{2}, \ldots\right) ; \boldsymbol{\beta}=\left(\beta_{0}, \beta_{1}, \ldots\right)$$
• Generally, we require $$\mathrm{E}[\varepsilon]=0$$.
— Thus $$f(\boldsymbol{x}, \boldsymbol{\beta})$$ describes the average response, $$\mathrm{E}[y]$$, if the experiment is repeated many times, not the actual response for a given trial
• Our three tasks in modeling:
• Find the equation from $$f(\boldsymbol{x}, \boldsymbol{\beta})$$ that meets our goals
• Find the values of the coefficients $$\beta$$ that are "best" in some sense
• Characterize the nature of $$\varepsilon$$ (distribution of errors)
• The perfect model has
• The correct set of input variables $$x_{1}, x_{2}, \ldots$$
• The correct model form $$f(\boldsymbol{x}, \boldsymbol{\beta})$$
• The correct values for the coefficients $$\beta_{0}, \beta_{1}, \ldots$$
• The correct probability distribution for $$\varepsilon$$, including parameters such as its standard deviation $$\sigma$$
• Picking the right model (form and predictor variables) is called modeling building
• Finding the best estimate of the parameter values and the properties of the random variable $$\varepsilon$$ is called regression

Model Generalizability

$$y=f(\boldsymbol{x}, \boldsymbol{\beta})+\varepsilon$$

• The three aspects of our model (equation, coefficients, and errors) can have different levels of generalizability
— We often want to know the levels of generalizability
• Example: model of thermal stress on polymer
• The equation form applies to all materials (under certain conditions)
• The parameters change for different materials
• The errors are a function of measurement and experimental methods, independent of materials

Some Terminology

• $$y=$$ response variable, response, dependent variable
• $$x=$$ predictor variable, explanatory variable, independent variable, predictor, regressor
• Our "model" is both the function $$f(\boldsymbol{x}, \boldsymbol{\beta})$$ and the assumed distribution of $$\varepsilon$$

Regression

• Regression involves three things:
• Data (a response variable as a function of one or more predictor variables)
• Model (fixed form and predictor variables, but with unknown parameters)
• Method (a statistical regression technique appropriate for the model and the data to find the “best” values of the model parameters)
• High quality regression requires high quality in all three items

The Model

• Statistical Relationship: $$y_{i}=f\left(x_{i}, \beta\right)+\varepsilon_{i}=\hat{y}_{i}+\varepsilon_{i}$$
• Functional Relationship: $$\hat{y}=f(x, \beta)$$ or $$E[Y \mid X]=f(X, \beta)$$
$$X, Y=$$ random variables (probability terminology)
$$\hat{y}=$$ predicted (mean) response
$$y_{i}=$$ measured response for $$\mathrm{i}^{\text {th }}$$ data point
$$x_{i}=$$ value of explanatory variable for $$\mathrm{ith}^{\text {th }}$$ data point
$$X, Y=$$ random variables (probability terminology)
$$\hat{y}=$$ predicted (mean) response
$$Y_{i}=$$ measured response for $$\mathrm{i}^{\text {th }}$$ data point
$$X_{i}=$$ value of explanatory variable (解释变量，其实就是自变量) for $$\mathrm{i}^{\text {th }}$$ data point
$$\beta_{k}=$$ true model parameters (can never be known)
$$b_{k}=$$ best fit model parameters for this data set (sample); our estimate for $$\beta_{k}$$.
$$\varepsilon_{i}=$$ true value of $$\mathrm{i}^{\text {th }}$$ residual (from true model, not known)
$$e_{i}=$$ actual ith residual for the current model

Example Model

• Straight line model:\begin{aligned} &f(x, \beta)=\beta_{0}+\beta_{1} x \\ &\hat{y}_{i}=\beta_{0}+\beta_{1} x_{i} \\ &y_{i}=\beta_{0}+\beta_{1} x_{i}+\varepsilon_{i} \end{aligned}
• Regression produces the "best" estimate of the model given the data $$\left(x_{i}, y_{i}\right)$$ :$$y_{i}=b_{0}+b_{1} x_{i}+e_{i}$$

Models for Linear Regression

• We use linear regression for linear parameter models: $$\hat{y}$$ is directly proportional to each unknown model coefficient
• $$\hat{y}=\sum_{k} \beta_{k} f_{k}(x)$$ for bivariate data (双变量数据)
• Example: $$\hat{y}=\beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\beta_{3} \ln (x)$$
• Multivariate data: two or more explanatory variables (we'll call them $$x_{1}, x_{2}$$, etc.)

Nonlinear Regression

• We call our regression nonlinear if it is nonlinear in the coefficients
• Linear regression: $$\hat{y}=\beta_{0}+\beta_{1} \ln (x)+\beta_{2} x^{3}$$
• Nonlinear regression: $$\hat{y}=\beta_{0} e^{\beta_{1} x}$$
• (非)线性回归，其实就是说应变量能不能表示为多个待定系数的线性组合，其中每一个待定系数都和一个含有自变量$$x$$的表达式相乘。比如函数$$f(x, \boldsymbol{\beta})=\displaystyle\frac{\beta_{1} x}{\beta_{2}+x}$$是非线性的，因为它不能表示为两个$$\beta$$的线性组合。非线性函数的其他示例包括指数函数 ， 对数函数 ， 三角函数 ， 幂函数 ， 高斯函数和洛伦兹曲线 。 某些函数（如指数函数或对数函数）可以进行转换，以使它们是线性的。 如此转换，可以执行标准线性回归，但必须谨慎应用。
• Linear regression is relatively easy
— Numerically stable with unique solution given a reasonable definition of "best" fit
• Nonlinear regression is relatively hard

## Regression Review

Vocabulary

• Experiment — explanatory variables are manipulated (controlled), all other inputs are held constant, and the response variable is measured
但是对于observation来说，我们测试inputs和outputs，也不控制inputs
• Model — a mathematical function that approximates the data and is useful for making predictions
• Regression — a method of finding the best fit of a model to a given set of data through the adjustment of model parameters (coefficients)
— What do we mean by “best fit”?
• Residual: $$\varepsilon_{i}=y_{i}-f\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{\beta}\right), e_{i}=y_{i}-f\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{b}\right)$$
residual可能是error in the mdel and/or model；这里的$$\boldsymbol{\beta}$$是model理论上参数，我们无法知道，但是可以用$$\boldsymbol{b}$$去估计。
• A "good" model has small residuals
• Small mean, $$|\bar{\varepsilon}|$$
• Small variance (方差), $$\sigma_{\varepsilon}^{2}$$
• Much of our model and regression checking will involve studying and plotting residuals
— Plot $$e_{i} v s . \hat{y}_{i}$$ (works for multiple regression)

What Do We Mean by “Best Fit”?

• First, assume $$\varepsilon$$ is a random variable that does not depend on $$x$$
• There are no systematic errors
• The model is perfect
• Desired properties of a "best fit" estimator
• Unbiased, $$\sum \varepsilon_{i}=\sum e_{i}=0\left(\right.$$ and $$\left.E\left[b_{k}\right]=\beta_{k}\right)$$(其实就是实验次数无穷大时的收敛值)
• Efficient, minimum variance $$\sigma_{e}^{2}\left(\right.$$ and $$\left.\operatorname{var}\left(b_{k}\right)\right)$$
• Maximum likelihood (for an assumed pdf of $$\varepsilon$$ )
• Robust (to be discussed later)
鲁棒性，if I have a or a few bad data point(s), how much does it screw up answer。

Maximum Likelihood Estimation

• Likelihood function: the probability of getting this exact data set given a known model and model parameters
— To calculate the likelihood function, we need to know the joint probability distribution function (pdf) of $$\varepsilon$$
• Maximum Likelihood: what parameter values maximize the likelihood function?
• Take the partial derivative of the likelihood function (or more commonly the log-likelihood function) with respect to each parameter, set it equal to zero
• Solve resulting equations simultaneously

MLE Example – straight line

• Let $$y_{i}=\beta_{0}+\beta_{1} x_{i}+\varepsilon_{i}, \varepsilon_{i} \sim N\left(0, \sigma_{\varepsilon}\right)$$, iid
• Since each $$\varepsilon_{i}$$ is independent, the likelihood function for the entire data set is$$L=\prod_{i=1}^{n} \mathbb{P}\left(y_{i}\right)=\prod_{i=1}^{n} \mathbb{P}\left(\varepsilon_{i}\right)$$
• Since the residuals are iid Normal,$$L=\left(\frac{1}{\sqrt{2 \pi} \sigma_{\varepsilon}}\right)^{n} \exp \left[-\frac{1}{2} \sum_{i=1}^{n} \frac{\varepsilon_{i}^{2}}{\sigma_{\varepsilon}^{2}}\right]$$
• Define chi-square as$$\chi^{2}=\sum_{i=1}^{n} \frac{\varepsilon_{i}^{2}}{\sigma_{\varepsilon}^{2}} \quad \text { (also called weighted SSE) }$$
• Maximizing $$L$$ is the same as minimizing $$\chi^{2}$$
• Solve these equations simultaneously$$\frac{\partial \chi^{2}}{\partial \beta_{k}}=0 \text { for all } k \text { (all coefficients) }$$
• For our line model, $$\varepsilon_{i}=y_{i}-\left(\beta_{0}+\beta_{1} x_{i}\right)$$
• Substitute our estimate for $$\beta_{0}$$ into $$\chi^{2}$$

Assumptions for this MLE

• We assumed three things:
• Each residual (and $$y_{i}$$ ) is independent
• Each residual (and $$y_{i}$$ ) is identically distributed
• Each $$\varepsilon_{i} \sim N\left(0, \sigma_{\varepsilon}\right)$$, and thus $$y_{i} \sim N\left(\hat{y}, \sigma_{\varepsilon}\right)$$
• We call this 【ordinary least squares】 (OLS)
• If any of these assumptions are invalid, then our least-squares estimates will not be the maximum likelihood estimates

Properties of our OLS Solution

• The parameters are unbiased estimators of the true parameters, with minimum variance compared to all other unbiased estimators (if our assumptions are correct)
• $$\displaystyle\sum_{i=1}^{n} e_{i}=0$$
• $$\displaystyle\sum_{i=1}^{n} y_{i}=\displaystyle\sum_{i=1}^{n} \hat{y}_{i}$$, so that $$\bar{y}=\overline{\hat{y}}$$
• $$\displaystyle\sum_{i=1}^{n} \hat{y}_{i} e_{i}=0$$
• $$\displaystyle\sum_{i=1}^{n} x_{i} e_{i}=0$$
• The best fit line goes through the point $$(\bar{x}, \bar{y})$$

OLS Matrix Formulation

When we have multivariate data, it is most convenient to formulate the OLS/MLE problem using matrix math

• $$i=1,2,3, \ldots, n$$ data point index
• $$k=0,1,2, \ldots, p-1$$ predictor variable index
• $$x_{i, k}=i^{t h}$$ value for the $$k^{t h}$$ predictor variable
• $$y_{i}=i^{t h}$$ response data value
• $$\hat{y}=\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\cdots+\beta_{p-1} x_{p-1}$$

Model in matrix form: $$\boldsymbol{Y}=\boldsymbol{X} \boldsymbol{\beta}+\boldsymbol{\varepsilon}$$
(each row in $$X$$ and $$\boldsymbol{Y}$$ is a "data point")

Sum of square errors$$S S E=\sum_{i=1}^{n} \varepsilon_{i}^{2}=\boldsymbol{\varepsilon}^{T} \boldsymbol{\varepsilon}=(\boldsymbol{Y}-\boldsymbol{X} \boldsymbol{\beta})^{T}(\boldsymbol{Y}-\boldsymbol{X} \boldsymbol{\beta})$$

Maximum Likelihood estimate (minimum SSE): $$\frac{\partial \chi^{2}}{\partial \beta_{k}}=0=\sum_{i=1}^{n} x_{i, k} \varepsilon_{i} \text { giving } \boldsymbol{X}^{\boldsymbol{T}} \boldsymbol{\varepsilon}=\mathbf{0}$$Using the definition of the residual, $$\boldsymbol{X}^{T} \boldsymbol{X} \boldsymbol{\beta}=\boldsymbol{X}^{T} \boldsymbol{Y}$$

Minimum SSE occurs when the coefficients are estimated as $$\begin{gathered} \widehat{\boldsymbol{\beta}}=\left(\boldsymbol{X}^{T} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{T} \boldsymbol{Y} \\ \widehat{\boldsymbol{Y}}=\boldsymbol{X} \widehat{\boldsymbol{\beta}}=\boldsymbol{X}\left(\boldsymbol{X}^{T} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{T} \boldsymbol{Y}=\boldsymbol{H} \boldsymbol{Y} \\ \boldsymbol{H}=\boldsymbol{X}\left(\boldsymbol{X}^{T} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{T}=\text { hat matrix } \\ \boldsymbol{e}=\boldsymbol{Y}-\widehat{\boldsymbol{Y}}=(\boldsymbol{I}-\boldsymbol{H}) \boldsymbol{Y} \end{gathered}$$注意$$\boldsymbol{I}$$表示的是identity matrix；$$\boldsymbol{H}$$和$$\boldsymbol{I}-\boldsymbol{H}$$都是symmetric；common alternate notation: $$b_{k}=\hat{\beta}_{k}$$

We can use our solution to calculate the covariance matrices:$$\begin{gathered} \operatorname{cov}(\boldsymbol{e})=s_{e}^{2}(\boldsymbol{I}-\boldsymbol{H}), \quad \operatorname{var}\left(e_{i}\right)=s_{e}^{2}\left(1-h_{i i}\right) \\ \operatorname{cov}\left(e_{i}, e_{j}\right)=-s_{e}^{2} h_{i j} \\ \operatorname{cov}(\widehat{\boldsymbol{Y}})=s_{e}^{2} \boldsymbol{H}, \quad \operatorname{var}\left(\hat{y}_{i}\right)=s_{e}^{2} h_{i i} \\ \operatorname{cov}(\widehat{\boldsymbol{\beta}})=s_{e}^{2}\left(\boldsymbol{X}^{T} \boldsymbol{X}\right)^{-1} \end{gathered}$$其中$$s_{e}^{2}$$是Variance of the residuals。

MLE Straight-Line Regression

• Our model: $$E[Y \mid X]=\beta_{0}+\beta_{1} X$$\begin{aligned} &\beta_{1}=\frac{\operatorname{cov}(X, Y)}{\operatorname{var}(X)} \\ &\beta_{0}=E[Y]-\beta_{1} E[X] \end{aligned} \quad \rho=\frac{\operatorname{cov}(X, Y)}{\sqrt{\operatorname{var}(X) \operatorname{var}(Y)}}
• Ordinary least squares (OLS) estimators:$$b_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \quad b_{0}=\bar{y}-b_{1} \bar{x}$$

Uncertainty of Fit Parameters

• The regression fit is based on a sample of data$$\text { Population Model: } y_{i}=\beta_{0}+\beta_{1} x_{i}+\varepsilon_{i}$$
• To create confidence intervals for $$b_{0}, b_{1}$$, and $$\hat{y}_{i}$$, we need to know their sampling distributions
— Given the assumption of $$\varepsilon_{i} \sim N\left(0, \sigma_{\varepsilon}^{2}\right)$$, the parameter sampling distributions are unbiased and t-distributed (DF = n-2) degrees of freedom)
• $$\begin{gathered} \operatorname{var}\left(b_{1}\right)=\frac{\operatorname{var}(\varepsilon)}{n \operatorname{var}(X)} \quad \operatorname{var}\left(b_{0}\right)=\frac{\operatorname{var}(\varepsilon)}{n}\left(1+\frac{\bar{x}^{2}}{\operatorname{var}(X)}\right) \\ \operatorname{cov}\left(b_{0}, b_{1}\right)=-\bar{x} \operatorname{var}\left(b_{1}\right) \end{gathered}$$
• Estimators: (其中$$p$$表示number of model parameters)$$\begin{gathered} s_{e}^{2}=\frac{1}{n-p} \sum_{i=1}^{n} e_{i}^{2} \\ s_{b 1}^{2}=\frac{s_{e}^{2}}{(n-1) s_{x}^{2}} \quad s_{b 0}^{2}=\frac{s_{\varepsilon}^{2}}{n}+s_{b 1}^{2} \bar{x}^{2} \end{gathered}$$

Confidence Intervals

• The sampling distributions are Student's t with DF = n – 2
• Slope CI: $$b_{1} \pm t_{n-2, \alpha} s_{b 1} \quad s_{b 1}^{2}=\displaystyle\frac{s_{\varepsilon}^{2}}{(n-1) s_{x}^{2}}$$
• Intercept CI: $$b_{0} \pm t_{n-2, \alpha} s_{b 0} \quad s_{b 0}^{2}=\displaystyle\frac{s_{\varepsilon}^{2}}{n}+s_{b 1}^{2} \bar{x}^{2}$$
其中$$n$$是Critical t-value，$$\alpha=$$是significance level (e.g., 0.05)

Uncertainty of Predictions

• Uncertainty in $$\hat{y}_{i}$$ comes from the spread of the residuals and from uncertainty in the best fit parameters $$b_{1}$$ and $$b_{0}$$\begin{aligned} \operatorname{var}\left(\hat{y}_{i}\right) &=\frac{\operatorname{var}(\varepsilon)}{n}\left(1+\frac{\left(x_{i}-\bar{x}\right)^{2}}{\operatorname{var}(X)}\right) \\ s_{\hat{y}_{i}}^{2} &=\frac{s_{\varepsilon}^{2}}{n}+s_{b 1}^{2}\left(x_{i}-\bar{x}\right)^{2} \end{aligned}
• Uncertainty in a predicted new measurement $$\hat{y}_{\text {new }}$$ adds additional uncertainty of a single measurement$$s_{\hat{y}_{n e w}}^{2}=s_{\hat{y}_{i}}^{2}+s_{\varepsilon}^{2}$$

Uncertainty of Correlation Coefficient

$$\rho=\frac{\operatorname{cov}(X, Y)}{\sqrt{\operatorname{var}(X) \operatorname{var}(Y)}} \quad r=\frac{1}{n-1} \sum_{i=1}^{n}\left(\frac{x_{i}-\bar{x}}{s_{x}}\right)\left(\frac{y_{i}-\bar{y}}{s_{y}}\right)$$

• The sampling distribution for $$r$$ is about Student's t (D F=n-2) only for $$\rho=0$$. For this case$$S_{r}=\frac{s_{\varepsilon}}{\sqrt{n-1} s_{y}}$$
• For $$\rho \neq 0$$, the sampling distribution is complicated
• We’ll use the Fisher z-transformation: $$z=\frac{1}{2} \ln \left(\frac{1+r}{1-r}\right)$$
• When $$n>25, z$$ is about normal with$$E[z]=\frac{1}{2} \ln \left(\frac{1+\rho}{1-\rho}\right) \quad \operatorname{var}(z)=\frac{1}{n-3}$$

Assumptions in OLS Regression

1. $$\varepsilon$$ is a random variable that does not depend on $$x$$ (i.e., the model is perfect, it properly accounts for the role of $$x$$ in predicting $$y$$ )
2. $$\mathrm{E}\left[\varepsilon_{l}\right]=0$$ (the population mean of the true residual is zero); this will always be true for a model with an intercept
3. All $$\varepsilon_{i}$$ are independent of each other (uncorrelated for the population, but not for a sample)
4. All $$\varepsilon_{j}$$ have the same probability density function (pdf), and thus the same variance (called homoscedasticity)
5. $$\varepsilon \sim \mathrm{N}\left(0, \sigma_{\varepsilon}\right)$$ (the residuals, and thus the $$y_{i}$$, are normally distributed)
6. The values of each $$x_{i}$$ are known exactly

Checking the Assumptions

• Do the assumptions in OLS regression hold?
— Which assumptions can you validate?
— If an assumption is violated, how far off is it?
• If one or more assumptions do not hold, does the observed violation invalidate the statistical procedure used?
— If so, what next?

Failed Assumptions – the Anscombe Problems

F. J. Anscombe, “Graphs in Statistical Analysis”, The American Statistician, Vol. 27, No. 1
(Feb., 1973) pp. 17 – 21.

What Happens When OLS Assumptions are Violated?

• At best, the regression becomes inefficient
— Do the assumptions in OLS regression holdThe uncertainty around the estimates is larger than you think: $$\operatorname{var}[\hat{\theta}]$$ for some parameter $$\theta$$
• At worst, the regression becomes biased
— The results may be misleading: bias $$[\hat{\theta}]$$
• We want small mean square error (MSE)$$\operatorname{MSE}(\hat{\theta})=\operatorname{var}[\hat{\theta}]+(\operatorname{bias}[\hat{\theta}])^{2}$$

Checking Our Assumptions

• Regression Diagnostics: checking for violations in any of the OLS assumptions
• Topics we'll address:
• Normality of residuals
• Outliers (identically distributed)
• Leverage and influence
• Heteroscedasticity (variation in variance)
• Error in predictor variables
• The wrong model
• Correlated residuals

Fixing Problems

• Regression Remediation: changing our regression to address diagnostic problems
• Topics we'll address:
• Outlier removal or adjustment
• Data transformation
• Weighted regression
• Total regression
• Model building
• Autocorrelation analysis