From Data to Decisions

课程网站-Chris Mack-The University of Texas at Austin
NIST/SEMATECH e-Handbook of Statistical Methods
Six Sigma-Response Surface Modeling
Response Surface Methodology-荧光粉

Data and Measurement

What is Data? (Data = the results of a measurement)

  • Definition of the thing being measured
  • Measurement value (number plus units)
  • Estimate of the uncertainty of each measurement
  • Experimental context (measurement method + environment)
  • Context uncertainty (uncertainty of controlled and uncontrolled input parameters)
  • Measurement model (theory, assumptions and definitions used in making the measurement)
    几乎所有的测量都是indirect,我们实际是不是测量我们真正想要的东西,而是测量something else and then we’re relating the thing we measure to the thing we want.


The Measurement Model

  • Virtually all measurements are indirect
    —  We have a set of theories that relate what is actually being measured to what we want to measure (measurement model)
  • Ex: Measuring temperature with a thermometer
      • Thermal expansion of mercury (or spirits) turns length measurement into temperature measurement
      • Theory: linear model of liquid thermal expansion
  • Ex: Measuring temperature with a thermocouple
      • Junction of dissimilar metals turns temperature change into voltage change (Seebeck effect)
      • Theory: polynomial model of voltage versus temperature, coupled with digital (or analog) voltage measurement.


Measurement is not a Passive Act

  • Measurement can change the thing you are measuring
    — Measurement is often not observation without disturbance
  • Example: SEM measurement
      • Sample charging
      • Physical and chemical changes of sample due to electron bombardment 电子轰击 (carbon deposition, etc.)
      • Effects can be current and voltage dependent


All Measurements are Uncertain

  • Measurement error exists, but we do not know what it is
      • If we knew the measurement error, we would subtract it out!
      • Unknown errors are called uncertainties
  • Our goal is to estimate the uncertainty in our measurements
      • Random errors can be estimated using repeated measurements
      • Systematic errors require a sophisticated understanding of the measurement process (and the measurement model)


Measurement Stability

  • We often make assumptions of spatial and/or temporal stability of measurements.
      • Repeated measurements are taken at different times, locations, conditions, etc.
      • How constant is the sample?
      • How constant is the measurement process?
      • How constant is the measurement context?


Measurement Terms

  • Metrology(计量学/ 量测学/ 量衡学): the science of measurement.
  • Measurand(被测物): the thing being measured.
  • True Value: the unknown (and unknowable) thing we wish to estimate.
  • Error: true value – measured value.
  • Measurement Uncertainty】: an estimate of the dispersion of measurement values around the true value.


Uncertainty Components

  • Systematic errors
      • Produce a bias in the measurement result.
      • We look for and try to correct for all systematic errors, but we are never totally successful.
      • We try to put an upper limit on how big we expect systematic errors could be.
  • Random errors
    — Can be evaluated statistically, through repeated measurements.


Other Measurement Terms

  • Accuracy(准确度): the same as error, it can never be known, but is estimated as the maximum systematic error that might reasonably exist
  • Precision(精密度): the standard deviation of repeated measurements (random error component). 精密仪器Precision Instrument
      • Repeatability: standard deviation of repeated measurements under conditions as nearly identical as possible. 侧重于自己在原来的仪器(样品)上重复试验。  
      • Reproducibility: standard deviation of repeated measurements under conditions that vary in a typical way (different operators, instruments, days).  侧重比如自己观察到的实验现象,别人是否能在他自己的仪器上观察到。
  • 对于上面右侧的图,Wiki给的说明是Accuracy is the proximity of measurement results to the accepted value; precision is the degree to which repeated (or reproducible) measurements under unchanged conditions show the same results.


Measurement Model Example

  • How wide is this feature?
  • Most critical dimension measurement tools use a trapezoidal (梯形的) feature model
      • What criterion should be used to best fit a trapezoid to this complicated shape?
      • Accuracy of the width measurement is a strong function of the appropriateness of the feature model (not a good fit means not a good measurement)
  • The measurement model of an SEM = the feature model + a model for how electrons interact with the sample



Data Example

Chemical vapor deposition (CVD) is used to deposit a tungsten film on a silicon wafer


  • There is no mention of deposition rate measurement error
  • “The temperature was controlled to within ± 2ºC of the desired value within the deposition zone.”
  • The context of the experiment is well described
  • There was no mention of context uncertainty (ex: pressure, flow rates)
  • “From the Arrhenius plots an activation energy, Ea, of 0.71 eV/atom (69,000 J/mol) was calculated.”实际上不同的线拟合的结果应该有差异。


  • Deposition rate is the slope of these lines (presumably determined by ordinary least squares regression)
  • What is the uncertainty (standard error) in the slopes?
  • What assumptions are being made here? Can the assumptions be tested?  (指的是沉积厚度和沉积时间的线性关系)



Process Modeling


  • Introduction to Process Modeling-NIST
  • Process modeling is the concise description of the total variation in one quantity \(y\) (called the response variable) by partitioning (分割,分开) it into
        • A deterministic component (确定性的成分) given by a mathematical function of one or more other quantities, \(x_{1}, x_{2}, \ldots\) and possibly unknown coefficients \(\beta_{0}, \beta_{1}, \ldots\)
        • A random component \(\varepsilon\) that follows a particular probability distribution $$ y=f(\boldsymbol{x}, \boldsymbol{\beta})+\varepsilon \quad \boldsymbol{x}=\left(x_{1}, x_{2}, \ldots\right) ; \boldsymbol{\beta}=\left(\beta_{0}, \beta_{1}, \ldots\right) $$
  • Generally, we require \(\mathrm{E}[\varepsilon]=0\).
    — Thus \(f(\boldsymbol{x}, \boldsymbol{\beta})\) describes the average response, \(\mathrm{E}[y]\), if the experiment is repeated many times, not the actual response for a given trial
  • Our three tasks in modeling:
      • Find the equation from \(f(\boldsymbol{x}, \boldsymbol{\beta})\) that meets our goals
      • Find the values of the coefficients \(\beta\) that are “best” in some sense
      • Characterize the nature of \(\varepsilon\) (distribution of errors)
  • The perfect model has
      • The correct set of input variables \(x_{1}, x_{2}, \ldots\)
      • The correct model form \(f(\boldsymbol{x}, \boldsymbol{\beta})\)
      • The correct values for the coefficients \(\beta_{0}, \beta_{1}, \ldots\)
      • The correct probability distribution for \(\varepsilon\), including parameters such as its standard deviation \(\sigma\)
  • Picking the right model (form and predictor variables) is called modeling building
  • Finding the best estimate of the parameter values and the properties of the random variable \(\varepsilon\) is called regression


Model Generalizability

\(y=f(\boldsymbol{x}, \boldsymbol{\beta})+\varepsilon\)

  • The three aspects of our model (equation, coefficients, and errors) can have different levels of generalizability
    — We often want to know the levels of generalizability
  • Example: model of thermal stress on polymer
      • The equation form applies to all materials (under certain conditions)
      • The parameters change for different materials
      • The errors are a function of measurement and experimental methods, independent of materials


Some Terminology

  • \(y=\) response variable, response, dependent variable
  • \(x=\) predictor variable, explanatory variable, independent variable, predictor, regressor
  • Our “model” is both the function \(f(\boldsymbol{x}, \boldsymbol{\beta})\) and the assumed distribution of \(\varepsilon\)



  • Regression involves three things:
      • Data (a response variable as a function of one or more predictor variables)
      • Model (fixed form and predictor variables, but with unknown parameters)
      • Method (a statistical regression technique appropriate for the model and the data to find the “best” values of the model parameters)
  • High quality regression requires high quality in all three items


The Model

  • Statistical Relationship: \(y_{i}=f\left(x_{i}, \beta\right)+\varepsilon_{i}=\hat{y}_{i}+\varepsilon_{i}\)
  • Functional Relationship: \(\hat{y}=f(x, \beta)\) or \(E[Y \mid X]=f(X, \beta)\)
    \(X, Y=\) random variables (probability terminology)
    \(\hat{y}=\) predicted (mean) response
    \(y_{i}=\) measured response for \(\mathrm{i}^{\text {th }}\) data point
    \(x_{i}=\) value of explanatory variable for \(\mathrm{ith}^{\text {th }}\) data point
    \(X, Y=\) random variables (probability terminology)
    \(\hat{y}=\) predicted (mean) response
    \(Y_{i}=\) measured response for \(\mathrm{i}^{\text {th }}\) data point
    \(X_{i}=\) value of explanatory variable (解释变量,其实就是自变量) for \(\mathrm{i}^{\text {th }}\) data point
    \(\beta_{k}=\) true model parameters (can never be known)
    \(b_{k}=\) best fit model parameters for this data set (sample); our estimate for \(\beta_{k}\).
    \(\varepsilon_{i}=\) true value of \(\mathrm{i}^{\text {th }}\) residual (from true model, not known)
    \(e_{i}=\) actual ith residual for the current model


Example Model

  • Straight line model:$$ \begin{aligned} &f(x, \beta)=\beta_{0}+\beta_{1} x \\ &\hat{y}_{i}=\beta_{0}+\beta_{1} x_{i} \\ &y_{i}=\beta_{0}+\beta_{1} x_{i}+\varepsilon_{i} \end{aligned} $$
  • Regression produces the “best” estimate of the model given the data \(\left(x_{i}, y_{i}\right)\) :$$ y_{i}=b_{0}+b_{1} x_{i}+e_{i} $$

Models for Linear Regression

  • We use linear regression for linear parameter models: \(\hat{y}\) is directly proportional to each unknown model coefficient
      • \(\hat{y}=\sum_{k} \beta_{k} f_{k}(x)\) for bivariate data (双变量数据)
      • Example: \(\hat{y}=\beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\beta_{3} \ln (x)\)
  • Multivariate data: two or more explanatory variables (we’ll call them \(x_{1}, x_{2}\), etc.)

Nonlinear Regression

  • We call our regression nonlinear if it is nonlinear in the coefficients
      • Linear regression: \(\hat{y}=\beta_{0}+\beta_{1} \ln (x)+\beta_{2} x^{3}\)
      • Nonlinear regression: \(\hat{y}=\beta_{0} e^{\beta_{1} x}\)
      • (非)线性回归,其实就是说应变量能不能表示为多个待定系数的线性组合,其中每一个待定系数都和一个含有自变量\(x \)的表达式相乘。比如函数\(f(x, \boldsymbol{\beta})=\displaystyle\frac{\beta_{1} x}{\beta_{2}+x}\)是非线性的,因为它不能表示为两个\(\beta\)的线性组合。非线性函数的其他示例包括指数函数 , 对数函数 , 三角函数 , 幂函数 , 高斯函数和洛伦兹曲线 。 某些函数(如指数函数或对数函数)可以进行转换,以使它们是线性的。 如此转换,可以执行标准线性回归,但必须谨慎应用。
  • Linear regression is relatively easy
    — Numerically stable with unique solution given a reasonable definition of “best” fit
  • Nonlinear regression is relatively hard



Regression Review


  • Experiment — explanatory variables are manipulated (controlled), all other inputs are held constant, and the response variable is measured
  • Model — a mathematical function that approximates the data and is useful for making predictions
  • Regression — a method of finding the best fit of a model to a given set of data through the adjustment of model parameters (coefficients)
    — What do we mean by “best fit”?
  • Residual: \(\varepsilon_{i}=y_{i}-f\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{\beta}\right), e_{i}=y_{i}-f\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{b}\right)\)
    residual可能是error in the mdel and/or model;这里的\(\boldsymbol{\beta}\)是model理论上参数,我们无法知道,但是可以用\(\boldsymbol{b}\)去估计。
  • A “good” model has small residuals
      • Small mean, \(|\bar{\varepsilon}|\)
      • Small variance (方差), \(\sigma_{\varepsilon}^{2}\)
  • Much of our model and regression checking will involve studying and plotting residuals
    — Plot \(e_{i} v s . \hat{y}_{i}\) (works for multiple regression)


What Do We Mean by “Best Fit”?

  • First, assume \(\varepsilon\) is a random variable that does not depend on \(x\)
      • There are no systematic errors
      • The model is perfect
  • Desired properties of a “best fit” estimator
      • Unbiased, \(\sum \varepsilon_{i}=\sum e_{i}=0\left(\right.\) and \(\left.E\left[b_{k}\right]=\beta_{k}\right)\)(其实就是实验次数无穷大时的收敛值)
      • Efficient, minimum variance \(\sigma_{e}^{2}\left(\right.\) and \(\left.\operatorname{var}\left(b_{k}\right)\right)\)
      • Maximum likelihood (for an assumed pdf of \(\varepsilon\) )
      • Robust (to be discussed later)
        鲁棒性,if I have a or a few bad data point(s), how much does it screw up answer。


Maximum Likelihood Estimation

  • Likelihood function: the probability of getting this exact data set given a known model and model parameters
    — To calculate the likelihood function, we need to know the joint probability distribution function (pdf) of \(\varepsilon\)
  • Maximum Likelihood: what parameter values maximize the likelihood function?
      • Take the partial derivative of the likelihood function (or more commonly the log-likelihood function) with respect to each parameter, set it equal to zero
      • Solve resulting equations simultaneously


MLE Example – straight line

  • Let \(y_{i}=\beta_{0}+\beta_{1} x_{i}+\varepsilon_{i}, \varepsilon_{i} \sim N\left(0, \sigma_{\varepsilon}\right)\), iid
  • Since each \(\varepsilon_{i}\) is independent, the likelihood function for the entire data set is$$ L=\prod_{i=1}^{n} \mathbb{P}\left(y_{i}\right)=\prod_{i=1}^{n} \mathbb{P}\left(\varepsilon_{i}\right) $$
  • Since the residuals are iid Normal,$$ L=\left(\frac{1}{\sqrt{2 \pi} \sigma_{\varepsilon}}\right)^{n} \exp \left[-\frac{1}{2} \sum_{i=1}^{n} \frac{\varepsilon_{i}^{2}}{\sigma_{\varepsilon}^{2}}\right] $$
  • Define chi-square as$$ \chi^{2}=\sum_{i=1}^{n} \frac{\varepsilon_{i}^{2}}{\sigma_{\varepsilon}^{2}} \quad \text { (also called weighted SSE) } $$
  • Maximizing \(L\) is the same as minimizing \(\chi^{2}\)
  • Solve these equations simultaneously$$ \frac{\partial \chi^{2}}{\partial \beta_{k}}=0 \text { for all } k \text { (all coefficients) } $$
  • For our line model, \(\varepsilon_{i}=y_{i}-\left(\beta_{0}+\beta_{1} x_{i}\right)\)
  • Substitute our estimate for \(\beta_{0}\) into \(\chi^{2}\)


Assumptions for this MLE

  • We assumed three things:
    • Each residual (and \(y_{i}\) ) is independent
    • Each residual (and \(y_{i}\) ) is identically distributed
    • Each \(\varepsilon_{i} \sim N\left(0, \sigma_{\varepsilon}\right)\), and thus \(y_{i} \sim N\left(\hat{y}, \sigma_{\varepsilon}\right)\)
    • We call this 【ordinary least squares】 (OLS)
  • If any of these assumptions are invalid, then our least-squares estimates will not be the maximum likelihood estimates


Properties of our OLS Solution

  • The parameters are unbiased estimators of the true parameters, with minimum variance compared to all other unbiased estimators (if our assumptions are correct)
  • \(\displaystyle\sum_{i=1}^{n} e_{i}=0\)
  • \(\displaystyle\sum_{i=1}^{n} y_{i}=\displaystyle\sum_{i=1}^{n} \hat{y}_{i}\), so that \(\bar{y}=\overline{\hat{y}}\)
  • \(\displaystyle\sum_{i=1}^{n} \hat{y}_{i} e_{i}=0\)
  • \(\displaystyle\sum_{i=1}^{n} x_{i} e_{i}=0\)
  • The best fit line goes through the point \((\bar{x}, \bar{y})\)

OLS Matrix Formulation

When we have multivariate data, it is most convenient to formulate the OLS/MLE problem using matrix math

  • \(i=1,2,3, \ldots, n\) data point index
  • \(k=0,1,2, \ldots, p-1\) predictor variable index
  • \(x_{i, k}=i^{t h}\) value for the \(k^{t h}\) predictor variable
  • \(y_{i}=i^{t h}\) response data value
  • \(\hat{y}=\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\cdots+\beta_{p-1} x_{p-1}\)

Model in matrix form: \(\boldsymbol{Y}=\boldsymbol{X} \boldsymbol{\beta}+\boldsymbol{\varepsilon}\)
(each row in \(X\) and \(\boldsymbol{Y}\) is a “data point”)

Sum of square errors$$ S S E=\sum_{i=1}^{n} \varepsilon_{i}^{2}=\boldsymbol{\varepsilon}^{T} \boldsymbol{\varepsilon}=(\boldsymbol{Y}-\boldsymbol{X} \boldsymbol{\beta})^{T}(\boldsymbol{Y}-\boldsymbol{X} \boldsymbol{\beta}) $$

Maximum Likelihood estimate (minimum SSE): $$ \frac{\partial \chi^{2}}{\partial \beta_{k}}=0=\sum_{i=1}^{n} x_{i, k} \varepsilon_{i} \text { giving } \boldsymbol{X}^{\boldsymbol{T}} \boldsymbol{\varepsilon}=\mathbf{0} $$Using the definition of the residual, \( \boldsymbol{X}^{T} \boldsymbol{X} \boldsymbol{\beta}=\boldsymbol{X}^{T} \boldsymbol{Y}\)

Minimum SSE occurs when the coefficients are estimated as $$ \begin{gathered} \widehat{\boldsymbol{\beta}}=\left(\boldsymbol{X}^{T} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{T} \boldsymbol{Y} \\ \widehat{\boldsymbol{Y}}=\boldsymbol{X} \widehat{\boldsymbol{\beta}}=\boldsymbol{X}\left(\boldsymbol{X}^{T} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{T} \boldsymbol{Y}=\boldsymbol{H} \boldsymbol{Y} \\ \boldsymbol{H}=\boldsymbol{X}\left(\boldsymbol{X}^{T} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{T}=\text { hat matrix } \\ \boldsymbol{e}=\boldsymbol{Y}-\widehat{\boldsymbol{Y}}=(\boldsymbol{I}-\boldsymbol{H}) \boldsymbol{Y} \end{gathered} $$注意\(\boldsymbol{I}\)表示的是identity matrix;\(\boldsymbol{H}\)和\(\boldsymbol{I}-\boldsymbol{H}\)都是symmetric;common alternate notation: \(b_{k}=\hat{\beta}_{k}\)

We can use our solution to calculate the covariance matrices:$$ \begin{gathered} \operatorname{cov}(\boldsymbol{e})=s_{e}^{2}(\boldsymbol{I}-\boldsymbol{H}), \quad \operatorname{var}\left(e_{i}\right)=s_{e}^{2}\left(1-h_{i i}\right) \\ \operatorname{cov}\left(e_{i}, e_{j}\right)=-s_{e}^{2} h_{i j} \\ \operatorname{cov}(\widehat{\boldsymbol{Y}})=s_{e}^{2} \boldsymbol{H}, \quad \operatorname{var}\left(\hat{y}_{i}\right)=s_{e}^{2} h_{i i} \\ \operatorname{cov}(\widehat{\boldsymbol{\beta}})=s_{e}^{2}\left(\boldsymbol{X}^{T} \boldsymbol{X}\right)^{-1} \end{gathered} $$其中\(s_{e}^{2}\)是Variance of the residuals。


MLE Straight-Line Regression

  • Our model: \(E[Y \mid X]=\beta_{0}+\beta_{1} X\)$$ \begin{aligned} &\beta_{1}=\frac{\operatorname{cov}(X, Y)}{\operatorname{var}(X)} \\ &\beta_{0}=E[Y]-\beta_{1} E[X] \end{aligned} \quad \rho=\frac{\operatorname{cov}(X, Y)}{\sqrt{\operatorname{var}(X) \operatorname{var}(Y)}} $$
  • Ordinary least squares (OLS) estimators:$$ b_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \quad b_{0}=\bar{y}-b_{1} \bar{x} $$

Uncertainty of Fit Parameters

  • The regression fit is based on a sample of data$$ \text { Population Model: } y_{i}=\beta_{0}+\beta_{1} x_{i}+\varepsilon_{i} $$
  • To create confidence intervals for \(b_{0}, b_{1}\), and \(\hat{y}_{i}\), we need to know their sampling distributions
    — Given the assumption of \(\varepsilon_{i} \sim N\left(0, \sigma_{\varepsilon}^{2}\right)\), the parameter sampling distributions are unbiased and t-distributed (DF = n-2) degrees of freedom)
  • $$ \begin{gathered} \operatorname{var}\left(b_{1}\right)=\frac{\operatorname{var}(\varepsilon)}{n \operatorname{var}(X)} \quad \operatorname{var}\left(b_{0}\right)=\frac{\operatorname{var}(\varepsilon)}{n}\left(1+\frac{\bar{x}^{2}}{\operatorname{var}(X)}\right) \\ \operatorname{cov}\left(b_{0}, b_{1}\right)=-\bar{x} \operatorname{var}\left(b_{1}\right) \end{gathered} $$
  • Estimators: (其中\(p\)表示number of model parameters)$$ \begin{gathered} s_{e}^{2}=\frac{1}{n-p} \sum_{i=1}^{n} e_{i}^{2} \\ s_{b 1}^{2}=\frac{s_{e}^{2}}{(n-1) s_{x}^{2}} \quad s_{b 0}^{2}=\frac{s_{\varepsilon}^{2}}{n}+s_{b 1}^{2} \bar{x}^{2} \end{gathered} $$

Confidence Intervals

  • The sampling distributions are Student’s t with DF = n – 2
    • Slope CI: \(b_{1} \pm t_{n-2, \alpha} s_{b 1} \quad s_{b 1}^{2}=\displaystyle\frac{s_{\varepsilon}^{2}}{(n-1) s_{x}^{2}}\)
    • Intercept CI: \(b_{0} \pm t_{n-2, \alpha} s_{b 0} \quad s_{b 0}^{2}=\displaystyle\frac{s_{\varepsilon}^{2}}{n}+s_{b 1}^{2} \bar{x}^{2}\)
      其中\(n\)是Critical t-value,\(\alpha=\)是significance level (e.g., 0.05)

Uncertainty of Predictions

  • Uncertainty in \(\hat{y}_{i}\) comes from the spread of the residuals and from uncertainty in the best fit parameters \(b_{1}\) and \(b_{0}\)$$ \begin{aligned} \operatorname{var}\left(\hat{y}_{i}\right) &=\frac{\operatorname{var}(\varepsilon)}{n}\left(1+\frac{\left(x_{i}-\bar{x}\right)^{2}}{\operatorname{var}(X)}\right) \\ s_{\hat{y}_{i}}^{2} &=\frac{s_{\varepsilon}^{2}}{n}+s_{b 1}^{2}\left(x_{i}-\bar{x}\right)^{2} \end{aligned} $$
  • Uncertainty in a predicted new measurement \(\hat{y}_{\text {new }}\) adds additional uncertainty of a single measurement$$ s_{\hat{y}_{n e w}}^{2}=s_{\hat{y}_{i}}^{2}+s_{\varepsilon}^{2} $$

Uncertainty of Correlation Coefficient

$$ \rho=\frac{\operatorname{cov}(X, Y)}{\sqrt{\operatorname{var}(X) \operatorname{var}(Y)}} \quad r=\frac{1}{n-1} \sum_{i=1}^{n}\left(\frac{x_{i}-\bar{x}}{s_{x}}\right)\left(\frac{y_{i}-\bar{y}}{s_{y}}\right) $$

  • The sampling distribution for \(r\) is about Student’s t (D F=n-2) only for \(\rho=0\). For this case$$ S_{r}=\frac{s_{\varepsilon}}{\sqrt{n-1} s_{y}} $$
  • For \(\rho \neq 0\), the sampling distribution is complicated
  • We’ll use the Fisher z-transformation: $$ z=\frac{1}{2} \ln \left(\frac{1+r}{1-r}\right) $$
  • When \(n>25, z\) is about normal with$$ E[z]=\frac{1}{2} \ln \left(\frac{1+\rho}{1-\rho}\right) \quad \operatorname{var}(z)=\frac{1}{n-3} $$

Assumptions in OLS Regression

  1. \(\varepsilon\) is a random variable that does not depend on \(x\) (i.e., the model is perfect, it properly accounts for the role of \(x\) in predicting \(y\) )
  2. \(\mathrm{E}\left[\varepsilon_{l}\right]=0\) (the population mean of the true residual is zero); this will always be true for a model with an intercept
  3. All \(\varepsilon_{i}\) are independent of each other (uncorrelated for the population, but not for a sample)
  4. All \(\varepsilon_{j}\) have the same probability density function (pdf), and thus the same variance (called homoscedasticity)
  5. \(\varepsilon \sim \mathrm{N}\left(0, \sigma_{\varepsilon}\right)\) (the residuals, and thus the \(y_{i}\), are normally distributed)
  6. The values of each \(x_{i}\) are known exactly

Checking the Assumptions

  • Do the assumptions in OLS regression hold?
    — Which assumptions can you validate?
    — If an assumption is violated, how far off is it?
  • If one or more assumptions do not hold, does the observed violation invalidate the statistical procedure used?
    — If so, what next?

Failed Assumptions – the Anscombe Problems

F. J. Anscombe, “Graphs in Statistical Analysis”, The American Statistician, Vol. 27, No. 1
(Feb., 1973) pp. 17 – 21.

What Happens When OLS Assumptions are Violated?

  • At best, the regression becomes inefficient
    — Do the assumptions in OLS regression holdThe uncertainty around the estimates is larger than you think: \(\operatorname{var}[\hat{\theta}]\) for some parameter \(\theta\)
  • At worst, the regression becomes biased
    — The results may be misleading: bias \([\hat{\theta}]\)
  • We want small mean square error (MSE)$$\operatorname{MSE}(\hat{\theta})=\operatorname{var}[\hat{\theta}]+(\operatorname{bias}[\hat{\theta}])^{2}$$

Checking Our Assumptions

  • Regression Diagnostics: checking for violations in any of the OLS assumptions
  • Topics we’ll address:
    • Normality of residuals
    • Outliers (identically distributed)
    • Leverage and influence
    • Heteroscedasticity (variation in variance)
    • Error in predictor variables
    • The wrong model
    • Correlated residuals

Fixing Problems

  • Regression Remediation: changing our regression to address diagnostic problems
  • Topics we’ll address:
    • Outlier removal or adjustment
    • Data transformation
    • Weighted regression
    • Total regression
    • Model building
    • Autocorrelation analysis



Response surface methodology



Leave a Reply