Interpretability is one of the biggest challenges in machine learning. A model has more interpretability than another one if its decisions are easier for a human to comprehend. Some models are so complex and are internally structured in such a way that it’s almost impossible to understand how they reached their final results. These black boxes seem to break the association between raw data and final output, since several processes happen in between.
可解释性是机器学习中最大的挑战之一。 如果一个模型的决策更容易让人理解,那么它的解释性就会比另一个模型高。 有些模型是如此复杂,并且内部结构如此复杂,以至于几乎无法理解它们是如何达到最终结果的。 这些黑匣子似乎打破了原始数据和最终输出之间的关联,因为它们之间发生了多个过程。
But in the universe of machine learning algorithms, some models are more transparent than others.
Decision Trees
are definitely one of them, and Linear Regression models are another one. Their simplicity and straightforward approach turns them into an ideal tool to approach different problems. Let’s see how.
但是在机器学习算法领域,某些模型比其他模型更透明。
决策树
绝对是其中之一,而线性回归模型又是其中之一。 它们的简单和直接的方法使它们成为解决不同问题的理想工具。 让我们看看如何。
You can use Linear Regression models to analyze how salaries in a given place depend on features like experience, level of education, role, city they work in, and so on. Similarly, you can analyze if real estate prices depend on factors such as their areas, numbers of bedrooms, or distances to the city center.
您可以使用线性回归模型来分析给定地点的薪水如何取决于经验,学历,职位,所工作的城市等特征。 同样,您可以分析房地产价格是否取决于面积,卧室数量或距市中心的距离等因素。
In this post, I’ll focus on Linear Regression models that examine the linear relationship between a
dependent variable
and one (Simple Linear Regression) or more (Multiple Linear Regression)
independent variables
.
在本文中,我将重点介绍线性回归模型,该模型研究
因变量
与一个(简单线性回归)或多个(多个线性回归)
自变量
之间的线性关系。
简单线性回归(SLR)
(
Simple Linear Regression (SLR)
)
Is the simplest form of Linear Regression used when there is a single input variable (predictor) for the output variable (target):
当输出变量(目标)只有一个输入变量(预测变量)时,是使用线性回归的最简单形式:
The
input
or
predictor variable
is the variable that helps predict the value of the output variable. It is commonly referred to as
X
.
输入
或
预测变量
是有助于预测输出变量值的变量。 通常称为
X。
The
output
or
target variable
is the variable that we want to predict. It is commonly referred to as
y
.
输出
或
目标变量
是我们要预测的变量。 通常称为
y
。
The value of
β0, also called the intercept
, shows the point where the estimated regression line crosses the
y
axis, while the value of
β1
determines the slope
of the estimated regression line. The
random error
describes the random component of the linear relationship between the dependent and independent variable (the disturbance of the model, the part of
y
that
X
is unable to explain). The true regression model is usually never known (since we are not able to capture all the effects that impact the dependent variable), and therefore the value of the random error term corresponding to observed data points remains unknown. However, the regression model can be estimated by calculating the parameters of the model for an observed data set.
β0
的值(
也称为截距
)显示估算的回归线与
y
轴交叉的点,而
β1
的值
确定
估算的回归线
的斜率
。
随机误差
描述了因变量和自变量之间线性关系的随机成分(模型的扰动,
X
无法解释的
y
部分)。 真正的回归模型通常是未知的(因为我们无法捕获影响因变量的所有影响),因此与观察到的数据点相对应的随机误差项的值仍然未知。 但是,可以通过为观察到的数据集计算模型的参数来估计回归模型。
The idea behind regression is to estimate the parameters
β0
and
β1
from a sample. If we are able to determine the optimum values of these two parameters, then we will have the
line of best fit
that we can use to predict the values of
y
, given the value of
X
. In other words, we try to fit a line to observe a relationship between the input and output variables and then further use it to predict the output of unseen inputs.
回归背后的想法是从样本中估计参数
β0
和
β1
。 如果我们能够确定这两个参数的最佳值,则在给定
X
的值的情况下,我们将具有
最佳拟合线,
可用于预测
y
的值。 换句话说,我们尝试拟合一条线以观察输入变量和输出变量之间的关系,然后进一步使用它来预测未见输入的输出。
How do we estimate
β0
and
β1
? We can use a method called
Ordinary Least Squares (OLS)
.
The goal behind this is to minimize the distance from the black dots to the red line as close to zero as possible, which is done by minimizing the squared differences between actual and predicted outcomes.
我们如何估计
β0
和
β1
? 我们可以使用一种称为
普通最小二乘(OLS)的方法
。
其背后的目标是使黑点到红线的距离尽可能地接近零,这是通过最小化实际结果与预测结果之间的平方差来实现的。
The difference between actual and predicted values is called
residual (e)
and can be negative or positive depending on whether the model overpredicted or underpredicted the outcome. Hence, to calculate the net error, adding all the residuals directly can lead to the cancellations of terms and reduction of the net effect. To avoid this, we take the sum of squares of these error terms, which is called the
Residual Sum of Squares (RSS).
实际值与预测值之差称为
残差(e)
可以是负值或正值,具体取决于模型是高估还是低估了结果。 因此,为了计算净误差,直接将所有残差相加会导致项的抵消和净效应的减小。 为了避免这种情况,我们采用这些误差项的平方和,称为
残差平方和(RSS)。
The
Ordinary Least Squares (OLS) method minimizes the residual sum of squares
, and its objective is to fit a regression line that would minimize the distance (measured in quadratic values) from the observed values to the predicted ones (the regression line).
普通最小二乘法(OLS)方法使残差平方
和
最小化
,其目的是拟合一条回归线,以使从观测值到预测值的距离(以二次值度量)最小化(回归线)。
多元线性回归(MLR)
(
Multiple Linear Regression (MLR)
)
Is the form
of Linear Regression used when there are two or more predictors or input variables. Similar to the SLR model described before, it includes additional predictors:
是形式
有两个或多个预测变量或输入变量时使用的线性回归系数。 与之前描述的SLR模型类似,它包含其他预测变量:
Notice that the equation is just an extension of the Simple Linear Regression one, in which each input/ predictor has its corresponding slope coefficient
(β)
. The first
β
term
(β0)
is the intercept constant and is the value of
y
in absence of all predictors (i.e when all
X
terms are 0).
注意,该方程只是简单线性回归方程的扩展,其中每个输入/预测变量都有其对应的斜率系数
(β)
。 第一个
β
项
(β0)
是截距常数,是在没有所有预测变量的情况下(即,当所有
X
项均为0时)的
y
值。
As the number of features grows, the complexity of our model increases and it becomes more difficult to visualize, or even comprehend, our data. Because there are more parameters in these models compared to SLR ones, more care is needed when working with them. Adding more terms will inherently improve the fit to the data, but the new terms may not have any real significance. This is dangerous because it can lead to a model that fits that data but doesn’t actually mean anything useful.
随着功能部件数量的增加,我们模型的复杂性也随之增加,并且更加难以可视化甚至理解我们的数据。 由于与SLR相比,这些模型中的参数更多,因此在使用它们时需要格外小心。 添加更多术语会从本质上改善数据的拟合度,但是新术语可能没有任何实际意义。 这很危险,因为它可能会导致模型适合该数据,但实际上并不意味着有用。
一个例子
(
An example
)
The advertising dataset consists of the sales of a product in 200 different markets, along with advertising budgets for three different media: TV, radio, and newspaper. We’ll use the dataset to predict the amount of sales (dependent variable), based on the TV, radio and newspaper advertising budgets (independent var
而解决最小化问题时,我们引入了 梯度下降算法。
梯度算法的思想:
开始时我们随机选择一个参数的组合(ø0,ø1,…,øn),计算代价函数,然后我们寻找下一个能让代价函数值下降最多的参数组合。持续这样做直到找到一个局部最小值(Local Mi
https://www.ibm.com/support/knowledgecenter/zh/SSLVMB_25.0.0/statistics_mainhelp_ddita/spss/base/idh_glmu.html
“GLM 单
变量
”过程通过一个或多个因子和/或
变量
,为一个因
变量
提供
回归
分析和方差分析。因子
变量
将总体划分成组。通过使
线性回归模型
的目标是寻找最佳的
线性
组合来预测目标
变量
。具体来说,对于简单
线性
回归
,
模型
试图找到一条最佳拟合直线,而对于多元
线性
回归
,则是在多维空间中找到一个最佳拟合的超平面。多元
线性回归模型
可以表示为:其中,𝑌Y 是因
变量
,𝑋1,𝑋2,...,𝑋𝑛X1,X2,...,Xn 是自
变量
,𝛽0β0 是截距,𝛽1,𝛽2,...,𝛽𝑛β1,β2,...,βn 是每个自
变量
对应的斜率,表示其对𝑌Y的影响程度,而𝜖ϵ 是误差项,反映了
模型
未能解释的随机波动。
关于空间计量
模型
,小编是通过阅读勒沙杰(James LeSage)和佩斯(R.Kelley Pace)合著的《空间计量经济学导论》(Introduction of Spatial Econ...
线性
模型
0. 写在前面1.
线性
模型
2. 用于
回归
和分类
回归
问题分类问题3. 分类任务的几个问题3.1 如何解决非
线性
的分类问题3.2 如何解决多分类问题:三种解决
0. 写在前面
今天对
线性
模型
做一个总结,围绕以下两个点理一理思路:
判别函数 - 决策函数;
线性
模型
-
线性
模型
各类拓展
具体沿着以下几个问题展开:
1. 生成方法与判别方法
2. 判别函数与决策函数
3.
线性
模型
4. 广义...
先看定义一下什么叫
回归
:
定义1
回归
函数(Regression Function):E(y∣x)\mathbb{E}(y|\mathbf{x})E(y∣x)就是yyy对x\mathbf{x}x的
回归
函数。
再定义一个度量预测得好不好的指标:
定义2 均方误(Mean Squared Error,MSE):假设用g(x)g(\mathbf{x})g(x)预测yyy,则预测量g(x)g(\ma