fbpx
维基百科

Lasso算法

统计学机器学习中,Lasso算法(英语:least absolute shrinkage and selection operator,又译最小绝对值收敛和选择算子、套索算法)是一种同时进行特征选择正则化(数学)的回归分析方法,旨在增强统计模型的预测准确性和可解释性,最初由斯坦福大学统计学教授羅伯特·蒂布希拉尼英语Robert Tibshirani于1996年基于Leo Breiman的非负参数推断(Nonnegative Garrote, NNG)提出[1][2]。Lasso算法最初用于计算最小二乘法模型,这个简单的算法揭示了很多估计量的重要性质,如估计量与岭回归(Ridge regression,也叫吉洪诺夫正则化)和最佳子集选择的关系,Lasso系数估计值(estimate)和软阈值(soft thresholding)之间的联系。它也揭示了当协变量共线时,Lasso系数估计值不一定唯一(类似标准线性回归)。

虽然最早是为应用最小二乘法而定义的算法,lasso正则化可以简单直接地拓展应用于许多统计学模型上,包括广义线性模型,广义估计方程,成比例灾难模型和M-估计[3][4]。Lasso选择子集的能力依赖于限制条件的形式并且有多种表现形式,包括几何学贝叶斯统计,和凸分析。

Lasso算法与基追踪降噪联系紧密。

历史来源 编辑

蒂布希拉尼最初使用Lasso来提高预测的准确性与回归模型的可解释性,他修改了模型拟合的过程,在协变量中只选择一个子集应用到最终模型中,而非用上全部协变量。这是基于有着相似目的,但方法有所不同的Breiman的非负参数推断。

在Lasso之前,选择模型中协变量最常用的方法是移步选择,这种方法在某些情况下是准确的,例如一些协变量与模型输出值有强相关性情况。然而在另一些情况下,这种方法会让预测结果更差。在当时,岭回归是提高模型预测准确性最常用的方法。岭回归可以通过缩小大的回归系数来减少过拟合从而改善模型预测偏差。但是它并不选择协变量,所以对模型的准确构建和解释没有帮助。

Lasso结合了上述的两种方法,它通过强制让回归系数绝对值之和小于某固定值,即强制一些回归系数变为0,有效地选择了不包括这些回归系数对应的协变量的更简单的模型。这种方法和岭回归类似,在岭回归中,回归系数平方和被强制小于某定值,不同点在于岭回归只改变系数的值,而不把任何值设为0。

基本形式 编辑

Lasso最初为了最小二乘法而被设计出来,Lasso的最小二乘法应用能够简单明了地展示Lasso的许多特性。

最小二乘 编辑

假设一个样本包括N种事件,每个事件包括p个协变量和一个输出值。让 为输出值,并且 为第i种情况的协变量向量,那么Lasso要计算的目标方程就是:

对所有  ,计算  [1]

这里   是一个决定规则化程度的预定的自由参数。 设 为协变量矩阵,那么  ,其中   的第 i 行,那么上式可以写成更紧凑的形式:

对所有  ,计算  

这里   是标准   范数  维的1的向量。

因为  ,所以有

 

对变量进行中心化是常用的数据处理方法。并且协方差一般规范化为   ,这样得到的解就不会依赖测量的规模。

它的目标方程还可以写为:

 

拉格朗日形式为:

 

其中    的关系取决于数据特征。

正交协变量 编辑

现在考虑一些Lasso回归估计的基本性质。

首先假定所有的协变量都是正交的,即  ,其中 克罗内克δ函数。等价的矩阵写法为  ,使用次梯度法可有如下的表达形式

  [1]

  用于表示软阈值算子,当这个值非常小的时候为0。一个与之相近的记号 用来表示硬阈值算子,将较小的数值记为0的同时保留原有的较大数值。

与岭回归相比较,其中岭回归的目标在于最小化

 

即有

 

因此岭回归是对OLS回归中所有的系数以一致的系数 缩放,并不会进行变量选择。

同样也可以对best subset selection算法进行比较,其目标在于最小化

 

其中   表示 "  norm",即0范数,被定义为该向量中非零元的个数。在这个例子中,可以得到

 

其中   被称为软阈值算子,  为示性函数。

总的来说,Lasso估计量展现出了岭回归和最佳子划分算法的系数收缩的优点,使得部分系数为0。此外,在岭回归全部使用一个常数系数缩放的时候,Lasso回归会将一个接近0的系数变为0。


相关协变量 编辑

对于一般的情况中,不同的协变量之间可能并不是独立的,其中一种特例即为变量存在重复,例如变量j和变量k,有  。在这种情况下参数   的Lasso回归的估计量不是唯一确定的。

事实上,如果有一些   中存在  ,寻找一个 进行变换,将 变换为  的同时有   变换为  ,并保留其他参数不变,此时Lasso回归具有有效的连续性质。一些基于Lasso回归的改进,例如弹性网络正则化,旨在解决这个缺点。

一般形式 编辑

Lasso正则化可以扩展为其他目标函数,例如广义线性模型,广义估计方程,比例风险模型和M估计。[1][5] 有目标函数

 

其中Lasso正则化回归给出了下面模型的估计量

 

在这里只有 是一个惩罚项, 是一个自由变量,与最基本的模型中的   变量一样。

算法解释 编辑

几何解释 编辑

 
Forms of the constraint regions for lasso and ridge regression.

Lasso回归可以使得某些项系数为0,从几何上来看,不同约束边界形状的岭回归则不能。他们都可以解释为最小化相同的目标函数

 

但是有不同的约束条件:在Lasso回归中为   而在岭回归中为  。1-范数

The figure shows that the constraint region defined by the   norm is a square rotated so that its corners lie on the axes (in general a cross-polytope), while the region defined by the   norm is a circle (in general an n-sphere), which is rotationally invariant and, therefore, has no corners. As seen in the figure, a convex object that lies tangent to the boundary, such as the line shown, is likely to encounter a corner (or a higher-dimensional equivalent) of a hypercube, for which some components of   are identically zero, while in the case of an n-sphere, the points on the boundary for which some of the components of   are zero are not distinguished from the others and the convex object is no more likely to contact a point at which some components of   are zero than one for which none of them are.

Making λ easier to interpret with an accuracy-simplicity tradeoff 编辑

The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of  .[6] It is assumed that   is standardized with z-scores and that   is centered (zero mean). Let   represent the hypothesized regression coefficients and let   refer to the data-optimized ordinary least squares solutions. We can then define the Lagrangian as a tradeoff between the in-sample accuracy of the data-optimized solutions and the simplicity of sticking to the hypothesized values.[7] This results in

 

where   is specified below. The first fraction represents relative accuracy, the second fraction relative simplicity, and   balances between the two.

 
Solution paths for the   norm and   norm when   and  

Given a single regressor, relative simplicity can be defined by specifying   as  , which is the maximum amount of deviation from   when  . Assuming that  , the solution path can be defined in terms of  :

 

If  , the ordinary least squares solution (OLS) is used. The hypothesized value of   is selected if   is bigger than  . Furthermore, if  , then   represents the proportional influence of  . In other words,   measures in percentage terms the minimal amount of influence of the hypothesized value relative to the data-optimized OLS solution.

If an  -norm is used to penalize deviations from zero given a single regressor, the solution path is given by

 . Like  ,   moves in the direction of the point   when   is close to zero; but unlike  , the influence of   diminishes in   if   increases (see figure).
Given multiple regressors, the moment that a parameter is activated (i.e. allowed to deviate from  ) is also determined by a regressor's contribution to   accuracy. First,

 

An   of 75% means that in-sample accuracy improves by 75% if the unrestricted OLS solutions are used instead of the hypothesized   values. The individual contribution of deviating from each hypothesis can be computed with the   x   matrix

 

where  . If   when   is computed, then the diagonal elements of   sum to  . The diagonal   values may be smaller than 0 or, less often, larger than 1. If regressors are uncorrelated, then the   diagonal element of   simply corresponds to the   value between   and  .

A rescaled version of the adaptive lasso of can be obtained by setting  .[8] If regressors are uncorrelated, the moment that the   parameter is activated is given by the   diagonal element of  . Assuming for convenience that   is a vector of zeros,

 

That is, if regressors are uncorrelated,   again specifies the minimal influence of  . Even when regressors are correlated, the first time that a regression parameter is activated occurs when   is equal to the highest diagonal element of  .

These results can be compared to a rescaled version of the lasso by defining  , which is the average absolute deviation of   from  . Assuming that regressors are uncorrelated, then the moment of activation of the   regressor is given by

 

For  , the moment of activation is again given by  . If   is a vector of zeros and a subset of   relevant parameters are equally responsible for a perfect fit of  , then this subset is activated at a   value of  . The moment of activation of a relevant regressor then equals  . In other words, the inclusion of irrelevant regressors delays the moment that relevant regressors are activated by this rescaled lasso. The adaptive lasso and the lasso are special cases of a '1ASTc' estimator. The latter only groups parameters together if the absolute correlation among regressors is larger than a user-specified value.[6]

Bayesian interpretation 编辑

 
Laplace distributions are sharply peaked at their mean with more probability density concentrated there compared to a normal distribution.

Just as ridge regression can be interpreted as linear regression for which the coefficients have been assigned normal prior distributions, lasso can be interpreted as linear regression for which the coefficients have Laplace prior distributions. The Laplace distribution is sharply peaked at zero (its first derivative is discontinuous at zero) and it concentrates its probability mass closer to zero than does the normal distribution. This provides an alternative explanation of why lasso tends to set some coefficients to zero, while ridge regression does not.[1]

Convex relaxation interpretation 编辑

Lasso can also be viewed as a convex relaxation of the best subset selection regression problem, which is to find the subset of   covariates that results in the smallest value of the objective function for some fixed  , where n is the total number of covariates. The "  norm",  , (the number of nonzero entries of a vector), is the limiting case of "  norms", of the form   (where the quotation marks signify that these are not really norms for   since   is not convex for  , so the triangle inequality does not hold). Therefore, since p = 1 is the smallest value for which the "  norm" is convex (and therefore actually a norm), lasso is, in some sense, the best convex approximation to the best subset selection problem, since the region defined by   is the convex hull of the region defined by   for  .

应用 编辑

LASSO已被应用于经济和金融领域,可以改善预测结果并选择有时被忽视的变量。例如:公司破产预测[9]和高增长公司预测[10]

参见 编辑

参考文献 编辑

  1. ^ 1.0 1.1 1.2 1.3 1.4 Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the lasso”. Journal of the Royal Statistical Society. Series B (methodological) 58 (1). Wiley: 267–88. http://www.jstor.org/stable/2346178 (页面存档备份,存于互联网档案馆).
  2. ^ Breiman, Leo. Better Subset Regression Using the Nonnegative Garrote. Technometrics. 1995-11-01, 37 (4): 373–384 [2017-10-06]. ISSN 0040-1706. doi:10.2307/1269730. (原始内容于2020-06-08). 
  3. ^ Tibshirani, Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996, 58 (1): 267–288 [2016-07-25]. (原始内容于2020-11-17). 
  4. ^ Tibshirani, Robert. The Lasso Method for Variable Selection in the Cox Model. Statistics in Medicine. 1997-02-28, 16 (4): 385–395. ISSN 1097-0258. doi:10.1002/(sici)1097-0258(19970228)16:4%3C385::aid-sim380%3E3.0.co;2-3 (英语). [永久失效連結]
  5. ^ 引用错误:没有为名为Tibshirani 1997的参考文献提供内容
  6. ^ 6.0 6.1 Hoornweg, Victor. Chapter 8. Science: Under Submission. Hoornweg Press. 2018. ISBN 978-90-829188-0-9. 
  7. ^ Motamedi, Fahimeh; Sanchez, Horacio; Mehri, Alireza; Ghasemi, Fahimeh. Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies. Bioinformatics. October 2021, 37 (19): 469–475. ISSN 1367-4803. PMID 34979024. doi:10.1093/bioinformatics/btab659. 
  8. ^ Zou, Hui. The Adaptive Lasso and Its Oracle Properties (PDF). 2006. 
  9. ^ Shaonan, Tian; Yu, Yan; Guo, Hui. Variable selection and corporate bankruptcy forecasts. Journal of Banking & Finance. 2015, 52 (1): 89–100. doi:10.1016/j.jbankfin.2014.12.003 . 
  10. ^ Coad, Alex; Srhoj, Stjepan. Catching Gazelles with a Lasso: Big data techniques for the prediction of high-growth firms. Small Business Economics. 2020, 55 (1): 541–565. doi:10.1007/s11187-019-00203-3 . 

lasso算法, 此條目目前正依照其他维基百科上的内容进行翻译, 2018年3月20日, 如果您擅长翻译, 並清楚本條目的領域, 欢迎协助翻譯, 改善或校对本條目, 此外, 长期闲置, 未翻譯或影響閱讀的内容可能会被移除, 在统计学和机器学习中, 英语, least, absolute, shrinkage, selection, operator, 又译最小绝对值收敛和选择算子, 套索算法, 是一种同时进行特征选择和正则化, 数学, 的回归分析方法, 旨在增强统计模型的预测准确性和可解释性, 最初由斯坦福大学统计. 此條目目前正依照其他维基百科上的内容进行翻译 2018年3月20日 如果您擅长翻译 並清楚本條目的領域 欢迎协助翻譯 改善或校对本條目 此外 长期闲置 未翻譯或影響閱讀的内容可能会被移除 在统计学和机器学习中 Lasso算法 英语 least absolute shrinkage and selection operator 又译最小绝对值收敛和选择算子 套索算法 是一种同时进行特征选择和正则化 数学 的回归分析方法 旨在增强统计模型的预测准确性和可解释性 最初由斯坦福大学统计学教授羅伯特 蒂布希拉尼 英语 Robert Tibshirani 于1996年基于Leo Breiman的非负参数推断 Nonnegative Garrote NNG 提出 1 2 Lasso算法最初用于计算最小二乘法模型 这个简单的算法揭示了很多估计量的重要性质 如估计量与岭回归 Ridge regression 也叫吉洪诺夫正则化 和最佳子集选择的关系 Lasso系数估计值 estimate 和软阈值 soft thresholding 之间的联系 它也揭示了当协变量共线时 Lasso系数估计值不一定唯一 类似标准线性回归 虽然最早是为应用最小二乘法而定义的算法 lasso正则化可以简单直接地拓展应用于许多统计学模型上 包括广义线性模型 广义估计方程 成比例灾难模型和M 估计 3 4 Lasso选择子集的能力依赖于限制条件的形式并且有多种表现形式 包括几何学 贝叶斯统计 和凸分析 Lasso算法与基追踪降噪联系紧密 目录 1 历史来源 2 基本形式 2 1 最小二乘 2 2 正交协变量 2 3 相关协变量 3 一般形式 4 算法解释 4 1 几何解释 4 2 Making l easier to interpret with an accuracy simplicity tradeoff 4 3 Bayesian interpretation 4 4 Convex relaxation interpretation 5 应用 6 参见 7 参考文献历史来源 编辑蒂布希拉尼最初使用Lasso来提高预测的准确性与回归模型的可解释性 他修改了模型拟合的过程 在协变量中只选择一个子集应用到最终模型中 而非用上全部协变量 这是基于有着相似目的 但方法有所不同的Breiman的非负参数推断 在Lasso之前 选择模型中协变量最常用的方法是移步选择 这种方法在某些情况下是准确的 例如一些协变量与模型输出值有强相关性情况 然而在另一些情况下 这种方法会让预测结果更差 在当时 岭回归是提高模型预测准确性最常用的方法 岭回归可以通过缩小大的回归系数来减少过拟合从而改善模型预测偏差 但是它并不选择协变量 所以对模型的准确构建和解释没有帮助 Lasso结合了上述的两种方法 它通过强制让回归系数绝对值之和小于某固定值 即强制一些回归系数变为0 有效地选择了不包括这些回归系数对应的协变量的更简单的模型 这种方法和岭回归类似 在岭回归中 回归系数平方和被强制小于某定值 不同点在于岭回归只改变系数的值 而不把任何值设为0 基本形式 编辑Lasso最初为了最小二乘法而被设计出来 Lasso的最小二乘法应用能够简单明了地展示Lasso的许多特性 最小二乘 编辑 假设一个样本包括N种事件 每个事件包括p个协变量和一个输出值 让y i displaystyle y i nbsp 为输出值 并且x i x 1 x 2 x p T displaystyle x i x 1 x 2 ldots x p T nbsp 为第i种情况的协变量向量 那么Lasso要计算的目标方程就是 对所有 j 1 p b j t displaystyle sum j 1 p beta j leq t nbsp 计算 min b 0 b 1 N i 1 N y i b 0 x i T b 2 displaystyle min beta 0 beta left frac 1 N sum i 1 N y i beta 0 x i T beta 2 right nbsp 1 这里 t displaystyle t nbsp 是一个决定规则化程度的预定的自由参数 设X displaystyle X nbsp 为协变量矩阵 那么 X i j x i j displaystyle X ij x i j nbsp 其中 x i T displaystyle x i T nbsp 是 X displaystyle X nbsp 的第 i 行 那么上式可以写成更紧凑的形式 对所有 b 1 t displaystyle beta 1 leq t nbsp 计算 min b 0 b 1 N y b 0 X b 2 2 displaystyle min beta 0 beta left frac 1 N left y beta 0 X beta right 2 2 right nbsp 这里 b p i 1 N b i p 1 p displaystyle beta p left sum i 1 N beta i p right 1 p nbsp 是标准 ℓ p displaystyle ell p nbsp 范数 1 N displaystyle 1 N nbsp 是N 1 displaystyle N times 1 nbsp 维的1的向量 因为 b 0 y x T b displaystyle hat beta 0 bar y bar x T beta nbsp 所以有 y i b 0 x i T b y i y x T b x i T b y i y x i x T b displaystyle y i hat beta 0 x i T beta y i bar y bar x T beta x i T beta y i bar y x i bar x T beta nbsp 对变量进行中心化是常用的数据处理方法 并且协方差一般规范化为 i 1 N x i j 2 1 displaystyle textstyle left sum i 1 N x ij 2 1 right nbsp 这样得到的解就不会依赖测量的规模 它的目标方程还可以写为 min b R p 1 N y X b 2 2 subject to b 1 t displaystyle min beta in mathbb R p left frac 1 N left y X beta right 2 2 right text subject to beta 1 leq t nbsp 其拉格朗日形式为 min b R p 1 N y X b 2 2 l b 1 displaystyle min beta in mathbb R p left frac 1 N left y X beta right 2 2 lambda beta 1 right nbsp 其中 t displaystyle t nbsp 和 l displaystyle lambda nbsp 的关系取决于数据特征 正交协变量 编辑 现在考虑一些Lasso回归估计的基本性质 首先假定所有的协变量都是正交的 即 x i x j d i j displaystyle x i mid x j delta ij nbsp 其中d i j displaystyle delta ij nbsp 为 克罗内克d函数 等价的矩阵写法为 X T X I displaystyle X T X I nbsp 使用次梯度法可有如下的表达形式 b j S N l b j OLS b j OLS max 0 1 N l b j OLS 其中 b OLS X T X 1 X T y displaystyle begin aligned hat beta j amp S N lambda hat beta j text OLS hat beta j text OLS max left 0 1 frac N lambda hat beta j text OLS right amp text 其中 hat beta text OLS X T X 1 X T y end aligned nbsp 1 S a displaystyle S alpha nbsp 用于表示软阈值算子 当这个值非常小的时候为0 一个与之相近的记号H a displaystyle H alpha nbsp 用来表示硬阈值算子 将较小的数值记为0的同时保留原有的较大数值 与岭回归相比较 其中岭回归的目标在于最小化 min b R p 1 N y X b 2 2 l b 2 2 displaystyle min beta in mathbb R p left frac 1 N y X beta 2 2 lambda beta 2 2 right nbsp 即有 b j 1 N l 1 b j OLS displaystyle hat beta j 1 N lambda 1 hat beta j text OLS nbsp 因此岭回归是对OLS回归中所有的系数以一致的系数 1 N l 1 displaystyle 1 N lambda 1 nbsp 缩放 并不会进行变量选择 同样也可以对best subset selection算法进行比较 其目标在于最小化 min b R p 1 N y X b 2 2 l b 0 displaystyle min beta in mathbb R p left frac 1 N left y X beta right 2 2 lambda beta 0 right nbsp 其中 0 displaystyle cdot 0 nbsp 表示 ℓ 0 displaystyle ell 0 nbsp norm 即0范数 被定义为该向量中非零元的个数 在这个例子中 可以得到 b j H N l b j OLS b j OLS I b j OLS N l displaystyle hat beta j H sqrt N lambda left hat beta j text OLS right hat beta j text OLS mathrm I left left hat beta j text OLS right geq sqrt N lambda right nbsp 其中 H a displaystyle H alpha nbsp 被称为软阈值算子 I displaystyle mathrm I nbsp 为示性函数 总的来说 Lasso估计量展现出了岭回归和最佳子划分算法的系数收缩的优点 使得部分系数为0 此外 在岭回归全部使用一个常数系数缩放的时候 Lasso回归会将一个接近0的系数变为0 相关协变量 编辑 对于一般的情况中 不同的协变量之间可能并不是独立的 其中一种特例即为变量存在重复 例如变量j和变量k 有x j x k displaystyle x j x k nbsp x j i x i j displaystyle x j i x ij nbsp 在这种情况下参数 b j displaystyle beta j nbsp 和 b k displaystyle beta k nbsp 的Lasso回归的估计量不是唯一确定的 事实上 如果有一些 b displaystyle hat beta nbsp 中存在 b j b k 0 displaystyle hat beta j hat beta k geq 0 nbsp 寻找一个s 0 1 displaystyle s in 0 1 nbsp 进行变换 将b j displaystyle hat beta j nbsp 变换为s b j b k displaystyle s hat beta j hat beta k nbsp 的同时有 b k displaystyle hat beta k nbsp 变换为 1 s b j b k displaystyle 1 s hat beta j hat beta k nbsp 并保留其他参数不变 此时Lasso回归具有有效的连续性质 一些基于Lasso回归的改进 例如弹性网络正则化 旨在解决这个缺点 一般形式 编辑Lasso正则化可以扩展为其他目标函数 例如广义线性模型 广义估计方程 比例风险模型和M估计 1 5 有目标函数 1 N i 1 N f x i y i a b displaystyle frac 1 N sum i 1 N f x i y i alpha beta nbsp 其中Lasso正则化回归给出了下面模型的估计量 min a b 1 N i 1 N f x i y i a b subject to b 1 t displaystyle min alpha beta frac 1 N sum i 1 N f x i y i alpha beta text subject to beta 1 leq t nbsp 在这里只有b displaystyle beta nbsp 是一个惩罚项 a displaystyle alpha nbsp 是一个自由变量 与最基本的模型中的 b 0 displaystyle beta 0 nbsp 变量一样 算法解释 编辑几何解释 编辑 nbsp Forms of the constraint regions for lasso and ridge regression Lasso回归可以使得某些项系数为0 从几何上来看 不同约束边界形状的岭回归则不能 他们都可以解释为最小化相同的目标函数 min b 0 b 1 N y b 0 X b 2 2 displaystyle min beta 0 beta left frac 1 N left y beta 0 X beta right 2 2 right nbsp 但是有不同的约束条件 在Lasso回归中为 b 1 t displaystyle beta 1 leq t nbsp 而在岭回归中为 b 2 2 t displaystyle beta 2 2 leq t nbsp 1 范数The figure shows that the constraint region defined by the ℓ 1 displaystyle ell 1 nbsp norm is a square rotated so that its corners lie on the axes in general a cross polytope while the region defined by the ℓ 2 displaystyle ell 2 nbsp norm is a circle in general an n sphere which is rotationally invariant and therefore has no corners As seen in the figure a convex object that lies tangent to the boundary such as the line shown is likely to encounter a corner or a higher dimensional equivalent of a hypercube for which some components of b displaystyle beta nbsp are identically zero while in the case of an n sphere the points on the boundary for which some of the components of b displaystyle beta nbsp are zero are not distinguished from the others and the convex object is no more likely to contact a point at which some components of b displaystyle beta nbsp are zero than one for which none of them are Making l easier to interpret with an accuracy simplicity tradeoff 编辑 The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of l displaystyle lambda nbsp 6 It is assumed that X displaystyle X nbsp is standardized with z scores and that y displaystyle y nbsp is centered zero mean Let b 0 displaystyle beta 0 nbsp represent the hypothesized regression coefficients and let b O L S displaystyle b OLS nbsp refer to the data optimized ordinary least squares solutions We can then define the Lagrangian as a tradeoff between the in sample accuracy of the data optimized solutions and the simplicity of sticking to the hypothesized values 7 This results in min b R p y X b y X b y X b 0 y X b 0 2 l i 1 p b i b 0 i q i displaystyle min beta in mathbb R p left frac y X beta y X beta y X beta 0 y X beta 0 2 lambda sum i 1 p frac beta i beta 0 i q i right nbsp where q i displaystyle q i nbsp is specified below The first fraction represents relative accuracy the second fraction relative simplicity and l displaystyle lambda nbsp balances between the two nbsp Solution paths for the ℓ 1 displaystyle ell 1 nbsp norm and ℓ 2 displaystyle ell 2 nbsp norm when b O L S 2 displaystyle b OLS 2 nbsp and b 0 0 displaystyle beta 0 0 nbsp Given a single regressor relative simplicity can be defined by specifying q i displaystyle q i nbsp as b O L S b 0 displaystyle b OLS beta 0 nbsp which is the maximum amount of deviation from b 0 displaystyle beta 0 nbsp when l 0 displaystyle lambda 0 nbsp Assuming that b 0 0 displaystyle beta 0 0 nbsp the solution path can be defined in terms of R 2 displaystyle R 2 nbsp b ℓ 1 1 l R 2 b O L S if l R 2 0 if l gt R 2 displaystyle b ell 1 begin cases 1 lambda R 2 b OLS amp mbox if lambda leq R 2 0 amp mbox if lambda gt R 2 end cases nbsp If l 0 displaystyle lambda 0 nbsp the ordinary least squares solution OLS is used The hypothesized value of b 0 0 displaystyle beta 0 0 nbsp is selected if l displaystyle lambda nbsp is bigger than R 2 displaystyle R 2 nbsp Furthermore if R 2 1 displaystyle R 2 1 nbsp then l displaystyle lambda nbsp represents the proportional influence of b 0 0 displaystyle beta 0 0 nbsp In other words l 100 displaystyle lambda times 100 nbsp measures in percentage terms the minimal amount of influence of the hypothesized value relative to the data optimized OLS solution If an ℓ 2 displaystyle ell 2 nbsp norm is used to penalize deviations from zero given a single regressor the solution path is given byb ℓ 2 1 l R 2 1 l 1 b O L S displaystyle b ell 2 bigg 1 frac lambda R 2 1 lambda bigg 1 b OLS nbsp Like b ℓ 1 displaystyle b ell 1 nbsp b ℓ 2 displaystyle b ell 2 nbsp moves in the direction of the point l R 2 b 0 displaystyle lambda R 2 b 0 nbsp when l displaystyle lambda nbsp is close to zero but unlike b ℓ 1 displaystyle b ell 1 nbsp the influence of R 2 displaystyle R 2 nbsp diminishes in b ℓ 2 displaystyle b ell 2 nbsp if l displaystyle lambda nbsp increases see figure Given multiple regressors the moment that a parameter is activated i e allowed to deviate from b 0 displaystyle beta 0 nbsp is also determined by a regressor s contribution to R 2 displaystyle R 2 nbsp accuracy First R 2 1 y X b y X b y X b 0 y X b 0 displaystyle R 2 1 frac y Xb y Xb y X beta 0 y X beta 0 nbsp An R 2 displaystyle R 2 nbsp of 75 means that in sample accuracy improves by 75 if the unrestricted OLS solutions are used instead of the hypothesized b 0 displaystyle beta 0 nbsp values The individual contribution of deviating from each hypothesis can be computed with the p displaystyle p nbsp x p displaystyle p nbsp matrix R X y 0 X y 0 X X 1 y 0 y 0 1 displaystyle R otimes X tilde y 0 X tilde y 0 X X 1 tilde y 0 tilde y 0 1 nbsp where y 0 y X b 0 displaystyle tilde y 0 y X beta 0 nbsp If b b O L S displaystyle b b OLS nbsp when R 2 displaystyle R 2 nbsp is computed then the diagonal elements of R displaystyle R otimes nbsp sum to R 2 displaystyle R 2 nbsp The diagonal R displaystyle R otimes nbsp values may be smaller than 0 or less often larger than 1 If regressors are uncorrelated then the i t h displaystyle i th nbsp diagonal element of R displaystyle R otimes nbsp simply corresponds to the r 2 displaystyle r 2 nbsp value between x i displaystyle x i nbsp and y displaystyle y nbsp A rescaled version of the adaptive lasso of can be obtained by setting q adaptive lasso i b O L S i b 0 i displaystyle q mbox adaptive lasso i b OLS i beta 0 i nbsp 8 If regressors are uncorrelated the moment that the i t h displaystyle i th nbsp parameter is activated is given by the i t h displaystyle i th nbsp diagonal element of R displaystyle R otimes nbsp Assuming for convenience that b 0 displaystyle beta 0 nbsp is a vector of zeros b i 1 l R i i b O L S i if l R i i 0 if l gt R i i displaystyle b i begin cases 1 lambda R ii otimes b OLS i amp mbox if lambda leq R ii otimes 0 amp mbox if lambda gt R ii otimes end cases nbsp That is if regressors are uncorrelated l displaystyle lambda nbsp again specifies the minimal influence of b 0 displaystyle beta 0 nbsp Even when regressors are correlated the first time that a regression parameter is activated occurs when l displaystyle lambda nbsp is equal to the highest diagonal element of R displaystyle R otimes nbsp These results can be compared to a rescaled version of the lasso by defining q lasso i 1 p l b O L S l b 0 l displaystyle q mbox lasso i frac 1 p sum l b OLS l beta 0 l nbsp which is the average absolute deviation of b O L S displaystyle b OLS nbsp from b 0 displaystyle beta 0 nbsp Assuming that regressors are uncorrelated then the moment of activation of the i t h displaystyle i th nbsp regressor is given by l lasso i 1 p R i l 1 p R l displaystyle tilde lambda text lasso i frac 1 p sqrt R i otimes sum l 1 p sqrt R l otimes nbsp For p 1 displaystyle p 1 nbsp the moment of activation is again given by l lasso i R 2 displaystyle tilde lambda text lasso i R 2 nbsp If b 0 displaystyle beta 0 nbsp is a vector of zeros and a subset of p B displaystyle p B nbsp relevant parameters are equally responsible for a perfect fit of R 2 1 displaystyle R 2 1 nbsp then this subset is activated at a l displaystyle lambda nbsp value of 1 p displaystyle frac 1 p nbsp The moment of activation of a relevant regressor then equals 1 p 1 p B p B 1 p B 1 p displaystyle frac 1 p frac 1 sqrt p B p B frac 1 sqrt p B frac 1 p nbsp In other words the inclusion of irrelevant regressors delays the moment that relevant regressors are activated by this rescaled lasso The adaptive lasso and the lasso are special cases of a 1ASTc estimator The latter only groups parameters together if the absolute correlation among regressors is larger than a user specified value 6 Bayesian interpretation 编辑 nbsp Laplace distributions are sharply peaked at their mean with more probability density concentrated there compared to a normal distribution Just as ridge regression can be interpreted as linear regression for which the coefficients have been assigned normal prior distributions lasso can be interpreted as linear regression for which the coefficients have Laplace prior distributions The Laplace distribution is sharply peaked at zero its first derivative is discontinuous at zero and it concentrates its probability mass closer to zero than does the normal distribution This provides an alternative explanation of why lasso tends to set some coefficients to zero while ridge regression does not 1 Convex relaxation interpretation 编辑 Lasso can also be viewed as a convex relaxation of the best subset selection regression problem which is to find the subset of k displaystyle leq k nbsp covariates that results in the smallest value of the objective function for some fixed k n displaystyle k leq n nbsp where n is the total number of covariates The ℓ 0 displaystyle ell 0 nbsp norm 0 displaystyle cdot 0 nbsp the number of nonzero entries of a vector is the limiting case of ℓ p displaystyle ell p nbsp norms of the form x p i 1 n x j p 1 p displaystyle textstyle x p left sum i 1 n x j p right 1 p nbsp where the quotation marks signify that these are not really norms for p lt 1 displaystyle p lt 1 nbsp since p displaystyle cdot p nbsp is not convex for p lt 1 displaystyle p lt 1 nbsp so the triangle inequality does not hold Therefore since p 1 is the smallest value for which the ℓ p displaystyle ell p nbsp norm is convex and therefore actually a norm lasso is in some sense the best convex approximation to the best subset selection problem since the region defined by x 1 t displaystyle x 1 leq t nbsp is the convex hull of the region defined by x p t displaystyle x p leq t nbsp for p lt 1 displaystyle p lt 1 nbsp 应用 编辑LASSO已被应用于经济和金融领域 可以改善预测结果并选择有时被忽视的变量 例如 公司破产预测 9 和高增长公司预测 10 参见 编辑降维 特征选择参考文献 编辑 1 0 1 1 1 2 1 3 1 4 Tibshirani Robert 1996 Regression Shrinkage and Selection via the lasso Journal of the Royal Statistical Society Series B methodological 58 1 Wiley 267 88 http www jstor org stable 2346178 页面存档备份 存于互联网档案馆 Breiman Leo Better Subset Regression Using the Nonnegative Garrote Technometrics 1995 11 01 37 4 373 384 2017 10 06 ISSN 0040 1706 doi 10 2307 1269730 原始内容存档于2020 06 08 Tibshirani Robert Regression Shrinkage and Selection via the Lasso Journal of the Royal Statistical Society Series B Methodological 1996 58 1 267 288 2016 07 25 原始内容存档于2020 11 17 Tibshirani Robert The Lasso Method for Variable Selection in the Cox Model Statistics in Medicine 1997 02 28 16 4 385 395 ISSN 1097 0258 doi 10 1002 sici 1097 0258 19970228 16 4 3C385 aid sim380 3E3 0 co 2 3 英语 永久失效連結 引用错误 没有为名为Tibshirani 1997的参考文献提供内容 6 0 6 1 Hoornweg Victor Chapter 8 Science Under Submission Hoornweg Press 2018 ISBN 978 90 829188 0 9 Motamedi Fahimeh Sanchez Horacio Mehri Alireza Ghasemi Fahimeh Accelerating Big Data Analysis through LASSO Random Forest Algorithm in QSAR Studies Bioinformatics October 2021 37 19 469 475 ISSN 1367 4803 PMID 34979024 doi 10 1093 bioinformatics btab659 Zou Hui The Adaptive Lasso and Its Oracle Properties PDF 2006 Shaonan Tian Yu Yan Guo Hui Variable selection and corporate bankruptcy forecasts Journal of Banking amp Finance 2015 52 1 89 100 doi 10 1016 j jbankfin 2014 12 003 nbsp Coad Alex Srhoj Stjepan Catching Gazelles with a Lasso Big data techniques for the prediction of high growth firms Small Business Economics 2020 55 1 541 565 doi 10 1007 s11187 019 00203 3 nbsp 取自 https zh wikipedia org w index php title Lasso算法 amp oldid 78423983, 维基百科,wiki,书籍,书籍,图书馆,

文章

,阅读,下载,免费,免费下载,mp3,视频,mp4,3gp, jpg,jpeg,gif,png,图片,音乐,歌曲,电影,书籍,游戏,游戏。