概率潜在语义分析

概率的潜在语义分析（PLSA），也称为概率潜在语义索引（PLSI，尤其是在信息检索领域），是用于分析双模和共现数据的统计方法。实际上，人们可以根据对某些隐变量的亲和性来推导出观测变量的低维表示，就像PLSA是从潜在语义分析中演化而来。

与源于线性代数并缩小发生表（通常通过奇异值分解）的标准潜在语义分析所不同的是，概率潜在语义分析基于从潜类模型导出的混合分解。

模型编辑

图模型表示的PLSA模型("不对称"式)。

d

是该文档的索引变量，

c

是一个词的来自文档的主题分布

P(c|d)

的主题，

w

是一个来自主题的词分布

P(w|c)

的单词。

d

和

w

是可观测的变量，主题

c

是一个潜变量。

考虑到以单词和文档的共现 $(w,d)$ 形式进行的观察，PLSA将每次共现的概率建模为条件独立的多项分布的混合：

P(w,d)=\sum _{c}P(c)P(d|c)P(w|c)=P(d)\sum _{c}P(c|d)P(w|c)

其中'c'是单词的主题。值得注意的是，模型的主题数量是一个超参数，必须提前设置而不是从数据中估计。第一个公式是对称式，其中 $w$ 和 $d$ 都是以类似的方式从潜变量 $c$ 生成（基于条件概率 $P(d|c)$ 和 $P(w|c)$ ）；而第二个公式是不对称的，对于每个文档 $d$ 根据 $P(c|d)$ 有条件地从文档中选择潜在类 $c$ ，然后根据 $P(w|c)$ 从该类生成一个单词。虽然在这个例子中我们使用单词和文档建模，但是任何离散变量的共现也可以用完全相同的方式建模。

因此，模型参数的数量等于 $cd+wc$ ，参数数量随文档数量呈线性增长。此外，尽管PLSA是基于文档集的生成模型，但它并不是新文档的生成模型。

模型的参数使用最大期望算法（EM算法）学习得到。

应用编辑

PLSA可以通过Fisher核函数用于判别设置。^[1]

PLSA在信息检索和过滤、自然语言处理、文本机器学习及其他相关领域都有应用。

根据报告，概率潜在语义分析中使用的方面模型存在严重的过拟合问题。^[2]

扩展编辑

分层扩展：
- 不对称：MASHA（Multinomial ASymmetric Hierarchical Analysis，多项式非对称分层分析）^[3]
- 对称：HPLSA（Hierarchical Probabilistic Latent Semantic Analysis，分层概率潜在语义分析）^[4]

生成模型：已经开发了以下模型来解决经常被批评的PLSA缺点——它不是新文档的正确生成模型。
- 潜在狄利克雷分配（LDA）——在每个文档-主题分布上添加狄利克雷先验
高阶数据：尽管在科学文献中很少讨论这一点，但PLSA可以自然地扩展到更高阶数据（三种模式或更高阶），它可以模拟三个或更多变量的共现。在上面的对称公式中，这仅需要为这些附加变量添加条件概率分布就可以实现。这是非负张量因子分解的概率类比。

历史编辑

这是潜类模型的一个特例（参见其中的参考文献），它与非负矩阵分解有关。^[5]^[6]当前的术语是由Thomas Hofmann在1999年创造的。^[7]

参见编辑

向量空间模型

参考文献编辑

^ Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization （页面存档备份，存于互联网档案馆）, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000
^ Blei, David M.; Andrew Y. Ng; Michael I. Jordan. Latent Dirichlet Allocation (PDF). Journal of Machine Learning Research. 2003, 3: 993–1022 [2019-01-17]. doi:10.1162/jmlr.2003.3.4-5.993. （原始内容 (PDF)于2020-12-26）.
^ Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002
^ Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents （页面存档备份，存于互联网档案馆）, in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002
^ Chris Ding, Tao Li, Wei Peng (2006). "Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006" （页面存档备份，存于互联网档案馆）
^ Chris Ding, Tao Li, Wei Peng (2008). "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing" （页面存档备份，存于互联网档案馆）
^ Thomas Hofmann, Probabilistic Latent Semantic Indexing （页面存档备份，存于互联网档案馆）, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

外部链接编辑

[1] Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization （页面存档备份，存于互联网档案馆）, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000

[2] Blei, David M.; Andrew Y. Ng; Michael I. Jordan. Latent Dirichlet Allocation (PDF). Journal of Machine Learning Research. 2003, 3: 993–1022 [2019-01-17]. doi:10.1162/jmlr.2003.3.4-5.993. （原始内容 (PDF)于2020-12-26）.

[3] Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002

[4] Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents （页面存档备份，存于互联网档案馆）, in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002

[5] Chris Ding, Tao Li, Wei Peng (2006). "Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006" （页面存档备份，存于互联网档案馆）

[6] Chris Ding, Tao Li, Wei Peng (2008). "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing" （页面存档备份，存于互联网档案馆）

[7] Thomas Hofmann, Probabilistic Latent Semantic Indexing （页面存档备份，存于互联网档案馆）, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

[1]

[2]

[3]

[4]

[5]

[6]

[7]

www.wiki2.zh-cn.nina.az

概率潜在语义分析

目录

模型编辑

应用编辑

扩展编辑

历史编辑

参见编辑

参考文献编辑

外部链接编辑

老西開事件

老蹇

老道寺镇

老閘船

老龍族

老龍頭

老鹰水库

老鹰茶

老鼠、虱子和历史

老鼠記者

泰山郡

泰山 (虛構人物)

泰川郡

泰州 (辽朝)

泰式炒河

文章

模型 编辑

应用 编辑

扩展 编辑

历史 编辑

参见 编辑

参考文献 编辑

外部链接 编辑

文章

模型编辑

应用编辑

扩展编辑

历史编辑

参见编辑

参考文献编辑

外部链接编辑