{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 泊松混合模型" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 泊松分布\n", "\n", "泊松分布(Poisson distribution)的概率分布函数为:\n", "\n", "$$P(X = k) = \\frac{\\lambda^{k}}{k !} e^{- \\lambda}, k = 0, 1, \\cdots \\tag {1}$$\n", "\n", "泊松分布的参数$\\lambda$表示单位时间(单位面积)内随机事件的平均发生次数。泊松分布适合于描述单位时间内随机事件发生的次数。\n", "\n", "泊松分布的期望$\\lambda$:,方差:$\\lambda$,特征函数:$\\phi(t) = \\exp(\\lambda (e^{i t} - 1))$\n", "\n", "### 泊松分布与二项分布\n", "\n", "当二项分布的$n$很大而$p$很小时,二项分布可近似为泊松分布,其中$\\lambda = np$。通常当$n \\geq 20, p \\leq 0.05$时,就可以用泊松分布近似。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 泊松混合模型\n", "\n", "泊松混合模型(Poisson mixture model)的概率密度函数为:\n", "\n", "$$p(y; \\theta) = \\sum_{k}^{K} \\alpha_{k} \\frac{\\lambda^{k}}{k !} e^{- \\lambda} \\tag {2}$$\n", "\n", "其中,$\\alpha_{k} \\geq 0$,$\\sum_{k}^{K} \\alpha_{k} = 1$。\n", "\n", "给定由$K$个Poisson分布生成的$N$条独立数据构成数据集$D = {y_{j} | j = 1, 2, \\cdots, N}$,估计泊松混合模型的参数$\\theta_{k} = (\\alpha_{k}, \\lambda_{k} | k = 1, 2, \\cdots, K)$。\n", "\n", "1. 隐变量\n", "\n", "观测数据$y_{j}$已知,反映观测数据$y_{j}$来自第$k$个分模型的数据未知,以隐变量$\\gamma_{k}$表示:\n", "\n", "$$\\gamma_{jk} = \\begin{cases}\n", "1, & \\text{第}j\\text{个观测来自第}k\\text{个分模型} \\\\\n", "0, & 否则\n", "\\end{cases}\n", "j = 1, 2, \\cdots, N; k = 1, 2, \\cdots, K \\tag {3}$$\n", "\n", "未观测数据$\\gamma_{jk}$为`0-1`随机变量,则完全数据为\n", "\n", "$$(y_{j}, \\gamma_{j1}, \\cdots, \\gamma_{jK}), j = 1, 2, \\cdots, N$$\n", "\n", "完全数据的似然函数:\n", "\n", "$$\\begin{aligned}\n", "P(y, \\gamma; \\theta) & = \\prod_{j = 1}^{N} P(y_{j}, \\gamma_{j1}, \\cdots, \\gamma_{jK}; \\theta) \\\\\n", "& = \\prod_{j = 1}^{N} \\prod_{k = 1}^{K} [\\alpha_{k} P(y_{j}; \\theta_{k})]^{\\gamma_{jk}} \\\\\n", "& = \\prod_{k = 1}^{K} \\alpha_{k}^{n_{k}} \\prod_{j = 1}^{N} [P(y_{j}; \\theta_{k})]^{\\gamma_{jk}} \\\\\n", "& = \\prod_{k = 1}^{K} \\alpha_{k}^{n_{k}} \\prod_{j = 1}^{N} \\left[\n", " \\frac{\\lambda_{k}^{y_{j}}}{y_{j} !} e^{- \\lambda_{k}}\n", "\\right]^{\\gamma_{jk}}\n", "\\end{aligned}$$\n", "\n", "其中,$n_{k} = \\sum_{j = 1}^{N} \\gamma_{jk}$,$N = \\sum_{k = 1}^{K} n_{k}$。完全数据的对数似然函数为\n", "\n", "$$\\begin{aligned}\n", "\\log P(y, \\gamma; \\theta) = \\sum_{k = 1}^{K} \\left( n_{k} \\log \\alpha_{k} + \\sum_{j = 1}^{N} \\gamma_{jk} \\left[\n", " y_{j} \\log \\lambda_{k} - \\log y_{j}! - \\lambda_{k}\n", "\\right] \\right) \\\\\n", "\\end{aligned} \\tag {4}$$\n", "\n", "2. E步,$Q$函数\n", "\n", "$$\\begin{aligned}\n", "Q(\\theta, \\theta^{(i)})\n", "& = \\text{E}_{\\gamma} \\left[ \\log P(y, \\gamma; \\theta) | y; \\theta^{(i)} \\right] \\\\\n", "& = \\sum_{k = 1}^{K} \\left(\n", " \\sum_{j = 1}^{N} \\text{E}[\\gamma_{jk} | y_{j}; \\theta^{(i)}] \\log \\alpha_{k} + \\sum_{j = 1}^{N} \\text{E}[\\gamma_{jk} | y_{j}; \\theta^{(i)}] \\left[\n", " y_{j} \\log \\lambda_{k} - \\log y_{j}! - \\lambda_{k}\n", "\\right] \\right) \\\\\n", "\\end{aligned}$$\n", "\n", "令$\\hat{\\gamma}_{jk} = \\text{E}[\\gamma_{jk} | y_{j}; \\theta^{(i)}]$,表示当前模型参数下第$j$个观测数据来自第$k$个分模型的概率,称为分模型$k$对观测数据$y_{j}$的响应度。\n", "\n", "$$\\begin{aligned}\n", "\\hat{\\gamma}_{jk} & = \\text{E}[\\gamma_{jk} | y_{j}; \\theta^{(i)}] \\\\\n", "& = \\frac{\n", " P(y_{j} | \\gamma_{jk} = 1; \\theta^{(i)}) P(\\gamma_{jk} = 1; \\theta^{(i)})\n", "}{\n", " \\sum_{k = 1}^{K} P(y_{j} | \\gamma_{jk} = 1; \\theta^{(i)}) P(\\gamma_{jk} = 1; \\theta^{(i)})\n", "} \\\\\n", "& = \\frac{\n", " \\alpha_{k}^{(i)} \\frac{\\left( \\lambda_{k}^{(i)} \\right)^{y_{j}}}{y_{j} !} e^{- \\lambda_{k}^{(i)}}\n", "}{\n", " \\sum_{k = 1}^{K} \\alpha_{k}^{(i)} \\frac{ \\left( \\lambda_{k}^{(i)} \\right)^{y_{j}}}{y_{j} !} e^{- \\lambda_{k}^{(i)}}\n", "} \\\\\n", "\\end{aligned} \\tag {5}$$\n", "\n", "则$n_{k} = \\sum_{j = 1}^{N} E[\\hat{\\gamma}_{jk}]$\n", "\n", "$$\\begin{aligned}\n", "Q(\\theta, \\theta^{(i)}) = \\sum_{k = 1}^{K} \\left(\n", " \\sum_{j = 1}^{N} \\hat{\\gamma}_{jk} \\log \\alpha_{k} + \\sum_{j = 1}^{N} \\hat{\\gamma}_{jk} \\left[\n", " y_{j} \\log \\lambda_{k} - \\log y_{j}! - \\lambda_{k}\n", "\\right] \\right) \\\\\n", "\\end{aligned} \\tag {6}$$\n", "\n", "3. M步\n", "\n", "求使函数$Q(\\theta, \\theta^{(i)})$取极大值时的参数$\\theta^{(i + 1)} = (\\hat{\\lambda}_{k}, \\hat{\\alpha}_{k} | k = 1, 2, \\cdots, K)$:\n", "\n", "$$\\begin{aligned}\n", "\\theta^{(i + 1)} & = \\argmax_{\\theta} Q(\\theta, \\theta^{(i)}) \\\\\n", "& \\text{s.t.} \\sum_{k = 1}^{K} \\alpha_{k} = 1, \\ \\alpha_{k} \\geq 1\n", "\\end{aligned}$$\n", "\n", "通过拉格朗日乘数法可得:\n", "\n", "$$\\hat{\\lambda}_{k} = \\frac{\\sum_{j = 1}^{N} \\hat{\\gamma}_{jk} y_{j}}{\\sum_{j = 1}^{N} \\hat{\\gamma}_{jk}} \\tag {7}$$\n", "\n", "$$\\hat{\\alpha}_{k} = \\frac{n_k}{N} = \\frac{\\sum_{j = 1}^{N} \\hat{\\gamma}_{jk}}{N} \\tag {8}$$\n", "\n", "### 泊松混合模型EM参数估计\n", "\n", "输入:观测数据$y_{1}, y_{2}, \\cdots,y_{N}$,泊松混合模型;\n", "\n", "输出:泊松混合模型参数。\n", "\n", "(1)参数初始化\n", "\n", "(2)E步:依据当前模型参数,计算分模型$k$对观测数据$y_{j}$的响应度\n", "\n", "$$\\begin{aligned}\n", "\\hat{\\gamma}_{jk} = \\frac{\n", " \\alpha_{k}^{(i)} \\frac{\\left( \\lambda_{k}^{(i)} \\right)^{y_{j}}}{y_{j} !} e^{- \\lambda_{k}^{(i)}}\n", "}{\n", " \\sum_{k = 1}^{K} \\alpha_{k}^{(i)} \\frac{ \\left( \\lambda_{k}^{(i)} \\right)^{y_{j}}}{y_{j} !} e^{- \\lambda_{k}^{(i)}}\n", "} \\\\\n", "\\end{aligned}$$\n", "\n", "(3)M步:计算新一轮迭代的模型参数\n", "\n", "$$\\begin{aligned}\n", "\\hat{\\lambda}_{k} & = \\frac{\\sum_{j = 1}^{N} \\hat{\\gamma}_{jk} y_{j}}{\\sum_{j = 1}^{N} \\hat{\\gamma}_{jk}} \\\\\n", "\\hat{\\alpha}_{k} & = \\frac{n_k}{N} = \\frac{\\sum_{j = 1}^{N} \\hat{\\gamma}_{jk}}{N} \\\\\n", "k & = 1, 2, \\cdots, K\n", "\\end{aligned}$$\n", "\n", "(4)重复(2)、(3),直至收敛(对数似然函数值不再有明显的变化)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }