fisher精确检验

OR值计算

目标是不使用非python自带的模块完成fisher精确检验的计算。包括OR值、P值以及置信区间。

	GroupA	GroupB
StatusA	a	b
StatusB	c	d

fisher精确检验是小样本下对二联表进行计算。其中odds ratio即OR值的计算方式比较简单

\[OR = ad / bc\]

由于4个值中可能会存在0，导致计算所得的OR值为0或者无限大，这时可以进行haldane-anscombe校正，即将所有值都加上0.5

def cal_fisher_odds_ratio(a, b, c, d):
    if any(x == 0 for x in [a, b, c, d]):
        ha_scale = 0.5
        a, b, c, d = [x + ha_scale for x in [a, b, c, d]]
    odds_ratio = (a * d) / (b * c)
    return odds_ratio

P值计算

	GroupA	GroupB	求和
StatusA	3	0	3
StatusB	4	3	7
求和	7	3	10

fisher精确检验的P值需要先找到所有的状态。计算每个状态发生的概率，再把小于等于当前状态的概率求和，即是P值。

如上面的二联表，可以有以下4种状态

	GroupA	GroupB
StatusA	0	3
StatusB	7	0

	GroupA	GroupB
StatusA	1	2
StatusB	6	1

	GroupA	GroupB
StatusA	2	1
StatusB	5	2

当前状态	GroupA	GroupB
StatusA	3	0
StatusB	4	3

每个状态的概率计算公式如下：

\[prob = \frac{(a+b)!(c+d)!(a+d)!(b+d)!}{(a+b+c+d)!a!b!c!d!}\]

因此选择分两步做，先找到所有状态组合，然后再计算每个状态组合的概率，最后求得P值。这里存在的问题是OR值可能是经过校正后获得的，而P值则是使用未经校正的数值计算所得。

import math

# 计算概率
def cal_prob(a, b, c, d):
    n1 = a + b
    n2 = c + d
    n_1 = a + c
    n_2 = b + d
    total = n1 + n2
    numerator = math.factorial(n1) * math.factorial(n2) * math.factorial(n_1) * math.factorial(n_2)
    denominator = math.factorial(total) * math.factorial(a) * math.factorial(b) * math.factorial(c) * math.factorial(d)
    prob = numerator / denominator
    return prob

# 计算P值
def cal_fisher_p_value(a, b, c, d):
    table = [[a, b], [c, d]]
    current_prob = cal_prob(table)
    row_total = [a + b, c + d]
    col_total = [a + c, b + d]
    group = []
    # 取一个最小值
    set_range = min(row_total[0], col_total[0])
    for a_ in range(0, set_range + 1):
        if a_ != a:
            b_ = row_total[0] - a_
            c_ = col_total[0] - a_
            d_ = row_total[1] - c_
            check_value = any(j < 0 for j in [a_, b_, c_, d_])
            if not check_value:
                group.append([[a_, b_], [c_, d_]])

    # 计算各个组合的概率
    p = current_prob
    for g in group:
        g_prob = cal_prob(g)
        if g_prob <= current_prob:
            p += g_prob
    return p

置信区间计算

注意fisher方法的置信区间不是传统意义下的置信区间，一般取使得 P-value 大于0.05组成一个95%置信区间，会用到卡方检验临界值。

import math

def cal_fisher_limit(a, b, c, d, alpha=0.05):
    odds_ratio = cal_fisher_odds_ratio(a, b, c, d)
    p_value = cal_fisher_p_value(a, b, c, d)
    chi_squared_critical = -2 * math.log(alpha / 2)
    log_odds_ratio = math.log(odds_ratio)
    standard_error = math.sqrt(1 / a + 1 / b + 1 / c + 1 / d)
    lower_limit = math.exp(log_odds_ratio - standard_error * math.sqrt(chi_squared_critical))
    upper_limit = math.exp(log_odds_ratio + standard_error * math.sqrt(chi_squared_critical))
    return [lower_limit, upper_limit]

上篇WordWriter，python自动化Word报告

下篇ggplot2绘制三角热图