Apriori - mlxtend

apriori: 通过Apriori算法发现频繁项集

Apriori函数用于提取关联规则挖掘中的频繁项集

> 从 mlxtend.frequent_patterns 导入 apriori

概述

Apriori是一种流行的算法[1]，用于提取频繁项集，并应用于关联规则学习。Apriori算法旨在处理包含交易的数据库，例如商店客户的购买记录。如果一个项集满足用户指定的支持度阈值，则被视为“频繁”。例如，如果支持度阈值设置为0.5（50%），则频繁项集被定义为在数据库中至少50%的所有交易中共同出现的一组物品。

参考文献

[1] Agrawal, Rakesh, 和 Ramakrishnan Srikant. "快速挖掘关联规则的算法." 第20届国际大会大型数据库，VLDB。第1215卷。1994年。

示例 1 -- 生成频繁项集

apriori 函数期望数据以一个独热编码的 pandas DataFrame 格式提供。假设我们有以下交易数据：

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

我们可以通过TransactionEncoder将其转换为正确的格式，如下所示：

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

	Apple	Corn	Dill	Eggs	Ice cream	Kidney Beans	Milk	Nutmeg	Onion	Unicorn	Yogurt
0	False	False	False	True	False	True	True	True	True	False	True
1	False	False	True	True	False	True	False	True	True	False	True
2	True	False	False	True	False	True	True	False	False	False	False
3	False	True	False	False	False	True	True	False	False	True	True
4	False	True	False	True	True	True	False	False	True	False	False

现在，让我们返回支持度至少为 60% 的项和项集：

from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.6)

	support	itemsets
0	0.8	(3)
1	1.0	(5)
2	0.6	(6)
3	0.6	(8)
4	0.6	(10)
5	0.8	(3, 5)
6	0.6	(8, 3)
7	0.6	(5, 6)
8	0.6	(8, 5)
9	0.6	(10, 5)
10	0.6	(8, 3, 5)

默认情况下，apriori 返回的是项目的列索引，这在后续操作如关联规则挖掘中可能会有所帮助。为了提高可读性，我们可以设置 use_colnames=True 将这些整数值转换为相应的项目名称：

apriori(df, min_support=0.6, use_colnames=True)

	support	itemsets
0	0.8	(Eggs)
1	1.0	(Kidney Beans)
2	0.6	(Milk)
3	0.6	(Onion)
4	0.6	(Yogurt)
5	0.8	(Eggs, Kidney Beans)
6	0.6	(Eggs, Onion)
7	0.6	(Kidney Beans, Milk)
8	0.6	(Kidney Beans, Onion)
9	0.6	(Yogurt, Kidney Beans)
10	0.6	(Kidney Beans, Eggs, Onion)

示例 2 -- 选择和过滤结果

使用pandas DataFrames 的一个优势是我们可以利用其便捷的功能来过滤结果。比如，假设我们只对支持度至少为80%的长度为2的项集感兴趣。首先，我们通过 apriori 创建频繁项集，并添加一个新列来存储每个项集的长度：

frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

	support	itemsets	length
0	0.8	(Eggs)	1
1	1.0	(Kidney Beans)	1
2	0.6	(Milk)	1
3	0.6	(Onion)	1
4	0.6	(Yogurt)	1
5	0.8	(Eggs, Kidney Beans)	2
6	0.6	(Eggs, Onion)	2
7	0.6	(Kidney Beans, Milk)	2
8	0.6	(Kidney Beans, Onion)	2
9	0.6	(Yogurt, Kidney Beans)	2
10	0.6	(Kidney Beans, Eggs, Onion)	3

然后，我们可以选择满足我们期望标准的结果，如下所示：

frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.8) ]

	support	itemsets	length
5	0.8	(Eggs, Kidney Beans)	2

同样地，使用Pandas API，我们可以根据“itemsets”列选择条目：

frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]

	support	itemsets	length
6	0.6	(Eggs, Onion)	2

不可变集合

请注意，“itemsets”列中的条目类型为frozenset，这是Python内置的类型，类似于Python的set但不可变，这使得它在某些查询或比较操作中更有效率（https://docs.python.org/3.6/library/stdtypes.html#frozenset）。由于frozenset是集合，因此项的顺序无关紧要。即，查询

frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]

等价于以下三种中的任何一种

frequent_itemsets[ frequent_itemsets['itemsets'] == {'Eggs', 'Onion'} ]
frequent_itemsets[ frequent_itemsets['itemsets'] == frozenset(('Eggs', 'Onion')) ]
frequent_itemsets[ frequent_itemsets['itemsets'] == frozenset(('Onion', 'Eggs')) ]

示例 3 -- 使用稀疏表示法

为了节省内存，您可能希望以稀疏格式表示您的交易数据。这在您有很多产品和少量交易时尤其有用。

oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
sparse_df

	Apple	Corn	Dill	Eggs	Ice cream	Kidney Beans	Milk	Nutmeg	Onion	Unicorn	Yogurt
0	False	False	False	True	False	True	True	True	True	False	True
1	False	False	True	True	False	True	False	True	True	False	True
2	True	False	False	True	False	True	True	False	False	False	False
3	False	True	False	False	False	True	True	False	False	True	True
4	False	True	False	True	True	True	False	False	True	False	False

apriori(sparse_df, min_support=0.6, use_colnames=True, verbose=1)

Processing 21 combinations | Sampling itemset size 3

	support	itemsets
0	0.8	(Eggs)
1	1.0	(Kidney Beans)
2	0.6	(Milk)
3	0.6	(Onion)
4	0.6	(Yogurt)
5	0.8	(Eggs, Kidney Beans)
6	0.6	(Eggs, Onion)
7	0.6	(Kidney Beans, Milk)
8	0.6	(Kidney Beans, Onion)
9	0.6	(Yogurt, Kidney Beans)
10	0.6	(Kidney Beans, Eggs, Onion)

API

apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0, low_memory=False)

Get frequent itemsets from a one-hot DataFrame

Parameters

df : pandas DataFrame

pandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see (https://pandas.pydata.org/pandas-docs/stable/ user_guide/sparse.html#sparse-data-structures)

Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.

The allowed values are either 0/1 or True/False. For example,

    Apple  Bananas   Beer  Chicken   Milk   Rice
    0     True    False   True     True  False   True
    1     True    False   True    False  False   True
    2     True    False   True    False  False  False
    3     True     True  False    False  False  False
    4    False    False   True     True   True   True
    5    False    False   True    False   True   True
    6    False    False   True    False   True  False
    7     True     True  False    False  False  False

min_support : float (default: 0.5)

A float between 0 and 1 for minumum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
use_colnames : bool (default: False)

If True, uses the DataFrames' column names in the returned DataFrame instead of column indices.
max_len : int (default: None)

Maximum length of the itemsets generated. If None (default) all possible itemsets lengths (under the apriori condition) are evaluated.
verbose : int (default: 0)

Shows the number of iterations if >= 1 and low_memory is True. If

=1 and low_memory is False, shows the number of combinations.
low_memory : bool (default: False)

If True, uses an iterator to search for combinations above min_support. Note that while low_memory=True should only be used for large dataset if memory resources are limited, because this implementation is approx. 3-6x slower than the default.

Returns

pandas DataFrame with columns ['support', 'itemsets'] of all itemsets that are >= min_support and < than max_len (if max_len is not None). Each itemset in the 'itemsets' column is of type frozenset, which is a Python built-in type that behaves similarly to sets except that it is immutable (For more info, see https://docs.python.org/3.6/library/stdtypes.html#frozenset).

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/

apriori: 通过Apriori算法发现频繁项集

概述

参考文献

相关

示例 1 -- 生成频繁项集

示例 2 -- 选择和过滤结果

示例 3 -- 使用稀疏表示法

API