apriori: 通过Apriori算法发现频繁项集
Apriori函数用于提取关联规则挖掘中的频繁项集
> 从 mlxtend.frequent_patterns 导入 apriori
概述
Apriori是一种流行的算法[1],用于提取频繁项集,并应用于关联规则学习。Apriori算法旨在处理包含交易的数据库,例如商店客户的购买记录。如果一个项集满足用户指定的支持度阈值,则被视为“频繁”。例如,如果支持度阈值设置为0.5(50%),则频繁项集被定义为在数据库中至少50%的所有交易中共同出现的一组物品。
参考文献
[1] Agrawal, Rakesh, 和 Ramakrishnan Srikant. "快速挖掘关联规则的算法." 第20届国际大会 大型数据库,VLDB。第1215卷。1994年。
相关
示例 1 -- 生成频繁项集
apriori 函数期望数据以一个独热编码的 pandas DataFrame 格式提供。假设我们有以下交易数据:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
我们可以通过TransactionEncoder将其转换为正确的格式,如下所示:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
| Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | True | False | True | True | True | True | False | True |
| 1 | False | False | True | True | False | True | False | True | True | False | True |
| 2 | True | False | False | True | False | True | True | False | False | False | False |
| 3 | False | True | False | False | False | True | True | False | False | True | True |
| 4 | False | True | False | True | True | True | False | False | True | False | False |
现在,让我们返回支持度至少为 60% 的项和项集:
from mlxtend.frequent_patterns import apriori
apriori(df, min_support=0.6)
| support | itemsets | |
|---|---|---|
| 0 | 0.8 | (3) |
| 1 | 1.0 | (5) |
| 2 | 0.6 | (6) |
| 3 | 0.6 | (8) |
| 4 | 0.6 | (10) |
| 5 | 0.8 | (3, 5) |
| 6 | 0.6 | (8, 3) |
| 7 | 0.6 | (5, 6) |
| 8 | 0.6 | (8, 5) |
| 9 | 0.6 | (10, 5) |
| 10 | 0.6 | (8, 3, 5) |
默认情况下,apriori 返回的是项目的列索引,这在后续操作如关联规则挖掘中可能会有所帮助。为了提高可读性,我们可以设置 use_colnames=True 将这些整数值转换为相应的项目名称:
apriori(df, min_support=0.6, use_colnames=True)
| support | itemsets | |
|---|---|---|
| 0 | 0.8 | (Eggs) |
| 1 | 1.0 | (Kidney Beans) |
| 2 | 0.6 | (Milk) |
| 3 | 0.6 | (Onion) |
| 4 | 0.6 | (Yogurt) |
| 5 | 0.8 | (Eggs, Kidney Beans) |
| 6 | 0.6 | (Eggs, Onion) |
| 7 | 0.6 | (Kidney Beans, Milk) |
| 8 | 0.6 | (Kidney Beans, Onion) |
| 9 | 0.6 | (Yogurt, Kidney Beans) |
| 10 | 0.6 | (Kidney Beans, Eggs, Onion) |
示例 2 -- 选择和过滤结果
使用pandas DataFrames 的一个优势是我们可以利用其便捷的功能来过滤结果。比如,假设我们只对支持度至少为80%的长度为2的项集感兴趣。首先,我们通过 apriori 创建频繁项集,并添加一个新列来存储每个项集的长度:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
| support | itemsets | length | |
|---|---|---|---|
| 0 | 0.8 | (Eggs) | 1 |
| 1 | 1.0 | (Kidney Beans) | 1 |
| 2 | 0.6 | (Milk) | 1 |
| 3 | 0.6 | (Onion) | 1 |
| 4 | 0.6 | (Yogurt) | 1 |
| 5 | 0.8 | (Eggs, Kidney Beans) | 2 |
| 6 | 0.6 | (Eggs, Onion) | 2 |
| 7 | 0.6 | (Kidney Beans, Milk) | 2 |
| 8 | 0.6 | (Kidney Beans, Onion) | 2 |
| 9 | 0.6 | (Yogurt, Kidney Beans) | 2 |
| 10 | 0.6 | (Kidney Beans, Eggs, Onion) | 3 |
然后,我们可以选择满足我们期望标准的结果,如下所示:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
(frequent_itemsets['support'] >= 0.8) ]
| support | itemsets | length | |
|---|---|---|---|
| 5 | 0.8 | (Eggs, Kidney Beans) | 2 |
同样地,使用Pandas API,我们可以根据“itemsets”列选择条目:
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]
| support | itemsets | length | |
|---|---|---|---|
| 6 | 0.6 | (Eggs, Onion) | 2 |
不可变集合
请注意,“itemsets”列中的条目类型为frozenset,这是Python内置的类型,类似于Python的set但不可变,这使得它在某些查询或比较操作中更有效率(https://docs.python.org/3.6/library/stdtypes.html#frozenset)。由于frozenset是集合,因此项的顺序无关紧要。即,查询
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]
等价于以下三种中的任何一种
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Eggs', 'Onion'} ]frequent_itemsets[ frequent_itemsets['itemsets'] == frozenset(('Eggs', 'Onion')) ]frequent_itemsets[ frequent_itemsets['itemsets'] == frozenset(('Onion', 'Eggs')) ]
示例 3 -- 使用稀疏表示法
为了节省内存,您可能希望以稀疏格式表示您的交易数据。这在您有很多产品和少量交易时尤其有用。
oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
sparse_df
| Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | True | False | True | True | True | True | False | True |
| 1 | False | False | True | True | False | True | False | True | True | False | True |
| 2 | True | False | False | True | False | True | True | False | False | False | False |
| 3 | False | True | False | False | False | True | True | False | False | True | True |
| 4 | False | True | False | True | True | True | False | False | True | False | False |
apriori(sparse_df, min_support=0.6, use_colnames=True, verbose=1)
Processing 21 combinations | Sampling itemset size 3
| support | itemsets | |
|---|---|---|
| 0 | 0.8 | (Eggs) |
| 1 | 1.0 | (Kidney Beans) |
| 2 | 0.6 | (Milk) |
| 3 | 0.6 | (Onion) |
| 4 | 0.6 | (Yogurt) |
| 5 | 0.8 | (Eggs, Kidney Beans) |
| 6 | 0.6 | (Eggs, Onion) |
| 7 | 0.6 | (Kidney Beans, Milk) |
| 8 | 0.6 | (Kidney Beans, Onion) |
| 9 | 0.6 | (Yogurt, Kidney Beans) |
| 10 | 0.6 | (Kidney Beans, Eggs, Onion) |
API
apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0, low_memory=False)
Get frequent itemsets from a one-hot DataFrame
Parameters
-
df: pandas DataFramepandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see (https://pandas.pydata.org/pandas-docs/stable/ user_guide/sparse.html#sparse-data-structures)
Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.
The allowed values are either 0/1 or True/False. For example,
Apple Bananas Beer Chicken Milk Rice
0 True False True True False True
1 True False True False False True
2 True False True False False False
3 True True False False False False
4 False False True True True True
5 False False True False True True
6 False False True False True False
7 True True False False False False
-
min_support: float (default: 0.5)A float between 0 and 1 for minumum support of the itemsets returned. The support is computed as the fraction
transactions_where_item(s)_occur / total_transactions. -
use_colnames: bool (default: False)If
True, uses the DataFrames' column names in the returned DataFrame instead of column indices. -
max_len: int (default: None)Maximum length of the itemsets generated. If
None(default) all possible itemsets lengths (under the apriori condition) are evaluated. -
verbose: int (default: 0)Shows the number of iterations if >= 1 and
low_memoryisTrue. If=1 and
low_memoryisFalse, shows the number of combinations. -
low_memory: bool (default: False)If
True, uses an iterator to search for combinations abovemin_support. Note that whilelow_memory=Trueshould only be used for large dataset if memory resources are limited, because this implementation is approx. 3-6x slower than the default.
Returns
pandas DataFrame with columns ['support', 'itemsets'] of all itemsets
that are >= min_support and < than max_len
(if max_len is not None).
Each itemset in the 'itemsets' column is of type frozenset,
which is a Python built-in type that behaves similarly to
sets except that it is immutable
(For more info, see
https://docs.python.org/3.6/library/stdtypes.html#frozenset).
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/