什么是Featuretools?#
Featuretools 是一个用于执行自动特征工程的框架。它擅长将时间序列和关系型数据集转换为机器学习的特征矩阵。## 5分钟快速入门以下是使用深度特征合成(DFS)执行自动特征工程的示例。在这个示例中,我们将DFS应用于一个包含时间戳客户交易的多表数据集。
[1]:
import featuretools as ft
2024-10-11 14:50:13,845 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:50:13,845 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:50:13,846 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:50:13,846 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:50:13,846 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:50:13,846 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:50:13,846 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:50:13,860 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
加载模拟数据#
[2]:
data = ft.demo.load_mock_customer()
准备数据#
在这个玩具数据集中,有3个数据框。 - customers: 有会话的唯一客户 - sessions: 唯一会话和相关属性 - transactions: 该会话中事件的列表
[3]:
customers_df = data["customers"]
customers_df
[3]:
customer_id | zip_code | join_date | birthday | |
---|---|---|---|---|
0 | 1 | 60091 | 2011-04-17 10:48:33 | 1994-07-18 |
1 | 2 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
2 | 3 | 13244 | 2011-08-13 15:42:34 | 2003-11-21 |
3 | 4 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
4 | 5 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
[4]:
sessions_df = data["sessions"]
sessions_df.sample(5)
[4]:
session_id | customer_id | device | session_start | |
---|---|---|---|---|
13 | 14 | 1 | tablet | 2014-01-01 03:28:00 |
6 | 7 | 3 | tablet | 2014-01-01 01:39:40 |
1 | 2 | 5 | mobile | 2014-01-01 00:17:20 |
28 | 29 | 1 | mobile | 2014-01-01 07:10:05 |
24 | 25 | 3 | desktop | 2014-01-01 05:59:40 |
[5]:
transactions_df = data["transactions"]
transactions_df.sample(5)
[5]:
transaction_id | session_id | transaction_time | product_id | amount | |
---|---|---|---|---|---|
74 | 232 | 5 | 2014-01-01 01:20:10 | 1 | 139.20 |
231 | 27 | 17 | 2014-01-01 04:10:15 | 2 | 90.79 |
434 | 36 | 31 | 2014-01-01 07:50:10 | 3 | 62.35 |
420 | 56 | 30 | 2014-01-01 07:35:00 | 3 | 72.70 |
54 | 444 | 4 | 2014-01-01 00:58:30 | 4 | 43.59 |
首先,我们指定一个包含数据集中所有DataFrame的字典。如果DataFrame存在时间索引列,那么将传入该索引列和时间索引列。
[6]:
dataframes = {
"customers": (customers_df, "customer_id"),
"sessions": (sessions_df, "session_id", "session_start"),
"transactions": (transactions_df, "transaction_id", "transaction_time"),
}
第二步,我们指定DataFrame之间的关系。当两个DataFrame之间存在一对多的关系时,我们将“一” DataFrame称为“父” DataFrame。父子关系的定义如下: (父DataFrame, 父列, 子DataFrame, 子列)在这个数据集中,我们有两个关系。
[7]:
relationships = [
("sessions", "session_id", "transactions", "session_id"),
("customers", "customer_id", "sessions", "customer_id"),
]
Note
要管理设置 DataFrame 和关系,我们建议使用 EntitySet
类,该类提供了方便的 API 来管理这样的数据。有关更多信息,请参见 用EntitySets表示数据。
运行深度特征合成#
DFS的最小输入是一个DataFrame字典、一个关系列表,以及我们想要计算特征的目标DataFrame的名称。DFS的输出是一个特征矩阵和相应的特征定义列表。让我们首先为数据中的每个客户创建一个特征矩阵。
[8]:
feature_matrix_customers, features_defs = ft.dfs(
dataframes=dataframes,
relationships=relationships,
target_dataframe_name="customers",
)
feature_matrix_customers
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x10a71ff60> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x10a7280e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x10a71f880> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x10a7289a0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x10a728ae0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x10a7289a0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x10a728ae0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x10a71ff60> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x10a7280e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x10a71f880> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x10a7280e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x10a728ae0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x10a71ff60> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x10a7289a0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x10a71f880> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x10a7280e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x10a728ae0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x10a71f880> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x10a7289a0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x10a71ff60> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x10a7280e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x10a71f880> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x10a728ae0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x10a7289a0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
[8]:
zip_code | COUNT(sessions) | MODE(sessions.device) | NUM_UNIQUE(sessions.device) | COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | MODE(transactions.product_id) | NUM_UNIQUE(transactions.product_id) | ... | STD(sessions.SKEW(transactions.amount)) | STD(sessions.SUM(transactions.amount)) | SUM(sessions.MAX(transactions.amount)) | SUM(sessions.MEAN(transactions.amount)) | SUM(sessions.MIN(transactions.amount)) | SUM(sessions.NUM_UNIQUE(transactions.product_id)) | SUM(sessions.SKEW(transactions.amount)) | SUM(sessions.STD(transactions.amount)) | MODE(transactions.sessions.device) | NUM_UNIQUE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
1 | 60091 | 8 | mobile | 3 | 126 | 139.43 | 71.631905 | 5.81 | 4 | 5 | ... | 0.589386 | 279.510713 | 1057.97 | 582.193117 | 78.59 | 40.0 | -0.476122 | 312.745952 | mobile | 3 |
2 | 13244 | 7 | desktop | 3 | 93 | 146.81 | 77.422366 | 8.73 | 4 | 5 | ... | 0.509798 | 251.609234 | 931.63 | 548.905851 | 154.60 | 35.0 | -0.277640 | 258.700528 | desktop | 3 |
3 | 13244 | 6 | desktop | 3 | 93 | 149.15 | 67.060430 | 5.89 | 1 | 5 | ... | 0.429374 | 219.021420 | 847.63 | 405.237462 | 66.21 | 29.0 | 2.286086 | 257.299895 | desktop | 3 |
4 | 60091 | 8 | mobile | 3 | 109 | 149.95 | 80.070459 | 5.73 | 2 | 5 | ... | 0.387884 | 235.992478 | 1157.99 | 649.657515 | 131.51 | 37.0 | 0.002764 | 356.125829 | mobile | 3 |
5 | 60091 | 6 | mobile | 3 | 79 | 149.02 | 80.375443 | 7.55 | 5 | 5 | ... | 0.415426 | 402.775486 | 839.76 | 472.231119 | 86.49 | 30.0 | 0.014384 | 259.873954 | mobile | 3 |
5 rows × 75 columns
我们现在有数十种新功能来描述客户的行为。#### 更改目标DataFrameDFS如此强大的原因之一是它可以为我们实体集中的任何DataFrame创建特征矩阵。例如,如果我们想要为会话构建特征。
[10]:
feature_matrix_sessions, features_defs = ft.dfs(
dataframes=dataframes, relationships=relationships, target_dataframe_name="sessions"
)
feature_matrix_sessions.head(5)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x10a71ff60> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x10a7280e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x10a728ae0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x10a7289a0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x10a71f880> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x10a71ff60> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x10a7280e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x10a71f880> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x10a7289a0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x10a728ae0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
[10]:
customer_id | device | COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | MODE(transactions.product_id) | NUM_UNIQUE(transactions.product_id) | SKEW(transactions.amount) | STD(transactions.amount) | ... | customers.STD(transactions.amount) | customers.SUM(transactions.amount) | customers.DAY(birthday) | customers.DAY(join_date) | customers.MONTH(birthday) | customers.MONTH(join_date) | customers.WEEKDAY(birthday) | customers.WEEKDAY(join_date) | customers.YEAR(birthday) | customers.YEAR(join_date) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
session_id | |||||||||||||||||||||
1 | 2 | desktop | 16 | 141.66 | 76.813125 | 20.91 | 3 | 5 | 0.295458 | 41.600976 | ... | 37.705178 | 7200.28 | 18 | 15 | 8 | 4 | 0 | 6 | 1986 | 2012 |
2 | 5 | mobile | 10 | 135.25 | 74.696000 | 9.32 | 5 | 5 | -0.160550 | 45.893591 | ... | 44.095630 | 6349.66 | 28 | 17 | 7 | 7 | 5 | 5 | 1984 | 2010 |
3 | 4 | mobile | 15 | 147.73 | 88.600000 | 8.70 | 1 | 5 | -0.324012 | 46.240016 | ... | 45.068765 | 8727.68 | 15 | 8 | 8 | 4 | 1 | 4 | 2006 | 2011 |
4 | 1 | mobile | 25 | 129.00 | 64.557200 | 6.29 | 5 | 5 | 0.234349 | 40.187205 | ... | 40.442059 | 9025.62 | 18 | 17 | 7 | 4 | 0 | 6 | 1994 | 2011 |
5 | 4 | mobile | 11 | 139.20 | 70.638182 | 7.43 | 5 | 5 | 0.336381 | 48.918663 | ... | 45.068765 | 8727.68 | 15 | 8 | 8 | 4 | 1 | 4 | 2006 | 2011 |
5 rows × 44 columns
理解特征输出#
一般来说,Featuretools 通过特征名称引用生成的特征。为了使特征更易于理解,Featuretools 提供了两个额外的工具,featuretools.graph_feature()
和 featuretools.describe_feature()
,帮助解释特征是什么以及 Featuretools 生成它的步骤。让我们看一个示例特征:
[11]:
feature = features_defs[18]
feature
[11]:
<Feature: MODE(transactions.WEEKDAY(transaction_time))>
特征谱系图#
特征谱系图通过可视化方式展示特征生成的过程。从基础数据开始,逐步展示应用的原语和生成的中间特征,以创建最终特征。
[12]:
ft.graph_feature(feature)
[12]:
特征描述#
Featuretools 还可以自动生成特征的英文句子描述。特征描述有助于解释特征的含义,并且可以通过包含手动定义的自定义定义来进一步改进。有关如何自定义自动生成的特征描述的更多详细信息,请参阅 :doc:/guides/feature_descriptions。
[13]:
ft.describe_feature(feature)
[13]:
'The most frequently occurring value of the day of the week of the "transaction_time" of all instances of "transactions" for each "session_id" in "sessions".'