在Featuretools中使用Woodwork进行数据类型处理#
Featuretools依赖于在创建EntitySets、Primitives、Features和特征矩阵时保持一致的数据类型。以前,Featuretools使用自己的类型系统,其中包含称为Variables的对象。现在以及未来,Featuretools将使用外部数据类型库进行数据类型处理:Woodwork。了解Woodwork存在的类型以及Featuretools如何使用Woodwork的类型系统将使用户能够: - 构建最能代表其数据的EntitySets - 了解Featuretools的Primitives的可能输入和返回类型 - 了解从给定数据和Primitives生成哪些特征
阅读了解Woodwork逻辑类型和语义标签指南,深入了解下面概述的可用Woodwork类型。对于熟悉旧的Variable对象的用户,迁移到Featuretools版本1.0指南将有助于将Variable类型转换为Woodwork类型。
物理类型#
物理类型定义了Woodwork DataFrame中的数据在磁盘或内存中的存储方式。您可能会看到一个列的物理类型被称为该列的dtype。了解Woodwork DataFrame的物理类型很重要,因为Pandas在执行DataFrame操作时依赖于这些类型。每个Woodwork LogicalType类都有一个与之关联的单个物理类型。
逻辑类型#
逻辑类型提供了关于数据应该如何解释或解析的额外信息,超出了物理类型所包含的内容。事实上,多个逻辑类型具有相同的物理类型,每个逻辑类型传达了不仅包含在物理类型中的不同含义。在Featuretools中,列的逻辑类型指导数据如何读入EntitySet以及在深度特征合成中如何使用。Woodwork提供了许多不同的逻辑类型,可以使用list_logical_types函数查看。
[1]:
import featuretools as ft
ft.list_logical_types()
2024-10-11 14:49:05,742 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:49:05,742 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:49:05,742 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:49:05,742 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:49:05,743 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:49:05,743 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:49:05,743 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:49:05,760 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
[1]:
| name | type_string | description | physical_type | standard_tags | is_default_type | is_registered | parent_type | |
|---|---|---|---|---|---|---|---|---|
| 0 | Address | address | Represents Logical Types that contain address ... | string | {} | True | True | None |
| 1 | Age | age | Represents Logical Types that contain whole nu... | int64 | {numeric} | True | True | Integer |
| 2 | AgeFractional | age_fractional | Represents Logical Types that contain non-nega... | float64 | {numeric} | True | True | Double |
| 3 | AgeNullable | age_nullable | Represents Logical Types that contain whole nu... | Int64 | {numeric} | True | True | IntegerNullable |
| 4 | Boolean | boolean | Represents Logical Types that contain binary v... | bool | {} | True | True | BooleanNullable |
| 5 | BooleanNullable | boolean_nullable | Represents Logical Types that contain binary v... | boolean | {} | True | True | None |
| 6 | Categorical | categorical | Represents Logical Types that contain unordere... | category | {category} | True | True | None |
| 7 | CountryCode | country_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
| 8 | CurrencyCode | currency_code | Represents Logical Types that use the ISO-4217... | category | {category} | True | True | Categorical |
| 9 | Datetime | datetime | Represents Logical Types that contain date and... | datetime64[ns] | {} | True | True | None |
| 10 | Double | double | Represents Logical Types that contain positive... | float64 | {numeric} | True | True | None |
| 11 | EmailAddress | email_address | Represents Logical Types that contain email ad... | string | {} | True | True | Unknown |
| 12 | Filepath | filepath | Represents Logical Types that specify location... | string | {} | True | True | None |
| 13 | IPAddress | ip_address | Represents Logical Types that contain IP addre... | string | {} | True | True | Unknown |
| 14 | Integer | integer | Represents Logical Types that contain positive... | int64 | {numeric} | True | True | IntegerNullable |
| 15 | IntegerNullable | integer_nullable | Represents Logical Types that contain positive... | Int64 | {numeric} | True | True | None |
| 16 | LatLong | lat_long | Represents Logical Types that contain latitude... | object | {} | True | True | None |
| 17 | NaturalLanguage | natural_language | Represents Logical Types that contain text or ... | string | {} | True | True | None |
| 18 | Ordinal | ordinal | Represents Logical Types that contain ordered ... | category | {category} | True | True | Categorical |
| 19 | PersonFullName | person_full_name | Represents Logical Types that may contain firs... | string | {} | True | True | None |
| 20 | PhoneNumber | phone_number | Represents Logical Types that contain numeric ... | string | {} | True | True | Unknown |
| 21 | PostalCode | postal_code | Represents Logical Types that contain a series... | category | {category} | True | True | Categorical |
| 22 | SubRegionCode | sub_region_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
| 23 | Timedelta | timedelta | Represents Logical Types that contain values s... | timedelta64[ns] | {} | True | True | Unknown |
| 24 | URL | url | Represents Logical Types that contain URLs, wh... | string | {} | True | True | Unknown |
| 25 | Unknown | unknown | Represents Logical Types that cannot be inferr... | string | {} | True | True | None |
Featuretools会执行类型推断,为EntitySets中的数据分配逻辑类型,如果没有提供的话,但也可以指定应为任何列设置哪些逻辑类型(前提是该列中的数据与逻辑类型兼容)。要了解有关逻辑类型在EntitySets中如何使用的更多信息,请参阅创建EntitySets指南。要了解如何直接在DataFrame上设置逻辑类型的更多信息,请参阅Woodwork指南中关于处理逻辑类型的内容。
语义标签#
语义标签为列提供有关数据含义或潜在用途的附加信息。列可以具有许多或零个语义标签。一些标签是由Woodwork添加的,一些是由Featuretools添加的,用户可以根据需要添加额外的标签。要了解如何直接在DataFrame上设置语义标签的更多信息,请参阅Woodwork指南中关于处理语义标签的内容。
Woodwork定义的语义标签#
Woodwork将在初始化时向列添加某些语义标签。这些可以是与不同逻辑类型集合相关联的标准标签或索引标签。还有一些标签是用户可以添加的,以在Woodwork中为列提供建议的含义。要获取这些标签的列表,可以使用list_semantic_tags函数。
[2]:
ft.list_semantic_tags()
[2]:
| name | is_standard_tag | valid_logical_types | |
|---|---|---|---|
| 0 | numeric | True | [Age, AgeFractional, AgeNullable, Double, Inte... |
| 1 | category | True | [Categorical, CountryCode, CurrencyCode, Ordin... |
| 2 | index | False | Any LogicalType |
| 3 | time_index | False | [Datetime, Age, AgeFractional, AgeNullable, Do... |
| 4 | date_of_birth | False | [Datetime] |
| 5 | ignore | False | Any LogicalType |
| 6 | passthrough | False | Any LogicalType |
在上面,我们看到了Woodwork中定义的语义标签。这些标签指导了Featuretools如何解释数据,其中一个示例可以在Age原语中看到,该原语要求在列上存在date_of_birth语义标签。date_of_birth标签不会被Woodwork自动添加,因此为了使Featuretools能够使用Age原语,必须手动将date_of_birth标签添加到适用的任何列中。
Featuretools定义的语义标签#
就像Woodwork在内部指定语义标签一样,Featuretools也定义了一些自己的标签,允许生成完整的特征集。当这些标签存在于列上时,它们具有特定的含义。 - 'last_time_index' - Featuretools添加到DataFrame的最后时间索引列。指示此列已由Featuretools创建。 - 'foreign_key' - 用于指示此列是关系的子列,这意味着此列与EntitySet中另一个DataFrame的相应索引列相关。
Woodwork在Featuretools中的应用#
现在我们已经描述了构成Woodwork类型系统的元素,让我们在Featuretools中看到它们的应用。
在EntitySets中使用Woodwork#
有关使用Woodwork构建EntitySets的更多信息,请参阅EntitySet指南。让我们看一下存储在零售数据演示EntitySet中的Woodwork类型信息:
[3]:
es = ft.demo.load_retail()
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[3]:
Entityset: demo_retail_data
DataFrames:
order_products [Rows: 401604, Columns: 8]
products [Rows: 3684, Columns: 4]
orders [Rows: 22190, Columns: 6]
customers [Rows: 4372, Columns: 3]
Relationships:
order_products.product_id -> products.product_id
order_products.order_id -> orders.order_id
orders.customer_name -> customers.customer_name
Woodwork类型信息不存储在EntitySet对象中,而是存储在组成EntitySet的各个DataFrame中。要查看Woodwork类型信息,我们首先从EntitySet中选择一个单独的DataFrame,然后通过ww命名空间访问Woodwork信息:
[4]:
df = es["products"]
df.head()
[4]:
| product_id | description | first_order_products_time | _ft_last_time | |
|---|---|---|---|---|
| 85123A | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 2010-12-01 08:26:00 | 2011-12-09 11:34:00 |
| 71053 | 71053 | WHITE METAL LANTERN | 2010-12-01 08:26:00 | 2011-12-07 14:12:00 |
| 84406B | 84406B | CREAM CUPID HEARTS COAT HANGER | 2010-12-01 08:26:00 | 2011-12-05 14:30:00 |
| 84029G | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 2010-12-01 08:26:00 | 2011-12-09 11:26:00 |
| 84029E | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 2010-12-01 08:26:00 | 2011-12-09 09:07:00 |
[5]:
df.ww
[5]:
| Physical Type | Logical Type | Semantic Tag(s) | |
|---|---|---|---|
| Column | |||
| product_id | category | Categorical | ['index'] |
| description | string | NaturalLanguage | [] |
| first_order_products_time | datetime64[ns] | Datetime | ['time_index'] |
| _ft_last_time | datetime64[ns] | Datetime | ['last_time_index'] |
请注意,显示此DataFrame的类型信息的三列是本指南开头概述的三个类型信息元素。重申一下:通过为DataFrame中的每一列定义物理类型、逻辑类型和语义标签,我们定义了一个DataFrame的Woodwork模式,通过这个模式,我们可以了解每一列的内容。在EntitySet中的每个DataFrame中存在的这种针对每一列的特定类型信息是Deep Feature Synthesis生成EntitySet特征能力的一个重要部分。###
在DFS中的Woodwork作为Featuretools中的计算单元,Primitive需要能够指定它们允许的输入类型,并具有可预测的返回类型。有关Featuretools中Primitive的详细解释,请参阅Feature Primitives指南。在这里,我们将看看Woodwork类型如何汇集到一个ColumnSchema对象中,以描述Primitive的输入和返回类型。以下是我们从零售EntitySet中products DataFrame中的'product_id'列获取的Woodwork ColumnSchema。
[6]:
products_df = es["products"]
product_ids_series = products_df.ww["product_id"]
column_schema = product_ids_series.ww.schema
column_schema
[6]:
<ColumnSchema (Logical Type = Categorical) (Semantic Tags = ['index'])>
这种逻辑类型和语义标记类型信息的组合是一个ColumnSchema。在上面的情况中,ColumnSchema描述了单个数据列的类型定义。请注意,在ColumnSchema中没有物理类型。这是因为ColumnSchema是一组Woodwork类型,它没有任何与之关联的数据,因此没有物理表示。由于ColumnSchema对象与任何数据都没有关联,它也可以用来描述其他列可能属于或不属于的类型空间。ColumnSchema类的这种灵活性允许ColumnSchema对象既用作实体集中每列的类型定义,也用作Featuretools中每个Primitive的输入和返回类型空间。让我们看一个不同DataFrame中的不同列,看看它是如何工作的:
[7]:
order_products_df = es["order_products"]
order_products_df.head()
[7]:
| order_product_id | order_id | product_id | quantity | order_date | unit_price | total | _ft_last_time | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 536365 | 85123A | 6 | 2010-12-01 08:26:00 | 4.2075 | 25.245 | 2010-12-01 08:26:00 |
| 1 | 1 | 536365 | 71053 | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
| 2 | 2 | 536365 | 84406B | 8 | 2010-12-01 08:26:00 | 4.5375 | 36.300 | 2010-12-01 08:26:00 |
| 3 | 3 | 536365 | 84029G | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
| 4 | 4 | 536365 | 84029E | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
[8]:
quantity_series = order_products_df.ww["quantity"]
column_schema = quantity_series.ww.schema
column_schema
[8]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
上面的ColumnSchema是从零售EntitySet中的order_products DataFrame中的'quantity'列中提取的。这是一个类型定义。如果我们查看order_products DataFrame的Woodwork类型信息,我们会发现有几列将具有类似的ColumnSchema类型定义。如果我们想描述这些列的子集,我们可以定义几个ColumnSchema 类型空间。
[9]:
es["order_products"].ww
[9]:
| Physical Type | Logical Type | Semantic Tag(s) | |
|---|---|---|---|
| Column | |||
| order_product_id | int64 | Integer | ['index'] |
| order_id | category | Categorical | ['category', 'foreign_key'] |
| product_id | category | Categorical | ['category', 'foreign_key'] |
| quantity | int64 | Integer | ['numeric'] |
| order_date | datetime64[ns] | Datetime | ['time_index'] |
| unit_price | float64 | Double | ['numeric'] |
| total | float64 | Double | ['numeric'] |
| _ft_last_time | datetime64[ns] | Datetime | ['last_time_index'] |
下面是几个ColumnSchema,它们都包括我们的quantity列,但每个都描述了不同类型的空间。随着我们继续向下,这些ColumnSchema会变得更加严格:##### 整个DataFrame没有任何限制;任何列都符合这个定义。这将包括整个DataFrame。
[10]:
from woodwork.column_schema import ColumnSchema
ColumnSchema()
[10]:
<ColumnSchema>
一个以ColumnSchema作为输入类型的原始变换示例是IsNull原始变换。##### 按语义标签只有带有numeric标签的列适用。这可以包括Double、Integer和Age逻辑类型列。它不会包括index列,尽管它包含整数,但其标准标签已被替换为'index'标签。
[11]:
ColumnSchema(semantic_tags={"numeric"})
[11]:
<ColumnSchema (Semantic Tags = ['numeric'])>
[12]:
df = es["order_products"].ww.select(include="numeric")
df.ww
[12]:
| Physical Type | Logical Type | Semantic Tag(s) | |
|---|---|---|---|
| Column | |||
| quantity | int64 | Integer | ['numeric'] |
| unit_price | float64 | Double | ['numeric'] |
| total | float64 | Double | ['numeric'] |
一个以ColumnSchema作为输入类型的原始类型的示例是Mean聚合原始类型。##### 按逻辑类型只有逻辑类型为Integer的列被包含在此定义中。不需要numeric标签,因此索引列(其标准标签已被移除)仍然适用。
[13]:
from woodwork.logical_types import Integer
ColumnSchema(logical_type=Integer)
[13]:
<ColumnSchema (Logical Type = Integer)>
[14]:
df = es["order_products"].ww.select(include="Integer")
df.ww
[14]:
| Physical Type | Logical Type | Semantic Tag(s) | |
|---|---|---|---|
| Column | |||
| order_product_id | int64 | Integer | ['index'] |
| quantity | int64 | Integer | ['numeric'] |
The column must be categorized by logical type and semantic label, having a logical type of integer and a numeric semantic label, excluding index columns.#
[15]:
ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
[15]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
[16]:
df = es["order_products"].ww.select(include="numeric")
df = df.ww.select(include="Integer")
df.ww
[16]:
| Physical Type | Logical Type | Semantic Tag(s) | |
|---|---|---|---|
| Column | |||
| quantity | int64 | Integer | ['numeric'] |
这样,ColumnSchema可以定义一个类型空间,在这个空间下,Woodwork DataFrame中的列可以存在。这就是Featuretools在DFS过程中确定DataFrame中哪些列对于构建特征是有效的方式。每个Primitive都有由Woodwork
ColumnSchema描述的input_types和return_type。EntitySet中的每个DataFrame都已经初始化了Woodwork。这意味着当一个EntitySet被传递到DFS中时,Featuretools可以选择DataFrame中与Primitive的input_types有效的相关列。然后我们得到一个具有column_schema属性的特征,该属性指示该特征的类型定义是什么,从而让DFS可以将特征堆叠在一起。通过这种方式,Featuretools能够利用Woodwork类型信息的基本单元ColumnSchema,并与Woodwork
DataFrames的EntitySet一起使用,以构建具有深度特征合成的特征。