{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 用EntitySets表示数据\n", "一个``EntitySet``是数据帧及其之间关系的集合。它们对于为特征工程准备原始的结构化数据集非常有用。虽然Featuretools中的许多函数将``dataframes``和``relationships``作为单独的参数，但建议创建一个``EntitySet``，这样您可以更轻松地根据需要操作数据。\n", "\n", "## 原始数据\n", "下面我们有两个数据表（表示为Pandas DataFrames），涉及客户交易。第一个是交易、会话和客户的合并，使结果看起来像您可能在日志文件中看到的内容：\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import featuretools as ft\n", "\n", "\n", "data = ft.demo.load_mock_customer()\n", "\n", "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n", "\n", "\n", "transactions_df.sample(10)\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["第二个数据框是涉及这些交易的产品列表。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["products_df = data[\"products\"]\n", "products_df\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 创建一个实体集\n", "首先，我们初始化一个``EntitySet``。如果您想为其命名，可以选择性地在构造函数中提供一个``id``。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es = ft.EntitySet(id=\"customer_data\")\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 添加数据框\n", "为了开始，我们将transactions数据框添加到`EntitySet`中。在调用`add_dataframe`时，我们指定了三个重要参数：\n", "* `index`参数指定了在数据框中唯一标识行的列。\n", "* `time_index`参数告诉Featuretools数据的创建时间。\n", "* `logical_types`参数指示\"product_id\"应该被解释为一个分类列，即使在底层数据中它只是一个整数。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from woodwork.logical_types import Categorical, PostalCode\n", "\n", "es = es.add_dataframe(\n", " dataframe_name=\"transactions\",\n", " dataframe=transactions_df,\n", " index=\"transaction_id\",\n", " time_index=\"transaction_time\",\n", " logical_types={\n", " \"product_id\": Categorical,\n", " \"zip_code\": PostalCode,\n", " },\n", ")\n", "\n", "es\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["您还可以在``EntitySet``对象上使用setter来添加数据帧。\n"]}, {"cell_type": "raw", "metadata": {"raw_mimetype": "text/restructuredtext"}, "source": [".. currentmodule:: featuretools\n", "\n", "\n", ".. note ::\n", "\n", " You can also use a setter on the ``EntitySet`` object to add dataframes\n", "\n", " ``es[\"transactions\"] = transactions_df``\n", "\n", " that this will use the default implementation of `add_dataframe`, notably the following:\n", "\n", " * if the DataFrame does not have `Woodwork `_ initialized, the first column will be the index column\n", " * if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.\n", " * if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe.\n", "\n", ".. note ::\n", "\n", " You can also display your `EntitySet` structure graphically by calling :meth:`.EntitySet.plot`."]}, {"cell_type": "markdown", "metadata": {}, "source": ["这个方法将数据框中的每一列与[Woodwork](https://woodwork.alteryx.com/)的逻辑类型关联起来。每种逻辑类型都可以有一个关联的标准语义标签，有助于定义列的数据类型。如果不为列指定逻辑类型，它将根据底层数据进行推断。逻辑类型和语义标签列在数据框的模式中列出。有关使用逻辑类型和语义标签的更多信息，请查看[Woodwork文档](https://woodwork.alteryx.com/)。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es[\"transactions\"].ww.schema\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["现在，我们可以对我们的产品数据框执行相同的操作。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es = es.add_dataframe(\n", " dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n", ")\n", "\n", "es\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["在我们的`EntitySet`中有两个数据框，我们可以在它们之间添加关系。\n", "\n", "## 添加关系\n", "\n", "我们希望通过每个数据框中名为“product_id”的列将这两个数据框关联起来。每个产品都有与之关联的多个交易，因此被称为**父数据框**，而交易数据框则被称为**子数据框**。在指定关系时，我们需要四个参数：父数据框名称、父列名称、子数据框名称和子列名称。请注意，每个关系必须表示一对多的关系，而不是一对一或多对多的关系。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es = es.add_relationship(\"products\", \"product_id\", \"transactions\", \"product_id\")\n", "es\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["现在，我们看到关系已经添加到我们的`EntitySet`中。 \n", "## 从现有表创建数据框 \n", "在处理原始数据时，通常会有足够的信息来证明需要创建新的数据框。为了为sessions创建一个新的数据框和关系，我们需要对交易数据框进行“规范化”。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es = es.normalize_dataframe(\n", " base_dataframe_name=\"transactions\",\n", " new_dataframe_name=\"sessions\",\n", " index=\"session_id\",\n", " make_time_index=\"session_start\",\n", " additional_columns=[\n", " \"device\",\n", " \"customer_id\",\n", " \"zip_code\",\n", " \"session_start\",\n", " \"join_date\",\n", " ],\n", ")\n", "es\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["从上面的输出中，我们可以看到这个方法执行了两个操作：1. 根据\"transactions\"中的\"session_id\"和\"session_start\"列创建了一个名为\"sessions\"的新数据框；2. 添加了一个连接\"transactions\"和\"sessions\"的关系。如果我们查看一下\"transactions\"数据框和新的\"sessions\"数据框的模式，我们会看到另外两个自动执行的操作：\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es[\"transactions\"].ww.schema\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es[\"sessions\"].ww.schema\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["1. 从“transactions”中删除了“device”、“customer_id”、“zip_code”和“join_date”，并在sessions数据框中创建了新的列。这样做可以减少冗余信息，因为会话的这些属性在交易之间不会改变。\n", "\n", "2. 将“session_start”复制并标记为新sessions数据框中的时间索引列，以表示会话的开始。如果基础数据框具有时间索引且未设置``make_time_index``，``normalize_dataframe``将为新数据框创建一个时间索引。在这种情况下，它将使用每个会话的第一笔交易的时间创建一个名为“first_transactions_time”的新时间索引。如果不希望创建这个时间索引，可以设置``make_time_index=False``。如果我们查看数据框，就可以看到``normalize_dataframe``对实际数据所做的操作。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es[\"sessions\"].head(5)\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es[\"transactions\"].head(5)\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["```markdown\n", "完成准备数据集的工作，使用相同的方法调用创建一个名为\"customers\"的数据框。\n", "```\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["es = es.normalize_dataframe(\n", " base_dataframe_name=\"sessions\",\n", " new_dataframe_name=\"customers\",\n", " index=\"customer_id\",\n", " make_time_index=\"join_date\",\n", " additional_columns=[\"zip_code\", \"join_date\"],\n", ")\n", "\n", "es\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 使用EntitySet\n", "最后，我们准备好在Featuretools中使用这个EntitySet的任何功能。例如，让我们为数据集中的每个产品构建一个特征矩阵。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name=\"products\")\n", "\n", "feature_matrix\n"]}, {"cell_type": "raw", "metadata": {"raw_mimetype": "text/restructuredtext", "vscode": {"languageId": "raw"}}, "source": ["As we can see, the features from DFS use the relational structure of our `EntitySet`. Therefore it is important to think carefully about the dataframes that we create."]}], "metadata": {"celltoolbar": "Raw Cell Format", "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2"}}, "nbformat": 4, "nbformat_minor": 4}