Data Transformations

版本v3目前处于测试阶段。本文档反映了正在开发中的功能和特性，在最终发布前可能会有所变更。

PandasAI中的数据转换

PandasAI提供了一套丰富的数据转换功能，可应用于您的数据。这些转换可以在模式文件中指定或以编程方式应用。

字符串转换

transformations:
  # Convert text to lowercase
  - type: to_lowercase
    params:
      column: product_name

  # Convert text to uppercase
  - type: to_uppercase
    params:
      column: category

  # Remove leading/trailing whitespace
  - type: strip
    params:
      column: description

  # Truncate text to specific length
  - type: truncate
    params:
      column: description
      length: 100
      add_ellipsis: true  # Optional, adds "..." to truncated text

  # Pad strings to fixed width
  - type: pad
    params:
      column: product_code
      width: 10
      side: left  # Optional: "left" or "right", default "left"
      pad_char: "0"  # Optional, default " "

  # Extract text using regex
  - type: extract
    params:
      column: product_code
      pattern: "^[A-Z]+-(\d+)"  # Extracts numbers after hyphen

数值转换

transformations:
  # Round numbers to specified decimals
  - type: round_numbers
    params:
      column: price
      decimals: 2

  # Scale values by a factor
  - type: scale
    params:
      column: price
      factor: 1.1  # 10% increase

  # Clip values to bounds
  - type: clip
    params:
      column: quantity
      lower: 0  # Optional
      upper: 100  # Optional

  # Normalize to 0-1 range
  - type: normalize
    params:
      column: score

  # Standardize using z-score
  - type: standardize
    params:
      column: score

  # Ensure positive values
  - type: ensure_positive
    params:
      column: amount
      drop_negative: false  # Optional, drops rows with negative values if true

  # Bin continuous data
  - type: bin
    params:
      column: age
      bins: [0, 18, 35, 50, 65, 100]  # Or specify number of bins: bins: 5
      labels: ["0-18", "19-35", "36-50", "51-65", "65+"]  # Optional

日期与时间转换

transformations:
  # Convert timezone
  - type: convert_timezone
    params:
      column: timestamp
      to: "US/Pacific"

  # Format dates
  - type: format_date
    params:
      column: date
      format: "%Y-%m-%d"

  # Convert to datetime
  - type: to_datetime
    params:
      column: date
      format: "%Y-%m-%d"  # Optional
      errors: "coerce"  # Optional: "raise", "coerce", or "ignore"

  # Validate date range
  - type: validate_date_range
    params:
      column: date
      start_date: "2024-01-01"
      end_date: "2024-12-31"
      drop_invalid: false  # Optional

数据清洗转换

transformations:
  # Fill missing values
  - type: fill_na
    params:
      column: quantity
      value: 0

  # Replace values
  - type: replace
    params:
      column: status
      old_value: "inactive"
      new_value: "disabled"

  # Remove duplicates
  - type: remove_duplicates
    params:
      columns: ["order_id", "product_id"]
      keep: "first"  # Optional: "first", "last", or false

  # Normalize phone numbers
  - type: normalize_phone
    params:
      column: phone
      country_code: "+1"  # Optional, default "+1"

分类转换

transformations:
  # One-hot encode categories
  - type: encode_categorical
    params:
      column: category
      drop_first: true  # Optional

  # Map values using dictionary
  - type: map_values
    params:
      column: grade
      mapping:
        "A": 4.0
        "B": 3.0
        "C": 2.0

  # Standardize categories
  - type: standardize_categories
    params:
      column: company
      mapping:
        "Apple Inc.": "Apple"
        "Apple Computer": "Apple"

重命名列

将列重命名为新名称。

参数：

column (str): 当前列名
new_name (str): 列的新名称

示例：

transformations:
  - type: rename
    params:
      column: old_name
      new_name: new_name

这将把列名从old_name重命名为new_name。

验证转换

transformations:
  # Validate email format
  - type: validate_email
    params:
      column: email
      drop_invalid: false  # Optional

  # Validate foreign key references
  - type: validate_foreign_key
    params:
      column: user_id
      ref_df: users  # Reference DataFrame
      ref_column: id
      drop_invalid: false  # Optional

隐私与安全转换

transformations:
  # Anonymize sensitive data
  - type: anonymize
    params:
      column: email  # Replaces username in emails with asterisks

类型转换变换

transformations:
  # Convert to numeric type
  - type: to_numeric
    params:
      column: amount
      errors: "coerce"  # Optional: "raise", "coerce", or "ignore"

链式转换

你可以按顺序链式执行多个转换操作。转换将按照指定的顺序依次应用：

transformations:
  - type: to_lowercase
    params:
      column: product_name
  - type: strip
    params:
      column: product_name
  - type: truncate
    params:
      column: product_name
      length: 50

编程化使用

虽然模式文件适用于静态转换，但您也可以通过编程方式使用TransformationManager来应用转换：

import pandasai as pai

df = pai.read_csv("data.csv")
manager = TransformationManager(df)
result = (manager
    .validate_email("email", drop_invalid=True)
    .normalize_phone("phone")
    .validate_date_range("birth_date", "1900-01-01", "2024-01-01")
    .remove_duplicates("user_id")
    .ensure_positive("amount")
    .standardize_categories("company", {"Apple Inc.": "Apple"})
    .df)

这种方法支持流畅的接口，可以将多个转换操作串联起来。每个方法都会返回管理器实例，以便进行后续转换。最终的.df属性会返回转换后的DataFrame。

完整示例

让我们通过一个完整的销售数据集转换示例来演示。这个例子展示了如何清洗、验证数据并为分析做好准备。

示例数据

考虑一个名为sales_data.csv的CSV文件，其结构如下：

date,store_id,product_name,category,quantity,unit_price,customer_email
2024-01-15, ST001,  iPhone 13 Pro,Electronics,2,999.99,john.doe@email.com
2024-01-15,ST002,macBook Pro ,Electronics,-1,1299.99,invalid.email
2024-01-16,ST001,AirPods Pro,Electronics,3,249.99,jane@example.com
2024-01-16,ST003,iMac 27" ,Electronics,1,1799.99,

模式文件

创建一个schema.yaml文件来定义转换规则：

name: sales_data
description: "Daily sales data from retail stores"
source:
  type: csv
  path: "sales_data.csv"

transformations:
  # Clean up product names
  - type: strip
    params:
      column: product_name
  - type: standardize_categories
    params:
      column: product_name
      mapping:
        "iPhone 13 Pro": "iPhone 13 Pro"
        "macBook Pro": "MacBook Pro"
        "AirPods Pro": "AirPods Pro"
        "iMac 27\"": "iMac 27-inch"

  # Format dates
  - type: to_datetime
    params:
      column: date
      format: "%Y-%m-%d"

  # Validate and clean store IDs
  - type: pad
    params:
      column: store_id
      width: 5
      side: "right"
      pad_char: "0"

  # Ensure valid quantities
  - type: ensure_positive
    params:
      column: quantity
      drop_negative: true

  # Format prices
  - type: round_numbers
    params:
      column: unit_price
      decimals: 2

  # Validate emails
  - type: validate_email
    params:
      column: customer_email
      drop_invalid: false

  # Add derived columns
  - type: scale
    params:
      column: unit_price
      factor: 1.1  # Add 10% tax

columns:
  date:
    type: datetime
    description: "Date of sale"
  store_id:
    type: string
    description: "Store identifier"
  product_name:
    type: string
    description: "Product name"
  category:
    type: string
    description: "Product category"
  quantity:
    type: integer
    description: "Number of units sold"
  unit_price:
    type: float
    description: "Price per unit"
  customer_email:
    type: string
    description: "Customer email address"

Python代码

以下是如何在代码中使用模式和转换：

import pandasai as pai

# Load and transform the data of the schema we just created
df = pai.load("my-org/sales-data")

# The resulting DataFrame will have:
# - Cleaned and standardized product names
# - Properly formatted dates
# - Padded store IDs (e.g., "ST001000")
# - Only positive quantities
# - Rounded prices with tax
# - Validated email addresses

# You can now analyze the data
response = df.chat("What's our best-selling product?")

# Or export the transformed data
df.to_csv("cleaned_sales_data.csv")

结果

转换后的数据将如下所示：

date,store_id,product_name,category,quantity,unit_price,customer_email,email_valid
2024-01-15,ST001000,iPhone 13 Pro,Electronics,2,1099.99,john.doe@email.com,true
2024-01-16,ST001000,AirPods Pro,Electronics,3,274.99,jane@example.com,true
2024-01-16,ST003000,iMac 27-inch,Electronics,1,1979.99,,false

注意这些转换具有以下特点：

标准化产品名称
填充后的店铺ID
移除了数量为负的行
价格增加10%税费
已验证的电子邮件地址
新增了电子邮件验证列

概述

数据层

自然语言

数据平台

高级用法

向后兼容性

关于

数据转换

PandasAI中的数据转换

字符串转换

数值转换

日期与时间转换

数据清洗转换

分类转换

重命名列

验证转换

隐私与安全转换

类型转换变换

链式转换

编程化使用

完整示例

示例数据

模式文件

Python代码

结果

概述

数据层

自然语言

数据平台

高级用法

向后兼容性

关于

​PandasAI中的数据转换

​字符串转换

​数值转换

​日期与时间转换

​数据清洗转换

​分类转换

​重命名列

​验证转换

​隐私与安全转换

​类型转换变换

​链式转换

​编程化使用

​完整示例

​示例数据

​模式文件

​Python代码

​结果

PandasAI中的数据转换

字符串转换

数值转换

日期与时间转换

数据清洗转换

分类转换

重命名列

验证转换

隐私与安全转换

类型转换变换

链式转换

编程化使用

完整示例

示例数据

模式文件

Python代码

结果