版本v3目前处于测试阶段。本文档反映了正在开发中的功能和特性,在最终发布前可能会有所变更。

PandasAI中的数据转换

PandasAI提供了一套丰富的数据转换功能,可应用于您的数据。这些转换可以在模式文件中指定或以编程方式应用。

字符串转换

transformations:
  # Convert text to lowercase
  - type: to_lowercase
    params:
      column: product_name

  # Convert text to uppercase
  - type: to_uppercase
    params:
      column: category

  # Remove leading/trailing whitespace
  - type: strip
    params:
      column: description

  # Truncate text to specific length
  - type: truncate
    params:
      column: description
      length: 100
      add_ellipsis: true  # Optional, adds "..." to truncated text

  # Pad strings to fixed width
  - type: pad
    params:
      column: product_code
      width: 10
      side: left  # Optional: "left" or "right", default "left"
      pad_char: "0"  # Optional, default " "

  # Extract text using regex
  - type: extract
    params:
      column: product_code
      pattern: "^[A-Z]+-(\d+)"  # Extracts numbers after hyphen

数值转换

transformations:
  # Round numbers to specified decimals
  - type: round_numbers
    params:
      column: price
      decimals: 2

  # Scale values by a factor
  - type: scale
    params:
      column: price
      factor: 1.1  # 10% increase

  # Clip values to bounds
  - type: clip
    params:
      column: quantity
      lower: 0  # Optional
      upper: 100  # Optional

  # Normalize to 0-1 range
  - type: normalize
    params:
      column: score

  # Standardize using z-score
  - type: standardize
    params:
      column: score

  # Ensure positive values
  - type: ensure_positive
    params:
      column: amount
      drop_negative: false  # Optional, drops rows with negative values if true

  # Bin continuous data
  - type: bin
    params:
      column: age
      bins: [0, 18, 35, 50, 65, 100]  # Or specify number of bins: bins: 5
      labels: ["0-18", "19-35", "36-50", "51-65", "65+"]  # Optional

日期与时间转换

transformations:
  # Convert timezone
  - type: convert_timezone
    params:
      column: timestamp
      to: "US/Pacific"

  # Format dates
  - type: format_date
    params:
      column: date
      format: "%Y-%m-%d"

  # Convert to datetime
  - type: to_datetime
    params:
      column: date
      format: "%Y-%m-%d"  # Optional
      errors: "coerce"  # Optional: "raise", "coerce", or "ignore"

  # Validate date range
  - type: validate_date_range
    params:
      column: date
      start_date: "2024-01-01"
      end_date: "2024-12-31"
      drop_invalid: false  # Optional

数据清洗转换

transformations:
  # Fill missing values
  - type: fill_na
    params:
      column: quantity
      value: 0

  # Replace values
  - type: replace
    params:
      column: status
      old_value: "inactive"
      new_value: "disabled"

  # Remove duplicates
  - type: remove_duplicates
    params:
      columns: ["order_id", "product_id"]
      keep: "first"  # Optional: "first", "last", or false

  # Normalize phone numbers
  - type: normalize_phone
    params:
      column: phone
      country_code: "+1"  # Optional, default "+1"

分类转换

transformations:
  # One-hot encode categories
  - type: encode_categorical
    params:
      column: category
      drop_first: true  # Optional

  # Map values using dictionary
  - type: map_values
    params:
      column: grade
      mapping:
        "A": 4.0
        "B": 3.0
        "C": 2.0

  # Standardize categories
  - type: standardize_categories
    params:
      column: company
      mapping:
        "Apple Inc.": "Apple"
        "Apple Computer": "Apple"

重命名列

将列重命名为新名称。

参数:

  • column (str): 当前列名
  • new_name (str): 列的新名称

示例:

transformations:
  - type: rename
    params:
      column: old_name
      new_name: new_name

这将把列名从old_name重命名为new_name

验证转换

transformations:
  # Validate email format
  - type: validate_email
    params:
      column: email
      drop_invalid: false  # Optional

  # Validate foreign key references
  - type: validate_foreign_key
    params:
      column: user_id
      ref_df: users  # Reference DataFrame
      ref_column: id
      drop_invalid: false  # Optional

隐私与安全转换

transformations:
  # Anonymize sensitive data
  - type: anonymize
    params:
      column: email  # Replaces username in emails with asterisks

类型转换变换

transformations:
  # Convert to numeric type
  - type: to_numeric
    params:
      column: amount
      errors: "coerce"  # Optional: "raise", "coerce", or "ignore"

链式转换

你可以按顺序链式执行多个转换操作。转换将按照指定的顺序依次应用:

transformations:
  - type: to_lowercase
    params:
      column: product_name
  - type: strip
    params:
      column: product_name
  - type: truncate
    params:
      column: product_name
      length: 50

编程化使用

虽然模式文件适用于静态转换,但您也可以通过编程方式使用TransformationManager来应用转换:

import pandasai as pai

df = pai.read_csv("data.csv")
manager = TransformationManager(df)
result = (manager
    .validate_email("email", drop_invalid=True)
    .normalize_phone("phone")
    .validate_date_range("birth_date", "1900-01-01", "2024-01-01")
    .remove_duplicates("user_id")
    .ensure_positive("amount")
    .standardize_categories("company", {"Apple Inc.": "Apple"})
    .df)

这种方法支持流畅的接口,可以将多个转换操作串联起来。每个方法都会返回管理器实例,以便进行后续转换。最终的.df属性会返回转换后的DataFrame。

完整示例

让我们通过一个完整的销售数据集转换示例来演示。这个例子展示了如何清洗、验证数据并为分析做好准备。

示例数据

考虑一个名为sales_data.csv的CSV文件,其结构如下:

date,store_id,product_name,category,quantity,unit_price,customer_email
2024-01-15, ST001,  iPhone 13 Pro,Electronics,2,999.99,john.doe@email.com
2024-01-15,ST002,macBook Pro ,Electronics,-1,1299.99,invalid.email
2024-01-16,ST001,AirPods Pro,Electronics,3,249.99,jane@example.com
2024-01-16,ST003,iMac 27" ,Electronics,1,1799.99,

模式文件

创建一个schema.yaml文件来定义转换规则:

name: sales_data
description: "Daily sales data from retail stores"
source:
  type: csv
  path: "sales_data.csv"

transformations:
  # Clean up product names
  - type: strip
    params:
      column: product_name
  - type: standardize_categories
    params:
      column: product_name
      mapping:
        "iPhone 13 Pro": "iPhone 13 Pro"
        "macBook Pro": "MacBook Pro"
        "AirPods Pro": "AirPods Pro"
        "iMac 27\"": "iMac 27-inch"

  # Format dates
  - type: to_datetime
    params:
      column: date
      format: "%Y-%m-%d"

  # Validate and clean store IDs
  - type: pad
    params:
      column: store_id
      width: 5
      side: "right"
      pad_char: "0"

  # Ensure valid quantities
  - type: ensure_positive
    params:
      column: quantity
      drop_negative: true

  # Format prices
  - type: round_numbers
    params:
      column: unit_price
      decimals: 2

  # Validate emails
  - type: validate_email
    params:
      column: customer_email
      drop_invalid: false

  # Add derived columns
  - type: scale
    params:
      column: unit_price
      factor: 1.1  # Add 10% tax

columns:
  date:
    type: datetime
    description: "Date of sale"
  store_id:
    type: string
    description: "Store identifier"
  product_name:
    type: string
    description: "Product name"
  category:
    type: string
    description: "Product category"
  quantity:
    type: integer
    description: "Number of units sold"
  unit_price:
    type: float
    description: "Price per unit"
  customer_email:
    type: string
    description: "Customer email address"

Python代码

以下是如何在代码中使用模式和转换:

import pandasai as pai

# Load and transform the data of the schema we just created
df = pai.load("my-org/sales-data")

# The resulting DataFrame will have:
# - Cleaned and standardized product names
# - Properly formatted dates
# - Padded store IDs (e.g., "ST001000")
# - Only positive quantities
# - Rounded prices with tax
# - Validated email addresses

# You can now analyze the data
response = df.chat("What's our best-selling product?")

# Or export the transformed data
df.to_csv("cleaned_sales_data.csv")

结果

转换后的数据将如下所示:

date,store_id,product_name,category,quantity,unit_price,customer_email,email_valid
2024-01-15,ST001000,iPhone 13 Pro,Electronics,2,1099.99,john.doe@email.com,true
2024-01-16,ST001000,AirPods Pro,Electronics,3,274.99,jane@example.com,true
2024-01-16,ST003000,iMac 27-inch,Electronics,1,1979.99,,false

注意这些转换具有以下特点:

  • 标准化产品名称
  • 填充后的店铺ID
  • 移除了数量为负的行
  • 价格增加10%税费
  • 已验证的电子邮件地址
  • 新增了电子邮件验证列

本示例演示如何结合使用多种转换方法来清洗和准备数据以供分析。这些转换按顺序应用,每个转换都基于前一个转换的结果进行构建。