数据层
数据转换
PandasAI 中可用的数据转换功能
版本v3目前处于测试阶段。本文档反映了正在开发中的功能和特性,在最终发布前可能会有所变更。
PandasAI中的数据转换
PandasAI提供了一套丰富的数据转换功能,可应用于您的数据。这些转换可以在模式文件中指定或以编程方式应用。
字符串转换
transformations:
# Convert text to lowercase
- type: to_lowercase
params:
column: product_name
# Convert text to uppercase
- type: to_uppercase
params:
column: category
# Remove leading/trailing whitespace
- type: strip
params:
column: description
# Truncate text to specific length
- type: truncate
params:
column: description
length: 100
add_ellipsis: true # Optional, adds "..." to truncated text
# Pad strings to fixed width
- type: pad
params:
column: product_code
width: 10
side: left # Optional: "left" or "right", default "left"
pad_char: "0" # Optional, default " "
# Extract text using regex
- type: extract
params:
column: product_code
pattern: "^[A-Z]+-(\d+)" # Extracts numbers after hyphen
数值转换
transformations:
# Round numbers to specified decimals
- type: round_numbers
params:
column: price
decimals: 2
# Scale values by a factor
- type: scale
params:
column: price
factor: 1.1 # 10% increase
# Clip values to bounds
- type: clip
params:
column: quantity
lower: 0 # Optional
upper: 100 # Optional
# Normalize to 0-1 range
- type: normalize
params:
column: score
# Standardize using z-score
- type: standardize
params:
column: score
# Ensure positive values
- type: ensure_positive
params:
column: amount
drop_negative: false # Optional, drops rows with negative values if true
# Bin continuous data
- type: bin
params:
column: age
bins: [0, 18, 35, 50, 65, 100] # Or specify number of bins: bins: 5
labels: ["0-18", "19-35", "36-50", "51-65", "65+"] # Optional
日期与时间转换
transformations:
# Convert timezone
- type: convert_timezone
params:
column: timestamp
to: "US/Pacific"
# Format dates
- type: format_date
params:
column: date
format: "%Y-%m-%d"
# Convert to datetime
- type: to_datetime
params:
column: date
format: "%Y-%m-%d" # Optional
errors: "coerce" # Optional: "raise", "coerce", or "ignore"
# Validate date range
- type: validate_date_range
params:
column: date
start_date: "2024-01-01"
end_date: "2024-12-31"
drop_invalid: false # Optional
数据清洗转换
transformations:
# Fill missing values
- type: fill_na
params:
column: quantity
value: 0
# Replace values
- type: replace
params:
column: status
old_value: "inactive"
new_value: "disabled"
# Remove duplicates
- type: remove_duplicates
params:
columns: ["order_id", "product_id"]
keep: "first" # Optional: "first", "last", or false
# Normalize phone numbers
- type: normalize_phone
params:
column: phone
country_code: "+1" # Optional, default "+1"
分类转换
transformations:
# One-hot encode categories
- type: encode_categorical
params:
column: category
drop_first: true # Optional
# Map values using dictionary
- type: map_values
params:
column: grade
mapping:
"A": 4.0
"B": 3.0
"C": 2.0
# Standardize categories
- type: standardize_categories
params:
column: company
mapping:
"Apple Inc.": "Apple"
"Apple Computer": "Apple"
重命名列
将列重命名为新名称。
参数:
column(str): 当前列名new_name(str): 列的新名称
示例:
transformations:
- type: rename
params:
column: old_name
new_name: new_name
这将把列名从old_name重命名为new_name。
验证转换
transformations:
# Validate email format
- type: validate_email
params:
column: email
drop_invalid: false # Optional
# Validate foreign key references
- type: validate_foreign_key
params:
column: user_id
ref_df: users # Reference DataFrame
ref_column: id
drop_invalid: false # Optional
隐私与安全转换
transformations:
# Anonymize sensitive data
- type: anonymize
params:
column: email # Replaces username in emails with asterisks
类型转换变换
transformations:
# Convert to numeric type
- type: to_numeric
params:
column: amount
errors: "coerce" # Optional: "raise", "coerce", or "ignore"
链式转换
你可以按顺序链式执行多个转换操作。转换将按照指定的顺序依次应用:
transformations:
- type: to_lowercase
params:
column: product_name
- type: strip
params:
column: product_name
- type: truncate
params:
column: product_name
length: 50
编程化使用
虽然模式文件适用于静态转换,但您也可以通过编程方式使用TransformationManager来应用转换:
import pandasai as pai
df = pai.read_csv("data.csv")
manager = TransformationManager(df)
result = (manager
.validate_email("email", drop_invalid=True)
.normalize_phone("phone")
.validate_date_range("birth_date", "1900-01-01", "2024-01-01")
.remove_duplicates("user_id")
.ensure_positive("amount")
.standardize_categories("company", {"Apple Inc.": "Apple"})
.df)
这种方法支持流畅的接口,可以将多个转换操作串联起来。每个方法都会返回管理器实例,以便进行后续转换。最终的.df属性会返回转换后的DataFrame。
完整示例
让我们通过一个完整的销售数据集转换示例来演示。这个例子展示了如何清洗、验证数据并为分析做好准备。
示例数据
考虑一个名为sales_data.csv的CSV文件,其结构如下:
date,store_id,product_name,category,quantity,unit_price,customer_email
2024-01-15, ST001, iPhone 13 Pro,Electronics,2,999.99,john.doe@email.com
2024-01-15,ST002,macBook Pro ,Electronics,-1,1299.99,invalid.email
2024-01-16,ST001,AirPods Pro,Electronics,3,249.99,jane@example.com
2024-01-16,ST003,iMac 27" ,Electronics,1,1799.99,
模式文件
创建一个schema.yaml文件来定义转换规则:
name: sales_data
description: "Daily sales data from retail stores"
source:
type: csv
path: "sales_data.csv"
transformations:
# Clean up product names
- type: strip
params:
column: product_name
- type: standardize_categories
params:
column: product_name
mapping:
"iPhone 13 Pro": "iPhone 13 Pro"
"macBook Pro": "MacBook Pro"
"AirPods Pro": "AirPods Pro"
"iMac 27\"": "iMac 27-inch"
# Format dates
- type: to_datetime
params:
column: date
format: "%Y-%m-%d"
# Validate and clean store IDs
- type: pad
params:
column: store_id
width: 5
side: "right"
pad_char: "0"
# Ensure valid quantities
- type: ensure_positive
params:
column: quantity
drop_negative: true
# Format prices
- type: round_numbers
params:
column: unit_price
decimals: 2
# Validate emails
- type: validate_email
params:
column: customer_email
drop_invalid: false
# Add derived columns
- type: scale
params:
column: unit_price
factor: 1.1 # Add 10% tax
columns:
date:
type: datetime
description: "Date of sale"
store_id:
type: string
description: "Store identifier"
product_name:
type: string
description: "Product name"
category:
type: string
description: "Product category"
quantity:
type: integer
description: "Number of units sold"
unit_price:
type: float
description: "Price per unit"
customer_email:
type: string
description: "Customer email address"
Python代码
以下是如何在代码中使用模式和转换:
import pandasai as pai
# Load and transform the data of the schema we just created
df = pai.load("my-org/sales-data")
# The resulting DataFrame will have:
# - Cleaned and standardized product names
# - Properly formatted dates
# - Padded store IDs (e.g., "ST001000")
# - Only positive quantities
# - Rounded prices with tax
# - Validated email addresses
# You can now analyze the data
response = df.chat("What's our best-selling product?")
# Or export the transformed data
df.to_csv("cleaned_sales_data.csv")
结果
转换后的数据将如下所示:
date,store_id,product_name,category,quantity,unit_price,customer_email,email_valid
2024-01-15,ST001000,iPhone 13 Pro,Electronics,2,1099.99,john.doe@email.com,true
2024-01-16,ST001000,AirPods Pro,Electronics,3,274.99,jane@example.com,true
2024-01-16,ST003000,iMac 27-inch,Electronics,1,1979.99,,false
注意这些转换具有以下特点:
- 标准化产品名称
- 填充后的店铺ID
- 移除了数量为负的行
- 价格增加10%税费
- 已验证的电子邮件地址
- 新增了电子邮件验证列
本示例演示如何结合使用多种转换方法来清洗和准备数据以供分析。这些转换按顺序应用,每个转换都基于前一个转换的结果进行构建。
助手
响应内容由AI生成,可能包含错误。
