Synthetic data generation (Part 1)

利用大型语言模型(LLMs)生成合成数据为解决一个常见问题提供了强有力的方案：即获取高质量、多样化且符合隐私要求的数据。这种方法可应用于多种场景，例如训练数据科学机器学习模型(SVMs、决策树、KNN算法)、在不同GPT模型上进行数据微调、解决冷启动问题、帮助构建具有真实数据的引人注目的演示/应用程序，以及进行场景测试等。

有许多关键驱动因素可能会促使您想要利用合成数据。

人类数据可能包含隐私限制和/或可识别数据，我们不希望这些数据被使用。
合成数据可以比真实数据更加结构化，因此更容易操作。
在数据稀疏或某些类别数据稀缺的领域，我们可能希望增强数据。
在处理不平衡数据集或缺乏多样性的数据集时，我们可能需要创建数据来提高数据集的丰富性。

与传统的数据增强或手动数据创建方法不同，使用LLM可以生成丰富、细致且上下文相关的数据集，从而显著提升其对企业和开发者的实用价值。

我们将本教程分为两部分。在本指南中，我们将按照以下议程进行：

带有结构化提示的CSV文件
使用Python程序处理CSV文件
使用Python程序处理多表CSV
仅生成文本数据
Dealing with imbalanced or non-diverse textual data while in part 2, we will look at prompting strategies for getting better textual data.

最后两种方法特别适用于创建合成数据来微调另一个GPT模型。例如使用gpt-4o生成的高质量数据来微调更便宜、更快的gpt-3.5-turbo，从而在降低成本的同时提升性能。

开始设置

%pip install openai
%pip install pandas
%pip install scikit-learn
%pip install matplotlib

from openai import OpenAI
import os
import re
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import json
import matplotlib

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

1. 带结构化提示的CSV

这里我们以最简单的方式创建数据。您可以通过解决3个关键点快速生成数据：告诉它数据的格式（CSV）、模式结构，以及关于列之间关系的实用信息（LLM能够从列名推断出这些信息，但提供帮助会提高性能）。

datagen_model = "gpt-4o-mini"
question = """
Create a CSV file with 10 rows of housing data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

```csv
id,house_size_m2,house_price,location,number_of_bedrooms
1,50,150000,Suburban,2
2,75,250000,City Center,3
3,100,350000,Suburban,4
4,120,450000,Suburban,4
5,80,300000,City Center,3
6,90,400000,City Center,3
7,150,600000,Premium Area,5
8,200,750000,Premium Area,5
9,55,180000,Suburban,2
10,300,950000,Premium Area,6
```

2. 使用Python程序处理CSV

直接生成数据的问题在于，由于上下文限制，我们能生成的数据量有限。相反，我们可以让大语言模型生成一个Python程序来创建合成数据。这种方法使我们能够扩展到更多数据，同时通过检查Python程序让我们了解数据是如何生成的。

这将让我们能够按照自己的意愿编辑Python程序，同时为我们提供一个良好的起点。

question = """
Create a Python program to generate 100 rows of housing data.
I want you to at the end of it output a pandas dataframe with 100 rows of data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

Certainly! Below is a Python program that generates synthetic housing data according to your specifications. We will create a pandas DataFrame with the defined fields and characteristics.

```python
import pandas as pd
import random

def generate_housing_data(num_rows):
    data = []
    
    locations = [
        ('City Center', 10000, 150),  # (location name, base price per m², base size)
        ('Suburban Area', 8000, 100),
        ('Country Side', 5000, 80),
        ('Coastal Region', 12000, 110),
        ('Urban Neighborhood', 9000, 130)
    ]
    
    for i in range(1, num_rows + 1):
        # Randomly pick a location
        location, base_price_per_m2, base_size = random.choice(locations)
        
        # Generate number of bedrooms (1 to 5)
        number_of_bedrooms = random.randint(1, 5)
        
        # Calculate house size based on the number of bedrooms
        house_size = base_size + (10 * number_of_bedrooms) + random.randint(-5, 15)  # Adding some noise
        
        # Calculate house price based on house size and location
        house_price = base_price_per_m2 * house_size + random.randint(-5000, 10000)  # Adding some noise

        # Append the generated data to the list
        data.append({
            'id': i,
            'house_size_m2': house_size,
            'house_price': house_price,
            'location': location,
            'number_of_bedrooms': number_of_bedrooms
        })

    # Create a pandas DataFrame
    df = pd.DataFrame(data)
    return df

# Generate 100 rows of housing data
housing_data_df = generate_housing_data(100)

# Show the result
print(housing_data_df)
```

### Explanation:
- The `generate_housing_data` function creates synthetic housing data for a specified number of rows (`num_rows`).
- We define different locations with corresponding base prices per square meter and average house sizes.
- For each house, we randomly select a location, number of bedrooms, and calculate house size and price to ensure a sensible correlation between the values.
- Finally, we create a pandas DataFrame from the generated data and return it.

You can run this program in your Python environment, and it will output a DataFrame containing 100 rows of synthetic housing data.

我们需要确保正确解析此输出，因为Python代码周围可能经常包含其他文本。我们还可以明确要求它说明生成数据时所做的所有假设，不过在此情况下它已自动告知我们。

3. 使用Python程序处理多表CSV

然而，对于更复杂的关系，我们需要确保指定更多特性。

要创建多个相互关联的不同数据集（例如住房、位置、房屋类型），和之前一样，我们需要指定格式、模式以及有用信息。不过，现在要获得良好性能所需的有用信息更多了。这取决于具体情况，但需要描述的重要内容应包括数据集之间的关联方式、处理数据集彼此之间的大小关系、确保外键和主键设置得当，并最好使用先前生成的数据集来填充新数据集，以便在必要时使实际数据值匹配。

question = """
Create a Python program to generate 3 different pandas dataframes.

1. Housing data
I want 100 rows. Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms
 - house type
 + any relevant foreign keys

2. Location
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - country
 - city
 - population
 - area (m^2)
 + any relevant foreign keys

 3. House types
 - id (incrementing integer starting at 1)
 - house type
 - average house type price
 - number of houses
 + any relevant foreign keys

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.
Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.
You can use the previously generated dataframe to generate the next dataframe.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

Certainly! Below is a Python program that generates the three specified pandas DataFrames for housing data, location data, and house types. Each DataFrame will include the necessary fields, and the foreign keys will ensure proper relationships among them.

```python
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(0)

# Function to generate location DataFrame
def generate_location_data(num_locations):
    locations = {
        "id": range(1, num_locations + 1),
        "country": np.random.choice(['USA', 'Canada', 'UK'], num_locations),
        "city": np.random.choice(['New York', 'Toronto', 'London', 'Vancouver', 'Manchester'], num_locations),
        "population": np.random.randint(50000, 1000000, num_locations),
        "area": np.random.randint(10000, 500000, num_locations)
    }
    return pd.DataFrame(locations)

# Function to generate house types DataFrame
def generate_house_type_data(num_house_types):
    house_types = {
        "id": range(1, num_house_types + 1),
        "house_type": np.random.choice(['Detached', 'Semi-Detached', 'Terraced', 'Flat'], num_house_types),
        "average_house_type_price": np.random.randint(100000, 1000000, num_house_types),
        "number_of_houses": np.random.randint(10, 1000, num_house_types)
    }
    return pd.DataFrame(house_types)

# Function to generate housing data DataFrame
def generate_housing_data(num_houses, location_df, house_type_df):
    house_sizes = np.random.randint(50, 300, num_houses)  # size in m^2
    location_ids = np.random.choice(location_df['id'], num_houses)
    house_type_ids = np.random.choice(house_type_df['id'], num_houses)
    
    # Generate prices based on size, location, and house type
    house_prices = (house_sizes * np.random.randint(2000, 5000, num_houses) // 10) + \
                   (location_ids * 1000) + \
                   (house_type_df.loc[house_type_ids - 1, 'average_house_type_price'].values // 4)
    
    housing_data = {
        "id": range(1, num_houses + 1),
        "house_size": house_sizes,
        "house_price": house_prices,
        "location_id": location_ids,
        "bedrooms": np.random.randint(1, 6, num_houses),
        "house_type_id": house_type_ids
    }
    
    return pd.DataFrame(housing_data)

# Generate DataFrames
num_locations = 10
num_house_types = 4
num_houses = 100

location_df = generate_location_data(num_locations)
house_type_df = generate_house_type_data(num_house_types)
housing_df = generate_housing_data(num_houses, location_df, house_type_df)

# Display the generated DataFrames
print("Location DataFrame:")
print(location_df.head(), "\n")

print("House Types DataFrame:")
print(house_type_df.head(), "\n")

print("Housing DataFrame:")
print(housing_df.head(), "\n")

# Printing the DataFrame shapes
print(f"Shapes: \nLocation: {location_df.shape}, House Types: {house_type_df.shape}, Housing: {housing_df.shape}")
```

### Explanation of the Code:
1. **Location DataFrame:** 
   - Generates random locations with attributes such as country, city, population, and area.
  
2. **House Types DataFrame:** 
   - Generates different types of houses along with average prices and quantity available.
  
3. **Housing DataFrame:** 
   - Generates housing data with increments on price based on house size, location, and house type, while also ensuring foreign keys (IDs) for location and house type.

### Output:
The three DataFrames generated will logically relate to one another with consistent data types and primary–foreign key relationships, resulting in a coherent representation of the housing dataset. The output displays heads of each DataFrame and their shapes for verification.

4. 仅生成文本数据

这里我们初步了解如何创建文本数据。例如，这可用于微调另一个GPT模型。在本案例中，我们设想自己是一家零售商，试图简化为其销售商品创建描述的过程。我们仍需指定数据的格式，特别是在这种情况下，我们需要一个易于解析的输出格式。

下面我们考虑的示例是为GPT模型创建输入输出训练对以进行微调。我们将以产品名称及其所属类别作为输入，输出则是描述。

明确指定输出的结构并下达指令要求不偏离此结构，有助于确保输出格式的一致性。您可以在循环中运行此过程并追加数据以生成更多合成数据。同样如前所述，我们需要妥善解析数据，以避免下游代码出现故障。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. The usecase is a retailer generating a description for a product from a product catalogue. I want the input to be product name and category (to which the product belongs to) and output to be description.
  The format should be of the form:
  1.
  Input: product_name, category
  Output: description
  2.
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.
  Create as many training pairs as possible.
  """

  response = client.chat.completions.create(
    model=datagen_model,
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response

1.
Input: Wireless Bluetooth Headphones, Electronics
Output: Immerse yourself in high-quality sound with these Wireless Bluetooth Headphones, featuring active noise cancellation and a comfortable over-ear design for extended listening sessions.

2.
Input: Organic Green Tea, Beverages
Output: Enjoy a refreshing cup of Organic Green Tea, sourced from the finest leaves, packed with antioxidants, and perfect for a healthy, invigorating boost anytime.

3.
Input: Stainless Steel Kitchen Knife, Kitchenware
Output: Cut with precision and ease using this Stainless Steel Kitchen Knife, designed with an ergonomic handle and a sharp blade for all your culinary tasks.

4.
Input: Hiking Backpack, Outdoor Gear
Output: Explore the great outdoors with this durable Hiking Backpack, featuring multiple compartments for optimal organization and a breathable design for ultimate comfort on long treks.

5.
Input: Air Fryer, Kitchen Appliances
Output: Cook your favorite meals with less oil using this Air Fryer

注意：上述输出已截断。现在我们可以按如下方式解析它，以获取产品列表、类别及其描述。例如，让我们看看它生成的产品。

#regex to parse data
pattern = re.compile(r'Input:\s*(.+?),\s*(.+?)\nOutput:\s*(.+?)(?=\n\n|\Z)', re.DOTALL)
matches = pattern.findall(output_string)
products = []
categories = []
descriptions = []

for match in matches:
    product, category, description = match
    products.append(product.strip())
    categories.append(category.strip())
    descriptions.append(description.strip())
products

['Wireless Bluetooth Headphones',
 'Organic Green Tea',
 'Stainless Steel Kitchen Knife',
 'Hiking Backpack',
 'Air Fryer',
 "Kids' Educational Tablet",
 'Bluetooth Speaker',
 'Yoga Mat',
 'Memory Foam Mattress',
 'Smartwatch',
 'Leather Wallet',
 'Portable Phone Charger',
 'Non-Stick Cookware Set',
 'Pet Dog Bed',
 'Fitness Tracker',
 'Wireless Earbuds',
 'Organic Green Tea',
 'Reusable Water Bottle',
 'Yoga Mat',
 'Leather Wallet',
 'Air Fryer',
 'Gaming Mouse',
 'Crochet Kit',
 'Hiking Boots',
 'Scented Candles',
 'Bluetooth Speaker',
 'Stainless Steel Cookware Set',
 'Fitness Tracker',
 'Decorative Throw Pillows',
 'Eco-Friendly Cleaning Supplies',
 'Wireless Noise Cancelling Headphones',
 'Organic Green Tea',
 'Adjustable Yoga Mat',
 'Bluetooth Smart Scale',
 'Stainless Steel Water Bottle',
 'Soft Cotton Bedding Set',
 'Multi-Functional Kitchen Blender',
 'Eco-Friendly Reusable Bags',
 'Portable Phone Charger',
 'Classic Leather Wallet',
 'Suede Chelsea Boots',
 'Non-Stick Cookware Set',
 'Pet-Friendly Indoor Plants',
 'High-Protein Snack Bars',
 'LED Desk Lamp with USB Port']

5. 处理不平衡或缺乏多样性的文本数据

生成高质量合成数据的一些最关键方面包括准确性（数据是否合理）、一致性（同一输入的两个独立数据点是否大致相同）和多样性（确保我们的数据分布尽可能匹配生产环境中的实际分布）。

To increase the diversity of our data, we start first by clustering the data. This will provide us information about which clusters are underrepresented (imbalanced dataset) or which data is not addressed at all (widening the data distribution). Then, we will either suggest new clusters (using self-reflection type call from GPT) or ask the next iteration of our synthetic generation calls to explicitly target the underrepresented clusters.

然后我们可以递归运行这个集群循环的生成和分析过程，以自动化生成多样化的合成数据。

出于演示目的，我们明确提示LLM生成关于4个不同主题领域的信息：车辆、服装、洗护用品、食品。随后我们将对数据进行聚类，观察它是否能成功识别出这4个主题领域。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under 4 main topics: vehicle, clothing, toiletries, food)
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response

1. vehicle  
Input: "Tesla Model 3, Electric Car"  
Output: "The Tesla Model 3 is a revolutionary electric car with impressive range and cutting-edge technology, designed to provide an exhilarating driving experience while minimizing environmental impact."

2. clothing  
Input: "Nike Air Max, Shoes"  
Output: "Elevate your sneaker game with Nike Air Max. Combining iconic style with superior comfort and support, these shoes are perfect for both workouts and casual outings."

3. toiletries  
Input: "Oral-B Pro 1000, Electronic Toothbrush"  
Output: "Achieve a superior clean with the Oral-B Pro 1000. This electronic toothbrush features 3D cleaning action that pulsates and oscillates to remove more plaque than a regular manual toothbrush."

4. food  
Input: "Chobani Greek Yogurt, Yogurt"  
Output: "Indulge in a nutritious snack with Chobani Greek Yogurt. Packed with protein and delicious flavors, it’s the perfect choice for a healthy breakfast or a satisfying treat anytime."

5. vehicle

注意：上述输出内容经过截断。在上面的示例中，我们会根据示例明确将主题领域作为响应的一部分包含在内，因为这有助于调整后续输出，通常会获得更好的性能。我们还可以提供一个实际输出示例，让它理解正确的输出风格，同时也有助于强化结构。

pattern = re.compile(r'(\d+)\.\s*(\w+)\s*Input:\s*"(.+?),\s*(.+?)"\s*Output:\s*"(.*?)"', re.DOTALL)
matches = pattern.findall(output_string)

topics = []
products = []
categories = []
descriptions = []

for match in matches:
    number, topic, product, category, description = match
    topics.append(topic)
    products.append(product)
    categories.append(category)
    descriptions.append(description)

products

['Tesla Model 3',
 'Nike Air Max',
 'Oral-B Pro 1000',
 'Chobani Greek Yogurt',
 'Ford F-150',
 "Levi's 511",
 'Philips Sonicare',
 'Quaker Oatmeal',
 'Toyota Camry',
 'Adidas Ultraboost',
 'Toyota Camry',
 'Nike Air Max',
 'Colgate Electric Toothbrush',
 'Blue Diamond Almonds',
 'Harley Davidson Fat Boy',
 'Adidas UltraBoost',
 "Dove Men's Body Wash",
 'Quaker Oats',
 'Ford F-150',
 "Levi's 501 Jeans",
 'Tesla Model 3',
 'Nike Air Max',
 'Oral-B Pro 1000',
 'Organic Almond Butter',
 'Yamaha YZF-R3',
 'Adidas Ultraboost',
 'Philips Sonicare',
 'Organic Quinoa']

我们现在将对数据进行聚类分析。我们将使用K-means聚类算法来划分数据。K-means算法中需要设置的一个重要参数是K，即聚类数量。

我们知道应该有4个聚类（4个主题），因为我们在提示中指定了：车辆、电子产品、服装、食品。但通常对于我们的数据，我们并不知道存在多少个聚类。因此我们将使用肘部法则来找到最佳的聚类数量。

在肘部法则中，我们会遍历一系列不同的K值，每次记录惯性值。惯性值衡量的是每个聚类中各点与该聚类质心之间距离平方的总和，从而告诉我们每个聚类的分离程度和密度。如果我们将K值与惯性值绘制成图表，就能观察到惯性值下降的趋势，并在惯性值下降最缓慢的位置（通常会形成一个肘部形状）确定最优的聚类数量。您可以点击此处深入了解肘部法则的更多细节。

首先，让我们将数据存储到pandas数据框中以便于分析

data = {
    'Product': products,
    'Category': categories,
    'Description': descriptions
}

df = pd.DataFrame(data)

接下来让我们将数据嵌入，因为我们将对嵌入向量进行聚类，如果它们在向量空间中彼此接近，则说明它们是相似的。

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model=model)

    return response.data[0].embedding

embedding_model = "text-embedding-3-small"
df["embedding"] = df.Category.apply(lambda x: get_embedding(x, model=embedding_model))

# Ensure there are embeddings to concatenate
if len(df.embedding.values) > 0:
    matrix = np.vstack(df.embedding.values)
else:
    matrix = np.array([])  # Handle the case where there are no embeddings

df

	产品	分类	描述	嵌入向量
0	Tesla Model 3	电动汽车	Tesla Model 3是一款革命性的电动汽车...	[0.003255360759794712, -0.039260633289813995, ...
1	Nike Air Max	运动鞋	用Nike Air Max提升你的运动鞋品味。C...	[0.03943369910120964, 0.022045187652111053, -0...
2	Oral-B Pro 1000	电动牙刷	使用Oral-B Pro 1000实现卓越清洁效果...	[-0.003470012918114662, -0.01911414973437786, ...
3	Chobani希腊酸奶	酸奶	享受Chobani希腊酸奶带来的营养零食...	[0.0208318829536438, -0.02645781636238098, -0....
4	福特F-150	皮卡车	福特F-150是终极皮卡车，...	[0.007467855699360371, -0.05288049206137657, -...
5	Levi's 511	牛仔裤	穿上Levi's 511牛仔裤，展现时尚风采。这款...	[0.0037206460256129503, 0.022772302851080894, ...
6	Philips Sonicare	电动牙刷	使用Philips Sonicare发现口腔护理新境界...	[-0.00724813062697649, -0.011600878089666367, ...
7	桂格燕麦片	早餐谷物	用桂格燕麦片开启美好一天。这款...	[-0.006529285106807947, 0.007865572348237038, ...
8	Toyota Camry	Sedan	丰田凯美瑞在轿车类别中表现突出...	[-0.02088991366326809, -0.006191295105963945, ...
9	Adidas Ultraboost	跑鞋	穿上Adidas Ultraboost，体验前所未有的奔跑感受...	[0.02679188922047615, 0.014639599248766899, 8....
10	Toyota Camry	汽车	Toyota Camry是一款可靠的中型轿车...	[0.008056452497839928, -0.007912316359579563, ...
11	Nike Air Max	鞋子	用Nike Air Max提升你的运动鞋品味...	[0.03943241760134697, 0.02208484522998333, -0....
12	高露洁电动牙刷	电动牙刷	用这款C...产品改变您的口腔护理习惯	[-0.003470012918114662, -0.01911414973437786, ...
13	蓝钻杏仁	坚果	健康零食选择蓝钻杏仁。这些...	[-0.013289917260408401, 0.036334190517663956, ...
14	哈雷戴维森肥仔	摩托车	体验开阔道路带来的刺激感受...	[0.012365399859845638, 0.03552943095564842, -0...
15	阿迪达斯UltraBoost	运动鞋	享受舒适与性能的完美融合...	[0.013107392005622387, 0.02963760495185852, -0...
16	多芬男士沐浴露	沐浴露	使用多芬男士沐浴露焕新肌肤并补水...	[0.03760576993227005, -0.008475445210933685, -...
17	桂格燕麦	燕麦	用桂格燕麦开启美好一天。富含...	[-0.00903365109115839, 0.00896345917135477, 0....
18	福特F-150	卡车	福特F-150是一款坚固耐用的卡车...	[0.023461222648620605, -0.026651185005903244, ...
19	Levi's 501牛仔裤	牛仔裤	探索Levi's 501牛仔裤的永恒风格...	[0.003762696636840701, 0.02275814116001129, -0...
20	特斯拉 Model 3	移动电话	探索特斯拉M带来的未来驾驶体验...	[0.03703858703374863, 0.03407958149909973, 0.0...
21	Nike Air Max	运动鞋	用Nike Air Max提升你的运动表现。这款鞋设计...	[0.03943369910120964, 0.022045187652111053, -0...
22	Oral-B Pro 1000	电动牙刷	使用Oral-B Pro 1000获得卓越清洁效果...	[-0.003470012918114662, -0.01911414973437786, ...
23	有机杏仁酱	食品	尽情享受有机杏仁酱的丝滑美味...	[-0.014613640494644642, -0.002179765608161688,...
24	雅马哈 YZF-R3	移动电话	为您介绍雅马哈YZF-R3，终极运动...	[0.03703858703374863, 0.03407958149909973, 0.0...
25	阿迪达斯Ultraboost	鞋类	探索阿迪达斯Ultraboost，这款鞋...（原文未完整）	[0.03944042697548866, 0.022062409669160843, -0...
26	Philips Sonicare	电动牙刷	体验飞利浦Sonicare带来的口腔护理革命...	[-0.003470012918114662, -0.01911414973437786, ...
27	有机藜麦	食品	用有机藜麦滋养您的身体，这是一种营养丰富的...	[-0.014613640494644642, -0.002179765608161688,...

现在我们执行肘部法则。

# Determine the optimal number of clusters using the elbow method
inertias = []
range_of_clusters = range(1, 13)  # Adjust the range as necessary

for n_clusters in range_of_clusters:
    kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
    kmeans.fit(matrix)
    inertias.append(kmeans.inertia_)

这将为我们输出一个图表，我们需要通过视觉判断最佳聚类点在哪里。从下方可以看出，惯性呈现逐渐下降而非急剧转折的趋势，但下降最陡峭的点似乎出现在3、4或5个聚类附近，这与我们根据提示所预期的结果一致。

# Plotting the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(range_of_clusters, inertias, '-o')
plt.title('Elbow Method to Determine Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.xticks(range_of_clusters)
plt.show()

elbow_chart

出于演示目的，我们将选择5作为最佳聚类数量，以表明只要大致正确，具体选择哪个数字并不重要。数据分类有多种正确方法。我们还会存储每个数据点所属的聚类。

n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_
df["Cluster"] = labels

我们现在将分析集群数据。我们需要解决两个独立的问题：1. 数据不平衡，2. 扩展数据分布。

首先针对不平衡数据，我们统计每个聚类中的样本数量。然后从每个聚类中随机选取少量样本，询问LLM这些样本映射到哪些主题。

cluster_counts = df["Cluster"].value_counts().sort_index()
print(cluster_counts)

Cluster
0    5
1    7
2    8
3    6
4    2
Name: count, dtype: int64

我们可以看到这里找到的主题：环保交通、奢侈品与休闲用品、个人护理产品、电动牙刷以及服装服饰与我们的初始提示：车辆、服装、洗漱用品、食品大致匹配但并非完全一致。

由于我们选择了5个聚类，它将洗漱用品分为了护肤和个人护理，这对我们后续的影响不大。

df

	产品	类别	描述	嵌入向量	聚类
0	Tesla Model 3	电动汽车	Tesla Model 3是一款革命性的电动汽车...	[0.003255360759794712, -0.039260633289813995, ...	1
1	耐克Air Max	鞋类	用耐克Air Max提升你的运动鞋品味。C...	[0.03943369910120964, 0.022045187652111053, -0...	2
2	Oral-B Pro 1000	电动牙刷	使用Oral-B Pro 1000实现卓越清洁效果...	[-0.003470012918114662, -0.01911414973437786, ...	1
3	Chobani希腊酸奶	酸奶	享受Chobani希腊酸奶带来的营养零食体验...	[0.0208318829536438, -0.02645781636238098, -0....	3
4	福特F-150	皮卡车	福特F-150是终极皮卡车，d...	[0.007467855699360371, -0.05288049206137657, -...	0
5	Levi's 511	牛仔裤	穿上Levi's 511牛仔裤，展现时尚风采。这款...	[0.0037206460256129503, 0.022772302851080894, ...	2
6	Philips Sonicare	电动牙刷	体验飞利浦Sonicare带来的全新口腔护理体验...	[-0.00724813062697649, -0.011600878089666367, ...	1
7	桂格燕麦片	早餐谷物	用桂格燕麦片开启美好一天。这款...	[-0.006529285106807947, 0.007865572348237038, ...	3
8	Toyota Camry	Sedan	丰田凯美瑞在轿车类别中表现突出...	[-0.02088991366326809, -0.006191295105963945, ...	0
9	Adidas Ultraboost	跑鞋	穿上Adidas Ultraboost体验前所未有的奔跑感受...	[0.02679188922047615, 0.014639599248766899, 8....	2
10	Toyota Camry	汽车	Toyota Camry是一款可靠的中型轿车...	[0.008056452497839928, -0.007912316359579563, ...	0
11	Nike Air Max	鞋子	用Nike Air Max提升你的运动鞋品味...	[0.03943241760134697, 0.02208484522998333, -0....	2
12	高露洁电动牙刷	电动牙刷	用这款C...彻底改变您的口腔护理习惯	[-0.003470012918114662, -0.01911414973437786, ...	1
13	蓝钻杏仁	坚果	健康零食选择蓝钻杏仁。这些...	[-0.013289917260408401, 0.036334190517663956, ...	3
14	哈雷戴维森肥仔	摩托车	体验开阔道路带来的刺激感受...	[0.012365399859845638, 0.03552943095564842, -0...	0
15	Adidas UltraBoost	运动鞋	享受舒适与性能的完美结合...	[0.013107392005622387, 0.02963760495185852, -0...	2
16	多芬男士沐浴露	沐浴露	使用多芬男士沐浴露焕新肌肤并补水...	[0.03760576993227005, -0.008475445210933685, -...	1
17	桂格燕麦	燕麦	用桂格燕麦开启美好一天。富含营养...	[-0.00903365109115839, 0.00896345917135477, 0....	3
18	福特F-150	卡车	福特F-150是一款坚固耐用的卡车...	[0.023461222648620605, -0.026651185005903244, ...	0
19	Levi's 501牛仔裤	牛仔裤	探索Levi's 501牛仔裤的永恒风格...	[0.003762696636840701, 0.02275814116001129, -0...	2
20	Tesla Model 3	手机	探索特斯拉M带来的未来驾驶体验...	[0.03703858703374863, 0.03407958149909973, 0.0...	4
21	Nike Air Max	运动鞋	用Nike Air Max提升你的运动表现。设计...	[0.03943369910120964, 0.022045187652111053, -0...	2
22	Oral-B Pro 1000	电动牙刷	使用Oral-B Pro 1000实现卓越清洁效果...	[-0.003470012918114662, -0.01911414973437786, ...	1
23	有机杏仁酱	食品	尽情享受有机杏仁酱的丝滑美味...	[-0.014613640494644642, -0.002179765608161688,...	3
24	雅马哈 YZF-R3	移动电话	为您介绍雅马哈YZF-R3，终极运动...	[0.03703858703374863, 0.03407958149909973, 0.0...	4
25	阿迪达斯Ultraboost	鞋类	探索阿迪达斯Ultraboost，这款鞋...（原文未完整）	[0.03944042697548866, 0.022062409669160843, -0...	2
26	Philips Sonicare	电动牙刷	体验飞利浦带来的牙科护理革命...	[-0.003470012918114662, -0.01911414973437786, ...	1
27	有机藜麦	食品	用有机藜麦滋养您的身体，这是一种营养丰富的...	[-0.014613640494644642, -0.002179765608161688,...	3

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want you identify the broad topic areas these clusters belong to.
    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    Do not add any extra characters around that formatting as it will make the output parsing break.
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content

pattern = r"Cluster: (\d+), topic: ([^\n]+)"
matches = re.findall(pattern, res)
clusters = [{"cluster": int(cluster), "topic": topic} for cluster, topic in matches]
json_output = json.dumps(clusters, indent=2)
print(json_output)

[
  {
    "cluster": 0,
    "topic": "Automotive  "
  },
  {
    "cluster": 1,
    "topic": "Personal Care  "
  },
  {
    "cluster": 2,
    "topic": "Footwear  "
  },
  {
    "cluster": 3,
    "topic": "Food  "
  },
  {
    "cluster": 4,
    "topic": "Automotive  "
  }
]

We now have the clusters and their counts so we could prompt the LLM to generate more examples within the topics we want. However for this example we won't take that further as they are well-split and you would just follow the procedure above for prompting the model to generate data while passing in the underrepresented topics.

接下来，我们将尝试处理增加数据分布的多样性。

首先，我们以类似的方式开始，随机从每个聚类中选取几个示例，并询问LLM这些示例对应哪些主题。除此之外，在同一LLM调用中，我们还会要求它生成更多主题以增加数据的多样性。我们通过一次调用来完成这些操作，以节省时间和成本。

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want to promote diversity in my examples across categories so follow the procedure below:
    1. You must identify the broad topic areas these clusters belong to.
    2. You should generate further topic areas which don't exist so I can generate data within these topics to improve diversity.


    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:

    1. Cluster topic mapping
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    2. New topics
    1. topic
    2. topic
    3. topic
    4. topic

    Do not add any extra characters around that formatting as it will make the output parsing break. It is very important you stick to that output format
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content
print(res)

1. Cluster topic mapping
Cluster: 0, topic: Automotive
Cluster: 1, topic: Personal Care
Cluster: 2, topic: Footwear
Cluster: 3, topic: Food
Cluster: 4, topic: Electric Vehicles

2. New topics
1. topic: Home Appliances
2. topic: Outdoor Equipment
3. topic: Smart Home Technology
4. topic: Fitness Equipment

我们再次看到这里明确提示了输出结构应遵循的格式。我还说明了生成主题的目的（促进多样性），以便模型获得完整的上下文。

然后我们将数据解析为集群映射JSON列表和主题列表

parts = res.split("\n\n")
cluster_mapping_part = parts[0]
new_topics_part = parts[1]

# Parse cluster topic mapping
cluster_topic_mapping_lines = cluster_mapping_part.split("\n")[1:]  # Skip the first two lines
cluster_topic_mapping = [{"cluster": int(line.split(",")[0].split(":")[1].strip()), "topic": line.split(":")[2].strip()} for line in cluster_topic_mapping_lines]

# Parse new topics
new_topics_lines = new_topics_part.split("\n")[1:]  # Skip the first line
new_topics = [line.split(". ")[1] for line in new_topics_lines]

cluster_topic_mapping, new_topics

([{'cluster': 0, 'topic': 'Automotive'},
  {'cluster': 1, 'topic': 'Personal Care'},
  {'cluster': 2, 'topic': 'Footwear'},
  {'cluster': 3, 'topic': 'Food'},
  {'cluster': 4, 'topic': 'Electric Vehicles'}],
 ['topic: Home Appliances',
  'topic: Outdoor Equipment',
  'topic: Smart Home Technology',
  'topic: Fitness Equipment'])

最后，我们可以利用这些信息进一步提示模型继续生成合成数据。具体做法是将JSON列表中的所有主题传递给下面的提示。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under some main topics: {[entry['topic'] for entry in cluster_topic_mapping]})
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string)

1. Automotive
Input: "Tesla Model S, Electric Vehicles"
Output: "The Tesla Model S delivers exhilarating performance with advanced electric technology, offering a sleek design, impressive range, and an industry-leading infotainment system."

2. Personal Care
Input: "Oral-B Pro 1000, Electronic Toothbrush"
Output: "The Oral-B Pro 1000 features a 3D cleaning action that oscillates, rotates, and pulsates to remove plaque, ensuring a deeper clean for healthier gums."

3. Footwear
Input: "Nike Air Max 270, Shoes"
Output: "Step into comfort and style with Nike Air Max 270, designed with a large Max Air unit for superior cushioning and a breathable upper for a snug fit."

4. Electronics
Input: "Apple iPhone 12, Mobile Phones"
Output: "The Apple iPhone 12 combines powerful performance with stunning design, equipped with A14 Bionic chip and advanced camera systems for capturing every moment in stunning detail."

5. Food
Input: "Nature Valley Granola Bars, Snacks"
Output: "Nature Valley Granola Bars offer a wholesome crunch made from simple, delicious ingredients, providing a perfect snack that fuels your adventure."

6. Automotive
Input: "Ford F-150, Electric Vehicles"
Output: "The Ford F-150 stands at the forefront of durability and innovation, with its powerful electric version setting new standards for strength and sustainability in the truck category."

7. Personal Care
Input: "Philips Sonicare, Electronic Toothbrush"
Output: "Philips Sonicare delivers superior cleaning with dynamic technology that provides up to 31,000 strokes per minute for a healthier mouth and brighter smile."

8. Footwear
Input: "Adidas Ultraboost, Shoes"
Output: "The Adidas Ultraboost is a game-changer in running footwear, featuring responsive cushioning and a knit upper for a snug, supportive fit that adapts to any run."

9. Electronics
Input: "Dell XPS 13, Laptop"
Output: "The Dell XPS 13 is a remarkable laptop with an ultra-thin design, featuring a stunning InfinityEdge display and powerful performance to accommodate your multitasking needs."

10. Food
Input: "Kraft Macaroni & Cheese, Instant Food"
Output: "Kraft Macaroni & Cheese offers quick and convenient comfort food, combining creamy cheese sauce with perfectly cooked pasta for a simple meal that satisfies."

1. Automotive
Input: "Toyota Camry, Mobile Phones"
Output: "The Toyota Camry is a midsize sedan that combines efficiency with modern technology. It offers a spacious interior and the latest features for an enjoyable driving experience."

2. Personal Care
Input: "Oral-B Pro 1000, Electronic Toothbrush"
Output: "The Oral-B Pro 1000 not only provides powerful cleaning action but also enhances your oral hygiene routine with its smart pressure sensor and various cleaning modes."

3. Footwear
Input: "Nike Air Max, Shoes"
Output: "Step into comfort with the Nike Air Max. With cutting-edge technology and a sleek design, these shoes are perfect for athletes and casual wearers alike."

4. Food
Input: "Nature's Valley Granola Bar, Food"
Output: "Savor the wholesome goodness of Nature's Valley Granola Bar, crafted with real ingredients to fuel your day with delicious flavor and crunchy satisfaction."

5. Electric Vehicles
Input: "Tesla Model 3, Mobile Phones"
Output: "The Tesla Model 3 is a revolutionary electric vehicle that combines performance with sustainability, featuring an intuitive interface and cutting-edge technology for an exceptional driving experience."

1. Automotive
Input: "Tesla Model 3, Electric Vehicles"
Output: "The Tesla Model 3 combines cutting-edge technology with eco-friendly driving. Enjoy a sleek design, impressive range, and top-notch safety features, making it the perfect electric car for the modern driver."

2. Personal Care
Input: "Oral-B Pro 1000, Electronic Toothbrush"
Output: "Achieve a superior clean with the Oral-B Pro 1000. Featuring advanced 3D cleaning action, this electronic toothbrush ensures effective plaque removal while being gentle on gums, allowing you to maintain optimum oral health."

3. Footwear
Input: "Nike Air Max, Shoes"
Output: "Step up your game with Nike Air Max shoes. Combining iconic cushioning technology and bold style, these shoes provide ultimate comfort and support, perfect for both casual wear and athletic performance."

4. Food
Input: "Oreo Cookies, Snacks"
Output: "Indulge in the classic taste of Oreo Cookies. With their irresistible cream filling sandwiched between two crunchy chocolate wafers, these treats are perfect for satisfying your sweet tooth any time of the day."

5. Personal Care
Input: "Garnier Micellar Water, Skincare"
Output: "Garnier Micellar Water gently removes makeup and impurities while hydrating the skin. This soothing formula is suitable for all skin types, making it a must-have in your daily skincare routine."

6. Automotive
Input: "Ford F-150, Trucks"
Output: "The Ford F-150 is the quintessential pickup truck, combining power, reliability, and innovative technology. Equipped with advanced towing capabilities and a spacious interior, it's designed for both work and play."

7. Electronics
Input: "Samsung Galaxy S21, Mobile Phones"
Output: "Experience the future of mobile technology with the Samsung Galaxy S21. This smartphone features a stunning display, powerful processor, and multiple camera options, perfect for capturing life's moments in high definition."

8. Footwear
Input: "Adidas Ultraboost, Shoes"
Output: "Run in style with Adidas Ultraboost shoes. Known for their comfort and performance, these shoes utilize responsive cushioning to provide unmatched energy return with every step you take."

9. Electronics
Input: "Dell XPS 13, Laptops"
Output: "The Dell XPS 13 redefines the laptop experience with its stunning InfinityEdge display, powerful performance, and sleek design. Ideal for both professionals and students looking for portability and functionality."

10. Personal Care
Input: "Philips Sonicare, Electronic Toothbrush"
Output: "Philips Sonicare's electronic toothbrush guarantees a superior cleaning experience with its advanced sonic technology. This toothbrush not only helps remove plaque but also promotes healthier gums for a brighter smile."

你可以循环运行此操作以追加到之前的数据中，通过这种方式可以持续生成更多文本合成数据来训练另一个GPT模型，同时确保我们处理不平衡数据集并生成多样化的数据。

您现在已经完成了合成数据生成教程的第一部分，我们已涵盖以下内容：

带有结构化提示的CSV文件
使用Python程序处理CSV文件
使用Python程序处理多表CSV
仅生成文本数据
处理不平衡或缺乏多样性的文本数据

在第二部分中，您将了解如何通过优化提示技巧来增强LLM生成合成文本数据的质量。