本笔记本探讨的场景是您的数据未标注,但具有可用于将其聚类为有意义类别的特征。聚类的挑战在于使这些聚类突出的特征对人类可读,这正是我们将使用GPT-3为我们生成有意义聚类描述的地方。然后我们可以利用这些描述为先前未标注的数据集应用标签。
为了训练模型,我们使用了在笔记本Multiclass classification for transactions Notebook中展示的方法创建的嵌入向量,该方法应用于数据集中全部359笔交易,以提供更大的学习样本池
本笔记本探讨的场景是您的数据未标注,但具有可用于将其聚类为有意义类别的特征。聚类的挑战在于使这些聚类突出的特征对人类可读,这正是我们将使用GPT-3为我们生成有意义聚类描述的地方。然后我们可以利用这些描述为先前未标注的数据集应用标签。
为了训练模型,我们使用了在笔记本Multiclass classification for transactions Notebook中展示的方法创建的嵌入向量,该方法应用于数据集中全部359笔交易,以提供更大的学习样本池
# optional env import
from dotenv import load_dotenv
load_dotenv()True
# imports
from openai import OpenAI
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
import os
from ast import literal_eval
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
COMPLETIONS_MODEL = "gpt-3.5-turbo"
# This path leads to a file with data and precomputed embeddings
embedding_path = "data/library_transactions_with_embeddings_359.csv"
我们将复用聚类笔记本中的方法,使用K-Means算法对我们之前创建的特征嵌入进行数据集聚类。然后通过Completions端点为我们生成聚类描述并评估其有效性
df = pd.read_csv(embedding_path)
df.head()| 日期 | 供应商 | 描述 | 交易金额(£) | 合并 | 令牌数 | 嵌入 | |
|---|---|---|---|---|---|---|---|
| 0 | 2016年4月21日 | M & J Ballantyne有限公司 | 乔治四世大桥工程 | 35098.0 | 供应商:M & J Ballantyne有限公司;描述:G... | 118 | [-0.013169967569410801, -0.004833734128624201,... |
| 1 | 2016年4月26日 | 私人销售 | 文学与档案物品 | 30000.0 | 供应商:私人销售;描述:文学... | 114 | [-0.019571533426642418, -0.010801066644489765,... |
| 2 | 2016年4月30日 | 爱丁堡市议会 | 非住宅房产税 | 40800.0 | 供应商:爱丁堡市议会;描述... | 114 | [-0.0054041435942053795, -6.548957026097924e-0... |
| 3 | 2016/05/09 | Computacenter英国公司 | 凯尔文大厅 | 72835.0 | 供应商:Computacenter英国公司;描述:凯尔文... | 113 | [-0.004776035435497761, -0.005533686839044094,... |
| 4 | 2016年5月9日 | John Graham建筑有限公司 | Causewayside翻新工程 | 64361.0 | 供应商:John Graham建筑有限公司;描述... | 117 | [0.003290407592430711, -0.0073441751301288605,... |
embedding_df = pd.read_csv(embedding_path)
embedding_df["embedding"] = embedding_df.embedding.apply(literal_eval).apply(np.array)
matrix = np.vstack(embedding_df.embedding.values)
matrix.shape(359, 1536)
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
kmeans.fit(matrix)
labels = kmeans.labels_
embedding_df["Cluster"] = labelstsne = TSNE(
n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200
)
vis_dims2 = tsne.fit_transform(matrix)
x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]
for category, color in enumerate(["purple", "green", "red", "blue","yellow"]):
xs = np.array(x)[embedding_df.Cluster == category]
ys = np.array(y)[embedding_df.Cluster == category]
plt.scatter(xs, ys, color=color, alpha=0.3)
avg_x = xs.mean()
avg_y = ys.mean()
plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
plt.title("Clusters identified visualized in language 2d using t-SNE")
Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')
# We'll read 10 transactions per cluster as we're expecting some variation
transactions_per_cluster = 10
for i in range(n_clusters):
print(f"Cluster {i} Theme:\n")
transactions = "\n".join(
embedding_df[embedding_df.Cluster == i]
.combined.str.replace("Supplier: ", "")
.str.replace("Description: ", ": ")
.str.replace("Value: ", ": ")
.sample(transactions_per_cluster, random_state=42)
.values
)
response = client.chat.completions.create(
model=COMPLETIONS_MODEL,
# We'll include a prompt to instruct the model what sort of description we're looking for
messages=[
{"role": "user",
"content": f'''We want to group these transactions into meaningful clusters so we can target the areas we are spending the most money.
What do the following transactions have in common?\n\nTransactions:\n"""\n{transactions}\n"""\n\nTheme:'''}
],
temperature=0,
max_tokens=100,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
print(response.choices[0].message.content.replace("\n", ""))
print("\n")
sample_cluster_rows = embedding_df[embedding_df.Cluster == i].sample(transactions_per_cluster, random_state=42)
for j in range(transactions_per_cluster):
print(sample_cluster_rows.Supplier.values[j], end=", ")
print(sample_cluster_rows.Description.values[j], end="\n")
print("-" * 100)
print("\n")
Cluster 0 Theme: The common theme among these transactions is that they all involve spending money on various expenses such as electricity, non-domestic rates, IT equipment, computer equipment, and the purchase of an electric van. EDF ENERGY, Electricity Oct 2019 3 buildings City Of Edinburgh Council, Non Domestic Rates EDF, Electricity EX LIBRIS, IT equipment City Of Edinburgh Council, Non Domestic Rates CITY OF EDINBURGH COUNCIL, Rates for 33 Salisbury Place EDF Energy, Electricity XMA Scotland Ltd, IT equipment Computer Centre UK Ltd, Computer equipment ARNOLD CLARK, Purchase of an electric van ---------------------------------------------------------------------------------------------------- Cluster 1 Theme: The common theme among these transactions is that they all involve payments for various goods and services. Some specific examples include student bursary costs, collection of papers, architectural works, legal deposit services, papers related to Alisdair Gray, resources on slavery abolition and social justice, collection items, online/print subscriptions, ALDL charges, and literary/archival items. Institute of Conservation, This payment covers 2 invoices for student bursary costs PRIVATE SALE, Collection of papers of an individual LEE BOYD LIMITED, Architectural Works ALDL, Legal Deposit Services RICK GEKOSKI, Papers 1970's to 2019 Alisdair Gray ADAM MATTHEW DIGITAL LTD, Resource - slavery abolution and social justice PROQUEST INFORMATION AND LEARN, This payment covers multiple invoices for collection items LM Information Delivery UK LTD, Payment of 18 separate invoice for Online/Print subscriptions Jan 20-Dec 20 ALDL, ALDL Charges Private Sale, Literary & Archival Items ---------------------------------------------------------------------------------------------------- Cluster 2 Theme: The common theme among these transactions is that they all involve spending money at Kelvin Hall. CBRE, Kelvin Hall GLASGOW CITY COUNCIL, Kelvin Hall University Of Glasgow, Kelvin Hall GLASGOW LIFE, Oct 20 to Dec 20 service charge - Kelvin Hall Computacenter Uk, Kelvin Hall XMA Scotland Ltd, Kelvin Hall GLASGOW LIFE, Service Charges Kelvin Hall 01/07/19-30/09/19 Glasgow Life, Kelvin Hall Service Charges Glasgow City Council, Kelvin Hall GLASGOW LIFE, Quarterly service charge KH ---------------------------------------------------------------------------------------------------- Cluster 3 Theme: The common theme among these transactions is that they all involve payments for facility management fees and services provided by ECG Facilities Service. ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees ECG FACILITIES SERVICE, Facilities Management Charge ECG FACILITIES SERVICE, Inspection and Maintenance of all Library properties ECG Facilities Service, Facilities Management Charge ECG FACILITIES SERVICE, Maintenance contract - October ECG FACILITIES SERVICE, Electrical and mechanical works ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees ECG FACILITIES SERVICE, CB Bolier Replacement (1),USP Batteries,Gutter Works & Cleaning of pigeon fouling ECG Facilities Service, Facilities Management Charge ECG Facilities Service, Facilities Management Charge ---------------------------------------------------------------------------------------------------- Cluster 4 Theme: The common theme among these transactions is that they all involve construction or refurbishment work. M & J Ballantyne Ltd, George IV Bridge Work John Graham Construction Ltd, Causewayside Refurbishment John Graham Construction Ltd, Causewayside Refurbishment John Graham Construction Ltd, Causewayside Refurbishment John Graham Construction Ltd, Causewayside Refurbishment ARTHUR MCKAY BUILDING SERVICES, Causewayside Work John Graham Construction Ltd, Causewayside Refurbishment Morris & Spottiswood Ltd, George IV Bridge Work ECG FACILITIES SERVICE, Causewayside IT Work John Graham Construction Ltd, Causewayside Refurbishment ----------------------------------------------------------------------------------------------------
我们现在有五个新的聚类可以用来描述数据。从可视化结果来看,部分聚类存在重叠区域,还需要进行调优才能达到理想效果,但已经可以看出GPT-3做出了有效的推断。特别值得注意的是,它识别出包含法定缴存本的条目与文献归档相关——这个关联确实存在,但模型事先并未获得任何线索提示。这非常酷,通过进一步调优,我们可以创建一组基础聚类,然后结合多分类器推广到其他可能使用的交易数据集。