2022年10月20日

交易分类的聚类分析

,

本笔记本探讨的场景是您的数据未标注,但具有可用于将其聚类为有意义类别的特征。聚类的挑战在于使这些聚类突出的特征对人类可读,这正是我们将使用GPT-3为我们生成有意义聚类描述的地方。然后我们可以利用这些描述为先前未标注的数据集应用标签。

为了训练模型,我们使用了在笔记本Multiclass classification for transactions Notebook中展示的方法创建的嵌入向量,该方法应用于数据集中全部359笔交易,以提供更大的学习样本池

# optional env import
from dotenv import load_dotenv
load_dotenv()
True
# imports
 
from openai import OpenAI
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
import os
from ast import literal_eval

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
COMPLETIONS_MODEL = "gpt-3.5-turbo"

# This path leads to a file with data and precomputed embeddings
embedding_path = "data/library_transactions_with_embeddings_359.csv"

聚类

我们将复用聚类笔记本中的方法,使用K-Means算法对我们之前创建的特征嵌入进行数据集聚类。然后通过Completions端点为我们生成聚类描述并评估其有效性

df = pd.read_csv(embedding_path)
df.head()
日期 供应商 描述 交易金额(£) 合并 令牌数 嵌入
0 2016年4月21日 M & J Ballantyne有限公司 乔治四世大桥工程 35098.0 供应商:M & J Ballantyne有限公司;描述:G... 118 [-0.013169967569410801, -0.004833734128624201,...
1 2016年4月26日 私人销售 文学与档案物品 30000.0 供应商:私人销售;描述:文学... 114 [-0.019571533426642418, -0.010801066644489765,...
2 2016年4月30日 爱丁堡市议会 非住宅房产税 40800.0 供应商:爱丁堡市议会;描述... 114 [-0.0054041435942053795, -6.548957026097924e-0...
3 2016/05/09 Computacenter英国公司 凯尔文大厅 72835.0 供应商:Computacenter英国公司;描述:凯尔文... 113 [-0.004776035435497761, -0.005533686839044094,...
4 2016年5月9日 John Graham建筑有限公司 Causewayside翻新工程 64361.0 供应商:John Graham建筑有限公司;描述... 117 [0.003290407592430711, -0.0073441751301288605,...
embedding_df = pd.read_csv(embedding_path)
embedding_df["embedding"] = embedding_df.embedding.apply(literal_eval).apply(np.array)
matrix = np.vstack(embedding_df.embedding.values)
matrix.shape
(359, 1536)
n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
kmeans.fit(matrix)
labels = kmeans.labels_
embedding_df["Cluster"] = labels
tsne = TSNE(
    n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200
)
vis_dims2 = tsne.fit_transform(matrix)

x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]

for category, color in enumerate(["purple", "green", "red", "blue","yellow"]):
    xs = np.array(x)[embedding_df.Cluster == category]
    ys = np.array(y)[embedding_df.Cluster == category]
    plt.scatter(xs, ys, color=color, alpha=0.3)

    avg_x = xs.mean()
    avg_y = ys.mean()

    plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
plt.title("Clusters identified visualized in language 2d using t-SNE")
Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')
image generated by notebook
# We'll read 10 transactions per cluster as we're expecting some variation
transactions_per_cluster = 10

for i in range(n_clusters):
    print(f"Cluster {i} Theme:\n")

    transactions = "\n".join(
        embedding_df[embedding_df.Cluster == i]
        .combined.str.replace("Supplier: ", "")
        .str.replace("Description: ", ":  ")
        .str.replace("Value: ", ":  ")
        .sample(transactions_per_cluster, random_state=42)
        .values
    )
    response = client.chat.completions.create(
        model=COMPLETIONS_MODEL,
        # We'll include a prompt to instruct the model what sort of description we're looking for
        messages=[
            {"role": "user",
             "content": f'''We want to group these transactions into meaningful clusters so we can target the areas we are spending the most money. 
                What do the following transactions have in common?\n\nTransactions:\n"""\n{transactions}\n"""\n\nTheme:'''}
        ],
        temperature=0,
        max_tokens=100,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    print(response.choices[0].message.content.replace("\n", ""))
    print("\n")

    sample_cluster_rows = embedding_df[embedding_df.Cluster == i].sample(transactions_per_cluster, random_state=42)
    for j in range(transactions_per_cluster):
        print(sample_cluster_rows.Supplier.values[j], end=", ")
        print(sample_cluster_rows.Description.values[j], end="\n")

    print("-" * 100)
    print("\n")
Cluster 0 Theme:

The common theme among these transactions is that they all involve spending money on various expenses such as electricity, non-domestic rates, IT equipment, computer equipment, and the purchase of an electric van.


EDF ENERGY, Electricity Oct 2019 3 buildings
City Of Edinburgh Council, Non Domestic Rates 
EDF, Electricity
EX LIBRIS, IT equipment
City Of Edinburgh Council, Non Domestic Rates 
CITY OF EDINBURGH COUNCIL, Rates for 33 Salisbury Place
EDF Energy, Electricity
XMA Scotland Ltd, IT equipment
Computer Centre UK Ltd, Computer equipment
ARNOLD CLARK, Purchase of an electric van
----------------------------------------------------------------------------------------------------


Cluster 1 Theme:

The common theme among these transactions is that they all involve payments for various goods and services. Some specific examples include student bursary costs, collection of papers, architectural works, legal deposit services, papers related to Alisdair Gray, resources on slavery abolition and social justice, collection items, online/print subscriptions, ALDL charges, and literary/archival items.


Institute of Conservation, This payment covers 2 invoices for student bursary costs
PRIVATE SALE, Collection of papers of an individual
LEE BOYD LIMITED, Architectural Works
ALDL, Legal Deposit Services
RICK GEKOSKI, Papers 1970's to 2019 Alisdair Gray
ADAM MATTHEW DIGITAL LTD, Resource -  slavery abolution and social justice
PROQUEST INFORMATION AND LEARN, This payment covers multiple invoices for collection items
LM Information Delivery UK LTD, Payment of 18 separate invoice for Online/Print subscriptions Jan 20-Dec 20
ALDL, ALDL Charges
Private Sale, Literary & Archival Items
----------------------------------------------------------------------------------------------------


Cluster 2 Theme:

The common theme among these transactions is that they all involve spending money at Kelvin Hall.


CBRE, Kelvin Hall
GLASGOW CITY COUNCIL, Kelvin Hall
University Of Glasgow, Kelvin Hall
GLASGOW LIFE, Oct 20 to Dec 20 service charge - Kelvin Hall
Computacenter Uk, Kelvin Hall
XMA Scotland Ltd, Kelvin Hall
GLASGOW LIFE, Service Charges Kelvin Hall 01/07/19-30/09/19
Glasgow Life, Kelvin Hall Service Charges
Glasgow City Council, Kelvin Hall
GLASGOW LIFE, Quarterly service charge KH
----------------------------------------------------------------------------------------------------


Cluster 3 Theme:

The common theme among these transactions is that they all involve payments for facility management fees and services provided by ECG Facilities Service.


ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees
ECG FACILITIES SERVICE, Facilities Management Charge
ECG FACILITIES SERVICE, Inspection and Maintenance of all Library properties
ECG Facilities Service, Facilities Management Charge
ECG FACILITIES SERVICE, Maintenance contract - October
ECG FACILITIES SERVICE, Electrical and mechanical works
ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees
ECG FACILITIES SERVICE, CB Bolier Replacement (1),USP Batteries,Gutter Works & Cleaning of pigeon fouling
ECG Facilities Service, Facilities Management Charge
ECG Facilities Service, Facilities Management Charge
----------------------------------------------------------------------------------------------------


Cluster 4 Theme:

The common theme among these transactions is that they all involve construction or refurbishment work.


M & J Ballantyne Ltd, George IV Bridge Work
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
ARTHUR MCKAY BUILDING SERVICES, Causewayside Work
John Graham Construction Ltd, Causewayside Refurbishment
Morris & Spottiswood Ltd, George IV Bridge Work
ECG FACILITIES SERVICE, Causewayside IT Work
John Graham Construction Ltd, Causewayside Refurbishment
----------------------------------------------------------------------------------------------------


结论

我们现在有五个新的聚类可以用来描述数据。从可视化结果来看,部分聚类存在重叠区域,还需要进行调优才能达到理想效果,但已经可以看出GPT-3做出了有效的推断。特别值得注意的是,它识别出包含法定缴存本的条目与文献归档相关——这个关联确实存在,但模型事先并未获得任何线索提示。这非常酷,通过进一步调优,我们可以创建一组基础聚类,然后结合多分类器推广到其他可能使用的交易数据集。