Using reasoning for data validation

在本指南中，我们将探讨如何使用o1模型（特别是o1-preview版本）通过推理执行数据验证。我们将通过一个涉及合成医疗数据集的实践案例，演示如何评估该模型在识别数据问题方面的准确性。

概述

数据验证是确保数据集质量和可靠性的关键步骤，尤其在医疗保健等敏感领域。传统验证方法通常依赖于预定义的规则和模式。然而，像o1这样的先进模型能够理解上下文并对数据进行推理，从而提供更灵活、更智能的验证方法。

在本教程中，我们将：

生成一个包含不一致性的医疗数据合成数据集。
定义一个函数，接收一行数据并验证其准确性
运行验证流程并计算准确率指标。
分析和解释结果。

from openai import OpenAI
import json
from IPython.display import display, HTML
from sklearn.metrics import precision_score, recall_score, f1_score
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
import pandas as pd

client = OpenAI()
MODEL = 'o1-preview'

合成数据生成

我们将运用Synthetic Data Generation手册中描述的诸多原则来构建数据集的基础。

我们将提示模型为我们的用例生成医疗数据集。我们已向模型提供了详细的指令，说明如何创建数据集、遵循何种格式以及如何填充不准确的数据。我们还提供了几行示例数据以帮助模型开始工作。

数据集中的每一行将包含以下字段：

患者ID：随机生成的患者编号
出生日期：患者的出生日期
性别: 男/女
病史：过往诊断记录
当前用药：患者正在服用的药物
过敏史: 已识别的过敏情况
实验室结果（血糖 mg/dL）
诊断：当前诊断
治疗方案：当前治疗方案
是否有效：当前数据行是否有效（True/False）
问题：如果数据行无效，具体是什么问题

数据中可能存在的一些不准确示例如下：

开具患者过敏的药物
当前用药与病史不符
治疗方案与诊断不符

def generate_data():
    messages = [
        {
            "role": "user",
            "content": """
You are a helpful assistant designed to generate data. You will be given a format for the data to generate and some examples of the data.

When generating Patient IDs, use the format 'P' followed by a three-digit number (e.g., P006, P941, P319).

Intentionally make some mistakes in the data generation and document them in the appropriate columns ('Is Valid' and 'Issue') if the row of data is invalid.

The types of mistakes to include are:

- **Allergy Contradictions**: Prescribing a medication that the patient is allergic to (e.g., prescribing Penicillin to a patient allergic to Penicillin).
- **Medical History and Medication Mismatch**: A patient with a medical condition not receiving appropriate medication (e.g., a diabetic patient not prescribed any diabetes medication).
- **Lab Results and Diagnosis Mismatch**: Lab results that do not support the diagnosis (e.g., normal glucose levels but diagnosed with Diabetes Type 2).
- **Other Plausible Mistakes**: Any other realistic errors that could occur in medical records, such as incorrect gender entries, impossible dates of birth, or inconsistent treatment plans.

Ensure that when 'Is Valid' is 'False', the 'Issue' column clearly explains the problem.

Return 100 rows of data for the user. Your response should strictly be in the format of a valid CSV.

Generate Synthetic Medical Records Dataset with the following columns:
    - Patient ID: A randomly generated patient id
    - Date of Birth: Date of birth of the patient
    - Gender: M/F
    - Medical History: Past diagnoses
    - Current Medications: Medication the patient is taking
    - Allergies: Identified allergies
    - Lab Results (Glucose mg/dL)
    - Diagnoses: Current diagnosis
    - Treatment Plan: Current treatment plan
    - Is Valid: Whether or not the current row of data is valid (True/False)
    - Issue: If the row of data is not valid, what the issue is

Patient ID,Date of Birth,Gender,Medical History,Current Medications,Allergies,Lab Results (Glucose mg/dL),Diagnoses,Treatment Plan,Is Valid,Issue
P001,1980-05-14,M,Hypertension,Lisinopril,None,110,Hypertension,Continue Lisinopril,True,
P002,1975-11-30,F,Diabetes Type 2,Metformin,Penicillin,90,Diabetes Type 2,Continue Metformin,True,
P003,1990-07-22,F,Asthma,Albuterol,Aspirin,85,Asthma,Prescribe Albuterol,True,
P004,2000-03-10,M,None,Amoxicillin,Penicillin,95,Infection,Prescribe Amoxicillin,False,Prescribed Amoxicillin despite Penicillin allergy
P005,1985-09-18,F,Hyperlipidemia,Atorvastatin,None,200,Hyperlipidemia,Continue Atorvastatin,True,
P006,1978-12-05,M,Hypertension; Diabetes Type 2,Lisinopril; Insulin,None,55,Diabetes Type 2,Adjust insulin dosage,False,Low glucose level not properly addressed
            """
        }
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )

    return response.choices[0].message.content.replace('```csv', '').replace('```', '')

# Generate data three times using the existing dataGeneration function
generated_data = []
data = generate_data()
generated_data.extend(data.strip().split('\n'))

# Append the generated data to the medicalData.csv file
with open('../data/medicalData.csv', 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    for row in generated_data:
        csvwriter.writerow(row.split(','))

print("Synthetic data generation and appending completed.")

Synthetic data generation and appending completed.

数据验证

现在我们已经准备好了数据集，接下来将提示推理模型检查每一行数据，判断其中是否存在问题。我们会要求模型输出数据中是否有问题，并解释具体的问题所在。

一旦模型确定了其无效数据列表，我们将把这些结果传递给模型评分器以评估两个指标：

模型准确识别数据问题的能力
对于已正确识别问题的数据子集，模型在识别当前问题方面的准确率如何

鉴于这项任务范围更窄，我们可以使用更快的gpt-4o模型来计算准确度。

提醒：鉴于这些模型仍处于测试阶段，速率限制将大幅降低。请相应调整并发工作线程的数量。

def validate_data(input_data):
    messages = [
        {
            "role": "user",
            "content": f"""
You are a helpful assistant designed to validate the quality of medical datasets. You will be given a single row of medical data, and your task is to determine whether the data is valid.

- Carefully analyze the data for any inconsistencies, contradictions, missing values, or implausible information.
- Consider the logical relationships between different fields (e.g., treatments should be appropriate for the diagnoses, medications should not conflict with allergies, lab results should be consistent with diagnoses, etc.).
- Use your general medical knowledge to assess the validity of the data.
- Focus solely on the information provided without making assumptions beyond the given data.

**Return only a JSON object** with the following two properties:

- `"is_valid"`: a boolean (`true` or `false`) indicating whether the data is valid.
- `"issue"`: if `"is_valid"` is `false`, provide a brief explanation of the issue; if `"is_valid"` is `true`, set `"issue"` to `null`.

Both JSON properties must always be present.

Do not include any additional text or explanations outside the JSON object.

MEDICAL DATA:
{input_data}
            """
        }
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )

    response_content = response.choices[0].message.content.replace('```json', '').replace('```', '').strip()
    
    try:
        if isinstance(response_content, dict):
            response_dict = response_content
        else:
            response_dict = json.loads(response_content)
        return response_dict
    except json.JSONDecodeError as e:
        print(f"Failed to decode JSON response: {response_content}")
        raise e

# Read the CSV file and exclude the last two columns
input_data = []
with open('../data/medicalData.csv', 'r') as file:
    reader = csv.reader(file)
    headers = next(reader)
    for row in reader:
        input_data.append(row[:-2])  # Exclude "Is Valid" and "Issue" columns

# Initialize lists to store true labels
true_is_valid = []
true_issues = []

# Extract true labels from the CSV file
with open('../data/medicalData.csv', 'r') as file:
    reader = csv.reader(file)
    headers = next(reader)
    for row in reader:
        true_is_valid.append(row[-2] == 'True')
        true_issues.append(row[-1])

# Function to validate a single row of data
def validate_row(row):
    input_str = ','.join(row)
    result_json = validate_data(input_str)
    return result_json

# Validate data rows and collect results
pred_is_valid = [False] * len(input_data)
pred_issues = [''] * len(input_data)

with ThreadPoolExecutor() as executor:
    futures = {executor.submit(validate_row, row): i for i, row in enumerate(input_data)}
    
    for future in as_completed(futures):
        i = futures[future]  # Get the index of the current row
        result_json = future.result()
        pred_is_valid[i] = result_json['is_valid']
        pred_issues[i] = result_json['issue']

现在我们有了模型的结果，可以将其与真实来源进行比较，以确定系统的准确性

# Convert predicted and true 'is_valid' labels to boolean if they aren't already
pred_is_valid_bool = [bool(val) if isinstance(val, bool) else val == 'True' for val in pred_is_valid]
true_is_valid_bool = [bool(val) if isinstance(val, bool) else val == 'True' for val in true_is_valid]

# Calculate precision, recall, and f1 score for the 'is_valid' prediction
precision = precision_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)
recall = recall_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)
f1 = f1_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)

# Initialize issue_matches_full with False
issue_matches_full = [False] * len(true_is_valid)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1: {f1:.2f}")

Precision: 0.82
Recall: 0.87
F1: 0.84

问题识别

我们现在将评估模型准确分类数据中问题的能力

def validate_issue(model_generated_answer, correct_answer):
    messages = [
        {
            "role": "user",
            "content": f"""
You are a medical expert assistant designed to validate the quality of an LLM-generated answer.

The model was asked to review a medical dataset row to determine if the data is valid. If the data is not valid, it should provide a justification explaining why.

Your task:

    •	Compare the model-generated justification with the correct reason provided.
    •	Determine if they address the same underlying medical issue or concern, even if phrased differently.
    •	Focus on the intent, medical concepts, and implications rather than exact wording.

Instructions:

    •	If the justifications have the same intent or address the same medical issue, return True.
    •	If they address different issues or concerns, return False.
    •	Only respond with a single word: True or False.

Examples:

    1.	Example 1:
    •	Model Generated Response: “The patient is allergic to penicillin”
    •	Correct Response: “The patient was prescribed penicillin despite being allergic”
    •	Answer: True
    2.	Example 2:
    •	Model Generated Response: “The date of birth of the patient is incorrect”
    •	Correct Response: “The patient was prescribed penicillin despite being allergic”
    •	Answer: False


Model Generated Response: {model_generated_answer}
Correct Response:  {correct_answer}
            """
        }
    ]

    response = client.chat.completions.create(
        model="o1-preview",
        messages=messages
    )

    result = response.choices[0].message.content

    return result

# Validate issues for rows where both true and predicted 'is_valid' are False
validation_results = []

with ThreadPoolExecutor() as executor:
    futures = {
        executor.submit(validate_issue, pred_issues[i], true_issues[i]): i
        for i in range(len(pred_is_valid_bool))
        if not pred_is_valid_bool[i] and not true_is_valid_bool[i]
    }
    
    for future in as_completed(futures):
        i = futures[future]  # Get the original index
        issue_match = future.result()
        issue_matches_full[i] = (issue_match == 'True')
        validation_results.append({
            "index": i,
            "predicted_issue": pred_issues[i],
            "true_issue": true_issues[i],
            "issue_match": issue_matches_full[i]
        })
    
    # Calculate issue accuracy
    issue_accuracy = sum([i['issue_match'] for i in validation_results]) / len(validation_results)
    
    # Store the results in the dictionary
    model_results = {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "issue_accuracy": issue_accuracy
    }

# Create a DataFrame to store the results
df_results = pd.DataFrame([model_results])

# Create a DataFrame to store the validation results for each row
df_validation_results = pd.DataFrame(validation_results)

下面我们将展示正确识别出存在问题的行子集。对于每一行，我们将显示预测问题与真实问题，以及它们是否匹配。

def display_formatted_dataframe(df):
    def format_text(text):
        return text.replace('\n', '<br>')

    df_formatted = df.copy()
    df_formatted['predicted_issue'] = df_formatted['predicted_issue'].apply(format_text)
    df_formatted['true_issue'] = df_formatted['true_issue'].apply(format_text)
    
    display(HTML(df_formatted.to_html(escape=False, justify='left')))
    
display_formatted_dataframe(pd.DataFrame(validation_results))

	索引	预测问题	真实问题	问题匹配
0	39	给对青霉素过敏的患者开具了阿莫西林处方。	尽管对青霉素过敏仍开具阿莫西林	True
1	50	被诊断为1型糖尿病的患者未服用任何药物，治疗字段列出的是诊断而非适当的治疗方案。	1型糖尿病患者未接受胰岛素治疗	True
2	51	实验室结果显示300表明存在高血糖，但未记录诊断或治疗。	极高的血糖水平未被诊断或治疗	True
3	26	患者对青霉素过敏，但仍被开具青霉素处方。	对青霉素过敏却仍开具青霉素处方	True
4	31	患者年龄(88岁)与出生日期(1996-11-05)不符。	骨质疏松患者未接受治疗	False
5	24	'治疗方案'字段不应仅为'抑郁症'；应具体说明为抑郁症开具的治疗方案。	抑郁症患者未接受治疗	True
6	3	患者对青霉素过敏，但被开具了阿莫西林处方。	尽管对青霉素过敏仍开具阿莫西林处方	True
7	28	治疗字段包含'哮喘'，这是一个诊断结果而非治疗方案。	哮喘患者未开具任何药物	False
8	7	哮喘患者实验室检查结果偏低(100)，仅通过生活方式调整而未使用药物治疗，这是不恰当的。	哮喘患者未开具任何药物处方	True
9	16	患者年龄(86岁)与出生日期(1955-10-10)不符。	未接受治疗的COPD患者	False
10	53	提供的年龄(92岁)与出生日期(1983-08-19)不一致。	未接受治疗的抑郁症患者	False
11	23	治疗字段错误地列出了'高脂血症'而非针对该诊断的适当治疗。	高脂血症患者未开具任何药物	True
12	13	患者对磺胺类药物过敏，但被开具了磺胺甲恶唑（一种磺胺类药物）。	在磺胺过敏情况下仍开具磺胺类药物	True
13	98	尽管患者对青霉素过敏，但仍被开具了青霉素处方。	对青霉素过敏仍开具青霉素处方	True
14	9	患者对青霉素有药物过敏史，但仍被开具青霉素处方。	尽管对青霉素过敏仍开具青霉素处方	True
15	85	治疗字段包含'高脂血症'，这是诊断结果而非治疗方案。	高脂血症患者未开具任何药物	False
16	18	处方治疗（阿司匹林）不适用于感染诊断。	在患者对阿司匹林过敏的情况下仍开具阿司匹林；未处理高血糖问题	False
17	70	治疗字段包含诊断结果'骨质疏松症'而非治疗方案。	骨质疏松症患者未接受治疗	True
18	57	患者对青霉素过敏，但正在被开具阿莫西林处方，这是禁忌的。	尽管对青霉素过敏仍开具阿莫西林	True
19	80	治疗字段错误地列出了'2型糖尿病'而非有效的治疗方案。	2型糖尿病患者未接受药物治疗	True
20	87	治疗方案中包含阿莫西林处方，但患者对该药物过敏。	尽管患者对青霉素过敏仍开具阿莫西林处方	True
21	37	治疗字段包含'高脂血症'，这是诊断结果而非治疗方案。	高脂血症患者未开具任何药物	False
22	95	治疗方案列为'哮喘'，但该方案并不符合当前诊断。	哮喘患者未开具任何药物	True
23	96	治疗字段列出了'高脂血症'，这不是一个合适的治疗方法。	高脂血症患者未开具任何药物	False
24	59	治疗字段包含'贫血'，这不是有效的治疗方法。	贫血患者未接受治疗	False
25	5	年龄与出生日期不匹配	低血糖水平未得到妥善处理	False

# Display the DataFrame
print(df_results)

   precision    recall       f1  issue_accuracy
0   0.818182  0.870968  0.84375        0.615385

结论

从这里的分析结果可以看出，我们能够以较高的精确度/召回率识别问题，同时在准确定位数据中的具体问题方面也表现良好。

这应有助于简化跨多个领域的评估数据集验证流程。

2024年9月12日

利用推理进行数据验证

概述

合成数据生成

数据验证

问题识别

结论