使用NodeJS构建的YouTube字幕问答机器人

使用LanceDB的Javascript API和OpenAI构建YouTube字幕问答机器人

nodejs

这个问答机器人可以让您使用自然语言搜索YouTube视频字幕！我们将介绍如何使用LanceDB的Javascript API轻松存储和管理您的数据。

npm install vectordb

下载数据

在这个示例中，我们使用了一个包含YouTube转录文本的HuggingFace数据集样本：jamescalam/youtube-transcriptions。请下载该文件并解压到data文件夹下：

wget -c https://eto-public.s3.us-west-2.amazonaws.com/datasets/youtube_transcript/youtube-transcriptions_sample.jsonl

准备上下文

数据集中的每个条目仅包含一小段文本。我们需要以滚动方式将多个这样的文本块合并在一起。在本演示中，我们将回溯20条记录，为每个句子创建更完整的上下文。

首先，我们需要读取并解析输入文件。

const lines = (await fs.readFile(INPUT_FILE_NAME, 'utf-8'))
  .toString()
  .split('\n')
  .filter(line => line.length > 0)
  .map(line => JSON.parse(line))

const data = contextualize(lines, 20, 'video_id')

contextualize函数按video_id对转录文本进行分组，然后为每个条目创建扩展上下文。

function contextualize (rows, contextSize, groupColumn) {
  const grouped = []
  rows.forEach(row => {
    if (!grouped[row[groupColumn]]) {
      grouped[row[groupColumn]] = []
    }
    grouped[row[groupColumn]].push(row)
  })

  const data = []
  Object.keys(grouped).forEach(key => {
    for (let i = 0; i < grouped[key].length; i++) {
      const start = i - contextSize > 0 ? i - contextSize : 0
      grouped[key][i].context = grouped[key].slice(start, i + 1).map(r => r.text).join(' ')
    }
    data.push(...grouped[key])
  })
  return data
}

创建LanceDB表

要将我们的数据加载到LanceDB中，需要为每个项目创建嵌入向量（vectors）。在本示例中，我们将使用与LanceDB原生集成的OpenAI嵌入函数。

// You need to provide an OpenAI API key, here we read it from the OPENAI_API_KEY environment variable
const apiKey = process.env.OPENAI_API_KEY
// The embedding function will create embeddings for the 'context' column
const embedFunction = new lancedb.OpenAIEmbeddingFunction('context', apiKey)
// Connects to LanceDB
const db = await lancedb.connect('data/youtube-lancedb')
const tbl = await db.createTable('vectors', data, embedFunction)

创建并回答提示

我们将接受自然语言提问，并利用存储在LanceDB中的语料库来回答问题。首先，我们需要设置OpenAI客户端：

const configuration = new Configuration({ apiKey })
const openai = new OpenAIApi(configuration)

然后我们可以提出问题，并使用LanceDB检索与该提示最相关的三个转录本。

const query = await rl.question('Prompt: ')
const results = await tbl
  .search(query)
  .select(['title', 'text', 'context'])
  .limit(3)
  .execute()

查询和转录文本的上下文被合并到同一个提示中：

function createPrompt (query, context) {
    let prompt =
        'Answer the question based on the context below.\n\n' +
        'Context:\n'

    // need to make sure our prompt is not larger than max size
    prompt = prompt + context.map(c => c.context).join('\n\n---\n\n').substring(0, 3750)
    prompt = prompt + `\n\nQuestion: ${query}\nAnswer:`
    return prompt
}

现在我们可以使用OpenAI Completion API来处理我们的自定义提示并给出答案。

const response = await openai.createCompletion({
  model: 'text-davinci-003',
  prompt: createPrompt(query, results),
  max_tokens: 400,
  temperature: 0,
  top_p: 1,
  frequency_penalty: 0,
  presence_penalty: 0
})
console.log(response.data.choices[0].text)

现在让我们把所有内容整合起来

现在我们可以提供查询，并根据您本地的LanceDB数据获取答案。

Prompt: who was the 12th person on the moon and when did they land?
 The 12th person on the moon was Harrison Schmitt and he landed on December 11, 1972.
Prompt: Which training method should I use for sentence transformers when I only have pairs of related sentences?
 NLI with multiple negative ranking loss.

总结完毕

在本示例中，您学习了如何使用LanceDB存储和查询本地数据的嵌入表示。完整示例代码位于GitHub上，您也可以通过此链接下载LanceDB数据集。