Python 源代码

Source code for llms_txt Python module, containing helpers to create and use llms.txt files

介绍

llms.txt 文件规范适用于位于网站路径 llms.txt 中的文件（或可选地，位于子路径中）。 llms-sample.txt 是一个简单的例子。遵循该规范的文件包含以下以 markdown 格式的部分，按特定顺序排列：

一个包含项目或网站名称的 H1。这是唯一必需的部分
一个包含项目简要摘要的引用块，包含理解文件其余部分所需的关键信息
零个或多个任意类型的markdown部分（例如段落、列表等），除了标题，包含有关项目的更详细信息以及如何解释提供的文件
零个或多个由H2标题分隔的markdown部分，包含可以获取进一步详细信息的URL的“文件列表”
- 每个“文件列表”都是一个markdown列表，包含一个必需的markdown超链接 [name](url)，然后可选地包含一个 : 和关于文件的备注。

这是我们将用于测试的示例 llms.txt 文件的开始：

samp = Path('llms-sample.txt').read_text()
print(samp[:480])

# FastHTML

> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's `FT` "FastTags" into a library for creating server-rendered hypermedia applications.

Remember:

- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it's automatic)
- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element

阅读

我们将实现 parse_llms_file 来提取 llms.txt 的部分内容到一个简单的数据结构中。

源

搜索

 search (pat, txt, flags=0)

在 txt 中与 pat 匹配的组的字典

源

命名正则表达式

 named_re (nm, pat)

在命名捕获组中匹配 pat 的模式

源

优化重新

 opt_re (s)

可选匹配 s 的模式

我们将从“外到内”进行工作，以便在进行时测试最内层的匹配。

解析链接

link = '- [FastHTML quick start](https://docs.fastht.ml/tutorials/quickstart_for_web_devs.html.md): A brief overview of FastHTML features'

将link的第一部分解析为字典

title = named_re('title', r'[^\]]+')
pat =  fr'-\s*\[{title}\]'
search(pat, samp)

{'title': 'internal docs - ed'}

做下一步。

url = named_re('url', r'[^\)]+')
pat += fr'\({url}\)'
search(pat, samp)

{'title': 'internal docs - ed', 'url': 'https://llmstxt.org/ed.html'}

做最后一步。注意，这不是必需的。

desc = named_re('desc', r'.*')
pat += opt_re(fr':\s*{desc}')
search(pat, link)

{'title': 'FastHTML quick start',
 'url': 'https://docs.fastht.ml/tutorials/quickstart_for_web_devs.html.md',
 'desc': 'A brief overview of FastHTML features'}

将这些部分合并到一个函数 parse_link(txt)

源

解析链接

 parse_link (txt)

从 llms.txt 中解析链接部分

parse_link(link)

{'title': 'FastHTML quick start',
 'url': 'https://docs.fastht.ml/tutorials/quickstart_for_web_devs.html.md',
 'desc': 'A brief overview of FastHTML features'}

parse_link('-[foo](http://foo)')

{'title': 'foo', 'url': 'http://foo', 'desc': None}

解析部分

sections = '''First bit.

## S1

-[foo](http://foo)
- [foo2](http://foo2): stuff

## S2

- [foo3](http://foo3)'''

start,*rest = re.split(fr'^##\s*(.*?$)', sections, flags=re.MULTILINE)
start

'First bit.\n\n'

rest

['S1',
 '\n\n-[foo](http://foo)\n- [foo2](http://foo2): stuff\n\n',
 'S2',
 '\n\n- [foo3](http://foo3)']

简洁地从rest中的对创建一个字典。

d = dict(chunked(rest, 2))
d

{'S1': '\n\n-[foo](http://foo)\n- [foo2](http://foo2): stuff\n\n',
 'S2': '\n\n- [foo3](http://foo3)'}

links = d['S1']
links.strip()

'-[foo](http://foo)\n- [foo2](http://foo2): stuff'

将 links 解析为链接列表。它们之间可以有多个换行符。

_parse_links(links)

[{'title': 'foo', 'url': 'http://foo', 'desc': None},
 {'title': 'foo2', 'url': 'http://foo2', 'desc': 'stuff'}]

创建一个函数，该函数使用上述步骤将 llms.txt 解析为 start 和一个字典，字典的键如 d，值为解析后的链接列表。

start, sects = _parse_llms(samp)
start

'# FastHTML\n\n> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.\n\nRemember:\n\n- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it\'s automatic)\n- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element.'

title = named_re('title', r'.+?$')
summ = named_re('summary', '.+?$')
summ_pat = opt_re(fr"^>\s*{summ}$")
info = named_re('info', '.*')

pat = fr'^#\s*{title}\n+{summ_pat}\n+{info}'
search(pat, start, (re.MULTILINE|re.DOTALL))

{'title': 'FastHTML',
 'summary': 'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.',
 'info': 'Remember:\n\n- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it\'s automatic)\n- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element.'}

让我们完成它吧！

来源

解析_llms_文件

 parse_llms_file (txt)

解析 llms.txt 文件内容为 AttrDict

llmsd = parse_llms_file(samp)
llmsd.summary

'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'

llmsd.sections.Examples

(#1) [{'title': 'Todo list application', 'url': 'https://raw.githubusercontent.com/AnswerDotAI/fasthtml/main/examples/adv_app.py', 'desc': 'Detailed walk-thru of a complete CRUD app in FastHTML showing idiomatic use of FastHTML and HTMX patterns.'}]

XML 转换

对于一些大型语言模型，例如Claude，XML格式是首选，因此我们将提供一个函数来创建该格式。

source

获取文档内容

 get_doc_content (url)

如果在 nbdev 仓库中，从本地文件中获取内容。

source

mk_ctx

 mk_ctx (d, optional=True, n_workers=None)

创建一个 Project，为 d 中每个 H2 部分创建一个 Section，可以选择跳过“可选”部分。

ctx = mk_ctx(llmsd)
print(to_xml(ctx, do_escape=False)[:260]+'...')

<project title="FastHTML" summary='FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore&#39;s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'>Remember:

- Use `serve()` for running uvic...

源

获取尺寸

 get_sizes (ctx)

获取LLM上下文每个部分的大小

get_sizes(ctx)

{'docs': {'internal docs - ed': 34464,
  'FastHTML quick start': 27383,
  'HTMX reference': 26812,
  'Starlette quick guide': 7936},
 'examples': {'Todo list application': 18558},
 'optional': {'Starlette full documentation': 48331}}

Path('../fasthtml.md').write_text(to_xml(ctx, do_escape=False))

源

创建上下文

 create_ctx (txt, optional=False, n_workers=None)

一个 Project，为 txt 中每个 H2 部分设置一个 Section，可选择跳过“可选”部分。

源

llms_txt2ctx

 llms_txt2ctx (fname:str, optional:<function bool_arg>=False,
               n_workers:int=None, save_nbdev_fname:str=None)

打印一个 Project，为从 fname 读取的文件中每个 H2 部分创建一个 Section，可选地跳过 "optional" 部分。

	类型	默认值	详细信息
fname	str		要读取的文件名
可选	bool_arg	False	包含‘可选’部分？
n_workers	int	None	用于并行下载的线程数量
save_nbdev_fname	str	None	将输出保存到 nbdev `{docs_path}` 而不是输出到 stdout

!llms_txt2ctx llms-sample.txt > ../fasthtml.md