torch_frame.datasets.MultimodalTextBenchmark

class MultimodalTextBenchmark(root: str, name: str, text_stype: torch_frame.stype = stype.text_embedded, col_to_text_embedder_cfg: dict[str, TextEmbedderConfig] | TextEmbedderConfig | None = None, col_to_text_tokenizer_cfg: dict[str, TextTokenizerConfig] | TextTokenizerConfig | None = None)[source]

基础类：Dataset

用于“Benchmarking Multimodal AutoML for Tabular Data with Text Fields”的带有文本列的表格数据基准数据集。一些回归数据集的目标列已从对数尺度转换为原始尺度。

Parameters:

name (str) – 要下载的数据集的名称。
text_stype (torch_frame.stype) – 用于数据集中文本列的文本类型。(默认: torch_frame.text_embedded)

统计:

名称	#行数	#cols（数值）	#cols（分类）	#cols (文本)	#cols（其他）	#classes	任务	缺失值比例
产品情感机器黑客	6,364	0	1	1	0	4	多类分类	0.0%
拼图意外偏见100K	125,000	29	0	1	0	2	二元分类	41.4%
新闻频道	25,355	14	0	1	0	6	多类分类	0.0%
葡萄酒评论	105,154	2	2	1	0	30	多类分类	1.0%
数据科学家薪资	19,802	0	3	2	1	6	多类分类	12.3%
墨尔本_airbnb	22,895	26	47	13	3	10	多类分类	9.6%
imdb_genre_prediction	1,000	7	1	2	1	2	二元分类	0.0%
kick_starter_funding	108,128	1	3	3	2	2	二元分类	0.0%
虚假职位发布2	15,907	0	3	2	0	2	二元分类	23.8%
google_qa_answer_type_reason_explanation	6,079	0	1	3	0	1	回归	0.0%
google_qa_question_type_reason_explanation	6,079	0	1	3	0	1	回归	0.0%
书籍价格预测	6,237	2	3	3	0	1	回归	1.7%
jc_penney_products	13,575	2	1	2	0	1	回归	13.7%
女装评论	23,486	1	3	2	0	1	回归	1.8%
news_popularity2	30,009	3	0	1	0	1	回归	0.0%
ae_price_prediction	28,328	2	5	1	3	1	回归	6.1%
加州房价	47,439	18	8	2	11	1	回归	13.8%
mercari_price_suggestion100K	125,000	0	6	2	1	1	回归	3.4%