torch_frame.datasets.HuggingFaceDatasetDict

class HuggingFaceDatasetDict(path: str, name: str | None = None, columns: list[str] | None = None, col_to_stype: dict[str, stype] | None = None, target_col: str | None = None, **kwargs)[source]

基础类:Dataset

加载一个 Hugging Face datasets.DatasetDict 数据集 到一个带有预定义分割信息的 torch_frame.data.Dataset。 要使用这个类,请先安装 Datasets 包。 有关所有可用的数据集路径和名称,请参阅 Hugging Face Datasets

Parameters:
  • path (str) – 数据集的路径或名称。

  • name (str, optional) – 定义数据集配置的名称。 (默认: None)

  • columns (list, optional) – 要包含的列的列表。 (默认: None)

示例

从Hugging Face Hub加载spotify-tracks-dataset数据集到torch_frame.data.Dataset:

>>> from torch_frame.datasets import HuggingFaceDatasetDict
>>> from torch_frame.config.text_embedder import TextEmbedderConfig
>>> from torch_frame.testing.text_embedder import HashTextEmbedder
>>> dataset = HuggingFaceDatasetDict(
...     path="maharshipandya/spotify-tracks-dataset",
...     columns=["artists", "album_name", "track_name",
...              "popularity", "duration_ms", "explicit",
...              "danceability", "energy", "key", "loudness",
...              "mode", "speechiness", "acousticness",
...              "instrumentalness", "liveness", "valence",
...              "tempo", "time_signature", "track_genre"
...     ],
...     target_col="track_genre",
...     col_to_text_embedder_cfg=TextEmbedderConfig(
...         text_embedder=HashTextEmbedder(10)),
... )
>>> dataset.materialize()
>>> dataset.tensor_frame
TensorFrame(
    num_cols=18,
    num_rows=114000,
    numerical (11): [
        'acousticness',
        'danceability',
        'duration_ms',
        'energy',
        'instrumentalness',
        'liveness',
        'loudness',
        'popularity',
        'speechiness',
        'tempo',
        'valence',
    ],
    categorical (4): [
        'explicit',
        'key',
        'mode',
        'time_signature',
    ],
    embedding (3): ['artists', 'album_name', 'track_name'],
    has_target=True,
    device='cpu',
)