torch_frame.datasets.HuggingFaceDatasetDict
- class HuggingFaceDatasetDict(path: str, name: str | None = None, columns: list[str] | None = None, col_to_stype: dict[str, stype] | None = None, target_col: str | None = None, **kwargs)[source]
基础类:
Dataset加载一个 Hugging Face
datasets.DatasetDict数据集 到一个带有预定义分割信息的torch_frame.data.Dataset。 要使用这个类,请先安装 Datasets 包。 有关所有可用的数据集路径和名称,请参阅 Hugging Face Datasets。- Parameters:
示例
从Hugging Face Hub加载spotify-tracks-dataset数据集到
torch_frame.data.Dataset:>>> from torch_frame.datasets import HuggingFaceDatasetDict >>> from torch_frame.config.text_embedder import TextEmbedderConfig >>> from torch_frame.testing.text_embedder import HashTextEmbedder >>> dataset = HuggingFaceDatasetDict( ... path="maharshipandya/spotify-tracks-dataset", ... columns=["artists", "album_name", "track_name", ... "popularity", "duration_ms", "explicit", ... "danceability", "energy", "key", "loudness", ... "mode", "speechiness", "acousticness", ... "instrumentalness", "liveness", "valence", ... "tempo", "time_signature", "track_genre" ... ], ... target_col="track_genre", ... col_to_text_embedder_cfg=TextEmbedderConfig( ... text_embedder=HashTextEmbedder(10)), ... ) >>> dataset.materialize() >>> dataset.tensor_frame TensorFrame( num_cols=18, num_rows=114000, numerical (11): [ 'acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'popularity', 'speechiness', 'tempo', 'valence', ], categorical (4): [ 'explicit', 'key', 'mode', 'time_signature', ], embedding (3): ['artists', 'album_name', 'track_name'], has_target=True, device='cpu', )