使用由PGVector驱动的RetrieveChat进行检索增强的代码生成和问答
AutoGen 提供了由 LLM、工具或人类驱动的可对话代理,这些代理可以通过自动化聊天共同执行任务。该框架允许通过多代理对话使用工具和人类参与。请找到关于此功能的文档这里。
RetrieveChat 是一个用于检索增强代码生成和问答的对话系统。在这个笔记本中,我们演示了如何利用 RetrieveChat 根据 LLM 训练数据集中不存在的定制文档生成代码并回答问题。RetrieveChat 使用 AssistantAgent
和 RetrieveUserProxyAgent
,这与其他笔记本中 AssistantAgent
和 UserProxyAgent
的使用类似(例如,使用代码生成、执行和调试自动任务解决)。本质上,RetrieveUserProxyAgent
实现了一种与 RetrieveChat 提示相对应的不同自动回复机制。
目录
我们将展示六个使用RetrieveChat进行代码生成和问题解答的示例:
本笔记本需要一些额外的依赖项,可以通过pip安装:
pip install autogen-agentchat[retrievechat-pgvector]~=0.2 flaml[automl]
如需更多信息,请参考安装指南。
确保你有一个PGVector实例。
如果没有,可以使用 Docker 快速部署测试版本。
docker-compose.yml
version: '3.9'
services:
pgvector:
image: pgvector/pgvector:pg16
shm_size: 128mb
restart: unless-stopped
ports:
- "5432:5432"
environment:
POSTGRES_USER: <postgres-user>
POSTGRES_PASSWORD: <postgres-password>
POSTGRES_DB: <postgres-database>
volumes:
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
创建 init.sql
文件
CREATE EXTENSION IF NOT EXISTS vector;
设置您的API端点
config_list_from_json
函数从环境变量或 JSON 文件加载配置列表。
import json
import os
import chromadb
import psycopg
from sentence_transformers import SentenceTransformer
import autogen
from autogen import AssistantAgent
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
# Accepted file formats for that can be stored in
# a vector database instance
from autogen.retrieve_utils import TEXT_FORMATS
config_list = autogen.config_list_from_json(
"OAI_CONFIG_LIST",
file_location=".",
)
assert len(config_list) > 0
print("models to use: ", [config_list[i]["model"] for i in range(len(config_list))])
models to use: ['gpt4-1106-preview', 'gpt-4o', 'gpt-35-turbo', 'gpt-35-turbo-0613']
了解更多关于为agent配置LLM的信息在这里.
为RetrieveChat构建代理
我们首先初始化AssistantAgent
和
RetrieveUserProxyAgent
。对于AssistantAgent,系统消息需要设置为“你是一个
有用的助手。”详细信息在用户消息中给出。稍后我们将使用
RetrieveUserProxyAgent.message_generator
来结合指令
和检索增强生成任务,以生成一个初始提示,发送给LLM助手。
print("Accepted file formats for `docs_path`:")
print(TEXT_FORMATS)
Accepted file formats for `docs_path`:
['yaml', 'ppt', 'rst', 'jsonl', 'xml', 'txt', 'yml', 'log', 'rtf', 'msg', 'xlsx', 'htm', 'pdf', 'org', 'pptx', 'md', 'docx', 'epub', 'tsv', 'csv', 'html', 'doc', 'odt', 'json']
# 1. create an AssistantAgent instance named "assistant"
assistant = AssistantAgent(
name="assistant",
system_message="You are a helpful assistant. You must always reply with some form of text.",
llm_config={
"timeout": 600,
"cache_seed": 42,
"config_list": config_list,
},
)
# Optionally create psycopg conn object
# conn = psycopg.connect(conninfo="postgresql://postgres:postgres@localhost:5432/postgres", autocommit=True)
# Optionally create embedding function object
sentence_transformer_ef = SentenceTransformer("all-distilroberta-v1").encode
# 2. create the RetrieveUserProxyAgent instance named "ragproxyagent"
# Refer to https://microsoft.github.io/autogen/docs/reference/agentchat/contrib/retrieve_user_proxy_agent
# and https://microsoft.github.io/autogen/docs/reference/agentchat/contrib/vectordb/pgvectordb
# for more information on the RetrieveUserProxyAgent and PGVectorDB
ragproxyagent = RetrieveUserProxyAgent(
name="ragproxyagent",
human_input_mode="NEVER",
max_consecutive_auto_reply=1,
retrieve_config={
"task": "code",
"docs_path": [
"https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Examples/Integrate%20-%20Spark.md",
"https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Research.md",
],
"chunk_token_size": 2000,
"model": config_list[0]["model"],
"vector_db": "pgvector", # PGVector database
"collection_name": "flaml_collection",
"db_config": {
"connection_string": "postgresql://postgres:postgres@localhost:5432/postgres", # Optional - connect to an external vector database
# "host": "postgres", # Optional vector database host
# "port": 5432, # Optional vector database port
# "dbname": "postgres", # Optional vector database name
# "username": "postgres", # Optional vector database username
# "password": "postgres", # Optional vector database password
# "conn": conn, # Optional - conn object to connect to database
},
"get_or_create": True, # set to False if you don't want to reuse an existing collection
"overwrite": True, # set to True if you want to overwrite an existing collection
"embedding_function": sentence_transformer_ef, # If left out SentenceTransformer("all-MiniLM-L6-v2").encode will be used
},
code_execution_config=False, # set to False if you don't want to execute the code
)
/home/lijiang1/anaconda3/envs/autogen/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
示例 1
使用RetrieveChat帮助生成示例代码,并自动运行代码和修复错误(如果有)。
问题:如果我想使用FLAML进行分类任务,并且我想在30秒内训练模型,我应该使用哪个API。使用spark来并行化训练。如果达到时间限制,则强制取消任务。
# reset the assistant. Always reset the assistant before starting a new conversation.
assistant.reset()
# given a problem, we use the ragproxyagent to generate a prompt to be sent to the assistant as the initial message.
# the assistant receives the message and generates a response. The response will be sent back to the ragproxyagent for processing.
# The conversation continues until the termination condition is met, in RetrieveChat, the termination condition when no human-in-loop is no code block detected.
# With human-in-loop, the conversation will continue until the user says "exit".
code_problem = "How can I use FLAML to perform a classification task and use spark to do parallel training. Train for 30 seconds and force cancel jobs if time limit is reached."
chat_result = ragproxyagent.initiate_chat(
assistant, message=ragproxyagent.message_generator, problem=code_problem, search_string="spark"
)
Trying to create collection.
VectorDB returns doc_ids: [['bdfbc921', '7968cf3c']]
Adding content of doc bdfbc921 to context.
Adding content of doc 7968cf3c to context.
You're a retrieve augmented coding assistant. You answer user's questions based on your own knowledge and the
context provided by the user.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
For code generation, you must obey the following rules:
Rule 1. You MUST NOT install any packages because all the packages needed are already installed.
Rule 2. You must follow the formats below to write your code:
```language
# your code
```
User's question is: How can I use FLAML to perform a classification task and use spark to do parallel training. Train for 30 seconds and force cancel jobs if time limit is reached.
Context is: # Integrate - Spark
FLAML has integrated Spark for distributed training. There are two main aspects of integration with Spark:
- Use Spark ML estimators for AutoML.
- Use Spark to run training in parallel spark jobs.
## Spark ML Estimators
FLAML integrates estimators based on Spark ML models. These models are trained in parallel using Spark, so we called them Spark estimators. To use these models, you first need to organize your data in the required format.
### Data
For Spark estimators, AutoML only consumes Spark data. FLAML provides a convenient function `to_pandas_on_spark` in the `flaml.automl.spark.utils` module to convert your data into a pandas-on-spark (`pyspark.pandas`) dataframe/series, which Spark estimators require.
This utility function takes data in the form of a `pandas.Dataframe` or `pyspark.sql.Dataframe` and converts it into a pandas-on-spark dataframe. It also takes `pandas.Series` or `pyspark.sql.Dataframe` and converts it into a [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) series. If you pass in a `pyspark.pandas.Dataframe`, it will not make any changes.
This function also accepts optional arguments `index_col` and `default_index_type`.
- `index_col` is the column name to use as the index, default is None.
- `default_index_type` is the default index type, default is "distributed-sequence". More info about default index type could be found on Spark official [documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type)
Here is an example code snippet for Spark Data:
```python
import pandas as pd
from flaml.automl.spark.utils import to_pandas_on_spark
# Creating a dictionary
data = {
"Square_Feet": [800, 1200, 1800, 1500, 850],
"Age_Years": [20, 15, 10, 7, 25],
"Price": [100000, 200000, 300000, 240000, 120000],
}
# Creating a pandas DataFrame
dataframe = pd.DataFrame(data)
label = "Price"
# Convert to pandas-on-spark dataframe
psdf = to_pandas_on_spark(dataframe)
```
To use Spark ML models you need to format your data appropriately. Specifically, use [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html) to merge all feature columns into a single vector column.
Here is an example of how to use it:
```python
from pyspark.ml.feature import VectorAssembler
columns = psdf.columns
feature_cols = [col for col in columns if col != label]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
psdf = featurizer.transform(psdf.to_spark(index_col="index"))["index", "features"]
```
Later in conducting the experiment, use your pandas-on-spark data like non-spark data and pass them using `X_train, y_train` or `dataframe, label`.
### Estimators
#### Model List
- `lgbm_spark`: The class for fine-tuning Spark version LightGBM models, using [SynapseML](https://microsoft.github.io/SynapseML/docs/features/lightgbm/about/) API.
#### Usage
First, prepare your data in the required format as described in the previous section.
By including the models you intend to try in the `estimators_list` argument to `flaml.automl`, FLAML will start trying configurations for these models. If your input is Spark data, FLAML will also use estimators with the `_spark` postfix by default, even if you haven't specified them.
Here is an example code snippet using SparkML models in AutoML:
```python
import flaml
# prepare your data in pandas-on-spark format as we previously mentioned
automl = flaml.AutoML()
settings = {
"time_budget": 30,
"metric": "r2",
"estimator_list": ["lgbm_spark"], # this setting is optional
"task": "regression",
}
automl.fit(
dataframe=psdf,
label=label,
**settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb)
## Parallel Spark Jobs
You can activate Spark as the parallel backend during parallel tuning in both [AutoML](/docs/Use-Cases/Task-Oriented-AutoML#parallel-tuning) and [Hyperparameter Tuning](/docs/Use-Cases/Tune-User-Defined-Function#parallel-tuning), by setting the `use_spark` to `true`. FLAML will dispatch your job to the distributed Spark backend using [`joblib-spark`](https://github.com/joblib/joblib-spark).
Please note that you should not set `use_spark` to `true` when applying AutoML and Tuning for Spark Data. This is because only SparkML models will be used for Spark Data in AutoML and Tuning. As SparkML models run in parallel, there is no need to distribute them with `use_spark` again.
All the Spark-related arguments are stated below. These arguments are available in both Hyperparameter Tuning and AutoML:
- `use_spark`: boolean, default=False | Whether to use spark to run the training in parallel spark jobs. This can be used to accelerate training on large models and large datasets, but will incur more overhead in time and thus slow down training in some cases. GPU training is not supported yet when use_spark is True. For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`.
- `n_concurrent_trials`: int, default=1 | The number of concurrent trials. When n_concurrent_trials > 1, FLAML performes parallel tuning.
- `force_cancel`: boolean, default=False | Whether to forcely cancel Spark jobs if the search time exceeded the time budget. Spark jobs include parallel tuning jobs and Spark-based model training jobs.
An example code snippet for using parallel Spark jobs:
```python
import flaml
automl_experiment = flaml.AutoML()
automl_settings = {
"time_budget": 30,
"metric": "r2",
"task": "regression",
"n_concurrent_trials": 2,
"use_spark": True,
"force_cancel": True, # Activating the force_cancel option can immediately halt Spark jobs once they exceed the allocated time_budget.
}
automl.fit(
dataframe=dataframe,
label=label,
**automl_settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb)
# Research
For technical details, please check our research publications.
- [FLAML: A Fast and Lightweight AutoML Library](https://www.microsoft.com/en-us/research/publication/flaml-a-fast-and-lightweight-automl-library/). Chi Wang, Qingyun Wu, Markus Weimer, Erkang Zhu. MLSys 2021.
```bibtex
@inproceedings{wang2021flaml,
title={FLAML: A Fast and Lightweight AutoML Library},
author={Chi Wang and Qingyun Wu and Markus Weimer and Erkang Zhu},
year={2021},
booktitle={MLSys},
}
```
- [Frugal Optimization for Cost-related Hyperparameters](https://arxiv.org/abs/2005.01571). Qingyun Wu, Chi Wang, Silu Huang. AAAI 2021.
```bibtex
@inproceedings{wu2021cfo,
title={Frugal Optimization for Cost-related Hyperparameters},
author={Qingyun Wu and Chi Wang and Silu Huang},
year={2021},
booktitle={AAAI},
}
```
- [Economical Hyperparameter Optimization With Blended Search Strategy](https://www.microsoft.com/en-us/research/publication/economical-hyperparameter-optimization-with-blended-search-strategy/). Chi Wang, Qingyun Wu, Silu Huang, Amin Saied. ICLR 2021.
```bibtex
@inproceedings{wang2021blendsearch,
title={Economical Hyperparameter Optimization With Blended Search Strategy},
author={Chi Wang and Qingyun Wu and Silu Huang and Amin Saied},
year={2021},
booktitle={ICLR},
}
```
- [An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models](https://aclanthology.org/2021.acl-long.178.pdf). Susan Xueqing Liu, Chi Wang. ACL 2021.
```bibtex
@inproceedings{liuwang2021hpolm,
title={An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models},
author={Susan Xueqing Liu and Chi Wang},
year={2021},
booktitle={ACL},
}
```
- [ChaCha for Online AutoML](https://www.microsoft.com/en-us/research/publication/chacha-for-online-automl/). Qingyun Wu, Chi Wang, John Langford, Paul Mineiro and Marco Rossi. ICML 2021.
```bibtex
@inproceedings{wu2021chacha,
title={ChaCha for Online AutoML},
author={Qingyun Wu and Chi Wang and John Langford and Paul Mineiro and Marco Rossi},
year={2021},
booktitle={ICML},
}
```
- [Fair AutoML](https://arxiv.org/abs/2111.06495). Qingyun Wu, Chi Wang. ArXiv preprint arXiv:2111.06495 (2021).
```bibtex
@inproceedings{wuwang2021fairautoml,
title={Fair AutoML},
author={Qingyun Wu and Chi Wang},
year={2021},
booktitle={ArXiv preprint arXiv:2111.06495},
}
```
- [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. ArXiv preprint arXiv:2202.09927 (2022).
```bibtex
@inproceedings{kayaliwang2022default,
title={Mining Robust Default Configurations for Resource-constrained AutoML},
author={Moe Kayali and Chi Wang},
year={2022},
booktitle={ArXiv preprint arXiv:2202.09927},
}
```
- [Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives](https://openreview.net/forum?id=0Ij9_q567Ma). Shaokun Zhang, Feiran Jia, Chi Wang, Qingyun Wu. ICLR 2023 (notable-top-5%).
```bibtex
@inproceedings{zhang2023targeted,
title={Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives},
author={Shaokun Zhang and Feiran Jia and Chi Wang and Qingyun Wu},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=0Ij9_q567Ma},
}
```
- [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673). Chi Wang, Susan Xueqing Liu, Ahmed H. Awadallah. ArXiv preprint arXiv:2303.04673 (2023).
```bibtex
@inproceedings{wang2023EcoOptiGen,
title={Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference},
author={Chi Wang and Susan Xueqing Liu and Ahmed H. Awadallah},
year={2023},
booktitle={ArXiv preprint arXiv:2303.04673},
}
```
- [An Empirical Study on Challenging Math Problem Solving with GPT-4](https://arxiv.org/abs/2306.01337). Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, Chi Wang. ArXiv preprint arXiv:2306.01337 (2023).
```bibtex
@inproceedings{wu2023empirical,
title={An Empirical Study on Challenging Math Problem Solving with GPT-4},
author={Yiran Wu and Feiran Jia and Shaokun Zhang and Hangyu Li and Erkang Zhu and Yue Wang and Yin Tat Lee and Richard Peng and Qingyun Wu and Chi Wang},
year={2023},
booktitle={ArXiv preprint arXiv:2306.01337},
}
```
--------------------------------------------------------------------------------
You're a retrieve augmented coding assistant. You answer user's questions based on your own knowledge and the
context provided by the user.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
For code generation, you must obey the following rules:
Rule 1. You MUST NOT install any packages because all the packages needed are already installed.
Rule 2. You must follow the formats below to write your code:
```language
# your code
```
User's question is: How can I use FLAML to perform a classification task and use spark to do parallel training. Train for 30 seconds and force cancel jobs if time limit is reached.
Context is: # Integrate - Spark
FLAML has integrated Spark for distributed training. There are two main aspects of integration with Spark:
- Use Spark ML estimators for AutoML.
- Use Spark to run training in parallel spark jobs.
## Spark ML Estimators
FLAML integrates estimators based on Spark ML models. These models are trained in parallel using Spark, so we called them Spark estimators. To use these models, you first need to organize your data in the required format.
### Data
For Spark estimators, AutoML only consumes Spark data. FLAML provides a convenient function `to_pandas_on_spark` in the `flaml.automl.spark.utils` module to convert your data into a pandas-on-spark (`pyspark.pandas`) dataframe/series, which Spark estimators require.
This utility function takes data in the form of a `pandas.Dataframe` or `pyspark.sql.Dataframe` and converts it into a pandas-on-spark dataframe. It also takes `pandas.Series` or `pyspark.sql.Dataframe` and converts it into a [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) series. If you pass in a `pyspark.pandas.Dataframe`, it will not make any changes.
This function also accepts optional arguments `index_col` and `default_index_type`.
- `index_col` is the column name to use as the index, default is None.
- `default_index_type` is the default index type, default is "distributed-sequence". More info about default index type could be found on Spark official [documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type)
Here is an example code snippet for Spark Data:
```python
import pandas as pd
from flaml.automl.spark.utils import to_pandas_on_spark
# Creating a dictionary
data = {
"Square_Feet": [800, 1200, 1800, 1500, 850],
"Age_Years": [20, 15, 10, 7, 25],
"Price": [100000, 200000, 300000, 240000, 120000],
}
# Creating a pandas DataFrame
dataframe = pd.DataFrame(data)
label = "Price"
# Convert to pandas-on-spark dataframe
psdf = to_pandas_on_spark(dataframe)
```
To use Spark ML models you need to format your data appropriately. Specifically, use [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html) to merge all feature columns into a single vector column.
Here is an example of how to use it:
```python
from pyspark.ml.feature import VectorAssembler
columns = psdf.columns
feature_cols = [col for col in columns if col != label]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
psdf = featurizer.transform(psdf.to_spark(index_col="index"))["index", "features"]
```
Later in conducting the experiment, use your pandas-on-spark data like non-spark data and pass them using `X_train, y_train` or `dataframe, label`.
### Estimators
#### Model List
- `lgbm_spark`: The class for fine-tuning Spark version LightGBM models, using [SynapseML](https://microsoft.github.io/SynapseML/docs/features/lightgbm/about/) API.
#### Usage
First, prepare your data in the required format as described in the previous section.
By including the models you intend to try in the `estimators_list` argument to `flaml.automl`, FLAML will start trying configurations for these models. If your input is Spark data, FLAML will also use estimators with the `_spark` postfix by default, even if you haven't specified them.
Here is an example code snippet using SparkML models in AutoML:
```python
import flaml
# prepare your data in pandas-on-spark format as we previously mentioned
automl = flaml.AutoML()
settings = {
"time_budget": 30,
"metric": "r2",
"estimator_list": ["lgbm_spark"], # this setting is optional
"task": "regression",
}
automl.fit(
dataframe=psdf,
label=label,
**settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb)
## Parallel Spark Jobs
You can activate Spark as the parallel backend during parallel tuning in both [AutoML](/docs/Use-Cases/Task-Oriented-AutoML#parallel-tuning) and [Hyperparameter Tuning](/docs/Use-Cases/Tune-User-Defined-Function#parallel-tuning), by setting the `use_spark` to `true`. FLAML will dispatch your job to the distributed Spark backend using [`joblib-spark`](https://github.com/joblib/joblib-spark).
Please note that you should not set `use_spark` to `true` when applying AutoML and Tuning for Spark Data. This is because only SparkML models will be used for Spark Data in AutoML and Tuning. As SparkML models run in parallel, there is no need to distribute them with `use_spark` again.
All the Spark-related arguments are stated below. These arguments are available in both Hyperparameter Tuning and AutoML:
- `use_spark`: boolean, default=False | Whether to use spark to run the training in parallel spark jobs. This can be used to accelerate training on large models and large datasets, but will incur more overhead in time and thus slow down training in some cases. GPU training is not supported yet when use_spark is True. For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`.
- `n_concurrent_trials`: int, default=1 | The number of concurrent trials. When n_concurrent_trials > 1, FLAML performes parallel tuning.
- `force_cancel`: boolean, default=False | Whether to forcely cancel Spark jobs if the search time exceeded the time budget. Spark jobs include parallel tuning jobs and Spark-based model training jobs.
An example code snippet for using parallel Spark jobs:
```python
import flaml
automl_experiment = flaml.AutoML()
automl_settings = {
"time_budget": 30,
"metric": "r2",
"task": "regression",
"n_concurrent_trials": 2,
"use_spark": True,
"force_cancel": True, # Activating the force_cancel option can immediately halt Spark jobs once they exceed the allocated time_budget.
}
automl.fit(
dataframe=dataframe,
label=label,
**automl_settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb)
# Research
For technical details, please check our research publications.
- [FLAML: A Fast and Lightweight AutoML Library](https://www.microsoft.com/en-us/research/publication/flaml-a-fast-and-lightweight-automl-library/). Chi Wang, Qingyun Wu, Markus Weimer, Erkang Zhu. MLSys 2021.
```bibtex
@inproceedings{wang2021flaml,
title={FLAML: A Fast and Lightweight AutoML Library},
author={Chi Wang and Qingyun Wu and Markus Weimer and Erkang Zhu},
year={2021},
booktitle={MLSys},
}
```
- [Frugal Optimization for Cost-related Hyperparameters](https://arxiv.org/abs/2005.01571). Qingyun Wu, Chi Wang, Silu Huang. AAAI 2021.
```bibtex
@inproceedings{wu2021cfo,
title={Frugal Optimization for Cost-related Hyperparameters},
author={Qingyun Wu and Chi Wang and Silu Huang},
year={2021},
booktitle={AAAI},
}
```
- [Economical Hyperparameter Optimization With Blended Search Strategy](https://www.microsoft.com/en-us/research/publication/economical-hyperparameter-optimization-with-blended-search-strategy/). Chi Wang, Qingyun Wu, Silu Huang, Amin Saied. ICLR 2021.
```bibtex
@inproceedings{wang2021blendsearch,
title={Economical Hyperparameter Optimization With Blended Search Strategy},
author={Chi Wang and Qingyun Wu and Silu Huang and Amin Saied},
year={2021},
booktitle={ICLR},
}
```
- [An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models](https://aclanthology.org/2021.acl-long.178.pdf). Susan Xueqing Liu, Chi Wang. ACL 2021.
```bibtex
@inproceedings{liuwang2021hpolm,
title={An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models},
author={Susan Xueqing Liu and Chi Wang},
year={2021},
booktitle={ACL},
}
```
- [ChaCha for Online AutoML](https://www.microsoft.com/en-us/research/publication/chacha-for-online-automl/). Qingyun Wu, Chi Wang, John Langford, Paul Mineiro and Marco Rossi. ICML 2021.
```bibtex
@inproceedings{wu2021chacha,
title={ChaCha for Online AutoML},
author={Qingyun Wu and Chi Wang and John Langford and Paul Mineiro and Marco Rossi},
year={2021},
booktitle={ICML},
}
```
- [Fair AutoML](https://arxiv.org/abs/2111.06495). Qingyun Wu, Chi Wang. ArXiv preprint arXiv:2111.06495 (2021).
```bibtex
@inproceedings{wuwang2021fairautoml,
title={Fair AutoML},
author={Qingyun Wu and Chi Wang},
year={2021},
booktitle={ArXiv preprint arXiv:2111.06495},
}
```
- [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. ArXiv preprint arXiv:2202.09927 (2022).
```bibtex
@inproceedings{kayaliwang2022default,
title={Mining Robust Default Configurations for Resource-constrained AutoML},
author={Moe Kayali and Chi Wang},
year={2022},
booktitle={ArXiv preprint arXiv:2202.09927},
}
```
- [Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives](https://openreview.net/forum?id=0Ij9_q567Ma). Shaokun Zhang, Feiran Jia, Chi Wang, Qingyun Wu. ICLR 2023 (notable-top-5%).
```bibtex
@inproceedings{zhang2023targeted,
title={Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives},
author={Shaokun Zhang and Feiran Jia and Chi Wang and Qingyun Wu},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=0Ij9_q567Ma},
}
```
- [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673). Chi Wang, Susan Xueqing Liu, Ahmed H. Awadallah. ArXiv preprint arXiv:2303.04673 (2023).
```bibtex
@inproceedings{wang2023EcoOptiGen,
title={Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference},
author={Chi Wang and Susan Xueqing Liu and Ahmed H. Awadallah},
year={2023},
booktitle={ArXiv preprint arXiv:2303.04673},
}
```
- [An Empirical Study on Challenging Math Problem Solving with GPT-4](https://arxiv.org/abs/2306.01337). Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, Chi Wang. ArXiv preprint arXiv:2306.01337 (2023).
```bibtex
@inproceedings{wu2023empirical,
title={An Empirical Study on Challenging Math Problem Solving with GPT-4},
author={Yiran Wu and Feiran Jia and Shaokun Zhang and Hangyu Li and Erkang Zhu and Yue Wang and Yin Tat Lee and Richard Peng and Qingyun Wu and Chi Wang},
year={2023},
booktitle={ArXiv preprint arXiv:2306.01337},
}
```
--------------------------------------------------------------------------------
Based on the provided context which details the integration of Spark with FLAML for distributed training, and the requirement to perform a classification task with parallel training in Spark, here's a code snippet that configures FLAML to train a classification model for 30 seconds and cancels the jobs if the time limit is reached.
```python
from flaml import AutoML
from flaml.automl.spark.utils import to_pandas_on_spark
import pandas as pd
# Your pandas DataFrame 'data' goes here
# Assuming 'data' is already a pandas DataFrame with appropriate data for classification
# and 'label_column' is the name of the column that we want to predict.
# First, convert your pandas DataFrame to a pandas-on-spark DataFrame
psdf = to_pandas_on_spark(data)
# Now, we prepare the settings for the AutoML training with Spark
automl_settings = {
"time_budget": 30, # Train for 30 seconds
"metric": "accuracy", # Assuming you want to use accuracy as the metric
"task": "classification",
"n_concurrent_trials": 2, # Adjust the number of concurrent trials depending on your cluster setup
"use_spark": True,
"force_cancel": True, # Force cancel jobs if time limit is reached
}
# Create an AutoML instance
automl = AutoML()
# Run the AutoML search
# You need to replace 'psdf' with your actual pandas-on-spark DataFrame variable
# and 'label_column' with the name of your label column
automl.fit(dataframe=psdf, label=label_column, **automl_settings)
```
This code snippet assumes that the `data` variable contains the pandas DataFrame you want to classify and that `label_column` is the name of the target variable for the classification task. Make sure to replace 'data' and 'label_column' with your actual data and label column name before running this code.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
UPDATE CONTEXT
--------------------------------------------------------------------------------
2024-06-11 19:57:44,122 - autogen.agentchat.contrib.retrieve_user_proxy_agent - INFO - Found 2 chunks.
Model gpt4-1106-preview not found. Using cl100k_base encoding.
Model gpt4-1106-preview not found. Using cl100k_base encoding.
示例2
使用RetrieveChat来回答与代码生成无关的问题。
问题:谁是FLAML的作者?
# reset the assistant. Always reset the assistant before starting a new conversation.
assistant.reset()
# Optionally create psycopg conn object
conn = psycopg.connect(conninfo="postgresql://postgres:postgres@localhost:5432/postgres", autocommit=True)
ragproxyagent = RetrieveUserProxyAgent(
name="ragproxyagent",
human_input_mode="NEVER",
max_consecutive_auto_reply=1,
retrieve_config={
"task": "code",
"docs_path": [
"https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Examples/Integrate%20-%20Spark.md",
"https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Research.md",
os.path.join(os.path.abspath(""), "..", "website", "docs"),
],
"custom_text_types": ["non-existent-type"],
"chunk_token_size": 2000,
"model": config_list[0]["model"],
"vector_db": "pgvector", # PGVector database
"collection_name": "flaml_collection",
"db_config": {
# "connection_string": "postgresql://postgres:postgres@localhost:5432/postgres", # Optional - connect to an external vector database
# "host": "postgres", # Optional vector database host
# "port": 5432, # Optional vector database port
# "dbname": "postgres", # Optional vector database name
# "username": "postgres", # Optional vector database username
# "password": "postgres", # Optional vector database password
"conn": conn, # Optional - conn object to connect to database
},
"get_or_create": True, # set to False if you don't want to reuse an existing collection
"overwrite": True, # set to True if you want to overwrite an existing collection
},
code_execution_config=False, # set to False if you don't want to execute the code
)
qa_problem = "Who is the author of FLAML?"
chat_result = ragproxyagent.initiate_chat(assistant, message=ragproxyagent.message_generator, problem=qa_problem)
/home/lijiang1/anaconda3/envs/autogen/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-06-11 19:58:21,076 - autogen.agentchat.contrib.retrieve_user_proxy_agent - INFO - Found 2 chunks.
Model gpt4-1106-preview not found. Using cl100k_base encoding.
Model gpt4-1106-preview not found. Using cl100k_base encoding.
Trying to create collection.
VectorDB returns doc_ids: [['7968cf3c', 'bdfbc921']]
Adding content of doc 7968cf3c to context.
Adding content of doc bdfbc921 to context.
You're a retrieve augmented coding assistant. You answer user's questions based on your own knowledge and the
context provided by the user.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
For code generation, you must obey the following rules:
Rule 1. You MUST NOT install any packages because all the packages needed are already installed.
Rule 2. You must follow the formats below to write your code:
```language
# your code
```
User's question is: Who is the author of FLAML?
Context is: # Research
For technical details, please check our research publications.
- [FLAML: A Fast and Lightweight AutoML Library](https://www.microsoft.com/en-us/research/publication/flaml-a-fast-and-lightweight-automl-library/). Chi Wang, Qingyun Wu, Markus Weimer, Erkang Zhu. MLSys 2021.
```bibtex
@inproceedings{wang2021flaml,
title={FLAML: A Fast and Lightweight AutoML Library},
author={Chi Wang and Qingyun Wu and Markus Weimer and Erkang Zhu},
year={2021},
booktitle={MLSys},
}
```
- [Frugal Optimization for Cost-related Hyperparameters](https://arxiv.org/abs/2005.01571). Qingyun Wu, Chi Wang, Silu Huang. AAAI 2021.
```bibtex
@inproceedings{wu2021cfo,
title={Frugal Optimization for Cost-related Hyperparameters},
author={Qingyun Wu and Chi Wang and Silu Huang},
year={2021},
booktitle={AAAI},
}
```
- [Economical Hyperparameter Optimization With Blended Search Strategy](https://www.microsoft.com/en-us/research/publication/economical-hyperparameter-optimization-with-blended-search-strategy/). Chi Wang, Qingyun Wu, Silu Huang, Amin Saied. ICLR 2021.
```bibtex
@inproceedings{wang2021blendsearch,
title={Economical Hyperparameter Optimization With Blended Search Strategy},
author={Chi Wang and Qingyun Wu and Silu Huang and Amin Saied},
year={2021},
booktitle={ICLR},
}
```
- [An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models](https://aclanthology.org/2021.acl-long.178.pdf). Susan Xueqing Liu, Chi Wang. ACL 2021.
```bibtex
@inproceedings{liuwang2021hpolm,
title={An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models},
author={Susan Xueqing Liu and Chi Wang},
year={2021},
booktitle={ACL},
}
```
- [ChaCha for Online AutoML](https://www.microsoft.com/en-us/research/publication/chacha-for-online-automl/). Qingyun Wu, Chi Wang, John Langford, Paul Mineiro and Marco Rossi. ICML 2021.
```bibtex
@inproceedings{wu2021chacha,
title={ChaCha for Online AutoML},
author={Qingyun Wu and Chi Wang and John Langford and Paul Mineiro and Marco Rossi},
year={2021},
booktitle={ICML},
}
```
- [Fair AutoML](https://arxiv.org/abs/2111.06495). Qingyun Wu, Chi Wang. ArXiv preprint arXiv:2111.06495 (2021).
```bibtex
@inproceedings{wuwang2021fairautoml,
title={Fair AutoML},
author={Qingyun Wu and Chi Wang},
year={2021},
booktitle={ArXiv preprint arXiv:2111.06495},
}
```
- [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. ArXiv preprint arXiv:2202.09927 (2022).
```bibtex
@inproceedings{kayaliwang2022default,
title={Mining Robust Default Configurations for Resource-constrained AutoML},
author={Moe Kayali and Chi Wang},
year={2022},
booktitle={ArXiv preprint arXiv:2202.09927},
}
```
- [Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives](https://openreview.net/forum?id=0Ij9_q567Ma). Shaokun Zhang, Feiran Jia, Chi Wang, Qingyun Wu. ICLR 2023 (notable-top-5%).
```bibtex
@inproceedings{zhang2023targeted,
title={Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives},
author={Shaokun Zhang and Feiran Jia and Chi Wang and Qingyun Wu},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=0Ij9_q567Ma},
}
```
- [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673). Chi Wang, Susan Xueqing Liu, Ahmed H. Awadallah. ArXiv preprint arXiv:2303.04673 (2023).
```bibtex
@inproceedings{wang2023EcoOptiGen,
title={Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference},
author={Chi Wang and Susan Xueqing Liu and Ahmed H. Awadallah},
year={2023},
booktitle={ArXiv preprint arXiv:2303.04673},
}
```
- [An Empirical Study on Challenging Math Problem Solving with GPT-4](https://arxiv.org/abs/2306.01337). Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, Chi Wang. ArXiv preprint arXiv:2306.01337 (2023).
```bibtex
@inproceedings{wu2023empirical,
title={An Empirical Study on Challenging Math Problem Solving with GPT-4},
author={Yiran Wu and Feiran Jia and Shaokun Zhang and Hangyu Li and Erkang Zhu and Yue Wang and Yin Tat Lee and Richard Peng and Qingyun Wu and Chi Wang},
year={2023},
booktitle={ArXiv preprint arXiv:2306.01337},
}
```
# Integrate - Spark
FLAML has integrated Spark for distributed training. There are two main aspects of integration with Spark:
- Use Spark ML estimators for AutoML.
- Use Spark to run training in parallel spark jobs.
## Spark ML Estimators
FLAML integrates estimators based on Spark ML models. These models are trained in parallel using Spark, so we called them Spark estimators. To use these models, you first need to organize your data in the required format.
### Data
For Spark estimators, AutoML only consumes Spark data. FLAML provides a convenient function `to_pandas_on_spark` in the `flaml.automl.spark.utils` module to convert your data into a pandas-on-spark (`pyspark.pandas`) dataframe/series, which Spark estimators require.
This utility function takes data in the form of a `pandas.Dataframe` or `pyspark.sql.Dataframe` and converts it into a pandas-on-spark dataframe. It also takes `pandas.Series` or `pyspark.sql.Dataframe` and converts it into a [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) series. If you pass in a `pyspark.pandas.Dataframe`, it will not make any changes.
This function also accepts optional arguments `index_col` and `default_index_type`.
- `index_col` is the column name to use as the index, default is None.
- `default_index_type` is the default index type, default is "distributed-sequence". More info about default index type could be found on Spark official [documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type)
Here is an example code snippet for Spark Data:
```python
import pandas as pd
from flaml.automl.spark.utils import to_pandas_on_spark
# Creating a dictionary
data = {
"Square_Feet": [800, 1200, 1800, 1500, 850],
"Age_Years": [20, 15, 10, 7, 25],
"Price": [100000, 200000, 300000, 240000, 120000],
}
# Creating a pandas DataFrame
dataframe = pd.DataFrame(data)
label = "Price"
# Convert to pandas-on-spark dataframe
psdf = to_pandas_on_spark(dataframe)
```
To use Spark ML models you need to format your data appropriately. Specifically, use [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html) to merge all feature columns into a single vector column.
Here is an example of how to use it:
```python
from pyspark.ml.feature import VectorAssembler
columns = psdf.columns
feature_cols = [col for col in columns if col != label]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
psdf = featurizer.transform(psdf.to_spark(index_col="index"))["index", "features"]
```
Later in conducting the experiment, use your pandas-on-spark data like non-spark data and pass them using `X_train, y_train` or `dataframe, label`.
### Estimators
#### Model List
- `lgbm_spark`: The class for fine-tuning Spark version LightGBM models, using [SynapseML](https://microsoft.github.io/SynapseML/docs/features/lightgbm/about/) API.
#### Usage
First, prepare your data in the required format as described in the previous section.
By including the models you intend to try in the `estimators_list` argument to `flaml.automl`, FLAML will start trying configurations for these models. If your input is Spark data, FLAML will also use estimators with the `_spark` postfix by default, even if you haven't specified them.
Here is an example code snippet using SparkML models in AutoML:
```python
import flaml
# prepare your data in pandas-on-spark format as we previously mentioned
automl = flaml.AutoML()
settings = {
"time_budget": 30,
"metric": "r2",
"estimator_list": ["lgbm_spark"], # this setting is optional
"task": "regression",
}
automl.fit(
dataframe=psdf,
label=label,
**settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb)
## Parallel Spark Jobs
You can activate Spark as the parallel backend during parallel tuning in both [AutoML](/docs/Use-Cases/Task-Oriented-AutoML#parallel-tuning) and [Hyperparameter Tuning](/docs/Use-Cases/Tune-User-Defined-Function#parallel-tuning), by setting the `use_spark` to `true`. FLAML will dispatch your job to the distributed Spark backend using [`joblib-spark`](https://github.com/joblib/joblib-spark).
Please note that you should not set `use_spark` to `true` when applying AutoML and Tuning for Spark Data. This is because only SparkML models will be used for Spark Data in AutoML and Tuning. As SparkML models run in parallel, there is no need to distribute them with `use_spark` again.
All the Spark-related arguments are stated below. These arguments are available in both Hyperparameter Tuning and AutoML:
- `use_spark`: boolean, default=False | Whether to use spark to run the training in parallel spark jobs. This can be used to accelerate training on large models and large datasets, but will incur more overhead in time and thus slow down training in some cases. GPU training is not supported yet when use_spark is True. For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`.
- `n_concurrent_trials`: int, default=1 | The number of concurrent trials. When n_concurrent_trials > 1, FLAML performes parallel tuning.
- `force_cancel`: boolean, default=False | Whether to forcely cancel Spark jobs if the search time exceeded the time budget. Spark jobs include parallel tuning jobs and Spark-based model training jobs.
An example code snippet for using parallel Spark jobs:
```python
import flaml
automl_experiment = flaml.AutoML()
automl_settings = {
"time_budget": 30,
"metric": "r2",
"task": "regression",
"n_concurrent_trials": 2,
"use_spark": True,
"force_cancel": True, # Activating the force_cancel option can immediately halt Spark jobs once they exceed the allocated time_budget.
}
automl.fit(
dataframe=dataframe,
label=label,
**automl_settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb)
--------------------------------------------------------------------------------
You're a retrieve augmented coding assistant. You answer user's questions based on your own knowledge and the
context provided by the user.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
For code generation, you must obey the following rules:
Rule 1. You MUST NOT install any packages because all the packages needed are already installed.
Rule 2. You must follow the formats below to write your code:
```language
# your code
```
User's question is: Who is the author of FLAML?
Context is: # Research
For technical details, please check our research publications.
- [FLAML: A Fast and Lightweight AutoML Library](https://www.microsoft.com/en-us/research/publication/flaml-a-fast-and-lightweight-automl-library/). Chi Wang, Qingyun Wu, Markus Weimer, Erkang Zhu. MLSys 2021.
```bibtex
@inproceedings{wang2021flaml,
title={FLAML: A Fast and Lightweight AutoML Library},
author={Chi Wang and Qingyun Wu and Markus Weimer and Erkang Zhu},
year={2021},
booktitle={MLSys},
}
```
- [Frugal Optimization for Cost-related Hyperparameters](https://arxiv.org/abs/2005.01571). Qingyun Wu, Chi Wang, Silu Huang. AAAI 2021.
```bibtex
@inproceedings{wu2021cfo,
title={Frugal Optimization for Cost-related Hyperparameters},
author={Qingyun Wu and Chi Wang and Silu Huang},
year={2021},
booktitle={AAAI},
}
```
- [Economical Hyperparameter Optimization With Blended Search Strategy](https://www.microsoft.com/en-us/research/publication/economical-hyperparameter-optimization-with-blended-search-strategy/). Chi Wang, Qingyun Wu, Silu Huang, Amin Saied. ICLR 2021.
```bibtex
@inproceedings{wang2021blendsearch,
title={Economical Hyperparameter Optimization With Blended Search Strategy},
author={Chi Wang and Qingyun Wu and Silu Huang and Amin Saied},
year={2021},
booktitle={ICLR},
}
```
- [An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models](https://aclanthology.org/2021.acl-long.178.pdf). Susan Xueqing Liu, Chi Wang. ACL 2021.
```bibtex
@inproceedings{liuwang2021hpolm,
title={An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models},
author={Susan Xueqing Liu and Chi Wang},
year={2021},
booktitle={ACL},
}
```
- [ChaCha for Online AutoML](https://www.microsoft.com/en-us/research/publication/chacha-for-online-automl/). Qingyun Wu, Chi Wang, John Langford, Paul Mineiro and Marco Rossi. ICML 2021.
```bibtex
@inproceedings{wu2021chacha,
title={ChaCha for Online AutoML},
author={Qingyun Wu and Chi Wang and John Langford and Paul Mineiro and Marco Rossi},
year={2021},
booktitle={ICML},
}
```
- [Fair AutoML](https://arxiv.org/abs/2111.06495). Qingyun Wu, Chi Wang. ArXiv preprint arXiv:2111.06495 (2021).
```bibtex
@inproceedings{wuwang2021fairautoml,
title={Fair AutoML},
author={Qingyun Wu and Chi Wang},
year={2021},
booktitle={ArXiv preprint arXiv:2111.06495},
}
```
- [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. ArXiv preprint arXiv:2202.09927 (2022).
```bibtex
@inproceedings{kayaliwang2022default,
title={Mining Robust Default Configurations for Resource-constrained AutoML},
author={Moe Kayali and Chi Wang},
year={2022},
booktitle={ArXiv preprint arXiv:2202.09927},
}
```
- [Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives](https://openreview.net/forum?id=0Ij9_q567Ma). Shaokun Zhang, Feiran Jia, Chi Wang, Qingyun Wu. ICLR 2023 (notable-top-5%).
```bibtex
@inproceedings{zhang2023targeted,
title={Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives},
author={Shaokun Zhang and Feiran Jia and Chi Wang and Qingyun Wu},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=0Ij9_q567Ma},
}
```
- [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673). Chi Wang, Susan Xueqing Liu, Ahmed H. Awadallah. ArXiv preprint arXiv:2303.04673 (2023).
```bibtex
@inproceedings{wang2023EcoOptiGen,
title={Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference},
author={Chi Wang and Susan Xueqing Liu and Ahmed H. Awadallah},
year={2023},
booktitle={ArXiv preprint arXiv:2303.04673},
}
```
- [An Empirical Study on Challenging Math Problem Solving with GPT-4](https://arxiv.org/abs/2306.01337). Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, Chi Wang. ArXiv preprint arXiv:2306.01337 (2023).
```bibtex
@inproceedings{wu2023empirical,
title={An Empirical Study on Challenging Math Problem Solving with GPT-4},
author={Yiran Wu and Feiran Jia and Shaokun Zhang and Hangyu Li and Erkang Zhu and Yue Wang and Yin Tat Lee and Richard Peng and Qingyun Wu and Chi Wang},
year={2023},
booktitle={ArXiv preprint arXiv:2306.01337},
}
```
# Integrate - Spark
FLAML has integrated Spark for distributed training. There are two main aspects of integration with Spark:
- Use Spark ML estimators for AutoML.
- Use Spark to run training in parallel spark jobs.
## Spark ML Estimators
FLAML integrates estimators based on Spark ML models. These models are trained in parallel using Spark, so we called them Spark estimators. To use these models, you first need to organize your data in the required format.
### Data
For Spark estimators, AutoML only consumes Spark data. FLAML provides a convenient function `to_pandas_on_spark` in the `flaml.automl.spark.utils` module to convert your data into a pandas-on-spark (`pyspark.pandas`) dataframe/series, which Spark estimators require.
This utility function takes data in the form of a `pandas.Dataframe` or `pyspark.sql.Dataframe` and converts it into a pandas-on-spark dataframe. It also takes `pandas.Series` or `pyspark.sql.Dataframe` and converts it into a [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) series. If you pass in a `pyspark.pandas.Dataframe`, it will not make any changes.
This function also accepts optional arguments `index_col` and `default_index_type`.
- `index_col` is the column name to use as the index, default is None.
- `default_index_type` is the default index type, default is "distributed-sequence". More info about default index type could be found on Spark official [documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type)
Here is an example code snippet for Spark Data:
```python
import pandas as pd
from flaml.automl.spark.utils import to_pandas_on_spark
# Creating a dictionary
data = {
"Square_Feet": [800, 1200, 1800, 1500, 850],
"Age_Years": [20, 15, 10, 7, 25],
"Price": [100000, 200000, 300000, 240000, 120000],
}
# Creating a pandas DataFrame
dataframe = pd.DataFrame(data)
label = "Price"
# Convert to pandas-on-spark dataframe
psdf = to_pandas_on_spark(dataframe)
```
To use Spark ML models you need to format your data appropriately. Specifically, use [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html) to merge all feature columns into a single vector column.
Here is an example of how to use it:
```python
from pyspark.ml.feature import VectorAssembler
columns = psdf.columns
feature_cols = [col for col in columns if col != label]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
psdf = featurizer.transform(psdf.to_spark(index_col="index"))["index", "features"]
```
Later in conducting the experiment, use your pandas-on-spark data like non-spark data and pass them using `X_train, y_train` or `dataframe, label`.
### Estimators
#### Model List
- `lgbm_spark`: The class for fine-tuning Spark version LightGBM models, using [SynapseML](https://microsoft.github.io/SynapseML/docs/features/lightgbm/about/) API.
#### Usage
First, prepare your data in the required format as described in the previous section.
By including the models you intend to try in the `estimators_list` argument to `flaml.automl`, FLAML will start trying configurations for these models. If your input is Spark data, FLAML will also use estimators with the `_spark` postfix by default, even if you haven't specified them.
Here is an example code snippet using SparkML models in AutoML:
```python
import flaml
# prepare your data in pandas-on-spark format as we previously mentioned
automl = flaml.AutoML()
settings = {
"time_budget": 30,
"metric": "r2",
"estimator_list": ["lgbm_spark"], # this setting is optional
"task": "regression",
}
automl.fit(
dataframe=psdf,
label=label,
**settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb)
## Parallel Spark Jobs
You can activate Spark as the parallel backend during parallel tuning in both [AutoML](/docs/Use-Cases/Task-Oriented-AutoML#parallel-tuning) and [Hyperparameter Tuning](/docs/Use-Cases/Tune-User-Defined-Function#parallel-tuning), by setting the `use_spark` to `true`. FLAML will dispatch your job to the distributed Spark backend using [`joblib-spark`](https://github.com/joblib/joblib-spark).
Please note that you should not set `use_spark` to `true` when applying AutoML and Tuning for Spark Data. This is because only SparkML models will be used for Spark Data in AutoML and Tuning. As SparkML models run in parallel, there is no need to distribute them with `use_spark` again.
All the Spark-related arguments are stated below. These arguments are available in both Hyperparameter Tuning and AutoML:
- `use_spark`: boolean, default=False | Whether to use spark to run the training in parallel spark jobs. This can be used to accelerate training on large models and large datasets, but will incur more overhead in time and thus slow down training in some cases. GPU training is not supported yet when use_spark is True. For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`.
- `n_concurrent_trials`: int, default=1 | The number of concurrent trials. When n_concurrent_trials > 1, FLAML performes parallel tuning.
- `force_cancel`: boolean, default=False | Whether to forcely cancel Spark jobs if the search time exceeded the time budget. Spark jobs include parallel tuning jobs and Spark-based model training jobs.
An example code snippet for using parallel Spark jobs:
```python
import flaml
automl_experiment = flaml.AutoML()
automl_settings = {
"time_budget": 30,
"metric": "r2",
"task": "regression",
"n_concurrent_trials": 2,
"use_spark": True,
"force_cancel": True, # Activating the force_cancel option can immediately halt Spark jobs once they exceed the allocated time_budget.
}
automl.fit(
dataframe=dataframe,
label=label,
**automl_settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb)
--------------------------------------------------------------------------------
The authors of FLAML are Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu.
--------------------------------------------------------------------------------