多个
处理多个文件。
Polars 可以根据您的需求和内存压力以不同的方式处理多个文件。
让我们创建一些文件来提供一些上下文:
import polars as pl
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})
for i in range(5):
df.write_csv(f"docs/assets/data/my_many_files_{i}.csv")
读取到单个 DataFrame
要将多个文件读取到一个DataFrame中,我们可以使用通配符模式:
df = pl.read_csv("docs/assets/data/my_many_files_*.csv")
print(df)
shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ null │
│ 2 ┆ ham │
│ 3 ┆ spam │
│ 1 ┆ null │
│ 2 ┆ ham │
│ … ┆ … │
│ 2 ┆ ham │
│ 3 ┆ spam │
│ 1 ┆ null │
│ 2 ┆ ham │
│ 3 ┆ spam │
└─────┴──────┘
要了解这是如何工作的,我们可以查看查询计划。下面我们看到所有文件都是单独读取并连接成一个单一的DataFrame。Polars 将尝试并行化读取。
pl.scan_csv("docs/assets/data/my_many_files_*.csv").show_graph()
<img src="data:image/png;base64, iVBORw0KGgoAAAANSUhEUgAAArEAAAA9CAIAAAA1VabwAAAABmJLR0QA/wD/AP+gvaeTAAAYJklEQVR4nO2de1QU1x3HfwO7wILL8oiwQNbG2+IbFaENpnAwKC8VMQ5EizbGQig5CTURIjUxj6NRk9jk2DSVGkKJxnfs0QX0KIhYg0Ik9RWJgChaIIALsCwL7M7+sX9MTmZ3Z3Z2lwX0b/6aO3fu/f5+9/eb+3Dn3rsUTdOAIAiCIMgjj91wC0AQBEEQZESAMQGCIAiCIAAYEyAIgiAIQpCwDxobG8+dOzdcUhAEQRAEsSUqlWrWrFn/P6ZZHDhwYPiEIQiCIAhiUxISEthhgMQwB65EQBAEQZCHnsTERE4KzidAEARBEAQAYwIEQRAEQQgYEyAIgiAIAoAxAYIgCIIgBIwJEARBEAQBsDAmqKysXLly5fjx42UymYeHx7Rp0377299u3779xo0b1tJniF6vz8nJefrppxUKhVQq9fX1nTdv3t/+9rdbt25ZqPD69esURYWEhHDS29raqB8JDAzU6XSmzlIUFRwcbFT27t27mTxjxowxmmf//v0kg5OTk5imsCJlZWUURZ0+fdrG9VoXU1YMV8M+HK1qe65du5aUlKRUKiUSCblxbm5u5NTWrVtJyuOPP24zPZcuXZo/f76bm5tcLp87d+7Zs2dtVvVIYOHChRRFbdy4cSgKP3LkCNMxcrpWxAyys7OZ9jR8l4nCcH8CWgR6vT4zM1MikWRlZV27dk2n07W0tBQVFc2dO5cU29fXJ6YcM1i2bJmdnd17773X0NDQ3d1dV1e3bt06iqI8PT0tVPjnP/+ZnKqqqjKst7KykpxNS0szPFteXs4RwOGLL74AgO3btwsaOGfOHEdHR8Fs1iUrK8vDw2Po7ppt4LfC9g37cLSqjamvr1coFNOnTz979mxnZ2d7e/vBgwfd3d3ZeZ588kk/Pz/b6KmoqJDJZElJSd9//71Go0lNTZVIJCdOnLBN7cPOzp07Sb+3YcOGoaslPj4eALq7u4euikcNe3v7p556SjBbQkICZ38CM2OCdevWAcCOHTs46f39/bGxsUMXE5w/fx4AUlNTOekvvvgi55U8WIV6vd7Pzy8wMBAAMjMzDauurKx0dHT09PQEgL1793LOjvaYICAgYMWKFTau1CguLi6/+c1vzLuW3wozGtYSMYJ6EKOQ0PzMmTM8eWwWE+j1+qlTp/r4+HR1dZGU/v7+iRMnqlQqnU5nAwHDS1NTk7u7+4oVK6wVE5h6oDAmsDpmxwTmfDuorq7esmVLUFBQamoq55S9vf369evNKFMkVVVVADBx4kRO+pIlSyxUWFRUJJFIduzYAQBffPFFf3+/YR4nJ6c9e/bY2dmlpaXV1tZaYsiIorq6ura2ljyWo5eRZsVI0zNauH79OgDMmDFjuIUAAJw5c6aqqiohIUEmk5EUe3v7ZcuWNTQ0FBYWDq82G5Campr4/9i787imrvxx4CcJhC0sYQk7yCqLgKCCgqK4oFbFpVq1Vq1L1Vq1Vq1Vq1Wr1apVq1Wr1apVq1Wr1qVq1Vq1Vq1Vq1K1KkUQkH0JhC0h+f1xfs1vHkkgCQmB5H7/4I9w7z333HPPvTk59957b1QUHxMMRcVHjhzhe0ZO14qYQXZ2Nt+eBu8yURjuT0CLQK/XZ2ZmSiSSrKysa9eu6XS6lpaWoqKiuXPnkmL7+vrElGMGy5Yts7Oze++99xoaGrq7u+vq6tatW0dRlKenp4UK//znP5NTlZWVhvVWVlaSs2lpaYZnS0tLOQI4fPHFFwCwfft2QQPnzJnj6OgomM26ZGVleXh4DN1dsw38Vti+YR+OVrUx9fX1CoVi+vTpZ8+e7ezsbG9vP3jwoLu7OzvPk08+6efnZxs9FRUVMpksKSnp+++/12g0qampEonkxIkTtql92Nm5cyfp9zZs2DB0tcTHxwNAd3f30FXxqGFvb//UU08JZktISODsT2BmTLBu3ToA2LFjBye9v78/NjZ26GKC8+fPA0Bqaion/cUXX+S8kgerUK/X+/n5BQYGAgDnLw5DVVVWVo6Ojp6engCwd+9eztnRHhMEBASsWLHCxpUaxcXF5dVXXzXvWn4rzGhYS8QI6kGMQkLzM2fO8OSxWUyg1+unTp3q4+PT1dVFUvr7+ydOnKhSqXQ6nQ0EDC9NTU3u7u4rVqywVkxg6oHCmMDqmB0TmPPtoLq6esuWLUFBQampqZxT9vb269evN6NMkVRVVQHAxIkTOelLliyxUGFRUZFEItmxYwcAfPHFF/39/YZ5nJyc9uzZY2dnl5aWVltba4khI4rq6ura2lryWI5eRpsVo03PaOH69esAMGPGjOEWAgBw5syZqqqqhIQEmUxGUuzt7ZctW9bQ0FBYWDi82mxAampqYmJiVFTUcAtBbIs5McHGjRsHBgYM9z8izJo1i6ZpiUTQDomW4+3tDQDFxcWc9PDw8La2NksU5uXlrVy5Mjg4eMaMGa2trceOHTN6bXR09BtvvKHVahMTEx+ar19qtdrR0TE6Onq4hVjESLNitOkZLfT19QGAo6PjcAsBADh16hQAcCYJkcOSkpLh0WQr8vLyqqqqtm7dOtxCEJtiTkxw5swZEBfI9/T0vPvuu5MmTXJ2dvbw8IiLi8vPz9fr9ffv32fPyyOzV/r7+5mUhIQEo4WFhYUplcoTJ07ExsaePn16YGDAQoWEH374oaCg4LnnngOA559/HgDy8vJMZX7rrbeioqKuXLny8ssviyyfh+rq6kWLFikUChcXl7CwsLKyMsM8d+/effXVVydMmODg4ODu7h4bG1taWmo0g6Oj4+OPP3727NnPP/+8u7sbTN8C9uX5+fkRERFk5mN/f/+BAwciIyOVSqVMJps+ffq2bdvY7cxf4KB1ajSajIyMJ554wsHBYezYsYsXL7506RI5RaYfdnZ2nj17lrgBE7cN1goxDStuKY8YwSbi6GHPorp9+3ZSUpJcLvf09FyxYsW9e/du3boVFxcnl8t9fHxSU1O1Wi0AmP2AEMyoUYxp7GJv376dlJTk5ubm6em5YMECMmnXKrLVajUAyGQy6qesXLmS51p+vwJx/mOU6upqAGDPZ/Tz8wMA/pFCMx5JC1vPujQ2Nq5ZsyYvL08ul4u/iqen4nmgGFpaWgydioHnFrM9s6amZsmSJZ6enuSQ/b8ig6A/8BiyceNGUnJoaChJOX78OEl57LHHROrhcQ9+S8WItxT2hwSR8wl8fHwA4OuvvxbMmZKSolAoioqKurq6WlpaMjMzAaC0tJScjYmJsbOzq6urY18ya9Ysw6/1bL766iuVSkXEe3l5JScn7927t7Oz0zyFhI8//viZZ57h/6nRaKRSqUQiaW1tZadXVlYqFAomD9Gwe/dukmLefILr16+7ubn5+fkVFRVpNJorV65ERUU98cQT7M/ezc3N48eP9/b2LigoePDgQU1NzeLFiymK+vTTT9kZlEplQUFBe3t7S0vLhg0bAOCjjz6ihW4BTdOtra12dnY5OTnksKCgAAA2bdr0ww8/aDSav/71r3Z2duwJFvwF8p/9/vvvf/azn3l7ex89elSr1V69ejU8PNzJyencuXNM+Ua/OFrSCjENK2ipKTH8F3L00D9+MV28ePE333zT0dGxa9cuAIiNjY2Pj7948aJWq83JyQGAV155hbkkOjra6AOyZ88eWgRm1CjGNFJsfHz8uXPnOjo6iouLZTLZr371K+vKZn9a1mg0APDcc88xKZz5BIJ+Jeg/poiMjASAiooKdiL5ujFz5kxTV1nySJrXK1qd6OjoF198kfxN+i7B+QSCPRUtNJ+AcaqSkhJXV1e2U4npOkgh4eHhpaWlnZ2dFRUV9vb2Go3GsDr+W2CeIUFBQZxXgCk9/O5hLWe26RxD8sY9f/68YM7x48c//fTT7JSAgABG/cmTJwGA8TyapsvKysaNGyc4P1Gn0+3cuTM+Pp6JYT09Pfft22eGQsLMmTN37drF/9Ozzz4LAFu3bmWns2MCmqbLy8ulUqmLi8u1a9doc2MC8nXj0KFDTEpTU5OjoyP71UX+PWJbp9PpfH19ZTJZS0sLk+HAgQPskmNiYoiH8d8CmqZzc3MpimpqaiKHBQUFs2fPZudfvny5VCp98OABOeQvkP8sGYlgvxuam5sdHR2DgoKYFKO9xmCtENOwgpYyYgQv5Oihf+wgjh49yqRMnToVAP71r3+xzZw4cSJzeOLECcMHxM/Pr7e3lxaBGTWKMY0UW1BQwKSQ/1+ZztcqsgcVEwj6lTj/MYXRmICMELCdloMlj6TZvaIV2bFjh7+/f0dHBzkUGRMI9lS0UEzAdqrf/e53bKcS03WQQo4dOzZoC/lvgXmGmIoJDPXwu4e1nNmmMUFQUJDIpk9PTweA1NTU8vLy/v5+wwyBgYHOzs5tbW3kMD4+/sMPPxQslqGvr6+kpGTp0qUAYG9vf+HCBfMU0jR9+fJluVzOHmnIz88HgKlTp7KzcWICmqa3bdsGAJGRkV1dXebFBCTm0Wq17MTp06ezX10KhQIA2tvb2XnITOCdO3eaysAg2ApxcXHseNyQDz74AACYKJW/QP6zCoXCzs6O/YKhaXrmzJkA0NDQQA6N9hqDtUJMw4pbyogRvJCjhyYdBPvzG3nfsB0vNDRULpdzNHMekC1btgjqsaRGQdNIsUznSNP0K6+8AgCXL1+2ouxBxQSCfiXOf0xBgssS1hxJmqYvXrwIAHPmzDF1lYWPpIW9ooXcvn1boVCcPn2aSRF5n4C/p6KFYgK2U2VlZbGdSkzXQQph2o0HwW7KDENMxQSGevjdw1rObNN1B+Hh4QBw5coVwZyffPLJrl27bt68OWfOHFdX15iYmMOHD7MzrFmzpqur6+9//zsA1NbWnjlzJiUlRbwSiUQSERGxb9++tWvX6vX6Q4cODVYhAOTl5Wm1WhcXF+YL0MKFCwGgqqqKLH00RUZGRlJS0tWrV1966SXxmhl6enq0Wq2TkxNnCyMvLy92ngcPHjg5OXG+6pG5li0tLaYyMPDfgq6urpMnT7Lnxj948ODNN9+cPn26u7s7aQ3ycHZ1dYkpkOcskTowMKBQKNjfTS9cuAA/zjY3xaCsENOwYiw1heCFhq3K4OrqyvxtZ2dnb2/v7OzMpNjb23PmJaxevZr9gJw6deqFF17gl2dJjeLbhPRrBAcHBwBgl2O5bPGI8SvBjsgUkyZNAoDGxkZ2YlNTEwAEBATw6DH7kQSLe0ULIQPms2fPZlqSvA7Xr19PDuvq6gyvEuypxFTNdio7Ozv40akG1XW4uLgIViTYTVloCI8efvcYUmcWCztAEDlOUFNTI5FIgoODjZ7NysqiKIqMqDP09vYWFRWRNS1/+ctfmPS+vj6VSuXl5aXT6V544YXXXnuNv+qysjIvLy/D9OPHjwNAenr6YBX29vY+9thjZ8+e5eRZvXo1APzxj39kUgzHCWia1mq1ZGGk4QYJHMSPEwQGBlpxnIDB6C04cuQIAHz77bdMtrCwMADYtm3bnTt3BgYGaJr+6KOPAKC4uFhMgfxn3dzcJBIJ/yjomDFjeP41F2mFmIYVY6lRMYIXGuqhjf37Gx0dbW9vz84THh7u4uLCTtHpdN7e3swD8tJLL5lqGUPMqFFMmxgWu3btWgC4ePHiEMkWHCcQ41cEfqc1hKw7yMjIYCe+88478NMvUxwseSTpwfeKQ40VxwlMPd2CTiXmFpuxyYHRWyDGELlczhkInDBhgtFxAkM9/O5hLWe29Z5FZE7EZ599xkmvrq52dXVNSkoihwqFgh0cdHV1URQVERHBvoSsdXn33XddXV0bGxv56/3qq68AoLy8nJNO5uiydygSqfCf//znpEm
并行读取和处理
如果你的文件不需要放在一个表中,你也可以为每个文件构建一个查询计划,并在Polars线程池上并行执行它们。
所有查询计划的执行都是高度并行的,不需要任何通信。
import glob
import polars as pl
queries = []
for file in glob.glob("docs/assets/data/my_many_files_*.csv"):
q = pl.scan_csv(file).group_by("bar").agg(pl.len(), pl.sum("foo"))
queries .append(q)
dataframes = pl.collect_all(queries)
print(dataframes)
[shape: (3, 3)
┌──────┬─────┬─────┐
│ bar ┆ len ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ ham ┆ 1 ┆ 2 │
│ spam ┆ 1 ┆ 3 │
│ null ┆ 1 ┆ 1 │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar ┆ len ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ null ┆ 1 ┆ 1 │
│ ham ┆ 1 ┆ 2 │
│ spam ┆ 1 ┆ 3 │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar ┆ len ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ ham ┆ 1 ┆ 2 │
│ spam ┆ 1 ┆ 3 │
│ null ┆ 1 ┆ 1 │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar ┆ len ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ ham ┆ 1 ┆ 2 │
│ spam ┆ 1 ┆ 3 │
│ null ┆ 1 ┆ 1 │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar ┆ len ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ spam ┆ 1 ┆ 3 │
│ ham ┆ 1 ┆ 2 │
│ null ┆ 1 ┆ 1 │
└──────┴─────┴─────┘]