11分钟了解Vaex#

因为 vaex 达到了 11

如果你想在实时的 Python 内核中尝试这个笔记本,请使用 mybinder:

https://mybinder.org/badge_logo.svg

数据框#

Vaex 的核心是 DataFrame(类似于 Pandas DataFrame,但更高效),我们通常使用变量 df 来表示它。DataFrame 是对大型表格数据集的高效表示,并具有以下特点:

  • 一些列,例如 x, yz,它们是:

    • 由 Numpy 数组支持;

    • 由表达式系统包装,例如 df.x, df['x']df.col.x 是一个表达式;

    • 列/表达式可以执行惰性计算,例如 df.x * np.sin(df.y) 在需要结果之前不会执行任何操作。

  • 一组虚拟列,这些列由(惰性)计算支持,例如 df['r'] = df.x/df.y

  • 一组选择,可以用来探索数据集,例如 df.select(df.x < 0)

  • 过滤的DataFrames,不会复制数据,df_negative = df[df.x < 0]

让我们从一个示例数据集开始,该数据集包含在Vaex中。

[1]:
import vaex
df = vaex.example()
df  # Since this is the last statement in a cell, it will print the DataFrame in a nice HTML format.
[1]:
# id x y z vx vy vz E L Lz FeH
0 0 1.2318683862686157 -0.39692866802215576-0.598057746887207 301.1552734375 174.05947875976562 27.42754554748535 -149431.40625 407.38897705078125333.9555358886719 -1.0053852796554565
1 23 -0.163700610399246223.654221296310425 -0.25490644574165344-195.00022888183594170.47216796875 142.5302276611328 -124247.953125890.2411499023438 684.6676025390625 -1.7086670398712158
2 32 -2.120255947113037 3.326052665710449 1.7078403234481812 -48.63423156738281 171.6472930908203 -2.079437255859375 -138500.546875372.2410888671875 -202.17617797851562-1.8336141109466553
3 8 4.7155890464782715 4.5852508544921875 2.2515437602996826 -232.42083740234375-294.850830078125 62.85865020751953 -60037.03906251297.63037109375 -324.6875 -1.4786882400512695
4 16 7.21718692779541 11.99471664428711 -1.064562201499939 -1.6891745328903198181.329345703125 -11.333610534667969-83206.84375 1332.79895019531251328.948974609375 -1.8570483922958374
... ... ... ... ... ... ... ... ... ... ... ...
329,99521 1.9938701391220093 0.789276123046875 0.22205990552902222 -216.9299011230468816.124420166015625 -211.244384765625 -146457.4375 457.72247314453125203.36758422851562 -1.7451677322387695
329,99625 3.7180912494659424 0.721337616443634 1.6415337324142456 -185.92160034179688-117.25082397460938-105.4986572265625 -126627.109375335.0025634765625 -301.8370056152344 -0.9822322130203247
329,99714 0.3688507676124573 13.029608726501465 -3.633934736251831 -53.677146911621094-145.15771484375 76.70909881591797 -84912.2578125817.1375732421875 645.8507080078125 -1.7645612955093384
329,99818 -0.112592644989490511.4529125690460205 2.168952703475952 179.30865478515625 205.79710388183594 -68.75872802734375 -133498.46875 724.000244140625 -283.6910400390625 -1.8808952569961548
329,9994 20.796220779418945 -3.331387758255005 12.18841552734375 42.69000244140625 69.20479583740234 29.54275131225586 -65519.328125 1843.07470703125 1581.4151611328125 -1.1231083869934082

#

上述预览显示,该数据集包含\(> 300,000\)行,以及名为x、y、z(位置)、vx、vy、vz(速度)、E(能量)、L(角动量)和id(样本子组)的列。当我们打印出一列时,我们可以看到它不是一个Numpy数组,而是一个Expression。

[2]:
df.x  # df.col.x or df['x'] are equivalent, but df.x may be preferred because it is more tab completion friendly or programming friendly respectively
[2]:
Expression = x
Length: 330,000 dtype: float32 (column)
---------------------------------------
     0    1.23187
     1  -0.163701
     2   -2.12026
     3    4.71559
     4    7.21719
       ...
329995    1.99387
329996    3.71809
329997   0.368851
329998  -0.112593
329999    20.7962

可以使用.values方法来获取表达式的内存表示。同样的方法也可以应用于DataFrame。

[3]:
df.x.values
[3]:
array([ 1.2318684 , -0.16370061, -2.120256  , ...,  0.36885077,
       -0.11259264, 20.79622   ], dtype=float32)

大多数Numpy函数(ufuncs)可以在表达式上执行,并且不会直接产生结果,而是生成一个新的表达式。

[4]:
import numpy as np
np.sqrt(df.x**2 + df.y**2 + df.z**2)
[4]:
Expression = sqrt((((x ** 2) + (y ** 2)) + (z ** 2)))
Length: 330,000 dtype: float32 (expression)
-------------------------------------------
     0  1.42574
     1  3.66676
     2  4.29824
     3  6.95203
     4   14.039
      ...
329995  2.15587
329996  4.12785
329997  13.5319
329998  2.61304
329999  24.3339

虚拟列#

有时将表达式存储为列是很方便的。我们称之为虚拟列,因为它不占用任何内存,并且在需要时即时计算。虚拟列被视为普通列。

[5]:
df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2)
df[['x', 'y', 'z', 'r']]
[5]:
# x y z r
0 1.2318683862686157 -0.39692866802215576-0.598057746887207 1.425736665725708
1 -0.163700610399246223.654221296310425 -0.254906445741653443.666757345199585
2 -2.120255947113037 3.326052665710449 1.7078403234481812 4.298235893249512
3 4.7155890464782715 4.5852508544921875 2.2515437602996826 6.952032566070557
4 7.21718692779541 11.99471664428711 -1.064562201499939 14.03902816772461
... ... ... ... ...
329,9951.9938701391220093 0.789276123046875 0.22205990552902222 2.155872344970703
329,9963.7180912494659424 0.721337616443634 1.6415337324142456 4.127851963043213
329,9970.3688507676124573 13.029608726501465 -3.633934736251831 13.531896591186523
329,998-0.112592644989490511.4529125690460205 2.168952703475952 2.613041877746582
329,99920.796220779418945 -3.331387758255005 12.18841552734375 24.333894729614258

选择和过滤#

Vaex 在探索数据子集时非常高效,例如用于移除异常值或仅检查部分数据。Vaex 不会创建副本,而是在内部跟踪哪些行被选中。

[6]:
df.select(df.x < 0)
df.evaluate(df.x, selection=True)
[6]:
array([-0.16370061, -2.120256  , -7.7843747 , ..., -8.126636  ,
       -3.9477386 , -0.11259264], dtype=float32)

当您频繁修改想要可视化的数据部分,或者当您想要有效地计算多个数据部分的统计信息时,选择非常有用。

或者,您也可以创建过滤后的数据集。这与使用 Pandas 类似,只是 Vaex 不会复制数据。

[7]:
df_negative = df[df.x < 0]
df_negative[['x', 'y', 'z', 'r']]
[7]:
# x y z r
0 -0.163700610399246223.654221296310425 -0.254906445741653443.666757345199585
1 -2.120255947113037 3.326052665710449 1.7078403234481812 4.298235893249512
2 -7.784374713897705 5.989774703979492 -0.682695209980011 9.845809936523438
3 -3.5571861267089844 5.413629055023193 0.09171556681394577 6.478376865386963
4 -20.813940048217773 -3.294677495956421 13.486607551574707 25.019264221191406
... ... ... ... ...
166,274-2.5926425457000732 -2.871671676635742 -0.180483341217041023.8730955123901367
166,275-0.7566012144088745 2.9830434322357178 -6.940553188323975 7.592250823974609
166,276-8.126635551452637 1.1619765758514404 -1.6459038257598877 8.372657775878906
166,277-3.9477386474609375 -3.0684902667999268-1.5822702646255493 5.244411468505859
166,278-0.112592644989490511.4529125690460205 2.168952703475952 2.613041877746582

N维网格上的统计#

Vaex 的一个核心特性是能够非常高效地计算 N 维网格上的统计数据。这对于大型数据集的可视化非常有用。

[8]:
df.count(), df.mean(df.x), df.mean(df.x, selection=True)
[8]:
(array(330000), array(-0.0632868), array(-5.18457762))

类似于SQL的groupby,Vaex使用binby概念,它告诉Vaex应该在常规网格上计算统计量(出于性能原因)

[9]:
counts_x = df.count(binby=df.x, limits=[-10, 10], shape=64)
counts_x
[9]:
array([1374, 1350, 1459, 1618, 1706, 1762, 1852, 2007, 2240, 2340, 2610,
       2840, 3126, 3337, 3570, 3812, 4216, 4434, 4730, 4975, 5332, 5800,
       6162, 6540, 6805, 7261, 7478, 7642, 7839, 8336, 8736, 8279, 8269,
       8824, 8217, 7978, 7541, 7383, 7116, 6836, 6447, 6220, 5864, 5408,
       4881, 4681, 4337, 4015, 3799, 3531, 3320, 3040, 2866, 2629, 2488,
       2244, 1981, 1905, 1734, 1540, 1437, 1378, 1233, 1186])

这将生成一个Numpy数组,其中包含在x = -10和x = 10之间分布的64个区间中的计数。我们可以使用Matplotlib快速可视化这一点。

[10]:
import matplotlib.pyplot as plt
plt.plot(np.linspace(-10, 10, 64), counts_x)
plt.show()
_images/tutorial_20_0.png

我们也可以在二维中做同样的事情(实际上可以推广到N维!),并使用Matplotlib显示它。

[11]:
xycounts = df.count(binby=[df.x, df.y], limits=[[-10, 10], [-10, 20]], shape=(64, 128))
xycounts
[11]:
array([[ 5,  2,  3, ...,  3,  3,  0],
       [ 8,  4,  2, ...,  5,  3,  2],
       [ 5, 11,  7, ...,  3,  3,  1],
       ...,
       [ 4,  8,  5, ...,  2,  0,  2],
       [10,  6,  7, ...,  1,  1,  2],
       [ 6,  7,  9, ...,  2,  2,  2]])
[12]:
plt.imshow(xycounts.T, origin='lower', extent=[-10, 10, -10, 20])
plt.show()
_images/tutorial_23_0.png
[13]:
v = np.sqrt(df.vx**2 + df.vy**2 + df.vz**2)
xy_mean_v = df.mean(v, binby=[df.x, df.y], limits=[[-10, 10], [-10, 20]], shape=(64, 128))
xy_mean_v
[13]:
array([[156.15283203, 226.0004425 , 206.95940653, ...,  90.0340627 ,
        152.08784485,          nan],
       [203.81366634, 133.01436043, 146.95962524, ..., 137.54756927,
         98.68717448, 141.06020737],
       [150.59178772, 188.38820371, 137.46753802, ..., 155.96900177,
        148.91660563, 138.48191833],
       ...,
       [168.93819809, 187.75943136, 137.318647  , ..., 144.83927917,
                 nan, 107.7273407 ],
       [154.80492783, 140.55182203, 180.30700166, ..., 184.01670837,
         95.10913086, 131.18122864],
       [166.06868235, 150.54079764, 125.84606828, ..., 130.56007385,
        121.04217911, 113.34659195]])
[14]:
plt.imshow(xy_mean_v.T, origin='lower', extent=[-10, 10, -10, 20])
plt.show()
_images/tutorial_25_0.png

可以计算其他统计量,例如:

或者查看完整的列表在API文档

获取您的数据#

在继续本教程之前,您可能希望读取自己的数据。最终,Vaex DataFrame 只是包装了一组 Numpy 数组。如果您可以将数据作为一组 Numpy 数组访问,您可以使用 from_arrays 轻松构建一个 DataFrame。

[15]:
import vaex
import numpy as np
x = np.arange(5)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df
[15]:
# x y
0 0 0
1 1 1
2 2 4
3 3 9
4 4 16

其他快速获取数据的方法有:

导出或将DataFrame转换为不同的数据结构也非常简单:

如今,将数据,尤其是较大的数据集放在云端是很常见的。Vaex 可以直接从 S3 以惰性方式读取数据,这意味着只会下载所需的数据,并将其缓存到磁盘上。

[16]:
# Read in the NYC Taxi dataset straight from S3
nyctaxi = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true')
nyctaxi.head(5)
[16]:
# 供应商ID 上车时间 下车时间 乘客数量支付类型 行程距离 上车经度 上车纬度 费率代码 存储和转发标志 下车经度 下车纬度 费用金额 附加费 地铁税 小费金额 过路费 总金额
0VTS 2009-01-04 02:52:00.0000000002009-01-04 03:02:00.000000000 1CASH 2.63 -73.992 40.7216 nan nan -73.9938 40.6959 8.9 0.5 nan 0 0 9.4
1VTS 2009-01-04 03:31:00.0000000002009-01-04 03:38:00.000000000 3Credit 4.55 -73.9821 40.7363 nan nan -73.9558 40.768 12.1 0.5 nan 2 0 14.6
2VTS 2009-01-03 15:43:00.0000000002009-01-03 15:57:00.000000000 5信用 10.35 -74.0026 40.7397 nan nan -73.87 40.7702 23.7 0 nan 4.74 0 28.44
3DDS 2009-01-01 20:52:58.0000000002009-01-01 21:14:00.000000000 1CREDIT 5 -73.9743 40.791 nan nan -73.9966 40.7318 14.9 0.5 nan 3.05 0 18.45
4DDS 2009-01-24 16:18:23.0000000002009-01-24 16:24:56.000000000 1CASH 0.4 -74.0016 40.7194 nan nan -74.0084 40.7203 3.7 0 nan 0 0 3.7

绘图#

一维和二维#

大多数可视化是在1或2维中完成的,Vaex很好地封装了Matplotlib,以满足各种常见的使用场景。

[17]:
import vaex
import numpy as np
df = vaex.example()

最简单的可视化是使用DataFrame.viz.histogram的一维图。在这里,我们只显示99.7%的数据。

[1]:
df.viz.histogram(df.x, limits='99.7%')
_images/tutorial_35_0.png

一个稍微复杂一些的可视化方法,不是绘制计数,而是绘制该分箱的不同统计量。在大多数情况下,传递what='()参数即可,其中是上述列表中或API文档中提到的任何统计量。

[19]:
df.viz.histogram(df.x, what='mean(E)', limits='99.7%');
_images/tutorial_37_0.png

一个等效的方法是使用vaex.stat.函数,例如vaex.stat.mean

[20]:
df.viz.histogram(df.x, what=vaex.stat.mean(df.E), limits='99.7%');
_images/tutorial_39_0.png

vaex.stat. 对象与 Vaex 表达式非常相似,因为它们代表了一个基础的计算。典型的算术和 Numpy 函数可以应用于这些计算。然而,这些对象计算的是单个统计量,并不返回列或表达式。

[21]:
np.log(vaex.stat.mean(df.x)/vaex.stat.std(df.x))
[21]:
log((mean(x) / std(x)))

这些统计对象可以传递给what参数。其优势在于数据只需传递一次。

[22]:
df.viz.histogram(df.x, what=np.clip(np.log(-vaex.stat.mean(df.E)), 11, 11.4), limits='99.7%');
_images/tutorial_43_0.png

通过我们自己计算统计量并将其传递给plot1d的grid参数,可以获得类似的结果。需要注意的是,用于计算统计量和绘图的限制必须相同,否则x轴可能无法与真实数据对应。

[3]:
limits = [-30, 30]
shape  = 64
meanE  = df.mean(df.E, binby=df.x, limits=limits, shape=shape)
grid   = np.clip(np.log(-meanE), 11, 11.4)
df.viz.histogram(df.x, grid=grid, limits=limits, ylabel='clipped E');
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb Cell 46' in <cell line: 3>()
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=0'>1</a> limits = [-30, 30]
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=1'>2</a> shape  = 64
----> <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=2'>3</a> meanE  = df.mean(df.E, binby=df.x, limits=limits, shape=shape)
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=3'>4</a> grid   = np.clip(np.log(-meanE), 11, 11.4)
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=4'>5</a> df.viz.histogram(df.x, grid=grid, limits=limits, ylabel='clipped E')

NameError: name 'df' is not defined

除了在一维上绘制密度(直方图)外,我们还可以在二维上绘制密度。这是通过DataFrame.viz.heatmap函数完成的。它共享许多参数,并且与直方图非常相似。

[24]:
df.viz.heatmap(df.x, df.y, what=vaex.stat.mean(df.E)**2, limits='99.7%');
_images/tutorial_47_0.png

绘图选择#

虽然过滤对于缩小DataFrame的内容非常有用(例如df_negative = df[df.x < 0]),但这种方法也有一些缺点。首先,一个实际问题是,当你以4种不同的方式进行过滤时,你将需要4个不同的DataFrame来污染你的命名空间。更重要的是,当Vaex执行一系列统计计算时,它将为每个DataFrame执行这些计算,这意味着将对数据进行4次遍历,尽管这4个DataFrame都指向相同的基础数据。

如果我们的DataFrame中有4个(命名的)选择,我们可以在一次数据遍历中计算统计信息,这在数据集大于内存的情况下尤其显著加快速度。

在下图中,我们展示了三个选择,默认情况下它们混合在一起,只需要一次数据遍历。

[25]:
df.viz.heatmap(df.x, df.y, what=np.log(vaex.stat.count()+1), limits='99.7%',
        selection=[None, df.x < df.y, df.x < -10]);
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/image.py:113: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rgba_dest[:, :, c][[mask]] = np.clip(result[[mask]], 0, 1)
_images/tutorial_49_1.png

高级绘图#

假设我们想要并排查看两个图表。为了实现这一点,我们可以传递一个表达式对的列表。

[26]:
df.viz.heatmap([["x", "y"], ["x", "z"]], limits='99.7%',
        title="Face on and edge on", figsize=(10,4));
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/viz/mpl.py:779: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  ax = plt.subplot(gs[row_offset + row * row_scale:row_offset + (row + 1) * row_scale, column * column_scale:(column + 1) * column_scale])
_images/tutorial_51_1.png

默认情况下,如果您有多个图表,它们会显示为列,多个选择会叠加显示,多个“whats”(统计信息)会显示为行。

[27]:
df.viz.heatmap([["x", "y"], ["x", "z"]],
        limits='99.7%',
        what=[np.log(vaex.stat.count()+1), vaex.stat.mean(df.E)],
        selection=[None, df.x < df.y],
        title="Face on and edge on", figsize=(10,10));
_images/tutorial_53_0.png

请注意,选择在底部行没有效果。

然而,可以使用visual参数来改变这种行为。

[28]:
df.viz.heatmap([["x", "y"], ["x", "z"]],
        limits='99.7%',
        what=vaex.stat.mean(df.E),
        selection=[None, df.Lz < 0],
        visual=dict(column='selection'),
        title="Face on and edge on", figsize=(10,10));
_images/tutorial_55_0.png

第三维度的切片#

如果提供了第三个轴(z),你可以“切片”数据,将z切片显示为行。请注意,这里的行是换行的,可以使用wrap_columns参数来更改。

[29]:
df.viz.heatmap("Lz", "E",
        limits='99.7%',
        z="FeH:-2.5,-1,8", show=True, visual=dict(row="z"),
        figsize=(12,8), f="log", wrap_columns=3);
_images/tutorial_57_0.png

小型数据集的可视化#

尽管Vaex专注于大型数据集,但有时你最终会得到一小部分数据(例如由于选择),并且你想制作散点图。你可以使用以下方法来实现:

[30]:
import vaex
df = vaex.example()
[31]:
import matplotlib.pyplot as plt
x = df.evaluate("x", selection=df.Lz < -2500)
y = df.evaluate("y", selection=df.Lz < -2500)
plt.scatter(x, y, c="red", alpha=0.5, s=4);
_images/tutorial_60_0.png

使用 DataFrame.viz.scatter:

[32]:
df.viz.scatter(df.x, df.y, selection=df.Lz < -2500, c="red", alpha=0.5, s=4)
df.viz.scatter(df.x, df.y, selection=df.Lz > 1500, c="green", alpha=0.5, s=4);
_images/tutorial_62_0.png

在控制中#

虽然Vaex提供了Matplotlib的封装,但在某些情况下,您可能希望使用DataFrame.viz方法,但又希望控制绘图。Vaex简单地使用当前的图形和轴对象,因此很容易实现。

[33]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,7))
plt.sca(ax1)
selection = df.Lz < -2500
x = df[selection].x.evaluate()#selection=selection)
y = df[selection].y.evaluate()#selection=selection)
df.viz.heatmap(df.x, df.y)
plt.scatter(x, y)
plt.xlabel('my own label $\gamma$')
plt.xlim(-20, 20)
plt.ylim(-20, 20)

plt.sca(ax2)
df.viz.histogram(df.x, label='counts', n=True)
x = np.linspace(-30, 30, 100)
std = df.std(df.x.expression)
y = np.exp(-(x**2/std**2/2)) / np.sqrt(2*np.pi) / std
plt.plot(x, y, label='gaussian fit')
plt.legend()
plt.show()
_images/tutorial_64_0.png

Healpix (绘图)#

Healpix 绘图通过 healpy 包支持。Vaex 不需要对 healpix 进行特殊支持,仅用于绘图,但引入了一些辅助函数以使处理 healpix 更加容易。

在以下示例中,我们将使用TGAS天文数据集。

为了更好地理解healpix,我们将从头开始。如果我们想制作一个密度天空图,我们希望向healpy传递一个一维Numpy数组,其中每个值代表球体上某个位置的密度,该位置由数组大小(healpix级别)和偏移量(位置)决定。TGAS(和Gaia)数据包括编码在source_id中的healpix索引。通过将source_id除以34359738368,你将得到一个healpix索引级别12,进一步除以它将带你到更低的级别。

[34]:
import vaex
import healpy as hp
tgas = vaex.datasets.tgas(full=True)

我们将开始展示如何使用vaex.count手动对healpix分箱进行统计。我们将采用一个非常粗略的healpix方案(级别2)。

[35]:
level = 2
factor = 34359738368 * (4**(12-level))
nmax = hp.nside2npix(2**level)
epsilon = 1e-16
counts = tgas.count(binby=tgas.source_id/factor, limits=[-epsilon, nmax-epsilon], shape=nmax)
counts
[35]:
array([ 4021,  6171,  5318,  7114,  5755, 13420, 12711, 10193,  7782,
       14187, 12578, 22038, 17313, 13064, 17298, 11887,  3859,  3488,
        9036,  5533,  4007,  3899,  4884,  5664, 10741,  7678, 12092,
       10182,  6652,  6793, 10117,  9614,  3727,  5849,  4028,  5505,
        8462, 10059,  6581,  8282,  4757,  5116,  4578,  5452,  6023,
        8340,  6440,  8623,  7308,  6197, 21271, 23176, 12975, 17138,
       26783, 30575, 31931, 29697, 17986, 16987, 19802, 15632, 14273,
       10594,  4807,  4551,  4028,  4357,  4067,  4206,  3505,  4137,
        3311,  3582,  3586,  4218,  4529,  4360,  6767,  7579, 14462,
       24291, 10638, 11250, 29619,  9678, 23322, 18205,  7625,  9891,
        5423,  5808, 14438, 17251,  7833, 15226,  7123,  3708,  6135,
        4110,  3587,  3222,  3074,  3941,  3846,  3402,  3564,  3425,
        4125,  4026,  3689,  4084, 16617, 13577,  6911,  4837, 13553,
       10074,  9534, 20824,  4976,  6707,  5396,  8366, 13494, 19766,
       11012, 16130,  8521,  8245,  6871,  5977,  8789, 10016,  6517,
        8019,  6122,  5465,  5414,  4934,  5788,  6139,  4310,  4144,
       11437, 30731, 13741, 27285, 40227, 16320, 23039, 10812, 14686,
       27690, 15155, 32701, 18780,  5895, 23348,  6081, 17050, 28498,
       35232, 26223, 22341, 15867, 17688,  8580, 24895, 13027, 11223,
        7880,  8386,  6988,  5815,  4717,  9088,  8283, 12059,  9161,
        6952,  4914,  6652,  4666, 12014, 10703, 16518, 10270,  6724,
        4553,  9282,  4981])

并且使用healpy的mollview我们可以将其可视化。

[36]:
hp.mollview(counts, nest=True)
_images/tutorial_70_0.png

为了简化生活,Vaex 包含了 DataFrame.healpix_count 来处理这个问题。

[37]:
counts = tgas.healpix_count(healpix_level=6)
hp.mollview(counts, nest=True)
_images/tutorial_72_0.png

或者更简单,使用 DataFrame.viz.healpix_heatmap

[38]:
tgas.viz.healpix_heatmap(
    f="log1p",
    healpix_level=6,
    figsize=(10,8),
    healpix_output="ecliptic"
)
_images/tutorial_74_0.png

xarray 支持#

df.count 方法也可以返回一个 xarray 数据数组,而不是 numpy 数组。这可以通过 array_type 关键字轻松实现。xarray 在 numpy 的基础上增加了维度标签、坐标和属性,使得处理多维数组更加方便。

[39]:
xarr = df.count(binby=[df.x, df.y], limits=[-10, 10], shape=64, array_type='xarray')
xarr
[39]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
  • x: 64
  • y: 64
  • 6 3 7 9 10 13 6 13 17 7 12 15 7 14 ... 11 8 10 6 7 5 17 9 10 10 6 5 7
    array([[ 6,  3,  7, ..., 15, 10, 11],
           [10,  3,  7, ..., 10, 13, 11],
           [ 5, 15,  5, ..., 12, 18, 12],
           ...,
           [ 7,  8, 10, ...,  6,  7,  7],
           [12, 10, 17, ..., 11,  8,  2],
           [ 7, 10, 13, ...,  6,  5,  7]])
    • x
      (x)
      float64
      -9.844 -9.531 ... 9.531 9.844
      array([-9.84375, -9.53125, -9.21875, -8.90625, -8.59375, -8.28125, -7.96875,
             -7.65625, -7.34375, -7.03125, -6.71875, -6.40625, -6.09375, -5.78125,
             -5.46875, -5.15625, -4.84375, -4.53125, -4.21875, -3.90625, -3.59375,
             -3.28125, -2.96875, -2.65625, -2.34375, -2.03125, -1.71875, -1.40625,
             -1.09375, -0.78125, -0.46875, -0.15625,  0.15625,  0.46875,  0.78125,
              1.09375,  1.40625,  1.71875,  2.03125,  2.34375,  2.65625,  2.96875,
              3.28125,  3.59375,  3.90625,  4.21875,  4.53125,  4.84375,  5.15625,
              5.46875,  5.78125,  6.09375,  6.40625,  6.71875,  7.03125,  7.34375,
              7.65625,  7.96875,  8.28125,  8.59375,  8.90625,  9.21875,  9.53125,
              9.84375])
    • y
      (y)
      float64
      -9.844 -9.531 ... 9.531 9.844
      array([-9.84375, -9.53125, -9.21875, -8.90625, -8.59375, -8.28125, -7.96875,
             -7.65625, -7.34375, -7.03125, -6.71875, -6.40625, -6.09375, -5.78125,
             -5.46875, -5.15625, -4.84375, -4.53125, -4.21875, -3.90625, -3.59375,
             -3.28125, -2.96875, -2.65625, -2.34375, -2.03125, -1.71875, -1.40625,
             -1.09375, -0.78125, -0.46875, -0.15625,  0.15625,  0.46875,  0.78125,
              1.09375,  1.40625,  1.71875,  2.03125,  2.34375,  2.65625,  2.96875,
              3.28125,  3.59375,  3.90625,  4.21875,  4.53125,  4.84375,  5.15625,
              5.46875,  5.78125,  6.09375,  6.40625,  6.71875,  7.03125,  7.34375,
              7.65625,  7.96875,  8.28125,  8.59375,  8.90625,  9.21875,  9.53125,
              9.84375])

此外,xarray 还有一个非常方便的绘图方法。由于 xarray 对象包含每个维度标签的信息,绘图轴将自动标记。

[40]:
xarr.plot();
_images/tutorial_78_0.png

将xarray作为输出有助于我们更快地探索数据内容。在下面的示例中,我们展示了如何轻松地绘制每个id组的样本位置(x,y)的二维分布。

注意xarray如何自动为图表添加适当的标题和轴标签。

[41]:
df.categorize('id', inplace=True)  # treat the id as a categorical column - automatically adjusts limits and shape
xarr = df.count(binby=['x', 'y', 'id'], limits='95%', array_type='xarray')
np.log1p(xarr).plot(col='id', col_wrap=7);
_images/tutorial_80_0.png

交互式小部件#

注意: 交互式小部件需要一个正在运行的 Python 内核,如果您在线查看此文档,您可以感受到小部件的功能,但无法进行计算!

使用vaex-jupyter包,我们可以访问交互式小部件(请参阅Vaex Jupyter教程以获取更深入的教程)

[42]:
import vaex
import vaex.jupyter
import numpy as np
import matplotlib.pyplot as plt
df = vaex.example()

获取更交互式可视化(甚至打印统计信息)的最简单方法是使用vaex.jupyter.interactive_selection装饰器,它将在每次选择更改时执行装饰的函数。

[43]:
df.select(df.x > 0)
@vaex.jupyter.interactive_selection(df)
def plot(*args, **kwargs):
    print("Mean x for the selection is:", df.mean(df.x, selection=True))
    df.viz.heatmap(df.x, df.y, what=np.log(vaex.stat.count()+1), selection=[None, True], limits='99.7%')
    plt.show()

在通过编程方式更改选择后,可视化将更新,打印输出也会更新。

[44]:
df.select(df.x > df.y)

然而,要获得真正交互式的可视化效果,我们需要使用小部件,例如bqplot库。再次强调,如果我们在这里进行选择,上面的可视化也会更新,所以让我们选择一个方形区域。

查看更多交互式小部件在Vaex Jupyter 教程

连接#

在Vaex中进行连接操作与Pandas类似,只是数据不会被复制。内部会为左侧DataFrame的每一行保留一个索引数组,指向右侧DataFrame,对于十亿行\(10^9\)的数据集大约需要8GB。让我们从两个小的DataFrames开始,df1df2

[47]:
a = np.array(['a', 'b', 'c'])
x = np.arange(1,4)
df1 = vaex.from_arrays(a=a, x=x)
df1
[47]:
# a x
0a 1
1b 2
2c 3
[48]:
b = np.array(['a', 'b', 'd'])
y = x**2
df2 = vaex.from_arrays(b=b, y=y)
df2
[48]:
# b y
0a 1
1b 4
2d 9

默认的连接方式是‘左’连接,其中左DataFrame(df1)的所有行都被保留,而右DataFrame(df2)的匹配行被添加。我们可以看到,对于列b和y,一些值是缺失的,正如预期的那样。

[49]:
df1.join(df2, left_on='a', right_on='b')
[49]:
# a xb y
0a 1a 1
1b 2b 4
2c 3-- --

‘右’连接基本上是一样的,但现在左右标签的角色互换了,所以现在我们有一些来自列x的值和一些缺失的值。

[50]:
df1.join(df2, left_on='a', right_on='b', how='right')
[50]:
# b ya x
0a 1a 1
1b 4b 2
2d 9-- --

我们也可以进行‘内’连接,其中输出的DataFrame只包含df1df2之间共有的行。

[51]:
df1.join(df2, left_on='a', right_on='b', how='inner')
[51]:
# a xb y
0a 1a 1
1b 2b 4

目前不支持其他连接(例如外部连接)。欢迎在GitHub上提出问题

分组#

使用 Vaex 还可以进行快速的分组聚合操作。输出结果是 Vaex 数据框。让我们看几个例子。

[52]:
import vaex
animal = ['dog', 'dog', 'cat', 'guinea pig', 'guinea pig', 'dog']
age = [2, 1, 5, 1, 3, 7]
cuteness = [9, 10, 5, 8, 4, 8]
df_pets = vaex.from_arrays(animal=animal, age=age, cuteness=cuteness)
df_pets
[52]:
# 动物 年龄 可爱度
0 2 9
1 1 10
2 5 5
3豚鼠 1 8
4豚鼠 3 4
5 7 8

进行分组操作的语法几乎与Pandas相同。请注意,当将多个聚合传递给单个列或表达式时,输出列会被适当地命名。

[53]:
df_pets.groupby(by='animal').agg({'age': 'mean',
                                  'cuteness': ['mean', 'std']})
[53]:
# 动物 年龄 可爱度均值 可爱度标准差
03.33333 9 0.816497
15 5 0
2豚鼠2 6 2

Vaex 支持多种聚合函数:

此外,我们可以在groupby方法中指定聚合操作。我们还可以根据需要命名生成的聚合列。

[54]:
df_pets.groupby(by='animal',
                agg={'mean_age': vaex.agg.mean('age'),
                     'cuteness_unique_values': vaex.agg.nunique('cuteness'),
                     'cuteness_unique_min': vaex.agg.min('cuteness')})
[54]:
# 动物 平均年龄 可爱度唯一值 可爱度唯一最小值
0 3.33333 3 8
1 5 1 5
2豚鼠 2 2 4

Vaex中聚合函数的一个强大特性是它们支持选择。这为我们提供了在聚合时进行选择的灵活性。例如,让我们计算这个示例DataFrame中宠物的平均可爱度,但按年龄分开。

[55]:
df_pets.groupby(by='animal',
                agg={'mean_cuteness_old': vaex.agg.mean('cuteness', selection='age>=5'),
                     'mean_cuteness_young': vaex.agg.mean('cuteness', selection='~(age>=5)')})
[55]:
# 动物 平均可爱度_旧 平均可爱度_年轻
0 8 9.5
1 5 nan
2豚鼠 nan 6

请注意,在最后一个示例中,分组的DataFrame对于没有样本的组包含NaN。

字符串处理#

字符串处理与Pandas类似,只是所有操作都是延迟执行的、多线程的,并且更快(在C++中)。查看API文档以获取更多示例。

[56]:
import vaex
text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
df = vaex.from_arrays(text=text)
df
[56]:
# 文本
0某事
1非常漂亮
2即将到来
3我们的
4方式。
[57]:
df.text.str.upper()
[57]:
Expression = str_upper(text)
Length: 5 dtype: str (expression)
---------------------------------
0    SOMETHING
1  VERY PRETTY
2    IS COMING
3          OUR
4         WAY.
[58]:
df.text.str.title().str.replace('et', 'ET')
[58]:
Expression = str_replace(str_title(text), 'et', 'ET')
Length: 5 dtype: str (expression)
---------------------------------
0    SomEThing
1  Very PrETty
2    Is Coming
3          Our
4         Way.
[59]:
df.text.str.contains('e')
[59]:
Expression = str_contains(text, 'e')
Length: 5 dtype: bool (expression)
----------------------------------
0   True
1   True
2  False
3  False
4  False
[60]:
df.text.str.count('e')
[60]:
Expression = str_count(text, 'e')
Length: 5 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  0
3  0
4  0

不确定性的传播#

在科学中,人们经常处理测量不确定性(有时称为测量误差)。当对具有相关不确定性的量进行变换时,Vaex 可以自动计算这些变换量的不确定性。请注意,不确定性的传播需要导数和长方程的矩阵乘法,这并不复杂,但很繁琐。Vaex 可以自动计算所有依赖关系、导数并计算完整的协方差矩阵。

作为一个例子,让我们再次使用TGAS天文数据集。尽管TGAS数据集已经包含了银河系天球坐标(l和b),但让我们通过从赤经和赤纬进行坐标系旋转再次添加它们。我们可以应用类似的转换,并将球面银河坐标转换为笛卡尔坐标。

[61]:
# convert parallas to distance
tgas.add_virtual_columns_distance_from_parallax(tgas.parallax)
# 'overwrite' the real columns 'l' and 'b' with virtual columns
tgas.add_virtual_columns_eq2gal('ra', 'dec', 'l', 'b')
# and combined with the galactic sky coordinates gives galactic cartesian coordinates of the stars
tgas.add_virtual_columns_spherical_to_cartesian(tgas.l, tgas.b, tgas.distance, 'x', 'y', 'z')
[61]:
# astrometric_delta_q astrometric_excess_noise astrometric_excess_noise_sig astrometric_n_bad_obs_ac astrometric_n_bad_obs_al astrometric_n_good_obs_ac astrometric_n_good_obs_al astrometric_n_obs_ac astrometric_n_obs_al astrometric_primary_flag astrometric_priors_used astrometric_relegation_factor astrometric_weight_ac astrometric_weight_al b dec dec_error dec_parallax_corr dec_pmdec_corr dec_pmra_corr duplicated_source ecl_lat ecl_lon hip l matched_observations parallax parallax_error parallax_pmdec_corr parallax_pmra_corr phot_g_mean_flux phot_g_mean_flux_error phot_g_mean_mag phot_g_n_obs phot_variable_flag pmdec pmdec_error pmra pmra_error pmra_pmdec_corr ra ra_dec_corr ra_error ra_parallax_corr ra_pmdec_corr ra_pmra_corr random_index ref_epoch scan_direction_mean_k1 scan_direction_mean_k2 scan_direction_mean_k3 scan_direction_mean_k4 scan_direction_strength_k1 scan_direction_strength_k2 scan_direction_strength_k3 scan_direction_strength_k4 solution_id source_id tycho2_id distance x y z
0 1.9190566539764404 0.7171010000916003 412.6059727233687 1 0 78 79 79 79 84 3 2.9360971450805664 1.2669624084082898e-05 1.818157434463501 -16.1210428281140140.23539164875137225 0.21880220693566088-0.4073381721973419 0.06065881997346878 -0.09945132583379745 70 -16.12105217335385342.64182504417002 13989 42.641804308626725 9 6.35295075173405 0.3079103606852086 -0.10195717215538025 -0.001576789305545389710312332.17299333210577.365273118843 7.991377829505826 77 b'NOT_AVAILABLE' -7.641989988351149 0.0874017933455474743.75231341609215 0.070542206426400810.21467718482017517 45.03433035439128 -0.41497212648391724 0.305989282002827270.17996619641780853-0.08575969189405441 0.15920649468898773 243619 2015.0 -113.76032257080078 21.39291763305664 -41.67839813232422 26.201841354370117 0.3823484778404236 0.5382660627365112 0.3923785090446472 0.9163063168525696 16353784107819335687627862074752 b'' 0.157407170160582170.111236040400056370.10243667003803988 -0.04370685490397632
1 nan 0.2534628812968044 47.316290890180255 2 0 55 57 57 57 84 5 2.6523141860961914 3.1600175134371966e-05 12.861557006835938 -16.19302376369384 0.2000676896877873 1.1977893944215496 0.8376259803771973 -0.9756439924240112 0.9725773334503174 70 -16.19303311057312 42.761180489478576-214748364842.76115974936648 8 3.90032893506844 0.3234880030045522 -0.8537789583206177 0.8397389650344849 949564.6488279914 1140.173576223928 10.58095871890025662 b'NOT_AVAILABLE' -55.10917285969142 2.522928801165149 10.03626300124532 4.611413518289133 -0.9963987469673157 45.1650067708984 -0.9959233403205872 2.583882288511597 -0.86091065406799320.9734798669815063 -0.9724165201187134 487238 2015.0 -156.432861328125 22.76607322692871 -36.23965835571289 22.890602111816406 0.7110026478767395 0.9659702777862549 0.6461148858070374 0.8671600818634033 16353784107819335689277129363072 b'55-28-1' 0.256388631996868450.1807701962996959 0.16716755815017084 -0.07150016957395491
2 nan 0.3989006354041912 221.18496561724646 4 1 57 60 61 61 84 5 3.9934017658233643 2.5633918994572014e-05 5.767529487609863 -16.12335382439265 0.24882543945301736 0.1803264123376257 -0.39189115166664124-0.193255528807640080.08942046016454697 70 -16.12336317040229642.69750168007008 -214748364842.69748094193635 7 3.15531322003673730.2734838183180671 -0.11855248361825943 -0.0418587327003479 817837.6000768564 1827.3836759985832 10.74310238043427360 b'NOT_AVAILABLE' -1.602867102186794 1.0352589283446592 2.9322836829569003 1.908644426623371 -0.9142706990242004 45.08615483797584 -0.1774432212114334 0.2138361631952843 0.30772241950035095-0.1848166137933731 0.04686680808663368 1948952 2015.0 -117.00751495361328 19.772153854370117 -43.108219146728516 26.7157039642334 0.4825277626514435 0.4287584722042084 0.5241528153419495 0.9030616879463196 163537841078193356813297218905216 b'55-1191-1' 0.316925747228465950.223761030194755460.2064625216744117 -0.08801225918215205
3 nan 0.4224923646481251 179.98201436339852 1 0 51 52 52 52 84 5 4.215157985687256 2.8672602638835087e-05 5.3608622550964355 -16.1182068792970340.24821079122833972 0.20095844850181172-0.33721715211868286-0.223501190543174740.13181143999099731 70 -16.11821622503516 42.67779093546686 -214748364842.67777019818556 7 2.292366835156796 0.2809724206784257 -0.10920235514640808 -0.049440864473581314 602053.4754362862 905.8772856344845 11.07568239443574561 b'NOT_AVAILABLE' -18.4149121148257321.1298513589995536 3.661982345981763 2.065051873379775 -0.9261773228645325 45.06654155758114 -0.36570677161216736 0.2760390513575931 0.2028782218694687 -0.058928851038217545 -0.050908856093883514102321 2015.0 -132.42112731933594 22.56928253173828 -38.95445251464844 25.878559112548828 0.4946548640727997 0.6384561061859131 0.5090736746788025 0.8989177942276001 163537841078193356813469017597184 b'55-624-1' 0.436230355745659160.308100140405318630.2840853806346911 -0.12110624783986161
4 nan 0.3175001122010629 119.74837853832186 2 3 85 84 87 87 84 5 3.2356362342834473 2.22787512029754e-05 8.080779075622559 -16.0554718307503740.33504360351532875 0.1701298562030361 -0.43870800733566284-0.279348850250244140.12179157137870789 70 -16.0554811777948 42.77336987816832 -214748364842.77334913546197 11 1.582076960273368 0.2615394689640736 -0.329196035861969 0.10031197965145111 1388122.242048847 2826.428866453177 10.16870078127108896 b'NOT_AVAILABLE' -2.379387386351838 0.7106320061478508 0.340802333695025161.2204755227890713 -0.8336043357849121 45.13603822322069 -0.0490525588393211360.170696952833767760.4714251756668091 -0.1563923954963684 -0.15207625925540924 409284 2015.0 -106.85968017578125 4.452099323272705 -47.8953971862793 26.755468368530273 0.5206537842750549 0.23930974304676056 0.653376579284668 0.8633849024772644 163537841078193356815736760328576 b'55-849-1' 0.6320805024726543 0.445878380954020440.41250283253756015 -0.17481316927621393
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2,057,04525.898868560791016 0.6508009723190962 172.3136755413185 0 0 54 54 54 54 84 3 6.386378765106201 1.8042501324089244e-05 2.2653496265411377 16.006806970347426 -0.423196860251580430.249741476396420750.00821441039443016 0.2133195698261261 -0.00080527918180450870 16.006807041815204 317.0782357688112 103561 -42.92178788756781 8 5.07430693974197760.2840892420661878 -0.0308084636926651 -0.03397708386182785 4114975.455725508 3447.5776608146016 8.988851940956916 69 b'NOT_AVAILABLE' -4.440524133201202 0.0474329790178223721.970772995655643 0.078468931186690470.3920176327228546 314.741700437929240.08548042178153992 0.2773321068969684 0.2473779171705246 -0.00060404307441785930.11652233451604843 1595738 2015.0 -18.078920364379883 -17.731922149658203 38.27400588989258 27.63787269592285 0.29217642545700073 0.11402469873428345 0.0404343381524086 0.937016487121582 16353784107819335686917488998546378368b'' 0.197071247733951380.13871698568448773-0.129002113090694430.054342703136315784
2,057,046nan 0.17407523451856974 28.886549102578012 0 2 54 52 54 54 84 5 1.9612410068511963 2.415467497485224e-05 24.774322509765625 16.12926993546893 -0.324975343682328940.148233655691999750.8842677474021912 -0.9121489524841309 -0.8994856476783752 70 16.129270018016896 317.0105462544942 -2147483648-42.98947742356782 7 1.69834808174399220.7410137777358506 -0.9793509840965271 -0.9959075450897217 1202425.5197785893871.2480333575235 10.32462460143572359 b'NOT_AVAILABLE' -10.4012251112689621.4016954983272711 -1.28356129908418742.7416807292293637 0.980453610420227 314.643817893111930.8981446623802185 0.3590974400544809 0.9818224906921387 -0.9802247881889343 -0.9827051162719727 2019553 2015.0 -87.07184600830078 -31.574886322021484 -36.37055206298828 29.130958557128906 0.22651544213294983 0.07730517536401749 0.2675701975822449 0.9523505568504333 16353784107819335686917493705830041600b'5179-753-1' 0.5888074481016426 0.4137467499267554 -0.385683048078504840.16357391078619246
2,057,047nan 0.47235246463190794 92.12190417660749 2 0 34 36 36 36 84 5 4.68601131439209 2.138371200999245e-05 3.9279115200042725 15.92496896432183 -0.343177320443203870.20902981533215972-0.2000708132982254 0.31042322516441345 -0.3574342727661133 70 15.924968943694909 317.6408327998631 -2147483648-42.3591908420944146 6.036938108863445 0.39688014089787665-0.7275367975234985 -0.25934046506881714 3268640.52536146954918.5087736624755 9.238852161621992 51 b'NOT_AVAILABLE' -27.8523447526722451.2778575351686428 15.713555906870294 0.9411842746983148 -0.1186852976679802 315.2828795933192 -0.47665935754776 0.4722647631556871 0.704002320766449 -0.77033931016922 0.12704335153102875 788948 2015.0 -21.23501205444336 20.132535934448242 33.55913162231445 26.732301712036133 0.41511622071266174 0.5105549693107605 0.15976844727993011 0.9333845376968384 16353784107819335686917504975824469248b'5192-877-1' 0.165646886214022630.11770477437507047-0.107325590749532430.045449912782963474
2,057,048nan 0.3086465263182493 76.66564461310193 1 2 52 51 53 53 84 5 3.154139280319214 1.9043474821955897e-05 9.627826690673828 16.193728871838935 -0.228113600435448820.131650037775767 0.3082593083381653 -0.5279345512390137 -0.4065483510494232 70 16.193728933791913 317.1363617703344 -2147483648-42.86366191921117 7 1.484142306295484 0.34860128377301614-0.7272516489028931 -0.9375584125518799 4009408.31726829061929.9834553649182 9.017069346445364 60 b'NOT_AVAILABLE' 1.8471079057572073 0.7307171627866237 11.352888915160555 1.219847308406543 0.7511345148086548 314.7406481637209 0.41397571563720703 0.192052966417785630.7539510726928711 -0.7239754796028137 -0.7911394238471985 868066 2015.0 -89.73970794677734 -25.196216583251953 -35.13546371459961 29.041872024536133 0.21430812776088715 0.06784655898809433 0.2636755108833313 0.9523414969444275 16353784107819335686917517998165066624b'5179-1401-1'0.6737898352187435 0.4742760432178817 -0.440164289459801350.18791055094922077
2,057,049nan 0.4329850465924866 60.789771079095715 0 0 26 26 26 26 84 5 4.3140177726745605 2.7940122890868224e-05 4.742301940917969 16.135962442685898 -0.221300816243519350.2686748166142929 -0.466053694486618040.30018869042396545 -0.3290684223175049 70 16.13596246842634 317.3575812619557 -2147483648-42.6424424173883245 2.680111343641743 0.4507741964825321 -0.689416229724884 -0.1735922396183014 2074338.153903563 4136.498086035368 9.732571175024953 31 b'NOT_AVAILABLE' 3.15173423618292 1.4388911228835037 2.897878776243949 1.0354817855168323 -0.21837876737117767314.960730599014 -0.4467950165271759 0.491820509447922160.7087226510047913 -0.8360105156898499 0.2156151533126831 1736132 2015.0 -63.01319885253906 18.303699493408203 -49.05630111694336 28.76698875427246 0.3929939866065979 0.32352808117866516 0.24211134016513824 0.9409775733947754 16353784107819335686917521537218608640b'5179-1719-1'0.3731188267130712 0.2636519673685346 -0.242801102164863340.10369630532457579

由于RA和Dec是以度为单位,而ra_error和dec_error是以毫角秒为单位,我们需要将它们放在同一尺度上

[62]:
tgas['ra_error'] = tgas.ra_error / 1000 / 3600
tgas['dec_error'] = tgas.dec_error / 1000 / 3600

我们现在让Vaex计算出笛卡尔坐标x、y和z的协方差矩阵。然后从数据集中抽取50个样本进行可视化。

[63]:
tgas.propagate_uncertainties([tgas.x, tgas.y, tgas.z])
tgas_50 = tgas.sample(50, random_state=42)

对于这个数据集的这个小子集,我们可以可视化不确定性,包括有协方差和没有协方差的情况。

[64]:
tgas_50.scatter(tgas_50.x, tgas_50.y, xerr=tgas_50.x_uncertainty, yerr=tgas_50.y_uncertainty)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
tgas_50.scatter(tgas_50.x, tgas_50.y, xerr=tgas_50.x_uncertainty, yerr=tgas_50.y_uncertainty, cov=tgas_50.y_x_covariance)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
_images/tutorial_120_0.png
_images/tutorial_120_1.png

从第二个图中,我们看到显示误差椭圆(非常窄以至于它们看起来像线)而不是误差条,揭示了在这种情况下距离信息主导了不确定性。

即时编译#

让我们从一个计算球体表面上两点之间角距离的函数开始。该函数的输入是一对以弧度表示的2个角坐标。

[65]:
import vaex
import numpy as np
# From http://pythonhosted.org/pythran/MANUAL.html
def arc_distance(theta_1, phi_1, theta_2, phi_2):
    """
    Calculates the pairwise arc distance
    between all points in vector a and b.
    """
    temp = (np.sin((theta_2-2-theta_1)/2)**2
           + np.cos(theta_1)*np.cos(theta_2) * np.sin((phi_2-phi_1)/2)**2)
    distance_matrix = 2 * np.arctan2(np.sqrt(temp), np.sqrt(1-temp))
    return distance_matrix

让我们使用2015年的纽约出租车数据集,可以以hdf5格式下载

[66]:
# nytaxi = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')
nytaxi = vaex.open('/Users/jovan/Work/vaex-work/vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5')
# lets use just 20% of the data, since we want to make sure it fits
# into memory (so we don't measure just hdd/ssd speed)
nytaxi.set_active_fraction(0.2)

尽管上述函数期望接收Numpy数组,Vaex可以传入列或表达式,这将延迟执行直到需要时,并将结果表达式作为虚拟列添加。

[67]:
nytaxi['arc_distance'] = arc_distance(nytaxi.pickup_longitude * np.pi/180,
                                      nytaxi.pickup_latitude * np.pi/180,
                                      nytaxi.dropoff_longitude * np.pi/180,
                                      nytaxi.dropoff_latitude * np.pi/180)

当我们计算出租车行程的平均角距离时,会遇到一些无效数据,这些数据会给出警告,在此演示中我们可以安全地忽略这些警告。

[68]:
%%time
nytaxi.mean(nytaxi.arc_distance)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in sqrt
  return function(*args, **kwargs)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in sin
  return function(*args, **kwargs)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in cos
  return function(*args, **kwargs)
CPU times: user 44.5 s, sys: 5.03 s, total: 49.5 s
Wall time: 6.14 s
[68]:
array(1.99993285)

这个计算使用了相当多的重型数学操作,并且由于它(内部)使用了Numpy数组,也使用了相当多的临时数组。我们可以通过基于numbapythran,或者如果你有NVIDIA显卡的话,基于cuda的即时编译来优化这个计算。选择性能最好或最容易安装的选项。

[69]:
nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_numba()
# nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_pythran()
# nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_cuda()
[70]:
%%time
nytaxi.mean(nytaxi.arc_distance_jit)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/expression.py:1038: RuntimeWarning: invalid value encountered in f
  return self.f(*args, **kwargs)
CPU times: user 25.7 s, sys: 330 ms, total: 26 s
Wall time: 2.31 s
[70]:
array(1.9999328)

在这种情况下,我们可以获得显著的加速(\(\sim 3 x\))。

并行计算#

正如在选择部分提到的,Vaex 可以并行进行计算。通常这是自动处理的,例如,当向方法传递多个选择或向统计函数之一传递多个参数时。然而,有时很难或无法在一个表达式中表示计算,我们需要进行所谓的“延迟”计算,类似于 joblibdask 中的做法。

[71]:
import vaex
df = vaex.example()
limits = [-10, 10]
delayed_count = df.count(df.E, binby=df.x, limits=limits,
                         shape=4, delay=True)
delayed_count
[71]:
<vaex.promise.Promise at 0x7ffbd64072d0>

请注意,现在返回的值是一个promise(TODO:更Pythonic的方式是返回一个Future)。这可能会有所变化,处理这种情况的最佳方式是使用delayed装饰器。并在需要结果时调用DataFrame.execute

除了上述的延迟计算外,我们还安排了更多的计算任务,使得计数和平均值能够并行执行,从而只需对数据进行一次遍历。我们使用vaex.delayed装饰器来安排两个额外函数的执行,并使用df.execute()来运行整个管道。

[72]:
delayed_sum = df.sum(df.E, binby=df.x, limits=limits,
                         shape=4, delay=True)

@vaex.delayed
def calculate_mean(sums, counts):
    print('calculating mean')
    return sums/counts

print('before calling mean')
# since calculate_mean is decorated with vaex.delayed
# this now also returns a 'delayed' object (a promise)
delayed_mean = calculate_mean(delayed_sum, delayed_count)

# if we'd like to perform operations on that, we can again
# use the same decorator
@vaex.delayed
def print_mean(means):
    print('means', means)
print_mean(delayed_mean)

print('before calling execute')
df.execute()

# Using the .get on the promise will also return the result
# However, this will only work after execute, and may be
# subject to change
means = delayed_mean.get()
print('same means', means)

before calling mean
before calling execute
calculating mean
means [ -94323.68051598 -118749.23850834 -119119.46292653  -95021.66183457]
same means [ -94323.68051598 -118749.23850834 -119119.46292653  -95021.66183457]

扩展Vaex#

Vaex 可以通过多种机制进行扩展。

添加函数#

使用 vaex.register_function decorator API 来添加新函数。

[73]:
import vaex
import numpy as np
@vaex.register_function()
def add_one(ar):
    return ar+1

该函数可以使用df.func访问器调用,以返回一个新的表达式。在Vaex上下文中进行评估时,每个作为表达式的参数将被Numpy数组替换。

[74]:
df = vaex.from_arrays(x=np.arange(4))
df.func.add_one(df.x)
[74]:
Expression = add_one(x)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  3
3  4

默认情况下(传递on_expression=True),该函数也可作为表达式上的方法使用,其中表达式本身会自动设置为第一个参数(因为这是一个非常常见的用例)。

[75]:
df.x.add_one()
[75]:
Expression = add_one(x)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  3
3  4

如果第一个参数不是表达式,请传递 on_expression=True,并使用 df.func.,通过该函数构建一个新的表达式:

[76]:
@vaex.register_function(on_expression=False)
def addmul(a, b, x, y):
    return a*x + b * y
[77]:
df = vaex.from_arrays(x=np.arange(4))
df['y'] = df.x**2
df.func.addmul(2, 3, df.x, df.y)
[77]:
Expression = addmul(2, 3, x, y)
Length: 4 dtype: int64 (expression)
-----------------------------------
0   0
1   5
2  16
3  33

这些表达式可以按预期添加为虚拟列。

[78]:
df = vaex.from_arrays(x=np.arange(4))
df['y'] = df.x**2
df['z'] = df.func.addmul(2, 3, df.x, df.y)
df['w'] = df.x.add_one()
df
[78]:
# x y z w
0 0 0 0 1
1 1 1 5 2
2 2 4 16 3
3 3 9 33 4

添加DataFrame访问器#

在添加操作Dataframes的方法时,将它们分组到一个命名空间中是有意义的。

[79]:
@vaex.register_dataframe_accessor('scale', override=True)
class ScalingOps(object):
    def __init__(self, df):
        self.df = df

    def mul(self, a):
        df = self.df.copy()
        for col in df.get_column_names(strings=False):
            if df[col].dtype:
                df[col] = df[col] * a
        return df

    def add(self, a):
        df = self.df.copy()
        for col in df.get_column_names(strings=False):
            if df[col].dtype:
                df[col] = df[col] + a
        return df
[80]:
df.scale.add(1)
[80]:
# x y z w
0 1 1 1 2
1 2 2 6 3
2 3 5 17 4
3 4 10 34 5
[81]:
df.scale.mul(2)
[81]:
# x y z w
0 0 0 0 2
1 2 2 10 4
2 4 8 32 6
3 6 18 66 8

便捷方法#

获取列名#

我们经常希望处理DataFrame中的一部分列。使用get_column_names方法,Vaex使得获取你需要的精确列变得非常容易和方便。默认情况下,get_column_names返回所有列:

[1]:
import vaex
df = vaex.datasets.titanic()
print(df.get_column_names())
['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home_dest']

相同的方法有几个参数,使得获取所需的列子集变得容易。例如,可以传递一个正则表达式来根据列名选择列。在下面的单元格中,我们选择所有名称长度为5个字符的列:

[2]:
print(df.get_column_names(regex='^[a-zA-Z]{5}$'))
['sibsp', 'parch', 'cabin']

我们还可以根据类型选择列。下面我们选择所有整数或浮点数的列:

[3]:
df.get_column_names(dtype=['int', 'float'])
[3]:
['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']

逃生口:apply#

如果计算无法表示为Vaex表达式,可以使用apply方法作为最后的手段。如果您想要应用的函数是用纯Python编写的,或者来自第三方库,并且难以或无法向量化,这可能很有用。

我们认为apply应该仅作为最后的手段使用,因为它需要使用多进程(产生新进程)来避免Python全局解释器锁(GIL),以利用多核。这带来的代价是必须在主进程和子进程之间传输数据。

这是一个使用apply方法的示例:

[1]:
import vaex

def slow_is_prime(x):
    return x > 1 and all((x % i) != 0 for i in range(2, x))

df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))
# you need to explicitly specify which arguments you need
df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])
df.head(10)
[1]:
# x是否为质数
0 0
1 1
2 2True
3 3True
4 4错误
5 5True
6 6错误
7 7
8 8
9 9错误
[2]:
prime_count = df.is_prime.sum()
print(f'There are {prime_count} prime numbers between 0 and {len(df)}')
There are 9592 prime number between 0 and 100000
[3]:
# both of these are equivalent
df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])
# but this form only works for a single argument
df['is_prime'] = df.x.apply(slow_is_prime)

何时不使用apply#

当你的函数可以被向量化时,你不应该使用apply。当你使用Vaex的表达式系统时,我们知道你在做什么,我们看到表达式,并且可以操作它以实现最佳性能。一个apply函数就像一个黑盒子,我们无法对它做任何事情,比如JIT编译。

[4]:
df = vaex.from_arrays(x=vaex.vrange(0, 10_000_000, dtype='f4'))
[5]:
# ideal case
df['y'] = df.x**2
[6]:
%%timeit
df.y.sum()
29.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[7]:
# will transfer the data to child processes, and execute the ** operation in Python for each element
df['y_slow'] = df.x.apply(lambda x: x**2)
[8]:
%%timeit
df.y_slow.sum()
353 ms ± 40 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[9]:
# bad idea: it will transfer the data to the child process, where it will be executed in vectorized form
df['y_slow_vectorized'] = df.x.apply(lambda x: x**2, vectorize=True)
[10]:
%%timeit
df.y_slow_vectorized.sum()
82.8 ms ± 525 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[11]:
# bad idea: same performance as just dy['y'], but we lose the information about what was done
df['y_fast'] = df.x.apply(lambda x: x**2, vectorize=True, multiprocessing=False)
[12]:
%%timeit
df.y_fast.sum()
28.8 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)