11分钟了解Vaex#

如果你想在实时的 Python 内核中尝试这个笔记本，请使用 mybinder：

数据框#

Vaex 的核心是 DataFrame（类似于 Pandas DataFrame，但更高效），我们通常使用变量 df 来表示它。DataFrame 是对大型表格数据集的高效表示，并具有以下特点：

一些列，例如 x, y 和 z，它们是：
- 由 Numpy 数组支持；
- 由表达式系统包装，例如 df.x, df['x'] 或 df.col.x 是一个表达式；
- 列/表达式可以执行惰性计算，例如 df.x * np.sin(df.y) 在需要结果之前不会执行任何操作。
一组虚拟列，这些列由（惰性）计算支持，例如 df['r'] = df.x/df.y
一组选择，可以用来探索数据集，例如 df.select(df.x < 0)
过滤的DataFrames，不会复制数据，df_negative = df[df.x < 0]

让我们从一个示例数据集开始，该数据集包含在Vaex中。

[1]:

import vaex
df = vaex.example()
df  # Since this is the last statement in a cell, it will print the DataFrame in a nice HTML format.

[1]:

#	id	x	y	z	vx	vy	vz	E	L	Lz	FeH
0	0	1.2318683862686157	-0.39692866802215576	-0.598057746887207	301.1552734375	174.05947875976562	27.42754554748535	-149431.40625	407.38897705078125	333.9555358886719	-1.0053852796554565
1	23	-0.16370061039924622	3.654221296310425	-0.25490644574165344	-195.00022888183594	170.47216796875	142.5302276611328	-124247.953125	890.2411499023438	684.6676025390625	-1.7086670398712158
2	32	-2.120255947113037	3.326052665710449	1.7078403234481812	-48.63423156738281	171.6472930908203	-2.079437255859375	-138500.546875	372.2410888671875	-202.17617797851562	-1.8336141109466553
3	8	4.7155890464782715	4.5852508544921875	2.2515437602996826	-232.42083740234375	-294.850830078125	62.85865020751953	-60037.0390625	1297.63037109375	-324.6875	-1.4786882400512695
4	16	7.21718692779541	11.99471664428711	-1.064562201499939	-1.6891745328903198	181.329345703125	-11.333610534667969	-83206.84375	1332.7989501953125	1328.948974609375	-1.8570483922958374
...	...	...	...	...	...	...	...	...	...	...	...
329,995	21	1.9938701391220093	0.789276123046875	0.22205990552902222	-216.92990112304688	16.124420166015625	-211.244384765625	-146457.4375	457.72247314453125	203.36758422851562	-1.7451677322387695
329,996	25	3.7180912494659424	0.721337616443634	1.6415337324142456	-185.92160034179688	-117.25082397460938	-105.4986572265625	-126627.109375	335.0025634765625	-301.8370056152344	-0.9822322130203247
329,997	14	0.3688507676124573	13.029608726501465	-3.633934736251831	-53.677146911621094	-145.15771484375	76.70909881591797	-84912.2578125	817.1375732421875	645.8507080078125	-1.7645612955093384
329,998	18	-0.11259264498949051	1.4529125690460205	2.168952703475952	179.30865478515625	205.79710388183594	-68.75872802734375	-133498.46875	724.000244140625	-283.6910400390625	-1.8808952569961548
329,999	4	20.796220779418945	-3.331387758255005	12.18841552734375	42.69000244140625	69.20479583740234	29.54275131225586	-65519.328125	1843.07470703125	1581.4151611328125	-1.1231083869934082

列#

上述预览显示，该数据集包含\(> 300,000\)行，以及名为x、y、z（位置）、vx、vy、vz（速度）、E（能量）、L（角动量）和id（样本子组）的列。当我们打印出一列时，我们可以看到它不是一个Numpy数组，而是一个Expression。

[2]:

df.x  # df.col.x or df['x'] are equivalent, but df.x may be preferred because it is more tab completion friendly or programming friendly respectively

[2]:

Expression = x
Length: 330,000 dtype: float32 (column)
---------------------------------------
     0    1.23187
     1  -0.163701
     2   -2.12026
     3    4.71559
     4    7.21719
       ...
329995    1.99387
329996    3.71809
329997   0.368851
329998  -0.112593
329999    20.7962

可以使用.values方法来获取表达式的内存表示。同样的方法也可以应用于DataFrame。

[3]:

df.x.values

[3]:

array([ 1.2318684 , -0.16370061, -2.120256  , ...,  0.36885077,
       -0.11259264, 20.79622   ], dtype=float32)

大多数Numpy函数（ufuncs）可以在表达式上执行，并且不会直接产生结果，而是生成一个新的表达式。

[4]:

import numpy as np
np.sqrt(df.x**2 + df.y**2 + df.z**2)

[4]:

Expression = sqrt((((x ** 2) + (y ** 2)) + (z ** 2)))
Length: 330,000 dtype: float32 (expression)
-------------------------------------------
     0  1.42574
     1  3.66676
     2  4.29824
     3  6.95203
     4   14.039
      ...
329995  2.15587
329996  4.12785
329997  13.5319
329998  2.61304
329999  24.3339

虚拟列#

有时将表达式存储为列是很方便的。我们称之为虚拟列，因为它不占用任何内存，并且在需要时即时计算。虚拟列被视为普通列。

[5]:

df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2)
df[['x', 'y', 'z', 'r']]

[5]:

#	x	y	z	r
0	1.2318683862686157	-0.39692866802215576	-0.598057746887207	1.425736665725708
1	-0.16370061039924622	3.654221296310425	-0.25490644574165344	3.666757345199585
2	-2.120255947113037	3.326052665710449	1.7078403234481812	4.298235893249512
3	4.7155890464782715	4.5852508544921875	2.2515437602996826	6.952032566070557
4	7.21718692779541	11.99471664428711	-1.064562201499939	14.03902816772461
...	...	...	...	...
329,995	1.9938701391220093	0.789276123046875	0.22205990552902222	2.155872344970703
329,996	3.7180912494659424	0.721337616443634	1.6415337324142456	4.127851963043213
329,997	0.3688507676124573	13.029608726501465	-3.633934736251831	13.531896591186523
329,998	-0.11259264498949051	1.4529125690460205	2.168952703475952	2.613041877746582
329,999	20.796220779418945	-3.331387758255005	12.18841552734375	24.333894729614258

选择和过滤#

Vaex 在探索数据子集时非常高效，例如用于移除异常值或仅检查部分数据。Vaex 不会创建副本，而是在内部跟踪哪些行被选中。

[6]:

df.select(df.x < 0)
df.evaluate(df.x, selection=True)

[6]:

array([-0.16370061, -2.120256  , -7.7843747 , ..., -8.126636  ,
       -3.9477386 , -0.11259264], dtype=float32)

当您频繁修改想要可视化的数据部分，或者当您想要有效地计算多个数据部分的统计信息时，选择非常有用。

或者，您也可以创建过滤后的数据集。这与使用 Pandas 类似，只是 Vaex 不会复制数据。

[7]:

df_negative = df[df.x < 0]
df_negative[['x', 'y', 'z', 'r']]

[7]:

#	x	y	z	r
0	-0.16370061039924622	3.654221296310425	-0.25490644574165344	3.666757345199585
1	-2.120255947113037	3.326052665710449	1.7078403234481812	4.298235893249512
2	-7.784374713897705	5.989774703979492	-0.682695209980011	9.845809936523438
3	-3.5571861267089844	5.413629055023193	0.09171556681394577	6.478376865386963
4	-20.813940048217773	-3.294677495956421	13.486607551574707	25.019264221191406
...	...	...	...	...
166,274	-2.5926425457000732	-2.871671676635742	-0.18048334121704102	3.8730955123901367
166,275	-0.7566012144088745	2.9830434322357178	-6.940553188323975	7.592250823974609
166,276	-8.126635551452637	1.1619765758514404	-1.6459038257598877	8.372657775878906
166,277	-3.9477386474609375	-3.0684902667999268	-1.5822702646255493	5.244411468505859
166,278	-0.11259264498949051	1.4529125690460205	2.168952703475952	2.613041877746582

N维网格上的统计#

Vaex 的一个核心特性是能够非常高效地计算 N 维网格上的统计数据。这对于大型数据集的可视化非常有用。

[8]:

df.count(), df.mean(df.x), df.mean(df.x, selection=True)

[8]:

(array(330000), array(-0.0632868), array(-5.18457762))

类似于SQL的groupby，Vaex使用binby概念，它告诉Vaex应该在常规网格上计算统计量（出于性能原因）

[9]:

counts_x = df.count(binby=df.x, limits=[-10, 10], shape=64)
counts_x

[9]:

array([1374, 1350, 1459, 1618, 1706, 1762, 1852, 2007, 2240, 2340, 2610,
       2840, 3126, 3337, 3570, 3812, 4216, 4434, 4730, 4975, 5332, 5800,
       6162, 6540, 6805, 7261, 7478, 7642, 7839, 8336, 8736, 8279, 8269,
       8824, 8217, 7978, 7541, 7383, 7116, 6836, 6447, 6220, 5864, 5408,
       4881, 4681, 4337, 4015, 3799, 3531, 3320, 3040, 2866, 2629, 2488,
       2244, 1981, 1905, 1734, 1540, 1437, 1378, 1233, 1186])

这将生成一个Numpy数组，其中包含在x = -10和x = 10之间分布的64个区间中的计数。我们可以使用Matplotlib快速可视化这一点。

[10]:

import matplotlib.pyplot as plt
plt.plot(np.linspace(-10, 10, 64), counts_x)
plt.show()

我们也可以在二维中做同样的事情（实际上可以推广到N维！），并使用Matplotlib显示它。

[11]:

xycounts = df.count(binby=[df.x, df.y], limits=[[-10, 10], [-10, 20]], shape=(64, 128))
xycounts

[11]:

array([[ 5,  2,  3, ...,  3,  3,  0],
       [ 8,  4,  2, ...,  5,  3,  2],
       [ 5, 11,  7, ...,  3,  3,  1],
       ...,
       [ 4,  8,  5, ...,  2,  0,  2],
       [10,  6,  7, ...,  1,  1,  2],
       [ 6,  7,  9, ...,  2,  2,  2]])

[12]:

plt.imshow(xycounts.T, origin='lower', extent=[-10, 10, -10, 20])
plt.show()

[13]:

v = np.sqrt(df.vx**2 + df.vy**2 + df.vz**2)
xy_mean_v = df.mean(v, binby=[df.x, df.y], limits=[[-10, 10], [-10, 20]], shape=(64, 128))
xy_mean_v

[13]:

array([[156.15283203, 226.0004425 , 206.95940653, ...,  90.0340627 ,
        152.08784485,          nan],
       [203.81366634, 133.01436043, 146.95962524, ..., 137.54756927,
         98.68717448, 141.06020737],
       [150.59178772, 188.38820371, 137.46753802, ..., 155.96900177,
        148.91660563, 138.48191833],
       ...,
       [168.93819809, 187.75943136, 137.318647  , ..., 144.83927917,
                 nan, 107.7273407 ],
       [154.80492783, 140.55182203, 180.30700166, ..., 184.01670837,
         95.10913086, 131.18122864],
       [166.06868235, 150.54079764, 125.84606828, ..., 130.56007385,
        121.04217911, 113.34659195]])

[14]:

plt.imshow(xy_mean_v.T, origin='lower', extent=[-10, 10, -10, 20])
plt.show()

可以计算其他统计量，例如：

或者查看完整的列表在API文档。

获取您的数据#

在继续本教程之前，您可能希望读取自己的数据。最终，Vaex DataFrame 只是包装了一组 Numpy 数组。如果您可以将数据作为一组 Numpy 数组访问，您可以使用 from_arrays 轻松构建一个 DataFrame。

[15]:

import vaex
import numpy as np
x = np.arange(5)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df

[15]:

#	x	y
0	0	0
1	1	1
2	2	4
3	3	9
4	4	16

其他快速获取数据的方法有：

from_arrow_table: Arrow 表格支持
from_csv: 逗号分隔文件
from_ascii: 空格/制表符分隔的文件
from_pandas: 将pandas DataFrame转换为
from_astropy_table: 转换一个astropy表

导出或将DataFrame转换为不同的数据结构也非常简单：

如今，将数据，尤其是较大的数据集放在云端是很常见的。Vaex 可以直接从 S3 以惰性方式读取数据，这意味着只会下载所需的数据，并将其缓存到磁盘上。

[16]:

# Read in the NYC Taxi dataset straight from S3
nyctaxi = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true')
nyctaxi.head(5)

[16]:

#	供应商ID	上车时间	下车时间	乘客数量	支付类型	行程距离	上车经度	上车纬度	费率代码	存储和转发标志	下车经度	下车纬度	费用金额	附加费	地铁税	小费金额	总金额
0	VTS	2009-01-04 02:52:00.000000000	2009-01-04 03:02:00.000000000	1	CASH	2.63	-73.992	40.7216	nan	nan	-73.9938	40.6959	8.9	0.5	nan	0	9.4
1	VTS	2009-01-04 03:31:00.000000000	2009-01-04 03:38:00.000000000	3	Credit	4.55	-73.9821	40.7363	nan	nan	-73.9558	40.768	12.1	0.5	nan	2	14.6
2	VTS	2009-01-03 15:43:00.000000000	2009-01-03 15:57:00.000000000	5	信用	10.35	-74.0026	40.7397	nan	nan	-73.87	40.7702	23.7	0	nan	4.74	28.44
3	DDS	2009-01-01 20:52:58.000000000	2009-01-01 21:14:00.000000000	1	CREDIT	5	-73.9743	40.791	nan	nan	-73.9966	40.7318	14.9	0.5	nan	3.05	18.45
4	DDS	2009-01-24 16:18:23.000000000	2009-01-24 16:24:56.000000000	1	CASH	0.4	-74.0016	40.7194	nan	nan	-74.0084	40.7203	3.7	0	nan	0	3.7

绘图#

一维和二维#

大多数可视化是在1或2维中完成的，Vaex很好地封装了Matplotlib，以满足各种常见的使用场景。

[17]:

import vaex
import numpy as np
df = vaex.example()

最简单的可视化是使用DataFrame.viz.histogram的一维图。在这里，我们只显示99.7%的数据。

[1]:

df.viz.histogram(df.x, limits='99.7%')

一个稍微复杂一些的可视化方法，不是绘制计数，而是绘制该分箱的不同统计量。在大多数情况下，传递what='()参数即可，其中是上述列表中或API文档中提到的任何统计量。

[19]:

df.viz.histogram(df.x, what='mean(E)', limits='99.7%');

一个等效的方法是使用vaex.stat.函数，例如vaex.stat.mean。

[20]:

df.viz.histogram(df.x, what=vaex.stat.mean(df.E), limits='99.7%');

vaex.stat. 对象与 Vaex 表达式非常相似，因为它们代表了一个基础的计算。典型的算术和 Numpy 函数可以应用于这些计算。然而，这些对象计算的是单个统计量，并不返回列或表达式。

[21]:

np.log(vaex.stat.mean(df.x)/vaex.stat.std(df.x))

[21]:

log((mean(x) / std(x)))

这些统计对象可以传递给what参数。其优势在于数据只需传递一次。

[22]:

df.viz.histogram(df.x, what=np.clip(np.log(-vaex.stat.mean(df.E)), 11, 11.4), limits='99.7%');

通过我们自己计算统计量并将其传递给plot1d的grid参数，可以获得类似的结果。需要注意的是，用于计算统计量和绘图的限制必须相同，否则x轴可能无法与真实数据对应。

[3]:

limits = [-30, 30]
shape  = 64
meanE  = df.mean(df.E, binby=df.x, limits=limits, shape=shape)
grid   = np.clip(np.log(-meanE), 11, 11.4)
df.viz.histogram(df.x, grid=grid, limits=limits, ylabel='clipped E');

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb Cell 46' in <cell line: 3>()
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=0'>1</a> limits = [-30, 30]
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=1'>2</a> shape  = 64
----> <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=2'>3</a> meanE  = df.mean(df.E, binby=df.x, limits=limits, shape=shape)
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=3'>4</a> grid   = np.clip(np.log(-meanE), 11, 11.4)
      <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=4'>5</a> df.viz.histogram(df.x, grid=grid, limits=limits, ylabel='clipped E')

NameError: name 'df' is not defined

除了在一维上绘制密度（直方图）外，我们还可以在二维上绘制密度。这是通过DataFrame.viz.heatmap函数完成的。它共享许多参数，并且与直方图非常相似。

[24]:

df.viz.heatmap(df.x, df.y, what=vaex.stat.mean(df.E)**2, limits='99.7%');

绘图选择#

虽然过滤对于缩小DataFrame的内容非常有用（例如df_negative = df[df.x < 0]），但这种方法也有一些缺点。首先，一个实际问题是，当你以4种不同的方式进行过滤时，你将需要4个不同的DataFrame来污染你的命名空间。更重要的是，当Vaex执行一系列统计计算时，它将为每个DataFrame执行这些计算，这意味着将对数据进行4次遍历，尽管这4个DataFrame都指向相同的基础数据。

如果我们的DataFrame中有4个（命名的）选择，我们可以在一次数据遍历中计算统计信息，这在数据集大于内存的情况下尤其显著加快速度。

在下图中，我们展示了三个选择，默认情况下它们混合在一起，只需要一次数据遍历。

[25]:

df.viz.heatmap(df.x, df.y, what=np.log(vaex.stat.count()+1), limits='99.7%',
        selection=[None, df.x < df.y, df.x < -10]);

/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/image.py:113: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rgba_dest[:, :, c][[mask]] = np.clip(result[[mask]], 0, 1)

高级绘图#

假设我们想要并排查看两个图表。为了实现这一点，我们可以传递一个表达式对的列表。

[26]:

df.viz.heatmap([["x", "y"], ["x", "z"]], limits='99.7%',
        title="Face on and edge on", figsize=(10,4));

/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/viz/mpl.py:779: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  ax = plt.subplot(gs[row_offset + row * row_scale:row_offset + (row + 1) * row_scale, column * column_scale:(column + 1) * column_scale])

默认情况下，如果您有多个图表，它们会显示为列，多个选择会叠加显示，多个“whats”（统计信息）会显示为行。

[27]:

df.viz.heatmap([["x", "y"], ["x", "z"]],
        limits='99.7%',
        what=[np.log(vaex.stat.count()+1), vaex.stat.mean(df.E)],
        selection=[None, df.x < df.y],
        title="Face on and edge on", figsize=(10,10));

请注意，选择在底部行没有效果。

然而，可以使用visual参数来改变这种行为。

[28]:

df.viz.heatmap([["x", "y"], ["x", "z"]],
        limits='99.7%',
        what=vaex.stat.mean(df.E),
        selection=[None, df.Lz < 0],
        visual=dict(column='selection'),
        title="Face on and edge on", figsize=(10,10));

第三维度的切片#

如果提供了第三个轴（z），你可以“切片”数据，将z切片显示为行。请注意，这里的行是换行的，可以使用wrap_columns参数来更改。

[29]:

df.viz.heatmap("Lz", "E",
        limits='99.7%',
        z="FeH:-2.5,-1,8", show=True, visual=dict(row="z"),
        figsize=(12,8), f="log", wrap_columns=3);

小型数据集的可视化#

尽管Vaex专注于大型数据集，但有时你最终会得到一小部分数据（例如由于选择），并且你想制作散点图。你可以使用以下方法来实现：

[30]:

import vaex
df = vaex.example()

[31]:

import matplotlib.pyplot as plt
x = df.evaluate("x", selection=df.Lz < -2500)
y = df.evaluate("y", selection=df.Lz < -2500)
plt.scatter(x, y, c="red", alpha=0.5, s=4);

使用 DataFrame.viz.scatter:

[32]:

df.viz.scatter(df.x, df.y, selection=df.Lz < -2500, c="red", alpha=0.5, s=4)
df.viz.scatter(df.x, df.y, selection=df.Lz > 1500, c="green", alpha=0.5, s=4);

在控制中#

虽然Vaex提供了Matplotlib的封装，但在某些情况下，您可能希望使用DataFrame.viz方法，但又希望控制绘图。Vaex简单地使用当前的图形和轴对象，因此很容易实现。

[33]:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,7))
plt.sca(ax1)
selection = df.Lz < -2500
x = df[selection].x.evaluate()#selection=selection)
y = df[selection].y.evaluate()#selection=selection)
df.viz.heatmap(df.x, df.y)
plt.scatter(x, y)
plt.xlabel('my own label $\gamma$')
plt.xlim(-20, 20)
plt.ylim(-20, 20)

plt.sca(ax2)
df.viz.histogram(df.x, label='counts', n=True)
x = np.linspace(-30, 30, 100)
std = df.std(df.x.expression)
y = np.exp(-(x**2/std**2/2)) / np.sqrt(2*np.pi) / std
plt.plot(x, y, label='gaussian fit')
plt.legend()
plt.show()

Healpix (绘图)#

Healpix 绘图通过 healpy 包支持。Vaex 不需要对 healpix 进行特殊支持，仅用于绘图，但引入了一些辅助函数以使处理 healpix 更加容易。

在以下示例中，我们将使用TGAS天文数据集。

为了更好地理解healpix，我们将从头开始。如果我们想制作一个密度天空图，我们希望向healpy传递一个一维Numpy数组，其中每个值代表球体上某个位置的密度，该位置由数组大小（healpix级别）和偏移量（位置）决定。TGAS（和Gaia）数据包括编码在source_id中的healpix索引。通过将source_id除以34359738368，你将得到一个healpix索引级别12，进一步除以它将带你到更低的级别。

[34]:

import vaex
import healpy as hp
tgas = vaex.datasets.tgas(full=True)

我们将开始展示如何使用vaex.count手动对healpix分箱进行统计。我们将采用一个非常粗略的healpix方案（级别2）。

[35]:

level = 2
factor = 34359738368 * (4**(12-level))
nmax = hp.nside2npix(2**level)
epsilon = 1e-16
counts = tgas.count(binby=tgas.source_id/factor, limits=[-epsilon, nmax-epsilon], shape=nmax)
counts

[35]:

array([ 4021,  6171,  5318,  7114,  5755, 13420, 12711, 10193,  7782,
       14187, 12578, 22038, 17313, 13064, 17298, 11887,  3859,  3488,
        9036,  5533,  4007,  3899,  4884,  5664, 10741,  7678, 12092,
       10182,  6652,  6793, 10117,  9614,  3727,  5849,  4028,  5505,
        8462, 10059,  6581,  8282,  4757,  5116,  4578,  5452,  6023,
        8340,  6440,  8623,  7308,  6197, 21271, 23176, 12975, 17138,
       26783, 30575, 31931, 29697, 17986, 16987, 19802, 15632, 14273,
       10594,  4807,  4551,  4028,  4357,  4067,  4206,  3505,  4137,
        3311,  3582,  3586,  4218,  4529,  4360,  6767,  7579, 14462,
       24291, 10638, 11250, 29619,  9678, 23322, 18205,  7625,  9891,
        5423,  5808, 14438, 17251,  7833, 15226,  7123,  3708,  6135,
        4110,  3587,  3222,  3074,  3941,  3846,  3402,  3564,  3425,
        4125,  4026,  3689,  4084, 16617, 13577,  6911,  4837, 13553,
       10074,  9534, 20824,  4976,  6707,  5396,  8366, 13494, 19766,
       11012, 16130,  8521,  8245,  6871,  5977,  8789, 10016,  6517,
        8019,  6122,  5465,  5414,  4934,  5788,  6139,  4310,  4144,
       11437, 30731, 13741, 27285, 40227, 16320, 23039, 10812, 14686,
       27690, 15155, 32701, 18780,  5895, 23348,  6081, 17050, 28498,
       35232, 26223, 22341, 15867, 17688,  8580, 24895, 13027, 11223,
        7880,  8386,  6988,  5815,  4717,  9088,  8283, 12059,  9161,
        6952,  4914,  6652,  4666, 12014, 10703, 16518, 10270,  6724,
        4553,  9282,  4981])

并且使用healpy的mollview我们可以将其可视化。

[36]:

hp.mollview(counts, nest=True)

为了简化生活，Vaex 包含了 DataFrame.healpix_count 来处理这个问题。

[37]:

counts = tgas.healpix_count(healpix_level=6)
hp.mollview(counts, nest=True)

或者更简单，使用 DataFrame.viz.healpix_heatmap

[38]:

tgas.viz.healpix_heatmap(
    f="log1p",
    healpix_level=6,
    figsize=(10,8),
    healpix_output="ecliptic"
)

xarray 支持#

df.count 方法也可以返回一个 xarray 数据数组，而不是 numpy 数组。这可以通过 array_type 关键字轻松实现。xarray 在 numpy 的基础上增加了维度标签、坐标和属性，使得处理多维数组更加方便。

[39]:

xarr = df.count(binby=[df.x, df.y], limits=[-10, 10], shape=64, array_type='xarray')
xarr

[39]:

xarray.DataArray

x: 64
y: 64

6 3 7 9 10 13 6 13 17 7 12 15 7 14 ... 11 8 10 6 7 5 17 9 10 10 6 5 7

array([[ 6,  3,  7, ..., 15, 10, 11],
       [10,  3,  7, ..., 10, 13, 11],
       [ 5, 15,  5, ..., 12, 18, 12],
       ...,
       [ 7,  8, 10, ...,  6,  7,  7],
       [12, 10, 17, ..., 11,  8,  2],
       [ 7, 10, 13, ...,  6,  5,  7]])

Coordinates: (2)

(x)

float64

-9.844 -9.531 ... 9.531 9.844

array([-9.84375, -9.53125, -9.21875, -8.90625, -8.59375, -8.28125, -7.96875,
       -7.65625, -7.34375, -7.03125, -6.71875, -6.40625, -6.09375, -5.78125,
       -5.46875, -5.15625, -4.84375, -4.53125, -4.21875, -3.90625, -3.59375,
       -3.28125, -2.96875, -2.65625, -2.34375, -2.03125, -1.71875, -1.40625,
       -1.09375, -0.78125, -0.46875, -0.15625,  0.15625,  0.46875,  0.78125,
        1.09375,  1.40625,  1.71875,  2.03125,  2.34375,  2.65625,  2.96875,
        3.28125,  3.59375,  3.90625,  4.21875,  4.53125,  4.84375,  5.15625,
        5.46875,  5.78125,  6.09375,  6.40625,  6.71875,  7.03125,  7.34375,
        7.65625,  7.96875,  8.28125,  8.59375,  8.90625,  9.21875,  9.53125,
        9.84375])

(y)

float64

-9.844 -9.531 ... 9.531 9.844

array([-9.84375, -9.53125, -9.21875, -8.90625, -8.59375, -8.28125, -7.96875,
       -7.65625, -7.34375, -7.03125, -6.71875, -6.40625, -6.09375, -5.78125,
       -5.46875, -5.15625, -4.84375, -4.53125, -4.21875, -3.90625, -3.59375,
       -3.28125, -2.96875, -2.65625, -2.34375, -2.03125, -1.71875, -1.40625,
       -1.09375, -0.78125, -0.46875, -0.15625,  0.15625,  0.46875,  0.78125,
        1.09375,  1.40625,  1.71875,  2.03125,  2.34375,  2.65625,  2.96875,
        3.28125,  3.59375,  3.90625,  4.21875,  4.53125,  4.84375,  5.15625,
        5.46875,  5.78125,  6.09375,  6.40625,  6.71875,  7.03125,  7.34375,
        7.65625,  7.96875,  8.28125,  8.59375,  8.90625,  9.21875,  9.53125,
        9.84375])

属性: (0)

此外，xarray 还有一个非常方便的绘图方法。由于 xarray 对象包含每个维度标签的信息，绘图轴将自动标记。

[40]:

xarr.plot();

将xarray作为输出有助于我们更快地探索数据内容。在下面的示例中，我们展示了如何轻松地绘制每个id组的样本位置（x，y）的二维分布。

注意xarray如何自动为图表添加适当的标题和轴标签。

[41]:

df.categorize('id', inplace=True)  # treat the id as a categorical column - automatically adjusts limits and shape
xarr = df.count(binby=['x', 'y', 'id'], limits='95%', array_type='xarray')
np.log1p(xarr).plot(col='id', col_wrap=7);

交互式小部件#

注意： 交互式小部件需要一个正在运行的 Python 内核，如果您在线查看此文档，您可以感受到小部件的功能，但无法进行计算！

使用vaex-jupyter包，我们可以访问交互式小部件（请参阅Vaex Jupyter教程以获取更深入的教程）

[42]:

import vaex
import vaex.jupyter
import numpy as np
import matplotlib.pyplot as plt
df = vaex.example()

获取更交互式可视化（甚至打印统计信息）的最简单方法是使用vaex.jupyter.interactive_selection装饰器，它将在每次选择更改时执行装饰的函数。

[43]:

df.select(df.x > 0)
@vaex.jupyter.interactive_selection(df)
def plot(*args, **kwargs):
    print("Mean x for the selection is:", df.mean(df.x, selection=True))
    df.viz.heatmap(df.x, df.y, what=np.log(vaex.stat.count()+1), selection=[None, True], limits='99.7%')
    plt.show()

在通过编程方式更改选择后，可视化将更新，打印输出也会更新。

[44]:

df.select(df.x > df.y)

然而，要获得真正交互式的可视化效果，我们需要使用小部件，例如bqplot库。再次强调，如果我们在这里进行选择，上面的可视化也会更新，所以让我们选择一个方形区域。

查看更多交互式小部件在Vaex Jupyter 教程

连接#

在Vaex中进行连接操作与Pandas类似，只是数据不会被复制。内部会为左侧DataFrame的每一行保留一个索引数组，指向右侧DataFrame，对于十亿行\(10^9\)的数据集大约需要8GB。让我们从两个小的DataFrames开始，df1 和 df2：

[47]:

a = np.array(['a', 'b', 'c'])
x = np.arange(1,4)
df1 = vaex.from_arrays(a=a, x=x)
df1

[47]:

#	a	x
0	a	1
1	b	2
2	c	3

[48]:

b = np.array(['a', 'b', 'd'])
y = x**2
df2 = vaex.from_arrays(b=b, y=y)
df2

[48]:

#	b	y
0	a	1
1	b	4
2	d	9

默认的连接方式是‘左’连接，其中左DataFrame（df1）的所有行都被保留，而右DataFrame（df2）的匹配行被添加。我们可以看到，对于列b和y，一些值是缺失的，正如预期的那样。

[49]:

df1.join(df2, left_on='a', right_on='b')

[49]:

#	a	x	b	y
0	a	1	a	1
1	b	2	b	4
2	c	3	--	--

‘右’连接基本上是一样的，但现在左右标签的角色互换了，所以现在我们有一些来自列x的值和一些缺失的值。

[50]:

df1.join(df2, left_on='a', right_on='b', how='right')

[50]:

#	b	y	a	x
0	a	1	a	1
1	b	4	b	2
2	d	9	--	--

我们也可以进行‘内’连接，其中输出的DataFrame只包含df1和df2之间共有的行。

[51]:

df1.join(df2, left_on='a', right_on='b', how='inner')

[51]:

#	a	x	b	y
0	a	1	a	1
1	b	2	b	4

目前不支持其他连接（例如外部连接）。欢迎在GitHub上提出问题。

分组#

使用 Vaex 还可以进行快速的分组聚合操作。输出结果是 Vaex 数据框。让我们看几个例子。

[52]:

import vaex
animal = ['dog', 'dog', 'cat', 'guinea pig', 'guinea pig', 'dog']
age = [2, 1, 5, 1, 3, 7]
cuteness = [9, 10, 5, 8, 4, 8]
df_pets = vaex.from_arrays(animal=animal, age=age, cuteness=cuteness)
df_pets

[52]:

#	动物	年龄	可爱度
0	狗	2	9
1	狗	1	10
2	猫	5	5
3	豚鼠	1	8
4	豚鼠	3	4
5	狗	7	8

进行分组操作的语法几乎与Pandas相同。请注意，当将多个聚合传递给单个列或表达式时，输出列会被适当地命名。

[53]:

df_pets.groupby(by='animal').agg({'age': 'mean',
                                  'cuteness': ['mean', 'std']})

[53]:

#	动物	年龄	可爱度均值	可爱度标准差
0	狗	3.33333	9	0.816497
1	猫	5	5	0
2	豚鼠	2	6	2

Vaex 支持多种聚合函数：

vaex.agg.count: 组中的元素数量
vaex.agg.first: 组中的第一个元素
vaex.agg.max: 组中的最大值
vaex.agg.min: 组中的最小值
vaex.agg.sum: 一组的总和
vaex.agg.mean: 组的平均值
vaex.agg.std: 一组数据的标准差
vaex.agg.var: 组的方差
vaex.agg.nunique: 组中唯一元素的数量

此外，我们可以在groupby方法中指定聚合操作。我们还可以根据需要命名生成的聚合列。

[54]:

df_pets.groupby(by='animal',
                agg={'mean_age': vaex.agg.mean('age'),
                     'cuteness_unique_values': vaex.agg.nunique('cuteness'),
                     'cuteness_unique_min': vaex.agg.min('cuteness')})

[54]:

#	动物	平均年龄	可爱度唯一值	可爱度唯一最小值
0	狗	3.33333	3	8
1	猫	5	1	5
2	豚鼠	2	2	4

Vaex中聚合函数的一个强大特性是它们支持选择。这为我们提供了在聚合时进行选择的灵活性。例如，让我们计算这个示例DataFrame中宠物的平均可爱度，但按年龄分开。

[55]:

df_pets.groupby(by='animal',
                agg={'mean_cuteness_old': vaex.agg.mean('cuteness', selection='age>=5'),
                     'mean_cuteness_young': vaex.agg.mean('cuteness', selection='~(age>=5)')})

[55]:

#	动物	平均可爱度_旧	平均可爱度_年轻
0	狗	8	9.5
1	猫	5	nan
2	豚鼠	nan	6

请注意，在最后一个示例中，分组的DataFrame对于没有样本的组包含NaN。

字符串处理#

字符串处理与Pandas类似，只是所有操作都是延迟执行的、多线程的，并且更快（在C++中）。查看API文档以获取更多示例。

[56]:

import vaex
text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
df = vaex.from_arrays(text=text)
df

[56]:

#	文本
0	某事
1	非常漂亮
2	即将到来
3	我们的
4	方式。

[57]:

df.text.str.upper()

[57]:

Expression = str_upper(text)
Length: 5 dtype: str (expression)
---------------------------------
0    SOMETHING
1  VERY PRETTY
2    IS COMING
3          OUR
4         WAY.

[58]:

df.text.str.title().str.replace('et', 'ET')

[58]:

Expression = str_replace(str_title(text), 'et', 'ET')
Length: 5 dtype: str (expression)
---------------------------------
0    SomEThing
1  Very PrETty
2    Is Coming
3          Our
4         Way.

[59]:

df.text.str.contains('e')

[59]:

Expression = str_contains(text, 'e')
Length: 5 dtype: bool (expression)
----------------------------------
0   True
1   True
2  False
3  False
4  False

[60]:

df.text.str.count('e')

[60]:

Expression = str_count(text, 'e')
Length: 5 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  0
3  0
4  0

不确定性的传播#

在科学中，人们经常处理测量不确定性（有时称为测量误差）。当对具有相关不确定性的量进行变换时，Vaex 可以自动计算这些变换量的不确定性。请注意，不确定性的传播需要导数和长方程的矩阵乘法，这并不复杂，但很繁琐。Vaex 可以自动计算所有依赖关系、导数并计算完整的协方差矩阵。

作为一个例子，让我们再次使用TGAS天文数据集。尽管TGAS数据集已经包含了银河系天球坐标（l和b），但让我们通过从赤经和赤纬进行坐标系旋转再次添加它们。我们可以应用类似的转换，并将球面银河坐标转换为笛卡尔坐标。

[61]:

# convert parallas to distance
tgas.add_virtual_columns_distance_from_parallax(tgas.parallax)
# 'overwrite' the real columns 'l' and 'b' with virtual columns
tgas.add_virtual_columns_eq2gal('ra', 'dec', 'l', 'b')
# and combined with the galactic sky coordinates gives galactic cartesian coordinates of the stars
tgas.add_virtual_columns_spherical_to_cartesian(tgas.l, tgas.b, tgas.distance, 'x', 'y', 'z')

[61]:

#	astrometric_delta_q	astrometric_excess_noise	astrometric_excess_noise_sig	astrometric_n_bad_obs_ac	astrometric_n_bad_obs_al	astrometric_n_good_obs_ac	astrometric_n_good_obs_al	astrometric_n_obs_ac	astrometric_n_obs_al	astrometric_primary_flag	astrometric_priors_used	astrometric_relegation_factor	astrometric_weight_ac	astrometric_weight_al	b	dec	dec_error	dec_parallax_corr	dec_pmdec_corr	dec_pmra_corr	duplicated_source	ecl_lat	ecl_lon	hip	l	matched_observations	parallax	parallax_error	parallax_pmdec_corr	parallax_pmra_corr	phot_g_mean_flux	phot_g_mean_flux_error	phot_g_mean_mag	phot_g_n_obs	phot_variable_flag	pmdec	pmdec_error	pmra	pmra_error	pmra_pmdec_corr	ra	ra_dec_corr	ra_error	ra_parallax_corr	ra_pmdec_corr	ra_pmra_corr	random_index	ref_epoch	scan_direction_mean_k1	scan_direction_mean_k2	scan_direction_mean_k3	scan_direction_mean_k4	scan_direction_strength_k1	scan_direction_strength_k2	scan_direction_strength_k3	scan_direction_strength_k4	solution_id	source_id	tycho2_id	distance	x	y	z
0	1.9190566539764404	0.7171010000916003	412.6059727233687	1	0	78	79	79	79	84	3	2.9360971450805664	1.2669624084082898e-05	1.818157434463501	-16.121042828114014	0.23539164875137225	0.21880220693566088	-0.4073381721973419	0.06065881997346878	-0.09945132583379745	70	-16.121052173353853	42.64182504417002	13989	42.641804308626725	9	6.35295075173405	0.3079103606852086	-0.10195717215538025	-0.0015767893055453897	10312332.172993332	10577.365273118843	7.991377829505826	77	b'NOT_AVAILABLE'	-7.641989988351149	0.08740179334554747	43.75231341609215	0.07054220642640081	0.21467718482017517	45.03433035439128	-0.41497212648391724	0.30598928200282727	0.17996619641780853	-0.08575969189405441	0.15920649468898773	243619	2015.0	-113.76032257080078	21.39291763305664	-41.67839813232422	26.201841354370117	0.3823484778404236	0.5382660627365112	0.3923785090446472	0.9163063168525696	1635378410781933568	7627862074752	b''	0.15740717016058217	0.11123604040005637	0.10243667003803988	-0.04370685490397632
1	nan	0.2534628812968044	47.316290890180255	2	0	55	57	57	57	84	5	2.6523141860961914	3.1600175134371966e-05	12.861557006835938	-16.19302376369384	0.2000676896877873	1.1977893944215496	0.8376259803771973	-0.9756439924240112	0.9725773334503174	70	-16.19303311057312	42.761180489478576	-2147483648	42.76115974936648	8	3.90032893506844	0.3234880030045522	-0.8537789583206177	0.8397389650344849	949564.6488279914	1140.173576223928	10.580958718900256	62	b'NOT_AVAILABLE'	-55.10917285969142	2.522928801165149	10.03626300124532	4.611413518289133	-0.9963987469673157	45.1650067708984	-0.9959233403205872	2.583882288511597	-0.8609106540679932	0.9734798669815063	-0.9724165201187134	487238	2015.0	-156.432861328125	22.76607322692871	-36.23965835571289	22.890602111816406	0.7110026478767395	0.9659702777862549	0.6461148858070374	0.8671600818634033	1635378410781933568	9277129363072	b'55-28-1'	0.25638863199686845	0.1807701962996959	0.16716755815017084	-0.07150016957395491
2	nan	0.3989006354041912	221.18496561724646	4	1	57	60	61	61	84	5	3.9934017658233643	2.5633918994572014e-05	5.767529487609863	-16.12335382439265	0.24882543945301736	0.1803264123376257	-0.39189115166664124	-0.19325552880764008	0.08942046016454697	70	-16.123363170402296	42.69750168007008	-2147483648	42.69748094193635	7	3.1553132200367373	0.2734838183180671	-0.11855248361825943	-0.0418587327003479	817837.6000768564	1827.3836759985832	10.743102380434273	60	b'NOT_AVAILABLE'	-1.602867102186794	1.0352589283446592	2.9322836829569003	1.908644426623371	-0.9142706990242004	45.08615483797584	-0.1774432212114334	0.2138361631952843	0.30772241950035095	-0.1848166137933731	0.04686680808663368	1948952	2015.0	-117.00751495361328	19.772153854370117	-43.108219146728516	26.7157039642334	0.4825277626514435	0.4287584722042084	0.5241528153419495	0.9030616879463196	1635378410781933568	13297218905216	b'55-1191-1'	0.31692574722846595	0.22376103019475546	0.2064625216744117	-0.08801225918215205
3	nan	0.4224923646481251	179.98201436339852	1	0	51	52	52	52	84	5	4.215157985687256	2.8672602638835087e-05	5.3608622550964355	-16.118206879297034	0.24821079122833972	0.20095844850181172	-0.33721715211868286	-0.22350119054317474	0.13181143999099731	70	-16.11821622503516	42.67779093546686	-2147483648	42.67777019818556	7	2.292366835156796	0.2809724206784257	-0.10920235514640808	-0.049440864473581314	602053.4754362862	905.8772856344845	11.075682394435745	61	b'NOT_AVAILABLE'	-18.414912114825732	1.1298513589995536	3.661982345981763	2.065051873379775	-0.9261773228645325	45.06654155758114	-0.36570677161216736	0.2760390513575931	0.2028782218694687	-0.058928851038217545	-0.050908856093883514	102321	2015.0	-132.42112731933594	22.56928253173828	-38.95445251464844	25.878559112548828	0.4946548640727997	0.6384561061859131	0.5090736746788025	0.8989177942276001	1635378410781933568	13469017597184	b'55-624-1'	0.43623035574565916	0.30810014040531863	0.2840853806346911	-0.12110624783986161
4	nan	0.3175001122010629	119.74837853832186	2	3	85	84	87	87	84	5	3.2356362342834473	2.22787512029754e-05	8.080779075622559	-16.055471830750374	0.33504360351532875	0.1701298562030361	-0.43870800733566284	-0.27934885025024414	0.12179157137870789	70	-16.0554811777948	42.77336987816832	-2147483648	42.77334913546197	11	1.582076960273368	0.2615394689640736	-0.329196035861969	0.10031197965145111	1388122.242048847	2826.428866453177	10.168700781271088	96	b'NOT_AVAILABLE'	-2.379387386351838	0.7106320061478508	0.34080233369502516	1.2204755227890713	-0.8336043357849121	45.13603822322069	-0.049052558839321136	0.17069695283376776	0.4714251756668091	-0.1563923954963684	-0.15207625925540924	409284	2015.0	-106.85968017578125	4.452099323272705	-47.8953971862793	26.755468368530273	0.5206537842750549	0.23930974304676056	0.653376579284668	0.8633849024772644	1635378410781933568	15736760328576	b'55-849-1'	0.6320805024726543	0.44587838095402044	0.41250283253756015	-0.17481316927621393
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2,057,045	25.898868560791016	0.6508009723190962	172.3136755413185	0	0	54	54	54	54	84	3	6.386378765106201	1.8042501324089244e-05	2.2653496265411377	16.006806970347426	-0.42319686025158043	0.24974147639642075	0.00821441039443016	0.2133195698261261	-0.000805279181804508	70	16.006807041815204	317.0782357688112	103561	-42.92178788756781	8	5.0743069397419776	0.2840892420661878	-0.0308084636926651	-0.03397708386182785	4114975.455725508	3447.5776608146016	8.988851940956916	69	b'NOT_AVAILABLE'	-4.440524133201202	0.04743297901782237	21.970772995655643	0.07846893118669047	0.3920176327228546	314.74170043792924	0.08548042178153992	0.2773321068969684	0.2473779171705246	-0.0006040430744178593	0.11652233451604843	1595738	2015.0	-18.078920364379883	-17.731922149658203	38.27400588989258	27.63787269592285	0.29217642545700073	0.11402469873428345	0.0404343381524086	0.937016487121582	1635378410781933568	6917488998546378368	b''	0.19707124773395138	0.13871698568448773	-0.12900211309069443	0.054342703136315784
2,057,046	nan	0.17407523451856974	28.886549102578012	0	2	54	52	54	54	84	5	1.9612410068511963	2.415467497485224e-05	24.774322509765625	16.12926993546893	-0.32497534368232894	0.14823365569199975	0.8842677474021912	-0.9121489524841309	-0.8994856476783752	70	16.129270018016896	317.0105462544942	-2147483648	-42.98947742356782	7	1.6983480817439922	0.7410137777358506	-0.9793509840965271	-0.9959075450897217	1202425.5197785893	871.2480333575235	10.324624601435723	59	b'NOT_AVAILABLE'	-10.401225111268962	1.4016954983272711	-1.2835612990841874	2.7416807292293637	0.980453610420227	314.64381789311193	0.8981446623802185	0.3590974400544809	0.9818224906921387	-0.9802247881889343	-0.9827051162719727	2019553	2015.0	-87.07184600830078	-31.574886322021484	-36.37055206298828	29.130958557128906	0.22651544213294983	0.07730517536401749	0.2675701975822449	0.9523505568504333	1635378410781933568	6917493705830041600	b'5179-753-1'	0.5888074481016426	0.4137467499267554	-0.38568304807850484	0.16357391078619246
2,057,047	nan	0.47235246463190794	92.12190417660749	2	0	34	36	36	36	84	5	4.68601131439209	2.138371200999245e-05	3.9279115200042725	15.92496896432183	-0.34317732044320387	0.20902981533215972	-0.2000708132982254	0.31042322516441345	-0.3574342727661133	70	15.924968943694909	317.6408327998631	-2147483648	-42.359190842094414	6	6.036938108863445	0.39688014089787665	-0.7275367975234985	-0.25934046506881714	3268640.5253614695	4918.5087736624755	9.238852161621992	51	b'NOT_AVAILABLE'	-27.852344752672245	1.2778575351686428	15.713555906870294	0.9411842746983148	-0.1186852976679802	315.2828795933192	-0.47665935754776	0.4722647631556871	0.704002320766449	-0.77033931016922	0.12704335153102875	788948	2015.0	-21.23501205444336	20.132535934448242	33.55913162231445	26.732301712036133	0.41511622071266174	0.5105549693107605	0.15976844727993011	0.9333845376968384	1635378410781933568	6917504975824469248	b'5192-877-1'	0.16564688621402263	0.11770477437507047	-0.10732559074953243	0.045449912782963474
2,057,048	nan	0.3086465263182493	76.66564461310193	1	2	52	51	53	53	84	5	3.154139280319214	1.9043474821955897e-05	9.627826690673828	16.193728871838935	-0.22811360043544882	0.131650037775767	0.3082593083381653	-0.5279345512390137	-0.4065483510494232	70	16.193728933791913	317.1363617703344	-2147483648	-42.86366191921117	7	1.484142306295484	0.34860128377301614	-0.7272516489028931	-0.9375584125518799	4009408.3172682906	1929.9834553649182	9.017069346445364	60	b'NOT_AVAILABLE'	1.8471079057572073	0.7307171627866237	11.352888915160555	1.219847308406543	0.7511345148086548	314.7406481637209	0.41397571563720703	0.19205296641778563	0.7539510726928711	-0.7239754796028137	-0.7911394238471985	868066	2015.0	-89.73970794677734	-25.196216583251953	-35.13546371459961	29.041872024536133	0.21430812776088715	0.06784655898809433	0.2636755108833313	0.9523414969444275	1635378410781933568	6917517998165066624	b'5179-1401-1'	0.6737898352187435	0.4742760432178817	-0.44016428945980135	0.18791055094922077
2,057,049	nan	0.4329850465924866	60.789771079095715	0	0	26	26	26	26	84	5	4.3140177726745605	2.7940122890868224e-05	4.742301940917969	16.135962442685898	-0.22130081624351935	0.2686748166142929	-0.46605369448661804	0.30018869042396545	-0.3290684223175049	70	16.13596246842634	317.3575812619557	-2147483648	-42.642442417388324	5	2.680111343641743	0.4507741964825321	-0.689416229724884	-0.1735922396183014	2074338.153903563	4136.498086035368	9.732571175024953	31	b'NOT_AVAILABLE'	3.15173423618292	1.4388911228835037	2.897878776243949	1.0354817855168323	-0.21837876737117767	314.960730599014	-0.4467950165271759	0.49182050944792216	0.7087226510047913	-0.8360105156898499	0.2156151533126831	1736132	2015.0	-63.01319885253906	18.303699493408203	-49.05630111694336	28.76698875427246	0.3929939866065979	0.32352808117866516	0.24211134016513824	0.9409775733947754	1635378410781933568	6917521537218608640	b'5179-1719-1'	0.3731188267130712	0.2636519673685346	-0.24280110216486334	0.10369630532457579

由于RA和Dec是以度为单位，而ra_error和dec_error是以毫角秒为单位，我们需要将它们放在同一尺度上

[62]:

tgas['ra_error'] = tgas.ra_error / 1000 / 3600
tgas['dec_error'] = tgas.dec_error / 1000 / 3600

我们现在让Vaex计算出笛卡尔坐标x、y和z的协方差矩阵。然后从数据集中抽取50个样本进行可视化。

[63]:

tgas.propagate_uncertainties([tgas.x, tgas.y, tgas.z])
tgas_50 = tgas.sample(50, random_state=42)

对于这个数据集的这个小子集，我们可以可视化不确定性，包括有协方差和没有协方差的情况。

[64]:

tgas_50.scatter(tgas_50.x, tgas_50.y, xerr=tgas_50.x_uncertainty, yerr=tgas_50.y_uncertainty)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
tgas_50.scatter(tgas_50.x, tgas_50.y, xerr=tgas_50.x_uncertainty, yerr=tgas_50.y_uncertainty, cov=tgas_50.y_x_covariance)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()

从第二个图中，我们看到显示误差椭圆（非常窄以至于它们看起来像线）而不是误差条，揭示了在这种情况下距离信息主导了不确定性。

即时编译#

让我们从一个计算球体表面上两点之间角距离的函数开始。该函数的输入是一对以弧度表示的2个角坐标。

[65]:

import vaex
import numpy as np
# From http://pythonhosted.org/pythran/MANUAL.html
def arc_distance(theta_1, phi_1, theta_2, phi_2):
    """
    Calculates the pairwise arc distance
    between all points in vector a and b.
    """
    temp = (np.sin((theta_2-2-theta_1)/2)**2
           + np.cos(theta_1)*np.cos(theta_2) * np.sin((phi_2-phi_1)/2)**2)
    distance_matrix = 2 * np.arctan2(np.sqrt(temp), np.sqrt(1-temp))
    return distance_matrix

让我们使用2015年的纽约出租车数据集，可以以hdf5格式下载

[66]:

# nytaxi = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')
nytaxi = vaex.open('/Users/jovan/Work/vaex-work/vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5')
# lets use just 20% of the data, since we want to make sure it fits
# into memory (so we don't measure just hdd/ssd speed)
nytaxi.set_active_fraction(0.2)

尽管上述函数期望接收Numpy数组，Vaex可以传入列或表达式，这将延迟执行直到需要时，并将结果表达式作为虚拟列添加。

[67]:

nytaxi['arc_distance'] = arc_distance(nytaxi.pickup_longitude * np.pi/180,
                                      nytaxi.pickup_latitude * np.pi/180,
                                      nytaxi.dropoff_longitude * np.pi/180,
                                      nytaxi.dropoff_latitude * np.pi/180)

当我们计算出租车行程的平均角距离时，会遇到一些无效数据，这些数据会给出警告，在此演示中我们可以安全地忽略这些警告。

[68]:

%%time
nytaxi.mean(nytaxi.arc_distance)

/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in sqrt
  return function(*args, **kwargs)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in sin
  return function(*args, **kwargs)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in cos
  return function(*args, **kwargs)

CPU times: user 44.5 s, sys: 5.03 s, total: 49.5 s
Wall time: 6.14 s

[68]:

array(1.99993285)

这个计算使用了相当多的重型数学操作，并且由于它（内部）使用了Numpy数组，也使用了相当多的临时数组。我们可以通过基于numba、pythran，或者如果你有NVIDIA显卡的话，基于cuda的即时编译来优化这个计算。选择性能最好或最容易安装的选项。

[69]:

nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_numba()
# nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_pythran()
# nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_cuda()

[70]:

%%time
nytaxi.mean(nytaxi.arc_distance_jit)

/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/expression.py:1038: RuntimeWarning: invalid value encountered in f
  return self.f(*args, **kwargs)

CPU times: user 25.7 s, sys: 330 ms, total: 26 s
Wall time: 2.31 s

[70]:

array(1.9999328)

在这种情况下，我们可以获得显著的加速（\(\sim 3 x\)）。

并行计算#

正如在选择部分提到的，Vaex 可以并行进行计算。通常这是自动处理的，例如，当向方法传递多个选择或向统计函数之一传递多个参数时。然而，有时很难或无法在一个表达式中表示计算，我们需要进行所谓的“延迟”计算，类似于 joblib 和 dask 中的做法。

[71]:

import vaex
df = vaex.example()
limits = [-10, 10]
delayed_count = df.count(df.E, binby=df.x, limits=limits,
                         shape=4, delay=True)
delayed_count

[71]:

<vaex.promise.Promise at 0x7ffbd64072d0>

请注意，现在返回的值是一个promise（TODO：更Pythonic的方式是返回一个Future）。这可能会有所变化，处理这种情况的最佳方式是使用delayed装饰器。并在需要结果时调用DataFrame.execute。

除了上述的延迟计算外，我们还安排了更多的计算任务，使得计数和平均值能够并行执行，从而只需对数据进行一次遍历。我们使用vaex.delayed装饰器来安排两个额外函数的执行，并使用df.execute()来运行整个管道。

[72]:

delayed_sum = df.sum(df.E, binby=df.x, limits=limits,
                         shape=4, delay=True)

@vaex.delayed
def calculate_mean(sums, counts):
    print('calculating mean')
    return sums/counts

print('before calling mean')
# since calculate_mean is decorated with vaex.delayed
# this now also returns a 'delayed' object (a promise)
delayed_mean = calculate_mean(delayed_sum, delayed_count)

# if we'd like to perform operations on that, we can again
# use the same decorator
@vaex.delayed
def print_mean(means):
    print('means', means)
print_mean(delayed_mean)

print('before calling execute')
df.execute()

# Using the .get on the promise will also return the result
# However, this will only work after execute, and may be
# subject to change
means = delayed_mean.get()
print('same means', means)

before calling mean
before calling execute
calculating mean
means [ -94323.68051598 -118749.23850834 -119119.46292653  -95021.66183457]
same means [ -94323.68051598 -118749.23850834 -119119.46292653  -95021.66183457]

扩展Vaex#

Vaex 可以通过多种机制进行扩展。

添加函数#

使用 vaex.register_function decorator API 来添加新函数。

[73]:

import vaex
import numpy as np
@vaex.register_function()
def add_one(ar):
    return ar+1

该函数可以使用df.func访问器调用，以返回一个新的表达式。在Vaex上下文中进行评估时，每个作为表达式的参数将被Numpy数组替换。

[74]:

df = vaex.from_arrays(x=np.arange(4))
df.func.add_one(df.x)

[74]:

Expression = add_one(x)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  3
3  4

默认情况下（传递on_expression=True），该函数也可作为表达式上的方法使用，其中表达式本身会自动设置为第一个参数（因为这是一个非常常见的用例）。

[75]:

df.x.add_one()

[75]:

Expression = add_one(x)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  3
3  4

如果第一个参数不是表达式，请传递 on_expression=True，并使用 df.func.，通过该函数构建一个新的表达式：

[76]:

@vaex.register_function(on_expression=False)
def addmul(a, b, x, y):
    return a*x + b * y

[77]:

df = vaex.from_arrays(x=np.arange(4))
df['y'] = df.x**2
df.func.addmul(2, 3, df.x, df.y)

[77]:

Expression = addmul(2, 3, x, y)
Length: 4 dtype: int64 (expression)
-----------------------------------
0   0
1   5
2  16
3  33

这些表达式可以按预期添加为虚拟列。

[78]:

df = vaex.from_arrays(x=np.arange(4))
df['y'] = df.x**2
df['z'] = df.func.addmul(2, 3, df.x, df.y)
df['w'] = df.x.add_one()
df

[78]:

#	x	y	z	w
0	0	0	0	1
1	1	1	5	2
2	2	4	16	3
3	3	9	33	4

添加DataFrame访问器#

在添加操作Dataframes的方法时，将它们分组到一个命名空间中是有意义的。

[79]:

@vaex.register_dataframe_accessor('scale', override=True)
class ScalingOps(object):
    def __init__(self, df):
        self.df = df

    def mul(self, a):
        df = self.df.copy()
        for col in df.get_column_names(strings=False):
            if df[col].dtype:
                df[col] = df[col] * a
        return df

    def add(self, a):
        df = self.df.copy()
        for col in df.get_column_names(strings=False):
            if df[col].dtype:
                df[col] = df[col] + a
        return df

[80]:

df.scale.add(1)

[80]:

#	x	y	z	w
0	1	1	1	2
1	2	2	6	3
2	3	5	17	4
3	4	10	34	5

[81]:

df.scale.mul(2)

[81]:

#	x	y	z	w
0	0	0	0	2
1	2	2	10	4
2	4	8	32	6
3	6	18	66	8

便捷方法#

获取列名#

我们经常希望处理DataFrame中的一部分列。使用get_column_names方法，Vaex使得获取你需要的精确列变得非常容易和方便。默认情况下，get_column_names返回所有列：

[1]:

import vaex
df = vaex.datasets.titanic()
print(df.get_column_names())

['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home_dest']

相同的方法有几个参数，使得获取所需的列子集变得容易。例如，可以传递一个正则表达式来根据列名选择列。在下面的单元格中，我们选择所有名称长度为5个字符的列：

[2]:

print(df.get_column_names(regex='^[a-zA-Z]{5}$'))

['sibsp', 'parch', 'cabin']

我们还可以根据类型选择列。下面我们选择所有整数或浮点数的列：

[3]:

df.get_column_names(dtype=['int', 'float'])

[3]:

['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']

逃生口：apply#

如果计算无法表示为Vaex表达式，可以使用apply方法作为最后的手段。如果您想要应用的函数是用纯Python编写的，或者来自第三方库，并且难以或无法向量化，这可能很有用。

我们认为apply应该仅作为最后的手段使用，因为它需要使用多进程（产生新进程）来避免Python全局解释器锁（GIL），以利用多核。这带来的代价是必须在主进程和子进程之间传输数据。

这是一个使用apply方法的示例：

[1]:

import vaex

def slow_is_prime(x):
    return x > 1 and all((x % i) != 0 for i in range(2, x))

df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))
# you need to explicitly specify which arguments you need
df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])
df.head(10)

[1]:

#	x	是否为质数
0	0	假
1	1	假
2	2	True
3	3	True
4	4	错误
5	5	True
6	6	错误
7	7	真
8	8	假
9	9	错误

[2]:

prime_count = df.is_prime.sum()
print(f'There are {prime_count} prime numbers between 0 and {len(df)}')

There are 9592 prime number between 0 and 100000

[3]:

# both of these are equivalent
df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])
# but this form only works for a single argument
df['is_prime'] = df.x.apply(slow_is_prime)

何时不使用apply#

当你的函数可以被向量化时，你不应该使用apply。当你使用Vaex的表达式系统时，我们知道你在做什么，我们看到表达式，并且可以操作它以实现最佳性能。一个apply函数就像一个黑盒子，我们无法对它做任何事情，比如JIT编译。

[4]:

df = vaex.from_arrays(x=vaex.vrange(0, 10_000_000, dtype='f4'))

[5]:

# ideal case
df['y'] = df.x**2

[6]:

%%timeit
df.y.sum()

29.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[7]:

# will transfer the data to child processes, and execute the ** operation in Python for each element
df['y_slow'] = df.x.apply(lambda x: x**2)

[8]:

%%timeit
df.y_slow.sum()

353 ms ± 40 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[9]:

# bad idea: it will transfer the data to the child process, where it will be executed in vectorized form
df['y_slow_vectorized'] = df.x.apply(lambda x: x**2, vectorize=True)

[10]:

%%timeit
df.y_slow_vectorized.sum()

82.8 ms ± 525 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[11]:

# bad idea: same performance as just dy['y'], but we lose the information about what was done
df['y_fast'] = df.x.apply(lambda x: x**2, vectorize=True, multiprocessing=False)

[12]:

%%timeit
df.y_fast.sum()

28.8 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

11分钟了解Vaex

目录

11分钟了解Vaex#

数据框#

列#

虚拟列#

选择和过滤#

N维网格上的统计#

获取您的数据#

绘图#

一维和二维#

绘图选择#

高级绘图#

第三维度的切片#

小型数据集的可视化#

在控制中#

Healpix (绘图)#

xarray 支持#

交互式小部件#

连接#

分组#

字符串处理#

不确定性的传播#

即时编译#

并行计算#

扩展Vaex#

添加函数#

添加DataFrame访问器#

便捷方法#

获取列名#

逃生口：apply#

何时不使用apply#