版本 0.18.0 (2016年3月13日)#

这是从 0.17.1 版本以来的一个重大发布，包括少量 API 更改、几个新功能、增强功能和性能改进，以及大量错误修复。我们建议所有用户升级到此版本。

警告

pandas >= 0.18.0 不再支持与 Python 2.6 和 3.3 的兼容性 (GH 7718, GH 11273)

警告

numexpr 版本 2.4.4 现在会显示警告并且不会被用作 pandas 的计算后端，因为存在一些错误行为。这不会影响其他版本（>= 2.1 和 >= 2.4.6）。(GH 12489)

亮点包括：

移动和扩展窗口函数现在在 Series 和 DataFrame 上是方法，类似于 .groupby，参见这里。
添加对 RangeIndex 的支持，作为 Int64Index 的一种专用形式，以节省内存，请参见这里。
对 .resample 方法的 API 破坏性更改，使其更像 .groupby，参见这里。
移除对使用浮点数进行位置索引的支持，该功能自0.14.0版本起已被弃用。现在这将引发一个 TypeError ，请参见这里。
.to_xarray() 函数已添加，以兼容 xarray 包，请参见这里。
read_sas 函数已增强，可以读取 sas7bdat 文件，请参见这里。
添加了 .str.extractall() 方法，以及对 .str.extract() 方法和 .str.cat() 方法的 API 更改。
pd.test() 顶级 nose 测试运行器可用 (GH 4327)。

在更新之前，请检查 API 变更和弃用。

新功能#

窗口函数现在是方法#

窗口函数已经重构为 Series/DataFrame 对象的方法，而不是顶层函数，这些顶层函数现在已被弃用。这使得这些窗口类型的函数具有与 .groupby 类似的API。查看完整的文档这里 (GH 11603, GH 12373)

In [1]: np.random.seed(1234)

In [2]: df = pd.DataFrame({'A': range(10), 'B': np.random.randn(10)})

In [3]: df
Out[3]: 
   A         B
0  0  0.471435
1  1 -1.190976
2  2  1.432707
3  3 -0.312652
4  4 -0.720589
5  5  0.887163
6  6  0.859588
7  7 -0.636524
8  8  0.015696
9  9 -2.242685

[10 rows x 2 columns]

之前的行为：

In [8]: pd.rolling_mean(df, window=3)
        FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with
                       DataFrame.rolling(window=3,center=False).mean()
Out[8]:
    A         B
0 NaN       NaN
1 NaN       NaN
2   1  0.237722
3   2 -0.023640
4   3  0.133155
5   4 -0.048693
6   5  0.342054
7   6  0.370076
8   7  0.079587
9   8 -0.954504

新行为：

In [4]: r = df.rolling(window=3)

这些显示了一个描述性的 repr

In [5]: r
Out[5]: Rolling [window=3,center=False,method=single]

带有可用方法和属性的制表符补全。

In [9]: r.<TAB>  # noqa E225, E999
r.A           r.agg         r.apply       r.count       r.exclusions  r.max         r.median      r.name        r.skew        r.sum
r.B           r.aggregate   r.corr        r.cov         r.kurt        r.mean        r.min         r.quantile    r.std         r.var

这些方法操作在 Rolling 对象本身上

In [6]: r.mean()
Out[6]: 
     A         B
NaN       NaN
NaN       NaN
1.0  0.237722
2.0 -0.023640
3.0  0.133155
4.0 -0.048693
5.0  0.342054
6.0  0.370076
7.0  0.079587
8.0 -0.954504

[10 rows x 2 columns]

它们提供 getitem 访问器

In [7]: r['A'].mean()
Out[7]: 
  NaN
  NaN
  1.0
  2.0
  3.0
  4.0
  5.0
  6.0
  7.0
  8.0
Name: A, Length: 10, dtype: float64

以及多个聚合

In [8]: r.agg({'A': ['mean', 'std'],
   ...:        'B': ['mean', 'std']})
   ...: 
Out[8]: 
     A              B          
  mean  std      mean       std
0  NaN  NaN       NaN       NaN
1  NaN  NaN       NaN       NaN
2  1.0  1.0  0.237722  1.327364
3  2.0  1.0 -0.023640  1.335505
4  3.0  1.0  0.133155  1.143778
5  4.0  1.0 -0.048693  0.835747
6  5.0  1.0  0.342054  0.920379
7  6.0  1.0  0.370076  0.871850
8  7.0  1.0  0.079587  0.750099
9  8.0  1.0 -0.954504  1.162285

[10 rows x 4 columns]

重命名的更改#

Series.rename 和 NDFrame.rename_axis 现在可以接受一个标量或类似列表的参数来改变 Series 或轴的名称，除了它们原有的改变标签的行为。(GH 9494, GH 11965)

In [9]: s = pd.Series(np.random.randn(5))

In [10]: s.rename('newname')
Out[10]: 
0    1.150036
1    0.991946
2    0.953324
3   -2.021255
4   -0.334077
Name: newname, Length: 5, dtype: float64

In [11]: df = pd.DataFrame(np.random.randn(5, 2))

In [12]: (df.rename_axis("indexname")
   ....:    .rename_axis("columns_name", axis="columns"))
   ....: 
Out[12]: 
columns_name         0         1
indexname                       
0             0.002118  0.405453
1             0.289092  1.321158
2            -1.546906 -0.202646
3            -0.655969  0.193421
4             0.553439  1.318152

[5 rows x 2 columns]

新功能在方法链中工作良好。以前这些方法只接受函数或映射标签到新标签的字典。对于函数或类似字典的值，这仍然像以前一样工作。

范围索引#

RangeIndex 已添加到 Int64Index 子类中，以支持常见用例的内存节省替代方案。它的实现类似于 Python 的 range 对象（Python 2 中的 xrange），因为它仅存储索引的起始值、停止值和步长值。它将透明地与用户 API 交互，在需要时转换为 Int64Index。

这将成为 NDFrame 对象的默认构造索引，而不是之前的 Int64Index。 (GH 939, GH 12070, GH 12071, GH 12109, GH 12888)

之前的行为：

In [3]: s = pd.Series(range(1000))

In [4]: s.index
Out[4]:
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            990, 991, 992, 993, 994, 995, 996, 997, 998, 999], dtype='int64', length=1000)

In [6]: s.index.nbytes
Out[6]: 8000

新行为：

In [13]: s = pd.Series(range(1000))

In [14]: s.index
Out[14]: RangeIndex(start=0, stop=1000, step=1)

In [15]: s.index.nbytes
Out[15]: 128

对 str.extract 的更改#

.str.extract 方法接受带有捕获组的正则表达式，在每个主题字符串中找到第一个匹配项，并返回捕获组的内容 (GH 11386)。

在 v0.18.0 版本中，expand 参数被添加到 extract 中。

expand=False: 它返回一个 Series、Index 或 DataFrame，取决于主题和正则表达式模式（与0.18.0之前的版本行为相同）。
expand=True: 它总是返回一个 DataFrame，这对用户来说更加一致且不易混淆。

目前默认值是 expand=None ，这会给出 FutureWarning 并使用 expand=False 。为了避免这个警告，请明确指定 expand 。

In [1]: pd.Series(['a1', 'b2', 'c3']).str.extract(r'[ab](\d)', expand=None)
FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame)
but in a future version of pandas this will be changed to expand=True (return DataFrame)

Out[1]:
0      1
1      2
2    NaN
dtype: object

提取带有单个组的正则表达式，如果 expand=False，则返回一个 Series。

In [16]: pd.Series(['a1', 'b2', 'c3']).str.extract(r'[ab](\d)', expand=False)
Out[16]: 
0      1
1      2
2    NaN
Length: 3, dtype: object

如果 expand=True，它返回一个包含一列的 DataFrame。

In [17]: pd.Series(['a1', 'b2', 'c3']).str.extract(r'[ab](\d)', expand=True)
Out[17]: 
     0
0    1
1    2
2  NaN

[3 rows x 1 columns]

使用一个带有精确一个捕获组的正则表达式调用 Index ，如果 expand=False ，则返回一个 Index 。

In [18]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])

In [19]: s.index
Out[19]: Index(['A11', 'B22', 'C33'], dtype='object')

In [20]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[20]: Index(['A', 'B', 'C'], dtype='object', name='letter')

如果 expand=True，它返回一个包含一列的 DataFrame。

In [21]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[21]: 
  letter
0      A
1      B
2      C

[3 rows x 1 columns]

如果 expand=False，使用具有多个捕获组的正则表达式调用 Index 会引发 ValueError。

>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: only one regex group is supported with Index

如果 expand=True，它返回一个 DataFrame。

In [22]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[22]: 
  letter   1
0      A  11
1      B  22
2      C  33

[3 rows x 2 columns]

总之，extract(expand=True) 总是返回一个 DataFrame ，其中每一行对应一个主题字符串，每一列对应一个捕获组。

添加 str.extractall#

.str.extractall 方法已添加 (GH 11386)。与 extract 不同，后者仅返回第一个匹配项。

In [23]: s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])

In [24]: s
Out[24]: 
A    a1a2
B      b1
C      c1
Length: 3, dtype: object

In [25]: s.str.extract(r"(?P<letter>[ab])(?P<digit>\d)", expand=False)
Out[25]: 
  letter digit
A      a     1
B      b     1
C    NaN   NaN

[3 rows x 2 columns]

extractall 方法返回所有匹配项。

In [26]: s.str.extractall(r"(?P<letter>[ab])(?P<digit>\d)")
Out[26]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1

[3 rows x 2 columns]

对 str.cat 的更改#

方法 .str.cat() 连接一个 Series 的成员。之前，如果 Series 中存在 NaN 值，调用 .str.cat() 会返回 NaN，这与 Series.str.* API 的其余部分不同。此行为已修改为默认忽略 NaN 值。(GH 11435)。

新增了一个更友好的 ValueError ，以防止将 sep 作为参数而不是关键字参数提供的错误。(GH 11334)。

In [27]: pd.Series(['a', 'b', np.nan, 'c']).str.cat(sep=' ')
Out[27]: 'a b c'

In [28]: pd.Series(['a', 'b', np.nan, 'c']).str.cat(sep=' ', na_rep='?')
Out[28]: 'a b ? c'

In [2]: pd.Series(['a', 'b', np.nan, 'c']).str.cat(' ')
ValueError: Did you mean to supply a ``sep`` keyword?

Datetimelike 舍入#

DatetimeIndex、Timestamp、TimedeltaIndex、Timedelta 已经获得了用于日期时间类似舍入、地板和天花板的 .round()、.floor() 和 .ceil() 方法。(GH 4314, GH 11963)

朴素日期时间

In [29]: dr = pd.date_range('20130101 09:12:56.1234', periods=3)

In [30]: dr
Out[30]: 
DatetimeIndex(['2013-01-01 09:12:56.123400', '2013-01-02 09:12:56.123400',
               '2013-01-03 09:12:56.123400'],
              dtype='datetime64[ns]', freq='D')

In [31]: dr.round('s')
Out[31]: 
DatetimeIndex(['2013-01-01 09:12:56', '2013-01-02 09:12:56',
               '2013-01-03 09:12:56'],
              dtype='datetime64[ns]', freq=None)

# Timestamp scalar
In [32]: dr[0]
Out[32]: Timestamp('2013-01-01 09:12:56.123400')

In [33]: dr[0].round('10s')
Out[33]: Timestamp('2013-01-01 09:13:00')

Tz-aware 在本地时间中进行四舍五入、向下取整和向上取整

In [34]: dr = dr.tz_localize('US/Eastern')

In [35]: dr
Out[35]: 
DatetimeIndex(['2013-01-01 09:12:56.123400-05:00',
               '2013-01-02 09:12:56.123400-05:00',
               '2013-01-03 09:12:56.123400-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq=None)

In [36]: dr.round('s')
Out[36]: 
DatetimeIndex(['2013-01-01 09:12:56-05:00', '2013-01-02 09:12:56-05:00',
               '2013-01-03 09:12:56-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq=None)

Timedeltas

In [37]: t = pd.timedelta_range('1 days 2 hr 13 min 45 us', periods=3, freq='d')

In [38]: t
Out[38]:
TimedeltaIndex(['1 days 02:13:00.000045', '2 days 02:13:00.000045',
                '3 days 02:13:00.000045'],
               dtype='timedelta64[ns]', freq='D')

In [39]: t.round('10min')
Out[39]:
TimedeltaIndex(['1 days 02:10:00', '2 days 02:10:00',
                '3 days 02:10:00'],
               dtype='timedelta64[ns]', freq=None)

# Timedelta scalar
In [40]: t[0]
Out[40]: Timedelta('1 days 02:13:00.000045')

In [41]: t[0].round('2h')
Out[41]: Timedelta('1 days 02:00:00')

此外，.round()、.floor() 和 .ceil() 将通过 Series 的 .dt 访问器可用。

In [37]: s = pd.Series(dr)

In [38]: s
Out[38]: 
0   2013-01-01 09:12:56.123400-05:00
1   2013-01-02 09:12:56.123400-05:00
2   2013-01-03 09:12:56.123400-05:00
Length: 3, dtype: datetime64[ns, US/Eastern]

In [39]: s.dt.round('D')
Out[39]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
Length: 3, dtype: datetime64[ns, US/Eastern]

FloatIndex 中整数的格式化#

在 FloatIndex 中的整数，例如 1.，现在会格式化为带小数点和 0 位数字，例如 1.0 (GH 11713)。这一变化不仅影响控制台的显示，还影响像 .to_csv 或 .to_html 这样的 IO 方法的输出。

之前的行为：

In [2]: s = pd.Series([1, 2, 3], index=np.arange(3.))

In [3]: s
Out[3]:
0    1
1    2
2    3
dtype: int64

In [4]: s.index
Out[4]: Float64Index([0.0, 1.0, 2.0], dtype='float64')

In [5]: print(s.to_csv(path=None))
0,1
1,2
2,3

新行为：

In [40]: s = pd.Series([1, 2, 3], index=np.arange(3.))

In [41]: s
Out[41]: 
0.0    1
1.0    2
2.0    3
Length: 3, dtype: int64

In [42]: s.index
Out[42]: Index([0.0, 1.0, 2.0], dtype='float64')

In [43]: print(s.to_csv(path_or_buf=None, header=False))
0.0,1
1.0,2
2.0,3

dtype 分配行为的更改#

当使用相同数据类型的新的切片更新 DataFrame 的切片时，DataFrame 的数据类型现在将保持不变。(GH 10503)

之前的行为：

In [5]: df = pd.DataFrame({'a': [0, 1, 1],
                           'b': pd.Series([100, 200, 300], dtype='uint32')})

In [7]: df.dtypes
Out[7]:
a     int64
b    uint32
dtype: object

In [8]: ix = df['a'] == 1

In [9]: df.loc[ix, 'b'] = df.loc[ix, 'b']

In [11]: df.dtypes
Out[11]:
a    int64
b    int64
dtype: object

新行为：

In [44]: df = pd.DataFrame({'a': [0, 1, 1],
   ....:                    'b': pd.Series([100, 200, 300], dtype='uint32')})
   ....: 

In [45]: df.dtypes
Out[45]: 
a     int64
b    uint32
Length: 2, dtype: object

In [46]: ix = df['a'] == 1

In [47]: df.loc[ix, 'b'] = df.loc[ix, 'b']

In [48]: df.dtypes
Out[48]: 
a     int64
b    uint32
Length: 2, dtype: object

当一个 DataFrame 的整数切片部分更新为一个新的浮点数切片，该切片可以在不损失精度的情况下向下转换为整数时，切片的 dtype 将被设置为浮点数而不是整数。

之前的行为：

In [4]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3),
                          columns=list('abc'),
                          index=[[4,4,8], [8,10,12]])

In [5]: df
Out[5]:
      a  b  c
4 8   1  2  3
  10  4  5  6
8 12  7  8  9

In [7]: df.ix[4, 'c'] = np.array([0., 1.])

In [8]: df
Out[8]:
      a  b  c
4 8   1  2  0
  10  4  5  1
8 12  7  8  9

新行为：

In [49]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3),
   ....:                   columns=list('abc'),
   ....:                   index=[[4,4,8], [8,10,12]])
   ....: 

In [50]: df
Out[50]: 
      a  b  c
4 8   1  2  3
  10  4  5  6
8 12  7  8  9

[3 rows x 3 columns]

In [51]: df.loc[4, 'c'] = np.array([0., 1.])

In [52]: df
Out[52]: 
      a  b  c
4 8   1  2  0
  10  4  5  1
8 12  7  8  9

[3 rows x 3 columns]

方法 to_xarray#

在未来的 pandas 版本中，我们将弃用 Panel 和其他 > 2 维对象。为了提供连续性，所有 NDFrame 对象都获得了 .to_xarray() 方法，以便转换为 xarray 对象，这些对象具有类似 pandas 的接口用于 > 2 维。(GH 11972)

请参阅 xarray 完整文档在这里。

In [1]: p = Panel(np.arange(2*3*4).reshape(2,3,4))

In [2]: p.to_xarray()
Out[2]:
<xarray.DataArray (items: 2, major_axis: 3, minor_axis: 4)>
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
Coordinates:
  * items       (items) int64 0 1
  * major_axis  (major_axis) int64 0 1 2
  * minor_axis  (minor_axis) int64 0 1 2 3

Latex 表示#

DataFrame 在 ._repr_latex_() 方法中获得了支持，以便在使用 nbconvert 的 ipython/jupyter 笔记本中转换为 latex。(GH 11778)

请注意，这必须通过设置选项 pd.display.latex.repr=True 来激活 (GH 12182)

例如，如果你有一个计划使用 nbconvert 转换为 latex 的 jupyter notebook，请在第一个单元格中放置语句 pd.display.latex.repr=True，以便将包含的 DataFrame 输出也存储为 latex。

选项 display.latex.escape 和 display.latex.longtable 也已添加到配置中，并由 to_latex 方法自动使用。更多信息请参见可用选项文档。

`pd.read_sas()` 更改#

read_sas 已经获得了读取 SAS7BDAT 文件的能力，包括压缩文件。文件可以完整读取，或增量读取。有关详细信息，请参见这里。(GH 4052)

其他增强功能#

处理 SAS xport 文件中的截断浮点数 (GH 11713)
在 Series.to_string 中添加了隐藏索引的选项 (GH 11729)
read_excel 现在支持格式为 s3://bucketname/filename 的 s3 网址 (GH 11447)
在从 s3 读取时添加对 AWS_S3_HOST 环境变量的支持 (GH 12198)
Panel.round() 的简单版本现已实现 (GH 11763)
对于 Python 3.x，round(DataFrame)、round(Series)、round(Panel) 将会工作 (GH 11763)
sys.getsizeof(obj) 返回一个 pandas 对象的内存使用情况，包括它包含的值 (GH 11597)
Series 获得了一个 is_unique 属性 (GH 11946)
DataFrame.quantile 和 Series.quantile 现在接受 interpolation 关键字 (GH 10174)。
添加了 DataFrame.style.format 以更灵活地格式化单元格值 (GH 11692)
DataFrame.select_dtypes 现在允许 np.float16 类型代码 (GH 11990)
pivot_table() 现在接受大多数可迭代对象作为 values 参数 (GH 12017)
添加了对 Google BigQuery 服务帐户身份验证的支持，这使得可以在远程服务器上进行身份验证。(GH 11881, GH 12572)。更多详情请参见这里
HDFStore 现在是可以迭代的：for k in store 等同于 for k in store.keys() (GH 12221)。
为 Period 的 .dt 添加缺失的方法/字段 (GH 8848)
整个代码库已经 PEP 化（GH 12096）

向后不兼容的 API 变化#

.to_string(index=False) 方法的输出中前导空格已被移除 (GH 11833)
out 参数已从 Series.round() 方法中移除。(GH 11763)
DataFrame.round() 在返回时保持非数字列不变，而不是引发错误。(GH 11885)
DataFrame.head(0) 和 DataFrame.tail(0) 返回空帧，而不是 self。（GH 11937）
Series.head(0) 和 Series.tail(0) 返回空序列，而不是 self。（GH 11937）
to_msgpack 和 read_msgpack 编码现在默认为 'utf-8'。 (GH 12170)
文本文件解析函数（.read_csv()、.read_table()、.read_fwf()）的关键字参数顺序已更改，以将相关参数分组。(GH 11555)
NaTType.isoformat 现在返回字符串 'NaT'，以允许将结果传递给 Timestamp 的构造函数。(GH 12300)

NaT 和 Timedelta 操作#

NaT 和 Timedelta 扩展了算术运算，这些运算在适用的情况下扩展到 Series 算术运算。为 datetime64[ns] 或 timedelta64[ns] 定义的运算现在也为 NaT 定义（GH 11564）。

NaT 现在支持与整数和浮点数的算术运算。

In [53]: pd.NaT * 1
Out[53]: NaT

In [54]: pd.NaT * 1.5
Out[54]: NaT

In [55]: pd.NaT / 2
Out[55]: NaT

In [56]: pd.NaT * np.nan
Out[56]: NaT

NaT 定义了更多与 datetime64[ns] 和 timedelta64[ns] 的算术运算。

In [57]: pd.NaT / pd.NaT
Out[57]: nan

In [58]: pd.Timedelta('1s') / pd.NaT
Out[58]: nan

NaT 可能代表一个 datetime64[ns] 空值或一个 timedelta64[ns] 空值。鉴于这种模糊性，它被视为一个 timedelta64[ns]，这使得更多的操作能够成功。

In [59]: pd.NaT + pd.NaT
Out[59]: NaT

# same as
In [60]: pd.Timedelta('1s') + pd.Timedelta('1s')
Out[60]: Timedelta('0 days 00:00:02')

与…相对

In [3]: pd.Timestamp('19900315') + pd.Timestamp('19900315')
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'

然而，当包裹在一个 dtype 为 datetime64[ns] 或 timedelta64[ns] 的 Series 中时，dtype 信息会被尊重。

In [1]: pd.Series([pd.NaT], dtype='<M8[ns]') + pd.Series([pd.NaT], dtype='<M8[ns]')
TypeError: can only operate on a datetimes for subtraction,
           but the operator [__add__] was passed

In [61]: pd.Series([pd.NaT], dtype='<m8[ns]') + pd.Series([pd.NaT], dtype='<m8[ns]')
Out[61]: 
0   NaT
Length: 1, dtype: timedelta64[ns]

Timedelta 除以 floats 现在可以工作。

In [62]: pd.Timedelta('1s') / 2.0
Out[62]: Timedelta('0 days 00:00:00.500000')

在 Series 中通过 Timestamp 进行 Timedelta 减法操作 (GH 11925)

In [63]: ser = pd.Series(pd.timedelta_range('1 day', periods=3))

In [64]: ser
Out[64]: 
0   1 days
1   2 days
2   3 days
Length: 3, dtype: timedelta64[ns]

In [65]: pd.Timestamp('2012-01-01') - ser
Out[65]: 
0   2011-12-31
1   2011-12-30
2   2011-12-29
Length: 3, dtype: datetime64[ns]

NaT.isoformat() 现在返回 'NaT'。这一改变允许 pd.Timestamp 从其 isoformat 重新激活任何类似时间戳的对象（GH 12300）。

对 msgpack 的更改#

在 msgpack 写入格式中进行了向前不兼容的更改，这些更改在 0.17.0 和 0.18.0 版本中进行；旧版本的 pandas 无法读取新版本打包的文件 (GH 12129, GH 10527)

在 0.17.0 版本中引入并在 0.18.0 版本中修复的 to_msgpack 和 read_msgpack 中的错误，导致在 Python 2 中打包的文件无法被 Python 3 读取 (GH 12142)。下表描述了 msgpacks 的向后和向前兼容性。

警告

Packed with	可以用以下方式解包
pre-0.17 / Python 2	任何
pre-0.17 / Python 3	任何
0.17 / Python 2	==0.17 / Python 2 >=0.18 / 任何 Python
0.17 / Python 3	>=0.18 / 任何 Python
0.18	>= 0.18

0.18.0 在读取旧版本打包的文件时是向后兼容的，除了在 Python 2 中使用 0.17 打包的文件，这些文件只能在 Python 2 中解包。

对 .rank 的签名更改#

Series.rank 和 DataFrame.rank 现在具有相同的签名 (GH 11759)

上一个签名

In [3]: pd.Series([0,1]).rank(method='average', na_option='keep',
                              ascending=True, pct=False)
Out[3]:
0    1
1    2
dtype: float64

In [4]: pd.DataFrame([0,1]).rank(axis=0, numeric_only=None,
                                 method='average', na_option='keep',
                                 ascending=True, pct=False)
Out[4]:
   0
0  1
1  2

新签名

In [66]: pd.Series([0,1]).rank(axis=0, method='average', numeric_only=False,
   ....:                       na_option='keep', ascending=True, pct=False)
   ....: 
Out[66]: 
0    1.0
1    2.0
Length: 2, dtype: float64

In [67]: pd.DataFrame([0,1]).rank(axis=0, method='average', numeric_only=False,
   ....:                          na_option='keep', ascending=True, pct=False)
   ....: 
Out[67]: 
     0
0  1.0
1  2.0

[2 rows x 1 columns]

Bug in QuarterBegin with n=0#

在之前的版本中，QuarterBegin 偏移的行为在 n 参数为 0 时取决于日期，这是不一致的。(GH 11406)

对于 n=0 的锚定偏移的通用语义是，当日期是锚点时不移动日期（例如，季度开始日期），否则向前滚动到下一个锚点。

In [68]: d = pd.Timestamp('2014-02-01')

In [69]: d
Out[69]: Timestamp('2014-02-01 00:00:00')

In [70]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
Out[70]: Timestamp('2014-02-01 00:00:00')

In [71]: d + pd.offsets.QuarterBegin(n=0, startingMonth=1)
Out[71]: Timestamp('2014-04-01 00:00:00')

在以前的版本中，对于 QuarterBegin 偏移量，如果日期与季度开始日期在同一月份，日期将会被向后调整。

In [3]: d = pd.Timestamp('2014-02-15')

In [4]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
Out[4]: Timestamp('2014-02-01 00:00:00')

此行为已在版本 0.18.0 中得到修正，这与 MonthBegin 和 YearBegin 等其他锚定偏移一致。

In [72]: d = pd.Timestamp('2014-02-15')

In [73]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
Out[73]: Timestamp('2014-05-01 00:00:00')

重采样 API#

像窗口函数 API 上面的改变一样，.resample(...) 正在改变以拥有更像 groupby 的 API。(GH 11732, GH 12702, GH 12202, GH 12332, GH 12334, GH 12348, GH 12448)。

In [74]: np.random.seed(1234)

In [75]: df = pd.DataFrame(np.random.rand(10,4),
   ....:                   columns=list('ABCD'),
   ....:                   index=pd.date_range('2010-01-01 09:00:00',
   ....:                                       periods=10, freq='s'))
   ....: 

In [76]: df
Out[76]: 
                            A         B         C         D
2010-01-01 09:00:00  0.191519  0.622109  0.437728  0.785359
2010-01-01 09:00:01  0.779976  0.272593  0.276464  0.801872
2010-01-01 09:00:02  0.958139  0.875933  0.357817  0.500995
2010-01-01 09:00:03  0.683463  0.712702  0.370251  0.561196
2010-01-01 09:00:04  0.503083  0.013768  0.772827  0.882641
2010-01-01 09:00:05  0.364886  0.615396  0.075381  0.368824
2010-01-01 09:00:06  0.933140  0.651378  0.397203  0.788730
2010-01-01 09:00:07  0.316836  0.568099  0.869127  0.436173
2010-01-01 09:00:08  0.802148  0.143767  0.704261  0.704581
2010-01-01 09:00:09  0.218792  0.924868  0.442141  0.909316

[10 rows x 4 columns]

以前的API:

你可以编写一个立即评估的重采样操作。如果没有提供 how 参数，它将默认为 how='mean'。

In [6]: df.resample('2s')
Out[6]:
                         A         B         C         D
2010-01-01 09:00:00  0.485748  0.447351  0.357096  0.793615
2010-01-01 09:00:02  0.820801  0.794317  0.364034  0.531096
2010-01-01 09:00:04  0.433985  0.314582  0.424104  0.625733
2010-01-01 09:00:06  0.624988  0.609738  0.633165  0.612452
2010-01-01 09:00:08  0.510470  0.534317  0.573201  0.806949

你也可以直接指定一个 how

In [7]: df.resample('2s', how='sum')
Out[7]:
                         A         B         C         D
2010-01-01 09:00:00  0.971495  0.894701  0.714192  1.587231
2010-01-01 09:00:02  1.641602  1.588635  0.728068  1.062191
2010-01-01 09:00:04  0.867969  0.629165  0.848208  1.251465
2010-01-01 09:00:06  1.249976  1.219477  1.266330  1.224904
2010-01-01 09:00:08  1.020940  1.068634  1.146402  1.613897

新 API:

现在，你可以将 .resample(..) 写成一个像 .groupby(...) 这样的两阶段操作，这将产生一个 Resampler。

In [77]: r = df.resample('2s')

In [78]: r
Out[78]: <pandas.core.resample.DatetimeIndexResampler object at 0xfffeaec75300>

下采样#

然后，您可以使用此对象执行操作。这些是下采样操作（从高频到低频）。

In [79]: r.mean()
Out[79]: 
                            A         B         C         D
2010-01-01 09:00:00  0.485748  0.447351  0.357096  0.793615
2010-01-01 09:00:02  0.820801  0.794317  0.364034  0.531096
2010-01-01 09:00:04  0.433985  0.314582  0.424104  0.625733
2010-01-01 09:00:06  0.624988  0.609738  0.633165  0.612452
2010-01-01 09:00:08  0.510470  0.534317  0.573201  0.806949

[5 rows x 4 columns]

In [80]: r.sum()
Out[80]: 
                            A         B         C         D
2010-01-01 09:00:00  0.971495  0.894701  0.714192  1.587231
2010-01-01 09:00:02  1.641602  1.588635  0.728068  1.062191
2010-01-01 09:00:04  0.867969  0.629165  0.848208  1.251465
2010-01-01 09:00:06  1.249976  1.219477  1.266330  1.224904
2010-01-01 09:00:08  1.020940  1.068634  1.146402  1.613897

[5 rows x 4 columns]

此外，resample 现在支持 getitem 操作，以对特定列执行重采样。

In [81]: r[['A','C']].mean()
Out[81]: 
                            A         C
2010-01-01 09:00:00  0.485748  0.357096
2010-01-01 09:00:02  0.820801  0.364034
2010-01-01 09:00:04  0.433985  0.424104
2010-01-01 09:00:06  0.624988  0.633165
2010-01-01 09:00:08  0.510470  0.573201

[5 rows x 2 columns]

以及 .aggregate 类型的操作。

In [82]: r.agg({'A' : 'mean', 'B' : 'sum'})
Out[82]: 
                            A         B
2010-01-01 09:00:00  0.485748  0.894701
2010-01-01 09:00:02  0.820801  1.588635
2010-01-01 09:00:04  0.433985  0.629165
2010-01-01 09:00:06  0.624988  1.219477
2010-01-01 09:00:08  0.510470  1.068634

[5 rows x 2 columns]

这些访问器当然可以组合使用

In [83]: r[['A','B']].agg(['mean','sum'])
Out[83]: 
                            A                   B          
                         mean       sum      mean       sum
2010-01-01 09:00:00  0.485748  0.971495  0.447351  0.894701
2010-01-01 09:00:02  0.820801  1.641602  0.794317  1.588635
2010-01-01 09:00:04  0.433985  0.867969  0.314582  0.629165
2010-01-01 09:00:06  0.624988  1.249976  0.609738  1.219477
2010-01-01 09:00:08  0.510470  1.020940  0.534317  1.068634

[5 rows x 4 columns]

上采样#

上采样操作将您从较低频率转换为较高频率。这些操作现在通过 Resampler 对象使用 backfill()、ffill()、fillna() 和 asfreq() 方法执行。

In [89]: s = pd.Series(np.arange(5, dtype='int64'),
              index=pd.date_range('2010-01-01', periods=5, freq='Q'))

In [90]: s
Out[90]:
2010-03-31    0
2010-06-30    1
2010-09-30    2
2010-12-31    3
2011-03-31    4
Freq: Q-DEC, Length: 5, dtype: int64

之前

In [6]: s.resample('M', fill_method='ffill')
Out[6]:
2010-03-31    0
2010-04-30    0
2010-05-31    0
2010-06-30    1
2010-07-31    1
2010-08-31    1
2010-09-30    2
2010-10-31    2
2010-11-30    2
2010-12-31    3
2011-01-31    3
2011-02-28    3
2011-03-31    4
Freq: M, dtype: int64

新 API

In [91]: s.resample('M').ffill()
Out[91]:
2010-03-31    0
2010-04-30    0
2010-05-31    0
2010-06-30    1
2010-07-31    1
2010-08-31    1
2010-09-30    2
2010-10-31    2
2010-11-30    2
2010-12-31    3
2011-01-31    3
2011-02-28    3
2011-03-31    4
Freq: M, Length: 13, dtype: int64

备注

在新的API中，你可以选择下采样或上采样。之前的实现允许你传递一个聚合函数（如 mean），即使你在上采样，这会带来一些混淆。

之前的 API 将会工作，但会有弃用警告#

警告

这个新的重采样API包括一些内部变化，以适应0.18.0之前的API，在大多数情况下，重采样操作会返回一个延迟对象，并带有弃用警告。我们可以拦截操作，并按照（0.18.0之前的）API所做的那样进行操作（带有警告）。以下是一个典型的使用案例：

In [4]: r = df.resample('2s')

In [6]: r*10
pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)

Out[6]:
                      A         B         C         D
2010-01-01 09:00:00  4.857476  4.473507  3.570960  7.936154
2010-01-01 09:00:02  8.208011  7.943173  3.640340  5.310957
2010-01-01 09:00:04  4.339846  3.145823  4.241039  6.257326
2010-01-01 09:00:06  6.249881  6.097384  6.331650  6.124518
2010-01-01 09:00:08  5.104699  5.343172  5.732009  8.069486

然而，直接对 Resampler 进行获取和赋值操作将引发 ValueError：

In [7]: r.iloc[0] = 5
ValueError: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)

在某些情况下，当使用原始代码时，新的API无法执行所有操作。这段代码旨在每2秒重采样一次，取 mean 然后取这些结果的 min 。

In [4]: df.resample('2s').min()
Out[4]:
A    0.433985
B    0.314582
C    0.357096
D    0.531096
dtype: float64

新的 API 将会：

In [84]: df.resample('2s').min()
Out[84]: 
                            A         B         C         D
2010-01-01 09:00:00  0.191519  0.272593  0.276464  0.785359
2010-01-01 09:00:02  0.683463  0.712702  0.357817  0.500995
2010-01-01 09:00:04  0.364886  0.013768  0.075381  0.368824
2010-01-01 09:00:06  0.316836  0.568099  0.397203  0.436173
2010-01-01 09:00:08  0.218792  0.143767  0.442141  0.704581

[5 rows x 4 columns]

好消息是新API和旧API的返回维度将有所不同，因此这应该会大声引发异常。

要复制原始操作

In [85]: df.resample('2s').mean().min()
Out[85]: 
A    0.433985
B    0.314582
C    0.357096
D    0.531096
Length: 4, dtype: float64

对 eval 的更改#

在之前的版本中，eval 表达式中的新列分配会导致 DataFrame 的就地更改。(GH 9297, GH 8664, GH 10486)

In [86]: df = pd.DataFrame({'a': np.linspace(0, 10, 5), 'b': range(5)})

In [87]: df
Out[87]: 
      a  b
0   0.0  0
1   2.5  1
2   5.0  2
3   7.5  3
4  10.0  4

[5 rows x 2 columns]

In [12]: df.eval('c = a + b')
FutureWarning: eval expressions containing an assignment currentlydefault to operating inplace.
This will change in a future version of pandas, use inplace=True to avoid this warning.

In [13]: df
Out[13]:
      a  b     c
0   0.0  0   0.0
1   2.5  1   3.5
2   5.0  2   7.0
3   7.5  3  10.5
4  10.0  4  14.0

在版本 0.18.0 中，添加了一个新的 inplace 关键字，用于选择赋值是否应该就地进行或返回一个副本。

In [88]: df
Out[88]: 
      a  b     c
0   0.0  0   0.0
1   2.5  1   3.5
2   5.0  2   7.0
3   7.5  3  10.5
4  10.0  4  14.0

[5 rows x 3 columns]

In [89]: df.eval('d = c - b', inplace=False)
Out[89]: 
      a  b     c     d
0   0.0  0   0.0   0.0
1   2.5  1   3.5   2.5
2   5.0  2   7.0   5.0
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

[5 rows x 4 columns]

In [90]: df
Out[90]: 
      a  b     c
0   0.0  0   0.0
1   2.5  1   3.5
2   5.0  2   7.0
3   7.5  3  10.5
4  10.0  4  14.0

[5 rows x 3 columns]

In [91]: df.eval('d = c - b', inplace=True)

In [92]: df
Out[92]: 
      a  b     c     d
0   0.0  0   0.0   0.0
1   2.5  1   3.5   2.5
2   5.0  2   7.0   5.0
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

[5 rows x 4 columns]

警告

为了向后兼容，如果未指定，inplace 默认为 True。这在未来的 pandas 版本中将会改变。如果你的代码依赖于原地赋值，你应该更新代码以显式设置 inplace=True

inplace 关键字参数也被添加到了 query 方法中。

In [93]: df.query('a > 5')
Out[93]: 
      a  b     c     d
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

[2 rows x 4 columns]

In [94]: df.query('a > 5', inplace=True)

In [95]: df
Out[95]: 
      a  b     c     d
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

[2 rows x 4 columns]

警告

请注意，在 query 中 inplace 的默认值是 False，这与之前的版本一致。

eval 也已更新，允许多行表达式进行多重赋值。这些表达式将按顺序逐一评估。只有赋值对多行表达式有效。

In [96]: df
Out[96]: 
      a  b     c     d
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

[2 rows x 4 columns]

In [97]: df.eval("""
   ....: e = d + a
   ....: f = e - 22
   ....: g = f / 2.0""", inplace=True)
   ....: 

In [98]: df
Out[98]: 
      a  b     c     d     e    f    g
3   7.5  3  10.5   7.5  15.0 -7.0 -3.5
4  10.0  4  14.0  10.0  20.0 -2.0 -1.0

[2 rows x 7 columns]

其他 API 更改#

DataFrame.between_time 和 Series.between_time 现在只解析一组固定的时间字符串。不再支持解析日期字符串，并且会引发 ValueError。 (GH 11818)

In [107]: s = pd.Series(range(10), pd.date_range('2015-01-01', freq='H', periods=10))

In [108]: s.between_time("7:00am", "9:00am")
Out[108]:
2015-01-01 07:00:00    7
2015-01-01 08:00:00    8
2015-01-01 09:00:00    9
Freq: H, Length: 3, dtype: int64

这将会引发。

In [2]: s.between_time('20150101 07:00:00','20150101 09:00:00')
ValueError: Cannot convert arg ['20150101 07:00:00'] to a time.

.memory_usage() 现在包括索引中的值，正如 .info() 中的 memory_usage 一样 (GH 11597)
DataFrame.to_latex() 现在在 Python 2 中支持非 ascii 编码（例如 utf-8），使用参数 encoding (GH 7061)
pandas.merge() 和 DataFrame.merge() 在尝试与非 DataFrame 类型或其子类的对象合并时会显示特定的错误消息 (GH 12081)
DataFrame.unstack 和 Series.unstack 现在接受 fill_value 关键字，以允许在取消堆叠导致结果 DataFrame 中出现缺失值时直接替换缺失值。作为额外的好处，指定 fill_value 将保留原始堆叠数据的数据类型。 (GH 9746)
作为窗口函数和重采样新API的一部分，聚合函数已经得到澄清，在无效聚合时会引发更多信息性的错误消息。(GH 9052)。在 groupby 中展示了一整套示例。
对于 NDFrame 对象的统计函数（如 sum(), mean(), min()），如果为 **kwargs 传递了非 numpy 兼容的参数，现在将会引发错误 (GH 12301)
.to_latex 和 .to_html 像 .to_csv 一样获得了一个 decimal 参数；默认值是 '.' (GH 12031)
当使用空数据但带有索引构建 DataFrame 时，提供更有帮助的错误信息 (GH 8020)
.describe() 现在将正确处理布尔类型作为分类 (GH 6625)
更友好的错误消息，带有无效的 .transform 与用户定义的输入 (GH 10165)
指数加权函数现在允许直接指定 alpha (GH 10789)，如果参数违反 0 < alpha <= 1 则引发 ValueError (GH 12492)

弃用#

函数 pd.rolling_*, pd.expanding_*, 和 pd.ewm* 已被弃用，并被相应的方法调用所取代。请注意，新的建议语法包括所有参数（即使是默认参数）(GH 11603)

In [1]: s = pd.Series(range(3))

In [2]: pd.rolling_mean(s,window=2,min_periods=1)
        FutureWarning: pd.rolling_mean is deprecated for Series and
             will be removed in a future version, replace with
             Series.rolling(min_periods=1,window=2,center=False).mean()
Out[2]:
        0    0.0
        1    0.5
        2    1.5
        dtype: float64

In [3]: pd.rolling_cov(s, s, window=2)
        FutureWarning: pd.rolling_cov is deprecated for Series and
             will be removed in a future version, replace with
             Series.rolling(window=2).cov(other=<Series>)
Out[3]:
        0    NaN
        1    0.5
        2    0.5
        dtype: float64

freq 和 how 参数在 .rolling、.expanding 和 ``.ewm``（新）函数中已被弃用，并将在未来版本中移除。您可以在创建窗口函数之前简单地对输入进行重采样。(GH 11603)。

例如，可以使用 s.resample('D').mean().rolling(window=5).max() 而不是 s.rolling(window=5,freq='D').max() 来获取滚动5天窗口的最大值，前者首先将数据重采样为每日数据，然后提供一个滚动5天窗口。
pd.tseries.frequencies.get_offset_name 函数已被弃用。请使用偏移量的 .freqstr 属性作为替代 (GH 11192)
pandas.stats.fama_macbeth 例程已被弃用，并将在未来版本中移除 (GH 6077)
pandas.stats.ols、pandas.stats.plm 和 pandas.stats.var 例程已被弃用，并将在未来版本中移除 (GH 6077)
在使用 HDFStore.select 中长时间弃用的语法时，显示 FutureWarning 而不是 DeprecationWarning ，其中 where 子句不是类字符串的 (GH 12027)
pandas.options.display.mpl_style 配置已被弃用，并将在未来版本的 pandas 中移除。此功能由 matplotlib 的样式表更好地处理 (GH 11783)。

移除已弃用的浮点索引器#

在 GH 4892 中，在非 Float64Index 上使用浮点数索引已被弃用（在版本 0.14.0 中）。在 0.18.0 中，此弃用警告被移除，这些操作现在将引发 TypeError。 (GH 12165, GH 12333)

In [99]: s = pd.Series([1, 2, 3], index=[4, 5, 6])

In [100]: s
Out[100]: 
4    1
5    2
6    3
Length: 3, dtype: int64

In [101]: s2 = pd.Series([1, 2, 3], index=list('abc'))

In [102]: s2
Out[102]: 
a    1
b    2
c    3
Length: 3, dtype: int64

之前的行为：

# this is label indexing
In [2]: s[5.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[2]: 2

# this is positional indexing
In [3]: s.iloc[1.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[3]: 2

# this is label indexing
In [4]: s.loc[5.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[4]: 2

# .ix would coerce 1.0 to the positional 1, and index
In [5]: s2.ix[1.0] = 10
FutureWarning: scalar indexers for index type Index should be integers and not floating point

In [6]: s2
Out[6]:
a     1
b    10
c     3
dtype: int64

新行为：

对于 iloc，通过浮点标量进行获取和设置将始终引发错误。

In [3]: s.iloc[2.0]
TypeError: cannot do label indexing on <class 'pandas.indexes.numeric.Int64Index'> with these indexers [2.0] of <type 'float'>

其他索引器将对获取和设置都强制转换为类似的整数。对于 .loc、.ix 和 []，已删除 FutureWarning。

In [103]: s[5.0]
Out[103]: 2

In [104]: s.loc[5.0]
Out[104]: 2

和设置

In [105]: s_copy = s.copy()

In [106]: s_copy[5.0] = 10

In [107]: s_copy
Out[107]: 
4     1
5    10
6     3
Length: 3, dtype: int64

In [108]: s_copy = s.copy()

In [109]: s_copy.loc[5.0] = 10

In [110]: s_copy
Out[110]: 
4     1
5    10
6     3
Length: 3, dtype: int64

使用 .ix 和浮点索引器的位置设置会将此值添加到索引中，而不是通过位置先前设置该值。

In [3]: s2.ix[1.0] = 10
In [4]: s2
Out[4]:
a       1
b       2
c       3
1.0    10
dtype: int64

切片也会将类似整数的浮点数强制转换为整数，以避免 Float64Index。

In [111]: s.loc[5.0:6]
Out[111]: 
5    2
6    3
Length: 2, dtype: int64

请注意，对于不能强制转换为整数的浮点数，将排除基于标签的范围。

In [112]: s.loc[5.1:6]
Out[112]: 
6    3
Length: 1, dtype: int64

在 Float64Index 上的浮点索引保持不变。

In [113]: s = pd.Series([1, 2, 3], index=np.arange(3.))

In [114]: s[1.0]
Out[114]: 2

In [115]: s[1.0:2.5]
Out[115]: 
1.0    2
2.0    3
Length: 2, dtype: int64

移除先前版本的弃用/更改#

移除 rolling_corr_pairwise 以支持 .rolling().corr(pairwise=True) (GH 4950)
移除 expanding_corr_pairwise 以支持 .expanding().corr(pairwise=True) (GH 4950)
移除 DataMatrix 模块。无论如何，这不会被导入到 pandas 命名空间中 (GH 12111)
在 DataFrame.duplicated() 和 DataFrame.drop_duplicates() 中移除 cols 关键字，改为使用 subset (GH 6680)
在 pd.io.sql 命名空间中移除 read_frame 和 frame_query``（两者都是 ``pd.read_sql 的别名）以及 write_frame``（``to_sql 的别名）函数，这些函数自 0.14.0 版本起已被弃用 (GH 6292)。
从 .factorize() 中移除 order 关键字 (GH 6930)

性能提升#

改进了 andrews_curves 的性能 (GH 11534)
改进了巨大的 DatetimeIndex、PeriodIndex 和 TimedeltaIndex 的操作性能，包括 NaT (GH 10277)
改进了 pandas.concat 的性能 (GH 11958)
改进了 StataReader 的性能 (GH 11591)
在构建包含 NaT 的 datetimes Series 的 Categoricals 时，性能得到了提升 (GH 12077)
改进了没有分隔符的 ISO 8601 日期解析性能（GH 11899），前导零（GH 11871）以及在时区前带有空格的日期解析性能（GH 9714）

错误修复#

当数据框为空时，GroupBy.size 中的错误。(GH 11699)
当请求一个时间周期的倍数时，Period.end_time 中的错误 (GH 11738)
在 .clip 中使用带时区的日期时间 (GH 11838)
当边界落在频率上时 date_range 中的错误 (GH 11804, GH 12409)
在传递嵌套字典到 .groupby(...).agg(...) 时的一致性错误 (GH 9052)
在 Timedelta 构造函数中接受 unicode (GH 11995)
在增量读取时 StataReader 的值标签读取中的错误 (GH 12014)
当 n 参数为 0 时，向量化 DateOffset 中的错误 (GH 11370)
关于 NaT 比较变化的 numpy 1.11 兼容性 (GH 12049)
在从 StringIO 读取时 read_csv 中的错误（GH 11790）
在因式分解和使用 Categoricals 时，未将 NaT 视为缺失值的错误 (GH 12077)
当 Series 的值是时区感知时，getitem 中的 Bug (GH 12089)
当其中一个变量是 ‘name’ 时，Series.str.get_dummies 中的错误 (GH 12180)
在连接带时区的 NaT 系列时 pd.concat 中的错误。(GH 11693, GH 11755, GH 12217)
pd.read_stata 在版本 <= 108 文件中的错误 (GH 12232)
当索引是 DatetimeIndex 并且包含非零纳秒部分时，使用 Nano 频率的 Series.resample 中的错误 (GH 12037)
使用 .nunique 和稀疏索引进行重采样时出现的错误 (GH 12352)
移除了一些编译器警告 (GH 12471)
在python 3.5中解决与``boto``的兼容性问题 (GH 11915)
从具有时区的 Timestamp 或 DatetimeIndex 中减去 NaT 的错误 (GH 11718)
在单个带时区的 Timestamp 的 Series 减法中的错误 (GH 12290)
在PY2中使用兼容迭代器以支持 .next() (GH 12299)
Timedelta.round 中负值的错误 (GH 11690)
loc 针对 CategoricalIndex 的错误可能导致正常的 Index (GH 11586)
当存在重复的列名时，DataFrame.info 中的错误 (GH 11761)
datetime tz-aware 对象的 .copy 中的错误 (GH 11794)
Series.apply 和 Series.map 中的错误，其中 timedelta64 未装箱 (GH 11349)
DataFrame.set_index() 中带有 tz-aware Series 的 Bug (GH 12358)
在 DataFrame 的子类中，AttributeError 没有传播的错误 (GH 11808)
在tz-aware数据上分组时选择未返回``Timestamp``的错误 (GH 11616)
pd.read_clipboard 和 pd.to_clipboard 函数不支持 Unicode 的错误；包含的 pyperclip 升级到 v1.5.15 (GH 9263)
DataFrame.query 中包含赋值的错误 (GH 8664)
from_msgpack 中的错误，当 __contains__() 对解包的 DataFrame 的列失败时，如果 DataFrame 有对象列。(GH 11880)
在带有 TimedeltaIndex 的分类数据上 .resample 的错误 (GH 12169)
在将标量日期时间广播到 DataFrame 时，时区信息丢失的错误 (GH 11682)
在从 Timestamp 创建 Index 时，混合时区强制转换为 UTC 的错误 (GH 11488)
to_numeric 中的一个错误，当输入超过一维时不会引发 (GH 11776)
解析时区偏移字符串时出现错误，包含非零分钟 (GH 11708)
在matplotlib 1.5+ 下，df.plot 中的错误导致条形图使用了不正确的颜色 (GH 11614)
在使用关键字参数时 groupby plot 方法中的错误 (GH 11805)。
DataFrame.duplicated 和 drop_duplicates 中的错误导致在设置 keep=False 时出现虚假匹配 (GH 11864)
.loc 结果中重复键的错误可能导致 Index 具有不正确的 dtype (GH 11497)
pd.rolling_median 中的一个错误，即使在有足够内存的情况下内存分配失败 (GH 11696)
DataFrame.style 中的错误与虚假的零 (GH 12134)
DataFrame.style 中从0开始的整数列的错误 (GH 12125)
.style.bar 中的错误在特定浏览器中可能无法正确渲染 (GH 11678)
在 Timedelta 与 numpy.array 的 Timedelta 的丰富比较中存在一个导致无限递归的错误 (GH 11835)
DataFrame.round 中的错误导致列索引名称丢失 (GH 11986)
在混合数据类型的 Dataframe 中使用 df.replace 替换值时出现的错误 (GH 11698)
Index 中的错误阻止了在未提供新名称时复制传递的 Index 的名称 (GH 11193)
read_excel 中的错误，当存在空表时无法读取任何非空表，且 sheetname=None (GH 11711)
在 read_excel 中存在一个错误，当提供关键字 parse_dates 和 date_parser 时未能引发 NotImplemented 错误 (GH 11544)
使用 pymysql 连接时 read_sql 中的错误导致无法返回分块数据 (GH 11522)
.to_csv 中的错误忽略了格式化参数 decimal、na_rep、float_format 用于浮点索引 (GH 11553)
Int64Index 和 Float64Index 中的错误，阻止了使用模运算符 (GH 9244)
对于未按字典顺序排序的 MultiIndexes，MultiIndex.drop 中的错误 (GH 12078)
当掩码一个空的 DataFrame 时的错误 (GH 11859)
在 .plot 中可能存在一个错误，当列数与提供的序列数不匹配时，会修改 colors 输入 (GH 12039)。
当索引具有 CustomBusinessDay 频率时，Series.plot 中的错误导致失败 (GH 7222)。
在 .to_sql 中处理 datetime.time 值时，使用 sqlite 回退的错误 (GH 8341)
read_excel 在 squeeze=True 时无法读取只有一列的数据的错误 (GH 12157)
read_excel 无法读取一个空列的错误 (GH 12292, GH 9002)
在 .groupby 中的一个错误，如果在数据框中只有一行，则不会为错误的列引发 KeyError (GH 11741)
在指定 dtype 的情况下，使用 .read_csv 读取空数据时产生错误 (GH 12048)
在 .read_csv 中，字符串如 '2E' 被视为有效浮点数的错误 (GH 12237)
使用调试符号构建 pandas 时出现错误 (GH 12123)
移除了 DatetimeIndex 的 millisecond 属性。这总是会引发 ValueError (GH 12019)。
Series 构造函数中对只读数据的错误 (GH 11502)
移除了 pandas._testing.choice()。应改用 np.random.choice()。(GH 12386)
.loc setitem 索引器中的错误，阻止了使用带时区的 DatetimeIndex (GH 12050)
.style 索引和 MultiIndexes 未显示的错误 (GH 11655)
to_msgpack 和 from_msgpack 中的错误，未能正确序列化或反序列化 NaT (GH 12307)。
由于高度相似值的舍入误差，.skew 和 .kurt 中存在错误 (GH 11974)
Timestamp 构造函数中的一个错误，如果 HHMMSS 没有用 ‘:’ 分隔，则会丢失微秒分辨率 (GH 10041)
buffer_rd_bytes 中的错误：如果读取失败，src->buffer 可能会被多次释放，导致段错误 (GH 12098)
crosstab 中的一个错误，当参数具有不重叠的索引时会返回 KeyError (GH 10291)
DataFrame.apply 中的一个错误，在 dtype 不是 numpy dtype 的情况下，没有阻止归约 (GH 12244)
使用标量值初始化分类系列时出现错误。(GH 12336)
在 .to_datetime 中通过设置 utc=True 来指定一个 UTC DatetimeIndex 时出现的错误 (GH 11934)
在 read_csv 中增加 CSV 阅读器缓冲区大小时出现的错误 (GH 12494)
当使用重复的列名设置 DataFrame 的列时出现错误 (GH 12344)

贡献者#

总共有101人为此版本贡献了补丁。名字后面带有“+”的人是第一次贡献补丁。

ARF +
Alex Alekseyev +
Andrew McPherson +
Andrew Rosenfeld
Andy Hayden
Anthonios Partheniou
Anton I. Sipos
Ben +
Ben North +
Bran Yang +
Chris
Chris Carroux +
Christopher C. Aycock +
Christopher Scanlin +
Cody +
Da Wang +
Daniel Grady +
Dorozhko Anton +
Dr-Irv +
Erik M. Bray +
Evan Wright
Francis T. O’Donovan +
Frank Cleary +
Gianluca Rossi
Graham Jeffries +
Guillaume Horel
Henry Hammond +
Isaac Schwabacher +
Jean-Mathieu Deschenes
Jeff Reback
Joe Jevnik +
John Freeman +
John Fremlin +
Jonas Hoersch +
Joris Van den Bossche
Joris Vankerschaver
Justin Lecher
Justin Lin +
Ka Wo Chen
Keming Zhang +
Kerby Shedden
Kyle +
Marco Farrugia +
MasonGallo +
MattRijk +
Matthew Lurie +
Maximilian Roos
Mayank Asthana +
Mortada Mehyar
Moussa Taifi +
Navreet Gill +
Nicolas Bonnotte
Paul Reiners +
Philip Gura +
Pietro Battiston
RahulHP +
Randy Carnevale
Rinoc Johnson
Rishipuri +
Sangmin Park +
Scott E Lasley
Sereger13 +
Shannon Wang +
Skipper Seabold
Thierry Moisan
Thomas A Caswell
Toby Dylan Hocking +
Tom Augspurger
Travis +
Trent Hauck
Tux1
Varun
Wes McKinney
Will Thompson +
Yoav Ram
Yoong Kang Lim +
Yoshiki Vázquez Baeza
Young Joong Kim +
Younggun Kim
Yuval Langer +
alex argunov +
behzad nouri
boombard +
brian-pantano +
chromy +
daniel +
dgram0 +
gfyoung +
hack-c +
hcontrast +
jfoo +
kaustuv deolal +
llllllllll
ranarag +
rockg
scls19fr
seales +
sinhrks
srib +
surveymedia.ca +
tworec +

版本 0.18.0 (2016年3月13日)#

新功能#

窗口函数现在是方法#

重命名的更改#

范围索引#

对 str.extract 的更改#

添加 str.extractall#

对 str.cat 的更改#

Datetimelike 舍入#

FloatIndex 中整数的格式化#

dtype 分配行为的更改#

方法 to_xarray#

Latex 表示#

pd.read_sas() 更改#

其他增强功能#

向后不兼容的 API 变化#

NaT 和 Timedelta 操作#

对 msgpack 的更改#

对 .rank 的签名更改#

Bug in QuarterBegin with n=0#

重采样 API#

下采样#

上采样#

之前的 API 将会工作，但会有弃用警告#

对 eval 的更改#

其他 API 更改#

弃用#

移除已弃用的浮点索引器#

移除先前版本的弃用/更改#

性能提升#

错误修复#

贡献者#

`pd.read_sas()` 更改#