列运算 - 云原生大数据计算服务 MaxCompute

相关文章推荐

忧郁的薯片 · Docker login not ...· 1 月前 ·

一身肌肉的荒野 · Scripting – SVG 1.1 ...· 4 月前 ·

微笑的充电器 · Xamarin在ContextAction中 ...· 7 月前 ·

胆小的青椒 · MySQL实战-基础篇（架构/日志/事务） ...· 1 年前 ·

力能扛鼎的抽屉 · javax.swing.JSpinner.s ...· 1 年前 ·

from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))

为一个Sequence加上一个常量或执行sin函数时，这些操作将作用于Sequence中的每个元素。

DataFrame API提供了几个和NULL相关的内置函数，例如


    isnull

用于判断某字段是否为NULL，


    notnull

用于判断某字段是否为非NULL，


    fillna

用于将NULL填充为您指定的值。

>>> iris.sepallength.isnull().head(5)
   sepallength
0        False
1        False
2        False
3        False
4        False


     ifelse

作用于BOOLEAN类型的字段，当条件成立时，返回第0个参数，否则返回第1个参数。

>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5


      switch

用于多条件判断的情况。

>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', iris.sepallength == 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

PyODPS 0.7.8以上版本支持根据条件修改数据集某一列的一部分值，写法如下。

>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'
>>> iris.head(5)
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

对于数字类型的字段，支持加法（+）、减法（-）、乘法（*）和除法（/）等操作，也支持log、sin等数学计算。

>>> (iris.sepallength * 10).log().head(5)
   sepallength
0     3.931826
1     3.891820
2     3.850148
3     3.828641
4     3.912023

>>> fields = [iris.sepallength,
>>>           (iris.sepallength / 2).rename('sepallength除以2'),
>>>           (iris.sepallength ** 2).rename('sepallength的平方')]
>>> iris[fields].head(5)
   sepallength  sepallength除以2  sepallength的平方
0          5.1              2.55             26.01
1          4.9              2.45             24.01
2          4.7              2.35             22.09
3          4.6              2.30             21.16
4          5.0              2.50             25.00

算术运算支持如下操作。

虽然DataFrame API不支持连续操作，例如


          3 <= iris.sepallength <= 5

，但是


          between

函数可以用于判断


          iris.sepallength

是否在某个区间。

>>> (iris.sepallength.between(3, 5)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4         True

默认情况下，


           between

包含两边的区间，如果计算开区间，则需要设置


           inclusive=False

。

>>> (iris.sepallength.between(3, 5, inclusive=False)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

DataFrame API提供了一系列针对STRING类型的Sequence或者Scalar的操作。

>>> fields = [
>>>     iris.name.upper().rename('upper_name'),
>>>     iris.name.extract('Iris(.*)', group=1)
>>> iris[fields].head(5)
    upper_name     name
0  IRIS-SETOSA  -setosa
1  IRIS-SETOSA  -setosa
2  IRIS-SETOSA  -setosa
3  IRIS-SETOSA  -setosa
4  IRIS-SETOSA  -setosa

STRING的相关操作如下。

对于DATETIME类型Sequence或者Scalar，可以调用时间相关的内置函数。

>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
>>>    df.dt.year.rename('year'),
>>>    df.dt.month.rename('month'),
>>>    df.dt.day.rename('day'),
>>>    df.dt.hour.rename('hour')].head(5)
                    dt  year  month  day  hour
0  1998-04-08 11:02:00  1998      4    8    11
1  1998-04-08 10:57:55  1998      4    8    10
2  1998-04-08 10:45:26  1998      4    8    10
3  1998-04-08 10:25:52  1998      4    8    10
4  1998-04-08 10:44:19  1998      4    8    10

与时间相关的属性如下。 a b 0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018 1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021 2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022 >>> from odps.df import day >>> df.a - day(3) 0 2016-12-03 16:43:12.460001 1 2016-12-03 16:43:12.460012 2 2016-12-03 16:43:12.460015 >>> (df.b - df.a).dtype int64 >>> (df.b - df.a).rename('a') 0 3600000 1 3600000 2 3600000

支持的时间类型如下表所示。

同时，两种集合均有 explode 方法，用于展开集合中的内容。对于List， explode 默认返回一列，当传入参数 pos 时，将返回两列，其中一列为值在数组中的编号（类似Python的 enumerate 函数）。对于Dict， explode 会返回两列，分别表示keys及values。 explode 中也可以传入列名，作为最后生成的列。

示例如下。

id a b 0 1 [a1, b1] {'a2': 0, 'b2': 1, 'c2': 2} 1 2 [c1] {'d2': 3, 'e2': 4} >>> df[df.id, df.a[0], df.b['b2']] id a b 0 1 a1 1 1 2 c1 NaN >>> df[df.id, df.a.len(), df.b.len()] id a b 0 1 2 3 1 2 1 2 >>> df.a.explode() 0 a1 1 b1 2 c1 >>> df.a.explode(pos=True) a_pos a 0 0 a1 1 1 b1 2 0 c1 >>> # 指定列名。 >>> df.a.explode(['pos', 'value'], pos=True) pos value 0 0 a1 1 1 b1 2 0 c1 >>> df.b.explode() b_key b_value 0 a2 0 1 b2 1 2 c2 2 3 d2 3 4 e2 4 >>> # 指定列名。 >>> df.b.explode(['key', 'value']) key value 0 a2 0 1 b2 1 2 c2 2 3 d2 3 4 e2 4

explode 也可以和并列多行输出结合，以将原有列和 explode 的结果相结合，示例如下。

>>> df[df.id, df.a.explode()]
   id   a
0   1  a1
1   1  b1
2   2  c1
>>> df[df.id, df.a.explode(), df.b.explode()]
   id   a b_key  b_value
0   1  a1    a2        0
1   1  a1    b2        1
2   1  a1    c2        2
3   1  b1    a2        0
4   1  b1    b2        1
5   1  b1    c2        2
6   2  c1    d2        3
7   2  c1    e2        4

除了下标

len

和


                 explode

两个共有方法以外，List还支持下列方法。


                 isin

用于判断Sequence里的元素是否在某个集合元素里，


                 notin

反之。

>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
   sepallength
0         True
1         True
2        False
3        False
4        False

cut 提供离散化的操作，可以将Sequence的数据拆成几个区段。

>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
   sepallength_cut
0             None
1              4-5
2              4-5
3              4-5
4              4-5

include_under 和 include_over 可以分别包括向下和向上的区间。

>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
   sepallength_cut
0               5-
1              4-5
2              4-5
3              4-5
4              4-5

调用MaxCompute内建或者已定义函数

如果您需要调用MaxCompute上的内建或者已定义函数来生成列，您可以使用


                  func

接口，该接口默认函数返回值为STRING，可以用


                  rtype

参数指定返回值。

>>> from odps.df import func
>>> iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
>>> iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
>>> # 调用ODPS上定义的UDF，列名无法确定时需要手动指定。
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]
>>> # 从其它Project调用UDF，也可通过name参数指定列名。
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project', name='new_col')]

说明 Pandas后端不支持执行带有


                   func

的表达式。