Skip to content

Version 1.7.0

Compare
Choose a tag to compare
@itholic itholic released this 08 Mar 10:22
· 63 commits to master since this release
05c8b4d

Switch the default plotting backend to Plotly

We switched the default plotting backend from Matplotlib to Plotly (#2029, #2033). In addition, we added more Plotly methods such as DataFrame.plot.kde and Series.plot.kde (#2028).

import databricks.koalas as ks
kdf = ks.DataFrame({
    'a': [1, 2, 2.5, 3, 3.5, 4, 5],
    'b': [1, 2, 3, 4, 5, 6, 7],
    'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
kdf.plot.hist()

Koalas_plotly_hist_plot

Plotting backend can be switched to matplotlib by setting ks.options.plotting.backend to matplotlib.

ks.options.plotting.backend = "matplotlib"

Add Int64Index, Float64Index, DatatimeIndex

We added more types of Index such as Index64Index, Float64Index and DatetimeIndex (#2025, #2066).

When creating an index, Index instance is always returned regardless of the data type.

But now Int64Index, Float64Index or DatetimeIndex is returned depending on the data type of the index.

>>> type(ks.Index([1, 2, 3]))
<class 'databricks.koalas.indexes.numeric.Int64Index'>
>>> type(ks.Index([1.1, 2.5, 3.0]))
<class 'databricks.koalas.indexes.numeric.Float64Index'>
>>> type(ks.Index([datetime.datetime(2021, 3, 9)]))
<class 'databricks.koalas.indexes.datetimes.DatetimeIndex'>

In addition, we added many properties for DatetimeIndex such as year, month, day, hour, minute, second, etc. (#2074) and added APIs for DatetimeIndex such as round(), floor(), ceil(), normalize(), strftime(), month_name() and day_name() (#2082, #2086, #2089).

Create Index from Series or Index objects

Index can be created by taking Series or Index objects (#2071).

>>> kser = ks.Series([1, 2, 3], name="a", index=[10, 20, 30])
>>> ks.Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Int64Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Float64Index(kser)
Float64Index([1.0, 2.0, 3.0], dtype='float64', name='a')
>>> kser = ks.Series([datetime(2021, 3, 1), datetime(2021, 3, 2)], index=[10, 20])
>>> ks.Index(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
>>> ks.DatetimeIndex(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)

Extension dtypes support

We added basic extension dtypes support (#2039).

>>> kdf = ks.DataFrame(
...     {
...         "a": [1, 2, None, 3],
...         "b": [4.5, 5.2, 6.1, None],
...         "c": ["A", "B", "C", None],
...         "d": [False, None, True, False],
...     }
... ).astype({"a": "Int32", "b": "Float64", "c": "string", "d": "boolean"})
>>> kdf
      a    b     c      d
0     1  4.5     A  False
1     2  5.2     B   <NA>
2  <NA>  6.1     C   True
3     3  NaN  <NA>  False
>>> kdf.dtypes
a      Int32
b    float64
c     string
d    boolean
dtype: object

The following types are supported per the installed pandas:

  • pandas >= 0.24
    • Int8Dtype
    • Int16Dtype
    • Int32Dtype
    • Int64Dtype
  • pandas >= 1.0
    • BooleanDtype
    • StringDtype
  • pandas >= 1.2
    • Float32Dtype
    • Float64Dtype

Binary operations and type casting are supported:

>>> kdf.a + kdf.b
0       5
1       7
2    <NA>
3    <NA>
dtype: Int64
>>> kdf + kdf
      a     b
0     2     8
1     4    10
2  <NA>    12
3     6  <NA>
>>> kdf.a.astype('Float64')
0     1.0
1     2.0
2    <NA>
3     3.0
Name: a, dtype: Float64

Other new features, improvements and bug fixes

We added the following new features:

koalas:

Series:

DataFrame:

Along with the following fixes:

  • PySpark 3.1.1 Support
  • Preserve index for statistical functions with axis==1 (#2036)
  • Use iloc to make sure it retrieves the first element (#2037)
  • Fix numeric_only to follow pandas (#2035)
  • Fix DataFrame.merge to work properly (#2060)
  • Fix astype(str) for some data types (#2040)
  • Fix binary operations Index by Series (#2046)
  • Fix bug on pow and rpow (#2047)
  • Support bool list-like column selection for loc indexer (#2057)
  • Fix window functions to resolve (#2090)
  • Refresh GitHub workflow matrix (#2083)
  • Restructure the hierarchy of Index unit tests (#2080)
  • Fix to delegate dtypes (#2061)