Version 1.7.0
Switch the default plotting backend to Plotly
We switched the default plotting backend from Matplotlib to Plotly (#2029, #2033). In addition, we added more Plotly methods such as DataFrame.plot.kde
and Series.plot.kde
(#2028).
import databricks.koalas as ks
kdf = ks.DataFrame({
'a': [1, 2, 2.5, 3, 3.5, 4, 5],
'b': [1, 2, 3, 4, 5, 6, 7],
'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
kdf.plot.hist()
Plotting backend can be switched to matplotlib
by setting ks.options.plotting.backend
to matplotlib
.
ks.options.plotting.backend = "matplotlib"
Add Int64Index, Float64Index, DatatimeIndex
We added more types of Index
such as Index64Index
, Float64Index
and DatetimeIndex
(#2025, #2066).
When creating an index, Index
instance is always returned regardless of the data type.
But now Int64Index
, Float64Index
or DatetimeIndex
is returned depending on the data type of the index.
>>> type(ks.Index([1, 2, 3]))
<class 'databricks.koalas.indexes.numeric.Int64Index'>
>>> type(ks.Index([1.1, 2.5, 3.0]))
<class 'databricks.koalas.indexes.numeric.Float64Index'>
>>> type(ks.Index([datetime.datetime(2021, 3, 9)]))
<class 'databricks.koalas.indexes.datetimes.DatetimeIndex'>
In addition, we added many properties for DatetimeIndex
such as year
, month
, day
, hour
, minute
, second
, etc. (#2074) and added APIs for DatetimeIndex
such as round()
, floor()
, ceil()
, normalize()
, strftime()
, month_name()
and day_name()
(#2082, #2086, #2089).
Create Index from Series or Index objects
Index can be created by taking Series
or Index
objects (#2071).
>>> kser = ks.Series([1, 2, 3], name="a", index=[10, 20, 30])
>>> ks.Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Int64Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Float64Index(kser)
Float64Index([1.0, 2.0, 3.0], dtype='float64', name='a')
>>> kser = ks.Series([datetime(2021, 3, 1), datetime(2021, 3, 2)], index=[10, 20])
>>> ks.Index(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
>>> ks.DatetimeIndex(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
Extension dtypes support
We added basic extension dtypes support (#2039).
>>> kdf = ks.DataFrame(
... {
... "a": [1, 2, None, 3],
... "b": [4.5, 5.2, 6.1, None],
... "c": ["A", "B", "C", None],
... "d": [False, None, True, False],
... }
... ).astype({"a": "Int32", "b": "Float64", "c": "string", "d": "boolean"})
>>> kdf
a b c d
0 1 4.5 A False
1 2 5.2 B <NA>
2 <NA> 6.1 C True
3 3 NaN <NA> False
>>> kdf.dtypes
a Int32
b float64
c string
d boolean
dtype: object
The following types are supported per the installed pandas:
- pandas >= 0.24
Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
- pandas >= 1.0
BooleanDtype
StringDtype
- pandas >= 1.2
Float32Dtype
Float64Dtype
Binary operations and type casting are supported:
>>> kdf.a + kdf.b
0 5
1 7
2 <NA>
3 <NA>
dtype: Int64
>>> kdf + kdf
a b
0 2 8
1 4 10
2 <NA> 12
3 6 <NA>
>>> kdf.a.astype('Float64')
0 1.0
1 2.0
2 <NA>
3 3.0
Name: a, dtype: Float64
Other new features, improvements and bug fixes
We added the following new features:
koalas:
Series:
align
(#2019)
DataFrame:
Along with the following fixes:
- PySpark 3.1.1 Support
- Preserve index for statistical functions with axis==1 (#2036)
- Use iloc to make sure it retrieves the first element (#2037)
- Fix numeric_only to follow pandas (#2035)
- Fix DataFrame.merge to work properly (#2060)
- Fix astype(str) for some data types (#2040)
- Fix binary operations Index by Series (#2046)
- Fix bug on pow and rpow (#2047)
- Support bool list-like column selection for loc indexer (#2057)
- Fix window functions to resolve (#2090)
- Refresh GitHub workflow matrix (#2083)
- Restructure the hierarchy of Index unit tests (#2080)
- Fix to delegate dtypes (#2061)