spark에서 pandas 대신 databricks의 koalas 이용하기

빅데이터(BigData)/Spark

spark에서 pandas 대신 databricks의 koalas 이용하기

leebaro 2020. 12. 3. 12:45

pandas는 spark에서 분산 병렬 처리가 되지 않기 때문에 대용량 데이터를 다루기에는 한계가 있다. 그렇다고 spark의 dataframe을 이용하면 pandas에 비해서 기능이 부족하거나 불편한 경우가 있다.

이런 경우에는 databricks에서 만든 koalas를 이용하면 된다.

koalas는 분산 병렬처리가 가능하고, 문법도 pandas와 유사해서 어려움 없이 이용할 수 있다.

아래는 koalas를 이용해 df의 describe() 함수를 이용하는 방법이다.

import databricks.koalas as ks

sdf = spark.sql("select cnt from table")

# koalas df로 변환
kdf = sdf.to_koalas()

kdf.describe()


##결과
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.0
50%      2.0
75%      3.0
max      3.0
dtype: float64

결과

koalas를 이용하기 위해서는 아래와 같이 패키지를 설치해줘야 한다.

# Conda
conda install koalas -c conda-forge

# pip
pip install koalas

참고

databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html

How Koalas-Spark Interoperability Helps pandas Users Scale - The Databricks Blog

Learn more about the interoperability between Koalas and Apache Spark and how it allows PySpark users to leverage functionality, such as direct data plotting, not readily available to them.

databricks.com

pypi.org/project/koalas/

koalas

Koalas: pandas API on Apache Spark

pypi.org

koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.describe.html

databricks.koalas.DataFrame.describe — Koalas 1.4.0 documentation

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will va

koalas.readthedocs.io

'빅데이터(BigData) > Spark' 카테고리의 다른 글

Service 'SparkUI' failed after 16 retries (starting from 4040)! Consider explicitly setting the appropriate port for the service 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries 에러 발생시 (0)	2021.02.19
spark에서 oracle로 데이터 입력 시 ORA-01861 오류가 발생할 때 (0)	2021.01.18
pyspark dataframe join 후 원하는 column 선택하기 (0)	2020.12.01
spark-submit 중 spark config 값 변경하기 (0)	2020.11.20
spark에서 string으로 날짜 데이터 만들기 (0)	2020.11.19

현재글spark에서 pandas 대신 databricks의 koalas 이용하기

프로도의 블로그

프리미어리그, 롤랑가로스, 선거, python, spark, 프로야구, 축구분석, 드라마, 해외축구, 축구, 스포츠뉴스, 클럽월드컵, 인공지능, 배우, 정치, 스포츠, 축구소식, 야구, 테니스, KBO,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

프로도의 블로그