spark에서 master node로 데이터를 모으기 위해서 collect를 사용하는 방법

빅데이터(BigData)/Spark

spark에서 master node로 데이터를 모으기 위해서 collect를 사용하는 방법

leebaro 2020. 10. 19.

spark에서 dataframe을 이용할 때 데이터는 worker node에 분산되어 저장된다.

이 때 어떠한 필요에 의해서 데이터를 master node로 보내야 하는 경우가 있다면 collect()를 이용하면 된다.

필자의 경우 df의 데이터를 oracle 데이터로 보내기 위해서 이 작업이 필요했다.

master node는 오라클 db에 접근이 가능하지만 worker node는 보안 정책으로 인해 접근할 수 없었다.

그래서 데이터를 master node로 보내고 master node의 데이터를 다른 db로 보내려고 했다.

결론적으로는 실패 해서 다른 방법을 이용했지만 어쨌든 데이터를 master node에 모으고 싶다면 이 방법을 이용하면 된다.

df.collect()

데이터가 많은 경우에는 데이터가 옮겨지는 시간과 리소가 많이 필요하기 때문에 이 점을 고려해서 진행해야 한다.

만약 특정 컬럼의 데이터만 필요하다면 아래와 같이 select() 함수를 함께 사용하면 된다.

df.select("col1").collect()

참고

sparkbyexamples.com/pyspark/pyspark-collect/

PySpark Collect() - Retrieve data from DataFrame — Spark by {Examples}

PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the

sparkbyexamples.com

PySpark Collect() - Retrieve data from DataFrame — Spark by {Examples}

PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the

sparkbyexamples.com

stackoverflow.com/questions/44174747/spark-dataframe-collect-vs-select

Spark dataframe: collect () vs select ()

Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that. Will collect() behave the same way if called on a dataframe? What a...

stackoverflow.com

'빅데이터(BigData) > Spark' 카테고리의 다른 글

pyspark dataframe에서 join하고 컬럼을 select 하거나 drop 하기 (0)	2020.10.26
spark에서 list로 dataframe 만들기 (0)	2020.10.19
dataframe의 partition 수 확인하기 (0)	2020.10.16
dataframe(데이터프레임)으로 hive table(테이블) 생성하거나 데이터를 입력하기 (0)	2020.10.02
java.time.format.DateTimeParseException: Text '2020-09-16 16:24:08.0' could not be parsed, unparsed text found at index 19 와 같은 에러가 발생하는 경우에 조치 방법 (0)	2020.09.18

현재글spark에서 master node로 데이터를 모으기 위해서 collect를 사용하는 방법

파이썬, 추천시스템, scikit-learn, pyspark, Machine Learning, Association Rule, 부모역할훈련, pandas, git, 부모 역할 훈련, PET, airflow, 빅데이터, 맥북, python, spark, 머신러닝, Recommendation System, 추천 시스템, 손자병법,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

프로도의 블로그