* 스칼라에서 case2 테이블 생성
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> sqlContext.sql("create table IF NOT EXISTS case2 (case_id int, province string, city string, group string, infection_case string, confirmed int, latitude int, longitude int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
res2: org.apache.spark.sql.DataFrame = []
scala> sqlContext.sql("LOAD DATA LOCAL INPATH '/home/scott/Case.txt' INTO TABLE case2")
res3: org.apache.spark.sql.DataFrame = []
scala> sql("select * from case2").show()
* 위에서 출력한 결과를 csv 파일로 저장(/home/scott/dd 밑에 저장함)
scala> sql("""select province,sum(confirmed)
from case2
group by province""").coalesce(1).write.option("header","true").option("sep",",").mode("overwrite").csv("/home/scott/dd")
* scott 에서 데이터 확인 + 이름 바꾸기
$ ls -ld dd
drwxrwxr-x. 2 scott scott 170 Jan 12 03:16 dd
(py389) [scott@centos ~]$ cd dd
(py389) [scott@centos dd]$ ls
part-r-00000-0aae2303-ff23-4da1-9aff-cb6bffd70122.csv _SUCCESS
(py389) [scott@centos dd]$ mv part-r-00000-0aae2303-ff23-4da1-9aff-cb6bffd70122.csv case.csv
(py389) [scott@centos dd]$ cat case.csv
province,sum(confirmed)
Sejong,49
Ulsan,51
Chungcheongbuk-do,60
Gangwon-do,62
Gwangju,43
province,
Gyeongsangbuk-do,1324
* (메모장에서 데이터 저장 할 필요가 없음 )
* 판다스에서 바로 그래프로 구현
import pandas as pd
emp=pd.read_csv("/home/scott/dd/case.csv")
result=emp['sum(confirmed)']
result.index=emp['province']
result.plot(kind='bar', color='pink')