[python] 파이썬 데이터 분석 - 시계열 데이터 (pandas 활용, 시각화)

Dev/Python 2023. 1. 16. 02:35

728x90

시계열 데이터

순차적인 시간 흐름으로 기록된 관측치 집합
고정된 시간 구간

Pandas

시계열 데이터를 위한 DatetimeInex 자료형 제공
년, 월, 일 문자열 데이터 -> DatetimeIndex 변환 (pd.to_datetime())
pd.DataFrame.plot() 이용 시각화

Pandas이용한 Resample

시간 간격 재조정
Down-sampling : 원래의 데이터가 그룹으로 묶여 대표 값 필요 (ex. 일별 데이터 -> 월별 데이터)
Up-sampling : 실제로 존재하지 않는 데이터를 만듦. (ex. 월별 데이터 -> 일별 데이터)

2.5시계열데이터기초

시계열 데이터 기초

pd.to_datetime

In [2]:

import pandas as pd
import numpy as np

In [3]:

# 아래와 같은 임의의 날짜 문자열로 생성
# pd.to_datetime() 시간 형식 '/' 구분 , ',' 구분, '.'구분 등 모두 가능 하다.

date = ["2020/01/01", "2020/02/01", "2020/03/01", "2020/04/01"] # 날짜 문자열
date_idx = pd.to_datetime(date) # 날짜 문자열 -> Datetimeindex

print(date_idx) # 속성 확인

DatetimeIndex(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01'], dtype='datetime64[ns]', freq=None)

In [4]:

np.random.seed(1) # 고정된 난수만 나오도록 시드 설정

# 3과 10 사이의 정수로 구성된 4개의 원소를 가진 넘파이 난수 배열 생성
random_int = np.random.randint(3, 10, size = 4)
random_int

Out[4]:

array([8, 6, 7, 3])

In [6]:

# 넘파이 배열 시리즈로 변환
series = pd.Series(random_int, index=date_idx)
series

Out[6]:

2020-01-01    8
2020-02-01    6
2020-03-01    7
2020-04-01    3
dtype: int32

pd.date_range

In [7]:

# date_range : 시작일과 기간을 인수로 설정
pd.date_range('2020-1-1', '2020-05-31') # 해당 기간 일별 데이터로 출력

Out[7]:

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2020-05-22', '2020-05-23', '2020-05-24', '2020-05-25',
               '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29',
               '2020-05-30', '2020-05-31'],
              dtype='datetime64[ns]', length=152, freq='D')

In [9]:

# 월별 데이터 출력
pd.date_range('2020-1-1', '2020-05-31', freq='M') # freg='M' : 월별 데이터로 생성 인수 (각 월말일 기준)

Out[9]:

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31'],
              dtype='datetime64[ns]', freq='M')

In [10]:

# 월별 데이터 출력
pd.date_range('2020-1-1', '2020-05-31', freq='MS') # freg='MS' : 월별 각 첫일로 데이터 생성

Out[10]:

DatetimeIndex(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
               '2020-05-01'],
              dtype='datetime64[ns]', freq='MS')

In [11]:

# 종료일 명시 x, periods 명시
pd.date_range(start="2020-1-1", periods=45) #일별 데이터로 리턴됨

Out[11]:

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12',
               '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
               '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24',
               '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28',
               '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01',
               '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05',
               '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09',
               '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13',
               '2020-02-14'],
              dtype='datetime64[ns]', freq='D')

In [12]:

# periods & freq 명시
pd.date_range(start="2020-1-1", periods=45, freq='MS') # 각 월 첫일 데이터

Out[12]:

DatetimeIndex(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
               '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
               '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01',
               '2021-01-01', '2021-02-01', '2021-03-01', '2021-04-01',
               '2021-05-01', '2021-06-01', '2021-07-01', '2021-08-01',
               '2021-09-01', '2021-10-01', '2021-11-01', '2021-12-01',
               '2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
               '2022-05-01', '2022-06-01', '2022-07-01', '2022-08-01',
               '2022-09-01', '2022-10-01', '2022-11-01', '2022-12-01',
               '2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01',
               '2023-05-01', '2023-06-01', '2023-07-01', '2023-08-01',
               '2023-09-01'],
              dtype='datetime64[ns]', freq='MS')

시계열의 shift (데이터 쉬프트 이동)

In [13]:

np.random.seed(1)

arr = np.random.randn(12) # 가우시안 정규분포 확률을 따르는 난수 12개 생성

ts_idx = pd.date_range('2020-1-1', periods=12, freq="MS") # 해당 난수값의 날짜 인덱스 생성

# 날짜 인덱스와 arr를 이용하여 시리즈 생성
ts = pd.Series(arr, index=ts_idx)
print(ts)

2020-01-01    1.624345
2020-02-01   -0.611756
2020-03-01   -0.528172
2020-04-01   -1.072969
2020-05-01    0.865408
2020-06-01   -2.301539
2020-07-01    1.744812
2020-08-01   -0.761207
2020-09-01    0.319039
2020-10-01   -0.249370
2020-11-01    1.462108
2020-12-01   -2.060141
Freq: MS, dtype: float64

In [14]:

# 1개월치 데이터 shift 이동
ts.shift(1)

Out[14]:

2020-01-01         NaN
2020-02-01    1.624345
2020-03-01   -0.611756
2020-04-01   -0.528172
2020-05-01   -1.072969
2020-06-01    0.865408
2020-07-01   -2.301539
2020-08-01    1.744812
2020-09-01   -0.761207
2020-10-01    0.319039
2020-11-01   -0.249370
2020-12-01    1.462108
Freq: MS, dtype: float64

In [15]:

# 4개월치 데이터 쉬프트
ts.shift(4)

Out[15]:

2020-01-01         NaN
2020-02-01         NaN
2020-03-01         NaN
2020-04-01         NaN
2020-05-01    1.624345
2020-06-01   -0.611756
2020-07-01   -0.528172
2020-08-01   -1.072969
2020-09-01    0.865408
2020-10-01   -2.301539
2020-11-01    1.744812
2020-12-01   -0.761207
Freq: MS, dtype: float64

In [16]:

# 각 월 첫주 일요일을 기준으로 데이터 쉬프트
ts.shift(1, freq='W')

Out[16]:

2020-01-05    1.624345
2020-02-02   -0.611756
2020-03-08   -0.528172
2020-04-05   -1.072969
2020-05-03    0.865408
2020-06-07   -2.301539
2020-07-05    1.744812
2020-08-02   -0.761207
2020-09-06    0.319039
2020-10-04   -0.249370
2020-11-08    1.462108
2020-12-06   -2.060141
dtype: float64

diff(n) : 행 -n기간전 행

In [17]:

# 1개월치 차분 값 연산
ts.diff(1) # 차분 값 = 현재 달 - 1달 전

Out[17]:

2020-01-01         NaN
2020-02-01   -2.236102
2020-03-01    0.083585
2020-04-01   -0.544797
2020-05-01    1.938376
2020-06-01   -3.166946
2020-07-01    4.046350
2020-08-01   -2.506019
2020-09-01    1.080246
2020-10-01   -0.568409
2020-11-01    1.711478
2020-12-01   -3.522249
Freq: MS, dtype: float64

In [18]:

# 연산 방식 확인
ts[1] - ts[0]

Out[18]:

-2.236101777313317

In [19]:

# 3개월치 차분 값 연산
ts.diff(3)

Out[19]:

2020-01-01         NaN
2020-02-01         NaN
2020-03-01         NaN
2020-04-01   -2.697314
2020-05-01    1.477164
2020-06-01   -1.773367
2020-07-01    2.817780
2020-08-01   -1.626615
2020-09-01    2.620578
2020-10-01   -1.994182
2020-11-01    2.223315
2020-12-01   -2.379180
Freq: MS, dtype: float64

resample

In [21]:

np.random.seed(1)

arr = np.random.randn(365) # 365개 1년치 난수 생성
time_idx = pd.date_range('2021-1-1', periods=365, freq='D') #2021년 365일치 기간 생성

ts = pd.Series(arr, index=time_idx) # 1년치 시계열 데이터 생성
print(ts)

2021-01-01    1.624345
2021-01-02   -0.611756
2021-01-03   -0.528172
2021-01-04   -1.072969
2021-01-05    0.865408
                ...   
2021-12-27   -0.557495
2021-12-28    0.939169
2021-12-29   -1.943323
2021-12-30    0.352494
2021-12-31   -0.236437
Freq: D, Length: 365, dtype: float64

down-sampling

In [22]:

# 다운 샘플링예제
ts.resample('M').mean() # 일별 데이터를 월별로 변환 (리샘플링)

Out[22]:

2021-01-31   -0.080317
2021-02-28    0.075127
2021-03-31    0.186964
2021-04-30   -0.036879
2021-05-31    0.183157
2021-06-30    0.083610
2021-07-31    0.135417
2021-08-31    0.106348
2021-09-30    0.100417
2021-10-31   -0.057832
2021-11-30    0.255580
2021-12-31   -0.305686
Freq: M, dtype: float64

up-sampling

In [23]:

# 업샘플링 예제
arr = np.random.randn(10) # 새로운 난수 10개 생성
time_idx = pd.date_range('2021-1-1', periods=10, freq='W') # 주별 데이터 10개 생성

ts = pd.Series(arr, index=time_idx)
print(ts)

2021-01-03    0.727813
2021-01-10    0.515074
2021-01-17   -2.782534
2021-01-24    0.584647
2021-01-31    0.324274
2021-02-07    0.021863
2021-02-14   -0.468674
2021-02-21    0.853281
2021-02-28   -0.413029
2021-03-07    1.834718
Freq: W-SUN, dtype: float64

In [24]:

# foward filling 방식
ts.resample('D').ffill().head(21) # ffill : 각 기간의 첫일을 참고하여 결측값 보간
# NaN 처리될 것을 앞 데이터를 가지고 보간되었다.

Out[24]:

2021-01-03    0.727813
2021-01-04    0.727813
2021-01-05    0.727813
2021-01-06    0.727813
2021-01-07    0.727813
2021-01-08    0.727813
2021-01-09    0.727813
2021-01-10    0.515074
2021-01-11    0.515074
2021-01-12    0.515074
2021-01-13    0.515074
2021-01-14    0.515074
2021-01-15    0.515074
2021-01-16    0.515074
2021-01-17   -2.782534
2021-01-18   -2.782534
2021-01-19   -2.782534
2021-01-20   -2.782534
2021-01-21   -2.782534
2021-01-22   -2.782534
2021-01-23   -2.782534
Freq: D, dtype: float64

In [25]:

# backward filling : 각 기간의 마지막 날을 참고하여 겉측값 보간
ts.resample('D').bfill().head(21)
# NaN 처리될 것을 뒤 데이터를 가지고 보간되었다.

Out[25]:

2021-01-03    0.727813
2021-01-04    0.515074
2021-01-05    0.515074
2021-01-06    0.515074
2021-01-07    0.515074
2021-01-08    0.515074
2021-01-09    0.515074
2021-01-10    0.515074
2021-01-11   -2.782534
2021-01-12   -2.782534
2021-01-13   -2.782534
2021-01-14   -2.782534
2021-01-15   -2.782534
2021-01-16   -2.782534
2021-01-17   -2.782534
2021-01-18    0.584647
2021-01-19    0.584647
2021-01-20    0.584647
2021-01-21    0.584647
2021-01-22    0.584647
2021-01-23    0.584647
Freq: D, dtype: float64

시계열 데이터 시각화

In [27]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # 그래프 상세 속성 설정 라이브러리

In [29]:

np.random.seed(1)

ts_data = np.random.randn(365, 3) # 365*3 행렬의 난수 데이터 생성
ts_idx = pd.date_range(start='2021-1-1', periods=365, freq='D') # 365개의 시계열 인덱스 생성

ts_df = pd.DataFrame(ts_data, index=ts_idx) # 시계열 데이터 프레임 생성
ts_df.head() # head() : 앞부분만 표시

Out[29]:

	0	1	2
2021-01-01	1.624345	-0.611756	-0.528172
2021-01-02	-1.072969	0.865408	-2.301539
2021-01-03	1.744812	-0.761207	0.319039
2021-01-04	-0.249370	1.462108	-2.060141
2021-01-05	-0.322417	-0.384054	1.133769

In [30]:

ts_df.columns = ['X', 'Y', 'Z']
ts_df.head()

Out[30]:

	X	Y	Z
2021-01-01	1.624345	-0.611756	-0.528172
2021-01-02	-1.072969	0.865408	-2.301539
2021-01-03	1.744812	-0.761207	0.319039
2021-01-04	-0.249370	1.462108	-2.060141
2021-01-05	-0.322417	-0.384054	1.133769

In [31]:

ts_df.plot()

Out[31]:

<AxesSubplot:>

In [32]:

ts_df.plot()
plt.title('Time Series Plot of Random Numbers')
plt.show() # 그래프 출력용 커맨드

In [33]:

# 누적합 함수를 이용하여 알아보기 쉽게 각 시리즈를 변경
ts_df = ts_df.cumsum() # 누적합
ts_df.plot()
plt.title("Time Series Plot of Random Numbers (Cumulative)")
plt.show()

In [35]:

ts_df.plot(figsize=(10,3))
plt.title('Time Series Plot of Random Numbers')
plt.xlabel('Date') # x축 이름
plt.ylabel('Unit') # y축 이름
plt.show()

In [ ]:

728x90

'Dev > Python' 카테고리의 다른 글

[python] 파이썬 데이터 분석 - 데이터 병합, 그룹화 (pandas 활용) (0)	2023.01.16
[python] 파이썬 데이터 분석 - 데이터 조작 (pandas, numpy 활용 - 데이터 조작, 변경) (0)	2023.01.16
[python] 파이썬 anaconda 환경 설정 명령어 (0)	2023.01.16
[Python] 파이썬 웹 크롤링 - Selenium 이용한 트위터 자동 로그인 매크로 봇 만들기(2) - 다중 계정 로그인 (2)	2022.12.23
[Python] 파이썬 웹 크롤링 - Selenium 이용한 트위터 자동 로그인 매크로 봇 만들기 (1) - 단일 계정 (0)	2022.12.22

ABOUT ME

케이디 케이디

시계열 데이터

Pandas

Pandas이용한 Resample

시계열 데이터 기초

pd.to_datetime

pd.date_range

시계열의 shift (데이터 쉬프트 이동)

diff(n) : 행 -n기간전 행

resample

down-sampling

up-sampling

시계열 데이터 시각화

'Dev > Python' 카테고리의 다른 글

티스토리툴바

ABOUT ME

시계열 데이터

Pandas

Pandas이용한 Resample

시계열 데이터 기초

pd.to_datetime

pd.date_range

시계열의 shift (데이터 쉬프트 이동)

diff(n) : 행 -n기간전 행

resample

down-sampling

up-sampling

시계열 데이터 시각화

'Dev > Python' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바