User Segmentation - retail data

Python

User Segmentation - retail data - 1(전처리)

남생이a 2024. 3. 23. 14:37

select *
from retail_data rd

import numpy as np
import pandas as pd
import plotly.express as px #데이터 시각화
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from wordcloud import WordCloud
from mpl_toolkits.mplot3d import Axes3D
import re #텍스트 데이터 정제 -> 토큰화, 단어 형태소 분석
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import KMeans # 데이터 클러스터링
from sklearn.metrics import silhouette_score # 클러스터링 효율성 평가

# Identify outliers using z-score method
from scipy import stats # z-score방법으로 이상치 탐지. 평균으로부터 얼마나 떨어져있는지 확인

# Outlier treatment - Winsorization
from scipy.stats.mstats import winsorize #데이터의 극단값을 다른 값으로 대체 -> 이상치의 영향을 줄임

from sklearn.feature_extraction.text import TfidfVectorizer #단어의 중요도 반영
from sklearn.metrics.pairwise import linear_kernel # 아이템 간 유사도 계산 -> 개인화된 추천 제공
from sklearn.metrics.pairwise import cosine_similarity # 아이템 간 유사도 계산 -> 개인화된 추천 제공
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

데이터 확인

데이터 불러오기

df= pd.read_csv('retail_data.csv', encoding='ISO-8859-1')
df

df.info()

df.describe()

이상치 탐지

df.isna().sum()

중복값 탐지

df.duplicated().sum()

중복 제거 및 ID 누락 데이터 삭제 -> 고객 중심 분석은 ID가 필수이기 때문

# Dropping duplicated rows
df.drop_duplicates(inplace=True)

# Handling missing values
# Since CustomerID is crucial for customer-centric analysis, we'll drop rows with missing CustomerID
df.dropna(subset=['CustomerID'], inplace=True)

# Checking the shape of the dataset after cleaning
df.shape

데이터 다시 확인 (전처리가 잘 되었는지)

df.isna().sum()

'Python' 카테고리의 다른 글

User Segmentation - retail data - 3 Visualization 시각화 (0)	2024.03.23
User Segmentation - retail data - 2 Visualization 시각화 (0)	2024.03.23
Numpy Summary (0)	2024.03.19
넘파이 ndarray의 데이터 세트 선택하기 - indexing (0)	2024.03.19
Numpy 개요 (0)	2024.03.18

현재글User Segmentation - retail data - 1(전처리)

풀스택 데이터 전문가가 되고자 합니다

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

TY_DataFullStack

User Segmentation - retail data - 1(전처리)

'Python' 카테고리의 다른 글

'Python'의 다른글

티스토리툴바

User Segmentation - retail data - 1(전처리)

'Python' 카테고리의 다른 글

'Python'의 다른글

관련글

티스토리툴바