머신러닝에서 문자열 데이터를 처리하기 위해, 구두점 제거와 stopwords 사용하는 방법

iminu 2022. 5. 10. 17:30

import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
my_stopwords = stopwords.words('english') # 영어 불용어 가져온다

def message_cleaning(sentence):
    # 1. 구두점 제거
    Test_punc_removed = [char for char in sentence if char not in string.punctuation]
    # 2. 각 글자들을 하나의 문자열로 합친다.
    Test_punc_removed_join = ''.join(Test_punc_removed)
    # 3. 문자열에 불용어가 포함되어 있는지 확인해서, 불용어 제거한다.
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in my_stopwords]
    # 4. 결과로 남은 단어들만 리턴한다.
    return Test_punc_removed_join_clean
    
message_cleaning('Hello~~! my name is, heheheh! nice to meet you!!!@')
# ['Hello', 'name', 'heheheh', 'nice', 'meet']

sentence에 문자열을 넣으면 먼저 문자열을 하나씩 char로 꺼내와서 char에 string.puctuation 안에 있는 기호가 들어잇는지 확인한다. 그리고 각 글자들을 하나의 문자열로 합친다.

split 함수를 통해 단어로 쪼갠 뒤 소문자 변환 한 것이 stopwords안에 있는지 확인하고 리턴한다.