Member-only story
NLP — Text Preprocessing
Since the data to be used in NLP projects is text data, it has an unstructured structure and, as in other projects, it is very important to prepare the data before moving on to the model. There may be some changes in the data preprocessing steps depending on the purpose of the project. Therefore, each step can be revised according to the project. These steps are generally as follows. Again, different steps can be added if necessary depending on the data to be studied (for example, if working with tweets and user names are included in the collected data, an additional step can be added to extract them).
- Removing spaces
- Removing punctuation
- Removing Numbers
- Lower Casing
- Stopwords Removal
- Stemming
- Lemmatization
- Tokenization
We will use the movie reviews data available on Kaggle to practically examine the above techniques.
Removing spaces
There may be extra spaces at the beginning, end of sentences or between words. Correcting these anomalies allows us to work with cleaner data.
# Removing spaces in sentences
df["review"] = df["review"].str.replace(r' +', '')
# Remove leading and trailing spaces.
df["review"] =…