data preprocessing
Table Of Contents

What is Data Preprocessing?

Data Preprocessing Is Process Of Transforms Data Into Algorithm Knowing Data. RealWord [Raw ]Data Is In incomplete and inconsistent Not Always. Make Raw-data Useful Using DataPreprocessing.

Data Preprocessing Step By Step

Step 1 : Import the libraries

Step 2 : Import the data set

Step 3 : Data Cleaning

Step 4 : Data Transformation

Step 4 : Data Reduction

Step 5: Feature Scaling

Import the libraries in python

First, I Import pandas and NumPy libraries and give alias.

import pandas as pd
import numpy as np

Import the data set

Import Data Using Pandas. You Can Import Data As CVS, Excel, SAS, delimited, SQL And URL.

# Import CSV File

Data = pd.read_csv("Train.csv")

# Import CSV File Using URL

Data = pd.read_csv("")

# import TXT File

Data = pd.read_table("train1.txt")

# Import Excel File

Data = pd.read_excel("train.xls",sheetname="June", skiprows=2)

#Sqlite 3 db

import sqlite3
from import sql
conn = sqlite3.connect('forecasting.db')
query = "SELECT * FROM forecasting"
results = pd.read_sql(query, con=conn)
print results.head()

Data Cleaning

Find Missing Data

First check In Data Set Have Missing Value Or Not.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Data = pd.read_csv("~/Downloads/Data Science/data set/train.csv")


In Huge Data Set Use isnull().sum() Not Always.


Some Time Null Value In Different Value As: ?, Blank Than Need To Convert Them NAN Format For Further Algorithm Use.

#Eliminate the NAN
for col in Data.columns:
    Data.loc[Data[col] == '?', col] = np.nan

Visualizing Null Values

In seaborn library Use For statistical graphics visualisation

import seaborn as sns
sns.heatmap(Data.isnull(), cbar=False)

Drop Null Value

Using dropna()


Using Column name

When See Graph, and No More Use Column Or Have A 75% Null Value No fill null value For Imputation then Better Option is Remove Column.

Data.drop(['Unnamed: 0'], axis=1, inplace=True)

Filling null values using mean

# Find mean

result = Data.category_ID.mode()

#Then Fill null value 
Data.category_ID = Data.loc[Data.category_ID == '?', col] = 'Category 26'

Data["category_ID"].fillna("Category", inplace = True)  

fill null values with the previous and next ones

Data.fillna(method ='bfill') # for next values as
Data.fillna(method ='pad') # for previous values as 

fill na value using replace()

Data['category_ID'].replace(to_replace = np.nan, value = 'Category 26') 

Data Transformation

When Your data is mixtures attributes Then Need Transformation them, Not Always. Example : currency, kilograms and sales volume.


Normalization is rescaling real numeric value into the range 0 and 1.

When you don’t know data distribution or know distribution is not Gaussian distribution(bell curve). Example k-nearest neighbors and artificial neural networks.

#import sklearn library for Normalization
from sklearn import preprocessing
#need all value in number, not convert non number
normalized_Data = preprocessing.normalize(Data)


Standardization is shifting the distribution of each and every attribute have a zero mean and one standard deviation.

when your data is Gaussian distribution (bell curve). This does not compulsory but the technique is more effective if your attribute is Gaussian distribution and varying scale data. Example linear regression, logistic regression

standardized_Data = preprocessing.scale(Data)

Data Discretization

When Your Data is Continuous and need to convert them discrete then use Discretization.


Data Reduction

in data reduction (Dimension Reduction) techniques example: Filter method using Pearson correlation, Wrapper method using pvalue, Embedded method using Lasso regularization. Lasso regularization is iterative method. each iteration extract features check which features contribute the most to the training, if feature is irrelevant then lasso penalizes is 0(zero) and remove it. PCA(Principal Component Analysis).

#Filter method
corr = Data.corr()
drop_cols = []
for col in Data.columns:
    if sum(corr[col].map(lambda x: abs(x) > 0.1)) <= 4:
Data.drop(drop_cols, axis=1, inplace=True)

Feature Scaling

Feature scaling is also know as Data Transformation. it is apply on independent variables for Given data to do it in a particular  range. this is also use for algorithm speeding up calculation.

from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler() 

Use Feature Scaling

When data is big scale, irrelevant or misleading and your algorithm Distance based then use feature scaling. Example: K-Means, K-Nearest-Neighbours, PCA.