20%off off — LaunchSpecial
Home/Machine Learning with Python/Data Preprocessing & Feature Engineering
intermediate
30 min
140 XP

Data Preprocessing & Feature Engineering

Scale features, encode categories, handle missing values, and engineer new features for better ML models

Data Preprocessing & Feature Engineering

Garbage in, garbage out. The quality of your data preprocessing directly determines model performance. This lesson covers the essential transformations every ML practitioner must know.

Why Preprocessing Matters

Raw data is messy:

  • Different scales (age: 0-100, salary: 0-500,000)
  • Missing values
  • Categorical text (can't feed "Male"/"Female" to an algorithm)
  • Outliers
  • Skewed distributions

Feature Scaling

StandardScaler (Z-score normalisation)

Best for: algorithms sensitive to scale (SVM, KNN, neural networks, PCA)

python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaler!

Result: mean=0, std=1 for each feature

MinMaxScaler (0-1 normalisation)

Best for: neural networks, image data

python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

Encoding Categorical Variables

Label Encoding (for ordinal data)

python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])  # S→0, M→1, L→2

One-Hot Encoding (for nominal data)

python
from sklearn.preprocessing import OneHotEncoder
# or use pandas:
df = pd.get_dummies(df, columns=['colour'], drop_first=True)

Handling Missing Values

python
from sklearn.impute import SimpleImputer

# Numerical: fill with mean/median
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# Categorical: fill with most frequent
cat_imputer = SimpleImputer(strategy='most_frequent')

Feature Engineering

Creating new features from existing ones:

python
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)  # Adds x1², x2², x1*x2

# Log transform for skewed features
df['log_salary'] = np.log1p(df['salary'])

# Interaction features
df['age_experience'] = df['age'] * df['years_experience']

Train-Test Split

Critical rule: Never fit preprocessing on test data!

python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Try It Yourself

python
🐍 Python

to use AI code explanations