Here's an example code for a data-cleaning process:
1. Import the necessary libraries:
import pandas as pdimport numpy as np
2. Load the dataset:
df = pd.read_csv('dataset.csv')
3. Check for missing values:
print(df.isnull().sum())
This will print the count of missing values in each column of the dataset.
4. Remove rows with missing values:
df = df.dropna()
This will remove all the rows with missing values from the dataset.
5. Check for duplicates:
print(df.duplicated().sum())
This will print the count of duplicate rows in the dataset.
6. Remove duplicates:
df = df.drop_duplicates()
This will remove all the duplicate rows from the dataset.
7. Check for outliers:
Q1 = df.quantile(0.25)Q3 = df.quantile(0.75)IQR = Q3 - Q1print(((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum())
This will print the count of outliers in each column of the dataset.
8. Remove outliers:
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
This will remove all the rows with outliers from the dataset.
9. Save the cleaned dataset:
df.to_csv('cleaned_dataset.csv', index=False)
This will save the cleaned dataset to a CSV file.









No comments:
Post a Comment