Your Scraped Data is a Mess. Here's How to Clean It.
So you scraped 10,000 pages. Now you have a CSV file full of garbage. Here's how to turn that mess into data that actually works.
So you did it. You scraped 10,000 product pages. You watched the logs scroll by, feeling like a genius.
Then you opened the CSV file.
The prices are a mix of $5.99
, 5.99
, and Contact for price
. The product names are full of weird \n
characters. Half the rows are missing a category, and you somehow have 2,000 duplicate entries.
Suddenly, your treasure trove of data looks like a cat walked across your keyboard.
This is the unglamorous, behind-the-scenes part of data scraping nobody talks about. Getting the data is the fun part. Making it not suck is where the real work begins.
The "I'll Fix It Later" Trap
Most people throw their messy data into a folder, promising they'll clean it up later. They never do.
Unclean data is useless data. You can't analyze it, you can't put it in a database, and you certainly can't make decisions with it. It's just a pile of digital junk.
Here’s how to turn that junk into something that actually works, using Python and the Pandas library—the standard tool for this kind of job.
Let's imagine we scraped some product data and it looks like this:
import pandas as pd
# This is our messy data
data = {
'product_name': [' Awesome Gadget\n', 'Another Thing ', 'Awesome Gadget\n', 'Super Widget '],
'price': ['$49.99', '50 USD', '$49.99', None],
'in_stock': [' In Stock ', 'True', ' In Stock ', 'Out of Stock'],
'reviews': ['1,200 reviews', '500 reviews', '1,200 reviews', '30 reviews']
}
df = pd.DataFrame(data)
print("--- Messy Data ---")
print(df)
--- Messy Data ---
product_name price in_stock reviews
0 Awesome Gadget\n $49.99 In Stock 1,200 reviews
1 Another Thing 50 USD True 500 reviews
2 Awesome Gadget\n $49.99 In Stock 1,200 reviews
3 Super Widget None Out of Stock 30 reviews
It's a complete mess. Let's fix it, step-by-step.
1. Get Rid of the Junk (Whitespace and Newlines)
First things first, let's trim all the extra spaces and weird newline characters (\n
).
# Strip whitespace from all string columns
for col in df.select_dtypes(include=['object']):
df[col] = df[col].str.strip()
print("--- After Stripping Whitespace ---")
print(df['product_name'])
--- After Stripping Whitespace ---
0 Awesome Gadget
1 Another Thing
2 Awesome Gadget
3 Super Widget
Name: product_name, dtype: object
Simple, but it makes a huge difference.
2. Normalize the Important Stuff (Prices and Text)
"Normalization" is a fancy word for "making things consistent." A price should just be a number, not a string with currency symbols. A category should be lowercase so "Gadget" and "gadget" are treated the same.
# Normalize the 'price' column to just numbers
df['price'] = df['price'].str.replace('$', '', regex=False).str.replace(' USD', '', regex=False)
# Convert it to a numeric type, making non-numbers (like None) into NaN (Not a Number)
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Normalize the 'reviews' column to just numbers
df['reviews'] = df['reviews'].str.replace(' reviews', '', regex=False).str.replace(',', '', regex=False)
df['reviews'] = pd.to_numeric(df['reviews'], errors='coerce')
print("--- After Normalizing Price & Reviews ---")
print(df[['price', 'reviews']])
--- After Normalizing Price & Reviews ---
price reviews
0 49.99 1200
1 50.00 500
2 49.99 1200
3 NaN 30
Now we have actual numbers we can do math with.
3. Handle the Empty Spaces (Missing Data)
Our price
for "Super Widget" is now NaN
because it was None
to begin with. What do we do? We have two choices:
- Fill it: Replace the missing value with something sensible, like 0 or the average price.
- Drop it: If a row is missing critical data (like a price), just delete the entire row.
For this, let's drop any row that doesn't have a price.
df.dropna(subset=['price'], inplace=True)
print("--- After Dropping Missing Prices ---")
print(df)
--- After Dropping Missing Prices ---
product_name price in_stock reviews
0 Awesome Gadget 49.99 In Stock 1200
1 Another Thing 50.00 True 500
2 Awesome Gadget 49.99 In Stock 1200
The "Super Widget" is gone. Ruthless, but necessary for clean data.
4. Kill the Clones (Deduplication)
We have two entries for "Awesome Gadget." It's a duplicate. Let's kill it.
df.drop_duplicates(inplace=True)
print("--- After Dropping Duplicates ---")
print(df)
--- After Dropping Duplicates ---
product_name price in_stock reviews
0 Awesome Gadget 49.99 In Stock 1200
1 Another Thing 50.00 True 500
Now every row is unique.
5. Give Everything a Type That Makes Sense
Finally, our in_stock
column is a mix of In Stock
and True
. We should make this a simple boolean (True
/False
).
# Map different string values to a boolean
stock_map = {'In Stock': True, 'True': True, 'Out of Stock': False}
df['in_stock'] = df['in_stock'].map(stock_map)
print("--- Final Clean Data ---")
print(df)
--- Final Clean Data ---
product_name price in_stock reviews
0 Awesome Gadget 49.99 True 1200
1 Another Thing 50.00 True 500
Look at that. From a complete mess to a clean, usable table of data. Now you can actually calculate the average price, find the most-reviewed item, or load it into a database without everything catching fire.
The Bottom Line
Scraping is not just about writing a script that pulls HTML. That's step one. The real value comes from the boring, methodical process of cleaning up the garbage you get back.
Stop hoarding messy CSVs you'll never use. Spend an hour cleaning your data. It’s the difference between having a folder full of junk and having a dataset that actually works.
Tags
Found this helpful?
Share it with others who might be dealing with the same problems.