ByteZeroByteZero
Web Scraping
8 min read

Your Scraped Data is a Mess. Here's How to Clean It.

So you scraped 10,000 pages. Now you have a CSV file full of garbage. Here's how to turn that mess into data that actually works.

ByteZero Team

So you did it. You scraped 10,000 product pages. You watched the logs scroll by, feeling like a genius.

Then you opened the CSV file.

The prices are a mix of $5.99, 5.99, and Contact for price. The product names are full of weird \n characters. Half the rows are missing a category, and you somehow have 2,000 duplicate entries.

Suddenly, your treasure trove of data looks like a cat walked across your keyboard.

This is the unglamorous, behind-the-scenes part of data scraping nobody talks about. Getting the data is the fun part. Making it not suck is where the real work begins.

The "I'll Fix It Later" Trap

Most people throw their messy data into a folder, promising they'll clean it up later. They never do.

Unclean data is useless data. You can't analyze it, you can't put it in a database, and you certainly can't make decisions with it. It's just a pile of digital junk.

Here’s how to turn that junk into something that actually works, using Python and the Pandas library—the standard tool for this kind of job.

Let's imagine we scraped some product data and it looks like this:

import pandas as pd

# This is our messy data
data = {
    'product_name': ['  Awesome Gadget\n', 'Another Thing ', 'Awesome Gadget\n', 'Super Widget '],
    'price': ['$49.99', '50 USD', '$49.99', None],
    'in_stock': [' In Stock ', 'True', ' In Stock ', 'Out of Stock'],
    'reviews': ['1,200 reviews', '500 reviews', '1,200 reviews', '30 reviews']
}
df = pd.DataFrame(data)
print("--- Messy Data ---")
print(df)
--- Messy Data ---
        product_name        price      in_stock        reviews
0  Awesome Gadget\n       $49.99      In Stock   1,200 reviews
1      Another Thing      50 USD          True     500 reviews
2  Awesome Gadget\n       $49.99      In Stock   1,200 reviews
3       Super Widget        None  Out of Stock      30 reviews

It's a complete mess. Let's fix it, step-by-step.

1. Get Rid of the Junk (Whitespace and Newlines)

First things first, let's trim all the extra spaces and weird newline characters (\n).

# Strip whitespace from all string columns
for col in df.select_dtypes(include=['object']):
    df[col] = df[col].str.strip()

print("--- After Stripping Whitespace ---")
print(df['product_name'])
--- After Stripping Whitespace ---
0    Awesome Gadget
1     Another Thing
2    Awesome Gadget
3      Super Widget
Name: product_name, dtype: object

Simple, but it makes a huge difference.

2. Normalize the Important Stuff (Prices and Text)

"Normalization" is a fancy word for "making things consistent." A price should just be a number, not a string with currency symbols. A category should be lowercase so "Gadget" and "gadget" are treated the same.

# Normalize the 'price' column to just numbers
df['price'] = df['price'].str.replace('$', '', regex=False).str.replace(' USD', '', regex=False)
# Convert it to a numeric type, making non-numbers (like None) into NaN (Not a Number)
df['price'] = pd.to_numeric(df['price'], errors='coerce')


# Normalize the 'reviews' column to just numbers
df['reviews'] = df['reviews'].str.replace(' reviews', '', regex=False).str.replace(',', '', regex=False)
df['reviews'] = pd.to_numeric(df['reviews'], errors='coerce')

print("--- After Normalizing Price & Reviews ---")
print(df[['price', 'reviews']])
--- After Normalizing Price & Reviews ---
   price  reviews
0  49.99     1200
1  50.00      500
2  49.99     1200
3    NaN       30

Now we have actual numbers we can do math with.

3. Handle the Empty Spaces (Missing Data)

Our price for "Super Widget" is now NaN because it was None to begin with. What do we do? We have two choices:

  • Fill it: Replace the missing value with something sensible, like 0 or the average price.
  • Drop it: If a row is missing critical data (like a price), just delete the entire row.

For this, let's drop any row that doesn't have a price.

df.dropna(subset=['price'], inplace=True)

print("--- After Dropping Missing Prices ---")
print(df)
--- After Dropping Missing Prices ---
     product_name  price    in_stock  reviews
0  Awesome Gadget  49.99    In Stock     1200
1   Another Thing  50.00        True      500
2  Awesome Gadget  49.99    In Stock     1200

The "Super Widget" is gone. Ruthless, but necessary for clean data.

4. Kill the Clones (Deduplication)

We have two entries for "Awesome Gadget." It's a duplicate. Let's kill it.

df.drop_duplicates(inplace=True)

print("--- After Dropping Duplicates ---")
print(df)
--- After Dropping Duplicates ---
     product_name  price  in_stock  reviews
0  Awesome Gadget  49.99  In Stock     1200
1   Another Thing  50.00      True      500

Now every row is unique.

5. Give Everything a Type That Makes Sense

Finally, our in_stock column is a mix of In Stock and True. We should make this a simple boolean (True/False).

# Map different string values to a boolean
stock_map = {'In Stock': True, 'True': True, 'Out of Stock': False}
df['in_stock'] = df['in_stock'].map(stock_map)

print("--- Final Clean Data ---")
print(df)
--- Final Clean Data ---
     product_name  price  in_stock  reviews
0  Awesome Gadget  49.99      True     1200
1   Another Thing  50.00      True      500

Look at that. From a complete mess to a clean, usable table of data. Now you can actually calculate the average price, find the most-reviewed item, or load it into a database without everything catching fire.

The Bottom Line

Scraping is not just about writing a script that pulls HTML. That's step one. The real value comes from the boring, methodical process of cleaning up the garbage you get back.

Stop hoarding messy CSVs you'll never use. Spend an hour cleaning your data. It’s the difference between having a folder full of junk and having a dataset that actually works.

Tags

data cleaningpandaspythondata normalizationweb scraping

Found this helpful?

Share it with others who might be dealing with the same problems.

Stop fighting your tools.

You've seen the features, you know the drill. Another proxy service promising the world. The difference is, we actually built this thing because we were sick of the alternatives. The dashboard is fast, the setup takes minutes, and it just works.

Join Telegram