How to webscrape with ChatGPT: Amazon
Amazon’s constantly shifting HTML and aggressive bot detection make traditional scraping:
- Frustrating (selectors break weekly)
- Time-consuming (hours of maintenance)
- Risky (IP bans come fast)
ChatGPT changes everything. Instead of hunting for selectors, you describe what you want and let AI figure out the details. Here’s your complete guide to selector-free Amazon scraping.
Step 1: The New Way to Scrape (No XPaths Needed)
Old Method:
# Fragile code you'll need to update constantly
price = soup.select_one('span.a-price span.a-offscreen').text
New AI-Powered Method:
# ChatGPT understands what a "price" looks like
data = extract("""
From this Amazon HTML:
1. Find all product cards
2. For each, extract:
- Name (main bold heading)
- Price (formatted like $19.99)
- Rating (stars out of 5)
- Prime badge (if present)
Return as clean JSON
""")
Step 2: Build Your AI Scraper in 5 Minutes
1. Get the Page HTML:
import requests
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.chrome}
url = "https://www.amazon.com/s?k=wireless+earbuds"
response = requests.get(url, headers=headers)
html = response.text
2. Feed to ChatGPT with Smart Prompts
Analyze this Amazon search results page HTML and extract:
1. All product listings (ignore ads/sponsored)
2. For each product:
- Title (most prominent text)
- Current price (look for $ amounts)
- Original price (if discounted)
- Rating (out of 5 stars)
- Number of reviews
- Prime eligibility (true/false)
Format as JSON array.
Pro Tip: Add examples for better accuracy
{
"title": "Sony WF-1000XM4 Wireless Earbuds",
"price": 278.00,
"original_price": 299.99,
"rating": 4.4,
"review_count": 1243,
"prime": true
}
Step 3: Handle Pagination the Smart Way
Update my scraper to:
1. Detect if there's a "Next" button
2. Follow it while:
- Adding random 3-7 second delays
- Rotating User-Agents
3. Stop after 5 pages or when no more results
ChatGPT will suggest this:
from time import sleep
import random
def scrape_page(url):
# scraping logic ...
next_page = soup.find(lambda tag: tag.name == 'a' and 'next' in tag.text.lower())
if next_page:
sleep(random.uniform(3, 7))
return "https://amazon.com" + next_page['href']
return None
Step 4: Bypass Anti-Bot Measures Like a Human
Ask ChatGPT for a full anti-detection strategy:
Generate a complete anti-detection system for Amazon scraping including:
1. Header rotation
2. Mouse simulation
3. CAPTCHA evasion
4. Proxy rotation
from selenium.webdriver.common.action_chains import ActionChains
def human_like_interaction(driver):
driver.execute_script(f"window.scrollBy(0, {random.randint(200, 800)})")
element = driver.find_element(By.TAG_NAME, 'body')
ActionChains(driver).move_to_element(element).perform()
search = driver.find_element(By.ID, 'twotabsearchtextbox')
for char in "headphones":
search.send_keys(char)
sleep(random.uniform(0.1, 0.3))
Step 5: Scale Like a Pro — Use Proxies to Avoid Bans
Scraping Amazon without proxies is asking to be blocked. Even with rotated headers and delays, your IP will eventually get flagged. That’s why smart scrapers use rotating residential proxies like ByteZero.
Here’s what a typical ByteZero proxy string looks like:
resi-bridge-us.bytezero.io:1111:5d7d1958qs-speed-fast:f526fgh975
This follows the format:
host:port:username:password
Split the string and add it to your script:
proxies = {
"http": "http://5d7d1958qs-speed-fast:[email protected]:1111",
"https": "http://5d7d1958qs-speed-fast:[email protected]:1111"
}
response = requests.get(url, headers=headers, proxies=proxies)
Step 6: Extract Complex Data Without Selectors
Use prompts like:
- Product Variants: Extract color/size options and availability
- Review Analysis: Summarize most common compliments and complaints
- Price Trends: Track price history and discount percentages
Ethical Considerations
Respect robots.txt and Amazon’s terms:
- Max 1 request every 3–5 seconds
- Don’t exceed 100 pages per IP per day
- Use data responsibly
Try It Yourself Right Now
Paste any Amazon HTML into ChatGPT with this prompt:
Extract all product information from this Amazon HTML including:
- Title
- Price
- Rating
- Key features bullet points
Return as structured JSON.
This isn’t just scraping evolution – it’s a revolution. Long live prompt-powered extraction.