Lessons Learned After Crawling and Importing Thousands of Classified Ads into Osclass

A real project story from my side about crawling partner websites, preparing clean datasets, and importing high-volume classifieds into Osclass without killing website performance.

Case Studies & Real Implementations

1. June 2026

7 min read

34 views

Lessons Learned After Crawling and Importing Thousands of Classified Ads into Osclass

Recently I have been working on project for one of our large clients to bring content from multiple websites of their partner stores into Osclass. Let us call him client Global Marketplace, and partner websites Green Store, Blue Market, and Red Shop. On paper it looked simple. In real world, this was one of those jobs where small details can break everything if you rush 🙂.

In this article I share what really worked for us when using Listings Crawler Plugin together with Ad Importer Plugin (CSV, XML, JSON). I also include mistakes we made and what I would do differently if I start same migration again.

Short result: in this setup we brought around 100000 listings within 2 days, while keeping front office responsive and without negative impact on normal customer traffic.

Project context and why this was not just "import file and done"

Global Marketplace had existing categories, custom fields, location tree, and active sellers. Partner websites had different structures, naming standards, and quality of data. One website had price in separate field, another had price mixed with currency string, third one had missing location data in many records.

If we imported everything as-is, the Osclass site would become messy very quickly. Search relevance would be weak, category pages would contain mixed quality records, and support team would spend too much time on manual cleanup.

So our first decision was important: crawler should not be used as "blind collector". Crawler had to produce structured and normalized source dataset, and importer had to apply strict rules before any listing goes live.

How we started: small pilot before full scale

I know many people want to start directly with big volume. We did opposite. First we created pilot with 300 listings from each partner source. This pilot helped us verify:

selector stability on source websites
data mapping quality (title, description, location, category, images)
deduplication behavior
server load during crawl and import windows
real output quality on search page and item page

This pilot saved us.

Listings Crawler Setup

We found two critical issues early: duplicated records caused by URL parameters, and broken image URLs from lazy-load attributes.

Crawler setup that worked for us

For crawling we used follow-links mode in most cases, because listing pages and detail pages had different structures. We also tested direct extraction mode on one partner source where all fields were already visible in list page cards.

Main practical rules we applied:

always validate selectors in analyzer before first production run
define unique key from clean canonical URL, not from title
set request delay to avoid aggressive bursts on partner websites
collect data into storage first, then import in controlled batches
capture both src and data-src for images

One very important thing: crawler plugin prepares data, it does not import directly into Osclass listing table. This separation was very useful for us because we could inspect data before publishing anything.

Example of mapping idea

source_url -> unique_id
headline -> title
content_html -> description
price_text -> price + currency parser
city_name -> city (with fallback region)
category_label -> category map table

For category and location mapping we maintained simple lookup rules outside source websites naming. This means "Cars", "Vehicles", and "Auto" from different partners all ended in one expected Osclass category.

Importer setup: quality gate, not only transport layer

After crawler output looked good, we moved to Ad Importer. Importer was configured as strict quality gate. I think this is where many migrations fail, because they treat importer only as connector.

Our import profile included:

default language and currency fallback
safe category fallback for unknown labels
email and phone sanity checks
batch limits to protect DB and PHP workers
logs and notification on every run

We intentionally did not run one massive import call for all records.

Importer Settings

Instead we used controlled waves. First wave imported around 10000 records, then next wave after validation, then larger batches when confidence was better.

Performance lessons from 100000 listings migration

People ask me if 100000 listings can be imported safely. Yes, but only with discipline. Biggest risk is not importer itself, but side effects: image downloads, search index update cost, and cache invalidation spikes.

What helped us keep site stable:

run heavy batches in low traffic hours
keep per-run limits realistic to server resources
monitor DB CPU, slow queries, and disk I/O during import windows
separate crawl schedule from import schedule
pause non-critical cron jobs during peak import windows

We also monitored frontend response time every 5 minutes. If time started to rise too much, we reduced next batch size. This dynamic approach was better than fixed giant jobs.

Big import projects are not only about data. They are about operating system, database, cron timing, and business priorities at same time.

Data quality problems we saw in partner feeds

No partner source was perfect. Some records had too short titles, many had duplicated template text, some had old phone numbers, and many images had weak quality. If we pushed this directly to production, user trust would drop very fast.

Practical fixes we used:

minimum title length and maximum title cleanup rules
description sanitizer to remove noisy blocks
reject records without at least one valid image in selected categories
fallback geolocation when city was missing but region was known
duplicate check by unique id plus fuzzy title similarity for edge cases

Another thing: partner websites changed HTML layout two times during project. Because of that, we added a quick selector health check before every larger crawl run. That small check prevented silent failures.

What I learned about SEO and indexing in this type of migration

For Google visibility, quantity alone is not enough. Imported content must still satisfy quality and intent. Thin duplicate pages can hurt the whole project if you publish blindly.

What we did for better search quality:

blocked low-value duplicates from publishing
kept category taxonomy clean and consistent
ensured listing pages had useful and readable descriptions
kept image alt text meaningful when possible
reviewed crawl/index behavior after each major import wave

The biggest SEO win was not "more pages". Biggest win was cleaner structure and better relevance in search result pages, unique and high quality content. This improved both organic sessions and conversion quality.

Operational checklist I recommend before you go live

Before crawling

verify server can access source URLs (no blockage, server-level protection, IP ban etc..)
define legal/partner permission for data usage
prepare selector mapping and fallback selectors
design unique id strategy

Before importing

validate 200-500 sample records manually
test image download behavior and allowed extensions
confirm category and location mappings
set safe batch size and memory/time limits

After import waves

review logs and failed record reasons
check frontend speed and DB health
spot check live listings for quality
adjust next batch size and schedules

Useful videos if you want to see real setup flow

I also recommend these two videos for practical setup flow and real examples:

Final lessons from this project

If I need to summarize this migration in one sentence: success came from controlled process, not from one magic setting. We treated crawling and importing as data engineering workflow, not as one-click action 🚀.

For Global Marketplace project, this approach gave us stable scale, better listing quality, and much less manual admin work after launch. If you are planning similar project, start small, verify everything, and grow in waves. It is slower in first week, but much faster in long run.

About the Author

Oliver Bk

My passion is building classifieds marketplaces, automating workflows, and turning messy data into useful products. From PHP, HTML, CSS, and JavaScript to Python, crawlers, imports, and SEO, I enjoy solving technical challenges and sharing lessons learned from real-world projects. Most ideas start with a problem, a cup of coffee, and a curiosity to see how far automation can go.

Osclass, PHP, JavaScript, CSS, Python

46 posts Publishing since 04/2018