Jad Elamrani

Building Birdello: How I Scraped My Way to Thousands of Users Without Spending a Dime on Marketing

A scraper-first job platform for elite firms: detection speed, normalized data, matching, and reliability.

Product Scraping Systems

Contents

The Problem

Birdello's official Instagram page.

Traditional job boards miss the best opportunities. Top-tier finance, consulting, and law firms often post positions exclusively on their own career pages, never touching LinkedIn or Indeed. These hidden jobs represent the most coveted positions in the industry.

Job seekers in finance often had to manually visit dozens of different career websites every day just to check for new openings. This process wastes a huge amount of time and energy. At my university, Imperial College London, it was such a big issue that a group of students actually founded a team of 10 people who would each refresh 5 websites a day, collectively scanning 50 career pages daily, just to track down summer internships.

I built Birdello to solve this. It’s a specialized scraper that monitors 1000+ elite company career pages 24/7, categorizes opportunities using AI, and matches them to user preferences in real time. Speed matters too: the first applicants typically have a much higher chance of making it into the interview pipeline, so Birdello prioritizes rapid detection and instant notifications.

Architecture Overview

I designed a multi-tier system to handle the complexity:

  • Frontend (React Native + Next.js)
  • Backend API (Node.js/Express)
  • Scraping Services (Python + Puppeteer)
  • Database (AWS RDS/S3)
  • AI Processing Fine tuned BERT model for label classification

Why this architecture? Each layer owns a single concern. The frontend focuses on UX, the API encapsulates business rules, the scrapers do the messy extraction work, the database keeps relationships clean and queryable, and the AI layer turns raw text into structured signals (roles, opportunities, seniority, sponsorship).

Where the orchestrator file fits: the Node/Puppeteer script is the brain for browser-based collection (and API collection when available). It normalizes HTML, handles “load more” and infinite scroll, dives into detail pages when needed, dedupes, categorizes, summarizes, writes to the DB, exports to Excel for auditing, and ships structured logs to a dashboard.

System overview: scrape, normalize, enrich, store, match, then notify.

The Hard Parts: Web Scraping at Scale

Ensuring the website stays responsive.

Challenge 1: Every Website is Different

Goldman Sachs uses Oracle; Barclays has infinite scroll; some sites bury content in Shadow DOM or iframes.

Solution: The Guidebook System: a JSON per-company map so I can add targets without changing code:

{
  "Goldman Sachs": {
    "URL": "https://www.goldmansachs.com/careers/students/",
    "Structure": {
      "JobListingsSelector": ".gs-card-application",
      "JobTitle": ".gs-card-application__title span",
      "JobLink": ".gs-card-application > a",
      "JobLocation": ".gs-card-application__category-location"
    },
    "CookieButton": "#truste-consent-button",
    "SecondaryScrape": {
      "Description": ".gs-details-list"
    }
  }
}

In code, I merge per-company overrides with reusable selector templates (Workday/Greenhouse). For Shadow DOM/iframes, the orchestrator switches context or uses a querySelectorAllDeep fallback. If jobs are behind “Load more” or infinite scroll, it clicks or scrolls with timeouts and a hard page cap.

Challenge 2: Modern Websites Use Complex JavaScript

Many career pages render little HTML until their JS runs, or fetch listings via JSON after load.

Solution: Multi-Engine Scraping (choose the least brittle per site).

Speeding up iteration: small internal tools to debug selectors, flows, and edge cases faster.
  1. Puppeteer for heavy JS:
const browser = await puppeteer.launch();
const page = await browser.newPage();

// Handle infinite scroll
if (company.Scrolling?.IsRequired) {
  await autoScroll(page);
}

// Click load more buttons
if (company.LoadMoreButton) {
  await page.click(company.LoadMoreButton);
}
  1. API Integration for Workday/Greenhouse/Eightfold (faster, cleaner):
// Workday API example
const response = await axios.post(
  'https://company.wd3.myworkdayjobs.com/wday/cxs/company/jobs',
  { appliedFacets: {}, limit: 20, offset: 0 }
);
  1. Cheerio for truly static HTML.

Challenge 3: Anti-Bot Measures

CAPTCHAs, rate-limits, and headless detection are common.

Solution: Human-Like Behavior

// Random, human-ish delays
await page.waitForTimeout(Math.random() * 3000 + 1000);

// Accept cookies like a person would
if (company.CookieButton) {
  await page.click(company.CookieButton);
}

// Real browser UA
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64)...');

I also rotate IPs and use residential proxies on sensitive targets. Network logging (XHR, fetch, or all requests) helps compare against DevTools and decide if an API route is feasible.


Where the Orchestrator Shines (Problems & Defenses)

  • Inconsistent text: normalize capitalization, collapse whitespace, and parse locations from both title and location fields.
  • Shadow DOM & iframes: deep selectors and iframe context switching.
  • Load more & infinite scroll: visibility checks, disabled-state checks, fallback clicks, and a hard page cap (80).
  • Missing fields: open detail pages in batches (default 10 tabs) and fetch Location, Location2, Description, Description2 with traditional and deep selectors.
  • London-only rule (when trustworthy): if a company exposes reliable locations (structured field or API), filter to London early; otherwise keep everything until you can infer.
  • Duplicate protection: site-side dedupe by Title + Location; DB dedupe by URL or Title + Location depending on platform behavior.

Database Design for Complex Relationships

Schema thinking: categories and relationships mattered as soon as the dataset got big.

Jobs can have multiple locations, fit several roles, and map to different opportunity types. I model those as many-to-many relationships. The normalization level I chose is Boyce-Codd Normal Form (BCNF). Here, every determinant is a candidate key: JobID uniquely identifies each job, ApplicationLink is unique, and the composite keys in joblocations and jobroles fully define their relationships. BCNF is a stricter version of Third Normal Form that removes redundancy and prevents update or insertion anomalies by ensuring all functional dependencies are tied to candidate keys. This makes the database cleaner, more consistent, and easier to maintain and scale.

-- Core job entity
CREATE TABLE jobs (
  JobID INT PRIMARY KEY AUTO_INCREMENT,
  CompanyName VARCHAR(500),
  Title VARCHAR(500),
  ApplicationLink VARCHAR(500) UNIQUE,
  Description VARCHAR(500),
  IsActive BOOLEAN DEFAULT TRUE,
  CreatedAt TIMESTAMP
);

-- Many-to-many: Jobs ↔ Locations
CREATE TABLE joblocations (
  JobID INT,
  LocationID INT,
  PRIMARY KEY (JobID, LocationID)
);

-- Many-to-many: Jobs ↔ Roles
CREATE TABLE jobroles (
  JobID INT,
  RoleID INT,
  PRIMARY KEY (JobID, RoleID)
);

Reality: Some ATSs reuse one ApplicationLink for many distinct roles (Salesforce/Workday shells). I flag those with NoDuplicateLinkCheckRequired and switch dedupe to Title + Location. If you enforce UNIQUE(ApplicationLink), consider a composite unique index (CompanyName, Title, UnfilteredLocation) or a derived JobHash.

AI-Powered Job Intelligence

User preferences: keep the intake short, but structured enough to match well.

Raw postings are inconsistent. Titles like “Vice President - Investment Banking Division, EMEA Coverage” need normalization.

Solution: OpenAI Integration

const categorizeJob = async (jobTitle, description) => {
  const prompt = `
  Categorize this job into standardized roles and locations:
  Title: ${jobTitle}
  Description: ${description}

  Return JSON with: role, seniority, locations, requiresSponsorship
  `;
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: prompt }]
  });
  return JSON.parse(response.choices[0].message.content);
};

I validate AI outputs against controlled vocabularies (valid roles and opportunities), and only store categorized rows when both sides match. I also summarize descriptions for fast mobile scanning.

Job intelligence pipeline: classification plus simple rules beats brittle heuristics.

Real-Time Matching Algorithm

An early idea: swipe UX. I tested the concept, then decided the final product needed a more “filter and act” workflow.

Users fill a short intake. As new jobs arrive, I match instantly.

Two-tier approach

  1. Rule-based filtering (fast, deterministic)
const matchJob = (job, userProfile) => {
  if (!userProfile.locations.some(loc => job.locations.includes(loc))) return false;
  if (!userProfile.roles.some(role => job.roles.includes(role))) return false;
  if (userProfile.needsSponsorship && !job.sponsorshipAvailable) return false;
  return true;
};
  1. ML-enhanced scoring (learns from clicks and saves)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_relevance_score(job_description, user_preferences):
    vectorizer = TfidfVectorizer()
    job_vector = vectorizer.fit_transform([job_description])
    pref_vector = vectorizer.transform([user_preferences])
    return cosine_similarity(job_vector, pref_vector)[0][0]

The final ranking blends rules (hard constraints) with a score (soft preferences).

Handling Scale and Reliability

Challenge: Scraping 100+ Sites Reliably

Sites go down, change DOMs, or block requests. I designed the scraper to fail softly and recover.

Solution: Robust Error Handling & Health Monitoring

const scrapeCompany = async (company) => {
  try {
    const jobs = await scrapeJobs(company);
    await saveJobs(jobs);
    await logSuccess(company.name);
  } catch (error) {
    await logError(company.name, error);

    if (company.priority === 'high') {
      await sendAlert(`${company.name} scraping failed`);
    }
    return { success: false, error }; // don’t crash the whole run
  }
};

Operationally, I use retries on first pass (refresh once if no listings appear), Shadow DOM fallbacks, and batch secondary scrapes (default 10 tabs). I track per-company metrics (found vs. new, potential deactivations, and missing-field counts) and feed everything into a dashboard (logs are batched and de-duplicated).

Job alert dashboard screenshot.

Challenge: Duplicate Detection

Same job can appear across platforms or be reposted with minor title changes.

Solution: Hybrid Deduping: exact URL when reliable, or Title + Location when platforms reuse shell URLs. Optional fuzzy match for cross-platform duplicates:

const isDuplicate = (newJob, existingJobs) => {
  return existingJobs.some(existing => {
    if (newJob.applicationLink === existing.applicationLink) return true;
    const titleSimilarity = similarity(newJob.title, existing.title);
    const sameCompany = newJob.companyName === existing.companyName;
    return titleSimilarity > 0.85 && sameCompany;
  });
};
Operations surface: when scraping breaks, you need observability and quick triage.

Frontend: Mobile-First Design

Mobile dashboard: built for quick decisions, not long browsing sessions.

I ship a React Native app and a Next.js web client.

  • Swipe interface for quick triage (Tinder-style)
  • Push notifications for new matches
  • Offline save + background sync
  • PWA capabilities on web

Why mobile-first? People check roles throughout the day. Mobile push is critical for short-window postings.

The Business Model

I initially planned to charge users for instant notifications.

But speed also creates noise: when alerts are immediate, a lot of people apply everywhere with low intent, which clogs hiring pipelines and makes strong candidates harder to spot.

So I kept Birdello free for seekers and charged companies instead. They pay for signal: fewer, higher-quality applications that are better matched, not just more applicants.

Getting to 3.8K Users Without Marketing

Distribution worked because the product was simple and intuitive.
  1. SEO-optimized category pages (“investment banking jobs London”, “consulting opportunities NYC”).
  2. Word-of-mouth in elite circles (teams share internally).
  3. Seasonal timing (launch aligned with recruiting cycles: Sep full-time, Jan internships).
  4. LinkedIn content with hiring trend insights from my data.
  5. University career centers (early adopters, high sharing velocity).

Lessons Learned

Scraping forced me to build for change, not for happy paths. Career sites shift layouts, load content late, and fail in small unpredictable ways, so reliability comes from graceful fallbacks, clear logging, and quick iteration. When a clean API exists, it is usually worth prioritizing because it reduces brittleness and makes deduping and change detection simpler. On the product side, speed only helps if the alerts stay accurate and relevant, so I biased the system toward curation and correctness instead of raw volume.

The Numbers (18 months)

Over 18 months, Birdello reached 3,800 users, processed 1M+ job listings, and supported thousands of applications, all through word of mouth with $0 spent on paid marketing.

Technical Stack Summary

Backend

Node.js / Express API MySQL Sequelize ORM Python Puppeteer OpenAI (categorization) Winston + custom logger

Frontend

React Native (Expo) Next.js TypeScript Tailwind CSS Expo Push

Infrastructure

Vercel (web) AWS RDS (MySQL) AWS S3 Cron schedules Residential proxies

Email

Mailgun (alerts, auth, confirmations) AWS SES (not approved)

Monitoring

Scraper health dashboard Error tracking + alerts Analytics + conversion

Appendix: Orchestrator Logging

The scraper runs at LoggingLevel(1) most days. It captures the few signals you actually need to debug broken scrapes without turning every run into a wall of text.

0Silent. Useful only if you are already confident nothing is failing.
1Milestones, warnings, errors. Default.
2Adds lifecycle noise. Use when a site is behaving oddly.
3-4Debug firehose: HTML snapshots, selector traces, Shadow DOM fallbacks.

Mode "all" prints everything at the chosen level. "selective" prints only messages explicitly marked as report-worthy. Before anything hits the dashboard, logs are batched and de-duplicated, so repeated failures (for example a flaky “Load more”) do not flood the feed.

← Back to Home