Building Birdello: How I Scraped My Way to 26,000 Users Without Spending a Dime on Marketing

The Problem
Architecture Overview
The Hard Parts: Web Scraping at Scale
Database Design for Complex Relationships
AI-Powered Job Intelligence
Real-Time Matching Algorithm
Handling Scale and Reliability
Frontend: Mobile-First Design
The Business Model
Getting to 26K Users Without Marketing
Lessons Learned
The Numbers
Technical Stack Summary
Appendix: How the Orchestrator Keeps Me Honest

The Problem

Traditional job boards miss the best opportunities. Top-tier finance, consulting, and law firms often post positions exclusively on their own career pages, never touching LinkedIn or Indeed. These “hidden” jobs represent the most coveted positions in the industry.

Job seekers in finance often had to manually visit dozens of different career websites every day just to check for new openings. This process wastes a huge amount of time and energy. At my university, Imperial College London, it was such a big issue that a group of students actually founded a team of 10 people who would each refresh 5 websites a day—collectively scanning 50 career pages daily—just to track down summer internships.

I built Birdello to solve this. It’s a specialized scraper that monitors 1000+ elite company career pages 24/7, categorizes opportunities using AI, and matches them to user preferences in real time. Speed matters too: the first applicants typically have a much higher chance of making it into the interview pipeline, so Birdello prioritizes rapid detection and instant notifications.

Architecture Overview

I designed a multi-tier system to handle the complexity:

Frontend (React Native + Next.js)
Backend API (Node.js/Express)
Scraping Services (Python + Puppeteer)
Database (AWS RDS/S3)
AI Processing Fine tuned BERT model for label classification

Why this architecture? Each layer owns a single concern. The frontend focuses on UX, the API encapsulates business rules, the scrapers do the messy extraction work, the database keeps relationships clean and queryable, and the AI layer turns raw text into structured signals (roles, opportunities, seniority, sponsorship).

Where the orchestrator file fits: the Node/Puppeteer script is the brain for browser-based collection (and API collection when available). It normalizes HTML, handles “load more” and infinite scroll, dives into detail pages when needed, dedupes, categorizes, summarizes, writes to the DB, exports to Excel for auditing, and ships structured logs to a dashboard.

The Hard Parts: Web Scraping at Scale

Challenge 1: Every Website is Different

Goldman Sachs uses Oracle; Barclays has infinite scroll; some sites bury content in Shadow DOM or iframes.

Solution: The Guidebook System — a JSON per-company map so I can add targets without changing code:

{
  "Goldman Sachs": {
    "URL": "https://www.goldmansachs.com/careers/students/",
    "Structure": {
      "JobListingsSelector": ".gs-card-application",
      "JobTitle": ".gs-card-application__title span",
      "JobLink": ".gs-card-application > a",
      "JobLocation": ".gs-card-application__category-location"
    },
    "CookieButton": "#truste-consent-button",
    "SecondaryScrape": {
      "Description": ".gs-details-list"
    }
  }
}

In code, I merge per-company overrides with reusable selector templates (Workday/Greenhouse). For Shadow DOM/iframes, the orchestrator switches context or uses a querySelectorAllDeep fallback. If jobs are behind “Load more” or infinite scroll, it clicks/scrolls with timeouts and a hard page cap.

Challenge 2: Modern Websites Use Complex JavaScript

Many career pages render little HTML until their JS runs, or fetch listings via JSON after load.

Solution: Multi-Engine Scraping (choose the least brittle per site).

Puppeteer for heavy JS:

const browser = await puppeteer.launch();
const page = await browser.newPage();

// Handle infinite scroll
if (company.Scrolling?.IsRequired) {
  await autoScroll(page);
}

// Click load more buttons
if (company.LoadMoreButton) {
  await page.click(company.LoadMoreButton);
}

API Integration for Workday/Greenhouse/Eightfold (faster, cleaner):

// Workday API example
const response = await axios.post(
  'https://company.wd3.myworkdayjobs.com/wday/cxs/company/jobs',
  { appliedFacets: {}, limit: 20, offset: 0 }
);

Cheerio for truly static HTML.

Challenge 3: Anti-Bot Measures

CAPTCHAs, rate-limits, and headless detection are common.

Solution: Human-Like Behavior

// Random, human-ish delays
await page.waitForTimeout(Math.random() * 3000 + 1000);

// Accept cookies like a person would
if (company.CookieButton) {
  await page.click(company.CookieButton);
}

// Real browser UA
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64)...');

I also rotate IPs and use residential proxies on sensitive targets. Network logging (XHR/fetch or all requests) helps compare against DevTools and decide if an API route is feasible.

Where the Orchestrator Shines (Problems & Defenses)

Inconsistent text: normalize capitalization, collapse whitespace, and parse locations from both title and location fields.
Shadow DOM & iframes: deep selectors and iframe context switching.
Load more & infinite scroll: visibility checks, disabled-state checks, fallback clicks, and a hard page limit (80).
Missing fields: open detail pages in batches (default 10 tabs) and fetch Location, Location2, Description, Description2 with traditional and deep selectors.
London-only rule (when trustworthy): if a company exposes reliable locations (structured field or API), filter to London early; otherwise keep everything until you can infer.
Duplicate protection: site-side dedupe by Title + Location; DB dedupe by URL or Title + Location depending on platform behavior.

Database Design for Complex Relationships

Jobs can have multiple locations, fit several roles, and map to different opportunity types. I model those as many-to-many relationships.The normalization level I chose is the Boyce–Codd Normal Form (BCNF). Here, every determinant is a candidate key: JobID uniquely identifies each job, ApplicationLink is unique, and the composite keys in joblocations and jobroles fully define their relationships. BCNF is a stricter version of Third Normal Form that removes redundancy and prevents update or insertion anomalies by ensuring all functional dependencies are tied to candidate keys. This makes the database cleaner, more consistent, and easier to maintain and scale.

-- Core job entity
CREATE TABLE jobs (
  JobID INT PRIMARY KEY AUTO_INCREMENT,
  CompanyName VARCHAR(500),
  Title VARCHAR(500),
  ApplicationLink VARCHAR(500) UNIQUE,
  Description VARCHAR(500),
  IsActive BOOLEAN DEFAULT TRUE,
  CreatedAt TIMESTAMP
);

-- Many-to-many: Jobs ↔ Locations
CREATE TABLE joblocations (
  JobID INT,
  LocationID INT,
  PRIMARY KEY (JobID, LocationID)
);

-- Many-to-many: Jobs ↔ Roles
CREATE TABLE jobroles (
  JobID INT,
  RoleID INT,
  PRIMARY KEY (JobID, RoleID)
);

Reality: Some ATSs reuse one ApplicationLink for many distinct roles (Salesforce/Workday shells). I flag those with NoDuplicateLinkCheckRequired and switch dedupe to Title + Location. If you enforce UNIQUE(ApplicationLink), consider a composite unique index (CompanyName, Title, UnfilteredLocation) or a derived JobHash.

AI-Powered Job Intelligence

Raw postings are inconsistent. Titles like “Vice President – Investment Banking Division, EMEA Coverage” need normalization.

Solution: OpenAI Integration

const categorizeJob = async (jobTitle, description) => {
  const prompt = `
  Categorize this job into standardized roles and locations:
  Title: ${jobTitle}
  Description: ${description}

  Return JSON with: role, seniority, locations, requiresSponsorship
  `;
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: prompt }]
  });
  return JSON.parse(response.choices[0].message.content);
};

I validate AI outputs against controlled vocabularies (valid roles/opportunities), and only store categorized rows when both sides match. I also summarize descriptions for fast mobile scanning.

Real-Time Matching Algorithm

Users fill a short intake. As new jobs arrive, I match instantly.

Two-tier approach

Rule-based filtering (fast, deterministic)

const matchJob = (job, userProfile) => {
  if (!userProfile.locations.some(loc => job.locations.includes(loc))) return false;
  if (!userProfile.roles.some(role => job.roles.includes(role))) return false;
  if (userProfile.needsSponsorship && !job.sponsorshipAvailable) return false;
  return true;
};

ML-enhanced scoring (learns from clicks/saves)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_relevance_score(job_description, user_preferences):
    vectorizer = TfidfVectorizer()
    job_vector = vectorizer.fit_transform([job_description])
    pref_vector = vectorizer.transform([user_preferences])
    return cosine_similarity(job_vector, pref_vector)[0][0]

The final ranking blends rules (hard constraints) with a score (soft preferences).

Handling Scale and Reliability

Challenge: Scraping 100+ Sites Reliably

Sites go down, change DOMs, or block requests. I designed the scraper to fail softly and recover.

Solution: Robust Error Handling & Health Monitoring

const scrapeCompany = async (company) => {
  try {
    const jobs = await scrapeJobs(company);
    await saveJobs(jobs);
    await logSuccess(company.name);
  } catch (error) {
    await logError(company.name, error);

    if (company.priority === 'high') {
      await sendAlert(`${company.name} scraping failed`);
    }
    return { success: false, error }; // don’t crash the whole run
  }
};

Operationally, I use retries on first pass (refresh once if no listings appear), Shadow DOM fallbacks, and batch secondary scrapes (default 10 tabs). I track per-company metrics—found vs. new, potential deactivations, and missing-field counts—and feed everything into a dashboard (logs are batched and de-duplicated).

Challenge: Duplicate Detection

Same job can appear across platforms or be reposted with minor title changes.

Solution: Hybrid Deduping — exact URL when reliable, or Title+Location when platforms reuse shell URLs. Optional fuzzy match for cross-platform duplicates:

const isDuplicate = (newJob, existingJobs) => {
  return existingJobs.some(existing => {
    if (newJob.applicationLink === existing.applicationLink) return true;
    const titleSimilarity = similarity(newJob.title, existing.title);
    const sameCompany = newJob.companyName === existing.companyName;
    return titleSimilarity > 0.85 && sameCompany;
  });
};

Frontend: Mobile-First Design

I ship a React Native app and a Next.js web client.

Swipe interface for quick triage (Tinder-style)
Push notifications for new matches
Offline save + background sync
PWA capabilities on web

Why mobile-first? People check roles throughout the day. Mobile push is critical for short-window postings.

The Business Model

Freemium

Free: 5 matches/day
Premium: unlimited, advanced filters, priority alerts
Enterprise: custom company tracking and team features

Revenue

$19/mo user subscriptions
$99/mo per-team enterprise
Priority placement fees for companies

Getting to 26K Users Without Marketing

SEO-optimized category pages (“investment banking jobs London”, “consulting opportunities NYC”).
Word-of-mouth in elite circles (teams share internally).
Seasonal timing (launch aligned with recruiting cycles: Sep full-time, Jan internships).
LinkedIn content with hiring trend insights from my data.
University career centers (early adopters, high sharing velocity).

Lessons Learned

Technical

Over-engineer reliability—scraping is inherently fragile.
Prefer APIs over HTML when available.
Real-time processing matters—minutes, not hours.

Business

Niche markets pay for quality; $19/mo is acceptable for high-value roles.
Timing amplifies growth (recruiting cycles).
Network effects inside firms beat ads.

Product

Mobile notifications = engagement (70% of applications via mobile).
Curation > volume (5 relevant > 50 random).
Speed beats perfection—ship alerts fast; refine categorization later.

The Numbers (18 months)

26,000 registered users
2.3M jobs scraped
150,000 applications facilitated
$0 in paid marketing
92% scraper uptime
$47k MRR peak

Technical Stack Summary

Backend

Node.js/Express API
MySQL (Sequelize ORM)
Python + Puppeteer scrapers
OpenAI for categorization
Winston/custom logger

Frontend

React Native (Expo) mobile
Next.js (TypeScript) web
Tailwind CSS
Push via Expo

Infrastructure

Vercel for web
AWS RDS (MySQL) — database hosting
AWS S3 — file storage (exports, assets, backups)
Cron-driven schedules for scraping
Residential proxies for IP rotation

Email

Attempted AWS SES for transactional email, but the application wasn’t accepted.
Switched to Mailgun for automated email sending (alerts, confirmations, password flows).

Monitoring

Custom scraper health dashboard
Error tracking & alerts
User analytics & conversion

Appendix: How the Orchestrator Keeps Me Honest (Logging in Plain English)

I run the scraper at LoggingLevel(1) most days—clean, useful signals without noise:

Level 0: silent (not recommended; you’ll miss failures)
Level 1: milestones, warnings, errors (daily driver)
Level 2: adds lifecycle chatter (use when something’s “off”)
Level 3–4: firehose (HTML dumps, selector checks, Shadow DOM fallbacks)

Mode "all" vs "selective" controls whether everything at that level is printed or only “report”-flagged messages. I batch and de-dupe messages before sending them to the dashboard so a flaky “Load more” doesn’t spam 500 identical lines.

Bottom line: When the web misbehaves—and it will—this logging is the difference between a mystery outage and a 10-minute fix.

← Back to Home

Jad Elamrani's Blog

Contents