Contents
- The Problem
- Architecture Overview
- The Hard Parts: Web Scraping at Scale
- Database Design for Complex Relationships
- AI-Powered Job Intelligence
- Real-Time Matching Algorithm
- Handling Scale and Reliability
- Frontend: Mobile-First Design
- The Business Model
- Getting to 26K Users Without Marketing
- Lessons Learned
- The Numbers
- Technical Stack Summary
- Appendix: How the Orchestrator Keeps Me Honest
The Problem
Traditional job boards miss the best opportunities. Top-tier finance, consulting, and law firms often post positions exclusively on their own career pages, never touching LinkedIn or Indeed. These “hidden” jobs represent the most coveted positions in the industry.
Job seekers in finance often had to manually visit dozens of different career websites every day just to check for new openings. This process wastes a huge amount of time and energy. At my university, Imperial College London, it was such a big issue that a group of students actually founded a team of 10 people who would each refresh 5 websites a day—collectively scanning 50 career pages daily—just to track down summer internships.
I built Birdello to solve this. It’s a specialized scraper that monitors 1000+ elite company career pages 24/7, categorizes opportunities using AI, and matches them to user preferences in real time. Speed matters too: the first applicants typically have a much higher chance of making it into the interview pipeline, so Birdello prioritizes rapid detection and instant notifications.
Architecture Overview
I designed a multi-tier system to handle the complexity:
- Frontend (React Native + Next.js)
- Backend API (Node.js/Express)
- Scraping Services (Python + Puppeteer)
- Database (AWS RDS/S3)
- AI Processing Fine tuned BERT model for label classification
Why this architecture? Each layer owns a single concern. The frontend focuses on UX, the API encapsulates business rules, the scrapers do the messy extraction work, the database keeps relationships clean and queryable, and the AI layer turns raw text into structured signals (roles, opportunities, seniority, sponsorship).
Where the orchestrator file fits: the Node/Puppeteer script is the brain for browser-based collection (and API collection when available). It normalizes HTML, handles “load more” and infinite scroll, dives into detail pages when needed, dedupes, categorizes, summarizes, writes to the DB, exports to Excel for auditing, and ships structured logs to a dashboard.
The Hard Parts: Web Scraping at Scale
Challenge 1: Every Website is Different
Goldman Sachs uses Oracle; Barclays has infinite scroll; some sites bury content in Shadow DOM or iframes.
Solution: The Guidebook System — a JSON per-company map so I can add targets without changing code:
{
"Goldman Sachs": {
"URL": "https://www.goldmansachs.com/careers/students/",
"Structure": {
"JobListingsSelector": ".gs-card-application",
"JobTitle": ".gs-card-application__title span",
"JobLink": ".gs-card-application > a",
"JobLocation": ".gs-card-application__category-location"
},
"CookieButton": "#truste-consent-button",
"SecondaryScrape": {
"Description": ".gs-details-list"
}
}
}
In code, I merge per-company overrides with reusable selector templates (Workday/Greenhouse). For Shadow DOM/iframes, the orchestrator switches context or uses a querySelectorAllDeep fallback. If jobs are behind “Load more” or infinite scroll, it clicks/scrolls with timeouts and a hard page cap.
Challenge 2: Modern Websites Use Complex JavaScript
Many career pages render little HTML until their JS runs, or fetch listings via JSON after load.
Solution: Multi-Engine Scraping (choose the least brittle per site).
- Puppeteer for heavy JS:
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Handle infinite scroll
if (company.Scrolling?.IsRequired) {
await autoScroll(page);
}
// Click load more buttons
if (company.LoadMoreButton) {
await page.click(company.LoadMoreButton);
}
- API Integration for Workday/Greenhouse/Eightfold (faster, cleaner):
// Workday API example
const response = await axios.post(
'https://company.wd3.myworkdayjobs.com/wday/cxs/company/jobs',
{ appliedFacets: {}, limit: 20, offset: 0 }
);
- Cheerio for truly static HTML.
Challenge 3: Anti-Bot Measures
CAPTCHAs, rate-limits, and headless detection are common.
Solution: Human-Like Behavior
// Random, human-ish delays
await page.waitForTimeout(Math.random() * 3000 + 1000);
// Accept cookies like a person would
if (company.CookieButton) {
await page.click(company.CookieButton);
}
// Real browser UA
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64)...');
I also rotate IPs and use residential proxies on sensitive targets. Network logging (XHR/fetch or all requests) helps compare against DevTools and decide if an API route is feasible.
Where the Orchestrator Shines (Problems & Defenses)
- Inconsistent text: normalize capitalization, collapse whitespace, and parse locations from both title and location fields.
- Shadow DOM & iframes: deep selectors and iframe context switching.
- Load more & infinite scroll: visibility checks, disabled-state checks, fallback clicks, and a hard page limit (80).
- Missing fields: open detail pages in batches (default 10 tabs) and fetch Location, Location2, Description, Description2 with traditional and deep selectors.
- London-only rule (when trustworthy): if a company exposes reliable locations (structured field or API), filter to London early; otherwise keep everything until you can infer.
- Duplicate protection: site-side dedupe by Title + Location; DB dedupe by URL or Title + Location depending on platform behavior.
Database Design for Complex Relationships
Jobs can have multiple locations, fit several roles, and map to different opportunity types. I model those as many-to-many relationships.The normalization level I chose is the Boyce–Codd Normal Form (BCNF). Here, every determinant is a candidate key: JobID uniquely identifies each job, ApplicationLink is unique, and the composite keys in joblocations and jobroles fully define their relationships. BCNF is a stricter version of Third Normal Form that removes redundancy and prevents update or insertion anomalies by ensuring all functional dependencies are tied to candidate keys. This makes the database cleaner, more consistent, and easier to maintain and scale.
-- Core job entity
CREATE TABLE jobs (
JobID INT PRIMARY KEY AUTO_INCREMENT,
CompanyName VARCHAR(500),
Title VARCHAR(500),
ApplicationLink VARCHAR(500) UNIQUE,
Description VARCHAR(500),
IsActive BOOLEAN DEFAULT TRUE,
CreatedAt TIMESTAMP
);
-- Many-to-many: Jobs ↔ Locations
CREATE TABLE joblocations (
JobID INT,
LocationID INT,
PRIMARY KEY (JobID, LocationID)
);
-- Many-to-many: Jobs ↔ Roles
CREATE TABLE jobroles (
JobID INT,
RoleID INT,
PRIMARY KEY (JobID, RoleID)
);
Reality: Some ATSs reuse one ApplicationLink for many distinct roles (Salesforce/Workday shells). I flag those with NoDuplicateLinkCheckRequired and switch dedupe to Title + Location. If you enforce UNIQUE(ApplicationLink), consider a composite unique index (CompanyName, Title, UnfilteredLocation) or a derived JobHash.
AI-Powered Job Intelligence
Raw postings are inconsistent. Titles like “Vice President – Investment Banking Division, EMEA Coverage” need normalization.
Solution: OpenAI Integration
const categorizeJob = async (jobTitle, description) => {
const prompt = `
Categorize this job into standardized roles and locations:
Title: ${jobTitle}
Description: ${description}
Return JSON with: role, seniority, locations, requiresSponsorship
`;
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: prompt }]
});
return JSON.parse(response.choices[0].message.content);
};
I validate AI outputs against controlled vocabularies (valid roles/opportunities), and only store categorized rows when both sides match. I also summarize descriptions for fast mobile scanning.
Real-Time Matching Algorithm
Users fill a short intake. As new jobs arrive, I match instantly.
Two-tier approach
- Rule-based filtering (fast, deterministic)
const matchJob = (job, userProfile) => {
if (!userProfile.locations.some(loc => job.locations.includes(loc))) return false;
if (!userProfile.roles.some(role => job.roles.includes(role))) return false;
if (userProfile.needsSponsorship && !job.sponsorshipAvailable) return false;
return true;
};
- ML-enhanced scoring (learns from clicks/saves)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def calculate_relevance_score(job_description, user_preferences):
vectorizer = TfidfVectorizer()
job_vector = vectorizer.fit_transform([job_description])
pref_vector = vectorizer.transform([user_preferences])
return cosine_similarity(job_vector, pref_vector)[0][0]
The final ranking blends rules (hard constraints) with a score (soft preferences).
Handling Scale and Reliability
Challenge: Scraping 100+ Sites Reliably
Sites go down, change DOMs, or block requests. I designed the scraper to fail softly and recover.
Solution: Robust Error Handling & Health Monitoring
const scrapeCompany = async (company) => {
try {
const jobs = await scrapeJobs(company);
await saveJobs(jobs);
await logSuccess(company.name);
} catch (error) {
await logError(company.name, error);
if (company.priority === 'high') {
await sendAlert(`${company.name} scraping failed`);
}
return { success: false, error }; // don’t crash the whole run
}
};
Operationally, I use retries on first pass (refresh once if no listings appear), Shadow DOM fallbacks, and batch secondary scrapes (default 10 tabs). I track per-company metrics—found vs. new, potential deactivations, and missing-field counts—and feed everything into a dashboard (logs are batched and de-duplicated).
Challenge: Duplicate Detection
Same job can appear across platforms or be reposted with minor title changes.
Solution: Hybrid Deduping — exact URL when reliable, or Title+Location when platforms reuse shell URLs. Optional fuzzy match for cross-platform duplicates:
const isDuplicate = (newJob, existingJobs) => {
return existingJobs.some(existing => {
if (newJob.applicationLink === existing.applicationLink) return true;
const titleSimilarity = similarity(newJob.title, existing.title);
const sameCompany = newJob.companyName === existing.companyName;
return titleSimilarity > 0.85 && sameCompany;
});
};
Frontend: Mobile-First Design
I ship a React Native app and a Next.js web client.
- Swipe interface for quick triage (Tinder-style)
- Push notifications for new matches
- Offline save + background sync
- PWA capabilities on web
Why mobile-first? People check roles throughout the day. Mobile push is critical for short-window postings.
The Business Model
Freemium
- Free: 5 matches/day
- Premium: unlimited, advanced filters, priority alerts
- Enterprise: custom company tracking and team features
Revenue
- $19/mo user subscriptions
- $99/mo per-team enterprise
- Priority placement fees for companies
Getting to 26K Users Without Marketing
- SEO-optimized category pages (“investment banking jobs London”, “consulting opportunities NYC”).
- Word-of-mouth in elite circles (teams share internally).
- Seasonal timing (launch aligned with recruiting cycles: Sep full-time, Jan internships).
- LinkedIn content with hiring trend insights from my data.
- University career centers (early adopters, high sharing velocity).
Lessons Learned
Technical
- Over-engineer reliability—scraping is inherently fragile.
- Prefer APIs over HTML when available.
- Real-time processing matters—minutes, not hours.
Business
- Niche markets pay for quality; $19/mo is acceptable for high-value roles.
- Timing amplifies growth (recruiting cycles).
- Network effects inside firms beat ads.
Product
- Mobile notifications = engagement (70% of applications via mobile).
- Curation > volume (5 relevant > 50 random).
- Speed beats perfection—ship alerts fast; refine categorization later.
The Numbers (18 months)
- 26,000 registered users
- 2.3M jobs scraped
- 150,000 applications facilitated
- $0 in paid marketing
- 92% scraper uptime
- $47k MRR peak
Technical Stack Summary
Backend
- Node.js/Express API
- MySQL (Sequelize ORM)
- Python + Puppeteer scrapers
- OpenAI for categorization
- Winston/custom logger
Frontend
- React Native (Expo) mobile
- Next.js (TypeScript) web
- Tailwind CSS
- Push via Expo
Infrastructure
- Vercel for web
- AWS RDS (MySQL) — database hosting
- AWS S3 — file storage (exports, assets, backups)
- Cron-driven schedules for scraping
- Residential proxies for IP rotation
- Attempted AWS SES for transactional email, but the application wasn’t accepted.
- Switched to Mailgun for automated email sending (alerts, confirmations, password flows).
Monitoring
- Custom scraper health dashboard
- Error tracking & alerts
- User analytics & conversion
Appendix: How the Orchestrator Keeps Me Honest (Logging in Plain English)
I run the scraper at LoggingLevel(1) most days—clean, useful signals without noise:
- Level 0: silent (not recommended; you’ll miss failures)
- Level 1: milestones, warnings, errors (daily driver)
- Level 2: adds lifecycle chatter (use when something’s “off”)
- Level 3–4: firehose (HTML dumps, selector checks, Shadow DOM fallbacks)
Mode "all" vs "selective" controls whether everything at that level is printed or only “report”-flagged messages. I batch and de-dupe messages before sending them to the dashboard so a flaky “Load more” doesn’t spam 500 identical lines.
Bottom line: When the web misbehaves—and it will—this logging is the difference between a mystery outage and a 10-minute fix.