Contents
- The Problem
- Architecture Overview
- The Hard Parts: Web Scraping at Scale
- Database Design for Complex Relationships
- AI-Powered Job Intelligence
- Real-Time Matching Algorithm
- Handling Scale and Reliability
- Frontend: Mobile-First Design
- The Business Model
- Getting to 3.8K Users Without Marketing
- Lessons Learned
- The Numbers
- Technical Stack Summary
- Appendix: How the Orchestrator Keeps Me Honest
The Problem
Traditional job boards miss the best opportunities. Top-tier finance, consulting, and law firms often post positions exclusively on their own career pages, never touching LinkedIn or Indeed. These hidden jobs represent the most coveted positions in the industry.
Job seekers in finance often had to manually visit dozens of different career websites every day just to check for new openings. This process wastes a huge amount of time and energy. At my university, Imperial College London, it was such a big issue that a group of students actually founded a team of 10 people who would each refresh 5 websites a day, collectively scanning 50 career pages daily, just to track down summer internships.
I built Birdello to solve this. It’s a specialized scraper that monitors 1000+ elite company career pages 24/7, categorizes opportunities using AI, and matches them to user preferences in real time. Speed matters too: the first applicants typically have a much higher chance of making it into the interview pipeline, so Birdello prioritizes rapid detection and instant notifications.
Architecture Overview
I designed a multi-tier system to handle the complexity:
- Frontend (React Native + Next.js)
- Backend API (Node.js/Express)
- Scraping Services (Python + Puppeteer)
- Database (AWS RDS/S3)
- AI Processing Fine tuned BERT model for label classification
Why this architecture? Each layer owns a single concern. The frontend focuses on UX, the API encapsulates business rules, the scrapers do the messy extraction work, the database keeps relationships clean and queryable, and the AI layer turns raw text into structured signals (roles, opportunities, seniority, sponsorship).
Where the orchestrator file fits: the Node/Puppeteer script is the brain for browser-based collection (and API collection when available). It normalizes HTML, handles “load more” and infinite scroll, dives into detail pages when needed, dedupes, categorizes, summarizes, writes to the DB, exports to Excel for auditing, and ships structured logs to a dashboard.
The Hard Parts: Web Scraping at Scale
Challenge 1: Every Website is Different
Goldman Sachs uses Oracle; Barclays has infinite scroll; some sites bury content in Shadow DOM or iframes.
Solution: The Guidebook System: a JSON per-company map so I can add targets without changing code:
{
"Goldman Sachs": {
"URL": "https://www.goldmansachs.com/careers/students/",
"Structure": {
"JobListingsSelector": ".gs-card-application",
"JobTitle": ".gs-card-application__title span",
"JobLink": ".gs-card-application > a",
"JobLocation": ".gs-card-application__category-location"
},
"CookieButton": "#truste-consent-button",
"SecondaryScrape": {
"Description": ".gs-details-list"
}
}
}
In code, I merge per-company overrides with reusable selector templates (Workday/Greenhouse). For Shadow DOM/iframes, the orchestrator switches context or uses a querySelectorAllDeep fallback. If jobs are behind “Load more” or infinite scroll, it clicks or scrolls with timeouts and a hard page cap.
Challenge 2: Modern Websites Use Complex JavaScript
Many career pages render little HTML until their JS runs, or fetch listings via JSON after load.
Solution: Multi-Engine Scraping (choose the least brittle per site).
- Puppeteer for heavy JS:
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Handle infinite scroll
if (company.Scrolling?.IsRequired) {
await autoScroll(page);
}
// Click load more buttons
if (company.LoadMoreButton) {
await page.click(company.LoadMoreButton);
}
- API Integration for Workday/Greenhouse/Eightfold (faster, cleaner):
// Workday API example
const response = await axios.post(
'https://company.wd3.myworkdayjobs.com/wday/cxs/company/jobs',
{ appliedFacets: {}, limit: 20, offset: 0 }
);
- Cheerio for truly static HTML.
Challenge 3: Anti-Bot Measures
CAPTCHAs, rate-limits, and headless detection are common.
Solution: Human-Like Behavior
// Random, human-ish delays
await page.waitForTimeout(Math.random() * 3000 + 1000);
// Accept cookies like a person would
if (company.CookieButton) {
await page.click(company.CookieButton);
}
// Real browser UA
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64)...');
I also rotate IPs and use residential proxies on sensitive targets. Network logging (XHR, fetch, or all requests) helps compare against DevTools and decide if an API route is feasible.
Where the Orchestrator Shines (Problems & Defenses)
- Inconsistent text: normalize capitalization, collapse whitespace, and parse locations from both title and location fields.
- Shadow DOM & iframes: deep selectors and iframe context switching.
- Load more & infinite scroll: visibility checks, disabled-state checks, fallback clicks, and a hard page cap (80).
- Missing fields: open detail pages in batches (default 10 tabs) and fetch Location, Location2, Description, Description2 with traditional and deep selectors.
- London-only rule (when trustworthy): if a company exposes reliable locations (structured field or API), filter to London early; otherwise keep everything until you can infer.
- Duplicate protection: site-side dedupe by Title + Location; DB dedupe by URL or Title + Location depending on platform behavior.
Database Design for Complex Relationships
Jobs can have multiple locations, fit several roles, and map to different opportunity types. I model those as many-to-many relationships. The normalization level I chose is Boyce-Codd Normal Form (BCNF). Here, every determinant is a candidate key: JobID uniquely identifies each job, ApplicationLink is unique, and the composite keys in joblocations and jobroles fully define their relationships. BCNF is a stricter version of Third Normal Form that removes redundancy and prevents update or insertion anomalies by ensuring all functional dependencies are tied to candidate keys. This makes the database cleaner, more consistent, and easier to maintain and scale.
-- Core job entity
CREATE TABLE jobs (
JobID INT PRIMARY KEY AUTO_INCREMENT,
CompanyName VARCHAR(500),
Title VARCHAR(500),
ApplicationLink VARCHAR(500) UNIQUE,
Description VARCHAR(500),
IsActive BOOLEAN DEFAULT TRUE,
CreatedAt TIMESTAMP
);
-- Many-to-many: Jobs ↔ Locations
CREATE TABLE joblocations (
JobID INT,
LocationID INT,
PRIMARY KEY (JobID, LocationID)
);
-- Many-to-many: Jobs ↔ Roles
CREATE TABLE jobroles (
JobID INT,
RoleID INT,
PRIMARY KEY (JobID, RoleID)
);
Reality: Some ATSs reuse one ApplicationLink for many distinct roles (Salesforce/Workday shells). I flag those with NoDuplicateLinkCheckRequired and switch dedupe to Title + Location. If you enforce UNIQUE(ApplicationLink), consider a composite unique index (CompanyName, Title, UnfilteredLocation) or a derived JobHash.
AI-Powered Job Intelligence
Raw postings are inconsistent. Titles like “Vice President - Investment Banking Division, EMEA Coverage” need normalization.
Solution: OpenAI Integration
const categorizeJob = async (jobTitle, description) => {
const prompt = `
Categorize this job into standardized roles and locations:
Title: ${jobTitle}
Description: ${description}
Return JSON with: role, seniority, locations, requiresSponsorship
`;
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: prompt }]
});
return JSON.parse(response.choices[0].message.content);
};
I validate AI outputs against controlled vocabularies (valid roles and opportunities), and only store categorized rows when both sides match. I also summarize descriptions for fast mobile scanning.
Real-Time Matching Algorithm
Users fill a short intake. As new jobs arrive, I match instantly.
Two-tier approach
- Rule-based filtering (fast, deterministic)
const matchJob = (job, userProfile) => {
if (!userProfile.locations.some(loc => job.locations.includes(loc))) return false;
if (!userProfile.roles.some(role => job.roles.includes(role))) return false;
if (userProfile.needsSponsorship && !job.sponsorshipAvailable) return false;
return true;
};
- ML-enhanced scoring (learns from clicks and saves)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def calculate_relevance_score(job_description, user_preferences):
vectorizer = TfidfVectorizer()
job_vector = vectorizer.fit_transform([job_description])
pref_vector = vectorizer.transform([user_preferences])
return cosine_similarity(job_vector, pref_vector)[0][0]
The final ranking blends rules (hard constraints) with a score (soft preferences).
Handling Scale and Reliability
Challenge: Scraping 100+ Sites Reliably
Sites go down, change DOMs, or block requests. I designed the scraper to fail softly and recover.
Solution: Robust Error Handling & Health Monitoring
const scrapeCompany = async (company) => {
try {
const jobs = await scrapeJobs(company);
await saveJobs(jobs);
await logSuccess(company.name);
} catch (error) {
await logError(company.name, error);
if (company.priority === 'high') {
await sendAlert(`${company.name} scraping failed`);
}
return { success: false, error }; // don’t crash the whole run
}
};
Operationally, I use retries on first pass (refresh once if no listings appear), Shadow DOM fallbacks, and batch secondary scrapes (default 10 tabs). I track per-company metrics (found vs. new, potential deactivations, and missing-field counts) and feed everything into a dashboard (logs are batched and de-duplicated).
Challenge: Duplicate Detection
Same job can appear across platforms or be reposted with minor title changes.
Solution: Hybrid Deduping: exact URL when reliable, or Title + Location when platforms reuse shell URLs. Optional fuzzy match for cross-platform duplicates:
const isDuplicate = (newJob, existingJobs) => {
return existingJobs.some(existing => {
if (newJob.applicationLink === existing.applicationLink) return true;
const titleSimilarity = similarity(newJob.title, existing.title);
const sameCompany = newJob.companyName === existing.companyName;
return titleSimilarity > 0.85 && sameCompany;
});
};
Frontend: Mobile-First Design
I ship a React Native app and a Next.js web client.
- Swipe interface for quick triage (Tinder-style)
- Push notifications for new matches
- Offline save + background sync
- PWA capabilities on web
Why mobile-first? People check roles throughout the day. Mobile push is critical for short-window postings.
The Business Model
I initially planned to charge users for instant notifications.
But speed also creates noise: when alerts are immediate, a lot of people apply everywhere with low intent, which clogs hiring pipelines and makes strong candidates harder to spot.
So I kept Birdello free for seekers and charged companies instead. They pay for signal: fewer, higher-quality applications that are better matched, not just more applicants.
Getting to 3.8K Users Without Marketing
- SEO-optimized category pages (“investment banking jobs London”, “consulting opportunities NYC”).
- Word-of-mouth in elite circles (teams share internally).
- Seasonal timing (launch aligned with recruiting cycles: Sep full-time, Jan internships).
- LinkedIn content with hiring trend insights from my data.
- University career centers (early adopters, high sharing velocity).
Lessons Learned
Scraping forced me to build for change, not for happy paths. Career sites shift layouts, load content late, and fail in small unpredictable ways, so reliability comes from graceful fallbacks, clear logging, and quick iteration. When a clean API exists, it is usually worth prioritizing because it reduces brittleness and makes deduping and change detection simpler. On the product side, speed only helps if the alerts stay accurate and relevant, so I biased the system toward curation and correctness instead of raw volume.
The Numbers (18 months)
Over 18 months, Birdello reached 3,800 users, processed 1M+ job listings, and supported thousands of applications, all through word of mouth with $0 spent on paid marketing.
Technical Stack Summary
Backend
Frontend
Infrastructure
Monitoring
Appendix: Orchestrator Logging
The scraper runs at LoggingLevel(1) most days. It captures the few signals you actually need to debug broken scrapes without turning every run into a wall of text.
Mode "all" prints everything at the chosen level. "selective" prints only messages explicitly marked as report-worthy. Before anything hits the dashboard, logs are batched and de-duplicated, so repeated failures (for example a flaky “Load more”) do not flood the feed.