Why Keyword Matching Fails
Traditional ATS systems match resumes by counting keyword occurrences. A resume with the word "Python" five times outscores a resume where Python is demonstrated through a complex distributed data pipeline โ even though the latter candidate is clearly superior. InnoHire.ai solves this with semantic matching.
Stage 1 โ Document Ingestion & Parsing
Supported input formats: PDF, DOCX, TXT, plain text paste.
The parser extracts the following structured fields from each resume:
- Contact block โ name, email, phone, LinkedIn, location
- Work history โ company, role, dates, bullet descriptions
- Education โ institution, degree, major, graduation year
- Skills โ hard skills, tools, languages, frameworks
- Certifications โ issuing body, credential name, validity
- Projects โ title, stack, description, impact metrics
A hybrid extraction approach is used: rule-based regex handles structured sections (phone numbers, dates, degree keywords), while a fine-tuned NER (Named Entity Recognition) model handles free-form sections like project descriptions and role responsibilities.
Stage 2 โ Normalisation
Raw extracted data contains inconsistencies that would corrupt scoring. The normalisation stage resolves these:
- Skill synonyms โ "JS" โ "JavaScript", "ML" โ "Machine Learning", "k8s" โ "Kubernetes"
- Date standardisation โ "Jan '22 โ Present" โ
2022-01-01totoday - Experience duration โ computed in months per role and cumulatively per skill domain
- Title normalisation โ "Sr. SWE" โ "Senior Software Engineer" for clean comparison
Stage 3 โ Semantic Embedding
Both the job description and the normalised resume candidate summary are independently passed through a sentence-transformer model (based on the BERT architecture, fine-tuned on recruitment domain data).
Each document is encoded into a 768-dimensional dense vector that captures semantic meaning rather than surface-level tokens. The model understands that "built REST APIs in Node.js" and "developed backend services using Express" are semantically equivalent โ even without shared keywords.
The job description embedding represents the ideal candidate profile. The resume embedding represents the candidate profile. The distance between them, measured via cosine similarity, forms the base match score.
Stage 4 โ Cosine Similarity Scoring
Cosine similarity measures the angle between two vectors in high-dimensional space. A score of 1.0 means identical direction (perfect match); 0.0 means orthogonal (no relationship).
In practice, InnoHire.ai sees:
- 0.85โ1.0 โ Excellent match. Candidate likely covers >90% of role requirements.
- 0.70โ0.85 โ Strong match with some skill or experience gaps.
- 0.55โ0.70 โ Partial match. Notable gaps but transferable experience exists.
- Below 0.55 โ Poor alignment. Candidate is unlikely to be a strong fit.
Stage 5 โ Gap Detection
After the base score is computed, a secondary module compares the job description's extracted required skills list against the candidate's normalised skill inventory. Missing skills are flagged as gaps with a gap severity score (critical vs. nice-to-have) based on how prominently they appear in the job description.
This gap list is surfaced to recruiters alongside the match score, and is used to generate targeted screening questions designed to probe whether the candidate can compensate for each gap.
Output Format
The engine returns a structured result object per candidate:
{
"candidateId": "abc123",
"overall_match_score": 84,
"semantic_similarity": 0.81,
"matched_skills": ["Python", "PostgreSQL", "FastAPI"],
"missing_skills": ["Docker", "CI/CD"],
"gap_severity": { "Docker": "critical", "CI/CD": "moderate" },
"experience_years_relevant": 4.5,
"screening_questions": [...],
"boolean_search_string": "...",
"linkedin_outreach_draft": "..."
}Performance
Average processing time per resume: 1.4 seconds (end-to-end, including parse + embed + score + generate). Bulk processing of 100 resumes completes in under 90 seconds via the async job queue.