Overview: The Four Skill Lists Every Evaluation Produces
When InnoHire.ai evaluates a candidate, it doesn't produce a single ranked list of skills β it builds four distinct skill sets, each answering a different question about candidate fit. Every number you see in the match panel, every green chip and red chip in the results, is derived from the intersection and difference of these four lists.
π JD Skills
π Resume Skills
β Matched Skills
β Missing Skills
- JD Skills β extracted from the job description: what the role requires.
- Resume Skills β extracted from the candidate's resume: what they claim to have.
- Matched Skills β the intersection: JD skills that are confirmed present on the resume.
- Missing Skills β the gap: JD skills not found on the resume after normalization.
"The match score isn't a keyword count. It's the result of a multi-stage pipeline that normalizes, validates, and cross-references skills across a 263-entry taxonomy before deciding what "matched" means.
JD Skills: Parsing the Job Description
The job description parsing step extracts structured data from raw JD text via the parse_jd function. This step identifies the job title, required skills, preferred skills, years of experience required, and industry domain. The resulting jd_skills list typically contains 8β20 skills depending on the JD's specificity.
What counts as a JD skill?
JD skills are named technical or methodological competencies required for the role. The parser separates these from noise like "strong communication skills," contract terms, location requirements, and salary information. A blocklist of 40+ noise terms ensures that words like contract, w2, relocation, and years experience are never included in the skills set.
Importantly, JD skills are extracted using the same taxonomy the resume extractor uses β so k8s in a JD is normalized to Kubernetes before any comparison happens, eliminating false negatives from abbreviation differences.
JD skill normalization example
A JD that says "experience with k8s and AWS EKS" produces JD skills Kubernetes and AWS β not raw strings "k8s" and "AWS EKS". This normalization happens before any matching, so aliases never cause false mismatches.
Resume Skills: The 4-Stage Extraction Pipeline
Extracting skills from a resume is far harder than extracting them from a JD. A JD is structured: skills are usually listed in a clean requirements section. A resume is unstructured: skills appear in bullet points, project descriptions, certifications, and even job titles β in dozens of different phrasings.
InnoHire.ai uses a 4-stage pipeline to extract resume skills:
Exact Matching
Each taxonomy entry (canonical name + all aliases + variants) is searched verbatim in the resume text using regex with token boundaries. Confidence score: 0.90β0.95. This catches 'Python', 'python3', 'py' and maps all three to the canonical 'Python'.
Fuzzy Matching
For skills not found exactly, sequence-based similarity scoring is applied at 85% threshold. This catches typos like 'Postgress' β 'PostgreSQL' and alternate spellings. Skipped for short terms (β€2 chars) to prevent false positives.
Context-Based Matching
Skills implied by action verbs are detected β e.g. 'built using Django' implies Python even if Python isn't written separately. Patterns like 'verb + using/with/leveraging + tool' trigger context inferences for parent skills.
Semantic Matching
When sentence-transformers are available, embedding-based similarity finds conceptually related skills not covered by the taxonomy. Adds rare or niche skills that exact/fuzzy matching misses. Confidence: 0.50β0.85.
The 263+ Entry Skills Taxonomy
Every skill in the system traces back to a single source: the SKILLS_TAXONOMY β a curated list of SkillEntry objects, each containing a canonical name, category, aliases, parent skill, and spelling variants.
The taxonomy covers 13 skill categories:
- Programming Languages β Python, Java, JavaScript, TypeScript, Golang, Rust, Scala, and 10+ more
- Frameworks β React, Angular, Vue.js, Spring Boot, Django, FastAPI, NestJS, and more
- Cloud Platforms β AWS (with 15+ sub-services), Azure, GCP
- DevOps Tools β Docker, Kubernetes, Terraform, Ansible, Jenkins, GitHub Actions, Helm
- Databases β PostgreSQL, MySQL, MongoDB, Redis, Cassandra, DynamoDB, Snowflake
- Data Engineering β Apache Kafka, Spark, Airflow, Databricks, Hadoop
- Domain & Compliance β HIPAA, HL7 FHIR, SOQL, EHR, LOINC, SNOMED, OAuth
- Methodologies β Agile, Scrum, CI/CD, DevOps, SRE, Kanban
- Platforms β Salesforce (with Apex, LWC, SOQL as child skills), SAP, ServiceNow
Parent-child skill relationships
The taxonomy uses a hierarchy. If Spring Boot is found (child of Java), Java is inferred as a context match unless Java is already explicitly listed. If SOQL appears, Salesforce is confirmed. These relationships prevent false negatives for candidates who list frameworks but not their parent language.
Matched Skills: How the Comparison Actually Works
After both JD skills and resume skills are extracted and normalized, the matching step is deceptively simple in code but rich in what it silently handles. Here is the exact logic:
Step 1: Normalize each JD skill via taxonomy
Every JD skill string is run through normalize_skill_via_taxonomy(). This function checks canonical names, aliases, and variants bi-directionally β meaning k8s in the JD matches Kubernetes in the taxonomy, and Kubernetes in the resume also normalizes to the same canonical entry. The comparison therefore happens in canonical-name space, not raw-string space.
Step 2: Build normalized resume skill set
All resume skills are lowercased and normalized the same way. The result is a flat set of canonical skill names. Skills on the blocklist (noise terms) are silently discarded before this set is built.
Step 3: Check membership
For each normalized JD skill, the system checks whether it exists in the normalized resume skill set. If yes β added to matched_skills. If no β added to missing_skills. The symmetry of the normalization step means that a candidate who wrote k8s on their resume matches a JD that wrote Kubernetes β vocabulary differences never create false gaps.
Missing Skills: Anatomy of a Gap
A skill landing in the missing_skills list means the following is true: the JD required it, the taxonomy normalized it to a canonical name, and no entry in the resume's normalized skill set matched that canonical name β directly, via alias, or via variant.
What missing skills are NOT
Missing skills are not a penalty for vocabulary differences. They are not triggered by abbreviation mismatches (thanks to normalization). They are not a measure of overall fit β a candidate can have a 90% match score with 3 missing skills if those skills are non-critical.
What missing skills ARE
Missing skills are a verifiable gap: the role requires this competency, and the resume provides no textual evidence β direct or inferred β of its presence. They become the primary input for:
- Gap Severity Scoring β weighting the missing skill by its importance to the role
- Gap Analysis β categorizing the gap by criticality and providing development recommendations
- Interview Prep β generating Gap Probe questions specifically targeting these missing competencies
The 'Docker vs Kubernetes' edge case
A JD requiring Kubernetes and a resume listing only Docker: Kubernetes lands in missing_skills, Docker in resume_skills. They are related but not the same canonical entry. The Gap Analysis engine (a separate step) handles the "Docker Swarm β Kubernetes" partial-match inference β the skill extractor deliberately keeps these as separate entries for accuracy.
The Hybrid Scoring Model: How Skill Lists Drive the Final Score
The four skill lists feed directly into InnoHire.ai's hybrid scoring model β a weighted combination of algorithmic scoring and GPT-4o LLM scoring.
Algorithmic score
The algorithm score is calculated from the RequirementMatrix β a structured mapping of JD requirements to resume evidence. The matched/missing skill lists are the primary input: a higher ratio of matched to total JD skills increases the algorithmic score, weighted by skill category (mandatory skills outweigh preferred skills).
LLM score
GPT-4o is prompted as a "Senior Technical Recruiter" to evaluate the same JD + resume and return a final_match_percent. The LLM considers context that the algorithm cannot β years of experience, soft skills, role alignment, industry fit.
Hybrid blend logic
The two scores are combined using a conditional rule:
- If LLM score β₯ 70%: Trust the LLM fully. Final score = LLM score.
- If LLM score < 70%: Apply hybrid rescue β 60% LLM + 40% algorithmic. This prevents the LLM from being too harsh on candidates with strong keyword coverage but less polished formatting.
Why hybrid scoring is fairer than either alone
Pure keyword matching misses context. Pure LLM scoring misses explicit skills. The hybrid model uses the LLM for high-confidence accepts and the algorithm as a safety net for borderline cases β giving every candidate credit for both what they wrote and what it implies.
Noise Filtering, Disambiguation & Deduplication
Before any skill list is finalized, three cleanup passes run:
Noise Filtering (Blocklist)
A 40+ term blocklist removes hiring noise that would otherwise pollute the skill sets: contract, w2, citizen, relocation, degree, bachelor, years experience, day-of-week names, month names, and common soft-skill clichΓ©s like "team player" and "detail oriented". Any skill that normalizes to a blocklist term is silently discarded.
Disambiguation
When multiple taxonomy entries match the same text β e.g., a resume says "React.js" which could match both the alias React.js and the canonical React β disambiguation keeps only the canonical form. Similarly, if both Spring and Spring Boot are found, the more specific one (Spring Boot) wins and the generic entry is removed.
Deduplication
If the same canonical skill was matched by multiple methods (e.g., exact + context), only the highest-confidence match is retained. The final skill lists contain unique canonical names only, sorted by confidence score descending.
The result is a clean, defensible, canonical set of skills β exactly what's shown in the Resume Matcher's match panel, and exactly what drives every downstream feature: gap analysis, recruiter questions, interview prep, and the match score.