What I Learned Building a Job-Matching System in Hebrew: Reversed Text, I/O Psychology, and When to…

What I Learned Building a Job-Matching System in Hebrew: Reversed Text, I/O Psychology, and When to Ditch the LLM

This is Part 2 of a series on building job-matching systems. Part 1 covered why job matching is fundamentally harder than it looks. This post is the technical deep-dive.

In Part 1, I wrote about the surprising complexity of matching people to jobs. The most common question afterward was: “So how did you actually build it?”

This post is my answer. I will walk through the architecture, the Hebrew NLP challenges that nearly broke me, the role that I/O psychology played in the design, and the production lessons I learned the hard way. No code snippets (NDA), but I will describe every meaningful decision and why I made it.

The Architecture: Three Services and a Queue

The system is a three-service pipeline. Job ads and resumes flow through independent parsing services, then converge in a shared semantic matching service.

Job Ad Parser. Takes a raw job posting and extracts structured fields: title, requirements, education level, required licenses, skills, experience level, and more. An LLM does the heavy lifting, constrained by Pydantic schemas and structured output mode.

Resume Parser. Takes a PDF or DOCX, extracts text, detects language, handles Hebrew RTL issues, strips PII, and produces a structured profile.

Semantic Matching Service. Both parsers call this shared service to map free-text fields to standardized taxonomies. “Bachelor’s in Computer Science” needs to map to a canonical education level. “רישיון עורך דין” (attorney license) needs to map to one of 46 professional license codes.

The matching service is a RAG pipeline adapted for taxonomy matching rather than document Q&A. The retrieval stage uses vector similarity search with embeddings stored in PostgreSQL via pgvector, using IVFFlat indexing for approximate nearest neighbor lookup, to narrow thousands of taxonomy entries to the top-k candidates by cosine similarity.

I tested multiple embedding models at 1536 dimensions. OpenAI’s small embedding model performed poorly on Hebrew text. Matches were noisy, confidence scores were unreliable, and the top-k candidates frequently included irrelevant entries. Switching to Gemini Embedding 001 at the same dimensionality was a night-and-day difference. The cosine similarity scores became meaningful, the top-5 candidates were almost always reasonable, and the LLM-as-judge stage went from rejecting most candidates to confirming most of them. My best guess is that Gemini’s multilingual training data includes substantially more Hebrew, but I have not verified this. The practical takeaway: if you are working in a non-English language, do not assume the most popular embedding model is the best one. Test. The reranking stage uses an LLM-as-judge to select the best match from those candidates, or reject all of them if none fit. This is a standard RAG pattern. The retrieval stage handles the computational heavy lifting; the LLM-as-judge handles the nuanced reasoning that cosine similarity alone cannot make.

The whole thing runs on AWS. Lambda for the job ad parser and matching service, ECS Fargate for the resume parser (the PII model is too large for Lambda), SQS queues connecting everything with dead-letter queues for reliability.

The three services communicate asynchronously through SQS, making this effectively a multi-agent orchestration pattern where each agent has a single responsibility. This separation matters. Job ads and resumes have completely different structures, different failure modes, and different scaling patterns. Keeping them independent means I can deploy, debug, and scale each one without touching the others.

Hebrew NLP: The Problems Nobody Warns You About

This is the part I most want to share, because there is almost nothing written about it. If you have only worked with English NLP, you have probably never encountered these problems.

Reversed Text and Final-Form Letter Detection

When you extract text from Hebrew PDFs, you will sometimes get reversed text. Not garbled, not corrupted. Reversed. The word “שלום” (shalom) becomes “םולש”. This happens because of how PDF renderers handle right-to-left text internally. Some store the logical order, some store the visual order, and when your extraction library guesses wrong, every word comes out backwards.

My first instinct was to throw this at an LLM: “Here is some text that might be reversed. Fix it.” It worked, but it was slow, expensive, and sometimes “fixed” text that was not actually broken.

The rule-based approach turned out to be both faster and more reliable, and the key insight is one of the more obscure features of the Hebrew writing system: final-form letters.

Five Hebrew letters have special forms that only appear at the end of a word: ך, ם, ן, ף, ץ. Their regular forms (כ, מ, נ, פ, צ) appear everywhere else. This is a hard rule of Hebrew orthography, not a tendency. When you see a final-form letter at the beginning of a word, that word is almost certainly reversed.

The detector scans all Hebrew words in the document. If more than 10% have final-form letters at the start, the entire document gets a global flip: every word is reversed. This threshold avoids false positives from the occasional OCR artifact while catching genuinely reversed documents.

This is a case I keep coming back to: the boring solution won. The rule-based detector runs in milliseconds, never hallucinates, has zero cost, and has been essentially bulletproof in production. The LLM approach was slower, more expensive, and less reliable. Not every problem needs a model.

PII Detection in Hebrew

English NER models are useless for Hebrew PII. Names like “יוסי כהן” or “נועה לוי” are invisible to models trained on “John Smith.” I use GolemPII-v1, a Hebrew-specific PII model, for detecting and redacting names, ID numbers, phone numbers, and addresses. This is also why the resume parser runs on ECS Fargate instead of Lambda: the transformer-based PII model exceeds Lambda’s size and memory constraints.

Language detection itself is simple: I check the ratio of characters in the Hebrew Unicode range (U+0590 to U+05FF) to total alphabetic characters. Above a threshold, the document is Hebrew and gets routed through the Hebrew-specific pipeline.

I/O Psychology on Top of ML

My PhD is in industrial-organizational psychology, and it shaped the system in ways that pure ML engineering would not have produced. Most job-matching systems are built entirely by engineers. They are technically solid but domain-naive. Adding even basic occupational psychology makes a real difference.

The education taxonomy reflects Israeli reality. Most systems use a simple scale: high school, bachelor’s, master’s, PhD. Israel has additional levels that matter enormously for job matching: vocational certificates (תעודת מקצוע), technician diplomas (טכנאי), and practical engineer degrees (הנדסאי). A practical engineer is not a bachelor’s degree, but it is more than a technician diploma. Getting this hierarchy wrong means bad matches. This is not something an embedding model will learn from training data. It requires knowing the Israeli labor market.

Professional licenses encode minimum education. In Israel, a lawyer must hold at least a law degree. A physician must have an MD. A psychologist must have at least an MA. When the system detects a professional license, it infers a minimum education level even if the resume or job ad does not state it explicitly. This is domain knowledge, plain and simple. No amount of cosine similarity over embedding vectors will capture it.

Skills get split into Listed and Inferred. Listed skills are explicitly mentioned in the text (“Proficient in Python”). Inferred skills are deduced from work history (“Managed a team of 12 engineers” implies people management, hiring, performance reviews). This distinction matters because inferred skills carry less certainty and should be weighted differently in matching. This split comes directly from I/O psychology’s distinction between stated competencies and demonstrated behaviors.

The DynamoDB to PostgreSQL Migration

I started with DynamoDB because it is the default serverless database on AWS. It seemed like the natural choice. In practice, it made the system nearly impossible to operate.

The core problem was observability. With DynamoDB, I could not easily query across all jobs to check processing status. I could not investigate why a particular resume failed without knowing its exact partition key. I could not run ad-hoc queries to find patterns in failures or check how many documents were stuck in a particular state. Every debugging session turned into a scavenger hunt through CloudWatch logs because the database itself was opaque.

When I needed vector similarity search for the RAG pipeline, I migrated to PostgreSQL with pgvector, and the operational improvements went far beyond just having vector search.

With PostgreSQL, I got SQL queries for monitoring. I could check processing status across the entire system with a single query. I could investigate failures by joining job records with their child tables. I could track costs, cache hits, and error rates with straightforward SQL.

I also got relational child tables for normalized data instead of stuffing everything into nested DynamoDB documents. Schema-based dev/prod separation instead of table-name prefixes. And pgvector meant embeddings lived in the same database as everything else, no separate vector store to manage.

The migration was painful. It cost me weeks. But the system became dramatically easier to run, debug, and improve afterward. If I started over, I would begin with PostgreSQL from day one. The DynamoDB detour taught me that for a system you need to actively monitor and debug, operational ergonomics matter more than theoretical scalability.

The Standard Engineering Bits

The rest of the technical stack is solid but unremarkable, and I want to be honest about that.

Pydantic schemas with structured output. Every LLM call has a corresponding Pydantic model that defines the expected response schema. I use structured output mode (JSON mode constrained to the schema) so the model is guaranteed to produce valid, parseable output. This eliminates an entire class of deserialization errors. It is the right approach, but it is also well-documented at this point.

SQS with dead-letter queues. Messages that fail processing go to a DLQ instead of disappearing. Combined with idempotent upserts, this makes the system self-healing. Standard pattern, works great.

Query caching. Every query-to-taxonomy lookup is cached using an MD5 hash with a 30-day TTL. In HR data, the same skills and education levels appear constantly. After a few hundred documents, the cache handles the majority of lookups without any API calls. The cost savings were immediate and significant.

Batching LLM-as-judge calls. Instead of one LLM call per field-to-taxonomy match, I collect all matches within a single SQS message batch and send them in one prompt. This was the biggest cost lever in the system.

Cost tracking per document. Every processed document gets a cost breakdown stored in the database. This is not just accounting. It drives optimization decisions and helps me spot regressions quickly.

Prompt injection defense for resumes. Resumes are adversarial input. People embed instructions hoping to manipulate ATS systems. The system prompt explicitly states to treat all resume text as data, not instructions. Not foolproof, but it catches the obvious cases.

Other Production Lessons

API Gateway to direct Lambda invocation. The original setup routed requests through API Gateway. Cloudflare’s bot protection started blocking some of these requests. Switching to direct Lambda invocation via the AWS SDK eliminated the problem. Not every service needs an HTTP endpoint.

Embedding generation is cheap, LLM-as-judge is not. Gemini embeddings are currently on the free tier, making the vector similarity search stage essentially zero-cost. The LLM-as-judge reranking calls are where the money goes. This is why caching and batching matter so much. Design your cost optimization around the expensive stage.

What I Would Do Differently

PostgreSQL from day one. The DynamoDB detour cost me weeks and taught me a lesson about choosing databases for operational reality rather than theoretical fit.

Earlier investment in query caching. The first version made redundant API calls for identical queries, and the savings from caching were obvious in retrospect.

The RAG-with-LLM-as-judge pattern and the I/O psychology layer, though, I would keep exactly as built. The pattern is not novel, but the combination of vector similarity search, LLM-as-judge reranking, and occupational psychology taxonomies works well for this problem space.

If you are building something similar, especially in a non-English market, I would love to hear about it. The intersection of NLP, domain expertise, and production engineering is where the interesting lessons come from.

Tom Ron, PhD, is a data scientist specializing in AI-powered HR tools. He holds a doctorate in industrial-organizational psychology and a master’s in data science. Learn more at tomron.ai.

What I Learned Building a Job-Matching System in Hebrew: Reversed Text, I/O Psychology, and When to… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.