top of page
Search

Resume Parsing with LLM to Extract Structured Data | LangChain Tutorial

  • Writer: Hackers Realm
    Hackers Realm
  • Nov 16
  • 7 min read

In today’s data-driven world, companies receive hundreds of resumes for a single job posting. Manually reviewing each one to extract key information — like skills, experience, and education — is time-consuming, error-prone, and inefficient. This is where AI-powered resume parsing comes in. By leveraging natural language processing (NLP) and large language models (LLMs), organizations can automatically convert unstructured resume text into structured, searchable data.

Resume Parsing with LLM
Resume Parsing with LLM

In this tutorial, we’ll explore how to build a Resume Parsing system using LangChain, a powerful framework for building LLM-based applications. You’ll learn how to extract key details such as candidate names, contact information, experience, and skills, and represent them in a structured format — ready for analysis or integration into HR systems.


You can watch the video-based tutorial with step by step explanation down below.


Import Modules

!pip install langchain langchain-community langchain-openai

Install the above package to continue with the rest of the code

from pydantic import BaseModel, Field
from typing import List, Optional

from langchain.prompts import PromptTemplate
from langchain.chat_models import init_chat_model
from langchain.output_parsers import PydanticOutputParser
from langchain_community.document_loaders import PyPDFLoader
  • from pydantic import BaseModel, Field –  Used to define a structured schema for our resume data. With Pydantic, we can clearly specify what information we want to extract (like name, email, education, experience, etc.) and validate that the AI model outputs it correctly.

  • from typing import List, Optional – These help us define flexible data types within our schema, such as optional fields or lists (for example, a list of skills or past job experiences).

  • from langchain.prompts import PromptTemplate – A core LangChain component that lets us design a clear, reusable prompt to instruct the AI model on exactly how to extract data from the resume text.

  • from langchain.chat_models import init_chat_model  – This function initializes the large language model (LLM) that will power our resume parser. It connects LangChain to a model like GPT, enabling it to understand and process resume content intelligently.

  • from langchain.output_parsers import PydanticOutputParser – Ensures that the AI’s output follows the structured schema we defined. Instead of getting free-form text, this parser converts the model’s response into clean, machine-readable data.

  • from langchain_community.document_loaders import PyPDFLoader – Resumes are most often shared in PDF format, so this loader helps us extract text content directly from PDF files, preparing them for AI processing.



Defining the Resume Data Schema with Pydantic


Once we’ve set up the required modules, the next step is to define a structured format for the data we want to extract from resumes. To do this, we use Pydantic, a powerful Python library that allows us to create data models with built-in validation.

# create class for each main category
class Education(BaseModel):
    university_name: str = Field(..., description='Name of the university')
    degree: str = Field(..., description='Degree Obtained')
    gpa: Optional[float] = Field(None, ge=0, le=10.0, description='GPA (0-10 scale)')

class Experience(BaseModel):
    company_name: Optional[str] = Field(..., description='Name of the company')
    n_years: Optional[int] = Field(..., ge=0, description='Number of years of experience in the company')
    project_name: Optional[str] = Field(..., description='Name of the Main Project')
    project_description: Optional[str] = Field(..., description='Role and achievements described in the project')
    tech_stack: Optional[str] = Field(..., description='Technologies and Tools used in the project')

class Resume(BaseModel):
    name: str = Field(..., description='Full name of the candidate')
    age: Optional[int] = Field(None, ge=0, description='Age of the candidate')
    email: str = Field(..., description='Email Address')
    phone_number: str = Field(..., description='Phone Number')
    experience: Optional[List[Experience]] = Field(..., description='List of professional Experience')
    education: Optional[List[Education]] = Field(..., description='Educational Background')
    languages: Optional[str] = Field(..., description='Languages Known')
  • We define three main Pydantic models — Education, Experience, and Resume — to represent structured information about a candidate.

  • Each class uses fields (Field) with descriptions that tell the AI model exactly what kind of information to extract and how to format it.

  • The Education class includes details such as the university name, degree, and GPA.

  • The Experience class captures key work details like company name, years of experience, project information, and the tech stack used.

  • Finally, the Resume class ties everything together, including the candidate’s personal information (name, age, email, phone) and lists of their education and experience entries.

  • By defining this schema, we give the AI model a clear blueprint for how to organize the extracted information. Instead of returning messy text, the model’s output will now be structured, validated, and ready to use for downstream applications like HR databases or search filters.



Creating a Prompt Template for the AI Model


After defining our data schema, the next step is to tell the AI how to extract the information from a resume. To do this, we create a prompt template — a structured instruction that the model will follow when analyzing the resume text.

resume_template = """
You are an AI assistant tasked with extracting structured information from a technical resume.

Only Extract the information that's present in the Resume class.

Resume Context:
{resume_text}
"""

prompt_template = PromptTemplate(
    template = resume_template,
    input_variables=['resume_text']
)
  • The resume_template defines the exact instructions the AI model should follow.

    • It tells the model that its role is to extract structured information from a resume.

    • It also specifies that only the fields defined in our Resume class (from the previous step) should be extracted — helping keep the output clean and relevant.

    • The placeholder {resume_text} will later be replaced with the actual text extracted from the PDF resume.

  • The PromptTemplate from LangChain then wraps this instruction into a reusable object.

    • By defining input_variables=['resume_text'], we tell LangChain which dynamic part of the prompt will change each time we process a new resume.

  • Essentially, this step builds the communication layer between your AI model and the resume data. The prompt ensures that every resume is processed in a consistent, structured way — with clear guidance on what information to extract.



Initializing the AI Model for Structured Resume Extraction


Now that we’ve defined our data schema and designed the prompt, it’s time to initialize the language model (LLM) that will actually parse and extract the information from resumes.

# initialize the model
model = init_chat_model(model='gpt-4o-mini', model_provider='openai').with_structured_output(Resume, method="function_calling")
  • init_chat_model() is a LangChain function that initializes a chat-based large language model.

    • In this case, we’re using OpenAI’s GPT-4o-mini, a lightweight yet capable model that balances performance and cost — perfect for structured text extraction tasks like resume parsing.

  • The .with_structured_output(Resume, method="function_calling") method ensures that the model’s response is not just plain text but structured according to our Resume Pydantic schema.

    • By using the "function_calling" method, we’re telling the model to generate outputs in a machine-readable format (like JSON) that directly maps to our defined fields — such as name, education, experience, and skills.



Loading and Preparing the Resume Document


With our model ready, the next step is to load an actual resume and prepare its content for AI processing. Since most resumes are submitted as PDFs, we’ll use LangChain’s PyPDFLoader to extract text directly from the file.

# load the document
file_path = 'data/johndoe_resume.pdf'
loader = PyPDFLoader(file_path)

docs = loader.load()

resume_text = "\n".join([doc.page_content for doc in docs])
len(resume_text)
Length of extracted text
Length of resume text
  • file_path specifies the location of the PDF resume we want to parse. In this example, it’s data/johndoe_resume.pdf.

  • PyPDFLoader(file_path) creates a loader object that can read and extract text from the PDF.

  • loader.load() processes the file and returns a list of document objects — one for each page of the resume.

  • "\n".join([doc.page_content for doc in docs]) combines the text content from all pages into a single string called resume_text. This ensures the AI model receives the complete resume in one go.

  • Finally, len(resume_text) simply checks the length of the extracted text — a quick sanity check to confirm that the content was loaded successfully.



Inspecting the Extracted Resume Text


Before passing the text to our AI model, it’s always a good idea to verify that the PDF content was extracted correctly. We can do this by printing the text to the console.

print(resume_text)
Extracted Text
Extracted Text
Extracted Resume Text

By printing it out, you can:

  • Confirm that the PDF text has been successfully extracted.

  • Check for any formatting issues or unwanted characters.

  • Ensure that the text includes all sections of the resume — such as the candidate’s name, education, experience, and skills.

This quick check helps prevent errors later in the pipeline. If the printed text looks incomplete or garbled, you can revisit the PDF loader or preprocess the file before sending it to the model.



Running the Resume Parsing Application


Now that we have our prompt, model, and extracted resume text ready, it’s time to bring everything together and let the AI do its work.

# call the application
prompt = prompt_template.invoke({'resume_text': resume_text})
response = model.invoke(prompt)
response.model_dump()
LLM extracted structured data
Response from the model
  • prompt = prompt_template.invoke({'resume_text': resume_text}) - This line fills the {resume_text} placeholder in our PromptTemplate with the actual text extracted from the resume.The result is a complete, ready-to-send prompt that instructs the model exactly what to extract.

  • response = model.invoke(prompt) - This sends the filled prompt to our initialized GPT-4o-mini model.The model reads through the resume text, identifies key details like name, email, education, and experience, and generates a structured response that matches the schema we defined earlier with Pydantic.

  • response.model_dump() - Finally, this method converts the model’s structured output into a Python dictionary (or JSON-like format).It makes the extracted data easy to read, manipulate, or store in a database or API.



Final Thoughts

  • In this tutorial, we built a complete AI-powered Resume Parsing pipeline using LangChain and OpenAI’s GPT models.Starting from PDF loading to schema definition, prompt creation, model initialization, and structured output extraction, we’ve seen how each component plays a crucial role in transforming unstructured resume text into clean, organized data.

  • By combining LangChain’s prompt management and output parsing with the intelligence of large language models, we can automate one of the most time-consuming parts of recruitment — extracting key details from resumes with high accuracy and consistency.


This approach not only saves time but also creates opportunities for smarter HR analytics, candidate matching, and talent insights.



Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

bottom of page