Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

🚀 GitHub Repository | ⭐ Star it if you find it useful!

The Problem That Started It All

Picture this: You’re an auditor, accountant, or financial a…


This content originally appeared on DEV Community and was authored by Vishwaraja Pathi (Vishwa)

Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

🚀 GitHub Repository | ⭐ Star it if you find it useful!

The Problem That Started It All

Picture this: You're an auditor, accountant, or financial analyst staring at a 165-page HDFC Bank statement with 3,602 transactions that need to be converted to CSV format. The manual process would take days, and the risk of errors is enormous.

That's exactly the challenge I faced recently, and it led me to build an open-source solution that I'm excited to share with the community.

The Solution: HDFC PDF to CSV Converter

I created a Python tool that automatically extracts all transactions from HDFC Bank PDF statements and converts them to CSV format with intelligent categorization. Here's what it accomplishes:

  • 100% extraction rate from 165-page PDFs
  • 3,602 transactions processed automatically
  • 22 automatic categories (UPI, Foreign Exchange, Salary, etc.)
  • Multi-line narration support for complex transactions
  • Multiple output formats (CSV, Excel, Markdown)
  • Command-line interface for easy automation

Quick Start

# Clone the repository
git clone https://github.com/vishwaraja/hdfc-pdf-converter.git
cd hdfc-pdf-converter

# Install dependencies
pip install -r requirements.txt

# Convert your first PDF (creates ./results/ directory automatically)
python src/hdfc_converter.py your_statement.pdf

Technical Deep Dive

The Tech Stack

# Core dependencies
camelot-py[cv]  # PDF table extraction
pandas          # Data manipulation
PyPDF2          # PDF processing
pdfplumber      # Text extraction

The Challenge: Multi-line Narrations

One of the biggest challenges was handling transactions where the narration spans multiple lines. Here's how I solved it:

def _parse_transaction_row(self, row, page_num):
    """Parse a single transaction row with multi-line support."""
    # Handle multi-line narrations
    narration_parts = []

    # Everything between date and amounts is narration
    narration_start = 1
    narration_end = len(row) - 5

    for i in range(narration_start, narration_end):
        part = str(row.iloc[i]).strip()
        if part and part != 'nan':
            narration_parts.append(part)

    narration = ' '.join(narration_parts)
    return narration

Intelligent Categorization

The tool automatically categorizes transactions into 22 meaningful categories:

def categorize_transaction(narration):
    narration_lower = str(narration).lower()

    if any(word in narration_lower for word in ['salary', 'payroll', 'betterplace']):
        return 'Salary & Employment'
    elif any(word in narration_lower for word in ['foreign', 'usd', 'eur', 'gbp']):
        return 'Foreign Exchange'
    elif any(word in narration_lower for word in ['upi']):
        return 'UPI Payments'
    # ... and 19 more categories

Real Results

Here's what the tool achieved with my 165-page statement:

Metric Result
Total Transactions 3,602
Pages Processed 165/165 (100%)
Extraction Time ~2 minutes
Categories Identified 22
Data Quality 100% valid dates

Sample Output

Date,Narration,Category,Withdrawal_Amount,Deposit_Amount
15/07/2020,UPI payment to merchant,UPI Payments,150.00,0.00
16/07/2020,Salary credit from company,Salary & Employment,0.00,25000.00
17/07/2020,Foreign remittance from USA,Foreign Exchange,0.00,50000.00

Usage Examples

Command Line Interface

# Basic usage (creates ./results/ directory automatically)
python src/hdfc_converter.py statement.pdf

# Custom output directory
python src/hdfc_converter.py statement.pdf --output-dir ./my_results

# Verbose logging for debugging
python src/hdfc_converter.py statement.pdf --verbose

# Convert PDF from different directory
python src/hdfc_converter.py /path/to/statements/hdfc_2024.pdf

Programmatic API

from src.hdfc_converter import HDFCConverter

# Initialize converter
converter = HDFCConverter('statement.pdf', output_dir='./results')

# Convert PDF to CSV
success = converter.convert()

if success:
    print("✅ Conversion completed successfully!")

The Impact

This tool has already saved me hours of manual work and eliminated the risk of transcription errors. But more importantly, it's now available as an open-source solution for the entire community.

Key Benefits for Users:

  • Auditors: Quick conversion of bank statements for analysis
  • Accountants: Automated data entry from PDF statements
  • Fintech Developers: Foundation for building banking tools
  • Data Analysts: Clean CSV data for financial analysis

Open Source and Community

I've made this tool completely open source with:

  • 📚 Comprehensive documentation
  • 🧪 Unit tests and examples
  • 🤝 Contribution guidelines
  • 📋 Issue templates and PR templates
  • 🔄 CI/CD pipeline

🔗 Repository: https://github.com/vishwaraja/hdfc-pdf-converter

What's Next?

I'm excited to see how the community will use and improve this tool. Some potential enhancements:

  • Support for other bank PDF formats
  • GUI interface for non-technical users
  • Cloud processing capabilities
  • Advanced filtering and search features

Lessons Learned

Building this tool taught me several valuable lessons:

  1. PDF parsing is complex - Different banks use different formats
  2. Multi-line data is tricky - Requires careful parsing logic
  3. Categorization needs intelligence - Simple regex isn't enough
  4. Documentation is crucial - Makes tools accessible to others
  5. Open source is powerful - Community feedback improves everything

Get Started

Ready to try it out? Here's how to get started:

# Clone the repository
git clone https://github.com/vishwaraja/hdfc-pdf-converter.git
cd hdfc-pdf-converter

# Install dependencies
pip install -r requirements.txt

# Convert your first PDF
python src/hdfc_converter.py your_statement.pdf

Conclusion

What started as a personal problem-solving exercise became a tool that could benefit the entire developer and financial community. This is the power of open source - turning individual solutions into community resources.

I'd love to hear your thoughts, suggestions, and use cases. Have you faced similar challenges with PDF processing? What other banking tools would be useful to the community?

Connect with me:

Have questions about PDF parsing or want to contribute to the project? Leave a comment below - I'd love to discuss!


This content originally appeared on DEV Community and was authored by Vishwaraja Pathi (Vishwa)


Print Share Comment Cite Upload Translate Updates
APA

Vishwaraja Pathi (Vishwa) | Sciencx (2025-09-25T05:06:36+00:00) Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes. Retrieved from https://www.scien.cx/2025/09/25/building-a-pdf-parser-for-hdfc-bank-statements-from-165-pages-to-csv-in-minutes-3/

MLA
" » Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes." Vishwaraja Pathi (Vishwa) | Sciencx - Thursday September 25, 2025, https://www.scien.cx/2025/09/25/building-a-pdf-parser-for-hdfc-bank-statements-from-165-pages-to-csv-in-minutes-3/
HARVARD
Vishwaraja Pathi (Vishwa) | Sciencx Thursday September 25, 2025 » Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes., viewed ,<https://www.scien.cx/2025/09/25/building-a-pdf-parser-for-hdfc-bank-statements-from-165-pages-to-csv-in-minutes-3/>
VANCOUVER
Vishwaraja Pathi (Vishwa) | Sciencx - » Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/09/25/building-a-pdf-parser-for-hdfc-bank-statements-from-165-pages-to-csv-in-minutes-3/
CHICAGO
" » Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes." Vishwaraja Pathi (Vishwa) | Sciencx - Accessed . https://www.scien.cx/2025/09/25/building-a-pdf-parser-for-hdfc-bank-statements-from-165-pages-to-csv-in-minutes-3/
IEEE
" » Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes." Vishwaraja Pathi (Vishwa) | Sciencx [Online]. Available: https://www.scien.cx/2025/09/25/building-a-pdf-parser-for-hdfc-bank-statements-from-165-pages-to-csv-in-minutes-3/. [Accessed: ]
rf:citation
» Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes | Vishwaraja Pathi (Vishwa) | Sciencx | https://www.scien.cx/2025/09/25/building-a-pdf-parser-for-hdfc-bank-statements-from-165-pages-to-csv-in-minutes-3/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.