Building GreenGovRAG

This content originally appeared on DEV Community and was authored by Sundeep Anand

An Open Source AI Assistant for Australian Environmental Compliance

If you've ever tried to navigate environmental regulations in Australia, you'll know the pain: federal EPBC Act, state EPAs, local council planning schemes, emissions reporting frameworks - all scattered across PDFs, government portals, and legislation websites.

I'm building GreenGovRAG to change that.

The Problem: Fragmented Regulatory Knowledge

Environmental consultants, planners, and ESG analysts spend days or weeks searching for answers to questions like:

Does this wind farm project need federal EPBC Act approval?
What's the biodiversity offset policy for councils in South Australia?
Can I clear native vegetation near Murray Bridge, SA?
How do I report Scope 3 emissions in Victoria?

The information exists - it's public, it's authoritative - but it's impossibly fragmented. Government portals are siloed by jurisdiction. LexisNexis is expensive ($10k-100k/year). ChatGPT hallucinates and lacks precise citations.

There had to be a better way.

The Solution: RAG + Geospatial Intelligence

GreenGovRAG is a Retrieval-Augmented Generation (RAG) system purpose-built for Australian environmental and planning regulations.

Core Features

Natural Language Queries
Ask questions like a human, get answers with verifiable citations to official sources.

Location-Aware Filtering
Filter by state, LGA (Local Government Area), or region using geospatial intelligence.

Hybrid Search
Combines BM25 keyword matching with vector similarity for precise retrieval.

Multi-Jurisdictional Coverage
Federal (EPBC Act), State (SA/NSW/VIC legislation), Local (council planning schemes), and Emissions (CER, NGER) in one unified system.

Multi-LLM Support
Works with OpenAI (GPT-4o, GPT-4o-mini), Anthropic (Claude), AWS Bedrock, and Azure OpenAI.

Cloud-Agnostic Deployment
Deploy on AWS, Azure, or run locally with Docker.

Technical Architecture: A Production RAG System

This isn't a tutorial project. It's a production-ready system with real ETL pipelines, monitoring, and cloud deployment. Here's how it works:

1. Document Ingestion & ETL

Sources:

Federal: EPBC Act (environment.gov.au)
State: SA/NSW/VIC legislation, EPA guidelines
Local: Council planning schemes
Emissions: CER emissions data, NGER reports

Pipeline:

Development: Apache Airflow (local UI for testing)
Production: GitHub Actions (scheduled daily runs)

Processing:

PDF parsing (PyMuPDF, layout-aware chunking)
HTML scraping (BeautifulSoup)
Metadata tagging with LLM (auto-extracts jurisdiction, topics, regulatory hierarchy)
Storage: PostgreSQL with pgvector, AWS S3/Azure Blob

2. Text Chunking & Embeddings

# Semantic chunking with regulatory context preservation
CHUNK_SIZE = 500-1000 tokens
CHUNK_OVERLAP = 100-200 tokens
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

Each chunk preserves:

Jurisdiction (Federal/State/Local)
Document type (Legislation, Guideline, Planning Scheme)
LGA (Local Government Area)
Regulatory hierarchy (Act → Regulation → Policy)

3. Vector Store & Hybrid Search

Vector Stores:

FAISS (local development)
Qdrant (production - faster for large datasets)

Search Strategy:

# Hybrid search: BM25 + vector similarity
def search(query: str, lga_name: Optional[str] = None):
    # 1. BM25 keyword search (handle exact terminology)
    keyword_results = bm25_search(query, top_k=20)

    # 2. Vector similarity search (semantic matching)
    vector_results = vector_store.similarity_search(
        query_embedding,
        top_k=20,
        filter={"lga": lga_name} if lga_name else None
    )

    # 3. Combine with reciprocal rank fusion
    merged_results = reciprocal_rank_fusion(
        keyword_results,
        vector_results
    )

    return merged_results[:5]

4. Geospatial Intelligence

Location NER (Named Entity Recognition):

# Extract Australian locations from queries
query = "Can I clear vegetation near Murray Bridge, SA?"
locations = location_ner.extract(query)
# => [{"text": "Murray Bridge", "state": "SA", "lga": "Rural City of Murray Bridge"}]

LGA Filtering:

GeoJSON boundaries for all 537 Australian LGAs
Buffer zone queries (planned feature)
Spatial intersection for multi-LGA queries

5. Response Generation with Trust Scoring

# Response enhancement with citation verification
def generate_response(query: str, retrieved_docs: list):
    # 1. LLM generates initial response
    response = llm.generate(query, context=retrieved_docs)

    # 2. Trust score calculation
    trust_score = calculate_trust_score(
        response,
        retrieved_docs,
        factors=[
            "citation_precision",      # Citations match sources?
            "regulatory_hierarchy",    # Cites appropriate authority level?
            "recency",                 # Documents up-to-date?
            "jurisdiction_alignment"   # Correct jurisdiction?
        ]
    )

    # 3. Citation verification
    verified_citations = verify_citations(response, retrieved_docs)

    return {
        "answer": response,
        "trust_score": trust_score,
        "citations": verified_citations,
        "sources": retrieved_docs
    }

Tech Stack

Backend:

Python 3.12
FastAPI (async API with auto-docs)
SQLModel (type-safe ORM)
LangChain (RAG orchestration)

Database:

PostgreSQL with pgvector

Vector Stores:

FAISS (development)
Qdrant (production)

Embeddings:

sentence-transformers/all-MiniLM-L6-v2 (default, configurable)

LLM Providers:

OpenAI (GPT-4o, GPT-4o-mini)
Anthropic (Claude 3)
AWS Bedrock
Azure OpenAI

Frontend:

React + TypeScript (work in progress)

Deployment:

Docker + Docker Compose
AWS: ECS Fargate, RDS PostgreSQL, EC2 Spot (Qdrant), CloudFront, S3
Azure: Container Apps, PostgreSQL Flexible Server, Blob Storage

Deployment Architecture (AWS)

┌─────────────────┐
│   CloudFront    │  (CDN + frontend)
└────────┬────────┘
         │
┌────────▼────────┐
│  API Gateway    │  (HTTP API)
└────────┬────────┘
         │
┌────────▼────────┐
│   ECS Fargate   │  (Backend API)
│   (2 tasks)     │
└────┬─────┬──────┘
     │     │
     │     └──────────┐
     │                │
┌────▼─────┐   ┌─────▼────────┐
│   RDS    │   │  EC2 Spot    │
│PostgreSQL│   │   (Qdrant)   │
└──────────┘   └──────────────┘
     │
┌────▼─────┐
│    S3    │  (Document storage)
└──────────┘

Plugin Architecture: Easy Contributions

One of the best parts of this project is the plugin system for document sources. Adding a new regulation is straightforward:

Example: Adding QLD Vegetation Management Guidelines

# backend/green_gov_rag/etl/sources/qld_vegetation.py
from green_gov_rag.etl.sources.base import BaseDocumentSource

class QLDVegetationScraper(BaseDocumentSource):
    def fetch_documents(self) -> list[Document]:
        # Scrape QLD government portal
        docs = []
        response = requests.get(self.config["url"])
        soup = BeautifulSoup(response.content, "html.parser")

        for link in soup.find_all("a", class_="document-link"):
            doc = Document(
                content=self.extract_text(link["href"]),
                metadata={
                    "source": "QLD Vegetation Management",
                    "jurisdiction": "State",
                    "state": "QLD",
                    "url": link["href"]
                }
            )
            docs.append(doc)

        return docs

    def validate_config(self) -> None:
        required = ["url"]
        if not all(k in self.config for k in required):
            raise ValueError(f"Missing config: {required}")

Register in config:

# backend/configs/documents_config.yml
sources:
  - type: qld_vegetation
    enabled: true
    config:
      url: https://www.qld.gov.au/environment/vegetation-management

That's it! The ETL pipeline auto-discovers the plugin and starts ingesting documents.

Real-World Use Cases

1. Environmental Impact Assessment Pre-screening

User: Environmental consultant
Query: "Do I need an environmental impact assessment to build a solar farm in regional NSW?"

GreenGovRAG Output:

Summarizes relevant sections from NSW planning portal and EPBC Act
Explains exemption criteria and thresholds
Provides citations to official sources

2. Native Vegetation Clearing Rules

User: Landowner in rural SA
Query: "Can I clear native vegetation near Murray Bridge, SA?"

GreenGovRAG Output:

Retrieves SA Government vegetation clearance policies
Filters to Rural City of Murray Bridge LGA
Returns allowed/disallowed activities and buffer zones

3. Emissions Reporting Compliance

User: Sustainability advisor
Query: "Which emissions standards apply to industrial zones in Greater Sydney?"

GreenGovRAG Output:

Points to NSW EPA and federal requirements
Suggests offsets or sustainable alternatives
Links to energy incentive schemes

Challenges & Lessons Learned

1. PDF Parsing is Hard

Government PDFs come in every flavor: scanned images, multi-column layouts, tables, footnotes. I tried multiple libraries:

PyMuPDF: Fast but struggles with complex layouts
pdfplumber: Better table extraction but slower
llmsherpa (LayoutPDFReader): Best for hierarchical documents but requires external service

Solution: Hybrid approach - detect layout type, route to appropriate parser.

2. Metadata is Everything

Early versions chunked documents naively (500 tokens, no context). This led to:

Chunks without jurisdiction information
Federal rules mixed with local bylaws
No regulatory hierarchy

Solution: LLM-based metadata tagging during ingestion:

metadata = llm.extract_metadata(chunk, schema={
    "jurisdiction": ["Federal", "State", "Local"],
    "state": ["NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT"],
    "lga": str,
    "document_type": ["Act", "Regulation", "Policy", "Guideline"],
    "topics": list[str]
})

3. Trust Scores Matter

Users (especially consultants) need to trust the answers for compliance reporting. I implemented trust scoring based on:

Citation precision (does the answer match the source?)
Regulatory hierarchy (citing Acts over guidelines)
Recency (prioritize recent amendments)
Jurisdiction alignment (federal > state > local)

4. GitHub Actions > Airflow for Production ETL

Initially used Airflow for all ETL. It's great for local development (nice UI), but overkill for production:

Resource-heavy (needs separate EC2 instance)
Complex deployment
Over-engineered for daily batch jobs

Switched to GitHub Actions for production:

Scheduled workflow (runs daily at 2 AM UTC)
No infrastructure cost
Easier to maintain

Kept Airflow for local development only (Docker Compose with --profile dev).

What's Next?

Immediate priorities:

Complete React frontend (query interface, LGA map selection)
Expand coverage to all Australian states/territories
Real-time regulatory change monitoring (web scraping + notifications)
Interactive compliance checklist generator

Future enhancements:

User authentication (OAuth2)
Parcel-level geospatial queries (buffer zones, overlays)
Multi-lingual support (Mandarin, Arabic for multicultural communities)
Integration with government APIs (planning portals, LGA systems)

How You Can Help

GreenGovRAG is open source and community-driven. Ways to contribute:

For Developers

Add new document source plugins (VIC, QLD, WA, TAS, NT regulations)
Improve frontend UI/UX (React, TypeScript)
Write integration tests
Optimize vector search performance

For Domain Experts

Validate query results and provide feedback
Add your state/council's regulations to the ETL pipeline
Suggest new use cases and features

For Advocates

Share with planners, ESG analysts, and researchers
Present at meetups or conferences
Provide feedback via GitHub issues

GitHub Repo: github.com/sdp5/green-gov-rag

Try It Out

Using Docker (Recommended)

git clone https://github.com/sdp5/green-gov-rag.git
cd green-gov-rag/deploy/docker
cp .env.example .env
# Edit .env with your API keys (OpenAI, Anthropic, or AWS Bedrock)
docker compose up -d

Access:

Backend API: http://localhost:8000/docs
Frontend: http://localhost:3000 (WIP)

Query Example (via API)

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Do I need an EIA for a solar farm in regional NSW?",
    "lga_name": "Dubbo Regional"
  }'

Final Thoughts

Building GreenGovRAG has been a journey in production RAG engineering: from ETL pipelines that don't break, to trust scoring systems that consultants can rely on, to geospatial filtering that makes sense for Australian regulations.

The goal isn't to replace human expertise - it's to make that expertise more efficient. Environmental consultants still need to verify answers, but now they can do in seconds what used to take days.

If you're working on RAG systems, navigating regulatory compliance, or just interested in civic tech - I'd love to hear from you.

Let's make environmental compliance faster, smarter, and more accessible. 🌏

This content originally appeared on DEV Community and was authored by Sundeep Anand

Print Share Comment Cite Upload Translate Updates

APA

Sundeep Anand | Sciencx (2025-11-23T05:02:16+00:00) Building GreenGovRAG. Retrieved from https://www.scien.cx/2025/11/23/building-greengovrag-2/

MLA

" » Building GreenGovRAG." Sundeep Anand | Sciencx - Sunday November 23, 2025, https://www.scien.cx/2025/11/23/building-greengovrag-2/

HARVARD

Sundeep Anand | Sciencx Sunday November 23, 2025 » Building GreenGovRAG., viewed ,<https://www.scien.cx/2025/11/23/building-greengovrag-2/>

VANCOUVER

Sundeep Anand | Sciencx - » Building GreenGovRAG. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/23/building-greengovrag-2/

CHICAGO

" » Building GreenGovRAG." Sundeep Anand | Sciencx - Accessed . https://www.scien.cx/2025/11/23/building-greengovrag-2/

IEEE

" » Building GreenGovRAG." Sundeep Anand | Sciencx [Online]. Available: https://www.scien.cx/2025/11/23/building-greengovrag-2/. [Accessed: ]

rf:citation

» Building GreenGovRAG | Sundeep Anand | Sciencx | https://www.scien.cx/2025/11/23/building-greengovrag-2/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

An Open Source AI Assistant for Australian Environmental Compliance

The Problem: Fragmented Regulatory Knowledge

The Solution: RAG + Geospatial Intelligence

Core Features

Technical Architecture: A Production RAG System

1. Document Ingestion & ETL

2. Text Chunking & Embeddings

3. Vector Store & Hybrid Search

4. Geospatial Intelligence

5. Response Generation with Trust Scoring

Tech Stack

Deployment Architecture (AWS)

Plugin Architecture: Easy Contributions

Real-World Use Cases

1. Environmental Impact Assessment Pre-screening

2. Native Vegetation Clearing Rules

3. Emissions Reporting Compliance

Challenges & Lessons Learned

1. PDF Parsing is Hard

2. Metadata is Everything

3. Trust Scores Matter

4. GitHub Actions > Airflow for Production ETL

What's Next?

How You Can Help

For Developers

For Domain Experts

For Advocates

Try It Out

Using Docker (Recommended)

Query Example (via API)

Final Thoughts

Related Posts