This content originally appeared on DEV Community and was authored by Linghua
Iโve built an open source ETL framework (CocoIndex) to prepare data for RAG with my friend.
๐ฅ Features:
- Data flow programming
- Support custom logic - you can plugin your own choice of chunking, embedding, vector stores; plugin your own logic like lego. We have three examples in the repo for now. In the long run, we also want to support dedupe, reconcile etc.
- Incremental updates. We provide state management out-of-box to minimize re-computation. Right now, it checks if a file from a data source is updated. In future, it will be at smaller granularity, e.g., at chunk level.
- Python SDK (RUST core ๐ฆ with Python binding ๐)
- ๐ GitHub Repo: CocoIndex - Appreciate your support with a github star โญ !
Sincerely looking for feedback and learning from your thoughts. Would love contributors too if you are interested :) Thank you so much!
This content originally appeared on DEV Community and was authored by Linghua
Linghua | Sciencx (2025-03-09T02:34:26+00:00) Open-Source ETL to prepare data for RAG ๐ฆ ๐. Retrieved from https://www.scien.cx/2025/03/09/open-source-etl-to-prepare-data-for-rag-%f0%9f%a6%80-%f0%9f%90%8d/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.
