💡 Main Idea

We have PDF of graphs, tiles ( dashboard elements ) which keeps changing
once in 24 hours (200 - 300 images). Have to figure out a solution so that we can do similarity 
matching `**text-to-image**` matching and then pass that along with query so that 
**VLM** hopefully will be able to understand and able to answer the query.

🤖 Models found

Pros cons comparision

qwen + jina-clip Colpali by vidore. Visualized-BGE by BAAI ImageBind by MetaAI
Memory requirement 223M model can run over CPU but old just suitable for smaller text matching tasks ( similar to CLIP arch) 3B model requires > 40GB ram in case of CPU <4 GB ram required about 300mb model 10GB ram required
Performance equivalent performance as of imagebind. Almost perfect slightly worse sometimes works sometimes doesn’t working most of times compared to BGE.

🧑‍💻 Codes for below approaches.

https://culinda-my.sharepoint.com/:u:/p/somesh/EXIRYlSZ8f9Is2zg33X2RLcBB90QG1IX35drj54EGEGaKw?e=5c7aI3

✅ Solution approaches

🍯 Approach 1: ( jina-clip-v1 )