3 of 24

About Samagra

SamagraX is dedicated to building and shaping open-source population-scale products, platforms, and protocols. The focus is on creating Building Blocks (BBs) and Digital Public Goods (DPGs) that empower governments to leverage technology and data to transform the lives of millions.

Building Blocks (BBs), as defined by GovStack, are software code and applications that are interoperable, scalable, and reusable across various use cases and contexts. They provide essential digital services at scale, ensuring efficient and effective solutions for diverse needs.

Digital Public Goods (DPGs) include open-source software, open data, open AI models, open standards, and open content. These resources adhere to privacy laws and best practices, are designed to do no harm, and contribute to achieving the Sustainable Development Goals (SDGs).

SamagraX operates across both these categories by either building solutions from scratch or enhancing existing BBs and DPGs. The portfolio includes shaping over 10 BBs and 3 DPGs, which have been instrumental in delivering more than 20 products across various domains and states.

The mission of SamagraX is to empower governments with innovative technology and data solutions, driving large-scale positive impact and transforming lives.

4 of 24

Brief Introduction of the Project

The project focuses on leveraging LLMs to create an efficient question-answering system for PDFs.
The main motivation of this project was to create a robust deep learning architecture that it efficiently able to extract information from the pdf and provide accurate answers to the user queries.
The end goal is that the user is able to upload a pdf and pose questions to it. The system should be able to retrieve the closest matching answer.

5 of 24

BASIC OVERVIEW

6 of 24

FEATURES OF THE PROJECT - Ingestion

Text Extraction from Documents
Text Cleaning
Text Tokenization
Coreference Resolution

5. Indexing in milvus testing

6. Document Ingestion

7. Sample dataset creation to create question answer pairs

7 of 24

FEATURES OF THE PROJECT - Retrieval

1. Tokenize the question from the user, clean the input.

2. Generate query embedding.

3. Find the top n matching documents from the document store.

4. Evaluate the most suitable model that can be used for short answer and generative answering based on context.

5. Fine-tuning of models for the current data set.

6. Pass the top n documents to the models and extract the responses.

7. Merge the responses and return them to the end user.

8. Returning the context info also to the end user along with the answers.

8 of 24

PROJECT DESIGN

10 of 24

Text Extraction, tokenization and Cleaning

Extract Text from PDF using PyPDF2.This function reads a PDF file and extracts the text from each page.

Split Text into Chunks. With separator="\n", chunk_size=1000, chunk_overlap=300, length_function=len as parameters used.

Preprocess the text (convert to lowercase) and remove punctuations and correct the spelling errors.

The function coreferencing processes a given text to replace pronouns or other coreferent mentions with their respective main entities using SpaCy's coreference resolution capabilities. This improves the clarity and coherence of the text by making it clear who or what each pronoun refers to.

11 of 24

Sample Dataset Creation

Webscraping articles or books on India and digital public goods and storing it into a pdf form.

Extracting the text and preprocessing the text (convert to lowercase) and remove punctuations and correct the spelling errors.

Forming question answer pairs of the text and storing it as ground truth answers dataset.

12 of 24

Testing milvus indexes and ingesting

Choosing between different indexing strategies for fast retrieval and low latency.

Priority given to high similarity score.

Finally chose HNSW as the indexing strategy ad ingested question answer pairs in the database

13 of 24

MILVUS QUERY RESULTS

The table demonstrates the results of similarity searches using Milvus,

based on a dataset of questions and answers about India. The retrieved results

include the query question, the answer, and the context from which the answer

was derived.

14 of 24

TESTING WHICH LLM IS BEST SUITED?

Make a list of all the models that can support question answering with context.

Finding all relevant models on huggingface.

15 of 24

TESTING WHICH LLM IS BEST SUITED?

The Testing Metrics are -

Avg F1 Score :- The F1 score is the harmonic mean of precision and recall. It provides a balance between the two metrics, offering a single measure that considers both false positives and false negatives.
Avg BLEU Score :- BLEU focuses on the precision of n-grams, ensuring that the generated answers contain relevant phrases and word sequences from the reference answers.

16 of 24

TESTING WHICH LLM IS BEST SUITED?

The Testing Metrics are -

3. Avg ROUGE-1 Score :- ROUGE scores indicate how much of the reference content is covered by the generated answers. Higher ROUGE scores suggest that the model is capturing relevant information from the source documents.

4. Accuracy :- Similarity score between predicted answer and ground truth answer.

17 of 24

TESTING WHICH LLM IS BEST SUITED?

The Testing Metrics are -

5. Latency :-Latency measures the time taken by the model to generate an answer for a given query.

18 of 24

Final Result

First Priority given to accuracy and second to latency.

Based on this

distilbert-base-uncased-distilled-squad

was selected as the LLM Model

and we fine- tuned it using langchain to our use case.

19 of 24

Getting User Query and Processing it

Step 1 :- Get the user query and pre-process it.

Step 2 :- Form the Question Embedding and pad it according to our vector database sequence length.

Step 3 :- Pass this through vector db to find the top-n documents.

20 of 24

Query Result

21 of 24

Future Scope :-

Expansion of Data Sources:

Continuously ingest new and diverse articles to expand the knowledge base.

Incorporate various types of data such as multimedia content (videos, images) and

structured data (tables, graphs) to provide a richer set of information.

Enhanced Model Fine-Tuning:

Regularly update and fine-tune the LLMs with new datasets and user feedback

to improve accuracy and contextual understanding.

Experiment with cutting-edge LLM architectures and fine-tuning techniques to

push the boundaries of performance.

Multilingual Support:

Extend the system’s capabilities to handle queries in multiple languages,

making it accessible to a global audience.

22 of 24

Future Scope :-

d. User Personalization:

Develop mechanisms to personalize responses based on individual user

preferences, query histories, and contextual factors.

e. User Feedback Loop:

Establish a feedback loop where users can rate the relevance and accuracy of

the responses.

f. Advanced Query Handling:

Enhance the system’s ability to understand and process complex, multi-faceted

questions.

23 of 24

Learnings

Studied about many new Large Language Models.
Conducted Testing for the first time.
Understood importance of following best practices while coding- maintaining documentation
Learning about Finetuning LLMs

1 of 24

2 of 24

3 of 24

4 of 24

5 of 24

6 of 24

7 of 24

8 of 24

9 of 24

10 of 24

11 of 24

12 of 24

13 of 24

14 of 24

15 of 24

16 of 24

17 of 24

18 of 24

19 of 24

20 of 24

21 of 24

22 of 24

23 of 24

24 of 24