Reexploring SAcommunity Chatbot for Integration into Website Redevelopment Project

With the rapid advancement of AI technologies and the team's initial lack of experience in software development during the early stages of chatbot development, many aspects of the chatbot software can be revisited. In collaboration with the website redevelopment team, efforts are also underway to update the SAcommunity website, which is gradually becoming outdated.

Volunteer Yong Kheng Beh, who previously developed the chatbot’s backend, returned to Connecting Up on January 14, 2025, to refine and enhance the chatbot, focusing on areas she believes can be improved. She brings valuable insights from her internship at YourAnswer Pty Ltd (July 2024 – October 2024), along with her dedication to continuous learning.

This blog post is will also serve as documentation of how the codes works and how to run it for future reference.

Data Preprocessing


Figure 1:  Flow chart of data_preprocessing.py

First, Yong Kheng aimed to streamline the chatbot’s vector storage update process by enabling it to run with a simple line of code. Since the website is still being updated, she temporarily used a CSV file containing extracted data from the SAcommunity database. As the CSV file had been previously cleaned, she only identified minor data entry errors, extracted them, and provided them to data entry volunteers for correction. The corrected entries, managed by the volunteer coordinator, will ensure smoother data extraction in the future, facilitating a seamless transfer to the new website.

She wrote a Python program to automate index creation using the following command:

python3 data_preprocessing.py --model_name all-miniLM-L6-v2 --input_file sacommunity.csv
  • input_file: The CSV file containing the data to be indexed.
  • model_name: The embedding model used to convert the data into embeddings for semantic search.


Figure 2: How the data_processing.py like when it runs.

Running this script generates a folder named "database", which contains:

  1.  A  hnsw index (hnsw_index.bin) storing vectors for all organization details and subjects.
  2. A sentence transformer model (saved in the embedding_model folder).
  3. Two JSON files:
    • organisation_dict.json: A dictionary mapping organization IDs to their corresponding details.
    • subject_dict.json: A dictionary mapping subjects to all organization IDs that have the subject listed in the subject category within the organization details, enabling more accurate search results.


Figure 3: Saved file after running the data_preprocessing.py program

These file are saved locally and will be used by the chatbot program for semantic search.

This code serves as a prototype for data preprocessing and extraction, as the website rebuild is still ongoing. Changes to the database content and structure may require further adjustments in the future.

Suggestions for Improvement Once the New Website is Live

  • Direct SQL Query for Data Extraction
    Instead of extracting data into a CSV file and processing it separately, a more efficient approach would be to extract data directly from the database using SQL queries. However, security concerns will need to be addressed before implementing this method.


Figure 4: Ideal data extraction scenario

  • Optimizing Index Creation for Semantic Search
    Currently, the index is created using full organization details and subjects as embeddings. Ideally, the services category should be split into more meaningful chunks to enhance semantic search accuracy. However, given that the dataset contains over 15,000 organizations, and each organization describes its services differently, breaking the data into smaller, meaningful segments could increase memory usage and reduce retrieval speed. A balance between accuracy, memory efficiency, and performance needs to be established.


Figure 5: Example of ideal chunking method and indexing

  • Improving Location-Based Search
    To enhance location-based search, a dedicated index could significantly benefit users seeking services within a specific area. Previously, the chatbot relied on the Nominatim Geocoder to fetch nearby locations. However, the free version's one-query-per-second limit presents a performance bottleneck. For a nonprofit organization, cost-effective solutions are crucial. We must consider::
    - Paying for the Google Maps API.
    - Installing a local Nominatim Geocoder instance (which required subtantial computational resources).
    - Developing an alternative location retrieval method to address typos and imprecise address matches.

The data_processing Python code created by Yong Kheng can be found in the attached file: data_preprocssing.pdf.

Chatbot Program Semantic Search

After data preparation, the next program chatbot.py works to perform a semantic search at the index, obtain the relevant data from the dictionaries, calculate cummulative scoring, ranked and return the top 10 results and feed the results to the Large Language Model(LLM) for further procressing and return a desired output with the help of prompt engineering.

The choice of index can be hnswlib, FAISS, ScaNN, Usearch which works quite similarly but the later 3 has more customisation features while hnswlib is pretty straightforward.

The choice of the LLM will be dependent on the company's usage and preference. Yong Kheng used Gemini at the moment as it has free trial but the caveat is, the data sent to the free version of gemini will be collected to train the model. Therefore, at utmost care not to put any private or sensitive information into Gemini.

The SAcommunity website is licensed under a
Creative Commons Attribution 3.0 Australia Licence. © Copyright