Reexploring SAcommunity Chatbot for Integration into Website Redevelopment Project

With the rapid advancement of AI technologies and the team's initial lack of experience in software development during the early stages of chatbot development, many aspects of the chatbot software can be revisited. In collaboration with the website redevelopment team, efforts are also underway to update the SAcommunity website, which is gradually becoming outdated.
Volunteer Yong Kheng Beh, who previously developed the chatbot’s backend, returned to Connecting Up on January 14, 2025, to refine and enhance the chatbot, focusing on areas she believes can be improved. She brings valuable insights from her internship at YourAnswer Pty Ltd (July 2024 – October 2024), along with her dedication to continuous learning.
This blog post is will also serve as documentation of how the codes works and how to run it for future reference.
Data Preprocessing
Figure 1: Flow chart of data_preprocessing.py
First, Yong Kheng aimed to streamline the chatbot’s vector storage update process by enabling it to run with a simple line of code. Since the website is still being updated, she temporarily used a CSV file containing extracted data from the SAcommunity database. As the CSV file had been previously cleaned, she only identified minor data entry errors, extracted them, and provided them to data entry volunteers for correction. The corrected entries, managed by the volunteer coordinator, will ensure smoother data extraction in the future, facilitating a seamless transfer to the new website.
She wrote a Python program to automate index creation using the following command:
python3 data_preprocessing.py --model_name all-miniLM-L6-v2 --input_file sacommunity.csv
input_file
: The CSV file containing the data to be indexed.model_name
: The embedding model used to convert the data into embeddings for semantic search.
Figure 2: How the data_processing.py like when it runs.
Running this script generates a folder named "database", which contains:
- A hnsw index (
hnsw_index.bin
) storing vectors for all organization details and subjects. - A sentence transformer model (saved in the
embedding_model
folder). - Two JSON files:
organisation_dict.json:
A dictionary mapping organization IDs to their corresponding details.subject_dict.json:
A dictionary mapping subjects to all organization IDs that have the subject listed in the subject category within the organization details, enabling more accurate search results.
Figure 3: Saved file after running the data_preprocessing.py program
These file are saved locally and will be used by the chatbot program for semantic search.
This code serves as a prototype for data preprocessing and extraction, as the website rebuild is still ongoing. Changes to the database content and structure may require further adjustments in the future.
Suggestions for Improvement Once the New Website is Live
- Direct SQL Query for Data Extraction
Instead of extracting data into a CSV file and processing it separately, a more efficient approach would be to extract data directly from the database using SQL queries. However, security concerns will need to be addressed before implementing this method.
Figure 4: Ideal data extraction scenario
- Optimizing Index Creation for Semantic Search
Currently, the index is created using full organization details and subjects as embeddings. Ideally, the services category should be split into more meaningful chunks to enhance semantic search accuracy. However, given that the dataset contains over 15,000 organizations, and each organization describes its services differently, breaking the data into smaller, meaningful segments could increase memory usage and reduce retrieval speed. A balance between accuracy, memory efficiency, and performance needs to be established.
Figure 5: Example of ideal chunking method and indexing
- Enhancing Location-Based Search
A separate index for location-based search could be beneficial for users looking for services in a specific area. Previously, the chatbot used the Nominatim Geocoder to retrieve nearby locations based on user queries. However, the free version allows only one query per second, which poses a limitation. Cost considerations will be necessary, especially for a nonprofit organization, when deciding whether to:
- Pay for the Google Maps API.
- Upgrade to a paid version of Nominatim Geocoder.
- Develop an alternative solution to improve location retrieval, particularly for handling typos or areas that do not precisely match any address records in the database.
The data_processing Python code created by Yong Kheng can be found in the attached file: data_preprocssing.pdf.
The next step is to update the chatbot program to utilise the latest database, embedding models and reduce the dependency on LangChain library if possible. This update is currently work in progess, and further iimprovements will be document in the blog post in the future.
Sign up for the newsletter!
Subscribe to our monthly newsletter to receive news, information and events for the community sector in SA.