In our earlier article, we demonstrated how to build an AI chatbot with the ChatGPT API and assign a role to personalize it. But what if you want to train the AI on your own data? For example, you may have a book, financial data, or a large set of databases, and you wish to search them with ease. In this article, we bring you an easy-to-follow tutorial on how to train an AI chatbot with your custom knowledge base with LangChain and ChatGPT API. We are deploying LangChain, GPT Index, and other powerful libraries to train the AI chatbot using OpenAI’s Large Language Model (LLM). So on that note, let’s check out how to train and create an AI Chatbot using your own dataset.
Train an AI Chatbot With Custom Knowledge Base Using ChatGPT API, LangChain, and GPT Index (2023)
In this article, we have explained the steps to teach the AI chatbot with your own data in greater detail. From setting up tools and software to training the AI model, we have included all the instructions in an easy-to-understand language. It is highly recommended to follow the instructions from top to down without skipping any part.
Notable Points Before You Train AI with Your Own Data
1. You can train the AI chatbot on any platform, whether Windows, macOS, Linux, or ChromeOS. In this article, I’m using Windows 11, but the steps are nearly identical for other platforms.
2. The guide is meant for general users, and the instructions are explained in simple language. So even if you have a cursory knowledge of computers and don’t know how to code, you can easily train and create a Q&A AI chatbot in a few minutes. If you followed our previous ChatGPT bot article, it would be even easier to understand the process.
3. Since we are going to train an AI Chatbot based on our own data, it’s recommended to use a capable computer with a good CPU and GPU. However, you can use any low-end computer for testing purposes, and it will work without any issues. I used a Chromebook to train the AI model using a book with 100 pages (~100MB). However, if you want to train a large set of data running into thousands of pages, it’s strongly recommended to use a powerful computer.
4. Finally, the data set should be in English to get the best results, but according to OpenAI, it will also work with popular international languages like French, Spanish, German, etc. So go ahead and give it a try in your own language.
Set Up the Software Environment to Train an AI Chatbot
Like our previous article, you should know that Python and Pip must be installed along with several libraries. In this article, we will set up everything from scratch so new users can also understand the setup process. To give you a brief idea, we will install Python and Pip. After that, we will install Python libraries, which include OpenAI, GPT Index, Gradio, and PyPDF2. Along the process, you will learn what each library does. Again, do not fret over the installation process, it’s pretty straightforward. On that note, let’s jump right in.
1. First off, you need to install Python (Pip) on your computer. Open this link and download the setup file for your platform.
2. Next, run the setup file and make sure to enable the checkbox for “Add Python.exe to PATH.” This is an extremely important step. After that, click on “Install Now” and follow the usual steps to install Python.
3. To check if Python is properly installed, open the Terminal on your computer. I’m using Windows Terminal on Windows, but you can also use Command Prompt. Once here, run the below command below, and it will output the Python version. On Linux and macOS, you may have to use
python3 --version instead of
When you install Python, Pip is installed simultaneously on your system. So let’s upgrade it to the latest version. For those unaware, Pip is the package manager for Python. Basically, it lets you install thousands of Python libraries from the Terminal. With Pip, we can install OpenAI, gpt_index, gradio, and PyPDF2 libraries. Here are the steps to follow.
1. Open the Terminal of your choice on your computer. I’m using the Windows Terminal, but you can also use Command Prompt. Now, run the below command to update Pip. Again, you may have to use
pip3 on Linux and macOS.
python -m pip install -U pip
2. To check if Pip was properly installed, run the below command. It will output the version number. If you get any errors, follow our dedicated guide on how to install Pip on Windows to fix PATH-related issues.
Install OpenAI, GPT Index, PyPDF2, and Gradio Libraries
Once we have set up Python and Pip, it’s time to install the essential libraries that will help us train an AI chatbot with a custom knowledge base. Here are the steps to follow.
1. Open the Terminal and run the below command to install the OpenAI library. We will use it as the LLM (Large language model) to train and create an AI chatbot. And we will also import the LangChain framework from OpenAI. Note that, Linux and macOS users may have to use
pip3 instead of
pip install openai
2. Next, let’s install GPT Index, which is also called LlamaIndex. It allows the LLM to connect to the external data that is our knowledge base.
pip install gpt_index
3. After that, install PyPDF2 to parse PDF files. If you want to feed your data in PDF format, this library will help the program read the data effortlessly.
pip install PyPDF2
4. Finally, install the Gradio library. This is meant for creating a simple UI to interact with the trained AI chatbot. We are now done installing all the required libraries to train an AI chatbot.
pip install gradio
Download a Code Editor
Finally, we need a code editor to edit some of the code. On Windows, I would recommend Notepad++ (Download). Simply download and install the program via the attached link. You can also use VS Code on any platform if you are comfortable with powerful IDEs. Other than VS Code, you can install Sublime Text (Download) on macOS and Linux.
For ChromeOS, you can use the excellent Caret app (Download) to edit the code. We are almost done setting up the software environment, and it’s time to get the OpenAI API key.
Get the OpenAI API Key For Free
Now, to train and create an AI chatbot based on a custom knowledge base, we need to get an API key from OpenAI. The API key will allow you to use OpenAI’s model as the LLM to study your custom data and draw inferences. Currently, OpenAI is offering free API keys with $5 worth of free credit for the first three months to new users. If you created your OpenAI account earlier, you may have free $18 credit in your account. After the free credit is exhausted, you will have to pay for the API access. But for now, it’s available to all users for free.
1. Head to platform.openai.com/signup and create a free account. If you already have an OpenAI account, simply log in.
2. Next, click on your profile in the top-right corner and select “View API keys” from the drop-down menu.
3. Here, click on “Create new secret key” and copy the API key. Do note that you can’t copy or view the entire API key later on. So it’s strongly recommended to copy and paste the API key to a Notepad file immediately.
4. Also, do not share or display the API key in public. It’s a private key meant only for access to your account. You can also delete API keys and create multiple private keys (up to five).
Train and Create an AI Chatbot With Custom Knowledge Base
Now that we have set up the software environment and got the API key from OpenAI, let’s train the AI chatbot. Here, we will use the “text-davinci-003” model instead of the latest “gpt-3.5-turbo” model because Davinci works much better for text completion. If you want, you can very well change the model to Turbo to reduce the cost. With that out of the way, let’s jump to the instructions.
Add Your Documents to Train the AI Chatbot
1. First, create a new folder called
docs in an accessible location like the Desktop. You can choose another location as well according to your preference. However, keep the folder name
2. Next, move the documents you wish to use for training the AI inside the “docs” folder. You can add multiple text or PDF files (even scanned ones). If you have a large table in Excel, you can import it as a CSV or PDF file and then add it to the “docs” folder. You can even add SQL database files, as explained in this Langchain AI tweet. I haven’t tried many file formats besides the mentioned ones, but you can add and check on your own. For this article, I am adding one of my articles on NFT in PDF format.
Note: If you have a large document, it will take a longer time to process the data, depending on your CPU and GPU. In addition, it will quickly use your free OpenAI tokens. So in the beginning, start with a small document (30-50 pages or < 100MB files) to understand the process.
Make the Code Ready
1. Now, launch Notepad++ (or your choice of code editor) and paste the below code into a new file. Once again, I have taken great help from armrrs on Google Colab and tweaked the code to make it compatible with PDF files and create a Gradio interface on top.
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper from langchain import OpenAI import gradio as gr import sys import os os.environ["OPENAI_API_KEY"] = 'Your API Key' def construct_index(directory_path): max_input_size = 4096 num_outputs = 512 max_chunk_overlap = 20 chunk_size_limit = 600 prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit) llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name="text-davinci-003", max_tokens=num_outputs)) documents = SimpleDirectoryReader(directory_path).load_data() index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper) index.save_to_disk('index.json') return index def chatbot(input_text): index = GPTSimpleVectorIndex.load_from_disk('index.json') response = index.query(input_text, response_mode="compact") return response.response iface = gr.Interface(fn=chatbot, inputs=gr.inputs.Textbox(lines=7, label="Enter your text"), outputs="text", title="Custom-trained AI Chatbot") index = construct_index("docs") iface.launch(share=True)
2. This is what the code looks like in the code editor.
3. Next, click on “File” in the top menu and select “Save As…” from the drop-down menu.
4. After that, set the file name
app.py and change the “Save as type” to “All types” from the drop-down menu. Then, save the file to the location where you created the “docs” folder (in my case, it’s the Desktop). You can change the name to your liking, but make sure
.py is appended.
5. Make sure the “docs” folder and “app.py” are in the same location, as shown in the screenshot below. The “app.py” file will be outside the “docs” folder and not inside.
6. Come back to the code again in Notepad++. Here, replace
Your API Key with the one generated on OpenAI’s website above.
7. Finally, press “Ctrl + S” to save the code. You are now ready to run the code.
Create ChatGPT AI Bot with Custom Knowledge Base
1. First, open the Terminal and run the below command to move to the Desktop. It’s where I saved the “docs” folder and “app.py” file. If you saved both items in another location, move to that location via the Terminal.
2. Now, run the below command. Linux and macOS users may have to use
3. Now, it will start analyzing the document using the OpenAI LLM model and start indexing the information. Depending on the file size and your computer’s capability, it will take some time to process the document. Once it’s done, an “index.json” file will be created on the Desktop. If the Terminal is not showing any output, do not worry, it might still be processing the data. For your information, it takes around 10 seconds to process a 30MB document.
4. Once the LLM has processed the data, you will get a few warnings that can be safely ignored. Finally, at the bottom, you will find a local URL. Copy it.
5. Now, paste the copied URL into the web browser, and there you have it. Your custom-trained ChatGPT-powered AI chatbot is ready. To start, you can ask the AI chatbot what the document is about.
6. You can ask further questions, and the ChatGPT bot will answer from the data you provided to the AI. So this is how you can build a custom-trained AI chatbot with your own dataset. You can now train and create an AI chatbot based on any kind of information you want. The possibilities are endless.
7. You can also copy the public URL and share it with your friends and family. The link will be live for 72 hours, but you also need to keep your computer turned on since the server instance is running on your computer.
8. To stop the custom-trained AI chatbot, press “Ctrl + C” in the Terminal window. If it does not work, press “Ctrl + C” again.
9. To restart the AI chatbot server, simply move to the Desktop location again and run the below command. Keep in mind, the local URL will be the same, but the public URL will change after every server restart.
10. If you want to train the AI chatbot with new data, delete the files inside the “docs” folder and add new ones. You can also add multiple files, but feed information on the same subject otherwise you may get an incoherent response.
11. Now, run the code again in the Terminal, and it will create a new “index.json” file. Here, the old “index.json” file will be replaced automatically.
12. To keep track of your tokens, head over to OpenAI’s online dashboard and check how much free credit is left.
13. Lastly, you don’t need to touch the code unless you want to change the API key or the OpenAI model for further customization.
Build a Custom AI Chatbot Using Your Own Data
So this is how you can train an AI chatbot with a custom knowledge base. I have used this code to train the AI on medical books, articles, data tables, and reports from old archives, and it has worked flawlessly. So go ahead and create your own AI chatbot using OpenAI’s Large Language Model and ChatGPY. Anyway, that is all from us. If you are looking for the best ChatGPT alternatives, head to our linked article. And to use ChatGPT on your Apple Watch, follow our in-depth tutorial. Finally, if you are facing any kind of issues, do let us know in the comment section below. We will definitely try to help you out.
Hi Arjun, thanks for the guide.
Step 4 in ‘Create ChatGPT AI Bot with Custom Knowledge Base’ doesn’t produce a URL I’m using macOS
here is my output:
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
Traceback (most recent call last):
File “/Users/iyarbinyamin/Desktop/acks.py”, line 37, in
index = construct_index(“docs”)
File “/Users/iyarbinyamin/Desktop/acks.py”, line 19, in construct_index
documents = SimpleDirectoryReader(directory_path).load_data()
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/readers/file/base.py”, line 181, in load_data
data = parser.parse_file(input_file, errors=self.errors)
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/readers/file/docs_parser.py”, line 38, in parse_file
page_text = pdf.pages[page].extract_text()
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_page.py”, line 1851, in extract_text
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_page.py”, line 1342, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_cmap.py”, line 28, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_cmap.py”, line 194, in parse_to_unicode
cm = prepare_cm(ft)
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_cmap.py”, line 207, in prepare_cm
tu = ft[“/ToUnicode”]
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/generic/_data_structures.py”, line 266, in __getitem__
return dict.__getitem__(self, key).get_object()
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/generic/_base.py”, line 259, in get_object
obj = self.pdf.get_object(self)
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_reader.py”, line 1269, in get_object
retval = self._encryption.decrypt_object(
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_encryption.py”, line 761, in decrypt_object
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_encryption.py”, line 185, in decrypt_object
obj._data = self.stmCrypt.decrypt(obj._data)
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_encryption.py”, line 147, in decrypt
raise DependencyError(“PyCryptodome is required for AES algorithm”)
PyPDF2.errors.DependencyError: PyCryptodome is required for AES algorithm
Run this: pip3 install PyCryptodome
It works fine, but what about if instead of using the web form I want to send the request using CURL and receive the response back as plain text
Very useful article, I made the mistake of first putting the Python code in the Docs folder as well, before running it which basically gave me nothing as a result, swapped it to the desktop instead with the docs folder having only the PDF and it worked out perfectly.
Why is it using text-davinci-003 and not GTP3.5 for example (which is cheaper) ?
If I place an excel file in the docs folder so that I can feed it the desired data, would it know how to convert that data to a json if I don’t teach it coding itself? What I’m wondering is how much does it already know or do you literallt teach it how to code first? Or are we just building on the knowledge chat gbt already has which already knows how to code?
No, you will have to provide the data in JSON or CSV format.
You mention adding new docs.
You mention starting over.
What if I build the index file and then need to turn off my machine, do I have to run app.py next time I want to use it?
If so, will this go through the build process again, it seems form what you said that it will.
Or succinctly “Can I preserve my data between sessions without having to run the index building process again?”
Great article btw easy to follow, I’m just sat waiting for it to finish with my large docs folder! 🙂
Hey, you can add # before index = construct_index(“docs”) function once you have created the JSON file. This will make the line ineffective. If you want to train data on new material, remove # and run the code again.
How can I get the Chatbot to reply to questions outside its trained data set? I want it to draw its knowledge from the data I provide but I still want it to be a capable chatbot. At this point it is very 1 dimentional! Thanks for the help and great article!!!
Thanks for the free source of codes. I tried it, it worked!
But I have a few questions which need your advice please as I couldn’t find them in the post.
1. On training with my custom data by calling the “construct_index” function, will any of my data be stored on the OpenAPI Cloud?
2. What will be contained in the index.json file? How will this file be used in the chatbot Q&A process?
3. Will this program connect to OpenAI Cloud in the chatbot Q&A process? If yes for what purpose, and will my data be stored on the OpenAI Cloud?
4. On this:
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 10 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 598 tokens
Is it the imported OpenAPI library on my server or the OpenAI Cloud which calculate this token usage?
5. Noted that the token used will be synced to the OpenAI portal dashboard in every few minutes, so this program has to run with internet connectivity?
Thanks much in advance! Your advice will be very helpful to me!
Excellent helpful article thank you Arjun. Instead of using a laptop, can this be hosted on a server and on a cpanel hosted site?
Hi Arjun, that’s a great article, thanks for sharing that. I got a concern here, if you got any idea for that, that would be great,
If we are using custom data to train this AI mode, where is the data store for custom data? I mean will custom data be saved into open ai server? or the data will only keep on our server.
No, the data is stored locally in the index.json file. However, OpenAI LLM is used to draw inferences from the dataset. OpenAI has also said that data submitted to the company through the API won’t be used for its AI training.
I have problems with training the model. Can you share a dataset that should work?
Very nice article. I could setup everyting withing 10 mins on my mac.
Can we provide URLs instead of PDF or test documents ?
You can actually use wget to download files automatically using URLs. You will have to add “wget URL” before you run the construct_index function.
Dude ! This is davinci model , not the chatGPT api (gpt-3.5 turbo ). Don’t use clickbait !
My friend Albus, you can also use “gpt-3.5-turbo” with the same code, as I have mentioned in the article. I chose Davinci because it’s better at text completion, as opposed to Chat completion for which the Turbo model is suitable.
Moreover, Davinci is also a ChatGPT model (GPT-3 to be precise), and Turbo being GPT-3.5.