In our earlier article, we demonstrated how to build an AI chatbot with the ChatGPT API and assign a role to personalize it. But what if you want to train the AI on your own data? For example, you may have a book, financial data, or a large set of databases, and you wish to search them with ease. In this article, we bring you an easy-to-follow tutorial on how to train an AI chatbot with your custom knowledge base with LangChain and ChatGPT API. We are deploying LangChain, GPT Index, and other powerful libraries to train the AI chatbot using OpenAI’s Large Language Model (LLM). So on that note, let’s check out how to train and create an AI Chatbot using your own dataset.
Notable Points Before You Train AI with Your Own Data
1. You can train the AI chatbot on any platform, whether Windows, macOS, Linux, or ChromeOS. In this article, I’m using Windows 11, but the steps are nearly identical for other platforms.
2. The guide is meant for general users, and the instructions are explained in simple language. So even if you have a cursory knowledge of computers and don’t know how to code, you can easily train and create a Q&A AI chatbot in a few minutes. If you followed our previous ChatGPT bot article, it would be even easier to understand the process.
3. Since we are going to train an AI Chatbot based on our own data, it’s recommended to use a capable computer with a good CPU and GPU. However, you can use any low-end computer for testing purposes, and it will work without any issues. I used a Chromebook to train the AI model using a book with 100 pages (~100MB). However, if you want to train a large set of data running into thousands of pages, it’s strongly recommended to use a powerful computer.
4. Finally, the data set should be in English to get the best results, but according to OpenAI, it will also work with popular international languages like French, Spanish, German, etc. So go ahead and give it a try in your own language.
Set Up the Software Environment to Train an AI Chatbot
Install Python and Pip
1. First off, you need to install Python along with Pip on your computer by following our linked guide. Make sure to enable the checkbox for “Add Python.exe to PATH” during installation.

2. To check if Python is properly installed, open the Terminal on your computer. Once here, run the below commands one by one, and it will output their version number. On Linux and macOS, you will have to use python3
instead of python
from now onwards.
python --version pip --version

3. Run the below command to update Pip to the latest version.
python -m pip install -U pip

Install OpenAI, GPT Index, PyPDF2, and Gradio Libraries
1. Open the Terminal and run the below command to install the OpenAI library.
pip install openai

2. Next, let’s install GPT Index.
pip install gpt_index==0.4.24

3. Now, install Langchain by running the below command.
pip install langchain==0.0.148

4. After that, install PyPDF2 and PyCryptodome to parse PDF files.
pip install PyPDF2 pip install PyCryptodome
5. Finally, install the Gradio library. This is meant for creating a simple UI to interact with the trained AI chatbot.
pip install gradio

Download a Code Editor
Finally, we need a code editor to edit some of the code. On Windows, I would recommend Notepad++ (Download). Simply download and install the program via the attached link. You can also use VS Code on any platform if you are comfortable with powerful IDEs. Other than VS Code, you can install Sublime Text (Download) on macOS and Linux.

For ChromeOS, you can use the excellent Caret app (Download) to edit the code. We are almost done setting up the software environment, and it’s time to get the OpenAI API key.
Get the OpenAI API Key For Free
1. Head to OpenAI’s website (visit) and log in. Next, click on “Create new secret key” and copy the API key. Do note that you can’t copy or view the entire API key later on. So it’s recommended to copy and paste the API key to a Notepad file for later use.

2. Next, go to platform.openai.com/account/usage and check if you have enough credit left. If you have exhausted all your free credit, you need to add a payment method to your OpenAI account.

Train and Create an AI Chatbot With Custom Knowledge Base
Add Your Documents to Train the AI Chatbot
1. First, create a new folder called docs
in an accessible location like the Desktop. You can choose another location as well according to your preference. However, keep the folder name docs
.

2. Next, move the documents for training inside the “docs” folder. You can add multiple text or PDF files (even scanned ones). If you have a large table in Excel, you can import it as a CSV or PDF file and then add it to the “docs” folder. You can also add SQL database files, as explained in this Langchain AI tweet. I haven’t tried many file formats besides the mentioned ones, but you can add and check on your own. For this article, I am adding one of my articles on NFT in PDF format.
Note: If you have a large document, it will take a longer time to process the data, depending on your CPU and GPU. In addition, it will quickly use your free OpenAI tokens. So in the beginning, start with a small document (30-50 pages or < 100MB files) to understand the process.

Make the Code Ready
1. Now, open a code editor like Sublime Text or launch Notepad++ and paste the below code. Once again, I have taken great help from armrrs on Google Colab and tweaked the code to make it compatible with PDF files and create a Gradio interface on top.
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper from langchain.chat_models import ChatOpenAI import gradio as gr import sys import os os.environ["OPENAI_API_KEY"] = 'Your API Key' def construct_index(directory_path): max_input_size = 4096 num_outputs = 512 max_chunk_overlap = 20 chunk_size_limit = 600 prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit) llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo", max_tokens=num_outputs)) documents = SimpleDirectoryReader(directory_path).load_data() index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper) index.save_to_disk('index.json') return index def chatbot(input_text): index = GPTSimpleVectorIndex.load_from_disk('index.json') response = index.query(input_text, response_mode="compact") return response.response iface = gr.Interface(fn=chatbot, inputs=gr.components.Textbox(lines=7, label="Enter your text"), outputs="text", title="Custom-trained AI Chatbot") index = construct_index("docs") iface.launch(share=True)
2. Next, click on “File” in the top menu and select “Save As…” . After that, set the file name app.py
and change the “Save as type” to “All types”. Then, save the file to the location where you created the “docs” folder (in my case, it’s the Desktop).

3. Make sure the “docs” folder and “app.py” are in the same location, as shown in the screenshot below. The “app.py” file will be outside the “docs” folder and not inside.

4. Come back to the code again in Notepad++. Here, replace Your API Key
with the one that you generated above on OpenAI’s website.

5. Finally, press “Ctrl + S” to save the code. You are now ready to run the code.

Create ChatGPT AI Bot with Custom Knowledge Base
1. First, open the Terminal and run the below command to move to the Desktop. It’s where I saved the “docs” folder and “app.py” file.
cd Desktop

2. Now, run the below command.
python app.py

3. It will start indexing the document using the OpenAI LLM model. Depending on the file size, it will take some time to process the document. Once it’s done, an “index.json” file will be created on the Desktop. If the Terminal is not showing any output, do not worry, it might still be processing the data. For your information, it takes around 10 seconds to process a 30MB document.

4. Once the LLM has processed the data, you will find a local URL. Copy it.

5. Now, paste the copied URL into the web browser, and there you have it. Your custom-trained ChatGPT-powered AI chatbot is ready. To start, you can ask the AI chatbot what the document is about.

6. You can ask further questions, and the ChatGPT bot will answer from the data you provided to the AI. So this is how you can build a custom-trained AI chatbot with your own dataset. You can now train and create an AI chatbot based on any kind of information you want.
Manage the Custom AI Chatbot
1. You can copy the public URL and share it with your friends and family. The link will be live for 72 hours, but you also need to keep your computer turned on since the server instance is running on your computer.
2. To stop the custom-trained AI chatbot, press “Ctrl + C” in the Terminal window. If it does not work, press “Ctrl + C” again.

3. To restart the AI chatbot server, simply move to the Desktop location again and run the below command. Keep in mind, the local URL will be the same, but the public URL will change after every server restart.
python app.py

4. If you want to train the AI chatbot with new data, delete the files inside the “docs” folder and add new ones. You can also add multiple files, but make sure to add clean data to get a coherent response.

5. Now, run the code again in the Terminal, and it will create a new “index.json” file. Here, the old “index.json” file will be replaced automatically.
python app.py

6. To keep track of your tokens, head over to OpenAI’s online dashboard and check how much free credit is left.

7. Lastly, you don’t need to touch the code unless you want to change the API key or the OpenAI model for further customization.
nice job! how can i add memory to ask follow-up questions? thanks
Hi, thanks for sharing this. It worked for me but the answers were mostly wrong. I just exported my whatsapp chat with a friend and gave that as sample input. When I tried to ask questions related to the chat, it fell flat.
Any idea how to fix the quality or training related issues?
This can only work with a small amount of data I have two pdf documents now ~7kb total. It also doesn’t say llm usage. How do I tell i tto run on top of the llm. It currently is only running on the two documents. Thank you
Traceback (most recent call last):
File “C:\Users\dhruv\OneDrive\Desktop\app.py”, line 1, in
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
ModuleNotFoundError: No module named ‘gpt_index’
PS C:\Users\dhruv\OneDrive\Desktop>
What to do
pip install gpt_index
Traceback (most recent call last):
File “C:\Users\RHASH\OneDrive\Desktop\app.py”, line 1, in
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
File “C:\Users\RHASH\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\__init__.py”, line 14, in
from gpt_index.embeddings.langchain import LangchainEmbedding
File “C:\Users\RHASH\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\embeddings\langchain.py”, line 6, in
from langchain.embeddings.base import Embeddings as LCEmbeddings
File “C:\Users\RHASH\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\__init__.py”, line 6, in
from langchain.agents import MRKLChain, ReActChain, SelfAskWithSearchChain
File “C:\Users\RHASH\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\agents\__init__.py”, line 2, in
from langchain.agents.agent import (
File “C:\Users\RHASH\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\agents\agent.py”, line 17, in
from langchain.chains.base import Chain
File “C:\Users\RHASH\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\__init__.py”, line 16, in
from langchain.chains.llm_math.base import LLMMathChain
File “C:\Users\RHASH\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\llm_math\base.py”, line 6, in
import numexpr
File “C:\Users\RHASH\AppData\Local\Programs\Python\Python311\Lib\site-packages\numexpr\__init__.py”, line 24, in
from numexpr.interpreter import MAX_THREADS, use_vml, __BLOCK_SIZE1__
ImportError: DLL load failed while importing interpreter: The specified module could not be found.
I’m getting this error : ValueError: chunk_overlap_ratio must be a float between 0. and 1. Please can someone help with this ? Thanks
How to connect to telegram chat bot AI custom trained?
How to integrate word and powerpoint files ?
Everything is working good, but when I asking what is the capital of America then AI giving me exact response but this question is not in my PDF?
Which is the best data files format to train the model? Which is best among txt, json, excel, csv files??
How to connect my MongoDB database to train the chatbot?
I cant able to install gradio in ubunti
Has anyone got this working on the newest versions of Langchain and Llama_index??
hey Nick, Are you able to configure it for Llama_index?
I was able to get things working. It couldn’t find gpt_index as pip was installing to the wrong version of python. I had to force the version by using “py -3.11 pip install …”
Now that I have it running, my gradio app doesn’t return any output in the output window. Any ideas?
I had problems all weekend with authentication error. Since you have the most recent comment, checking if you had the same issue and found a work around?
I got mine working but it lies! I believe the word might be “hallucinates”. I added a CSV of a list of 648 pet food products with prices and other parameters. When I asked how many sold products were cat food it said “one”. When I asked how many total products were cat food it said “three”. Looking at the data, this was clearly incorrect.
I had problems all weekend with authentication error. Since you have the most recent comment, checking if you had the same issue and found a work around?
Same. Can anoyone please help with authentication error?
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tenacity/__init__.py”, line 326, in iter
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[]
I also have the same error, any fix?
I added this after ‘import os’ line and it worked
import openai
openai.api_key = ‘– YOUR API KEY –‘
Have all installed and seemed to be working, except I don’t get the url or the .json file. Here is an error I’m getting at the end
raise ValueError(
ValueError: A single term is larger than the allowed chunk sizes.
Term size: 607
Chunk sizes: 600Effective chunk size: 600
Any ideas?
Increase the chunk size or the document you are using contains characters it doesn’t like. I found one PDF was causing this error.
I get this error:
INFO:openai:error_code=None error_message=’Rate limit reached for default-text-embedding-ada-002 in organization org-{my org id} on tokens per min. Limit: 150000 / min. Current: 1 / min.
Sometimes it even says that current is 0 / min. I do have 5$ credits on my API key.
I have encountered an error:
INFO:openai:error_code=None error_message=’Rate limit reached for default-text-embedding-ada-002 in organization org-vAmu8wHZ5zAeIzaSufri1VTN on tokens per min. Limit: 150000 / min. Current: 1 / min. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method.’ error_param=None error_type=tokens message=’OpenAI API error received’ stream_error=False
However, I do have 5$ credits on my OpenAI account. Is there a way to fix it?
Thanks. How do I view the responses bot makes after running on the server
Hello! Your article is great and easy to understand! Thank you so much!
– but I have encountered some questions. Can you help me?
1. There is too little text generated by AI, I don’t know how to make the reply longer.
2. AI quickly forgot about the content in my PDF (in my case, PDF is a role-playing introduction, and I want AI to role-play)
Is there any method that allows us to train a chatbot on a custom set of data ONLY? Where it doesn’t have any knowledge from other outside sources?
This is literally what this article does
Getting this error while trying to run python app.py
C:\Users\Red\Desktop>python app.py
Traceback (most recent call last):
File “C:\Users\Red\Desktop\app.py”, line 37, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\Red\Desktop\app.py”, line 19, in construct_index
documents = SimpleDirectoryReader(directory_path).load_data()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\Red\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\readers\file\base.py”, line 92, in __init__
self.input_files = self._add_files(self.input_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\Red\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\readers\file\base.py”, line 99, in _add_files
input_files = sorted(input_dir.iterdir())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\Red\AppData\Local\Programs\Python\Python311\Lib\pathlib.py”, line 931, in iterdir
for name in os.listdir(self):
^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 3] The system cannot find the path specified: ‘docs’
Traceback (most recent call last):
File “C:\Users\gofly\Desktop\CUSTAI.PY”, line 37, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\gofly\Desktop\CUSTAI.PY”, line 19, in construct_index
documents = SimpleDirectoryReader(directory_path).load_data()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\gofly\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\readers\file\base.py”, line 92, in __init__
self.input_files = self._add_files(self.input_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\gofly\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\readers\file\base.py”, line 99, in _add_files
input_files = sorted(input_dir.iterdir())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\gofly\AppData\Local\Programs\Python\Python311\Lib\pathlib.py”, line 931, in iterdir
for name in os.listdir(self):
^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 3] The system cannot find the path specified: ‘docs’
Kindly help me…. Im stuck
did you find the solution?
Same issue here
FileNotFoundError: [Errno 2] No such file or directory: ‘docs’
Anyone solved it?
I am having valueError: chunk_overlap_ratio must be a float between 0. and 1. Can anyone pls help me with this?
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/routes.py”, line 437, in run_predict
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/blocks.py”, line 1346, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/blocks.py”, line 1074, in call_function
prediction = await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anyio/to_thread.py”, line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anyio/_backends/_asyncio.py”, line 877, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anyio/_backends/_asyncio.py”, line 807, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/forrester/Central/docs/app.py”, line 25, in chatbot
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/forrester/Central/docs/app.py”, line 14, in construct_index
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_index/indices/prompt_helper.py”, line 72, in __init__
raise ValueError(“chunk_overlap_ratio must be a float between 0. and 1.”)
ValueError: chunk_overlap_ratio must be a float between 0. and 1.
Did you resolve it? I am getting the same error
I am having the same error pls someone solve this
thank you so much u saved my life
Hi Arjun,
getting error: from langchain.schema import BaseLanguageModel
ImportError: cannot import name ‘BaseLanguageModel’ from ‘langchain.schema’ (C:\Users\SANJEEV\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\schema.py)
Getting the same error, would love to know how to fix that.
Same exact error here.
I´m getting the same error, someone knows how to fix it? I´ve been trying but can´t fix it
pip install llama-index==0.5.6
pip install langchain==0.0.148
getting the same error
Most likely you have newer versions installed, needs some code tinkering.
That it worked for me
pip install langchain==0.0.118
pip install gpt_index==0.4.24
Try to install/update the following libraries, it may work for you
I’ve tried installing the above but still doesn’t work. Now I get the error
ModuleNotFoundError: No module named ‘gpt_index’
Can anyone help with this? 🙂
I get the same error when following these directions, no module named gpt-index.
pip install langchain==0.0.118 worked for me
Nice tutorial! One question, it doesn’t seem to hold context between inputs, one very important feature. Am I doing something wrong or is this a limit in the code when using the API?
I am getting on mac , any pointers how to solve this? frpc_darwin_arm64-v0.2 has been blocked. I am on Mac m1 however I tried with Rosette it did not work
I was able to make this work after some struggles with python libraries and having to chunk my dataset into smaller bits. But, running this on my laptop, the outcome is REALLY slow. Anybody else experience that?
I get this frpc_darwin_arm64-v0.2 has been blocked. is this file harmful?
The app working. But GPT working not well, used pdf for the test.
Thanks.
Anyone figure out longer response length? I can’t seem to find anyway to get longer responses. Everything is always cut off part way through the reply.
How to Train an AI Chatbot With Custom Knowledge Base Using ChatGPT API using asp .net core
Is it possible to execute this custom model programmatically instead of via chat?
I had the problem: “ImportError: cannot import name ‘BaseLanguageModel’ from ‘langchain.schema'”
Thanks to the folks above I:
1) uninstalled langchain. (which had been 0.0.181)
( you can check the version with “pip show langchain”)
2) Installed the older version with :
pip install langchain==0.0.132
It worked.
Thanks! this solved it for me too.
Thanks you !!.
I tried to run the python code using Visual Studio Code on a old Windows Machine…. and ran almost flawlessly except for the following error ” ImportError: cannot import name ‘BaseLanguageModel’ from ‘langchain.schema” which is fixed by running “pip install langchain==0.0.153” on the VS Code Terminal…
Thanks again for sharing this code that works !!!!
If my data is confidential, will this expose my data to open ai?
of course, you are sending confidential data to an external entity. Read the agreement that you agreed when you signed up with OpenAI.
worked great on my MAC just had to change “langchain==0.0.153”
Also, if you are having any problems, just copy and paste your error code into Chat GPT, and it will work you through it
thanks
Got it working following your steps and works like a charm. I have a question though,
Does adding new pdf (along with existing ones) requires reloading .py app ?
Also, how to add websites as the source along with pdfs ? Could use this code to build chatbot to Q and A websites.
Thank you arjun, it’s work.
But what if I want it to write 3000 words article?
I had a compatibility problem between GPT_index and langchain, which is why the import did not work. (Import could not find BaseLanguageModel.) Could solve it thanks to ChatGPT 🙂 by: pip install langchain==0.0.153
Hi Arjun
Super document.
I get error message after pasting “python3 app.py” in Terminal (no index.json is created)
ImportError: cannot import name ‘BaseLanguageModel’ from ‘langchain.schema’ (/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/schema.py)
When installing Python (Mac) there was no choice to “Add Python.exe to PATH”.
Thank you
I encounter the same problem. No index.json is created.
ImportError: cannot import name ‘BaseLanguageModel’ from ‘langchain.schema’ (C:\Users\arvinpedrosa\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\schema.py)
I was now able to resolve the same issue with the help of chatGPT provided me with the correct code.
Installing the latest versions of the libraries, I had to make the following modifications to the code to get it to work.
“`
from llama_index import SimpleDirectoryReader, LLMPredictor, GPTVectorStoreIndex, PromptHelper, ServiceContext, load_index_from_storage, StorageContext
from langchain.chat_models import ChatOpenAI
import argparse
import gradio as gr
import os
os.environ[“OPENAI_API_KEY”] = ‘— API KEY —‘
parser = argparse.ArgumentParser(description=”Launch chatbot”)
parser.add_argument(‘-t’, ‘-train’, action=’store_true’, help=”Train the model”)
parser.add_argument(‘-i’, ‘-input’, default=’docs’, help=’Set input directory path’)
parser.add_argument(‘-o’, ‘-output’, default=’./gpt_store’, help=”Set output directory path”)
args = parser.parse_args()
# define prompt helper
max_input_size = 4096
num_output = 1000 # number of output tokens
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name=”gpt-3.5-turbo”, max_tokens=num_output))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
def construct_index():
print(“Constructing index…”)
# load in the documents
docs = SimpleDirectoryReader(args.i).load_data()
index = GPTVectorStoreIndex.from_documents(docs, service_context=service_context)
# save index to disk
index.set_index_id(‘vector_index’)
index.storage_context.persist(persist_dir=args.o)
return index
def chatbot(input_text):
# If not already done, initialize ‘index’ and ‘query_engine’
if not hasattr(chatbot, “index”):
# rebuild storage context and load index
storage_context = StorageContext.from_defaults(persist_dir=args.o)
chatbot.index = load_index_from_storage(service_context=service_context, storage_context=storage_context, index_id=”vector_index”)
# Initialize query engine
chatbot.query_engine = chatbot.index.as_query_engine()
print(“Context initialized”)
# Submit query
response = chatbot.query_engine.query(input_text)
return response.response
iface = gr.Interface(fn=chatbot,
inputs=gr.Textbox(lines=7, label=”Enter your text”),
outputs=”text”,
title=”Custom-trained AI Chatbot”)
if args.t:
construct_index()
iface.launch(share=True)
“`
Looks it works with new libs, but when querying using the web interface it throws the following error
raise ValueError(f”No existing {__name__} found at {persist_path}.”)
ValueError: No existing llama_index.storage.kvstore.simple_kvstore found at ./gpt_store\docstore.json.
Could it be the way is called the app? i just ran it using “phyton app.py” on the same dir as dir doc. Is there any arguments i need to use on the command line?
Thanks!
run first time with python app.py -t to train data first…
Would it be possible for you to send me the code in another way? I tried copy-pasting and i dont think it copied correctly.
I am new to Python and dont know all the syntax and such yet.
I used this code and it generated the following:
File “C:\Users\arvinpedrosa\Desktop\pdx.py”, line 1
“`
^
SyntaxError: invalid character ‘“’ (U+201C)
I had to refer to the LlamaIndex 0.6.8 docs and alter the code like this to get it to work:
“`
from llama_index import SimpleDirectoryReader, LLMPredictor, GPTVectorStoreIndex, PromptHelper, ServiceContext, load_index_from_storage, StorageContext
from langchain.chat_models import ChatOpenAI
import gradio as gr
import os
os.environ[“OPENAI_API_KEY”] = ‘— API KEY HERE —‘
# define prompt helper
max_input_size = 4096
num_output = 512 # number of output tokens
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name=”gpt-3.5-turbo”, max_tokens=num_output))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
def construct_index(directory_path):
# load in the documents
docs = SimpleDirectoryReader(directory_path).load_data()
index = GPTVectorStoreIndex.from_documents(docs, service_context=service_context)
# save index to disk
index.set_index_id(‘vector_index’)
index.storage_context.persist(persist_dir=”./gpt_store”)
return index
def chatbot(input_text):
# If not already done, initialize ‘index’ and ‘query_engine’
if not hasattr(chatbot, “index”):
# rebuild storage context and load index
storage_context = StorageContext.from_defaults(persist_dir=”./gpt_store”)
chatbot.index = load_index_from_storage(service_context=service_context, storage_context=storage_context, index_id=”vector_index”)
# Initialize query engine
chatbot.query_engine = chatbot.index.as_query_engine()
# Submit query
response = chatbot.query_engine.query(input_text)
return response.response
iface = gr.Interface(fn=chatbot,
inputs=gr.Textbox(lines=7, label=”Enter your text”),
outputs=”text”,
title=”Custom-trained AI Chatbot”)
index = construct_index(“docs”) #comment out after 1st run if training docs aren’t changing
iface.launch(share=True)
“`
Hi Cameron! I’m working with your code sample and am getting: ModuleNotFoundError: No module named ‘langchain.base_language’
I’m using: llama-index 0.6.1 and langchain 0.0.194. Any ideas?
TIA!
Would like to run this off cloud…advise?
Great tutorial, how can I do the same but instead on training with localhost docs it’s a website url?
Thanks!
Even I am interested in knowing the same.
If somebody could help, it would be really helpful
i am getting a timeout error. Can anyone please help me to figure out the issue please
The error message is showing
“WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘ConnectTimeoutError(, ‘Connection to api.openai.com timed out. (connect timeout=600)’)’: /v1/engines/text-embedding-ada-002/embeddings”
I’m facing the error
ImportError: cannot import name ‘RequestsWrapper’ from ‘langchain.utilities’ (C:\Users\famira\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\utilities\__init__.py)
hello everyone , i’ve got this as an error:
INFO:openai:error_code=None error_message=’You exceeded your current quota, please check your plan and billing details.’ error_param=None error_type=insufficient_quota message=’OpenAI API error received’ stream_error=False
Does this mean that i have a problem with my api key ? should i pay or something ?
Thanks in advance!!
hello everyone ,
i have this as an error :
INFO:openai:error_code=None error_message=’You exceeded your current quota, please check your plan and billing details.’ error_param=None error_type=insufficient_quota message=’OpenAI API error received’ stream_error=False
does this mean that i have a problem with my api key ? should i pay or something ?
thanks in advance!!
I made application run, but chatbot is giving me the information that doesn’t exist from the data I provide in docs folder.
How can I solve this issue?
I’m having the same issue
I have done all the steps as mentioned , and that thing works as well , but i want to use this custom trained bot in my flutter project , can you please tell me how to do that
I am getting this error:
Traceback (most recent call last):
File “/Users/dan/notes/Hackathon/chat_bot_1.py”, line 1, in
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/__init__.py”, line 18, in
from gpt_index.indices.common.struct_store.base import SQLDocumentContextBuilder
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/indices/__init__.py”, line 4, in
from gpt_index.indices.keyword_table.base import GPTKeywordTableIndex
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/indices/keyword_table/__init__.py”, line 4, in
from gpt_index.indices.keyword_table.base import GPTKeywordTableIndex
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/indices/keyword_table/base.py”, line 16, in
from gpt_index.indices.base import DOCUMENTS_INPUT, BaseGPTIndex
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/indices/base.py”, line 23, in
from gpt_index.indices.prompt_helper import PromptHelper
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/indices/prompt_helper.py”, line 12, in
from gpt_index.langchain_helpers.chain_wrapper import LLMPredictor
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/langchain_helpers/chain_wrapper.py”, line 13, in
from gpt_index.prompts.base import Prompt
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/prompts/__init__.py”, line 3, in
from gpt_index.prompts.base import Prompt
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/prompts/base.py”, line 9, in
from langchain.schema import BaseLanguageModel
ImportError: cannot import name ‘BaseLanguageModel’ from ‘langchain.schema’ (/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/schema.py)
Same here. Any idea?
You can fix this issue with the following command
pip install langchain==0.0.132
Thank you so much!
Hi… Can some one help me with below errors…
C:\Users\DELL\Desktop>python app.py
Traceback (most recent call last):
File “C:\Users\DELL\Desktop\app.py”, line 3, in
import gradio as gr
ModuleNotFoundError: No module named ‘gradio’
C:\Users\DELL\Desktop>python app.py
Traceback (most recent call last):
File “C:\Users\DELL\Desktop\app.py”, line 3, in
import gradio as gr
ModuleNotFoundError: No module named ‘gradio’
C:\Users\DELL\Desktop>pip install gradio
Collecting gradio
Using cached gradio-3.28.3-py3-none-any.whl (17.3 MB)
Collecting aiofiles (from gradio)
Using cached aiofiles-23.1.0-py3-none-any.whl (14 kB)
Requirement already satisfied: aiohttp in c:\users\dell\appdata\local\programs\python\python311\lib\site-packages (from gradio) (3.8.4)
Collecting altair>=4.2.0 (from gradio)
Using cached altair-4.2.2-py3-none-any.whl (813 kB)
Collecting fastapi (from gradio)
Using cached fastapi-0.95.1-py3-none-any.whl (56 kB)
Collecting ffmpy (from gradio)
Using cached ffmpy-0.3.0.tar.gz (4.8 kB)
Installing build dependencies … done
Getting requirements to build wheel … error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
running egg_info
writing ffmpy.egg-info\PKG-INFO
writing dependency_links to ffmpy.egg-info\dependency_links.txt
writing top-level names to ffmpy.egg-info\top_level.txt
reading manifest file ‘ffmpy.egg-info\SOURCES.txt’
writing manifest file ‘ffmpy.egg-info\SOURCES.txt’
OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py”, line 353, in
main()
File “C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py”, line 349, in main
write_json(json_out, pjoin(control_dir, ‘output.json’), indent=2)
File “C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py”, line 31, in write_json
with open(path, ‘w’, encoding=’utf-8′) as f:
OSError: [Errno 9] Bad file descriptor
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
C:\Users\DELL\Desktop>
Hey people, has anyone tried to limit the responses to the custom knowledge base only?
Maybe, you can change the temperature to 0 (zero) and try.
This code sis working for plain text files, but I am getting an error in PyPDF2 when trying to use PDF files. I have tried a number of version combinations of the various packages in an attempt to resolve the problem. Any advice on how to address the issue?
Error messages displayed when executing the code:
Traceback (most recent call last):
File “D:\anaconda3\lib\site-packages\PyPDF2\_reader.py”, line 1623, in _read_xref_tables_and_trailers
xrefstream = self._read_pdf15_xref_stream(stream)
File “D:\anaconda3\lib\site-packages\PyPDF2\_reader.py”, line 1752, in _read_pdf15_xref_stream
self._read_xref_subsections(idx_pairs, get_entry, used_before)
File “D:\anaconda3\lib\site-packages\PyPDF2\_reader.py”, line 1808, in _read_xref_subsections
assert start >= last_end
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “C:\Users\nnnn\OneDrive\Desktop\DocBot\chatbot.py”, line 37, in
index = construct_index(“docs”)
File “C:\Users\nnnn\OneDrive\Desktop\DocBot\chatbot.py”, line 19, in construct_index
documents = SimpleDirectoryReader(directory_path).load_data()
File “D:\anaconda3\lib\site-packages\gpt_index\readers\file\base.py”, line 150, in load_data
data = parser.parse_file(input_file, errors=self.errors)
File “D:\anaconda3\lib\site-packages\gpt_index\readers\file\docs_parser.py”, line 30, in parse_file
pdf = PyPDF2.PdfReader(fp)
File “D:\anaconda3\lib\site-packages\PyPDF2\_reader.py”, line 319, in __init__
self.read(stream)
File “D:\anaconda3\lib\site-packages\PyPDF2\_reader.py”, line 1426, in read
self._read_xref_tables_and_trailers(stream, startxref, xref_issue_nr)
File “D:\anaconda3\lib\site-packages\PyPDF2\_reader.py”, line 1632, in _read_xref_tables_and_trailers
raise PdfReadError(f”trailer can not be read {e.args}”)
PyPDF2.errors.PdfReadError: trailer can not be read ()
Current version of OS and libraries mentioned in this article:
Windows 11
Python – 3.10.11
OpenAI – 0.27.6
GPT Index – 0.4.24
PyPDF2 – 3.0.0
PyCryptodome – 3.17
Gradio – 3.28.3
How can I expend the length of the output. It always shows ~300 character maximum…………..
i have this error on macOS
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
ImportError: No module named gpt_index
Getting the following error:
ImportError: cannot import name ‘BaseLanguageModel’ from ‘langchain.schema’
I’m guessing there’s an updated version of this part of the code:
from langchain.chat_models import ChatOpenAI
Any ideas?
I got the same thing. Did you figure it out?
install pip install langchain==0.0.132
Hi, thanks for the great work!
Please allow two questions:
1) When indexing 600 RTF files, the result says INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens, INFO:root:> [build_index_from_documents] Total embedding token usage: 28470527 tokens
Why is it that LLM is not being used here?
2) Whats the downside of using gpt_index==0.4.24 against the most current version… and what can we do to make the code working the most current version?
Thanks
Marc
Hello, good morning
I’m having this problem when try to run the app.py
File “C:\Users\julio\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\prompts\base.py”, line 9, in
from langchain.schema import BaseLanguageModel
ImportError: cannot import name ‘BaseLanguageModel’ from ‘langchain.schema’ (C:\Users\julio\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\schema.py)
Somebody have any idea how to solve it?
Just found it out: https://github.com/hwchase17/langchain/issues/1595
Fixed this by running:
pip install langchain==0.0.107
I hade the same issue 🙁
I am getting the same error, would appreciate if someone could help me get unblock.
needs specific version of langchain:
pip install langchain==0.0.153
(credit goes to a StackOverflow page I cant find anymore)
also make sure that your gpt-index version is the one mentioned in the article
In the end, the problem for was that the new version of llama-index doesn’t have the GPTSimpleVectorIndex library. You can get around that by installing an older version –> pip install llama-index==0.5.27 and it should work.
To fix this (I had the same problem!) the class name has changed so what you need to do is edit the base.py file on line 9 as the error suggests. You need to change line 9 to the following:-
from langchain.schema import BaseModel as BaseLanguageModel
Then it should all work!
figured it out
go here
Python\Python311\Lib\site-packages\gpt_index\prompts\base.py
and change this: langchain.schema import BaseLanguageModel
to this: from langchain.base_language import BaseLanguageModel
I had a similar issue and resolved it by downgrading the langchain version. Code below.
pip install langchain==0.0.153
I had the same issue when installing gpt_index==0.4.24.
I had to go back to the screenshot and install manually the packages.
pip uninstall llama_index
pip uninstall langchain
pip install langchain==0.0.132
pip install openai==0.27.4
pip install tiktoken==0.3.3
pip install langchain==0.0.132
This worked for me.
0
fixed this with
pip install langchain==0.0.118
and
pip install gpt_index==0.4.24
same issue. Anyone have some ideas?
I have the same error. I am 100% sure I have followed all the instructsions as written.
This stack overflow answer worked for me:
https://stackoverflow.com/questions/76153016/gpt-chatbot-not-working-after-using-open-ai-imports-and-langchain
After installing the new libraries I was also prompted to add the transformers library
Then everything worked like a charm
Good Luck!
Running this fixed the BaseLanguageModel error for me:
pip install langchain==0.0.118
Thank you for the very easy to understand tutorial !
Wonderful !
Is there an easy way (tutorial) to make the ChatBot I trained not only 72 hours but permanently public ?
That would be fantastic 🙂
Thx !
I would be interested in that as well.
As I understood this creates a fined tuned chatGPT.
Is this available on OpenAI or its always a part of the created Index?
Thank you very much , working perfectly , how to change the code so instead of reading preload file , it can upload a new PDF file
in the docs folder place the new pdf file
Can something similar be done by using Databricks Dolly 2.0 instead of OpenAI ?
Thanks for the great tutorial! it’s very easy to follow and useful. But I have one problem.. the output text is being cut off in the middle. Both English and Korean query. Could you help me fix this issue?
Thanks Arjun for this very helpful article. Can I ask what changes to make to the code to ensure it does not answer any questions outside the scope of the custom knowledge base?
i also have same question,it would be great any one find the solution.
Yes very interested to know this!
After an initial round of training with a set of docs, how do I add more docs to the training without having to re-train the ai on all the docs? i.e. is it possible to save previous training sessions and then come back later to improve them with new docs without having to start over?
Hi Arjun,
Thank you very much for this tutorial! It works well, I would have only one question: Is there a way to make the maximal length of the answers longer? It cuts the response around 900 characters. I tried different settings, increased num_outputs, tried different models including gpt-4-32k-0314 but no luck. Also printed the response to the log to make sure not the UI truncates it, but the response itself is truncated.
For the thing that I’m working on I would need longer answers, around 4000 characters.
Thank you in advance!
AM running this on a linux VM and it works. However, i have two concerns:
1) The terminal shows this warning: “None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won’t be available and only tokenizers, configuration and file/data utilities can be used.” Does this affect the effectiveness of the model?
2) How do i safely exit the terminal and leave the session running? I tried using the screen command but it is not helping too.
This is an excellent tutorial, many thanks. Can you tell me how I can limit the responses to the source content ONLY? And where the answer is not contained within the created model to reply “I’m sorry I can’t answer that”
Did u got the answer?
Hey Marcus, did you ever find a answer for this? I would like to know too!
Very interested to know this too! Anyone have any ideas?
thanks! how can we switch between models, and which model do yuo suugest for best vlue of pricing?
Hello,
I used this solution and the problem am facing is that, every time i stop the web server all the trained knowledge vanishes thus even is there is one additional file i need to train, i have to train the model with all the files combines.
This is consuming a lot of tokens, is there a way to overcome this?
Secondly, whenever i ask it a question after training it then it is probably scanning through all the content embedding thus the tokens consumed are way higher than am asking the same query in the playground.
How do i reduce the consumption of tokens in both the cases
Thank you for this. I got it to work but it triggers an Avast virus alert whenever I run app.py.
Threats: FileRepMalware and Win64:Malware-gen
File: frpc_windows_amd64_v0.2
Path: C:\Users\user1\AppData\Local\Programs\Python\Python310\Lib\site-packages\gradio\frpc_windows_amd64_v0.2
Anyone getting this?
i am…
I’m getting this as well with a different anti-virus program. Did you guys figure out what’s causing this? It also uninstalled Anaconda from my machine. arg
I am getting on mac , any pointers how to solve this? frpc_darwin_arm64-v0.2 has been blocked. I am on Mac m1 however I tried with Rosette it did not work
you can keep older files and add new ones. it will create a new index.json file based on all files in the docs folder i.e old and new.
Hi Arjun, I´ve integrated ChatGPT with Whatsapp and it´s fine. When I read your stuff, training the chatbot, I´ve tried to do it on the whatsapp chatbot, but it is not working. Could you help me?
Requisition URL = https://api.openai.com/v1/completions
key = Content-Type / value = application/json
key = Authorization / value = Bearer sk-********s7
“root”:{
“model”:”text-davinci-003″
“prompt”:”{Prompt}”
“max_tokens”:1000
“temperature”:0
}
I am also trying to integrate with whatsapp but its not working. I tried exporting a text file of chats and using that to train the model, but that also didn’t work. Anyone else have success??
Tum bhot mast kaam kiya dost !!!
bengali bhaii
Excellent article. Thanks!
How could we change the settings so we get longer responses?
(changing response_mode=”default” and num_outputs = 1024 didn`t work!)
Thanks
Hi Arjun Sha,
Can I have the response longer? The current answer is too short to be meaningful. Thanks!
Remus
Thanks Arjun, it worked from my side after 30mins followed your steps.
I just have one query on the pdf content build up since i would like to build my knowledge base and converted it into the pdf finally?
For the content that we built into the pdf, is there any guideline or we just follow the normal pdf structure?
Arjun,
I have build custom chatbot using above article. I am getting weird response like bot always greets on each response.
The bot is not giving proper result.
Can you please help me out ?
Same problem i am facing. Cant distinguish if its trained data or generic
How to maintain and pass the previous chat history for further chat?
Like chat on same topic with chatbot.
Curious about this too
Hi Arjun,
Very nice article. The stepsvare precise and worked well for me. Many thanks for the tutorial. However, for clarity, want to know few points:
1. Once the index.json file and local URL is created, then why we have to run the “python app.py” again and again before using the chatbot? It is very awkward to run the command again and again after stating the command prompt app. Can we use a batch file to include all the steps and then just double click it to run the app.py code as well as the local URL?
Please provide the snippet for this if it is possible.
2. How to make the responses longer? Is the “compact” phrase in the code affecting the output? If yes then, what else we can use for a longer explainatory response in place of “compact”?
3. Can we not save the chats locally in a chat history? What changes to be done for this in the code or gradio interface?
4. For “adding documents” topic, want to clear one thing. If I add a new pdf file in the “docs” folder and run the “python app.py” command, will the chatbot forgets the old training, and trained based on the new document? OR, the new training material is just added/appended to the old training data? If the latter is true, then we can add data step by step, chapterwise, and will be easier to handle.
5. How to make arrangements and changes in the code for placing the chatbot in my blog ( I have no blog yet, but thinking of starting one on my subject) which will run with a single click?
6. Is there any other publicly available free API like chatGPT API, which can be imported free of cost for training the chatbot, in the python code, instead of the chatGPT API?
Dear Mr. Arjun, if you take some time for clarification of the above queries, it will be immensely helpful to all your audience.
Many thanks again.
Hi, have you found an answer on how to make the responses longer? Thank you!
Thank you for posting this tutorial! I get an error message in my output window when I tell the chat assistant to summarize the document within the doc. folder. Could this error be due to changes to ChatGPT software update? Thanks again!
Getting
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 8386 tokens
Running on local URL: http://127.0.0.1:7860
Could not create share link. Please check your internet connection or our status page: https://status.gradio.app
Install gradio version gradio version 2.9b50 instead and try again. See if it works
Amazing!!! finally got it to work after going though everyone’s comments – had many similar errors. I just spent $1k on upwork to do this exact same thing and their solution wasn’t as elegant. Thanks – can’t wait to expand on this….
Hello Arjun,
Great article. It is really helping me to start building on chatgpt.
I am getting following error when i run app.py
ImportError: cannot import name ‘Mapping’ from ‘collections’ (/usr/lib/python3.10/collections/__init__.py)
python version: 3.10.10
pip version: 23.0.1
Please suggest
I got the following error message:
ValueError: Encountered text corresponding to disallowed special token ”.
Could you please help?
Hii Arjun , THANKS For making this blog , the problem im facing is that when i submit any query in the Chatbot after submiting it .
its giving me error message .
can you assist here.
your response will be appriciated
hello I am getting such an error could you support me?
\CHATP\scripts>python app.py
\CHATP\Lib\site-packages\gradio\inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
warnings.warn(
\CHATP\Lib\site-packages\gradio\deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
warnings.warn(value)
\CHATP\Lib\site-packages\gradio\deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
warnings.warn(value)
Traceback (most recent call last):
File “\CHATP\scripts\app.py”, line 37, in
index = construct_index(“Docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “\CHATP\scripts\app.py”, line 21, in construct_index
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “\CHATP\Lib\site-packages\gpt_index\indices\vector_store\vector_indices.py”, line 94, in __init__
super().__init__(
File “\CHATP\Lib\site-packages\gpt_index\indices\vector_store\base.py”, line 58, in __init__
super().__init__(
TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
did you manage to solve this issue?
Arjun, thanks for this. Between the code you wrote and fixes in the comments, I was able to get it to work. It works much better with text, however. I had limited success with PDF scanning, even though they were excellent PDF docs.
Do you have any insight into how to properly format text data so it actually learns better? I have a document with a bunch of rules for a residential community (Section 1.13, Rule 3 with sub-rules a,b and c.). Sometimes the Bot can accurately answer my questions about these rules, but generally it’s not very good. Garbage in, garbage out, so it must be my formatting.
hey Arjun its a great article but I am not able to run the app.py file stored at desktop. it says python could not execute the said file. file or directory does not exist.
[ ERROR 2]
C:\Users\Administrator\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\gradio\inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
warnings.warn(
C:\Users\Administrator\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\gradio\deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
warnings.warn(value)
C:\Users\Administrator\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\gradio\deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
warnings.warn(value)
Traceback (most recent call last):
File “C:\Users\Administrator\Desktop\app.py”, line 37, in
index = construct_index(“docs”)
File “C:\Users\Administrator\Desktop\app.py”, line 21, in construct_index
index = GPTSimpleVectorIndex( documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper )
File “C:\Users\Administrator\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\gpt_index\indices\vector_store\vector_indices.py”, line 94, in __init__
super().__init__(
File “C:\Users\Administrator\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\gpt_index\indices\vector_store\base.py”, line 58, in __init__
super().__init__(
TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
Any advice?
Hi Arjun,
nice article.
quick question: How do we change loopback address to network adapter so that i can create a internal URL?
Hi. Novice here! This is a great tutorial! I am not getting a index.json file on my desktop. Also no URL and misc errors. Grateful for any suggestions!
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
warnings.warn(value)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
warnings.warn(value)
Traceback (most recent call last):
File “/Users/christine/Desktop/app.py”, line 37, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/christine/Desktop/app.py”, line 19, in construct_index
documents = SimpleDirectoryReader(directory_path).load_data()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/readers/file/base.py”, line 181, in load_data
data = parser.parse_file(input_file, errors=self.errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/readers/file/tabular_parser.py”, line 103, in parse_file
df = pd.read_csv(file, **self._pandas_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/util/_decorators.py”, line 211, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/util/_decorators.py”, line 331, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py”, line 950, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py”, line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py”, line 1442, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py”, line 1753, in _make_engine
return mapping[engine](f, **self.options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py”, line 79, in __init__
self._reader = parsers.TextReader(src, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “pandas/_libs/parsers.pyx”, line 547, in pandas._libs.parsers.TextReader.__cinit__
File “pandas/_libs/parsers.pyx”, line 636, in pandas._libs.parsers.TextReader._get_header
File “pandas/_libs/parsers.pyx”, line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File “pandas/_libs/parsers.pyx”, line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa0 in position 1420: invalid start byte
Traceback (most recent call last):
File “c:\Users\lenovo\Desktop\chatgpt\app.py”, line 42, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “c:\Users\lenovo\Desktop\chatgpt\app.py”, line 21, in construct_index
documents = SimpleDirectoryReader(directory_path).load_data()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\readers\file\base.py”, line 181, in load_data
data = parser.parse_file(input_file, errors=self.errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\readers\file\tabular_parser.py”, line 103, in parse_file
df = pd.read_csv(file, **self._pandas_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\util\_decorators.py”, line 211, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\util\_decorators.py”, line 331, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py”, line 950, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py”, line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py”, line 1442, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py”, line 1753, in _make_engine
return mapping[engine](f, **self.options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\lenovo\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py”, line 79, in __init__
self._reader = parsers.TextReader(src, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “pandas\_libs\parsers.pyx”, line 547, in pandas._libs.parsers.TextReader.__cinit__
File “pandas\_libs\parsers.pyx”, line 636, in pandas._libs.parsers.TextReader._get_header
File “pandas\_libs\parsers.pyx”, line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File “pandas\_libs\parsers.pyx”, line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa0 in position 1115: invalid start byte
i got this error while running the code. please help me
Arjun, thanks so much for sharing this. You are my hero. Has anyone figured out how to make the response longer? The current answer is too short to be meaningful. Thanks in advance.
Thanks for this solution. It doesn’t seem to be working for me. Although I am not a developer, so likely missing something simple.
I now get a name error:
File “Desktop\app.py”, line 39, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “Desktop\app.py”, line 19, in construct_index
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
^^^^^^^^^^^^^^
NameError: name ‘ServiceContext’ is not defined. Did you mean: ‘service_context’?
Here is what I used:
def construct_index(directory_path):
max_input_size = 4096
num_outputs = 512
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name=”text-davinci-003″, max_tokens=num_outputs))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
documents = SimpleDirectoryReader(directory_path).load_data()
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
index.save_to_disk(‘index.json’)
return index
You have to import ServiceContext from gpt-index Lana.
Thank you!
add ServiceContext in the import statement like below
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper,ServiceContext
Thank you!
Hi Arjun, that’s a great article and all is up and running. Still got a question. I seems to be impossible to read more than one pdf file in the folder docs. Is this correct? How can this be changed to ad multiple documents to feed the chatbot.
I am getting this back:
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/site-packages/gpt_index/indices/vector_store/vector_indices.py”, line 94, in __init__
super().__init__(
File “/usr/local/lib/python3.11/site-packages/gpt_index/indices/vector_store/base.py”, line 57, in __init__
super().__init__(
TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
How can I solve it? Thank you
I think I have figured out how to solve llm_predictor error. Here is my modification to construct_index function:
def construct_index(directory_path):
max_input_size = 4096
num_outputs = 512
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name=”text-davinci-003″, max_tokens=num_outputs))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
documents = SimpleDirectoryReader(directory_path).load_data()
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
index.save_to_disk(‘index.json’)
Note that you need to add ServiceContext method in gpt_index import call.
Can I trouble you to elaborate on “add ServiceContext method in gpt_index import call” please?
= = =
Traceback (most recent call last):
File “C:\Users\default\Desktop\app.py”, line 59, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\default\Desktop\app.py”, line 38, in construct_index
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
^^^^^^^^^^^^^^
NameError: name ‘ServiceContext’ is not defined. Did you mean: ‘service_context’?
Thanks much for this. Is there something wrong with this?
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
Because I get this error message:
Traceback (most recent call last):
File “C:\Users\ccccc\Desktop\app.py”, line 38, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\ccccc\Desktop\app.py”, line 22, in construct_index
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\ccccc\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\vector_store\vector_indices.py”, line 94, in __init__
super().__init__(
File “C:\Users\ccccc\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\vector_store\base.py”, line 57, in __init__
super().__init__(
TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
Hello ken, me too faced same problem it seems that they changed their documentation today
instead of :
index = GPTSimpleVectorIndex(
documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)
use this :
index = GPTSimpleVectorIndex.from_documents(documents)
hope this helps . please feel free to give feedback if this method also don’t works 🙂
Does the local data will need to upload to Chat GPT via API?
I means in this training model, all data will be store at local server and Chat GPT does not catch anything and store in cloud ?
Thank you.
Hi Arjun,
Ik used Dutch pdf files to train with and asked Dutch questions. That worked like a charm.
One question. What happens with the data ? In my company they are afraid it will be used to train ChatGPT and our internal data will be widely available. I could not find information about it.
I am getting something like this :
Traceback (most recent call last):
File “C:\chatgpt\app.py”, line 37, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “C:\chatgpt\app.py”, line 21, in construct_index
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\TMS\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\vector_store\vector_indices.py”, line 94, in __init__
super().__init__(
File “C:\Users\TMS\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\vector_store\base.py”, line 57, in __init__
super().__init__(
TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
Please let me know what can be done. Thanks in anticipation.
Hi Arjun Sha
Amazing tutorial. When running python app.py, I get this error:
File “C:\Users\USUARIO\Desktop\Chatbot Project\env\lib\site-packages\gpt_index\indices\vector_store\vector_indices.py”, line 94, in __init__
super().__init__(
File “C:\Users\USUARIO\Desktop\Chatbot Project\env\lib\site-packages\gpt_index\indices\vector_store\base.py”, line 57, in __init__
super().__init__(
TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
I appreciate your help. Thanks in advance.
Hi, I’m getting the following error, can anyone point me in the right direction for a resolution?
Traceback (most recent call last):
File “C:\Users\Kealy\Desktop\Chatbot\app.py”, line 37, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\Kealy\Desktop\Chatbot\app.py”, line 21, in construct_index
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\Kealy\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\vector_store\vector_indices.py”, line 94, in __init__
super().__init__(
File “C:\Users\Kealy\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\vector_store\base.py”, line 57, in __init__
super().__init__(
TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
had this problem as well.
Change 1st line to “from llama_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper, ServiceContext”
change line 21 from “index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)”
to ” service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)”
This has been super easy to follow (non-coder here) however I am stuck on the last step and get the following error:
TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
Any ideas?
instead of : index = GPTSimpleVectorIndex( documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper ) use this : index = GPTSimpleVectorIndex.from_documents(documents) hope this helps . please feel free to give feedback if this method also don’t works 🙂
I replaced that line and am still getting the error: TypeError: BaseGPTIndex.__init__() got an unexpected keyword argument ‘llm_predictor’
since we remove llm_predictor and prompt_helper, we pretty much remove everything, we only have this:
def construct_index(directory_path):
documents = SimpleDirectoryReader(directory_path).load_data()
index = GPTSimpleVectorIndex.from_documents(documents)
index.save_to_disk(‘index.json’)
return index
Which works but gives an empty index.
Hi Arjun, Thanks a lot this worked for me, can I add some logos in this?
Also how can I call my software APIs in this , for eg I want to download a report of last one week from my login – can I call my software APIs also in this?
How do I make it so the chatbot can output longer responses? What if I want it to answer a question in 1000 words for example?
How can I make the responses longer? I have increased the “num_outputs” but that does not change the output size that I can tell. I noticed this line in your code: response = index.query(input_text, response_mode=”compact”). Is this affecting the response size in the output box? If it is what can be used instead of compact? Thank you
OK, got it working here, trained on the complete works of Shakespeare …
Can you give me an example on how to get inputs and outputs to an IRC channel instead of using Gradio?
I am way too newbie at this.
Hey, I’m not a programer. How can I expend the length of the output. It always shows ~300 character maximum.
Getting SSL exception for this. Not sure how to resolve this. Tried adding the certificate to the “Trusted Root Certification Authority ” but still no relief. Could you suggest any possible solution?
Error:
requests.exceptions.SSLError: HTTPSConnectionPool(host=’openaipublic.blob.core.windows.net’, port=443): Max retries exceeded with url: /gpt-2/encodings/main/vocab.bpe (Caused by SSLError(SSLCertVerificationError(1, ‘[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:992)’)))
I’m facing the same error. Did you find any solution?
Someone please help with this one. Thanks!
Hello, I am getting this error: ImportError: cannot import name ‘BaseOutputParser’ from ‘langchain.output_parsers’. Does anyone know why?
Ok, so I fixed it myself. Last working langchain version is 0.0.118, 119 has parser renamed/removed.
pip3 install langchain==0.0.118
Now it’s working.
Yes this worked for me as well
I’m getting “from langchain.output_parsers import BaseOutputParser as LCOutputParser”.
Versions are all the same as in the post. Also, checked the source code and it makes sense it can’t find it so what’s going on???
Hi Arjun
Thank you very much for this useful article.
However I have run into a bit of a challenge.
The respective script executes and the chatbot interface works , however when I ask it questions, it provides answers but not on the specific articles that I provided, it seems its broad scale answers?
When for example I ask it “To list the authors” it cites authors of an unknown nature on completely different articles.
1.What could be the reason for this.
2.Where in the script is the input directory for where it will search for these documents and train it?
3.Can a fine tuned trained model on a specific dataset be exported ?
Thanking you in advance.
Hi Arjun,
This was the best tutorial I’ve looked at, and I’ve been trying to follow a few. Thank you so much. Is there a simple way around step 7, where it could run through the cloud, for example?
Great article! Very helpful! I have the following error (the PDF size is ~94M bytes; it works fine if the PDF size is less than 1M bytes):
————
(gpt) zoneson@zoneson-sony:~/AI/GPT/training$ python app.py
/home/zoneson/AI/GPT/gpt/lib/python3.10/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
warnings.warn(
/home/zoneson/AI/GPT/gpt/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
warnings.warn(value)
/home/zoneson/AI/GPT/gpt/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
warnings.warn(value)
Traceback (most recent call last):
File “/home/zoneson/AI/GPT/gpt/lib/python3.10/site-packages/tenacity/__init__.py”, line 382, in __call__
result = fn(*args, **kwargs)
File “/home/zoneson/AI/GPT/gpt/lib/python3.10/site-packages/gpt_index/embeddings/openai.py”, line 142, in get_embeddings
assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048."
AssertionError: The batch size should not be larger than 2048.
The above exception was the direct cause of the following exception:
…
———————————————————-
Is it due to my PC memory size or free tier?
Thanks!
Hey! Did you figure out how to solve this?
Yes, this is a limitation that I have been unable to fix.
It requires some tweaking with the code. You will have to use wget or curl to download new files and keep the docs folder updated.
OpenAI offers $5 of free credit. Once you have exhausted it, you have to buy the API access.
Hi John, you can actually add documents of various topics and file formats in one folder. It will work without any issues
Try to update gradio: pip install –upgrade gradio
Dear Arjun,
Many thanks for this super article. The steps were very easy to follow!
I can get the CHAT BOT to work on the local URL but the code is not giving the Public URL so I can see it on the web and share it.
Is there an easy solution to get the public URL? Or instructions on how to deploy it on a web interface?
Many thanks for you help!
Gradio might be under heavy load. Try later and also update Gradio to the latest version: pip install –upgrade gradio
Hi Arjun, thanks for the guide.
Step 4 in ‘Create ChatGPT AI Bot with Custom Knowledge Base’ doesn’t produce a URL I’m using macOS
here is my output:
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
warnings.warn(value)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gradio/deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
warnings.warn(value)
Traceback (most recent call last):
File “/Users/iyarbinyamin/Desktop/acks.py”, line 37, in
index = construct_index(“docs”)
^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/iyarbinyamin/Desktop/acks.py”, line 19, in construct_index
documents = SimpleDirectoryReader(directory_path).load_data()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/readers/file/base.py”, line 181, in load_data
data = parser.parse_file(input_file, errors=self.errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/gpt_index/readers/file/docs_parser.py”, line 38, in parse_file
page_text = pdf.pages[page].extract_text()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_page.py”, line 1851, in extract_text
return self._extract_text(
^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_page.py”, line 1342, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_cmap.py”, line 28, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_cmap.py”, line 194, in parse_to_unicode
cm = prepare_cm(ft)
^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_cmap.py”, line 207, in prepare_cm
tu = ft[“/ToUnicode”]
~~^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/generic/_data_structures.py”, line 266, in __getitem__
return dict.__getitem__(self, key).get_object()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/generic/_base.py”, line 259, in get_object
obj = self.pdf.get_object(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_reader.py”, line 1269, in get_object
retval = self._encryption.decrypt_object(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_encryption.py”, line 761, in decrypt_object
return cf.decrypt_object(obj)
^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_encryption.py”, line 185, in decrypt_object
obj._data = self.stmCrypt.decrypt(obj._data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_encryption.py”, line 147, in decrypt
raise DependencyError(“PyCryptodome is required for AES algorithm”)
PyPDF2.errors.DependencyError: PyCryptodome is required for AES algorithm
any advice?
Run this: pip3 install PyCryptodome
This solved it for me. Thank you!!!
It works fine, but what about if instead of using the web form I want to send the request using CURL and receive the response back as plain text
Very useful article, I made the mistake of first putting the Python code in the Docs folder as well, before running it which basically gave me nothing as a result, swapped it to the desktop instead with the docs folder having only the PDF and it worked out perfectly.
Why is it using text-davinci-003 and not GTP3.5 for example (which is cheaper) ?
If I place an excel file in the docs folder so that I can feed it the desired data, would it know how to convert that data to a json if I don’t teach it coding itself? What I’m wondering is how much does it already know or do you literallt teach it how to code first? Or are we just building on the knowledge chat gbt already has which already knows how to code?
No, you will have to provide the data in JSON or CSV format.
You mention adding new docs.
You mention starting over.
What if I build the index file and then need to turn off my machine, do I have to run app.py next time I want to use it?
If so, will this go through the build process again, it seems form what you said that it will.
Or succinctly “Can I preserve my data between sessions without having to run the index building process again?”
Great article btw easy to follow, I’m just sat waiting for it to finish with my large docs folder! 🙂
Thanks.
Hey, you can add # before index = construct_index(“docs”) function once you have created the JSON file. This will make the line ineffective. If you want to train data on new material, remove # and run the code again.
How can I get the Chatbot to reply to questions outside its trained data set? I want it to draw its knowledge from the data I provide but I still want it to be a capable chatbot. At this point it is very 1 dimentional! Thanks for the help and great article!!!
Thanks for the free source of codes. I tried it, it worked!
But I have a few questions which need your advice please as I couldn’t find them in the post.
1. On training with my custom data by calling the “construct_index” function, will any of my data be stored on the OpenAPI Cloud?
2. What will be contained in the index.json file? How will this file be used in the chatbot Q&A process?
3. Will this program connect to OpenAI Cloud in the chatbot Q&A process? If yes for what purpose, and will my data be stored on the OpenAI Cloud?
4. On this:
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 10 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 598 tokens
Is it the imported OpenAPI library on my server or the OpenAI Cloud which calculate this token usage?
5. Noted that the token used will be synced to the OpenAI portal dashboard in every few minutes, so this program has to run with internet connectivity?
Thanks much in advance! Your advice will be very helpful to me!
Excellent helpful article thank you Arjun. Instead of using a laptop, can this be hosted on a server and on a cpanel hosted site?
Hi Arjun, that’s a great article, thanks for sharing that. I got a concern here, if you got any idea for that, that would be great,
If we are using custom data to train this AI mode, where is the data store for custom data? I mean will custom data be saved into open ai server? or the data will only keep on our server.
No, the data is stored locally in the index.json file. However, OpenAI LLM is used to draw inferences from the dataset. OpenAI has also said that data submitted to the company through the API won’t be used for its AI training.
I have problems with training the model. Can you share a dataset that should work?
Very nice article. I could setup everyting withing 10 mins on my mac.
Can we provide URLs instead of PDF or test documents ?
You can actually use wget to download files automatically using URLs. You will have to add “wget URL” before you run the construct_index function.
Thanks for all!
I am new with python, can you share an example of the line to add URLs?
Thanks again.
Dude ! This is davinci model , not the chatGPT api (gpt-3.5 turbo ). Don’t use clickbait !
My friend Albus, you can also use “gpt-3.5-turbo” with the same code, as I have mentioned in the article. I chose Davinci because it’s better at text completion, as opposed to Chat completion for which the Turbo model is suitable.
Moreover, Davinci is also a ChatGPT model (GPT-3 to be precise), and Turbo being GPT-3.5.
Is working! Can you help me how to add code that will answer in Slovenain language?
willm this work with GPT 4?