Large Language Models (LLMs) is an advanced type of language model that represent a breakthrough in the field of natural language processing (NLP). These models are designed to understand and generate human-like text by leveraging the power of deep learning algorithms and massive amounts of data.
In this learn by building, I will try to create a Question and Answering System using LLM through Open AI’s model APIs. We will try to dive deeper into the information contained in the ‘Reviews of Universal Studios’ datasets. Which is includes 50,000++ reviews of 3 Universal Studios branches (Florida, Singapore, Japan), posted by visitors on the Trip Advisor website.
Environment Set-up¶
Using LangChain will usually require integrations with one or more model providers, data stores, APIs, etc. For this example, we’ll use OpenAI’s model APIs.
from dotenv import load_dotenv
load_dotenv()
True
LangChain Quickstart¶
In LangChain
, a QuickStart involves working with three key components: Prompt, Chain, and Agent.
With the Prompt, Chain, and Agent components working together, we can engage in interactive conversations with the language model. The Prompt sets the context or initiates the conversation, the Chain maintains the conversation history, and the Agent manages the communication between the user and the language model.
Using these components, we can build dynamic and interactive applications that involve back-and-forth interactions with the language model, allowing us to create conversational agents, chatbots, question-answering systems, and more.
To interact LangChain
library with an OpenAI language model, we should:
Importing the Required Module: The code imports the LangChain library by using the statement
from langchain import OpenAI
.Creating an OpenAI Instance: The code creates an instance of the
OpenAI
class and assigns it to the variablellm
. This instance represents the connection to the OpenAI language model.Setting the Temperature Parameter: The
temperature
parameter is passed to theOpenAI
instance during its initialization. Temperature is a parameter that controls the randomness of the language model’s output.A higher temperature value (e.g., 0.9) makes the generated text more diverse and creative, while a lower value (e.g., 0.2) makes it more focused and deterministic.
from langchain import OpenAI
llm = OpenAI(temperature=0.2)
By creating an instance of OpenAI
and setting the desired temperature, we can now use the llm
object to interact with the OpenAI language model. We can pass prompts or messages to the llm
object, receive the generated responses, and customize the behavior of the language model using additional parameters and methods provided by the LangChain library.
By simply providing a prompt in Bahasa (Indonesian language), we can obtain a generated text response in Bahasa as well. This showcases the versatility of language models like LangChain in understanding and generating text in various languages, allowing for multilingual applications and interactions.
Prompt Templates¶
LLM applications typically utilize a prompt template instead of directly inputting user queries into the LLM. This approach involves incorporating the user input into a larger text context known as a prompt template.
A prompt template is a structured format designed to generate prompts in a consistent manner. It consists of a text string, referred to as the “template” which can incorporate various parameters provided by the end user to create a dynamic prompt.
The prompt template can include:
- Instructions to guide the language model’s response.
- A set of few-shot examples to assist the language model in generating more accurate and contextually appropriate outputs.
- A question posed to the language model.
In the previous example, the text passed to the model contained instructions to generate a brand name based on a given description. In our application, it would be convenient for users to only provide the description of their company or product without the need to explicitly provide instructions to the model.
To create a prompt template using LangChain, we begin by importing the PromptTemplate
class from the langchain.prompts
module. This class allows us to create and manipulate prompt templates.
from langchain.prompts import PromptTemplate
Create a prompt template: Use the PromptTemplate.from_template()
method to create a PromptTemplate
object from the template string.
In this case, the template string is “What is a good name for a brand that makes {product}?”, where {product}
acts as a placeholder for the product name.
# Create a prompt template
template_prompt = PromptTemplate.from_template("What is a good name for a brand that makes {product}?")
# Format the prompt template
prompt = template_prompt.format(product="park and resort")
# Print the prompt
print(prompt)
What is a good name for a brand that makes park and resort?
Notice the instruction changes automatically based on user input, this instruction will be input to llm
to generate the response. Let’s get the response generated by the language model (llm
) based on the given prompt.
print(llm.predict(prompt))
Paradise Parks & Resorts
Chain¶
Now that we have our model and prompt template, we can combine them by creating a “chain”. Chains provide a mechanism to link or connect multiple components, such as models, prompts, and other chains.
The most common type of chain is an LLMChain, which involves passing the input through a PromptTemplate and then to an LLM. We can create an LLMChain using our existing model and prompt template.
For example, if we want to generate a response using our template, our workflow would be as follows:
- Create the prompt based on input with
template_prompt
prompt = template_prompt.format(product="Paradise Parks & Resorts")
print(prompt)
What is a good name for a brand that makes Paradise Parks & Resorts?
- Generate response from prompt with
llm
print(llm.predict(prompt))
Paradise Getaways.
We can simplify the workflow by chaining (link) them up with Chains
# Import LLMChain class from langchain
from langchain.chains import LLMChain
# Chain the prompt template and llm
chain = LLMChain(llm=llm, prompt=template_prompt)
# Execute the chained model and prompt template
print(chain.run('Paradise Parks & Resorts'))
Paradise Getaways.
Agents¶
In more complex workflows, it becomes crucial to have the ability to make decisions and choose actions based on the given context. This is where agents come into play.
Agents utilize a language model to determine which actions to take and in what sequence. They have a set of tools at their disposal, and they continually select, execute, and evaluate these tools until they arrive at the optimal solution. Agents provide a dynamic and adaptable approach to problem-solving within the LangChain framework, allowing for more sophisticated and flexible workflows.
To load an agent in LangChain, you need to consider the following components:
LLM/Chat model: This refers to the language model that powers the agent. It is responsible for generating responses based on the given input. You can choose from various pre-trained models or use your own custom models.
Tools: Tools are functions or methods that perform specific tasks within the agent’s workflow. These can include actions like Google Search, Database lookup, Python REPL (Read-Eval-Print Loop), or even other chains. LangChain provides a set of predefined tools with their specifications, which you can refer to in the Tools documentation.
Agent name: The agent name is a string that identifies a supported agent class. Each agent class is parameterized by the prompt that the language model uses to determine the appropriate action to take. In this context, we will focus on using the standard supported agents, rather than implementing custom agents. You can explore the list of supported agents and their specifications to choose the most suitable one for your application.
For the specific example mentioned, we will utilize the wikipedia
tool to query and retrieve responses based on Wikipedia information. This tool allows the agent to access relevant information from Wikipedia and provide informative responses based on the given input.
Import the required modules: The code starts by importing the necessary modules from LangChain, such as AgentType
, initialize_agent
, and load_tools
. These modules provide the functionalities required to create and configure the agent.
from langchain.agents import AgentType, initialize_agent, load_tools
Define the language model for the agent: In this example, the llm_agent
is initialized with the OpenAI
class, which represents the language model. The temperature
parameter determines the level of randomness in the generated responses.
# The language model we're going to use to control the agent.
llm_agent = OpenAI(temperature=0)
Build Question Answering System¶
Introduction to Question-Answer System¶
As we know, LangChain is an open-source library that provides developers with powerful tools for building applications using Large Language Models (LLMs). In our previous example, we saw how we could use an LLM to generate responses based on a given question. However, there may be cases where we need to ask more specific questions related to our business domain. For instance, we might want to ask the LLM about our company’s top revenue-generating product.
LLMs have certain limitations when it comes to specific contextual knowledge, as they are trained on a vast amount of general information. To overcome this limitation, we can provide additional documents or context to the LLM. The idea is to retrieve relevant documents related to our question from a corpus or database and then pass them along with the original question to the LLM. This allows the LLM to generate a response that is informed by the specific information contained in the retrieved documents.
These documents can come from various sources such as databases, PDF files, plain text files, or even information extracted from websites. By connecting and feeding these documents to the LLM, we can build a powerful Question-Answer System that leverages the LLM’s language generation capabilities while incorporating domain-specific knowledge.
In this section, we will explore how to connect and feed a database and text information to LLM to build Question-Answer System that can provide contextually relevant answers to specific business-related questions.
Connecting to CSV¶
Structured data is not only stored in database files; it can also be stored in other formats such as .xlsx
and .csv
, which represent data in a tabular form with columns and rows.
To begin, let’s define the file path of our dataset universal_studio_branches.csv
, which contains a bunch of reviews from Universal Studios visitors.
filepath = "universal_studio_branches.csv"
Next, we will create an agent specifically designed for working with CSV data. This agent will allow us to query and retrieve information from the universal_studio_branches.csv
dataset. Since we are using the same LLM model as in the SQL part, there is no need to redefine the LLM. We can utilize the existing LLM model for our CSV agent.
from langchain.agents import create_csv_agent
agent = create_csv_agent(llm, filepath, verbose=True)
agent.run("How many review records are in the dataset")
> Entering new chain... Thought: I need to know the size of the dataframe Action: python_repl_ast Action Input: df.shape Observation: (50904, 6) Thought: I now know the final answer Final Answer: 50,904 review records are in the dataset. > Finished chain.
'50,904 review records are in the dataset.'
The number of records in this dataset exists is 50,904 reviews
I will try to asking the model to describe the datasets info:
agent.run("Give me a detail info about the dataset")
> Entering new chain... Thought: I need to know what the dataframe contains Action: python_repl_ast Action Input: df.info() Observation: <class 'pandas.core.frame.DataFrame'> RangeIndex: 50904 entries, 0 to 50903 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 reviewer 50904 non-null object 1 rating 50904 non-null float64 2 written_date 50904 non-null object 3 title 50904 non-null object 4 review_text 50904 non-null object 5 branch 50904 non-null object dtypes: float64(1), object(5) memory usage: 2.3+ MB Thought: I now know the details of the dataset Final Answer: The dataset contains 50904 entries, with 6 columns. The columns are 'reviewer', 'rating', 'written_date', 'title', 'review_text', and 'branch'. The data types are float64 (1) and object (5). > Finished chain.
"The dataset contains 50904 entries, with 6 columns. The columns are 'reviewer', 'rating', 'written_date', 'title', 'review_text', and 'branch'. The data types are float64 (1) and object (5)."
agent.run("Give me top 5 best review from the dataset")
> Entering new chain... Thought: I need to sort the dataframe by rating Action: python_repl_ast Action Input: df.sort_values(by='rating', ascending=False).head() Observation: reviewer rating written_date \ 50903 sc_myinitial 5.0 February 24, 2010 41325 Tanisha K 5.0 August 25, 2016 21235 Owen M 5.0 May 17, 2015 21234 jaquic 5.0 May 17, 2015 21233 Soliman715 5.0 May 17, 2015 title \ 50903 Excellent Sneak Preview 41325 Travel triangle singapore package 21235 Top class holiday 21234 So much Fun 21233 Not enough time in the day. review_text \ 50903 My group managed to get the tickets for the 16... 41325 We booked a singapore package through travel t... 21235 Great location for your holiday but make sure ... 21234 We had the best day in this Park. The queues w... 21233 We had a '5 days ticket' over a 10 day holiday... branch 50903 Universal Studios Singapore 41325 Universal Studios Singapore 21235 Universal Studios Florida 21234 Universal Studios Florida 21233 Universal Studios Florida Thought: I now know the top 5 best reviews Final Answer: The top 5 best reviews from the dataset are: 1. My group managed to get the tickets for the 16... (Universal Studios Singapore) 2. We booked a singapore package through travel t... (Universal Studios Singapore) 3. Great location for your holiday but make sure ... (Universal Studios Florida) 4. We had the best day in this Park. The queues w... (Universal Studios Florida) 5. We had a '5 days ticket' over a 10 day holiday... (Universal Studios Florida) > Finished chain.
"The top 5 best reviews from the dataset are: \n1. My group managed to get the tickets for the 16... (Universal Studios Singapore) \n2. We booked a singapore package through travel t... (Universal Studios Singapore) \n3. Great location for your holiday but make sure ... (Universal Studios Florida) \n4. We had the best day in this Park. The queues w... (Universal Studios Florida) \n5. We had a '5 days ticket' over a 10 day holiday... (Universal Studios Florida)"
agent.run("Give me top 5 worst review from the dataset")
> Entering new chain... Thought: I need to find the reviews with the lowest ratings Action: python_repl_ast Action Input: df[df['rating'] == df['rating'].min()].head() Observation: reviewer rating written_date \ 1 Jon 1.0 May 30, 2021 5 John 1.0 May 28, 2021 8 Chuck N 1.0 May 27, 2021 10 Paul S 1.0 May 26, 2021 13 Kimberly T 1.0 May 24, 2021 title \ 1 Food is hard to get. 5 This is not a vacation 8 Greed makes for a terrible guest experience 10 Same old Orlando experience. 13 Parking and Guest Services TERRIBLE!!!!!!!!!! review_text \ 1 The food service is horrible. I’m not reviewin... 5 Worst experience I have ever had the rides are... 8 Universal is one thing - Not Disney. Everythin... 10 I'm literally standing in a line for the Hagri... 13 We went to City Walk due to being with our qua... branch 1 Universal Studios Florida 5 Universal Studios Florida 8 Universal Studios Florida 10 Universal Studios Florida 13 Universal Studios Florida Thought: I now know the top 5 worst reviews Final Answer: The top 5 worst reviews from the dataset are: 1. Food is hard to get. 2. This is not a vacation 3. Greed makes for a terrible guest experience 4. Same old Orlando experience. 5. Parking and Guest Services TERRIBLE!!!!!!! > Finished chain.
'The top 5 worst reviews from the dataset are: \n1. Food is hard to get.\n2. This is not a vacation\n3. Greed makes for a terrible guest experience\n4. Same old Orlando experience.\n5. Parking and Guest Services TERRIBLE!!!!!!!'
agent.run("How many percent a positive review from this datasets?")
> Entering new chain... Thought: I need to calculate the percentage of positive reviews Action: python_repl_ast Action Input: df[df['rating'] > 3].count() / df.count() Observation: reviewer 0.819503 rating 0.819503 written_date 0.819503 title 0.819503 review_text 0.819503 branch 0.819503 dtype: float64 Thought: I now know the final answer Final Answer: 81.95% > Finished chain.
'81.95%'
agent.run("How many percent a negative review from this datasets?")
> Entering new chain... Thought: I need to calculate the percentage of negative reviews Action: python_repl_ast Action Input: df[df['rating'] < 3]['rating'].count() / df['rating'].count() Observation: 0.07777384881345277 Thought: I now know the final answer Final Answer: 7.78% of the reviews are negative. > Finished chain.
'7.78% of the reviews are negative.'
agent.run("How many percent a neutral review from this datasets?")
> Entering new chain... Thought: I need to calculate the percentage of neutral reviews Action: python_repl_ast Action Input: df[df['rating'] == 3].shape[0] / df.shape[0] Observation: 0.10272277227722772 Thought: I now know the final answer Final Answer: 10.27% of the reviews are neutral. > Finished chain.
'10.27% of the reviews are neutral.'
import matplotlib.pyplot as plt
from textblob import TextBlob
agent.run("can you show me a visualization of the sentiment review from the datasets?")
> Entering new chain... Thought: I need to create a visualization of the sentiment from the data Action: python_repl_ast Action Input: df['rating'].value_counts().plot(kind='bar') Observation: Axes(0.125,0.11;0.775x0.77) Thought: I now know the final answer Final Answer: The visualization of the sentiment review from the datasets is a bar chart showing the count of ratings. > Finished chain.
'The visualization of the sentiment review from the datasets is a bar chart showing the count of ratings.'
agent.run("see how count of each rating using histogram plot as per branch")
> Entering new chain... Thought: I need to group the dataframe by branch and rating Action: python_repl_ast Action Input: df.groupby(['branch', 'rating']).size().reset_index(name='count').hist(by='branch', column='count') Observation: [[<Axes: title={'center': 'Universal Studios Florida'}> <Axes: title={'center': 'Universal Studios Japan'}>] [<Axes: title={'center': 'Universal Studios Singapore'}> <Axes: >]] Thought: I now know the final answer Final Answer: The histogram plot shows the count of each rating for each branch. > Finished chain.
'The histogram plot shows the count of each rating for each branch.'
agent.run("I need see a plot that showing histogram plot with x=rating, y=count, hue=branch")
> Entering new chain... Thought: I need to use a visualization tool to plot the data Action: python_repl_ast Action Input: df.hist(column='rating', by='branch') Observation: [[<Axes: title={'center': 'Universal Studios Florida'}> <Axes: title={'center': 'Universal Studios Japan'}>] [<Axes: title={'center': 'Universal Studios Singapore'}> <Axes: >]] Thought: I now know the final answer Final Answer: Use the command `df.hist(column='rating', by='branch')` to create a histogram plot with x=rating, y=count, hue=branch. > Finished chain.
"Use the command `df.hist(column='rating', by='branch')` to create a histogram plot with x=rating, y=count, hue=branch."
agent.run("Please compile each plot above as one plot histogram")
> Entering new chain... Thought: I need to visualize the data Action: python_repl_ast Action Input: df['rating'].hist() Observation: Axes(0.125,0.11;0.775x0.77) Thought: I now know the final answer Final Answer: A histogram of the ratings from the dataframe df can be generated by using the command df['rating'].hist(). > Finished chain.
"A histogram of the ratings from the dataframe df can be generated by using the command df['rating'].hist()."
Summary¶
Through the Open AI’s model, especially with the Langchain Agent component we can prompt some questions about the Universal Studio Review dataset.
When we asked for info about the dataset. The generative AI shows info as if we run df.info()
in python. The dataset contains 50904 entries, with 6 columns. The columns are ‘reviewer’, ‘rating’, ‘written_date’, ‘title’, ‘review_text’, and ‘branch’. The data types are float64 (1) and object (5).
When we asked about top 5 best review from the dataset, it said:
The top 5 best reviews from the dataset are:
- My group managed to get the tickets for the 16… (Universal Studios Singapore)
- We booked a singapore package through travel t… (Universal Studios Singapore)
- Great location for your holiday but make sure … (Universal Studios Florida)
- We had the best day in this Park. The queues w… (Universal Studios Florida)
- We had a ‘5 days ticket’ over a 10 day holiday… (Universal Studios Florida)
When we asked about top 5 worst review from the dataset, it said:
The top 5 worst reviews from the dataset are:
- Food is hard to get.
- This is not a vacation
- Greed makes for a terrible guest experience
- Same old Orlando experience.
- Parking and Guest Services TERRIBLE!!!!!!!
We also know that 81.95% is a positive, 7.78% negative, and 10.27% neutral reviews. W
We also can ask the The generative AI to print a visualization, but we need to install and import the required library first.
The difficulty in using generative AI is precisely how to translate technical things into simple language