How to Build a Text Generation Application Using GPT-3

Share:

At Visage, we promote a culture of learning and sharing. We encourage all employees to share their great ideas. The following is an article written by one of our interns. Lina Bekdouche is completing her end of study internship with Visage where she joined the data team as an Intern Data Scientist. She had the opportunity to work on different AI projects that aim to improve and facilitate Visages recruiting process in ways that would benefit the users.

Introduction

The rise of artificial intelligence in our lives has had an impact on every industry and field, from banking and journalism to medicine and law. One of the most exciting developments in AI is the ability to generate text, which can be used to create unique and interesting content.

In this article, I will explain how we can use the OpenAI GPT-3 model which is one of the most sophisticated language models to date in order to build a text generation application and discuss the ways in which you can use it yourself to build your own application. I will also talk about some of the limitations of text generation and ways in which you can overcome them to generate content that is still engaging and human.

The application mentioned in this article was one of the projects developed during my internship at Visage.jobs a crowd sourcing company that combines human and AI to find the best candidates.

What is GPT-3?

Well actually, the Generative Pre-trained Transformer 3 (GPT-3) is not one single model but a family of models. Each model in the family has a different number of trainable parameters (and different capabilities).

It is the third-generation language prediction models in the GPT-n series created by OpenAI. The most impressive feature of GPT-3 is that it’s a meta-learner; it has learned to learn. You can ask it in natural language to perform a new task and it “understands” what it has to do, in an analogous way to how a human would. Each model of the GPT-3 family can perform a huge number of tasks such as : Language Translation, Question Answering, Chatbots, Code Generation, Text Summarization, Text Classification, Information Retrieval, and many others.

We will see later which model is best suited for which task.

How does it work?

Architecture : In fact, GPT-3 has the same **transformer-based** architecture as GPT-2. A transformer-based architecture adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data to give it more contextual information.

Training process : GPT-3 was pre-trained in a generative, unsupervised manner. The model is presented with textual data that passes through the encoder and produces vectors. The produced vectors are further inputted into the attention mechanism. The combined workflow helps to produce next word prediction. The model’s prediction will be wrong. We calculate the error in its prediction and update the model so next time it makes a better prediction. Then the process repeat millions of times.

https://cdn-images-1.medium.com/max/1600/0*TdGV-l5PZ5Gb7tY3.gif

Source figure : https://jalammar.github.io/how-gpt3-works-visualizations-animations/

Training Dataset : It was trained on a huge Internet text datasets — 570GB in total. When it was released its largest version, was the largest neural network with 175 billion parameters (100x GPT-2). Thus, it has excellent potential for automatization across various industries — from customer service to documentation generation.

https://cdn-images-1.medium.com/max/1600/0*8KPrpzp4PE8hmXxu.png

Datasets used to train GPT-3

GPT-3 Engines : We talked earlier about the different GPT-3 engines let’s get to know them more in depth :

Davinci : The most capable engine and can perform any task the others perform with less instructions. It is mostly used for applications requiring a deep understanding of the content like summarization and creative content generation. However it is the most expensive model since it requires a lot of computing resources + it’s not as fast.

Curie : 1/10 the price of the Davinci model + way faster. Even if it’s not as good at complex text analysis Curie is still extremely powerful and quite capable for many nuanced tasks like complex classification and summarization as well as questions answering and general chatbot services.

Babbage : Good at straightforward tasks like simple classification and semantic search classification.

Ada : The fastest model and good at : Parsing text, simple classification, address correction, keywords

https://cdn-images-1.medium.com/max/1600/0*MTDQXb-j6WI9_9wP.png

Classification of the different GPT-3 Engines

To find the right model to use for your application it is advised to use first the Davinci model for finding the right instructions and setups. And then, we can test the faster models to see if they are up for the task as well.

How to use it

We first need to know that GPT-3 takes as input what we call a prompt which contains the instruction to do some task, it is mentioned in the paper that better results are achieved by showing the model an example or few examples of what it needs to do (called respectively One-Shot and Few-Shot learning).

https://cdn-images-1.medium.com/max/1600/0*vaa6msNu7mCp6AoZ.png

Overall workflow of GPT-3

GPT-3 Access

Right now anyone can get their hands on the GPT-3 models, on the platform OpenAI you can just create your profile and get access to the API. Informations about the pricing is found on this page. For experimenting with the different models and their applications you have this playground page.

Building your application

Now that we have a better understanding of the GPT-3 models, we will see the main steps that you should follow to be able to create your own application using GPT-3.

In this article we’re building a Subject Line Generator, the subject lines that we want to generate are for job offer emails sent to potential candidates.

Step 1 : Knowing what we want as output

GPT-3 offers multiple endpoints depending on the task you want it to do. Here we want to generate text so we are using the completion endpoint. You can checkout the documentation to know what kind of endpoint you need to use. For example if your task is to classify text you need to use the classification endpoint, all entries have the same models mentioned earlier what differs is the input given.

What we want in the generated subject lines and why it’s important:

Generating a good subject line is very important, according to OptinMonster:

  • 47% of recipients decide to open an email based on the subject line
  • 69% of email recipients report email spam based solely on the subject line.

And even if the email is not opened, people see the subject line and build their perception of the brand according to what was said in it.

But how can we know what kind of subject lines are good ? There is no specific answer to this question, but there are some guidelines that we can follow to increase our chances in having our email opened :

  • The subject line must accurately reflect the content of the message
  • Start subject lines with action verbs (get, join, check)
  • Making the subject line clear and easy to read
  • Length of the email subject line should not exceed 8 words
  • Avoid generic sentences.

We’ll try to evaluate our output according to those guidelines.

Step 2 : Getting a database for testing

We have a database that contains the job offers information ( Title, company, location …) We will select the main information that we want to appear in the subject line or that would give more context to the model to be able to generate something that is adaptable to each job, insightful and creative.

We will use everything except for the country, it’s not really necessary in our case.

Step 3 : Prompt design

Prompt design is the key element for a good output when it comes to GPT-3. According to the original paper the model should receive one clear instruction about what it should do, and if the task is complex, we should add to the input examples to show the model what is expected of him.

To see what’s best for our application we tested few different forms of prompts :

  • Zero shot learner :We give it only an instruction without any examples and let the model create
https://cdn-images-1.medium.com/max/1600/0*UqmjQPHh2nw9YVtQ.png

Python function used for Zero Shot Learning

Results :

Comments : With only the instruction the model was able to generate correct subject lines. However, the results are not perfect yet :

  • Too long and very generic.
  • Too much information
  • Not very appealing
  • Generation of the “/n”
  • Sometimes the model cuts sentences in half because it’s limited in tokens

We will try other prompts to see if we can achieve the results wanted.

  • One shot learner: We give it the instruction to follow + one example of what we want as output for a certain input

Python function used for One Shot Learning

Results :

Comments : The results are definitely better, however by giving it one example the model tried to follow exactly that and lacked innovation. The subject lines generated were pretty generic and we don’t want that. Let’s try giving it few different examples and see what it does.

Few shot learner :

Instead of one example we give it a few more

Python function used for Few Shots Learning

Results :

Comments: I believe those are the results that we are looking the model understood the task, the subject lines are appealing and varied. The few shot learner is the way to go in this case.

Step 4 : Setting the parameters

Depending on your use case you’re going to need to se the parameters for better output. Let’s see together what are the different input parameters for the completion model of GPT-3.

  • max_tokens : The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model’s context length. Most models have a context length of 2048 tokens. You need to be care full to not put a small value so the model won’t cut the sentences in half. But also not too long to not waist tokens. For our use case we’re going to need a max_token value between 10 and 15. We set it at 13.

 

  • temperature :This parameter controls the randomness of the generated text. A value of 0 makes the engine deterministic, which means that it will always generate the same output for a given input. A value of 1 makes the engine take most risks and a lot of creativity. We want innovative enough but not so much that the output starts getting out of context. The default value is 0.9, we’re going to put it at 0.7 so our model don’t get out of context and doesn’t invent informations that we didn’t gave that might not be true.

 

  • top_p : An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. It is recommended to alter this parameter or the temperature but not both. we’re going to leave it at it’s default value of 1.

 

  • presence_penalty : Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. Altogether this parameter might be more useful for longer text generation. In our case we’re gonna leave it at 0.

 

  • frequency_penalty : Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line. Same as the Presence_penalty parameter this one is more useful when dealing with longer text. We’re gonna leave it at 0 as well.

 

  • logit_bias : Modify the likelihood of specified tokens appearing in the completion. Accepts a json object that maps tokens (specified by their token ID in the GPT tokenizer) to an associated bias value from -100 to 100. Values between -1 and 1 should decrease or increase likelihood of the selected tokens to appear; values like -100 or 100 should result in a ban or exclusive selection of the tokens. We don’t really need to use it for our use case.

 

  • stop : The stop parameter is a parameter that tells the model when to stop generating text. For example : if I give it “?” the model stops the generation once it generates the “?” character and the character is removed from the output. Notice that in the few shot experiment I give the input “/n” to the stop parameter that’s because we want the model to stop the generation if it starts another line.

Step 5: Find out if we can use a fastest model

We are giving the same instruction and parameters to the other models in order to analyze their performance

Results :

Comments :

  • We can see clearly that Davinci gives us the best results. It was able to understand the instruction given and didn’t make any error.
  • With Curie and Babbage we were able to achieve medium results with some errors like cutting phrases in half, no generation at all and very generic subject lines.
  • The outputs with the Ada model got out of context a lot

To avoid any error, we’re gonna keep using the Davinci model.

Possible improvement with fine Tuning

OpenAI now gives us the possibility to fine tune the GPT-3 models to get more suitable results for your application : https://beta.openai.com/docs/api-reference/fine-tunes. However, we need an annotated dataset that contains the generated text for each prompt.

The difference between few-shots learning and fine-tuning is that with few-shots learning we don’t update the weights of the mode but with fine-tuning we do.

https://cdn-images-1.medium.com/max/1600/0*nir9AO23iZhqLzdN.png

Difference between Shot learning and and fine-tuning from the original article

In most cases the GPT-3 models give amazing results with few-shots learning, and that is thanks to what is called in the paper in-context-learning.

https://cdn-images-1.medium.com/max/1600/0*dmaZxX4wCTcQd5j8.png

Learning of GPT-3 from the original article: The inner loops represent the in-context learning.

But sometimes, that is not enough so fine-tuning is needed for the model to adapt to the task better.

Bottom line

GPT-3 represents a new milestone for AI, but it still has serious weaknesses like :

  • Cutting phrases in half.
  • Generation out of context
  • Has trouble understanding complexe linguistics
  • Being biased: Since it was trained on the content that humans generated on the internet, there are still troubles referring to bias, fairness, and representation. Thus, GPT-3 can generate prejudiced or stereotyped content.
  • Testers have inferred that GPT-3 makes a lot of non-sense even for reasonable questions

When building your application you need to take that into account and keep in mind that good prompt design is the key to better results. Test your model on different data and do analysis on your output to make sure everything is good.

In this article we were able to generate very creative human-like subject lines in only few steps, which is very impressive. GPT-3 opens the door for many other various applications and use cases.

In Part II we’re going to build the API for our Subject Line Generator.

Bibliography