How to Generate Embeddings with Azure OpenAI

In the ever-evolving Natural Language Processing (NLP), embeddings play a transformative role. These compact vector representations capture the essence of textual data, enabling machines to grasp the semantic relationships between words and concepts, even when phrased differently. This empowers applications to perform tasks like document search, text classification, and machine translation with remarkable accuracy.

This comprehensive guide explores generating text embeddings using Azure OpenAI. By following these steps, you'll equip yourself with the knowledge to unlock the hidden meaning within your text data and unlock a world of exciting NLP possibilities.

What are Embeddings?

In Azure OpenAI, Embeddings, in the context of natural language processing (NLP), are dense vector representations of text data. Imagine capturing the essence of a document or sentence in a compact array of numbers. That's what embeddings achieve! These numbers encode semantic meaning, allowing models to understand the relationships between words and concepts, even if the exact phrasing differs.

Key characteristics:

Dense vectors: Embeddings pack a lot of information into a relatively small space compared to the original text.
Vector Representation: The information is stored as a series of numbers arranged in a specific order, similar to a mathematical vector.
Semantic Meaning: Embeddings go beyond just keywords. They capture the underlying meaning and relationships within the text.

Why Use Azure OpenAI for Embeddings?

There are several compelling reasons to consider Azure OpenAI for your embedding needs:

Pre-trained Models: Azure OpenAI offers access to pre-trained embedding models, like "text-embedding-ada-002," which have been trained on massive amounts of text data. This saves you the time and resources to train your models from scratch.
Text embedding: Azure OpenAI specializes in text and code, providing models designed for natural language understanding. This ensures optimal performance for your text-based embedding tasks.
Ease of Use: Azure OpenAI offers a user-friendly API and tools to simplify the generating and utilizing embeddings in your applications.
Scalability: The Azure cloud infrastructure allows you to scale your embedding workloads efficiently as your needs grow.
Security and Compliance: Azure adheres to rigorous security standards and compliance certifications, offering peace of mind for handling sensitive data.

How to Generate Embeddings with Azure OpenAI?

Here's a step-by-step guide on generating embeddings with Azure OpenAI:

Setting Up the Environment

STEP 1: Import Python Libraries

We need Python libraries to interact with Azure OpenAI's functionalities. Here's a list of the primary libraries used in this tutorial:

openai: This library provides the official Python client for interacting with Azure OpenAI's API.
pandas (optional): This versatile library is helpful for data manipulation tasks, especially if you plan to preprocess your data before generating embeddings.
matplotlib or plotly (for data visualization): These libraries can be used to visualize the generated embeddings for better understanding.
scikit-learn (for machine learning tasks): If you plan to use the embeddings for downstream tasks like document classification, scikit-learn offers various machine learning algorithms.

You can install these libraries using the pip package manager:

pip install openai [pandas matplotlib plotly scikit-learn]

STEP 2: Download the BillSum dataset

For illustrative purposes, we will use the BillSum dataset, which consists of US congressional bills. This dataset provides a good example of textual data suitable for generating embeddings.

The BillSum dataset contains legislative text from the 103rd to 115th Congress sessions (1993-2018). This corpus focuses on mid-length legislation, typically from 5,000 to 20,000 characters. You can find more information about the project and the original research paper on the BillSum project's GitHub repository.

We'll work with a sample of the BillSum data stored in a CSV file named bill_sum_data.csv.

You can download it with the command line:

curl "https://raw.githubusercontent.com/Azure-Samples/Azure-OpenAI-Docs-Samples/main/Samples/Tutorials/Embeddings/data/bill_sum_data.csv" --output bill_sum_data.csv

or download using a link - BillSum project's GitHub repository

STEP 3: Retrieving your Azure OpenAI Endpoint and key

To successfully interact with Azure OpenAI API, you'll need two important pieces of information:

Azure OpenAI Endpoint URL
Azure OpenAI Access Key.

Navigate to your Azure OpenAI resource within the Azure portal.

Locate the Keys & Endpoint section. This section will display your endpoint URL and access keys (KEY1 and KEY2).

STEP 4: Setting up Environment Variables

Environment variables are a secure way to store sensitive information like your Azure OpenAI credentials without exposing them directly in your code.

Environment variables provide a secure mechanism to store configuration details accessible by your code. This keeps your code clean and protects your credentials from accidental exposure.

Here's how you can set environment variables for your Azure OpenAI endpoint and access key using the command line:

setx AZURE_OPENAI_API_KEY "REPLACE_WITH_YOUR_KEY_VALUE_HERE"
setx AZURE_OPENAI_ENDPOINT "REPLACE_WITH_YOUR_ENDPOINT_HERE"

Generating Embeddings with Azure OpenAI

STEP 1: Importing Libraries

We'll begin by importing the necessary Python libraries:

import os 
import re 
import requests 
import sys 
from num2words import num2words 
import os 
import pandas as pd 
import numpy as np 
import tiktoken 
from openai import AzureOpenAI

os: Provides functions for interacting with the operating system.
re: Enables regular expression operations for text manipulation.
requests (optional): Used for making HTTP requests (if needed for data download).
sys (optional): Provides system-specific parameters and functions.
num2words (optional): Converts numbers to words (if applicable to your data).
pandas (pd): A powerful library for data manipulation and analysis. We'll use it to work with our DataFrame.
numpy (np): Provides numerical computing functionalities.
tiktoken: This library helps with text tokenization, breaking down text into smaller units.

STEP 2: Reading the CSV Data

Next, we'll read the BillSum dataset stored in a CSV file named bill_sum_data.csv. The code assumes this file resides in the same directory as your Jupyter Notebook.

import pandas as pd

df = pd.read_csv(os.path.join(os.getcwd(), 'bill_sum_data.csv'))
print(df)

This code snippet reads the CSV file and creates a pandas DataFrame named df containing the data. Running print(df) will display the initial table with all its columns.

Reading CSV data to Generate Embeddings with Azure OpenAI

STEP 3: Selecting Relevant Columns

The DataFrame likely contains more columns than needed for our task. We'll create a new, smaller DataFrame (df_bills) that focuses on the following columns:

text: The main content of the bill.
summary: A concise overview of the bill.
title: The title of the bill.

df_bills = df[['text', 'summary', 'title']]
print(df_bills)

Generate Embeddings with Azure OpenAI by selecting relevant columns

STEP 4: Data Cleaning

Before processing the text data for embedding generation, perform some light cleaning. This might involve removing unnecessary whitespace, fixing punctuation inconsistencies, and preparing the text for tokenization.

Here's a function (normalize_text) that addresses common cleaning tasks:

import re

def normalize_text(s, sep_token=" \n "):
  """
  This function cleans text data by removing redundant whitespace, fixing punctuation,
  and preparing it for tokenization.
  """
  s = re.sub(r'\s+', ' ', s).strip()  # Remove extra spaces and leading/trailing whitespace
  s = re.sub(r". ,", "", s)  # Remove unnecessary commas after full stops
  # Replace multiple periods with single periods
  s = s.replace("..", ".").replace(". .", ".")
  s = s.replace("\n", "")  # Remove newline characters
  return s.strip()  # Remove any remaining whitespace

We'll apply this function to the text column of our df_bills DataFrame.

df_bills['text'] = df_bills["text"].apply(lambda x : normalize_text(x))

STEP 5: Handling Document Length

Azure OpenAI's embedding model has a limit on the number of tokens it can process for a single document. We'll use the tiktoken library to get the number of tokens for each bill and remove any entries exceeding the limit (8192 tokens in this case).

from tiktoken import get_encoding

tokenizer = get_encoding("cl100k_base")

df_bills['n_tokens'] = df_bills["text"].apply(lambda x: len(tokenizer.encode(x)))
df_bills = df_bills[df_bills.n_tokens < 8192]

print(len(df_bills))  # This should ideally show the number of bills under the token limit

Understanding Tokenization

Tokenization breaks down text into smaller units for further processing. It depends on the tokenization method, these units can be words, sub-words, characters, or other meaningful elements.

The example shows the tokenization of the first sentence from a bill:

"SECTION 1. SHORT TITLE."

Each token is represented as a single byte (b 'SECTION'). Punctuation marks and spaces are also treated as tokens.

sample_encode = tokenizer.encode(df_bills.text[0])
decode = tokenizer.decode_tokens_bytes(sample_encode)
print(decode)

print(len(decode))  # This should match the value in the first entry of the 'n_tokens' column

The decode variable holds the decoded version of the tokenized text, which shows the original sentence reconstructed from the individual tokens.

The length of the decode variable (len(decode)) matches the value in the n_tokens column for that bill. This indicates that the number of tokens used for this specific sentence is 1466.

Note:-

The n_tokens column doesn't represent actual tokenization yet. It's a preliminary check to ensure the text doesn't exceed the model's input token limit (8,192 in this example) before sending it for actual tokenization and embedding generation.

When the text is sent to the Azure OpenAI embedding model, it performs the real tokenization similar to, but not necessarily identical to, the example shown. This tokenized text is converted into a series of floating-point numbers, an actual embedding.

These embeddings can be stored locally or in an Azure Database to facilitate vector search, a technique for finding similar data points based on their embedding vectors.

Using Embeddings for Downstream Tasks

Now that we have generated embeddings for our documents (congressional bills in this example), let's explore how these embeddings can be utilized for various tasks.

Embeddings offer a powerful tool for document search. Instead of simply matching keywords, they allow us to retrieve documents based on their semantic similarity, meaning documents with similar meanings will be closer in the embedding space.

Introduction to using embeddings for document retrieval

Traditional document search often relies on keyword matching, which can be limited. For instance, a search for "tax breaks for businesses" might miss relevant documents that don't contain those exact keywords but discuss similar concepts.

Embeddings capture the semantic meaning of text, enabling us to find documents that convey similar ideas even if they use different words. This leads to more relevant and nuanced search results.

Code example for finding similar documents based on embedding similarity:

Here's some Python code demonstrating how to find similar documents in our df_bills DataFrame based on a user query:

def cosine_similarity(a, b):
  """
  This function calculates the cosine similarity between two embedding vectors.
  """
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embedding(text, model="text-embedding-ada-002"):
  """
  This function retrieves the embedding for a given text using the specified model.
  """
  # Code to call Azure OpenAI API for embedding generation

def search_docs(df, user_query, top_n=4, to_print=True):
  """
  This function searches for documents in the DataFrame based on user query similarity.
  """
  # 1. Get the embedding for the user query
  embedding = get_embedding(user_query)

  # 2. Calculate cosine similarity between the query embedding and each document embedding
  df["similarities"] = df.ada_v2.apply(lambda x: cosine_similarity(x, embedding))

  # 3. Sort documents by similarity in descending order and return the top 'top_n' results
  res = df.sort_values("similarities", ascending=False).head(top_n)

  if to_print:
    print(res)

  return res

# Example usage: Find documents similar to the query "Can I get information on cable company tax revenue?"
user_query = "Can I get information on cable company tax revenue?"
res = search_docs(df_bills, user_query, top_n=4)

This code defines functions for calculating cosine similarity and retrieving document embeddings. The search_docs function takes a user query, searches for similar documents in the DataFrame based on embedding similarity, and returns the top results.

Other Potential Applications:

Embeddings have a wide range of applications beyond document search. Here are a few examples:

Text Classification: Embeddings classify text documents into predefined categories, such as sentiment analysis (positive or negative reviews) or topic modeling (identifying themes in a corpus).
Chatbots: Chatbots can leverage embeddings to understand the intent behind user queries and respond with more relevant and informative answers.
Machine Translation: Machine translation models can benefit from embeddings to capture the semantic meaning of text and generate more accurate and natural translations.

Conclusion

By now, you've gained a solid understanding of how to generate embeddings with Azure OpenAI. These embeddings bridge human language and machine comprehension, empowering AI applications to process and analyze text data. Azure OpenAI provides a user-friendly platform to generate embeddings for your specific needs. With its robust infrastructure and pre-trained models, Azure OpenAI streamlines the process, allowing you to focus on extracting valuable insights from your text data.