Data Ingestion in Azure OpenAI on Your Data

Empower your AI journey with secure data ingestion in Azure OpenAI on Your Data: A guide to integrating your data using Azure Blob Storage and optional Azure Search for organizations.

Azure OpenAI on Your Data empowers you to unlock the true potential of OpenAI's large language models (LLMs) by tailoring them to your specific needs. This innovative service allows you to customize these pre-trained LLMs with your data, enabling them to understand your unique business context, industry jargon, or any other specific requirements you may have. This customization leads to more accurate and relevant responses from the AI model, significantly enhancing its effectiveness for your tasks.

Key Capabilities of Azure OpenAI on Your Data:

Data Ingestion Flexibility: Seamlessly ingest data from various sources like Azure Blob Storage, local file systems, or URLs. This flexibility ensures you can leverage your existing data infrastructure for AI development.
Data Processing with Azure AI Search: Azure AI Search plays a crucial role by indexing and making your data searchable. This empowers the AI model to efficiently find relevant information based on user queries, leading to more accurate and insightful results.
Tailored AI Models: The core strength of Azure OpenAI on Your Data lies in its ability to customize AI models with your data. This customization bridges the gap between generic models and your specific needs, resulting in a more relevant and accurate AI solution.

Data Ingestion in Azure OpenAI on Your Data

Data ingestion, the process of importing your data into Azure OpenAI on Your Data, forms the foundation for customizing powerful AI models. This service provides several flexible methods to ingest data from various sources, ensuring you can leverage your existing data infrastructure for AI development.

Data Sources for Ingestion:

Azure Blob Storage: This highly scalable and reliable cloud storage solution is ideal for storing large amounts of unstructured data like text or binary files. You can upload your data to Azure Blob Storage and seamlessly connect it to Azure OpenAI for ingestion via HTTP or HTTPS. This method is perfect for large datasets you want to utilize for AI model training.
Local Files: For smaller datasets or initial testing purposes, you can directly upload files from your local machine to Azure OpenAI. This provides a quick and convenient way to ingest your data without additional cloud storage setup.
URLs: Azure OpenAI pulls data directly from specified web addresses. This functionality is valuable if your data is hosted on a public web server or if you want to incorporate publicly available data for training your AI model. Leverage this method to integrate external data sources for broader model training.

Data Ingestion in Azure OpenAI on Your Data — Source: Microsoft

Steps 1 and 2 in the image are specifically for uploading files. Downloading URLs to Azure Blob Storage isn't depicted visually, but the subsequent steps (from step 3 onwards) apply equally to data ingested from URLs after it's uploaded to Blob Storage.

Five crucial components orchestrate the data ingestion process:

Azure Search Indexes and Data Sources (2 each): These are created within the Azure Search resource. Indexes store the processed data, while data sources specify the location of the data (Blob Storage in this case).
Custom Skill: A custom skill is created within Azure Search. It essentially acts as an intermediary, receiving documents from the data source and passing them on for preprocessing by Azure OpenAI's preprocessing jobs API.
Chunks Container: A container within Azure Blob Storage is specifically designated to hold the data chunks created during preprocessing.
Scheduled Refresh (Optional): It can trigger the ingestion process automatically at predefined intervals if configured.

Data Preprocessing by Azure OpenAI

Azure OpenAI's preprocessing jobs API plays a central role. It adheres to the Azure Search customer skill web API protocol to process documents queued for ingestion.

Here are the preprocessing steps performed by Azure OpenAI internally:

Document Cracking: The first indexer cracks the documents using a heuristic-based algorithm. This essentially involves splitting the documents into smaller, more manageable units.
Chunking: Azure OpenAI employs a chunking algorithm that considers factors like table layouts and other formatting elements to ensure optimal chunk quality.
Vectorization (Optional): Azure OpenAI utilizes the chosen embedding deployment to vectorize the data chunks. Vectorization represents data points as numerical vectors for efficient search operations.

Once all the data is processed, Azure OpenAI triggers the second indexer, which stores the preprocessed data in the Azure Search service, making it ready for use by the AI models.

Inference architecture

The inference architecture in Azure OpenAI on Your Data refers to the ingested and prepared data to generate responses from the underlying AI model. It essentially bridges the gap between your data and the AI model's ability to leverage that data to perform tasks or answer your questions.

API Call for Interaction: When interacting with the Azure OpenAI model, you initiate the process through an API call (like a chat functionality).

Field Retrieval from Azure Search: The Azure OpenAI service retrieves necessary field information from the Azure Search indexes. This field information is crucial for accurate model performance during inference.

Model Inference:

The retrieved data and query of instructions are used by the AI model.
Based on your customized data and the model's capabilities, the AI model generates a response or completes a task as instructed.
For instance, the model might translate text, answer your questions based on your data, or generate creative text formats tailored to your domain.

Optional Vector Search (if embedding deployment is specified): If an embedding deployment is configured, Azure OpenAI vectorizes your query and sends both the vectorized query and the original text-based query to Azure Search for potentially improved search accuracy.

Importance of Inference Architecture:

Transforms Data into Actionable Insights: The data ingestion architecture prepares your data for use by the AI model. The inference architecture takes prepared data and generates meaningful responses or completes tasks as instructed. Imagine having a vast library of information (your data) but no way to search or extract knowledge from it (without inference).
Tailored Responses Based on Your Data: By incorporating your specific data into the AI model, the inference architecture allows the model to generate relevant responses aligned with your domain or area of focus. For instance, if you've trained the model on customer support data, it can utilize that knowledge during inference to provide more accurate and helpful responses to customer inquiries.
Real-World Application of AI Capabilities: The inference architecture enables you to interact with the AI model and benefit from its capabilities. Without it, the AI model would remain a powerful learning engine but cannot translate its knowledge into practical applications. Inference allows you to leverage the model for tasks like generating creative text formats, translating languages, or answering questions based on your specific data set.

Azure OpenAI Security Measures

Data security remains paramount when working with sensitive information. Azure OpenAI on Your Data prioritizes security throughout the process, offering robust measures to protect your data at every stage, from ingestion to processing and storage. Here's a breakdown of these security features:

Authentication with Microsoft Entra ID (Azure AD): This robust identity and access management service ensures that only authorized users can access and process your data. It acts as a digital gatekeeper, verifying user identities before granting access.
Double Encryption at Rest in Azure Blob Storage: Your data is securely stored within Azure Blob Storage, a highly scalable and reliable cloud storage solution. Data encryption at rest with the AES-256 encryption standard adds a strong layer of protection by default. Additionally, you can utilize customer-managed keys for even greater control over encryption.
Secure Communication with Private Endpoints (Optional): If you've implemented a Virtual Network (VNet) for further isolation, private endpoints can be used during data communication. This creates a secure tunnel within the VNet, ensuring your data bypasses the public internet, minimizing the risk of exposure.
Azure AI Search Security Measures: Azure AI Search offers additional security features like Role-Based Access Control (RBAC) to define user permissions for accessing data sets and search results. Furthermore, document-level access control (optional) allows you to restrict document usage based on user groups within Azure AI Search, ensuring only authorized users can utilize specific data points for AI model training or generation.

Secure Data Processing with Azure AI Search

Azure AI Search is a cloud search service providing a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications. In the context of Azure OpenAI on Your Data, it plays a crucial role in processing the ingested data.

Role of Azure AI Search in Processing Ingested Data:

Imagine a vast library of information. Azure AI Search acts as the powerful indexing system for this library. It transforms your ingested data into a searchable format through indexing. This process involves:

Tokenization: Breaking down text data into smaller meaningful units called tokens. Think of it like creating keywords for easier searching.
Normalization: Standardizing the text data for consistency. This ensures consistent search results regardless of capitalization, punctuation, or other formatting.
Organization: Structuring the indexed data for efficient searching. This allows the AI model to locate the information when responding to user queries.

By transforming your data into a searchable format, Azure AI Search empowers your AI model to deliver accurate and insightful results based on user needs.

Security Aspects of Azure AI Search

Azure AI Search understands the importance of data security. It implements robust security measures to safeguard your information throughout the processing stage:

Role-Based Access Control (RBAC): This fine-grained access control system allows you to define specific permissions for different users or user groups within Azure. In the context of Azure AI Search, you can create roles like "read-only" or "read-write" and assign them to users. This ensures that only authorized users can access and manipulate specific data sets or search results within Azure AI Search.
Document-Level Access Control (Optional): You can implement document-level access control for more granular control (available with Azure AI Search). This allows you to restrict which documents specific user groups can access within your search index. This ensures that only authorized users can utilize data points for AI model training or generation. Imagine a library where certain documents are restricted to specific user groups for confidentiality. Document-level access control offers similar functionality for your data within Azure AI Search.

Supported Data Formats and File Types

Azure OpenAI on Your Data supports the following file types:

.txt: Plain text files
.md: Markdown files
.html: HTML files
.docx: Microsoft Word documents
.pptx: Microsoft PowerPoint presentations
.pdf: PDF documents

Upload Limit and Document Structure Considerations

While Azure OpenAI on Your Data offers flexibility, there are certain upload limitations and document structure considerations to keep in mind:

Upload Limit: There's a maximum size limit of 16 MB for all files uploaded. Additionally, there is a cap of 2 million tokens (meaningful units of text) per file for text and document files,
Document Structure: The structure of your documents can significantly impact the quality of responses generated by the AI model. When converting data from unsupported formats, ensure the conversion process doesn't lead to substantial data loss or introduce unexpected noise (corrupted or irrelevant data) into your information.

Optimizing Data for Better Results:

For files with special formatting like tables, columns, or bullet points, it is advised to use the data preparation script available on GitHub. This script helps pre-process your data for optimal utilization by the AI model. Additionally, consider the available data preparation script for documents and datasets with long stretches of text. This script intelligently chunks your data into smaller segments, leading to more accurate and insightful responses from the AI model.

The script also offers extended support for scanned PDF files and images, allowing you to incorporate various data sources for comprehensive AI model training.

Managed Identities

Managed identities in Azure are a special type of service principal, locked and only used with Azure resources. They are a feature of Microsoft Entra ID that allows your app to easily access other Microsoft Entra-protected resources such as Azure Key Vault. It does not require you to provision or rotate any secrets.

There are two types of managed identities: system-assigned and user-assigned.

System-Assigned Managed Identities: These are automatically generated at the resource level. When you enable a system-assigned managed identity, an identity is created in Azure AD. The identity is tied to the lifecycle of that service instance. When the resource is deleted, Azure automatically deletes the identity for you. By design, only that Azure resource can use this identity to request tokens from Azure AD.
User-Assigned Managed Identities: These are explicitly created and assigned by developers or administrators. User-assigned identities are standalone Azure resources that can be assigned to one or more instances of an Azure service.

Importance of Managed Identities in Service Calls

Managed identities provide several benefits when making service calls:

Eliminate Credential Management: Managed identities eliminate the need to store credentials within the application code, improving security as there are no chances of password leaks.
Automatic Management: Managed identities manage the creation and automatic renewal of the service principal on your behalf.
Fine-Grained Access Control: Managed identities can be used to grant Role Based Access Control (RBAC) roles that are either built-in or custom. This helps provide a finer grain of access to specific functions within the scoped Azure resources.
Resource Cleanup: If leveraging System Assigned Identities, Azure Active Directory will destroy the identity after the underlying Azure resource has been deleted.

Conclusion

Use Azure Blob Storage for secure data housing, and explore Azure Search (optional) for data organization. With Azure OpenAI's preprocessing techniques, your data is prepared for optimal utilization by the AI models, empowering you to unlock powerful AI-driven capabilities.