Dataset Files

Dataset files enable you to attach uploaded files to datasets, automatically extracting and indexing their content for AI-powered search and retrieval, supporting various document formats including PDFs, text files, and office documents.

Dataset files provide a powerful mechanism for integrating document-based knowledge into your AI applications. By attaching files to datasets, you enable the platform to automatically extract text content, process it into searchable records, and make that information available to your bots through semantic search capabilities.

The file attachment system supports a wide range of document formats and handles the complexity of text extraction, chunking, and vectorization automatically. Once attached and synced, file content becomes instantly searchable within the dataset, allowing your AI agents to access and reference information from documents when responding to user queries.

Understanding File Attachments

When you attach a file to a dataset, you're creating a connection that tells the platform to extract and index the file's content. The attachment system supports different attachment types that control how the file content is processed and stored:

source: The file serves as a source of knowledge, with its content extracted and stored as dataset records
reference: The file is referenced but not automatically processed (useful for metadata tracking)

File attachments are persistent connections - once attached, the file remains associated with the dataset until explicitly detached. This allows you to manage your knowledge base by adding or removing document sources as your information needs evolve.

Attaching Files to Datasets

To attach a file to a dataset, you need both a file ID (obtained by uploading a file) and a dataset ID (from creating or fetching a dataset). The attachment operation creates the connection but does not immediately process the file - you'll need to trigger a sync operation separately to extract and index the content.

The type parameter is required and determines how the file is handled:

source: Most common type - extracts text content from the file and creates searchable records in the dataset. Use this when you want the file's content to be available for AI retrieval and reference. Supported formats include PDF, TXT, DOCX, PPTX, and many others.

reference: Creates an attachment without content extraction. Useful for tracking which files are associated with a dataset without processing their content, or for files that will be processed through custom mechanisms.

Supported File Formats

The file attachment system can extract text from numerous document formats:

Text Documents: TXT, MD (Markdown), RTF
Office Documents: DOCX, XLSX, PPTX
PDFs: Both text-based and image-based (with OCR)
Web Documents: HTML, XML
Code Files: Most programming language source files
Data Formats: JSON, CSV, YAML

Attachment Workflow

The complete workflow for making file content available in a dataset involves three steps:

Upload: First upload the file using the file upload endpoint to get a file ID
Attach: Create the attachment between the file and dataset (this operation)
Sync: Trigger synchronization to extract and index the content

Here's a complete example:

Re-attaching Files

If a file is already attached to the dataset, calling attach again will update the attachment type. The existing attachment is automatically removed and recreated with the new type. This allows you to change how a file is processed without manually detaching and reattaching.

Important Considerations

Processing Time: Large documents or complex PDFs may take several minutes to process during sync. The platform handles this asynchronously, so your attach request returns immediately.

File Size Limits: Files are subject to your account's size limits. Very large files (hundreds of MB) should be split into smaller chunks for optimal processing.

Content Updates: If you update the file content (by uploading a new version), you need to trigger a new sync to refresh the dataset records. Attachments don't automatically detect file changes.

Multiple Datasets: A single file can be attached to multiple datasets, allowing you to reuse content across different knowledge bases without duplicating file storage.

Record Source Tracking: Records created from file content include source metadata that references the original file ID, enabling you to track which records came from which documents.

Files can be used to provide a source of records in your datasets. You can create files, attach them to datasets, and sync them to import records.

Create File

Creating a file is the first step to using it as a data source for your datasets. You can create a file by making a POST request to the following endpoint:

Uploading File Content

There are multiple ways to upload file content to be used as a data source for your datasets.

Upload via JSON URL or Data URL

You can upload a file by providing a HTTP URL or a data URL in a JSON request body. This method is suitable for smaller files (up to 4.5MB).

Upload via Multipart/Form-Data

You can upload a file using multipart/form-data. This method is suitable for files up to 4.5MB.

Upload via Raw File Stream

You can upload a file by sending the raw file stream in the request body. This method is suitable for files up to 4.5MB.

Direct-to-Source Uploads

For larger files or more control over the upload process, you can obtain a pre-signed upload request by providing the file metadata in a JSON request body. You can then use the provided upload request to upload the file directly to the storage service.

The response will include an uploadRequest object with the necessary details to perform the upload.

You can then use this uploadRequest to upload the file directly to the storage service.

Dataset files are the primary way to add content and knowledge to your datasets, enabling AI agents to access and reference specific documents, images, PDFs, text files, and other file types during conversations. Each file attached to a dataset is automatically processed, indexed, and made searchable, allowing the AI to retrieve relevant information when responding to user queries.

Listing Dataset Files

Retrieving the list of files attached to a dataset allows you to inventory all content within a knowledge base, review file metadata, and manage your dataset's content library. The list endpoint provides comprehensive information about each file including its name, description, visibility settings, and timestamps.

To retrieve the files associated with a dataset, send a GET request to the dataset's file list endpoint:

Pagination

The endpoint supports cursor-based pagination for efficiently navigating large file collections:

cursor: Pagination token from the previous response, enabling you to fetch the next page of results
take: Number of files to retrieve per page (adjust based on your needs)
order: Sort order, either asc (oldest first) or desc (newest first, default)

Filtering by Metadata

Filter files based on custom metadata fields using deep object notation:

Metadata filtering enables flexible organization and retrieval based on your own categorization schemes, making it easy to find specific types of content within large datasets.

Response Format

The endpoint returns an array of file objects:

File Visibility

Each file has a visibility setting that controls access:

private: Only accessible to the file owner and explicitly authorized users
protected: Accessible to users within the same organization or team
public: Publicly accessible (use with caution for sensitive content)

Streaming Response (JSONL)

For real-time processing of large file lists, request JSONL streaming format:

Each line in the response is a separate JSON object:

This format is ideal for processing large file lists incrementally without waiting for the entire response.

Important Notes:

Only files attached to datasets you own are returned
File processing status is not included in the list response; check individual file details for processing state
Deleted files are automatically removed from the list
The list reflects the current state of file attachments through the DatasetFileAttachment relationship
File metadata is flexible and can store arbitrary key-value pairs for custom organization

Detaching Files from Datasets

When a file is no longer needed as a knowledge source for a dataset, you can detach it to remove the connection between the file and dataset. The detachment operation provides flexible control over what happens to the content that was extracted from the file, allowing you to either preserve the existing dataset records or clean them up along with the attachment.

Detaching a file is useful when you want to update your dataset's knowledge base by removing outdated information, reorganizing document sources, or simply cleaning up attachments that are no longer relevant. The operation is immediate and can be configured to handle content cleanup automatically.

Basic Detachment

To detach a file without removing its extracted records from the dataset:

This removes the attachment relationship while preserving all records that were created from the file's content. The records remain searchable in the dataset and continue to provide knowledge to your AI agents. This option is useful when you want to disconnect a file but keep its information available.

Detachment with Record Deletion

To completely remove both the attachment and all associated content from the dataset:

This performs a complete cleanup by:

Identifying all records in the dataset that originated from the file
Deleting those records from both the database and vector store
Removing the file attachment

Use this option when you want to fully remove a document's information from the dataset, such as when content becomes outdated, incorrect, or no longer relevant to your AI application.

Record Deletion Process

When deleteRecords is set to true, the system:

Locates all records with a source matching file:///{fileId}
Processes deletions in batches of 10 for efficient performance
Removes records from both the Prisma database and the vector store
Handles large files with many records without timeout issues

The deletion process runs synchronously but is optimized for performance. For files that generated hundreds or thousands of records, the operation may take several seconds to complete.

Detachment Scenarios

Scenario 1: Updating File Content

When you need to update a document's content, detach with record deletion, then re-attach and sync the updated file. This ensures clean replacement of old content with new:

Scenario 2: Reorganizing Knowledge Base

When restructuring datasets, you might detach files without deleting records to preserve knowledge while reorganizing attachments. This is useful when migrating content between datasets or consolidating knowledge sources.

Scenario 3: Content Removal

When information becomes obsolete, confidential, or needs to be removed for compliance reasons, detach with record deletion to ensure complete removal from the AI's accessible knowledge.

Important Considerations

Irreversible Deletion: When deleteRecords is true, the record deletion is permanent and cannot be undone. Ensure you have backups if there's any chance you'll need the content again.

File Preservation: Detaching a file only removes its connection to the dataset. The file itself remains in your account's file storage and can be reattached later or attached to other datasets.

Batch Processing: For files that generated many records, the deletion process handles batching automatically. You don't need to implement any special logic for large documents.

Vector Store Cleanup: Record deletion includes cleanup from the vector store, ensuring embeddings are also removed. This helps maintain vector database efficiency and prevents ghost results in semantic searches.

Multiple Dataset Attachments: If a file is attached to multiple datasets, detaching from one dataset doesn't affect its attachments to other datasets. Each attachment is independent.

Validation and Authorization

The detach operation validates that:

The attachment exists between the specified file and dataset
You own both the dataset and the file
The dataset and file are both accessible and valid

Attempting to detach a non-existent attachment or unauthorized resources will result in appropriate error responses (404 Not Found or 403 Not Authorized).

Best Practice: Before detaching with record deletion, consider exporting dataset records to create a backup. This provides a safety net if you need to restore the content later.

Synchronizing File Content to Datasets

File synchronization is the process that extracts text content from attached files, processes it into searchable records, generates embeddings for semantic search, and indexes everything into the dataset. Unlike attachment which only creates the connection, synchronization performs the actual content extraction and indexing that makes file information accessible to your AI agents.

Synchronization is intentionally a separate operation from attachment to give you complete control over when processing occurs. This design allows you to attach multiple files and then trigger synchronization in batch, avoid unnecessary processing when files are being updated, and manage computational resources efficiently by scheduling sync operations strategically.

Basic Synchronization

To trigger synchronization of an attached file:

The sync operation is asynchronous and returns immediately with the file ID. The actual content extraction and indexing happens in the background through the dataset processing queue. You can monitor sync progress and completion through the dataset event log or by checking for new records in the dataset.

What Happens During Sync

When you trigger a file sync, the platform performs several complex operations automatically:

1. Content Extraction: The file is analyzed and its text content is extracted. This varies by file type:

Text files (TXT, MD): Direct content read
PDFs: Text layer extraction or OCR for image-based PDFs
Office documents (DOCX, XLSX, PPTX): Content parsing from structured formats
HTML/XML: Tag stripping and content extraction
Code files: Source code with syntax preservation

2. Text Chunking: Extracted content is intelligently split into manageable chunks. The chunking algorithm:

Respects document structure (paragraphs, sections, headings)
Maintains semantic coherence in each chunk
Ensures chunks are optimally sized for embedding models
Preserves context by including overlapping content between chunks

3. Record Creation: Each chunk becomes a dataset record containing:

The text content
Source metadata identifying the file: file:///{fileId}
Positional information (which chunk in the sequence)
File metadata (name, type, creation date)

4. Embedding Generation: Text chunks are processed through embedding models to create high-dimensional vector representations that capture semantic meaning. These embeddings enable semantic search capabilities.

5. Vector Indexing: Generated embeddings are stored in the vector database with indexes optimized for similarity search, allowing fast retrieval of relevant content during bot conversations.

Monitoring Sync Progress

Since synchronization is asynchronous, you need to monitor its progress:

Re-synchronization and Updates

If you update a file's content (by uploading a new version), the file attachment doesn't automatically detect the change. You need to manually trigger synchronization again to refresh the dataset with updated content:

Rate Limiting

The sync endpoint implements rate limiting to prevent excessive queue load:

Limit: 1 sync request per 2 minutes per file attachment
Purpose: Prevents accidental rapid re-syncing of the same file
Scope: Applied per authenticated session

If you need to re-sync multiple files rapidly, stagger your sync requests or wait for the rate limit window to reset. The rate limit applies per file, so you can sync different files concurrently without hitting limits.

Synchronization Performance

Sync processing time varies based on several factors:

Small Text Files (< 100 KB): Usually process in under 30 seconds Medium Documents (100 KB - 1 MB): Typically 1-3 minutes Large Documents (1-10 MB): May take 5-15 minutes Very Large Files (> 10 MB): Can require 15-30+ minutes

PDF Complexity: Image-heavy PDFs or those requiring OCR take significantly longer than text-based PDFs due to image processing requirements.

Concurrent Processing: Multiple file syncs for the same dataset are processed sequentially to maintain consistency, so queuing delays may occur during bulk operations.

Error Handling

Common sync failures and their causes:

File Format Not Supported: The file type cannot be processed for text extraction. Check that your file format is in the supported list.

Corrupted File: The file cannot be read or parsed. Verify file integrity and try uploading again.

Empty Content: The file contains no extractable text. This can occur with image-only PDFs when OCR fails or with binary files mistakenly attached.

Processing Timeout: Very large or complex files may exceed processing limits. Consider splitting large documents into smaller files.

Storage Limits: Your account's record or storage limits may be reached. Check usage and upgrade if necessary.

Check the dataset event log for detailed error messages when sync operations fail.

Best Practices

Batch Attachments, Then Sync: When adding multiple files, attach them all first, then trigger synchronization. This reduces queue overhead and provides better performance than alternating attach/sync operations.

Schedule Large Syncs: For processing many large files, consider scheduling sync operations during off-peak hours to ensure adequate processing resources and avoid user-facing delays.

Monitor and Validate: After synchronization completes, verify that records were created successfully by checking the dataset record count and performing test searches.

Optimize File Preparation: Clean up documents before upload - remove unnecessary pages, compress images, and eliminate non-textual content to improve processing speed and quality.

Handle Failures Gracefully: Implement retry logic with exponential backoff for sync operations that fail due to temporary issues.

Use Webhooks: Configure webhooks to receive notifications when sync operations complete, enabling event-driven workflows instead of polling.

Integration Patterns

Pattern 1: Bulk Knowledge Base Creation

Upload and sync multiple documents to build a comprehensive knowledge base:

Pattern 2: Continuous Document Updates

Keep dataset in sync with external document repository:

Important: Synchronization requires an active file attachment. Ensure the file is attached before attempting to sync. Attempting to sync a detached file will result in a 404 Not Found error.

dataset file attachment document-processing

Dataset Files

Understanding File Attachments

Attaching Files to Datasets

Supported File Formats

Attachment Workflow

Re-attaching Files

Important Considerations

Create File

Uploading File Content

Upload via JSON URL or Data URL

Upload via Multipart/Form-Data

Upload via Raw File Stream

Direct-to-Source Uploads

Listing Dataset Files

Filtering by Metadata

Response Format

File Visibility

Streaming Response (JSONL)

Detaching Files from Datasets

Basic Detachment

Detachment with Record Deletion

Record Deletion Process

Detachment Scenarios

Important Considerations

Validation and Authorization

Synchronizing File Content to Datasets

Basic Synchronization

What Happens During Sync

Monitoring Sync Progress

Re-synchronization and Updates

Rate Limiting

Synchronization Performance

Error Handling

Best Practices

Integration Patterns

Dataset Records

Dataset Search

Datasets