Dataset Files
Dataset files provide a powerful mechanism for integrating document-based knowledge into your AI applications. By attaching files to datasets, you enable the platform to automatically extract text content, process it into searchable records, and make that information available to your bots through semantic search capabilities.
The file attachment system supports a wide range of document formats and handles the complexity of text extraction, chunking, and vectorization automatically. Once attached and synced, file content becomes instantly searchable within the dataset, allowing your AI agents to access and reference information from documents when responding to user queries.
Understanding File Attachments
When you attach a file to a dataset, you're creating a connection that tells the platform to extract and index the file's content. The attachment system supports different attachment types that control how the file content is processed and stored:
- source: The file serves as a source of knowledge, with its content extracted and stored as dataset records
- reference: The file is referenced but not automatically processed (useful for metadata tracking)
File attachments are persistent connections - once attached, the file remains associated with the dataset until explicitly detached. This allows you to manage your knowledge base by adding or removing document sources as your information needs evolve.
Attaching Files to Datasets
To attach a file to a dataset, you need both a file ID (obtained by uploading a file) and a dataset ID (from creating or fetching a dataset). The attachment operation creates the connection but does not immediately process the file - you'll need to trigger a sync operation separately to extract and index the content.
The type parameter is required and determines how the file is handled:
source: Most common type - extracts text content from the file and creates searchable records in the dataset. Use this when you want the file's content to be available for AI retrieval and reference. Supported formats include PDF, TXT, DOCX, PPTX, and many others.
reference: Creates an attachment without content extraction. Useful for tracking which files are associated with a dataset without processing their content, or for files that will be processed through custom mechanisms.
Supported File Formats
The file attachment system can extract text from numerous document formats:
- Text Documents: TXT, MD (Markdown), RTF
- Office Documents: DOCX, XLSX, PPTX
- PDFs: Both text-based and image-based (with OCR)
- Web Documents: HTML, XML
- Code Files: Most programming language source files
- Data Formats: JSON, CSV, YAML
Attachment Workflow
The complete workflow for making file content available in a dataset involves three steps:
- Upload: First upload the file using the file upload endpoint to get a file ID
- Attach: Create the attachment between the file and dataset (this operation)
- Sync: Trigger synchronization to extract and index the content
Here's a complete example:
Re-attaching Files
If a file is already attached to the dataset, calling attach again will update the attachment type. The existing attachment is automatically removed and recreated with the new type. This allows you to change how a file is processed without manually detaching and reattaching.
Important Considerations
Processing Time: Large documents or complex PDFs may take several minutes to process during sync. The platform handles this asynchronously, so your attach request returns immediately.
File Size Limits: Files are subject to your account's size limits. Very large files (hundreds of MB) should be split into smaller chunks for optimal processing.
Content Updates: If you update the file content (by uploading a new version), you need to trigger a new sync to refresh the dataset records. Attachments don't automatically detect file changes.
Multiple Datasets: A single file can be attached to multiple datasets, allowing you to reuse content across different knowledge bases without duplicating file storage.
Record Source Tracking: Records created from file content include source metadata that references the original file ID, enabling you to track which records came from which documents.
Files can be used to provide a source of records in your datasets. You can create files, attach them to datasets, and sync them to import records.
Create File
Creating a file is the first step to using it as a data source for your datasets. You can create a file by making a POST request to the following endpoint:
Uploading File Content
There are multiple ways to upload file content to be used as a data source for your datasets.
Upload via JSON URL or Data URL
You can upload a file by providing a HTTP URL or a data URL in a JSON request body. This method is suitable for smaller files (up to 4.5MB).
or
Upload via Multipart/Form-Data
You can upload a file using multipart/form-data. This method is suitable for files up to 4.5MB.
Upload via Raw File Stream
You can upload a file by sending the raw file stream in the request body. This method is suitable for files up to 4.5MB.
Direct-to-Source Uploads
For larger files or more control over the upload process, you can obtain a pre-signed upload request by providing the file metadata in a JSON request body. You can then use the provided upload request to upload the file directly to the storage service.
The response will include an uploadRequest object with the necessary
details to perform the upload.
You can then use this uploadRequest to upload the file directly to the
storage service.
Dataset files are the primary way to add content and knowledge to your datasets, enabling AI agents to access and reference specific documents, images, PDFs, text files, and other file types during conversations. Each file attached to a dataset is automatically processed, indexed, and made searchable, allowing the AI to retrieve relevant information when responding to user queries.
Listing Dataset Files
Retrieving the list of files attached to a dataset allows you to inventory all content within a knowledge base, review file metadata, and manage your dataset's content library. The list endpoint provides comprehensive information about each file including its name, description, visibility settings, and timestamps.
To retrieve the files associated with a dataset, send a GET request to the dataset's file list endpoint:
Pagination
The endpoint supports cursor-based pagination for efficiently navigating large file collections:
- cursor: Pagination token from the previous response, enabling you to fetch the next page of results
- take: Number of files to retrieve per page (adjust based on your needs)
- order: Sort order, either
asc(oldest first) ordesc(newest first, default)
Filtering by Metadata
Filter files based on custom metadata fields using deep object notation:
Metadata filtering enables flexible organization and retrieval based on your own categorization schemes, making it easy to find specific types of content within large datasets.
Response Format
The endpoint returns an array of file objects:
File Visibility
Each file has a visibility setting that controls access:
- private: Only accessible to the file owner and explicitly authorized users
- protected: Accessible to users within the same organization or team
- public: Publicly accessible (use with caution for sensitive content)
Streaming Response (JSONL)
For real-time processing of large file lists, request JSONL streaming format:
Each line in the response is a separate JSON object:
This format is ideal for processing large file lists incrementally without waiting for the entire response.
Important Notes:
- Only files attached to datasets you own are returned
- File processing status is not included in the list response; check individual file details for processing state
- Deleted files are automatically removed from the list
- The list reflects the current state of file attachments through the DatasetFileAttachment relationship
- File metadata is flexible and can store arbitrary key-value pairs for custom organization
Detaching Files from Datasets
When a file is no longer needed as a knowledge source for a dataset, you can detach it to remove the connection between the file and dataset. The detachment operation provides flexible control over what happens to the content that was extracted from the file, allowing you to either preserve the existing dataset records or clean them up along with the attachment.
Detaching a file is useful when you want to update your dataset's knowledge base by removing outdated information, reorganizing document sources, or simply cleaning up attachments that are no longer relevant. The operation is immediate and can be configured to handle content cleanup automatically.
Basic Detachment
To detach a file without removing its extracted records from the dataset:
This removes the attachment relationship while preserving all records that were created from the file's content. The records remain searchable in the dataset and continue to provide knowledge to your AI agents. This option is useful when you want to disconnect a file but keep its information available.
Detachment with Record Deletion
To completely remove both the attachment and all associated content from the dataset:
This performs a complete cleanup by:
- Identifying all records in the dataset that originated from the file
- Deleting those records from both the database and vector store
- Removing the file attachment
Use this option when you want to fully remove a document's information from the dataset, such as when content becomes outdated, incorrect, or no longer relevant to your AI application.
Record Deletion Process
When deleteRecords is set to true, the system:
- Locates all records with a source matching
file:///{fileId} - Processes deletions in batches of 10 for efficient performance
- Removes records from both the Prisma database and the vector store
- Handles large files with many records without timeout issues
The deletion process runs synchronously but is optimized for performance. For files that generated hundreds or thousands of records, the operation may take several seconds to complete.
Detachment Scenarios
Scenario 1: Updating File Content
When you need to update a document's content, detach with record deletion, then re-attach and sync the updated file. This ensures clean replacement of old content with new:
Scenario 2: Reorganizing Knowledge Base
When restructuring datasets, you might detach files without deleting records to preserve knowledge while reorganizing attachments. This is useful when migrating content between datasets or consolidating knowledge sources.
Scenario 3: Content Removal
When information becomes obsolete, confidential, or needs to be removed for compliance reasons, detach with record deletion to ensure complete removal from the AI's accessible knowledge.
Important Considerations
Irreversible Deletion: When deleteRecords is true, the record deletion is permanent and cannot be undone. Ensure you have backups if there's any chance you'll need the content again.
File Preservation: Detaching a file only removes its connection to the dataset. The file itself remains in your account's file storage and can be reattached later or attached to other datasets.
Batch Processing: For files that generated many records, the deletion process handles batching automatically. You don't need to implement any special logic for large documents.
Vector Store Cleanup: Record deletion includes cleanup from the vector store, ensuring embeddings are also removed. This helps maintain vector database efficiency and prevents ghost results in semantic searches.
Multiple Dataset Attachments: If a file is attached to multiple datasets, detaching from one dataset doesn't affect its attachments to other datasets. Each attachment is independent.
Validation and Authorization
The detach operation validates that:
- The attachment exists between the specified file and dataset
- You own both the dataset and the file
- The dataset and file are both accessible and valid
Attempting to detach a non-existent attachment or unauthorized resources will result in appropriate error responses (404 Not Found or 403 Not Authorized).
Best Practice: Before detaching with record deletion, consider exporting dataset records to create a backup. This provides a safety net if you need to restore the content later.
Synchronizing File Content to Datasets
File synchronization is the process that extracts text content from attached files, processes it into searchable records, generates embeddings for semantic search, and indexes everything into the dataset. Unlike attachment which only creates the connection, synchronization performs the actual content extraction and indexing that makes file information accessible to your AI agents.
Synchronization is intentionally a separate operation from attachment to give you complete control over when processing occurs. This design allows you to attach multiple files and then trigger synchronization in batch, avoid unnecessary processing when files are being updated, and manage computational resources efficiently by scheduling sync operations strategically.
Basic Synchronization
To trigger synchronization of an attached file:
The sync operation is asynchronous and returns immediately with the file ID. The actual content extraction and indexing happens in the background through the dataset processing queue. You can monitor sync progress and completion through the dataset event log or by checking for new records in the dataset.
What Happens During Sync
When you trigger a file sync, the platform performs several complex operations automatically:
1. Content Extraction: The file is analyzed and its text content is extracted. This varies by file type:
- Text files (TXT, MD): Direct content read
- PDFs: Text layer extraction or OCR for image-based PDFs
- Office documents (DOCX, XLSX, PPTX): Content parsing from structured formats
- HTML/XML: Tag stripping and content extraction
- Code files: Source code with syntax preservation
2. Text Chunking: Extracted content is intelligently split into manageable chunks. The chunking algorithm:
- Respects document structure (paragraphs, sections, headings)
- Maintains semantic coherence in each chunk
- Ensures chunks are optimally sized for embedding models
- Preserves context by including overlapping content between chunks
3. Record Creation: Each chunk becomes a dataset record containing:
- The text content
- Source metadata identifying the file:
file:///{fileId} - Positional information (which chunk in the sequence)
- File metadata (name, type, creation date)
4. Embedding Generation: Text chunks are processed through embedding models to create high-dimensional vector representations that capture semantic meaning. These embeddings enable semantic search capabilities.
5. Vector Indexing: Generated embeddings are stored in the vector database with indexes optimized for similarity search, allowing fast retrieval of relevant content during bot conversations.
Monitoring Sync Progress
Since synchronization is asynchronous, you need to monitor its progress:
Re-synchronization and Updates
If you update a file's content (by uploading a new version), the file attachment doesn't automatically detect the change. You need to manually trigger synchronization again to refresh the dataset with updated content:
Rate Limiting
The sync endpoint implements rate limiting to prevent excessive queue load:
- Limit: 1 sync request per 2 minutes per file attachment
- Purpose: Prevents accidental rapid re-syncing of the same file
- Scope: Applied per authenticated session
If you need to re-sync multiple files rapidly, stagger your sync requests or wait for the rate limit window to reset. The rate limit applies per file, so you can sync different files concurrently without hitting limits.
Synchronization Performance
Sync processing time varies based on several factors:
Small Text Files (< 100 KB): Usually process in under 30 seconds Medium Documents (100 KB - 1 MB): Typically 1-3 minutes Large Documents (1-10 MB): May take 5-15 minutes Very Large Files (> 10 MB): Can require 15-30+ minutes
PDF Complexity: Image-heavy PDFs or those requiring OCR take significantly longer than text-based PDFs due to image processing requirements.
Concurrent Processing: Multiple file syncs for the same dataset are processed sequentially to maintain consistency, so queuing delays may occur during bulk operations.
Error Handling
Common sync failures and their causes:
File Format Not Supported: The file type cannot be processed for text extraction. Check that your file format is in the supported list.
Corrupted File: The file cannot be read or parsed. Verify file integrity and try uploading again.
Empty Content: The file contains no extractable text. This can occur with image-only PDFs when OCR fails or with binary files mistakenly attached.
Processing Timeout: Very large or complex files may exceed processing limits. Consider splitting large documents into smaller files.
Storage Limits: Your account's record or storage limits may be reached. Check usage and upgrade if necessary.
Check the dataset event log for detailed error messages when sync operations fail.
Best Practices
Batch Attachments, Then Sync: When adding multiple files, attach them all first, then trigger synchronization. This reduces queue overhead and provides better performance than alternating attach/sync operations.
Schedule Large Syncs: For processing many large files, consider scheduling sync operations during off-peak hours to ensure adequate processing resources and avoid user-facing delays.
Monitor and Validate: After synchronization completes, verify that records were created successfully by checking the dataset record count and performing test searches.
Optimize File Preparation: Clean up documents before upload - remove unnecessary pages, compress images, and eliminate non-textual content to improve processing speed and quality.
Handle Failures Gracefully: Implement retry logic with exponential backoff for sync operations that fail due to temporary issues.
Use Webhooks: Configure webhooks to receive notifications when sync operations complete, enabling event-driven workflows instead of polling.
Integration Patterns
Pattern 1: Bulk Knowledge Base Creation
Upload and sync multiple documents to build a comprehensive knowledge base:
Pattern 2: Continuous Document Updates
Keep dataset in sync with external document repository:
Important: Synchronization requires an active file attachment. Ensure the file is attached before attempting to sync. Attempting to sync a detached file will result in a 404 Not Found error.