Skip to main content

Indexing Attachments for Vector Search

Overview

Attachments in RavenDB

  • Attachments in RavenDB allow you to associate binary files with your JSON documents.
    You can use attachments to store images, PDFs, videos, text files, or any other format.

  • Attachments are stored separately from documents, reducing document size and avoiding unnecessary duplication. They are stored as binary data, regardless of content type.

  • Attachments are handled as streams, allowing efficient upload and retrieval.
    Learn more in: What are attachments.

You can index attachment content in a vector field within a static-index,
enabling vector search on text or numerical data that is stored in the attachments:

  • Attachments with TEXT:

    • During indexing, RavenDB processes the text into a single embedding per attachment using the built-in
      bge-micro-v2 model.
  • Attachments with NUMERICAL data:

    • While attachments can store any file type, RavenDB does Not generate embeddings from images, videos, or other non-textual content.
      Each attachment must contain a single precomputed embedding vector, generated externally.
    • RavenDB indexes the embedding vector from the attachment in and can apply quantization (e.g., index it in Int8 format) if this is configured.
    • All embeddings indexed within the same vector-field in the static-index must be vectors of the same dimension to ensure consistency in indexing and search. They must also be created using the same model.

Indexing TEXT attachments

  • The following index defines a vector field named VectorFromAttachment.

  • It indexes embeddings generated from the text content of the description.txt attachment.
    This applies to all Company documents that contain an attachment with that name.

public class Companies_ByVector_FromTextAttachment :
AbstractIndexCreationTask<Company, Companies_ByVector_FromTextAttachment.IndexEntry>
{
public class IndexEntry()
{
// This index-field will hold embeddings
// generated from the TEXT in the attachments.
public object VectorFromAttachment { get; set; }
}

public Companies_ByVector_FromTextAttachment()
{
Map = companies => from company in companies

// Load the attachment from the document (ensure it is not null)
let attachment = LoadAttachment(company, "description.txt")
where attachment != null

select new IndexEntry()
{
// Index the text content from the attachment in the vector field
VectorFromAttachment =
CreateVector(attachment.GetContentAsString(Encoding.UTF8))
};

// Configure the vector field:
VectorIndexes.Add(x => x.VectorFromAttachment,
new VectorOptions()
{
// Specify 'Text' as the source format
SourceEmbeddingType = VectorEmbeddingType.Text,
// Specify the desired destination format within the index
DestinationEmbeddingType = VectorEmbeddingType.Single
});

SearchEngineType = Raven.Client.Documents.Indexes.SearchEngineType.Corax;
}
}

Execute a vector search using the index:
Results will include Company documents whose attachment contains text similar to "chinese food".

var relevantCompanies = session
.Query<Companies_ByVector_FromTextAttachment.IndexEntry,
Companies_ByVector_FromTextAttachment>()
.VectorSearch(
field => field
.WithField(x => x.VectorFromAttachment),
searchTerm => searchTerm
.ByText("chinese food"), 0.8f)
.Customize(x => x.WaitForNonStaleResults())
.OfType<Company>()
.ToList();

You can now extract the text from the attachments of the resulting documents:

// Extract text from the attachment of the first resulting document
// ================================================================

// Retrieve the attachment stream
var company = relevantCompanies[0];
var attachmentResult = session.Advanced.Attachments.Get(company, "description.txt");
var attStream = attachmentResult.Stream;

// Read the attachment content into memory and decode it as a UTF-8 string
var ms = new MemoryStream();
attStream.CopyTo(ms);
string attachmentText = Encoding.UTF8.GetString(ms.ToArray());

Indexing NUMERICAL attachments

LINQ index

  • The following index defines a vector field named VectorFromAttachment.

  • It indexes embeddings generated from the numerical data stored in the vector.raw attachment.
    This applies to all Company documents that contain an attachment with that name.

  • Each attachment contains raw numerical data in 32-bit floating-point format.

public class Companies_ByVector_FromNumericalAttachment :
AbstractIndexCreationTask<Company, Companies_ByVector_FromNumericalAttachment.IndexEntry>
{
public class IndexEntry()
{
// This index-field will hold embeddings
// generated from the NUMERICAL content in the attachments.
public object VectorFromAttachment { get; set; }
}

public Companies_ByVector_FromNumericalAttachment()
{
Map = companies => from company in companies

// Load the attachment from the document (ensure it is not null)
let attachment = LoadAttachment(company, "vector.raw")
where attachment != null

select new IndexEntry
{
// Index the attachment's content in the vector field
VectorFromAttachment = CreateVector(attachment.GetContentAsStream())
};

// Configure the vector field:
VectorIndexes.Add(x => x.VectorFromAttachment,
new VectorOptions()
{
// Define the source embedding type
SourceEmbeddingType = VectorEmbeddingType.Single,
// Define the desired destination format within the index
DestinationEmbeddingType = VectorEmbeddingType.Single
});

SearchEngineType = Raven.Client.Documents.Indexes.SearchEngineType.Corax;
}
}

Execute a vector search using the index:
Results will include Company documents whose attachment contains vectors similar to the query vector.

var similarCompanies = session
.Query<Companies_ByVector_FromNumericalAttachment.IndexEntry,
Companies_ByVector_FromNumericalAttachment>()
.VectorSearch(
field => field
.WithField(x => x.VectorFromAttachment),
queryVector => queryVector
.ByEmbedding(new float[] { 0.1f, 0.2f, 0.3f, 0.4f }))
.Customize(x => x.WaitForNonStaleResults())
.OfType<Company>()
.ToList();

JS index

  • The following is the JavaScript index format equivalent to the LINQ index shown above.

  • The main difference is that JavaScript indexes do Not support getContentAsStream() on attachment objects:

    • Because of this, embedding vectors must be stored in attachments as Base64-encoded strings.
    • Use getContentAsString() to retrieve the attachment content as a string, as shown in this example.
public class Companies_ByVector_FromNumericalAttachment_JS :
AbstractJavaScriptIndexCreationTask
{
public Companies_ByVector_FromNumericalAttachment_JS()
{
Maps = new HashSet<string>()
{
@"map('Companies', function (company) {

var attachment = loadAttachment(company, 'vector_base64.raw');
if (!attachment) return null;

return {
VectorFromAttachment: createVector(attachment.getContentAsString('utf8'))
};
})"
};

Fields = new();
Fields.Add("VectorFromAttachment", new IndexFieldOptions()
{
Vector = new VectorOptions()
{
SourceEmbeddingType = VectorEmbeddingType.Single,
DestinationEmbeddingType = VectorEmbeddingType.Single
}
});

SearchEngineType = Raven.Client.Documents.Indexes.SearchEngineType.Corax;
}
}

Execute a vector search using the index:
Results will include Company documents whose attachment contains vectors similar to the query vector.

var similarCompanies = session.Advanced
.RawQuery<Company>(@"
from index 'Companies/ByVector/FromNumericalAttachment/JS'
where vector.search(VectorFromAttachment, $queryVector)")
.AddParameter("queryVector", new float[] { 0.1f, 0.2f, 0.3f, 0.4f })
.WaitForNonStaleResults()
.ToList();

Indexing ALL attachments

  • The following index defines a vector field named VectorFromAttachment.

  • It indexes embeddings generated from the numerical data stored in ALL attachments of all Company documents.

public class Companies_ByVector_AllAttachments :
AbstractIndexCreationTask<Company, Companies_ByVector_AllAttachments.IndexEntry>
{
public class IndexEntry()
{
// This index-field will hold embeddings
// generated from the NUMERICAL content of ALL attachments.
public object VectorFromAttachment { get; set; }
}

public Companies_ByVector_AllAttachments()
{
Map = companies => from company in companies

// Load ALL attachments from the document
let attachments = LoadAttachments(company)

select new IndexEntry
{
// Index the attachments content in the vector field
VectorFromAttachment = CreateVector(
attachments.Select(e => e.GetContentAsStream()))
};

// Configure the vector field:
VectorIndexes.Add(x => x.VectorFromAttachment,
new VectorOptions()
{
SourceEmbeddingType = VectorEmbeddingType.Single,
DestinationEmbeddingType = VectorEmbeddingType.Single
});

SearchEngineType = Raven.Client.Documents.Indexes.SearchEngineType.Corax;
}
}

Execute a vector search using the index:
Results will include Company documents whose attachments contains vectors similar to the query vector.

var similarCompanies = session
.Query<Companies_ByVector_AllAttachments.IndexEntry,
Companies_ByVector_AllAttachments>()
.VectorSearch(
field => field
.WithField(x => x.VectorFromAttachment),
queryVector => queryVector
.ByEmbedding(new float[] { -0.1f, 0.2f, -0.7f, -0.8f }))
.Customize(x => x.WaitForNonStaleResults())
.OfType<Company>()
.ToList();