Semantic Document Search

Drop files here or click to browse

Supported formats: PDF, Word, Excel, PowerPoint, EPUB, ODT, ZIP, and 200+ text formats

Frequently Asked Questions

Is my data sent to a server?

No, absolutely not. Everything happens entirely in your browser. Your files are processed locally using WebAssembly and JavaScript. No data is transmitted to any server - your documents never leave your device.

This means you can safely use this tool with confidential or sensitive documents. Even if you disconnect from the internet after loading the page, the search will still work.

What is Semantic Search and how does it work?

Semantic Search goes beyond traditional keyword matching by understanding the meaning of your query. It uses techniques from RAG (Retrieval-Augmented Generation) to find relevant information in your documents using natural language questions.

Here's how it works:

Processing: When you add documents, they are split into smaller chunks and converted into mathematical representations called "embeddings"
Searching: When you type a question, it's also converted into an embedding
Matching: The system finds document chunks whose embeddings are most similar to your question

This allows you to search by meaning rather than exact keywords. For example, searching for "how to handle errors" will find content about "exception handling" or "error management" even if those exact words aren't in your query.

What file types are supported?

This tool supports a wide range of file formats:

Office Documents: PDF, Word (.docx), Excel (.xlsx), PowerPoint (.pptx), OpenDocument (.odt)
E-Books: EPUB
Archives: ZIP (all supported files inside are automatically extracted)
Vector Databases: .vsdb (pre-computed search databases, see below)
Text Documents: Plain text (.txt), Markdown (.md), HTML, XML
Code: JavaScript, TypeScript, Python, Java, C/C++, Rust, Go, Ruby, PHP, and 200+ more
Data: JSON, YAML, CSV, TOML, INI configuration files

Binary formats like images, audio, or video files are not supported. All document processing happens entirely in your browser - no server required.

What is a .vsdb file?

A .vsdb (Vector Search DataBase) file is a compact binary format that stores your documents together with their pre-computed embeddings. It allows you to export your fully indexed document collection and import it later without having to re-generate the embeddings - which can take a long time for large collections.

How to use it:

Export: After indexing your documents, click the "Export .vsdb" button to download the database file.
Import: Drag & drop the .vsdb file onto the upload area, or use the file picker. The documents and embeddings are restored instantly - no model download or computation needed.
Combine: You can import multiple .vsdb files to merge several document collections into one. Duplicate documents (same content) are automatically detected and merged.

The file includes a compatibility header that records which embedding model was used. If you import a .vsdb file created with a different model, you'll see a warning - but you can still proceed if you choose to.

What is the maximum file size?

There is no strict file size limit, but practical limits depend on your device:

Browser memory: Very large files may cause your browser to slow down or crash
Recommended: Individual files under 10 MB work best
Total size: The combined size of all documents should stay reasonable (under 100 MB) for smooth performance

If you experience slowdowns, try closing other browser tabs or using fewer/smaller files.

Why does adding documents require a download?

When you first add documents, the page downloads an AI model (approximately 130 MB) that runs entirely in your browser. The model used is paraphrase-multilingual-MiniLM-L12-v2, a sentence transformer that supports over 50 languages including English and German.

This model converts text into mathematical representations (embeddings) that capture semantic meaning, enabling you to search by concept rather than just keywords.

Good news: The model is cached in your browser, so future visits won't require another download.

How does the relevance scoring work?

The search uses a hybrid approach that combines two techniques:

Semantic Search (60%): Understands the meaning of your query. Finds content that is conceptually similar, even if the exact words don't match.
Keyword Search / BM25 (40%): Looks for exact word matches. Ensures that documents containing your specific search terms are boosted.

The relevance percentage shown for each result is this combined score. A document with a specific keyword match might appear high in results even if the semantic similarity alone would be low.

What do 'Precise' and 'Context' mean?

Documents are automatically split into chunks at two granularity levels:

Precise: Smaller chunks (~400 characters) that are better for finding specific terms or short phrases. Ideal for keyword-focused searches.
Context: Larger chunks (~1200 characters) that capture more surrounding context. Better for conceptual or thematic searches.

The search automatically considers both granularities and adjusts scoring based on your query length: short queries favor precise chunks, longer questions favor context chunks.

Why does a result appear without matching keywords?

This happens because 60% of the relevance score comes from semantic search, which works on meaning rather than exact words. The AI model converts your query and every document chunk into a 384-dimensional mathematical vector (called an "embedding"). Chunks whose vectors point in a similar direction are considered relevant.

This means a search for "how to handle errors" can surface a chunk about "exception management strategies" - even though no single word matches. The model learned during training that these concepts are related.

To investigate why a specific chunk was selected, check the Chunks tab:

High Semantic + low BM25: The chunk is conceptually similar to your query but uses different words. This is the AI understanding meaning.
Low Semantic + high BM25: The chunk contains your exact search terms but may be about a different topic. The keyword match is driving the score.
Both high: The strongest matches - both meaning and keywords align.

The embeddings compress entire text passages into abstract numeric features. Unlike keyword matching, there is no way to attribute the semantic score to individual words - it represents the overall meaning similarity between your query and the chunk as a whole.

Why does my search return no results?

Results only appear if they exceed a minimum relevance threshold. Here are some tips:

Use specific keywords: Include terms that are likely to appear in your documents.
Try different phrasings: The semantic search understands synonyms, but exact terms get an additional boost.
Keep queries focused: Very long or vague queries may dilute the relevance score.
Check the Chunks tab: Switch to the "Chunks" tab to see all chunks with their scores, including those below the threshold.

Is this a production-ready solution?

This tool is a proof of concept demonstrating that sophisticated semantic search can run entirely in the browser without any server infrastructure. It showcases the potential of client-side AI for privacy-preserving document search.

For enterprise-grade implementations, multimodal textualization would be a typical enhancement - using OCR and vision models to extract and describe content from images, diagrams, charts, and scanned documents, making visual information searchable alongside text.

That said, this implementation already demonstrates the core principles and can handle real-world document collections effectively within browser constraints.