CodeMaster is a powerful search engine specifically designed to process and rank a large collection of DSA problems efficiently. Leveraging cutting-edge text processing techniques such as TF-IDF and BM25, the system ensures precise document ranking and delivers the most relevant results for user queries. It also tackles edge cases to provide accurate and user-friendly search experiences.
- Efficient Search Mechanism: Uses TF-IDF and BM25 algorithms to rank documents by relevance to user queries.
- Advanced Text Processing: Handles complex cases like spelling corrections, camel casing, lemmatization, and numeric equivalences.
- Real-Time Ranking: Scores and ranks documents dynamically, prioritizing trusted sources and showing the top 10 results.
- Traditional String Matching: Provides a basic approach but struggles with scalability.
- Term Frequency (TF): Counts keyword occurrences but tends to favor longer documents.
- Inverse Document Frequency (IDF): Adjusts scores by emphasizing rare terms across documents.
- TF-IDF: Combines TF and IDF to determine the importance of keywords effectively.
- Normalization: Accounts for document length to remove bias.
- BM25: Balances keyword influence to avoid overemphasis on specific terms.
- Preprocessing: Cleans query text by removing stopwords, punctuation, and applying lowercase transformation.
- Keyword Extraction: Identifies relevant terms while ignoring unrelated content.
- Edge Cases:
- Spelling corrections.
- Numeric conversions (e.g., "two" to "2").
- Splitting camel-cased words (e.g., "twoSum" to "two sum").
- Lemmatization for accurate word matching.
- Query Processing: Extracts and processes query keywords, then computes TF-IDF scores.
- Cosine Similarity: Measures the closeness of the query and documents.
- Title Similarity: Adds weight to documents with titles that closely match the query.
- Ranking: Applies BM25 for ranking and displays the top 10 results.
- Stopword Removal: Eliminates common, irrelevant words (e.g., "is," "the").
- Punctuation Removal: Avoids mismatches by stripping punctuation.
- Spell Checking: Corrects misspelled query terms.
- Lemmatization: Matches words by reducing them to their base forms (e.g., "running" to "run").
- Number-Word Conversion: Treats numeric forms and their word equivalents as identical.
- CamelCase Handling: Splits camel-cased words for better keyword matching.
- Title Similarity: Considers document titles as part of the ranking process.
- RAM-Based Indexing: Stores TF-IDF values in files and loads them into memory during startup, enabling faster searches and improved response times.
- Install Dependencies: Ensure all required packages and libraries are installed.
- Start the Application: Launch the search engine and input your queries to retrieve ranked results.
- Explore Results: View the top 10 most relevant documents based on your search.
- Backend: Node.js
- Frontend: ejs template engine
- Text Processing: TF-IDF, BM25
- Storage: RAM-based indexing for fast access
- Add support for more data sources and DSA platforms.
- Optimize the scoring algorithms for even larger datasets.
- Implement a user-friendly interface for better accessibility.