IN
INKWWW/Hadoop-MapReduce
Big Data\MapReduce
Hadoop
############################################
1. Top 10 CommonWords in multiple files
Hadoop + MapReduce
- Read mutiple files parallelly from HDFS, count the number of 'CommondWords' appering in these files at the same time excluding 'StopWords'.
- Sort these 'CommonWords' in descending order.
- Finally, pick up top 10 most frequent 'CommonWord' in the 'CommonWords list' generated in previous step.
2. Find Top-K most similar documents
Hadoop + MapReduce + TF-IDF
Exact mathching + Ranked Retrieval Models + Bag of Word model + TF-IDF
- Read multiple files from HDFS, use 'Bag of Word' model to compute the frequence of every word in every different file excluding 'StopWords'.
- Compute -TF-IDF of every word w.r.t a document
- Normalize TF-IDF of every word w.r.t a document
- Compute the relevance of every document w.r.t query words --> Ranked Retrieval Models(mentioned above)
- Sort documents according to the relevance to query words