top of page

How to Find Highest Repeating Word from a File in Java?



In this Tutorial we will Learn to find the duplicate word which has occurred a maximum number of times in a file. You can also print the frequency of words from highest to lowest because you have the Map, which contains the word and their count in sorted order. All you need to do is iterate over each entry of Map and print the keys and values.


The most important part of this solution is sorting all entries. Since Map.Entry doesn't implement the Comparable interface, we need to write our own custom Comparator to sort the entries.


Algorithm:

STEP 1: START

STEP 2: DEFINE String line, word = ""

STEP 3: SET count =0, maxCount =0

STEP 4: DEFINE ArrayList<String> words

STEP 5: USE File Reader to open file in read mode.

STEP 6: READ line from file

STEP 7: By looping, CONVERT each line into lower case.

STEP 8: REMOVE the punctuation marks.

STEP 9: SPLIT the lines and STORE in array string[].

STEP 10: ADD all words generated in previous step into words.

STEP 11: SET i=0.REPEAT STEP 12 to STEP 17 UNTIL i<words.size()

STEP 12: SET count =1

STEP 13: SET j=i+1.REPEAT STEP 14 to 15 STEP UNTIL j<words.size()

STEP 14: IF(words.get(i).equals(words.get(j))) then count = count+1.

STEP 15: j = j+1

STEP 16: IF count>maxCount then maxCount = count word =words.get(i)

STEP 17: i=i+1

STEP 18: PRINT word

STEP 19: END


Java Program to Print word and their counts from a File

Here is our complete Java program to find and print the world their count from a given file. It also prints the world with the highest and lowest occurrence characters from the file as asked in given problem.

import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.regex.Pattern;
/**
 * Java program to find count of repeated words in a file.
 *
 * @author
 */public class Problem 
 {

    public static void main(String args[]) 
    {
        Map<String, Integer> wordMap = buildWordMap("C:/temp/words.txt");
        List<Entry<String, Integer>> list 
                  = sortByValueInDecreasingOrder(wordMap);
        System.out.println("List of repeated word from file and their 
                                                            count");
        for (Map.Entry<String, Integer> entry : list) 
        {
            if (entry.getValue() > 1) {
                System.out.println(entry.getKey() + " => "
                            + entry.getValue());
            }
        }
    }

    public static Map<String, Integer> buildWordMap(String fileName) 
    {
        // Using diamond operator for clean code
        Map<String, Integer> wordMap = new HashMap<>();
        
        // Using try-with-resource statement for automatic resource management
        try (FileInputStream fis = new FileInputStream(fileName);
                DataInputStream dis = new DataInputStream(fis);
                BufferedReader br = new BufferedReader(
                                     new InputStreamReader(dis))) {
        // words are separated by whitespace
        Pattern pattern = Pattern.compile("\\s+");
        String line = null;
        while ((line = br.readLine()) != null) 
        {
            // do this if case sensitivity is not required i.e. Java = 
            java
                line = line.toLowerCase();
                String[] words = pattern.split(line);
                for (String word : words) 
                {
                    if (wordMap.containsKey(word)) 
                    {
                        wordMap.put(word, (wordMap.get(word) + 1));
                    } else {
                        wordMap.put(word, 1);
                    }
                }
            }
        } 
        catch (IOException ioex)         
            {
            ioex.printStackTrace();
        }
        return wordMap;
    }

    public static List<Entry<String, Integer>> 
    sortByValueInDecreasingOrder( Map<String, Integer> wordMap) 
    {
        Set<Entry<String, Integer>> entries = wordMap.entrySet();
        List<Entry<String, Integer>> list = new ArrayList<>(entries);
        Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() 
        {
            @Overridepublic int compare(Map.Entry<String, Integer> o1, 
                               Map.Entry<String, Integer> o2) {
                return (o2.getValue()).compareTo(o1.getValue());
            }
        });
        return list;
    }
}

Output:

List of repeated word from file and their count
its => 2
of => 2
programming => 2
java => 2
language => 2


Tips:

If your writing code on interviews make sure they are production quality code, which means you must handle as many errors as possible, you must write unit tests, you must comment on the code and you do proper resource management. Here are a couple of more points to remember:

  1. Close files and streams once you are through with it. If you are in Java 7, just use the try-with-resource statement.

  2. Since the size of the file is not specified, the interviewer may grill you on cases like What happens if the file is large? With a large file, your program will run out of memory and throw java.lang.OutOfMemory: Java Heap space. - One solution for this is to do this task in chunks like first read 20% content, find the maximum repeated word on that, then read the next 20% content, and find repeated maximum by taking the previous maximum into consideration. This way, you don't need to store all words in memory and you can process any arbitrary length file.

  3. Always use Generics for type safety. It will ensure that your program is correct at compile time rather than throwing Type related exceptions at runtime.



Resource: Java67, Wikipedia, Javapoint


The Tech Platform

0 comments
bottom of page