All computer programs and scripts, including this website and our word count function, runs off code that can be converted into instructions that a computer can understand (1s and 0s). Algorithms are rules that are unambiguous instructions that achieve a specific goal that can be represented by code. In our case, our algorithm finds out the word count of any length of text.
The easiest way to count words is to look at what separates the words, so this is what we can make our algorithm do.
Step 1. Define a list of separators, such as space, a comma, semi-colon and any other symbol you want, then store them in the program (you can store them as an array for instance if you are using a programming language like Java).
Step 2. Define a state that has two possible values, true and false. Initially, this value will be set to true. Define an integer variable called word count. Set that to initialize at zero. The idea is that when the state is false, we are currently scanning a word, and when it is true, we are not.
Step 3. Get the input text. This process is different depending on how you want to get the text, from importing a txt document into the program or getting it off a website. The easiest way to do this in the code is to hard code it inside the program itself by defining a string variable, then setting it to the full text that you want counted. Setting up a scanner to take in input is also relatively easy.
Step 4. Using a loop, go over the full string to be counted unit by unit. (A string in computer code is what we call any text).
Inside the loop, follow the following rules in this priority:
If the next unit is a separator (do a comparison between the unit scanned and the list of separators we made; you can write another method here that allows you to compare the element currently being scanned to all the elements in our separator array), set the state as “true”.
If the next unit is not a word separator as defined in step 1, and the state is out, then set the state as “false” and increment the word counter by 1. This means adding 1 to whatever value the variable word count currently sits at.
Go to the next unit and repeat the process until the for loop ends.
Step 5. Once the loop has finished running, return the word count to the user.
Now that you understand the algorithm and the logic behind it, you can turn it into pseudo code! Pseudo code is a term that applies to ‘code’ that is mostly readable in plain English, but will share the same structure as code that is written in a real programming language. Programmers write pseudo code because it is easy to write without having to bother with the nitty gritty details of syntax, and helps them plan what they should do when they go and write code in a real programming language.
Pseudo code for word count (important variable names are bolded):
//Comments will have the double dash before them like this sentence!
//This is the list of separators; I have included only the comma, period and empty space //in the example below.
Define array separators = [‘ ‘, ‘.’, ‘,’]
//This is the state, where true means we are not currently counting text, and false means //we are.
Define boolean state = true
//This is the main method, which is where we put in our input text, and we expect to get //an integer value back out; said integer value will be our word count
Method int countWords(String input) {
//When a sentence begins, but before we have scanned anything, we are not counting //a word yet, so we have to set the state to true first.
state = true
//This variable is our word count that we will give to the user once the program //has finished running
int word_count = 0
int i = 0 //This is just a variable that we use to scan through text
//Scan through all elements in our text individually one by one
For (int i = 0, i < input.length, i++) {
//if the next element is a separator, set the state as true.
If (SeparatorArrayContains(input.elementAt(i) or input.elementAt(i) == ‘new_line’ or input.elementAt(i) == ‘new_tab’) {
state = true
}
//On the other hand, if the next element is not in our list separators, and the state is currently true,
then we set the state as false and increment our word_count variable by one.
Else if (state == true) {
state = false
word_count = word_count + 1
}
}
//At this point, the while loop has stopped running and we have the final word count.
return word_count
}You will also need a method called SeparatorArrayContains that can take in a parameter and see if it is counted as a separator or not.
Method Boolean SeparatorArrayContains(elementToCheck) {
boolean found = false; //We set the default result to false
//Scan through the entire array of separators we have defined earlier
for (int k = 0; k < separators.length; k++) {
//If we successfully match the element we are checking to a value in the
//separator array
if (separators[k] == elementToCheck) {
found = true;
}
}
return found; //Once we have scanned through the entire array, return the result
}
Once you have the above down, you can call this method in the main part of your program by putting in the text you want to count as a parameter, like so:
countWords(“Test sentence we want to count!”)
If you want to be fancy, there are plenty of ways to either take in user input as the text you want to count, or to import a text file. We will not be going over that here, but there are plenty of tutorials online on how to do something like that.
Please note that the above pseudo code may require tweaking for you to get accurate results, remember to test the code thoroughly once you have it in the programming language of your choice.
Finally, this is what the code looks like when you write it in Java. No comments here, since you should know what is going on from reading the pseudo code!
public class Word count {
static char[] separators = {' ', '.', ','};
static boolean state = true;
static int countWords(String str) {
boolean state = true;
int word count = 0;
for (int i = 0; i < str.length(); i++) {
if (SeparatorArrayContains(str.charAt(i)) || str.charAt(i) == '\n' || str.charAt(i) == '\t') {
state = true;
}
else if (state == true) {
state = false;
word count ++;
}
}
return word count ;
}
static boolean SeparatorArrayContains(char c) {
boolean found = false;
for (int k = 0; k < separators.length; k++) {
if (separators[k] == c) {
found = true;
}
}
return found;
}
public static void main(String args[]) {
String str = "one two three four five ,,,,,,, six";
System.out.println("Number of words : " + countWords(str));
}
}
If you are interested in seeing more wordcount algorithms coded using various different programming languages, check out: https://www.geeksforgeeks.org/count-words-in-a-given-string/
Note that the above code is structured in a similar manner as the java code in the link, but with the additional ability for the programmer to choose the list of separators added to the code.
Many may think that having your own word count program might be pointless if there are already ways to find a word count using popular programs such as Microsoft Office, Google Docs, or just simply using this website.
However, we know that different Office programs will return different word counts; if you have a specific definition of what is counted as a word and what is not, then writing your own word count code would be the only way to count the number of words in your document outside of counting it manually by hand! This would be especially useful if you must adhere to strict standards for word counts (as in the case of legal documents for instance).
Remember that once you have written your code, you will need some way to compile it and run it. Fortunately, many compilers (such as Dr Java) are portable and can fit onto any mobile storage device, so you can run them whenever you want.
Firstly, you would have to find out all the rules for counting words according to the organization that regulates the count. After that, you can use those rules to create an algorithm that uses all these rules to count words. Once you have the algorithm, you can convert that into computer code in whatever programming language you have learned, such as Python, Java, C++ and etc.
You will likely have to test this algorithm many times over with different text that you have hand counted. Obviously, we would use very small sections of text that would be easy to hand count to begin with! Remember to test strange edge cases as well as normal text, so that you can find all the instances where your algorithm’s logic is not correct, figure out why, then fix the issue.
Say you are a lawyer citing a statue called 18 U.S.C. 1001. Google docs counts that as five words. However, the courts say that is counted as three words. Now if you cite a statue with subsections, such as 18 U.S.C. 924(a)(1)(B), the courts also says that is three words, but Google docs now counts that as 8 words.
What has happened here? In Google docs’ case, we can see that it has counted 18, U, S, C, 924, a, 1, and b all as separate words. This is because it sees the period, the bracket and the space as separators. To change this, we can simply use the pseudocode above and edit the list of separators so that only space is a separator. Following that logic, we can see now that we only have three words. Note that Microsoft Office and countofwords.com both return three words as a result with the above example, which is what that particular court says is the proper result.