[Solved] For every possible pair of two unique words in the file, print out the count of occurrences of that pair [closed]


It seems inconceivable that the purpose of your assignment determining the frequency of word-pairs in a file would be to have you wrap a piped-string of shell utilities in a system call. What does that possibly teach you about C? That a system function exists that allows shell access? Well, it does, and you can, lesson done, nothing learned.

It seems far more likely that the intent was for you to understand the use of structures to hold collections of related data in a single object, or at the minimum array or pointer indexing to check for pairs in adjacent words within a file. Of the 2 normal approaches, use of a struct, or index arithmetic, the use of a struct is far more beneficial. Something simple to hold a pair of words and the frequency that pair is seen is all you need. e.g.:

enum { MAXC = 32, MAXP = 100 };

typedef struct {
    char w1[MAXC];
    char w2[MAXC];
    size_t freq;
} wordpair;

(note, the enum simply defines the constants MAXC (32) and MAXP (100) for maximum characters per-word, and maximum pairs to record. You could use two #define statements to the same end)

You can declare an array of the wordpair struct which will hold a pair or words w1 and w2 and how many time that pair is seen in freq. The array of struct can be treated like any other array, sorted, etc..

To analyze the file, you simply need to read the first two words into the first struct, save a pointer to the second word, and then read each remaining word that remains in the file comparing whether the pair formed by the pointer and the new word read already exists (if so simply update the number of times seen), and if it doesn’t exist, add a new pair updating the pointer to point to the new word read, and repeat.

Below is a short example that will check the pair occurrence for the words in all filenames given as arguments on the command line (e.g. ./progname file1 file2 ...). If no file is given, the code will read from stdin by default.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

enum { MAXC = 32, MAXP = 100 };

typedef struct {
    char w1[MAXC];
    char w2[MAXC];
    size_t freq;
} wordpair;

size_t get_pair_freq (wordpair *words, FILE *fp);
int compare (const void *a, const void *b);

int main (int argc, char **argv) {

    /* initialize variables & open file or stdin for seening */
    wordpair words[MAXP] = {{"", "", 0}};
    size_t i, idx = 0;
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {
        fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
        return 1;
    }

    /* read from file given, or from stdin (default) */
    idx = get_pair_freq (words, stdin);

    /* read each remaining file given on command line */
    for (i = 2; i < (size_t)argc; i++)
    {   if (fp && fp != stdin) { fclose (fp); fp = NULL; }
        /* open file for reading */
        if (!(fp = fopen (argv[i], "r"))) {
            fprintf (stderr, "error: file open failed '%s'.\n",
                        argv[i]);
            continue;
        }

        /* check 'idx' against MAXP */
        if ((idx += get_pair_freq (words, fp)) == MAXP)
            break;
    }
    if (fp && fp != stdin) fclose (fp);

    /* sort words alphabetically */
    qsort (words, idx, sizeof *words, compare);

    /* output the frequency of word pairs */
    printf ("\nthe occurrence of words pairs are:\n\n");
    for (i = 0; i < idx; i++) {
        char pair[MAXC * 2] = "";
        sprintf (pair, "%s:%s", words[i].w1, words[i].w2);
        printf ("  %-32s : %zu\n", pair, words[i].freq);
    }

    return 0;
}

size_t get_pair_freq (wordpair *pairs, FILE *fp)
{
    char w1[MAXC] = "", w2[MAXC] = "";
    char *fmt1 = " %32[^ ,.\t\n]%*c";
    char *fmt2 = " %32[^ ,.\t\n]%*[^A-Za-z0-9]%32[^ ,.\t\n]%*c";
    char *w1p;
    int nw = 0;
    size_t i, idx = 0;

    /* read 1st 2 words into pair, update index 'idx' */
    if (idx == 0) {
        if ((nw = fscanf (fp, fmt2, w1, w2)) == 2) {
            strcpy (pairs[idx].w1, w1);
            strcpy (pairs[idx].w2, w2);
            pairs[idx].freq++;
            w1p = pairs[idx].w2;    /* save pointer to w2 for next w1 */
            idx++;
        }
        else {
            if (!nw) fprintf (stderr, "error: file read error.\n");
            return idx;
        }
    }

    /* read each word in file into w2 */
    while (fscanf (fp, fmt1, w2) == 1) {
        /* check against all pairs in struct */
        for (i = 0; i < idx; i++) {
            /* check if pair already exists  */
            if (strcmp (pairs[i].w1, w1p) == 0 && 
                strcmp (pairs[i].w2, w2) == 0) {
                pairs[i].freq++;    /* update frequency for pair  */
                goto skipdup;       /* skip adding duplicate pair */
            }
        } /* add new pair, update pairs[*idx].freq */
        strcpy (pairs[idx].w1, w1p);
        strcpy (pairs[idx].w2, w2);
        pairs[idx].freq++;
        w1p = pairs[idx].w2;
        idx++;

    skipdup:

        if (idx == MAXP) { /* check 'idx' against MAXP */
            fprintf (stderr, "warning: MAXP words exceeded.\n");
            break;
        }
    }

    return idx;
}

/* qsort compare funciton */
int compare (const void *a, const void *b)
{
    return (strcmp (((wordpair *)a)->w1, ((wordpair *)b)->w1));
}

Use/Output

Given your example of "Hi how are you are you.", it produces the desired results (in sorted order according to your LOCALE).

$ echo "Hi how are you are you." | ./bin/file_word_pairs

the occurrence of words pairs are:

  Hi:how                           : 1
  are:you                          : 2
  how:are                          : 1
  you:are                          : 1

(there is no requirement that you sort the results, but it makes lookup/confirmation a lot easier with longer files)

Removing qsort

$ echo "Hi how are you are you." | ./bin/file_word_pairs

the occurrence of words pairs are:

  Hi:how                           : 1
  how:are                          : 1
  are:you                          : 2
  you:are                          : 1

While you are free to attempt to use your system version, why not take the time to learn how to approach the problem in C. If you want to learn how to do it through a system call, take a Linux course, as doing it in that manner has very little to do with C.

Look it over, lookup the functions that are new to you in the man pages and then ask about anything you don’t understand thereafter.

5

solved For every possible pair of two unique words in the file, print out the count of occurrences of that pair [closed]