[Solved] Trim() in Java not working the way I expect? [duplicate]

Question

It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.

If my assumption is correct then you’ve got two choices:

Use a binary reader and figure out what those characters are – and delete them with String.replace(); E.g.:

private static void cutCharacters(String fromHtml) {
    String result = fromHtml;
    char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
    for (char ch : problematicCharacters) {
        result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
    }
    return result;
}

If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:

private String getImportantParts(String fromHtml) {
    Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
    Matcher m = p.matcher(fromHtml);
    StringBuilder buff = new StringBuilder();
    while (m.find()) {
        buff.append(m.group(1));
    }
    return buff.toString().trim();
}

Accepted Answer

It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.

If my assumption is correct then you’ve got two choices:

Use a binary reader and figure out what those characters are – and delete them with String.replace(); E.g.:

private static void cutCharacters(String fromHtml) {
    String result = fromHtml;
    char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
    for (char ch : problematicCharacters) {
        result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
    }
    return result;
}

If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:

private String getImportantParts(String fromHtml) {
    Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
    Matcher m = p.matcher(fromHtml);
    StringBuilder buff = new StringBuilder();
    while (m.find()) {
        buff.append(m.group(1));
    }
    return buff.toString().trim();
}