[Solved] Trim() in Java not working the way I expect? [duplicate]


It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.

If my assumption is correct then you’ve got two choices:

  1. Use a binary reader and figure out what those characters are – and delete them with String.replace(); E.g.:

    private static void cutCharacters(String fromHtml) {
        String result = fromHtml;
        char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
        for (char ch : problematicCharacters) {
            result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
        }
        return result;
    }
    
  2. If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:

    private String getImportantParts(String fromHtml) {
        Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
        Matcher m = p.matcher(fromHtml);
        StringBuilder buff = new StringBuilder();
        while (m.find()) {
            buff.append(m.group(1));
        }
        return buff.toString().trim();
    }
    

1

solved Trim() in Java not working the way I expect? [duplicate]