It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you’ve got two choices:
-
Use a binary reader and figure out what those characters are – and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) { String result = fromHtml; char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too for (char ch : problematicCharacters) { result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example } return result; }
-
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) { Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well. Matcher m = p.matcher(fromHtml); StringBuilder buff = new StringBuilder(); while (m.find()) { buff.append(m.group(1)); } return buff.toString().trim(); }
1
solved Trim() in Java not working the way I expect? [duplicate]