[Solved] Process unstructured and multiple line CSV in hadoop


Because you’re coping with multi-line data you cannot use a simple TextInputFormat to access your data. Thus you need to use a custom InputFormat for CSV files.

Currently there is no built-in way of processing multi-line CSV files in Hadoop (see https://issues.apache.org/jira/browse/MAPREDUCE-2208), but luckily there’s come code on github you can try: https://github.com/mvallebr/CSVInputFormat.

As far as the non-terminated quotations is concerned, it might be necessary to pre-process the data and clean it up in the first place. One simple rule would be to escape the quotations if there is no separator before or after the quotation ("):

  • escape: a"b => a\"b
  • leave unchanged: a;"b and a";b

Another option would be correcting the application that produces invalid CSV to escape the data in a proper way.

solved Process unstructured and multiple line CSV in hadoop