mapreduce Archives

[Solved] Scala / Spark: map in word reduce

December 31, 2022 by Kirat

Here is an informal explaination: The map() method, when applied to a collection of one type of thing (e.g. a collections of lines in a file) and provided a function (e.g. extract the second and third item from the given string) will return a collection of the results of applying that function to each item … Read more

[Solved] What should I use to perform similarity functions on 200 column 12 million row dataset? [closed]

December 26, 2022 by Kirat

After getting suggestions from a couple of friends, I looked up the documentation on ElasticSearch. Seems like that’s the perfect tool for my use-case. It’s built for search/retrieval needs such as this, shards like anything, can handle huge data. Here’s what should be done: Store each row in a document, with the key elements being … Read more

[Solved] Could you please tell me where I would find the output for the mapreduce program Wordmedian in hadoop? Is it stored in a directory in HDFS?

November 22, 2022 by Kirat

Could you please tell me where I would find the output for the mapreduce program Wordmedian in hadoop? Is it stored in a directory in HDFS? solved Could you please tell me where I would find the output for the mapreduce program Wordmedian in hadoop? Is it stored in a directory in HDFS?

[Solved] Process unstructured and multiple line CSV in hadoop

November 2, 2022 by Kirat

Because you’re coping with multi-line data you cannot use a simple TextInputFormat to access your data. Thus you need to use a custom InputFormat for CSV files. Currently there is no built-in way of processing multi-line CSV files in Hadoop (see https://issues.apache.org/jira/browse/MAPREDUCE-2208), but luckily there’s come code on github you can try: https://github.com/mvallebr/CSVInputFormat. As far … Read more

[Solved] MapReduce to Spark

October 22, 2022 by Kirat

This is a very broad question, but the short of it is: Create an RDD of the input data. Call map with your mapper code. Output key-value pairs. Call reduceByKey with your reducer code. Write the resulting RDD to disk. Spark is more flexible than MapReduce: there is a great variety of methods that you … Read more

[Solved] Aggregation in MapReduce [closed]

September 24, 2022 by Kirat

This is a bit broad for an SO question but I’ll bite. Your mapper is for mapping values to keys. Lets say your CSV has 4 columns with numeric values: 42, 71, 45, 22 You map a key to each value; effectively what would be like the header in the CSV. Lets say column 4 … Read more