[Solved] Downloading 1,000+ files fast?


Update

It was just pointed out to me in a comment by Jimi, that DownloadFileAsync is an event driven call and not awaitable. Though, there is a WebClient.DownloadFileTaskAsync version, which would be the appropriate one to use in this example, it is an awaitable call and returns a Task

Downloads the specified resource to a local file as an asynchronous
operation using a task object.

Original answer

I know I could run multiple threads or even parallel but what’s the
best way

Yes you can make it parallel and be in control of the resources you use.

I’m not too worried about speed as long as it isn’t as slow as right
now, but I don’t want to overpower the device’s resources such as CPU
trying to speed it up

You should be able to achieve this and configure this fairly well.


OK, so there are many ways to do this. Here are some things to think about:

  • You have 1000s of IO bound tasks (as opposed to CPU bound tasks)
  • With this many files, you want sort of parallelism and to be able to to configure the amount of concurrent tasks.
  • You will want to do this in an async / await pattern so you’re not wasting system resources on IO completion ports or smashing your CPU

Some immediate solutions:

  • Tasks, and WaitAll in an asnyc / await pattern, this is a great approach however it’s a little bit trickier to limit concurrent tasks.
  • You have the Parallel.ForEach and Parallel.For, this has a nice approach to limit concurrent workloads, but its just not suited to IO bound tasks
  • Or another option you might consider is the Microsoft Dataflow (Task Parallel Library), I have come to like these libraries a lot lately as they can give you the best of both worlds.

Please note: there are many other approaches.

So Parallel.ForEach uses the thread pool. Moreover, IO bound operations will block those threads waiting for a device to respond and tie up resources. A general rule of thumb here is

  • If you have CPU bound code, Parallel.ForEach is appropriate;
  • Though if you have IO bound code, Asynchrony is appropriate.

In this case, downloading a file is clearly I/O, there is a DownloadFileAsync version, and 1000 files to download, so you are best to use async/await pattern and some type of limit on concurrent tasks


Here is a very basic example of how you might achieve this:

Given

public class WorkLoad
{
    public string Url {get;set;}
    public string FileName {get;set;}

}

Dataflow example

public async Task DoWorkLoads(List<WorkLoad> workloads)
{
   var options = new ExecutionDataflowBlockOptions
                     {
                        // add pepper and salt to taste
                        MaxDegreeOfParallelism = 50
                     };

   // create an action block
   var block = new ActionBlock<WorkLoad>(MyMethodAsync, options);

   // Queue them up
   foreach (var workLoad in workloads)
      block.Post(workLoad );

   // wait for them to finish
   block.Complete();
   await block.Completion;

}

...

// Notice we are using the async / await pattern
public async Task MyMethodAsync(WorkLoad workLoad)
{
   
    try
    {
        Console.WriteLine("Downloading: " + workLoad.Url);
        await client.DownloadFileAsync(workLoad.Url, workLoad.FileName);
    }
    catch (Exception)
    {
        // probably best to add some error checking some how
    }
}

Summary

This approach gives you Asynchrony, it also gives you MaxDegreeOfParallelism, it doesn’t waste resources, and lets IO be IO

Disclaimer, DataFlow may not be where you want to be, however I just thought I’d give you some more information

Disclaimer 2, Also the above code has not been tested, I would seriously consider researching this technology first and doing your on due diligence thoroughly.


Loosely related demo here

6

solved Downloading 1,000+ files fast?