[Solved] Mongodb aggregate query, or too complex?


Though it should have been made more clear in your question, your output sample from source suggests that you are looking for:

  • Total count of messages per “uid”
  • Distinct count of values in “to”
  • Distinct count of values in “from”
  • Summary of counts per “hour” for each “uid”

This is all possible in a single aggregation statement, and it just takes some careful management of the distinct lists and then some manipulation to map results for each hour in a 24 hour period.

The best approach here is aided by operators introduced in MongoDB 3.2:

db.collection.aggregate([
    // First group by hour within "uid" and keep distinct "to" and "from"
    { "$group": {
        "_id": {
            "uid": "$uid",
            "time": { "$hour": "$timestamp" }
        },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "count": { "$sum": 1 }
    }},

    // Roll-up to "uid" and keep each hour in an array
    { "$group": {
        "_id": "$_id.uid",
        "total": { "$sum": "$count" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { 
            "$push": {
                "index": "$_id.time",
                "count": "$count"
            }
        }
     }},

     // Getting distinct "to" and "from" requires a double unwind of arrays
     { "$unwind": "$to" },
     { "$unwind": "$to" },
     { "$unwind": "$from" },
     { "$unwind": "$from" },

     // And then adding back to sets for distinct
     { "$group": {
        "_id": "$_id",
        "total": { "$first": "$total" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { "$first": "$temp_hours" }
     }},

     // Map out for each hour and count size of distinct lists
     { "$project": {
        "count": "$total",
        "from_count": { "$size": "$from" },
        "to_count": { "$size": "$to" },
        "hours": {
            "$map": {
                "input": [
                     00,01,02,03,04,05,06,07,08,09,10,11,
                     12,13,14,15,16,17,18,19,20,21,22,23
                 ],
                 "as": "el",
                 "in": {
                      "$ifNull": [
                          { "$arrayElemAt": [
                              { "$map": {
                                  "input": { "$filter": {
                                     "input": "$temp_hours",
                                     "as": "tmp",
                                     "cond": {
                                         "$eq": [ "$$el", "$$tmp.index" ]
                                     }
                                  }},
                                 "as": "out",
                                 "in": "$$out.count"
                              }},
                              0
                          ]},
                          0
                      ]
                 }
            }
        }
     }},

     // Optionally sort in "uid" order
     { "$sort": { "_id": 1 } }
 ])

Prior MongoDB 3.2 you need to get a bit more involved to map the array content for all hours in the day:

db.collection.aggregate([

    // First group by hour within "uid" and keep distinct "to" and "from"
    { "$group": {
        "_id": {
            "uid": "$uid",
            "time": { "$hour": "$timestamp" }
        },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "count": { "$sum": 1 }
    }},

    // Roll-up to "uid" and keep each hour in an array
    { "$group": {
        "_id": "$_id.uid",
        "total": { "$sum": "$count" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { 
            "$push": {
                "index": "$_id.time",
                "count": "$count"
            }
        }
     }},

     // Getting distinct "to" and "from" requires a double unwind of arrays
     { "$unwind": "$to" },
     { "$unwind": "$to" },
     { "$unwind": "$from" },
     { "$unwind": "$from" },

     // And then adding back to sets for distinct, also adding the indexes array
     { "$group": {
        "_id": "$_id",
        "total": { "$first": "$total" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { "$first": "$temp_hours" },
        "indexes": { "$first": { "$literal": [
                     00,01,02,03,04,05,06,07,08,09,10,11,
                     12,13,14,15,16,17,18,19,20,21,22,23
        ] } }
     }},

     // Denormalize both arrays
     { "$unwind": "$temp_hours" },
     { "$unwind": "$indexes" },

     // Marry up the index entries and keep either the value or 0
     // Note you are normalizing the double unwind to distinct index
     { "$group": {
         "_id": {
             "_id": "$_id",
             "index": "$indexes"
         },
         "total": { "$first": "$total" }, 
         "from": { "$first": "$from" },
         "to": { "$first": "$to" },
         "count": {
             "$max": {
                 "$cond": [
                     { "$eq": [ "$indexes", "$temp_hours.index" ] },
                     "$temp_hours.count",
                     0
                 ]
             }
         }
     }},

     // Sort to keep index order - !!Important!!         
     { "$sort": { "_id": 1 } },

     // Put the hours into the array and get sizes for other results
     { "$group": {
         "_id": "$_id._id",
         "count": { "$first": "$total" },
         "from_count": { "$first": { "$size": "$from" } },
         "to_count": { "$first": { "$size": "$to" } },
         "hours": { "$push": "$count" }
     }},

     // Optionally sort in "uid" order
     { "$sort": { "_id": 1 } }
])

To break that down, both approaches here follow the same basic steps, with the only real difference occuring on the mapping of “hours” for the 24 hour period.

In the first aggregation $group stage, the objective is to get results per hour present in the data and for each “uid” value. The simple date aggregation operator of $hour helps obtain this value as part of the grouping key.

The $addToSet operations are a sort of “mini-group” in themselves, and this is allowing to keep the “distinct sets” for each of the “to” and “from” values whilst essentially still grouping per hour.

The next $group is more “organizational”, as the recorded “counts” for each hour are kept in an array whilst rolling up all the data to just be grouped per “uid”. This basically gives you all the “data” you really need for the result, but of course the $addToSet operations here are just adding “arrays within arrays” of the distinct sets determined per hour.

In order to get these values as truly distinct lists per each “uid” and only, it is necessary to deconstruct each array using $unwind and then finally group back as just the distinct “sets”. The same $addToSet compacts this, and the $first operations just take the “first” values of the other fields, which are already all the same for the target “per uid” data. We are happy with those, so just keep them as they are.

The final stage(s) here are essentially “cosmetic” in nature and can equally be achieved in client side code. Since there is not data present for every single hour interval, it needs to be mapped into an array of values representing each hour. The two approaches here vary on the capabilties of the available operators between versions.

In the MongoDB 3.2 release, there are $filter and $arrayElemAt operators that effectively allow you to create the logic to “transpose” an input source of all possible index positions ( 24 hours ) into the values that are already determined for the counts fom those hours in the available data. This is basicalyl a “direct lookup” of values already recorded for each available hour to see if it exists, where it does the count is transposed into the full array. Where it is not present, a default value of 0 is used in place.

Without those operators, doing this “match up” essentially means de-normalizing both arrays ( the recorded data and the full 24 positions ) in order to compare and transpose. This is what is happening in the second approach with a simple comparison of the “index” values to see if there was a result for that hour. The $max operator here is mainly used because of the two $unwind statements, where each recorded value from the source data is going to be reproduced for every possible index position. This “compacts” down to just the values that are wanted per “index hour”.

In that latter approach it then becomes important to $sort on the grouping _id value. This is because it contains the “index” position, and that is going to be needed when moving this content back into an array you expect to be orderred. Which is of course the final $group stage here where the ordered positions are put into an array with $push.

Back to the “distinct lists”, the $size operator is used in all cases to determine the “length” and therefore “count” of distinct values in the lists for “to” and “from”. This is the only real constraint on MongoDB 2.6 at least, but can otherwise be replaced with simply “unwinding” each array individually and then grouping back on the _id already present in order to count the array entries in each set. It’s a basic process, but as you should see the $size operator is the better option here for overall performance.


As a final note, your conclusion data is a little off, as possibly the entry with “ddd” in “from” was intended to also be the same in “to”, but is instead recorded as “bbb”. This changes the distinct count of the third “uid” grouping for “to” down by one entry. But of course the logical results given the source data is sound:

{ "_id" : 1000000, "count" : 3, "from_count" : 2, "to_count" : 2, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0 ] }
{ "_id" : 2000000, "count" : 2, "from_count" : 1, "to_count" : 1, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0 ] }
{ "_id" : 3000000, "count" : 5, "from_count" : 5, "to_count" : 4, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0 ] }

N.B The source also has a typo with the delimitter being interposed with : instead of a comma right after the timestamp on all lines.

solved Mongodb aggregate query, or too complex?