Merge collections without duplicates in C#
Posted by Jonas Elfström Thu, 18 Oct 2012 21:28:00 GMT
Your task is to merge two lists of objects. The resulting collection should be without duplicates, based on a certain property on the objects.
The lists are populated with very simple Person objects.
1 2 3 4 5 |
class Person
{
public int Number { get; set; }
public string Name { get; set; }
} |
C# programmers often turn to LINQ, and they should! With LINQ you could end up with something like this:
1 2 3 |
var merged = new List<Person>(list1); merged.AddRange(list2.Where(p2 => list1.All(p1 => p1.Number != p2.Number))); |
It's not all that obvious how it works and it also has performance problems. This solution has to iterate over list1
for every object in list2
. Not every object every time but on average half of them (if the lists contains no duplicates within themselves). If the lists contains a 1000 objects each there's a good chance that list1
will be iterated 500.000 times. That kind of work does not come for free.
We need to get rid of this nested looping. I was pretty sure a dictionary based solution would do the trick.
1 2 3 4 5 6 |
var dict = list2.ToDictionary(p => p.Number); foreach (var person in list1) { dict[person.Number] = person; } var merged = dict.Values.ToList(); |
This solution converts list2
to a dictionary and then it loops a single time over list1
. It either adds to the dictionary or replaces the value of list2
if it was a duplicate. This seemed to be a much more efficient algorithm.
In C# there's not just Dictionary<TKey, TValue> but we also have HashSet. A HashSet is a collection that contains no duplicates. It also has the UnionWith
method and that is exactly what we are looking for. The problem is that it compares object to object and we want to compare based on a property.
Fortunately the HashSet methods honors the default equality comparer. Such a beast could look like this.
1 2 3 4 5 6 7 8 9 10 11 12 |
class PersonComparer : IEqualityComparer<Person> { public bool Equals(Person p1, Person p2) { return p1.Number == p2.Number; } public int GetHashCode(Person p) { return p.Number; } } |
Then just add it as the second parameter to the constructor.
1 2 3 |
var hs = new HashSet<Person>(list1, new PersonComparer()); hs.UnionWith(list2); var merged = hs.ToList(); |
Set theory tells us that this merging is actually a union of the two sets (without duplicates) and LINQ happens to have a Union
method.
var merged = list1.Union(list2, new PersonComparer()); |
By that we've come full circle and it's time to see how the different solutions performs. I wrote a small test program that creates two lists of objects with 1000 objects in each. In each list 500 of the objects property, that we base the merge on, has the same value.
Lists and LINQ merge: 4820ms Dictionary merge: 16ms HashSet and IEqualityComparer: 20ms LINQ Union and IEqualityComparer: 24ms
The first solution is, in this case, 300 times slower than the fastest one!
I prefer LINQ Union because it communicates intent very clearly.