CreateMutex
to create
the mutex or obtain a handle to the existing one.
But the customer found that sometimes, the call to
CreateMutex
fails, and GetLastError
reports that the reason is
ERROR_ACCESS_DENIED
.
What could cause that?
Specifically, the two processes are a UI process and a service.
Both processes use CreateMutex
to create or
access the mutex, passing NULL
as the security attributes.
Okay, so issue is that the two processes are running
under different identities with different privileges.
Even though you think you are creating the mutex in both places
with the same security attributes
(because you're passing NULL
both times),
the effect of the NULL
is different depending on who
is calling.
What's probably happening is that most of the time, the mutex is created by the UI process first, so the mutex gets the access control list that grants access to the user under which the UI process is running, and to the SYSTEM account, The service is running under the SYSTEM account, so it gets access.
But once in a while, it's the service that creates the mutex first. In that case, the access control list on the mutex grants access only to the SYSTEM account, in which case the UI process cannot access the mutex.
The customer reported that they saw this behavior only on builds 14393 and higher and were wondering what could be causing it. My guess is that something is happening on build 14393 that wakes up the service sooner than it used to, causing the service to be the one to create the mutex, and that's what's preventing the UI process from gaining access.
In these sorts of cases, where two processes with different permissions need to share a securable object, the general principle is to have the highpermission process create the object and set the permission so that the lowpermission process can access it. If you let the lowpermission process be the one that creates the mutex, then you have a potential squatting attack.
]]>
The most oldfashioned way of declaring a page to be uninteresting
is to free it.
The catch with that is that freeing the memory with
the VirtualFree
function
and the
MEM_RELEASE
flag
frees the entire allocation, not individual pages.
If you allocated a
64KB
chunk of memory,
then you have to release the whole thing.
You can't release half of it.
But all is not lost.
Because while you cannot free a single page from a
larger allocation,
you can decommit it,
which is almost as good.
Decommitting page is like freeing it, except that the address
space is still reserved.
To decommit a page,
call
VirtualFree
with the
MEM_DECOMMIT
flag.
For quite some time, those were the only tools you had available.
Around the Windows NT 4 era,
a new trick arrived on the scene:
You could
VirtualUnlock
memory that was already unlocked
in order to remove it from your working set.
This was a trick, because it took what used to be a programming
error and gave it additional meaning,
but in a way that didn't break backward compatibility
because the contractual behavior of the memory did not change:
The contents of the memory remain valid and the program is still
free to access it at any time.
The new behavior is that unlocking unlocked memory also takes it out
of the process's working set,
so that it becomes a prime candidate for being paged out
and used to satisfy another memory allocation.
The fact that it preserved contractual behavior means that
you could scatter
VirtualUnlock
calls randomly throughout the
program and have no effect on correctness.
It might run slower (or faster), but it will still run.
Around the Windows 2000 era,
the
MEM_RESET
flag was added to
VirtualAlloc
.
If you pass this flag, this tells the memory manager
that the memory in question is no longer interesting
to your program, and the memory manager is free to discard
it without saving the contents.
The memory itself remains accessible to the program,
and doing so before the memory gets discarded will read
the old values.
On the other hand, if the memory manager decides that it needs
to evict the memory (in order to satisfy a memory request
elsewhere),
it will throw away the contents without saving it,
and then turn the page into a demandzeroinitialized page.
Later, if your program tries to access the memory,
it will see a page full of zeroes.
Windows 8 added the
MEM_RESET_UNDO
flag
which says,
"Hey, um, I changed my mind. I don't want you to discard
the contents of the memory after all."
If the memory hasn't yet been discarded, then it is "rescued"
and behaves like normal memory again.
But if the memory has already been discarded,
then the memory manager will say,
"Sorry, too late."
And then at some point, I don't know exactly when, my colleague Adrian added code to check if a page of memory is all zeroes before paging it out, and turning it into a demandzeroinitialized page if so. So another way to say that you are not interested in a page of memory is to explicitly zero it. That causes it to turn into a demandzeroinitialized page at pageout time, which avoids the I/O of writing a page full of zeroes to disk. This is another one of those things that has no effect on the programming model; it's just an optimization. If you are running on a system that doesn't perform this optimization, everything still behaves the same as before, just a little slower.
Note that writing the zeroes to the page does have its own side effects. (Well, aside from the obvious side effect of, y'know, filling the page with zeroes.) Writing to the page will set both the Dirty and Accessed bits in the page table, which will bring it into the process's working set, and therefore will reduce its likelihood of being selected for eviction. In other words, zeroing out the page "resets the clock" on the eviction calendar. Therefore, if you're going to do this, do it as soon as you're done with the memory.
In Windows 8.1 we got the function
OfferVirtualMemory
which mixes in a few new wrinkles.
First of all, when you call
OfferVirtualMemory
,
you pass a flag that says how much you don't care about
this memory:
You can say that you totally don't care,
you mostly don't care,
you sort of don't care,
or you have no opinion on the concept of caring.
Okay, formally, what you're doing is saying how to prioritize the memory for discarding. At one extreme, you can make it a prime candidate for discarding. At the other extreme, you can say, "No special priority here. Just prioritize it according to the standard rules, as if it were plain old regular process memory."
The other wrinkle to the
OfferVirtualMemory
function is that once you offer the memory,
it is no longer accessible to your program.
Trying to access memory that has been offered will take
an access violation.
If you later decide that you want the memory back,
you can call
ReclaimVirtualMemory
,
which will try to bring the memory back into your process.
If it fails, then the contents are garbage.
There's also a companion function
DiscardVirtualMemory
which forces an immediate discard
and leaves the page contents undefined.
It's the equivalent of
OfferVirtualMemory
,
and then calling
ReclaimVirtualMemory
,
and forcing the reclaim to fail.
Okay, so here we go with the table.
VirtualFree + MEM_RELEASE 
VirtualFree + MEM_DECOMMIT 
VirtualUnlock  VirtualAlloc + MEM_RESET 
ZeroMemory  DiscardVirtualMemory  OfferVirtualMemory  

Is address space still reserved?  N  Y  Y  Y  Y  Y  Y 
Is memory accessible?  N  N  Y  Y  Y  Y  N 
Is memory removed from working set?  Y  Y  Y  N¹  N  Y  Y 
Can control eviction priority?  N  N  N  N  N  N  Y 
Are previous contents recoverable?  N  N  Y  Y until eviction  N  N  Y until eviction 
Contents if recovery failed  N/A  N/A  N/A  Zeroes  Zeroes  Garbage  Garbage 
Bonus chatter:
The flip side of discarding memory is prefetching it.
I've
discussed
the
PrefetchVirtualMemory
before,
so I'll leave it at a mention this time.
(And here's a
nonmention.)
¹
The fact that
MEM_RESET
does not remove the page
from the working set is not actually mentioned
in the documentation for the
MEM_RESET
flag.
Instead, it's mentioned in the documentation
for the
OfferVirtualMemory
function,
and in a sort of backhanded way:
]]>Note that offering and reclaiming virtual memory is similar to using the MEM_RESET and MEM_RESET_UNDO memory allocation flags, except that OfferVirtualMemory removes the memory from the process working set and restricts access to the offered pages until they are reclaimed.
VersionHelpers.h
header file,
like
IsWindows10OrGreater
,
while maintaining the same source code for both
Windows 7 and Windows 10 targets.
Their plan was to LoadLibrary
for kernel32.dll
,
and then
GetProcAddress
for
IsWindows10OrGreater
,
but they found that the
GetProcAddress
call
always failed,
even on Windows 10.
They looked in kernelbase.dll
and
ntdll.dll
; no luck there either.
How is it possible that Windows 10 doesn't know whether it
is Windows 10?
The customer investigated further and found that
when their test program called
IsWindows10OrGreater
,
there was no call to LoadLibrary
at all!
0:000> k # ChildEBP RetAddr 00 0133f9fc 00c01806 ntdll!VerSetConditionMask+0x14 01 0133fc2c 00c01739 Test!IsWindowsVersionOrGreater+0xa6 02 0133fd0c 00c01a33 Test!IsWindows10OrGreater+0x29 03 0133fe10 00c022be Test!main+0x23 04 0133fe24 00c02120 Test!invoke_main+0x1e 05 0133fe7c 00c01fbd Test!__scrt_common_main_seh+0x150 06 0133fe84 00c022d8 Test!__scrt_common_main+0xd 07 0133fe8c 77a962c4 Test!mainCRTStartup+0x8 08 0133fea0 77bd0609 KERNEL32!BaseThreadInitThunk+0x24 09 0133fee8 77bd05d4 ntdll!__RtlUserThreadStart+0x2f 0a 0133fef8 00000000 ntdll!_RtlUserThreadStart+0x1b
The customer wanted to know how to call
functions like
IsWindows10OrGreater
dynamically.
The reason the customer cannot find the function
IsWindows10OrGreater
in kernel32.dll
or any other DLL is simple:
The function was inside you all along.
The functions in the VersionHelpers.h
header
file are all inline functions.
They are not exported anywhere.
These functions do the grunt work of figuring out the
operating system so you don't have to write the
version detection code yourself
(and invariable
mess up)
by building the appropriate query and calling
VerifyVersionInfo
,
which has been available since Windows 2000
(possibly longer).
If you think about it, the answer must be like that,
for how could kernel32.dll
export all of these specific versionchecking functions?
The Windows 7 version of kernel32.dll
would have
to be clairvoyant
and have exports for all of these functions like
IsWindows10OrGreater
,
which would be quite a feat.
Presumably the implementations would
simply be hardcoded to return either
TRUE
or FALSE
, as appropriate.
(I guess you could imagine that each version of Windows
exports only the functions for which it returns TRUE
,
and the absence of the function implies that the
corresponding version is not installed.)
So just go ahead and use the functions in
VersionHelpers.h
.
They will always work and give you an answer.
(Well, unless you're targeting systems earlier
than Windows 2000,
but if you're doing that, then you probably aren't
too interested in version detection
since your customer that is still running Windows NT 4.0
is unlikely ever to upgrade.)
Bonus chatter: Note that the operating system version check does raise its own question: "Why are you doing an operating system version check at all?" Because that sort of thing gives the application compatibility team the heebiejeebies. We asked, but the customer never did answer that question.
]]>apply_reverse_permutation
in which each element in the indices
represents where the
element should move to rather than where it comes from.
This exercise is easier than the forward permutation case¹ because we can maintain a very simple invariant: At all times, the state of the input variables describe the same result. All we do is take a step closer to that result at each swap.
template<typename Iter1, typename Iter2> void apply_reverse_permutation( Iter1 first, Iter1 last, Iter2 indices) { using std::swap; using T = typename std::iterator_traits<Iter1>::value_type; using Diff = typename std::iterator_traits<Iter2>::value_type; Diff length = std::distance(first, last); for (Diff i = 0; i < length; i++) { while (i != indices[i]) { Diff next = indices[i]; swap(first[i], first[next]); swap(indices[i], indices[next]); } } }
The idea here is that we have a "working position" that starts at the beginning of the collection. We study the element in the working position and move it to its final destination by swapping it with whatever happens to there right now. We also swap the bookkeeping so that we also remember where that swapped element needs to go eventually. As a result of the swap, you now have a new element in the working position. Repeat as long as element in the working position is not in the correct position. If you have a proper permutation, then eventually you will reach the end of the cycle and the element in the working position is one that wants to be there. You can now move the working position to the next position until you have moved through the entire collection.
The complexity of this algorithm is O(N) because each swap operation moves one element to its final destination, and no element appears on the left hand side of a swap operation more than once. (An element may end up swapping at most twice. Once when it swaps out of its old position into the working position, and again when it swaps out of the working position to its final position.)
Fans of tail recursion can rewrite this function tailrecursively, which might be instructive. (Or it might not. At least it'll be fun, if rewriting functions to be tailrecursive is your idea of fun.)
As before, we can add error checking while preserving the same useful exit conditions: If an error occurs, the original collection and the indices are permuted in an unspecified way.
As before,
there are two error cases.
One is that an index is out of range.
That's easy to check.
The other is that an index appears more than once.
This could be detected in a number of ways.
One way is to detect that we have swapped more than
length
items through the working position,
because the length of a cycle in a permutation
cannot exceed the number
of elements.
But I'm going to use the same technique we used for the
forward permutation:
When we realize that we are about to swap an element that
has already been swapped into position.
In other words, if the element at the destination already
thinks that it's at the correct destination,
then we have an error because
two elements both want to go to the same
destination.
template<typename Iter1, typename Iter2> void apply_reverse_permutation( Iter1 first, Iter1 last, Iter2 indices) { using T = typename std::iterator_traits<Iter1>::value_type; using Diff = typename std::iterator_traits<Iter2>::value_type; Diff length = std::distance(first, last); for (Diff i = 0; i < length; i++) { while (i != indices[i]) { Diff next = indices[i]; if (next < 0  next >= length) { throw std::range_error("Invalid index in permutation"); } if (next == indices[next]) { throw std::range_error("Not a permutation"); } swap(first[i], first[next]); swap(indices[i], indices[next]); } } }
¹ This is different from the forward permutation, where the work of rotating the elements through a cycle leaves the inputs in a temporarily unstable state. We saw last time that before we could report an error, we had to restore some state before reporting an error, and that state that we restored didn't even correspond meaningfully to the intermediate state. It merely corresponded to the original state in a very weaklyspecified way.
]]>apply_permutation
function
assumes that the integers form a valid permutation.
Let's add error checking.
There are two ways the integers could fail to be a permutation: One is that the collection includes a value that is out of range. The other problem case is that all the values are in range, but a value appears more than once. We can detect that when we encounter a singleelement cycle when we expected a longer cycle. (Another way of looking at it is that we detect the error when we discover that we're about to move an item for the second time, because the permutation application algorithm is supposed to move each item only once.)
template<typename Iter1, typename Iter2> void apply_permutation( Iter1 first, Iter1 last, Iter2 indices) { using T = typename std::iterator_traits<Iter1>::value_type; using Diff = typename std::iterator_traits<Iter2>::value_type; Diff length = std::distance(first, last); for (Diff i = 0; i < length; i++) { Diff current = i; while (i != indices[current]) { Diff next = indices[current]; if (next < 0  next >= length) { throw std::range_error("Invalid index in permutation"); } if (next == current) { throw std::range_error("Not a permutation"); } swap(first[current], first[next]); indices[current] = current; current = next; } indices[current] = current; } }
(I added the typename
keyword at the suggestion of
commenter ildjarn.
And I used std::distance
to calculate the distance
between two iterators.
The second change was not technically necessary because
std::distance
is defined as subtraction when the
iterators are randomaccess, but if you're going to go with the
standard library, you may as well go all the way, right?)
I switched to the swapping version of the algorithm because
that allows me to ensure a useful exit condition in the
case of exception:
If an exception occurs, the elements in
[first, last)
have been permuted in an
unspecified manner.
Even though the resulting order is unspecified,
you at least know that no items were lost.
It's the same set of items, just in some other order.
The indices, on the other hand, are left in an unspecified state.
They won't be a permutation of the original indices.
But wait, we can even restore the indices
to a permutation of their former selves:¹
We can take the duplicate index and drop it back into
indices[i]
.
That entry optimistically was set to the value we expected to
find when we reached the end of the cycle.
If we never find that value, then we can put the value we
actually found into that slot, thereby correcting our optimistic
assumption.
template<typename Iter1, typename Iter2> void apply_permutation( Iter1 first, Iter1 last, Iter2 indices) { using T = typename std::iterator_traits<Iter1>::value_type; using Diff = typename std::iterator_traits<Iter2>::value_type; Diff length = std::distance(first, last); for (Diff i = 0; i < length; i++) { Diff current = i; while (i != indices[current]) { Diff next = indices[current]; if (next < 0  next >= length) { indices[i] = next; throw std::range_error("Invalid index in permutation"); } if (next == current) { indices[i] = next; throw std::range_error("Not a permutation"); } swap(first[current], first[next]); indices[current] = current; current = next; } indices[current] = current; } }
¹
This is valuable because it improves postmortem debuggability:
You can inspect the indices
to look for the
outofrange or duplicate index.
apply_permutation
function.
Here it is again:
template<typename T> void apply_permutation( std::vector<T>& v, std::vector<int>& indices) { using std::swap; // to permit Koenig lookup for (size_t i = 0; i < indices.size(); i++) { auto current = i; while (i != indices[current]) { auto next = indices[current]; swap(v[current], v[next]); indices[current] = current; current = next; } indices[current] = current; } }
The outer for
loop runs N times;
that's easy to see.
The maximum number of iterations of the
inner while
loop is a bit less obvious,
but if you understood the discussion,
you'll see that it runs at most N times
because that's the maximum length of a cycle in the
permutation.
(Of course, this analysis requires that the indices
be a permutation of 0 … N − 1.)
Therefore,
a naïve analysis would conclude that this
has worstcase running time of O(N²)
because the outer for
loop runs N times,
and the inner while
loop also runs N times
in the worst case.
But it's not actually that bad. The complexity is only O(N), because not all of the worstcase scenarios can exist simultaneously.
One way to notice this is to observe that each item
moves only once, namely to its final position.
Once an item is in the correct position, it never
moves again.
In terms of indices, observe that each swap corresponds
to an assignment indices[current] = current
.
Therefore, each entry in the index array gets set to its
final value.
And the while
loop doesn't iterate at all
if indices[current] == current
,
so when we revisit an element that has already moved
to its final location, we do nothing.
Since each item moves at most only once,
and there are N items, then the total number
of iterations of the inner while
loop
is at most N.
Another way of looking at this is that the
while
loop walks through every cycle.
But mathematics tell us that permutations decompose
into disjoint cycles,
so
the number of elements involved in the cycles cannot
exceed the total number of elements.
Either way, the conclusion is that there are most
N iterations of the inner while
loop
in total.
Combine this with the fixed overhead of N iterations
of the for
loop,
and you see that the total running time complexity is
O(N).
(We can determine via inspection that the algorithm consumes constant additional space.)
]]>template<typename Iter, typename UnaryOperation, typename Compare> void sort_by(Iter first, Iter last, UnaryOperation op, Compare comp) { std::sort(first, last, [&](T& a, T& b) { return comp(op(a), op(b)); }); }
The idea here is that you give a unary operator op
that
produces a sort key, and we sort the items by that key according
to the comparer.
For example, you might say
std::vector<Person> v = ...; // Sort by last name sort_by(v.begin(), v.end(), [](const Person& p) { return p.LastName; }, std::less<std::string>);
The first functional selects the thing we are sorting by (here, the last name), and the second functional selects how we are sorting (here, in ascending order).
This technique works okay if the unary operator (the key generator) is simple, such as the one we have here. But if generating the key is expensive, then we will want to cache the keys rather than evaluating them over and over. So let's do it:
template<typename Iter, typename UnaryOperation, typename Compare> void sort_by_with_caching(Iter first, Iter last, UnaryOperation op, Compare comp) { using Diff = std::iterator_traits<Iter>::difference_type; using T = std::iterator_traits<Iter>::value_type; using Key = decltype(op(std::declval<T>())); using Pair = std::pair<T, Key>; Diff length = std::distance(first, last); std::vector<Pair> pairs; pairs.reserve(length); std::transform(first, last, std::back_inserter(pairs), [&](T& t) { return std::make_pair(std::move(t), op(t)); }); std::sort(pairs.begin(), pairs.end(), [&](const Pair& a, const Pair& b) { return comp(a.second, b.second); }); std::transform(pairs.begin(), pairs.end(), first, [](Pair& p) { return std::move(p.first); }); }
The above is the literal translation of the Schwartian transform (also known more conventionally as the decoratesortundecorate pattern) into C++. You augment each item to be sorted with its corresponding key.¹ You then sort by the key. And then you throw away the keys, leaving the original items.¹
We use std::move
to move the items out of the
original collection into our temporary vector of pairs,
then we sort the pairs by the key,
and then we move the items from our pairs back to the
original collection.
The hope is that the object is efficiently movable,
so these move operations are very inexpensive.
But maybe the objects being sorted isn't efficiently movable.
Or maybe (horrors) the keys aren't efficiently movable.
We can use the trick from the
sort_minimize_copies
function to sort the items with minimal moving.
template<typename Iter, typename UnaryOperation, typename Compare> void sort_by_with_caching(Iter first, Iter last, UnaryOperation op, Compare comp) { using Diff = std::iterator_traits<Iter>::difference_type; using T = std::iterator_traits<Iter>::value_type; using Key = decltype(op(std::declval<T>())); Diff length = std::distance(first, last); std::vector<Key> keys; keys.reserve(length); std::transform(first, last, std::back_inserter(keys), [&](T& t) { return op(t); }); std::vector<Diff> indices(length); std::iota(indices.begin(), indices.end(), static_cast<Diff>(0)); std::sort(indices.begin(), indices.end(), [&](Diff a, Diff b) { return comp(keys[a], keys[b]); }); apply_permutation(first, last, indices.begin()); } template<typename Iter, typename UnaryOperation> void sort_by_with_caching(Iter first, Iter last, UnaryOperation op) { sort_by_with_caching(first, last, op, std::less<>()); }
We create two helper arrays. One holds the keys corresponding to the items, and the other holds the indices. The keys are in a parallel array with the original collection and do not move during sorting. Instead, we sort the indices. Once we finish the sort, we apply the permutation to the original items to move them to their final positions.
Okay, so that's what I was trying to get at: Sorting a vector by a key, with caching. If there's already a standard function to do this, please let me know.³
¹ The algorithm does assume that the key can consistently be generated from the item, and in particular that it depends only on the item and not on the item with which it is being compared.
²
If we wanted to show off sort_by
,
the call to std::sort
could have been replaced with
sort_by(pairs.begin(), pairs.end(), [](const Pair& p) { return p.second; }, comp);
³ I would like to point out that I arrived at this particular algorithm only after going down a dead end of having only a parallel key array. The idea was that I would sort the items and keys together by using a proxy iterator that represented both the original item and its key. The thing I had trouble working out was how to structure the proxy iterator so that it knew when its contents had been moved out, so it could move the real objects. I probably could have gotten it to work eventually, but then I realized I could avoid the entire hassle by sorting indices instead.
]]>apply_permutation
function
that we beat to death for first part of this week.
Suppose you are sorting a collection of objects with the property that copying and moving them is expensive. (Okay, so in practice, moving is not expensive, so let's say that the object is not movable at all.) You want to minimize the number of copies.
The typical solution for this is to perform an indirect sort: Instead of moving expensive things around, use an inexpensive thing (like as an integer) to represent the expensive item, and sort the inexpensive things. Once you know where everything ends up, you can move the expensive things just once.
template<typename Iter, typename Compare> void sort_minimize_copies(Iter first, Iter last, Compare cmp) { using Diff = std::iterator_traits<Iter1>::difference_type; Diff length = last  first; std::vector<Diff> indices(length); std::iota(indices.begin(), indices.end(), static_cast<Diff>(0)); std::sort(indices.begin(), indices.end(), [&](Diff a, Diff b) { return cmp(first[a], first[b]); }); apply_permutation(first, last, indices.begin()); } template<typename Iter> void sort_minimize_copies(Iter first, Iter last) { return sort_minimize_copies(first, last, std::less<>()); }
We use std::iterator_traits
to tell us
what to use to represent indices,
then we create a vector of those indices.
(The difference type is required to be an integral type,
so we didn't have to play any funny games like
first  first
to get the null index.
We could just write 0
.)
We then sort the indices
by using the indices to reference
the original collection.
(We also provide an overload that sorts by
<
.)
This performs an indirect sort,
where we are sorting the original collection,
but doing so by mainpulating indices rather
than the actual objects.
Once we have the indices we need,
we can use the
apply_permutation
function
to rearrange the original items according
to the indices.
We'll wrap up next time with another kind of sorting.
]]>apply_permutation
function
and arguing pros and cons of various implementation
choices.
Today's we're going to look at generalization.
One of the things you are taught in mathematics is that after you've proved something, you should try to strengthen the conclusion and weaken the hypotheses. Can we apply that principle here?
I don't see much that can be done to strengthen the conclusion, but I see a way to weak the hypotheses: The inputs don't actually have to be vectors. Anything that supports random access will do. So let's use a random access iterator.
And the indices don't have to be integers. Anything that can be used to index the random access iterator will do. So let's not require to to be an integer; we'll take whatever it is.
template<typename Iter1, typename Iter2> void apply_permutation( Iter1 first, Iter1 last, Iter2 indices) { using T = std::iterator_traits<Iter1>::value_type; using Diff = std::iterator_traits<Iter2>::value_type; Diff length = last  first; for (Diff i = 0; i < length; i++) { Diff current = i; if (i != indices[current]) { T t{std::move(first[i])}; while (i != indices[current]) { Diff next = indices[current]; first[current] = std::move(first[next]); indices[current] = current; current = next; } first[current] = std::move(t); indices[current] = current; } } }
Note that we used std::iterator_traits
to determine the appropriate types for the indices and the
underlying type.
This is significant when the iterator returns a proxy type
(such as the infamous vector<bool>
).
Another observation is that the indices
don't have to be in the range [0, N − 1];
as long as we can map the values into that range.
But we don't need to generalize that, because that
can already be generalized in another way:
By creating a custom iterator whose *
operator
returns a proxy object that does the conversion.
Okay, I think I've run out of things to write about this
apply_permutation
function.
But we'll use it later.
Exercise:
Write an apply_inverse_permutation
which
applies the inverse of the specified permutation:
Instead of each element of the indices
telling you where the item comes from,
it specifies where the item goes to.
In other words, if v
is a copy of the
original vector and v2
is a copy of the
result vector,
then v2[indices[i]] = v[i]
.
apply_permutation
function
by wondering which version is bettern:
the moving version of the swapping version.
I'm not certain I have the answer,
but here's my analysis.
The first observation is that the standard swap function performs three move operations. It basically goes like this:
template<class T> void std::swap(T& a, T& b) { T t{std::move(a)); a = std::move(b); b = std::move(t); }
So if you're counting move operations, you need to count a swap as three moves.
But wait, you say. It is legal for types to provide a custom swap operation. However, even those custom swap operations are still going to perform three move operations.¹ The customization is just to reduce the memory requirements. While the standard swap will move the entire instance into a temporary, a custom swap will move individual members.
struct sample { std::string x, y, z; };
In the above example,
assuming the obvious definition of the move assignment operator,
the standard swap would
move all three strings from the first sample
into a temporary sample
,
then move the three strings from the second sample
into the first,
and then move the three strings from the temporary sample
into the second.
But a custom swap would look like this:
void swap(sample& a, sample& b) { swap(a.x, b.x); swap(a.y, b.y); swap(a.z, b.z); }
This version performs three swaps consecutively. The total number of move operations is the same; they just happen in a different order.
standard swap  custom swap 

t.x = std::move(a.x); t.y = std::move(a.y); t.z = std::move(a.z); a.x = std::move(b.x); a.y = std::move(b.y); a.z = std::move(b.z); b.x = std::move(t.x); b.y = std::move(t.y); b.z = std::move(t.z);

t = std::move(a.x); a.x = std::move(b.x); b.x = std::move(t); t = std::move(a.y); a.y = std::move(b.y); b.y = std::move(t); t = std::move(a.z); a.z = std::move(b.z); b.z = std::move(t);

The memberbymember swap of the custom swap function will probably exhibit better locality than the fullclass swap used by the standard swap. The memberbymember swap also requires fewer temporary resources than the fullclass swap (here: one string compared to three).
Okay, so either way, a swap costs three moves.
Therefore, if we are just counting moves,
the swapping version of
apply_permutation
performs
almost three times as many move operations
as the explicittemporary version.
The counterargument to "too many move operations" is that move operations are relatively inexpensive. A typical move operation transfers ownership of resources from one instance to another. No new allocation needs to be done; the existing allocation just needs to be transferred across. So counting your move operations is like counting pennies: Even if you manage to save a hundred of them, that's still only one dollar.
But I think the winning argument for moving rather than swapping is the copyablebutnotmovable object. If the object doesn't have a move constructor/move assignment operator, but it does have a copy constructor/copy assignment operator, then the algorithm will still work, but it will fall back to using the copy operation in the absence of a move operation.
And copy operations are not cheap.
So now instead of saving pennies, we are saving dollars, and those dollars quickly add up. So this argues in favor of the moving version rather than the swapping version.²
Like I said at the start, this is my analysis. It could be wrong. Let me know.
Next time, we'll look at how this function could be generalized.
¹ Yes, there may be superoptimized custom swaps that are actually perform less work than the standard swap, but I think those types of custom swaps are relatively uncommon.
²
On the third hand (fourth, fifth? how many am I up to?)
if the object is copyable but not movable,
but it also has a custom swap function,
then that swap function is going to be much less expensive
than copying.
(Because the custom swap function is going to exchange
contents rather than making three expensive copies.)
You'll encounter objects of this ilk if they predate
C++11, since it is C++11 that introduced the concept
of movability.
So now, if you have an object that is copyable,
efficiently swappable, and not movable, you are better
off using the swapping version again.
Another case where the swapping version may be better
is if the vector uses a proxy iterator, such as
vector<bool>
.
So now I'm not sure any more. Maybe the way to go is to do compiletime detection of whether the object has a custom swap function. If so, then use the swapping version. If not, then use the moving version.
]]>