In the previous post, I’ve come up with this interview question:
In a given .NET string, assume there are line breaks in standard \r\n form (basically Environment.NewLine).
Write a method that inserts a space between two consecutive line breaks to separate any two line breaks from each other.
Update: Rik Hemsley has posted an absolutely fantastic summary in form of a Visual Studio solution with all the answers from the comments, unit-tests and benchmarks:
Jordan Terrell’s RegExp-based solution seems to be the fastest one.
Thanks Rik! This is very helpful!
Source of the problem
First I’ll explain how I came up with the problem. I was writing some code that generated HTML, specifically, some text inside the <PRE> tag, interspersed with <BR /> tags. I’m not an expert in HTML, that’s why it came as a surprise to me that if you have multiple consecutive <BR /> elements, the browser will only render one line break (well, at least the browser I checked it on – IE8). My code generated the <BR /> tags from Environment.NewLine substrings in the source text. So, to preserve empty lines in the resulting HTML, I had to intersperse the line breaks with at least one space – then IE would keep the empty lines when rendering <PRE> contents. I didn’t keep direct line breaks because some HTML viewers that I care about don’t render them correctly.
My original solution
Anyway, I’ve gotten a lot of great responses that contain a lot of food for thought. Mostly I was amazed how the responses actually reflected my own thinking as I was coming up with a solution. My ultimate solution (which was also correctly suggested by ram) was this:
public static string IntersperseLineBreaks(string source)
string result = source.Replace("\n\r", "\n \r");
A temporary local variable here is for your debugging convenience until my team comes up with a way to inspect the return value of a method in the debugger.
Ambiguous problem statement
Of course, banko correctly noted that the problem is stated ambiguosly, since Environment.NewLine is “\r\n” on Windows, and “\n” on Unix. Besides, as Jordan Terrell mentions, the behavior is unclear on source texts that contain wild mixes like “\r\n\r” and “\r\n\n”. Jordan also suggests that I include unit-tests to specify correct solutions.
Well, I apologize for a problem statement that is not specified well enough (I simply didn’t have time to be that thorough!). I didn’t do it on purpose, but now when I think about it, most real-world requirements are ambiguous and not well specified. It is always interesting to see how a candidate deals with ambiguity and unclear requirements. So be prepared to expect unclear problem statements in your interviews and deal with them gracefully.
There are several good strategies to impress the interviewer in this case, the simplest being to ask clarifying questions in the places where you think the requirements are unclear. Let them know that you are trying to explicitly understand and model the requirements in your head and are filling in the gaps.
Another approach is to explicitly identify and parameterize the variability in the requirements and give a generic solution that works in all cases. Finally, if the problem statement is unclear, you can totally assert, assume or demand requirements that make your life easier. For instance, I remember I said “let’s assume that C# has tuples” during my interview at Microsoft. The interviewer laughed and said OK. I produced a nice solution that used tuples, and then he asked how would I change it to not use tuples. But that’s a totally different question 😉
If I were an interviewer (and I’ve never been one yet) and I would happen to ask this question, I guess I would look at several things in answers:
- Code reuse (e.g. my solution above relies on the Replace implementation in the framework vs. reimplementing a custom version of it for this particular problem, which might be error prone). However the point of interview questions is usually to see how people implement algorithms, so I guess for a sorting question an answer like List<T>.Sort() would rate very high on code reuse, but pretty much useless otherwise 🙂
- Correctness – some people gave similar solutions to mine, which call replace twice, but are more resilient to “weird input streams” with non-normalized line breaks. Such solutions will be twice as slow (since they will do full string replacement twice), but if a candidate understands the trade-offs between performance and correctness, I’m totally fine with it.
- Performance – writing your own StringBuilder-based implementation of Replace specifically suited for this particular problem is totally OK. In fact, some people gave very efficient solutions using this approach (e.g. Jared Parsons (a developer who I work with, who sits in the office down the hall) had built a custom ‘replacer’ state machine in F#) and Thomas Levesque correctly used a StringBuilder to avoid unnecessary string allocations. Some people fell into the trap of using string concatenation instead of a StringBuilder, which is by far less efficient, since it allocates ~ O(n^2) memory vs. O(n log n) memory allocated by StringBuilder. I would want a .NET candidate to understand how strings work in .NET, what are allocations, how GC works, and why string concatenation is bad when used in a loop.
- Style – having a concise declarative version is great when performance is not critical. Having readable code that is expressive and easy to maintain can sometimes trump performance. Moreover, usually people who are able to express their thoughts in code concisely will likely also master the skill of performance tuning when necessary.
So I guess it’s impossible to declare a single winner because different solutions win when different assessment criteria are used. In terms of readability, getting things done and reusing as much as possible, I like ram’s solution (and mine :P), even if it won’t work in certain scenarios (I’m OK with making assumptions about the source text). In terms of algorithm implementation, other solutions are certainly much better, those that use a StringBuilder or build the resulting array char-by-char. Finally, in terms of expressiveness I really liked some declarative solutions (using LINQ), even if they’re less effective (e.g. use string concatenation).
string.Replace(“\r\n\r\n”, “\r\n \r\n”)
One solution deserves some special explanation since it nicely demonstrates the core of the problem. The Replace method continues matching substrings after the end of the replacement, so if your replacement patterns overlap, it won’t replace the second one:
To overcome this problem, some people suggested the following:
str = str.Replace("\r\n\r\n", "\r\n \r\n");
Note that this loop will never run more than twice (this is where I goofed myself). GSEJ corrected me, pointing out that only two passes are necessary – first pass leaves a maximum of two consecutive line breaks, second pass eliminates them all.
However we notice that in the replacement pattern (“\r\n\r\n”, “\r\n \r\n”), the start and the end are the same (hence the first and last chars overlap with previous and next occurrences of the pattern). The pattern is redundant, so we can make it shorter and only replace the crux of the pattern: Replace(“\n\r”, “\n \r”):
This pattern is non-redundant and doesn’t overlap with the neighbors, hence no second pass required.
Well, anyway, thanks everyone for your answers! It was truly interesting for me and hopefully educational and entertaining for some of you.