When working with Get-Content
and Out-File
on the same file in a pipe line, the following is worth keeping in mind:
1 2 3 4 5 6 7 8 9 10 |
# Inspect the contents of a file that is not empty PS> Get-Content .\Test.txt blah blah # Read the contents of the file and write back to it PS> Get-Content .\Test.txt | Out-File Test.Txt # Inspect the contents again and note that it's empty PS> Get-Content .\Test.txt PS> |
Notice how the file had contents initially but the action of reading the contents and writing back to the file erased everything?
My understanding (from page 187 of Bruce Payette’s “PowerShell in Action” book) is that this happens because the Out-File
cmdlet runs before the Get-Content
cmdlet and so the file is emptied out even before its contents can be displayed.
That doesn’t make much sense so here’s an attempt at trying to make some sense of it.
I think the first piece in the puzzle is that PowerShell pipelines run as a single process (from section 2.5 of the book). That is to say, while it seems like there will be two processes in the above pipeline in reality there is only one. The pipeline cmdlets are split into three clauses: BeginProcessing
, ProcessRecord
, and EndProcessing
. Then the BeginProcessing
clause is run for all cmdlets in the pipeline. Next the ProcessRecord
clause is run for the first cmdlet; if an output is produced it is passed to the ProcessRecord
clause of the second cmdlet and if an output is produced from the second cmdlet it’s passed on to the third cmdlet and so on. This happens for each of the output produced (by each cmdlet in the pipeline) and once all this completes the EndProcessing
clauses are run for all cmdlets.
I couldn’t find much info on what the BeginProcessing
, ProcessRecord
, and EndProcessing
clauses are (and not that I tried very hard either) but I did find documentation on the Cmdlet Class and learnt that this class defines three input processing methods, namely: BeginProcessing
, ProcessRecord
, and EndProcessing
. All PowerShell cmdlets then override at least one of these methods if they are to process records.
So now I understand better what happens during a pipeline processing. A single process is created and that process (1) invokes the BeginProcessing
methods of all the cmdlets in the pipeline; (2) then invokes the ProcessRecord
method of each cmdlet in the order they are in the pipeline, passing the output from the a cmdlet that’s earlier in the pipeline to the one following it; and lastly (3) invokes the the EndProcessing
methods of all cmdlets in the pipeline.
Next I tried to find more about these input processing modules in the context of the Get-Content
and Out-File
cmdlets. And I learnt that while Out-File
overrides all three methods Get-Content
only overrides two methods. The BeginProcessing
method is not overridden by Get-Content
.
Now it makes sense why the above pipeline behaves the way it does. Since Get-Content
does not define a BeginProcessing
method but Out-File
does, the latter is run first and truncates the contents of the file and only then does Get-Content
read the (now empty) file and pass it on.
Moral of the story: always read the contents of the file into a variable first and then pipe the variable to Out-File
. Or read the contents of the file but pipe output to a different file.
As a variant of the above example, check this out:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Inspect the contents of a file that is not empty PS> Get-Content .\Test.txt blah blah # Read the contents of the file and write back to it. However, choose to append instead of overwrite. PS> Get-Content .\Test.txt | Out-File -Append Test.Txt # This goes into a loop so have to press Ctrl+C to break out. # Inspect the contents again PS> Get-Content .\Test.txt blah blah blah blah blah blah blah blah ... |
This time the pipeline goes into a loop. Why?
I would have expected that (1) Out-File
opens the file as usual but doesn’t delete the contents, (2) then Get-Content
opens the file and outputs the contents to Out-File
who in turn (3) appends it to the file, thus terminating the pipeline and leaving the file with the initial line repeated. But it goes into a loop instead and the initial line is repeated until I terminate the loop – not sure what’s happening here.
Update: I asked this question on StackOverflow and got an answer. Steps 1-3 are as I expected, with the difference that at step 3 instead of the pipeline being terminated (4) Get-Content
views this as additional data in the file and so outputs that too – leading to steps 3 & 4 repeating over and over again. The important thing to keep in mind is that the operation isn’t sequential – I was thinking Get-Content
outputs the first line and closes the file – but it does not. I suppose the file would have closed if there was a delay, but what happens is that Out-File
writes to it side by side and so Get-Content
views the appended line as part of the file and outputs that too. The two cmdlets run side by side.
Important to keep this in mind when dealing with pipelines.