Lister et tester les liens Hypertexte dans un document Word / List and test Hypertext links in a Word document


[MAJ 09/03/2017] En tant que PFE, je suis amené à écrire beaucoup de documentations à destination des clients. Il m’arrive même de devoir écrire des supports de formations. Souvent le cycle de vie de cette documentation s’étend sur plusieurs années et m’amène à procèder régulièrement à des mises à jour. Or toute bonne documentation renvoie forcément vers des liens connexes et rien n’est plus fastidieux que de devoir valider les liens un par un pour savoir s’ils sont toujours valides (Surtout quand le document/formation comprend plusieurs centaines de liens hypertextes). Pour cela j’ai développé le script suivant qui liste tous les liens hypertextes présent dans un ou plusieurs documents Word (Nom du document, Page, URI, Texte du lien) et indique de manière optionelle le statut HTTP du lien ainsi que le titre de la page en question. Le script est disponible ici.

 

Le résultat sera un fichier CSV similaire à celui-ci :

Get-WordHyperLinks

Ci-après la signification des colonnes :

  • TextToDisplay : Le texte du lien tel qu’affiché dans le document
  • StatusCode : Le statut de la réponse HTTP
  • Page : La page où se trouve le lien dans le document
  • Document : Le chemin complet du document Word
  • Title : Le titre de la page HTTP (balise HTML <title></title>)

 

Ce script est aussi disponible dans le TechNet Script Center : https://gallery.technet.microsoft.com/List-and-test-Hypertext-ffef3413 


 
[Updated 03/09/2017] As PFE, I have to write a lot of documentation fro customers. I even have to write training materials. Often the life cycle of this material extends over several years and brings me to proceed to regular updates. But good documentation necessarily refers to related links and nothing is more tedious than having to validate the links one by one to see if they are still valid (Especially when the document / training includes hundreds of hyperlinks). For this I developed the following script that lists all hyperlinks present in one or more Word documents (Document name, Page, URI, text link) and so indicates optional HTTP status and the page title of the URI. The script is available here.

 

The result will be a CSV file similar to this one:

Get-WordHyperLinks

Below the meaning of the columns:

  • TextToDisplay: The link text as displayed in the document
  • StatusCode: The status of the HTTP response
  • Page: The page where the link in the document was found
  • Document: Full file path of the Word document
  • Title: The title of the HTTP page (HTML <title></title>)

 

#requires -version 4 #region function definitions Function Remove-Ref { [CmdletBinding()] param ( [Object] $ref ) <# .SYNOPSIS Releases a COM Object .DESCRIPTION Releases a COM Object .PARAMETER ref The COM Object to release .EXAMPLE $Word=new-object -ComObject "Word.Application" ... Remove-Ref ($Word) #> $null = Remove-Variable -Name $ref -ErrorAction SilentlyContinue while ([System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$ref) -gt 0) { } [System.GC]::Collect() [System.GC]::WaitForPendingFinalizers() } Function Get-WordHyperLinks { <# .SYNOPSIS Returns a collection of object related to the Hyptertext links (Properties : "Link", "Document", "Page", "Title" (optional), "StatusCode" (optional)) inside a Word Document .DESCRIPTION Returns a collection of object related to the Hyptertext links inside a Word Document .PARAMETER FullName The Word document to analyze specified by its full name .PARAMETER Visible A switch to specify if the Word application will be visible during the processing .PARAMETER Status A switch to specify if we should query the URL and get the HTTP Status (the processing will take more time) .EXAMPLE $WordHyperLinks = (Get-ChildItem "*.docx" | Get-WordHyperLinks -Verbose -Visible -Status) Will return a collection of hypertext links contained inside all Word documents. The output will be verbose The Word application will be visible The HTTP status will be returned (the processing will take more time) .EXAMPLE $WordHyperLinks = Get-WordHyperLinks -FullName "Sales.docx","HR.docx" Will return a collection of hypertext links contained inside the two given Word documents. The Word application will be invisible #> [CmdletBinding()] Param( #The Word document to process [Parameter(Mandatory = $True,HelpMessage = 'Please specify the path of a valid Word document', ValueFromPipeline = $True, ValueFromPipelineByPropertyName = $True)] [ValidateScript({ (Test-Path -Path $_ -PathType Leaf) -and ($_ -match '\.docx?$') })] [string[]]$FullName, #To display the Word application [parameter(Mandatory = $false)] [switch]$Visible, #To get the HTTP status of the link (and the title of the page) [parameter(Mandatory = $false)] [switch]$Status ) begin { #For Microsoft Word $null = Add-Type -AssemblyName Microsoft.Office.Interop.Word $wdActiveEndPageNumber = [Enum]::Parse([Microsoft.Office.Interop.Word.WdInformation], 'wdActiveEndPageNumber') $wdDoNotSaveChanges = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveOptions], 'wdDoNotSaveChanges') $ConfirmConversion = $false $ReadOnly = $True #Opening the word application Write-Verbose -Message 'Running the Word application ...' $Word = New-Object -ComObject 'Word.Application' #To make the Word application visible (or not) $Word.Visible = $Visible $Word.Application.DisplayAlerts = $false #To store the hypertext links $WordHyperLinks = @() } process { #For all files passed as argument outside a pipeline context Foreach ($CurrentFullName in $FullName) { $CurrentFullName = Get-Item -Path $CurrentFullName #Getting the fullname of the processed Word document $CurrentWordDocumentFullName = $CurrentFullName.FullName #Getting the name of the processed Word document $CurrentTrainerWordDocumentName = $CurrentFullName.Name Write-Host -Object ('Processing {0} ...' -f $CurrentWordDocumentFullName) Write-Verbose -Message 'Opening the Word document ...' $OpenDoc = $Word.Documents.OpenNoRepairDialog($CurrentWordDocumentFullName, $ConfirmConversion, $ReadOnly) #Getting the hypertext links $HyperLinks = $OpenDoc.Hyperlinks | Where-Object -FilterScript { $_.Address } $Index = 0 #Going through the hyperlink collection ForEach ($CurrentHyperLink in $HyperLinks) { $Index++ Write-Progress -Activity "[$($Index)/$($HyperLinks.Count)] Processing $($CurrentHyperLink.Address)" -Status "Percent : $('{0:N0}' -f $($Index/($HyperLinks.Count) * 100)) %" -PercentComplete ($Index /$HyperLinks.Count * 100) #To Select/highlight the link $CurrentHyperLink.Range.Select() $Selection = $Word.Selection #To get the page number $PageNumber = $Selection.Information($wdActiveEndPageNumber) Write-Verbose -Message "$($CurrentHyperLink.Address) - $($CurrentHyperLink.TextToDisplay) - $PageNumber" #We process only http or https links if ($CurrentHyperLink.Address -match '^https?://') { if ($Status) { Try { # The HEAD method can return some 404 HTTP status instead of HTTP 200 $Response = Invoke-WebRequest -Method Get -Uri $CurrentHyperLink.Address -UseBasicParsing -ErrorAction SilentlyContinue $StatusCode = [int]$Response.StatusCode # Bug : https://connect.microsoft.com/PowerShell/feedbackdetail/view/1557783/invoke-webrequest-hangs-in-some-cases-unless-usebasicparsing-is-used) # Workaround http://www.networksteve.com/forum/topic.php/Invoke-WebRequest_hangs_in_some_cases,_unless_-UseBasicParsing_i/?TopicId=77984&Posts=2 # $Title = $Response.ParsedHTML.title $null = $Response.RawContent -replace "`n", '' -match '<title>(?<title>.*)</title>' #The title of the HTML document $Title = $matches['title'] } #If an exception occurs (Page Not found for instance : HTTP/404) catch { #Getting the status code if ($_.Exception.Response.StatusCode.Value__) { $StatusCode = $_.Exception.Response.StatusCode.Value__ } } # Storing hyperlinks infos inside an array $WordHyperLinks += (New-Object -TypeName PSObject -Property @{ #The document full path Document = $CurrentWordDocumentFullName #The page number Page = $PageNumber #The link URI Link = $CurrentHyperLink.Address #The text link in the document TextToDisplay = $CurrentHyperLink.TextToDisplay #The HTTP Status Code StatusCode = $StatusCode #The title of the HTML document Title = $Title }) Write-Verbose -Message $("[$CurrentTrainerWordDocumentName][Page {0:D3}][Added] $($CurrentHyperLink.Address) ($StatusCode - $Title)" -f $PageNumber) } else { # Storing hyperlink information inside an array $WordHyperLinks += (New-Object -TypeName PSObject -Property @{ #The document full path Document = $CurrentWordDocumentFullName #The page number Page = $PageNumber #The link URI Link = $CurrentHyperLink.Address #The text link in the document TextToDisplay = $CurrentHyperLink.TextToDisplay }) Write-Verbose -Message $("[$CurrentTrainerWordDocumentName][Page {0:D3}][Added] $($CurrentHyperLink.Address)" -f $PageNumber) } } } Write-Verbose -Message 'Closing the Word document ...' $OpenDoc.Close($wdDoNotSaveChanges) Remove-Ref -ref ($OpenDoc) } } end { Write-Progress -Activity 'Completed !' -Status 'Completed !' -Completed Write-Verbose -Message 'Exiting the Word application ...' $null = $Word.Quit() Remove-Ref -ref ($Word) return $WordHyperLinks } } #endregion Clear-Host $CurrentScript = $MyInvocation.MyCommand.Path # Getting the directory of this script $CurrentDir = Split-Path -Path $CurrentScript -Parent # Creating CSV file name based on the script full file path and by appending the timestamp at the end of the file name $CSVFile = $CurrentScript.replace((Get-Item -Path $CurrentScript).Extension, '_'+$(Get-Date -Format 'yyyyMMddTHHmmss')+'.csv') $WordHyperLinks = Get-ChildItem -Path $CurrentDir -Filter '*.docx' | Get-WordHyperLinks -Verbose -Visible -Status #$WordHyperLinks = Get-WordHyperLinks -FullName "C:\datasheet_fr-FR.docx", "C:\datasheet_en-US.docx" $WordHyperLinks | Export-Csv -Path $CSVFile -Force -NoTypeInformation Write-Host -Object "Results are available in '$CSVFile'"

 

The script file is also available at the TechNet Script Center repository, at: https://gallery.technet.microsoft.com/List-and-test-Hypertext-ffef3413 

Laurent.

Comments (0)

Skip to main content