User Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
public:nnels:etext:regex [2018/07/11 15:31]
leah.brochu
public:nnels:etext:regex [2022/04/11 14:02] (current)
rachel.osolen
Line 23: Line 23:
 [[https://support.office.com/en-ca/article/Find-and-replace-text-and-other-data-in-your-Word-2010-files-c6728c16-469e-43cd-afe4-7708c6c779b7?ui=en-US&rs=en-CA&ad=CA#__toc282774574|Using wildcards in Microsoft Word]] (this is similar to regular expressions, but Word has a lot of its own syntax) [[https://support.office.com/en-ca/article/Find-and-replace-text-and-other-data-in-your-Word-2010-files-c6728c16-469e-43cd-afe4-7708c6c779b7?ui=en-US&rs=en-CA&ad=CA#__toc282774574|Using wildcards in Microsoft Word]] (this is similar to regular expressions, but Word has a lot of its own syntax)
    
-  * Word has a lot of options to find letters (^$) and numbers (^#) but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off.+  * Word has a lot of options to find letters (^$) and numbers (^#) when using the non-regex [[public:nnels:etext:find-and-replace|Find & Replace]], but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off.
  
   * A lot of the codes for special characters (e.g. page break) are under the "Special..." button.   * A lot of the codes for special characters (e.g. page break) are under the "Special..." button.
Line 77: Line 77:
  
 <WRAP center round box 80%> <WRAP center round box 80%>
-**PROBLEM:** OCR converted some "1" digits to "i/I" letters, resulting in dates like "i984" or numbers like "3i".+**PROBLEM**: Hyphenated words that break single word (not over two lines). 
 + 
 +**SOLUTION**: Replace with the same text minus the hyphen. 
 + 
 +Find: ''([a-z])-([a-z])'' 
 + 
 +Replace with: ''\1\2'' 
 + 
 +Using a-z restricts what it finds to lowercase. 
 + 
 +You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed. 
 +</WRAP> 
 + 
 +---- 
 + 
 +<WRAP center round box 80%> 
 +**PROBLEM:** OCR converted some "1" digits to "i/I" letters, resulting in dates like "i984" or numbers like "3I".
  
 **SOLUTION:** Replace "i/I"s that come immediately before of after a number with "1"s. This will be done in two steps **SOLUTION:** Replace "i/I"s that come immediately before of after a number with "1"s. This will be done in two steps
Line 90: Line 106:
  
 ---- ----
 +
  
 <WRAP center round box 80%> <WRAP center round box 80%>
-**PROBLEM**: There are extra paragraph breaks. We want to keep the real paragraph breaks and remove the fake extra paragraph breaks.   
  
-**SOLUTION**: Use MS Word'find and replace to remove the extra paragraph breaks using special Word symbols.+**PROBLEM:** OCR did not recognize spaces around quotation marks.  
 +  * Example AAs one of Montgomery'British staff officers later put ''it,"I'' feel Monty was astonishing in his relationship with all the Dominion troops. 
 +  * Example B: The "nasty little ''troublemaker,"as'' Montgomery was widely known in the British army... 
 +This problem has an added complexity; the pattern has two different solutions: 
 +  * Example A will need to say: ... later put ''it, "I'' feel Monty... (or, comma-space-quotation mark) 
 +  * Example B will need to say: The "nasty little troublemaker''," as'' Montgomery..(or, comma-quotation mark-space
  
-Find: ''^p^p'' (you can also search for more than paragraph breaksi.e. ''^p^p^p'')+**SOLUTIONS:** 
 +Example A:\\  
 + 
 +Find: ''([,])(["])([A-z])''\\  
 +Replace: ''\1 \2\3'' 
 + 
 +Example B: 
 + 
 +Find: ''([,])(["])([A-z])''\\  
 +Replace: ''\1\2 \3'' 
 + 
 +Notes:  
 +  * You will **not** be able to use "replace all" in this situationYou will need to keep hitting ''Find Next'' and replacing the pattern with the appropriate solution. 
 +  * You will also need to re-do this, searching for periods instead of commas.
  
-Replace with: ''^p'' 
 </WRAP> </WRAP>
  
 ---- ----
 +
  
 <WRAP center round box 80%> <WRAP center round box 80%>
-**PROBLEM**: There are newlines/line breaks (↵) instead of paragraph marks (¶).+**PROBLEM**: There are extra paragraph breaks. We want to keep the real paragraph breaks and remove the fake extra paragraph breaks 
  
-**SOLUTION**: Find and remove all line breaks and replace with a single paragraph break.+**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]] 
 +</WRAP>
  
-Find: ''^m''+----
  
-Replace with''^p''+<WRAP center round box 80%> 
 +**PROBLEM**There are newlines/line breaks (↵) instead of paragraph marks (¶).
  
-In LibreOffice, replace all ''\n'' with ''\p'' to convert them to paragraphs.+**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]]
 </WRAP> </WRAP>
  
Line 121: Line 157:
 ''231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/27/14 11:56 AM(paragraph break)'' ''231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/27/14 11:56 AM(paragraph break)''
  
-**SOLUTION**: Without using wildcards: +**SOLUTION**: See[[public:nnels:etext:find-and-replace|Find & Replace]]
- +
-Find:  ''^#^#^#^pMacG_9781770494220_5p_all_r1.indd ^#^#^#^p10/27/14 11:56 AM^p'' +
- +
-Replace with: nothing. If you're doing a paginated title, replace with page breaks. +
- +
-You will need to remove one of the ^# at the beginning and after the .indd to remove it for 2 digit page numbers, and one last time for single digit page numbers. The following screenshot is an example with a 1-digit page number (see below), followed by the command used to isolate all such instances.  +
- +
-<WRAP center round box 60%> +
- +
-{{:nnels:documentation:content:production:screen_shot_2015-08-06_at_6.10.55_pm.png?300|}} +
- +
-Find: ^#^pMacG_9781770494220_5p_all_r1.indd ^#^p10/27/14 11:56 AM^p +
-</WRAP> +
- +
-You will also need to do it with the leading ^#^p to catch the footer text that do not have any page numbers with it.+
 </WRAP> </WRAP>
  
Line 152: Line 173:
   * ''\p.+\s+[0-9OoIil]{1,3}\p'' ### Detect bad line breaks ###   * ''\p.+\s+[0-9OoIil]{1,3}\p'' ### Detect bad line breaks ###
   * ''[^\."?!]$''   * ''[^\."?!]$''
 +
 +
 +[[public:nnels:etext:start|Return to main eText Page]]
  
public/nnels/etext/regex.1531348264.txt.gz · Last modified: 2018/07/11 15:31 by leah.brochu