User Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
public:nnels:etext:regex [2017/04/03 15:49]
farrah.little [Conversion Fixes]
public:nnels:etext:regex [2022/04/11 14:02] (current)
rachel.osolen
Line 1: Line 1:
 ====== Regular Expressions ====== ====== Regular Expressions ======
 Regular expressions (aka regex) is useful for replacing patterns of text, such as headers/footers with page breaks or simply removing them, or replacing line breaks as is common when text is converted from a PDF (to remove middle of word or middle of sentence breaks). Regular expressions (aka regex) is useful for replacing patterns of text, such as headers/footers with page breaks or simply removing them, or replacing line breaks as is common when text is converted from a PDF (to remove middle of word or middle of sentence breaks).
 +
 +With regex, you can define patterns of text in a number of different ways, but the most commonly used ones for our purposes are **Ranges** and **Groups**. For more information about others, you can take a look at [[https://wordmvp.com/FAQs/General/UsingWildcards.htm|this helpful webpage]]:
 +  * Ranges
 +    * Square brackets are always used in pairs and are used to identify //specific characters// or //ranges of characters//. You can use any character or series of characters in a range [ ], including the space character. For example:
 +      * [A-Z] will find any upper case letter;
 +      * [a-z] will find any lower case letter;
 +      * [A-z] will find any letter (upper or lower case);
 +      * [0-9] will find any number
 +      * [abc] will find any of the letters a, b, or c.
 +      * [F] will find upper case “F”
 +      * [Fred] will find "Fred"
 +   * Groups
 +    * Round brackets are used in pairs to enclose //groups//. For example:
 +      * ''([A-Z][A-Z])-([0-9])'' Will find any two capital letters followed by a hyphen and a number, like ''BB-8'' or ''LY-5''
 +    * They must be used in pairs and are addressed by number in the replacement. In the replace field, \1 represents the first group, \2 represents the second group, and so on. For example:
 +      * If you wanted to remove the hyphen from "BB-8" you would enter ''\1\2'' (i.e., the two groups with nothing between them) into the Replace field. Or, if you wanted to change the hyphen to a space, you would enter ''\1 \2'' (i.e., the two groups with a space between them) into the Replace field.
 +      * Another example: ''(John) (Smith)'' replaced by ''\2 \1'' (note the spaces in the search and replace strings) – will produce ''Smith John''
  
 ====Tips==== ====Tips====
Line 6: Line 23:
 [[https://support.office.com/en-ca/article/Find-and-replace-text-and-other-data-in-your-Word-2010-files-c6728c16-469e-43cd-afe4-7708c6c779b7?ui=en-US&rs=en-CA&ad=CA#__toc282774574|Using wildcards in Microsoft Word]] (this is similar to regular expressions, but Word has a lot of its own syntax) [[https://support.office.com/en-ca/article/Find-and-replace-text-and-other-data-in-your-Word-2010-files-c6728c16-469e-43cd-afe4-7708c6c779b7?ui=en-US&rs=en-CA&ad=CA#__toc282774574|Using wildcards in Microsoft Word]] (this is similar to regular expressions, but Word has a lot of its own syntax)
    
-  * Word has a lot of options to find letters (^$) and numbers (^#) but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off.+  * Word has a lot of options to find letters (^$) and numbers (^#) when using the non-regex [[public:nnels:etext:find-and-replace|Find & Replace]], but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off.
  
   * A lot of the codes for special characters (e.g. page break) are under the "Special..." button.   * A lot of the codes for special characters (e.g. page break) are under the "Special..." button.
Line 19: Line 36:
 The following fixes assume you are using Word, unless otherwise stated. The following fixes assume you are using Word, unless otherwise stated.
  
 +<note>Contribute your problems and regex solutions below. Attach your screenshots of both the problem and solution.</note>
 +
 +----
 +
 +<WRAP center round box 80%>
 **PROBLEM**: Each line ends with a paragraph break.  **PROBLEM**: Each line ends with a paragraph break. 
  
-**SOLUTION**: There is no single solution to this, but the typical pattern is to search for the pattern not a period, followed by paragraph break, followed by letter and replace with the same thing minus the paragraph break.+**SOLUTION**: There is no single solution to this, but the typical pattern is to search for the pattern: ''not a period'', followed by ''paragraph break'', followed by ''letter'' and replace with the same thing minus the paragraph break.
  
 In Word, this will only work with wildcards turned on. In Word, this will only work with wildcards turned on.
Line 29: Line 51:
 Replace with: ''\1\2'' Replace with: ''\1\2''
  
-This looks for the pattern: any-letter space paragraph-break any-letter+This looks for the pattern: ''any-letter'' ''space'' ''paragraph-break'' ''any-letter''
  
 The parentheses are used to group what it finds, so \1 refers to the first "any-letter" group and \2 refers to the second "any-letter" group. The parentheses are used to group what it finds, so \1 refers to the first "any-letter" group and \2 refers to the second "any-letter" group.
  
 In this way, you are putting back exactly what it found minus the paragraph break. In this way, you are putting back exactly what it found minus the paragraph break.
 +</WRAP>
 +
 +----
  
 +<WRAP center round box 80%>
 **PROBLEM**: Hyphenated words that break over two lines. **PROBLEM**: Hyphenated words that break over two lines.
  
Line 46: Line 72:
  
 You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed. You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed.
 +</WRAP>
  
-**PROBLEM**: There are extra paragraph breaks. We want to keep the real paragraph breaks and remove the fake extra paragraph breaks.  +----
  
-**SOLUTION**: Use MS Word's find and replace to remove the extra paragraph breaks using special Word symbols.+<WRAP center round box 80%> 
 +**PROBLEM**: Hyphenated words that break single word (not over two lines).
  
-Find''^p^p'' (you can also search for more than 2 paragraph breaks, i.e. ''^p^p^p'')+**SOLUTION**Replace with the same text minus the hyphen.
  
-Replace with: ''^p''+Find: ''([a-z])-([a-z])''
  
-**PROBLEM**There are newlines/line breaks (↵) instead of paragraph marks (¶).+Replace with''\1\2''
  
-**SOLUTION**: Find and remove all line breaks and replace with single paragraph break.+Using a-z restricts what it finds to lowercase.
  
-Find: ''^m''+You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed. 
 +</WRAP>
  
-Replace with: ''^p''+----
  
-In LibreOfficereplace all ''\n'' with ''\p'' to convert them to paragraphs.+<WRAP center round box 80%> 
 +**PROBLEM:** OCR converted some "1" digits to "i/I" lettersresulting in dates like "i984" or numbers like "3I".
  
-<note important>If you understand the deleted notes below, please attach screenshot of the problem and of the solution!</note>+**SOLUTION:** Replace "i/I"s that come immediately before of after number with "1"s. This will be done in two steps
  
-<del>Check to see if there is a paragraph marker at the end of each line, if so, there is a multi-step process to clean them up: +  -  
-    - Paragraphs will be separated by a blank line. replace those with a unique set of characters that won't be in the text, e.g. ''\p\p'' -> ''%%%%'' +    - Find: ''([iI])([0-9])'' This will find both lower and upper case "i"s, immediately followed by a digit. 
-    - If the lines all end with a spacereplace all ''\p'' with nothing, otherwise replace them with a single space+    - Replace: ''1\2'' This replaces the first group ''([iI])'' with the number **1**and leaves the second group ''([0-9])'' as is. 
-    - Finally, replace all ''%%%%'' with ''\p''+  -  
-  * If the lines wrap properly but there is still a blank line between paragraphs, then a simple replace ''\p\p'' with ''\p'' will suffice, rather than the above procedure.</del>+    - Find: ''([0-9])([iI])'' This will find the digit immediately followed by the letter i (e.g., 3i). 
 +    - Replace: ''\11'' This leaves the first group ''([0-9])'' as is, and replaces the second group ''([iI])'' with the number **1**. 
 +</WRAP>
  
 +----
  
-<del>We have to convert the double paragraphs breaks into something else unique, remove the single paragraph breaks and then convert the unique characters that were double paragraph breaks into new single paragraph breaks. It is best to do this at the beginning of the text correction stage as it appears to mess with existing formatting styles. 
-  -  Find and replace all double paragraphs 
-    * initiate a find for, ^p^p 
-  - Replace with a unique symbol or code, eg, ' xswedc ' 
-    * (I found placing a space before and after helps make it even more unique and avoid it bunching up with other double paragraphs) this isn't anything special about these letters, other than that they are a unique string of letters we can search on later 
-  - Find and replace all remaining single paragraphs, find = ^p, replace =  [single keyboard space] 
-  - Find and replace all the double paragraphs you previously changed into a special symbol or code and change back to a single paragraph 
-  - Find and remove all line breaks, change into double or single paragraphs instead (find = ^m, replace = ^p )</del> 
  
-<note important>The below problem and solution need to be clarified. If it makes sense to you, attach a screenshot of the problem and solution!</note>+<WRAP center round box 80%>
  
-**PROBLEM**: Running headers. Example, where the first three numbers and the three numbers after the filename is the page number:  +**PROBLEM:** OCR did not recognize spaces around quotation marks.  
-''231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/27/14 11:56 AM(paragraph break)''+  * Example AAs one of Montgomery's British staff officers later put ''it,"I'' feel Monty was astonishing in his relationship with all the Dominion troops. 
 +  * Example B: The "nasty little ''troublemaker,"as'' Montgomery was widely known in the British army... 
 +This problem has an added complexity; the pattern has two different solutions
 +  * Example A will need to say: ... later put ''it, "I'' feel Monty... (or, comma-space-quotation mark) 
 +  * Example B will need to sayThe "nasty little troublemaker''," as'' Montgomery... (or, comma-quotation mark-space
  
-**SOLUTION**: Without using wildcards:+**SOLUTIONS:** 
 +Example A:\\ 
  
-Find:  ''^#^#^#^pMacG_9781770494220_5p_all_r1.indd ^#^#^#^p10/27/14 11:56 AM^p''+Find: ''([,])(["])([A-z])''\\  
 +Replace''\1 \2\3''
  
-Replace withnothing. If you're doing a paginated title, replace with page breaks.+Example B:
  
-You will need to remove one of the ^# at the beginning and after the .indd to remove it for 2 digit page numbersand one last time for single digit page numbers. The following screenshot is an example with a 1-digit page number (see below), followed by the command used to isolate all such instances. +Find: ''([,])(["])([A-z])''\\  
 +Replace: ''\1\2 \3''
  
-<WRAP center round box 60%>+Notes:  
 +  * You will **not** be able to use "replace all" in this situation. You will need to keep hitting ''Find Next'' and replacing the pattern with the appropriate solution. 
 +  * You will also need to re-do this, searching for periods instead of commas.
  
-{{:nnels:documentation:content:production:screen_shot_2015-08-06_at_6.10.55_pm.png?300|}}+</WRAP>
  
-Find^#^pMacG_9781770494220_5p_all_r1.indd ^#^p10/27/14 11:56 AM^p+---- 
 + 
 + 
 +<WRAP center round box 80%> 
 +**PROBLEM**There are extra paragraph breaksWe want to keep the real paragraph breaks and remove the fake extra paragraph breaks.   
 + 
 +**SOLUTION**See: [[public:nnels:etext:find-and-replace|Find & Replace]]
 </WRAP> </WRAP>
  
-You will also need to do it with the leading ^#^p to catch the footer text that do not have any page numbers with it.+---- 
 + 
 +<WRAP center round box 80%> 
 +**PROBLEM**: There are newlines/line breaks (↵) instead of paragraph marks (¶). 
 + 
 +**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]] 
 +</WRAP> 
 + 
 +---- 
 + 
 +<WRAP center round box 80%> 
 +**PROBLEM**: Running headers. Example, where the first three numbers and the three numbers after the filename is the page number:  
 +''231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/27/14 11:56 AM(paragraph break)'' 
 + 
 +**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]] 
 +</WRAP>
  
 In LibreOffice: In LibreOffice:
Line 118: Line 173:
   * ''\p.+\s+[0-9OoIil]{1,3}\p'' ### Detect bad line breaks ###   * ''\p.+\s+[0-9OoIil]{1,3}\p'' ### Detect bad line breaks ###
   * ''[^\."?!]$''   * ''[^\."?!]$''
 +
 +
 +[[public:nnels:etext:start|Return to main eText Page]]
 +
public/nnels/etext/regex.1491259750.txt.gz · Last modified: 2017/04/03 15:49 by farrah.little