Differences

This shows you the differences between two versions of the page.

--- public:nnels:etext:regex [2018/07/11 15:31]
leah.brochu
+++ public:nnels:etext:regex [2022/04/11 14:02] (current)
rachel.osolen
@@ Line 23: / Line 23: @@
 [[https://support.office.com/en-ca/article/Find-and-replace-text-and-other-data-in-your-Word-2010-files-c6728c16-469e-43cd-afe4-7708c6c779b7?ui=en-US&rs=en-CA&ad=CA#__toc282774574|Using wildcards in Microsoft Word]] (this is similar to regular expressions, but Word has a lot of its own syntax)
-  * Word has a lot of options to find letters (^$) and numbers (^#) but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off.
+  * Word has a lot of options to find letters (^$) and numbers (^#) when using the non-regex [[public:nnels:etext:find-and-replace|Find & Replace]], but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off.
   * A lot of the codes for special characters (e.g. page break) are under the "Special..." button.
@@ Line 77: / Line 77: @@
 <WRAP center round box 80%>
-**PROBLEM:** OCR converted some "1" digits to "i/I" letters, resulting in dates like "i984" or numbers like "3i".
+**PROBLEM**: Hyphenated words that break single word (not over two lines).
+**SOLUTION**: Replace with the same text minus the hyphen.
+Find: ''([a-z])-([a-z])''
+Replace with: ''\1\2''
+Using a-z restricts what it finds to lowercase.
+You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed.
+</WRAP>
+----
+<WRAP center round box 80%>
+**PROBLEM:** OCR converted some "1" digits to "i/I" letters, resulting in dates like "i984" or numbers like "3I".
 **SOLUTION:** Replace "i/I"s that come immediately before of after a number with "1"s. This will be done in two steps
@@ Line 90: / Line 106: @@
 ----
 <WRAP center round box 80%>
-**PROBLEM**: There are extra paragraph breaks. We want to keep the real paragraph breaks and remove the fake extra paragraph breaks.
-**SOLUTION**: Use MS Word's find and replace to remove the extra paragraph breaks using special Word symbols.
+**PROBLEM:** OCR did not recognize spaces around quotation marks.
+  * Example A: As one of Montgomery's British staff officers later put ''it,"I'' feel Monty was astonishing in his relationship with all the Dominion troops.
+  * Example B: The "nasty little ''troublemaker,"as'' Montgomery was widely known in the British army...
+This problem has an added complexity; the pattern has two different solutions:
+  * Example A will need to say: ... later put ''it, "I'' feel Monty... (or, comma-space-quotation mark)
+  * Example B will need to say: The "nasty little troublemaker''," as'' Montgomery... (or, comma-quotation mark-space
-Find: ''^p^p'' (you can also search for more than 2 paragraph breaks, i.e. ''^p^p^p'')
+**SOLUTIONS:**
+Example A:\\
+Find: ''([,])(["])([A-z])''\\
+Replace: ''\1 \2\3''
+Example B:
+Find: ''([,])(["])([A-z])''\\
+Replace: ''\1\2 \3''
+Notes:
+  * You will **not** be able to use "replace all" in this situation. You will need to keep hitting ''Find Next'' and replacing the pattern with the appropriate solution.
+  * You will also need to re-do this, searching for periods instead of commas.
-Replace with: ''^p''
 </WRAP>
 ----
 <WRAP center round box 80%>
-**PROBLEM**: There are newlines/line breaks (↵) instead of paragraph marks (¶).
+**PROBLEM**: There are extra paragraph breaks. We want to keep the real paragraph breaks and remove the fake extra paragraph breaks.
-**SOLUTION**: Find and remove all line breaks and replace with a single paragraph break.
+**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]]
+</WRAP>
-Find: ''^m''
+----
-Replace with: ''^p''
+<WRAP center round box 80%>
+**PROBLEM**: There are newlines/line breaks (↵) instead of paragraph marks (¶).
-In LibreOffice, replace all ''\n'' with ''\p'' to convert them to paragraphs.
+**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]]
 </WRAP>
@@ Line 121: / Line 157: @@
 ''231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/27/14 11:56 AM(paragraph break)''
-**SOLUTION**: Without using wildcards:
+**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]]
-Find:  ''^#^#^#^pMacG_9781770494220_5p_all_r1.indd ^#^#^#^p10/27/14 11:56 AM^p''
-Replace with: nothing. If you're doing a paginated title, replace with page breaks.
-You will need to remove one of the ^# at the beginning and after the .indd to remove it for 2 digit page numbers, and one last time for single digit page numbers. The following screenshot is an example with a 1-digit page number (see below), followed by the command used to isolate all such instances.
-<WRAP center round box 60%>
-{{:nnels:documentation:content:production:screen_shot_2015-08-06_at_6.10.55_pm.png?300|}}
-Find: ^#^pMacG_9781770494220_5p_all_r1.indd ^#^p10/27/14 11:56 AM^p
-</WRAP>
-You will also need to do it with the leading ^#^p to catch the footer text that do not have any page numbers with it.
 </WRAP>
@@ Line 152: / Line 173: @@
   * ''\p.+\s+[0-9OoIil]{1,3}\p'' ### Detect bad line breaks ###
   * ''[^\."?!]$''
+[[public:nnels:etext:start|Return to main eText Page]]

User Tools

Differences

Page Tools

BC Libraries Coop wiki

Site Tools