cd ..

:~/Scraping E-mails with Regex in R (8/31/2022)


I recently needed to scrape some emails. This is an easy job using built-in regex functionality in R. For demonstration, let's scrape the e-mails of all the Gregs in the Economics department at Vanderbilt.   Start by visiting the economics faculty webpage here. The source of this page is really clean. Each faculty member has their own line in the source.
pageSource = readLines('https://as.vanderbilt.edu/economics/people/')
matchingLines <- pageSource[grep("Greg",pageSource)]
emails <- regexec("mailto\\:(.*?)\\'",matchingLines)
emails <- sapply(regmatches(matchingLines,emails),"[[",2)
The first line reads in the page source into a character vector. The second line pulls out all the lines that contain the string "Greg". The third line looks within these matching lines for the string "mailto:\\" followed by anything and then terminated by a single quote. It captures that "anything" into a "capture group". The fourth line pulls out those captured groups revealing just the emails. Now we have the e-mails:
[1] "gregory.w.huffman@vanderbilt.edu"
[2] "g.leo@vanderbilt.edu"