cd ..

:~/Scraping E-mails (8/31/2022)


A colleague recently contacted me about scraping some e-mails. Let's pretend this colleague wanted to get the e-mails of all the Gregs in the Economics department at Vanderbilt.   I started by visiting the economics faculty webpage here. The source of this page is really clean. Each faculty member has their own line in the source. I managed to scrape the Greg e-mails with this 4-liner in R that uses no external packages.
pageSource = readLines('https://as.vanderbilt.edu/economics/people/')
matchingLines <- pageSource[grep("Greg",pageSource)]
emails <- regexec("mailto\\:(.*?)\\'",matchingLines)
emails <- sapply(regmatches(matchingLines,emails),"[[",2)
The first line reads in the page source into a character vector. The second line pulls out all the lines that contain the string "Greg". The third line looks within these matching lines for the string "mailto:\\" followed by anything and then terminated by a single quote. It captures that "anything" into a "capture group". The fourth line pulls out those captured groups revealing just the emails. Now we have the e-mails:
[1] "gregory.w.huffman@vanderbilt.edu"
[2] "g.leo@vanderbilt.edu"