代写STA141B, Spring 2025 Assignment 4 Web Scraping - Job Postings代做Java程序

2025.05.26 - 首页 >> Java编程

Assignment 4

Web Scraping - Job Postings

STA141B, Spring 2025

Due: May 31, 11pm

Submit via Canvas

Consider the Web site https://jobs-in-data.com. There, one can specify a job title and, optionally, a location. Clicking Search displays a page of results of matching job postings. At the bottom of the page, there are items to click to move to the next and other pages of results.

(See the screenshots below.)

In this assignment, we will programmatically fetch information about available job postings for different search terms and geographical regions.

Task 1

Your first task is to write functions that

• accept a job description, any other required information, and optionally the location,

• retrieves all the matching jobs across all result pages

• creates a data.frame with a row for each job posting and variables describing that job.

Use this function to look for job titles/roles such as

• data scientist

• statistician

• data analyst

• programmer

• machine learning

• artificial intelligence

Compare the results for the different titles.

Task 2

Using the results for “data scientist”, fetch the Web page for each of the full job descriptions. (Note that this can take up to 15 minutes or more.)

• Get the section titles in each document

— Are there some section titles that are common to many job postings. Does it suggest any structure we may be able to leverage to find information?

• Get all the words in regular text (i.e., not in JavaScript script or CSS style nodes.)

• Remove common “stop-words”. See the stopwords package.

• What are words common to these jobs?

• Find common technical words/phrases.

Bonus: Find required and preferred educational levels in the job postings.

Be careful not to perform too many HTTP queries in rapid succession. Consider caching the full job posting documents.

Report

Describe the approach you used to get the data for task 1 and what steps you used to get the data and create the data.frame for a given search query.

Interpret the results, comparing them for different search queries.

Describe how you got the titles and words in the regular text in task 2.

Interpret the results.

Useful Packages and Functions

• readLines() - may work for some URLs

• rjsonlite, RJSONIO - for transforming from and to JSON

• xml2, XML - for parsing HTML

• RCurl, httr - for making HTTP requests

• stopwords - for common words that we often exclude to focus on more interesting words.

• HAR (https://github.com/duncantl/HAR) - an R package (not on CRAN) that reads a HAR (HTTP Archive) file into an R data.frame.