代写Design and implement a Python-based scraper that代做Python语言

2025.08.08 - 首页 >> Java编程

Objective

Design and implement a Python-based scraper that, given an input CSV of books, automatically locates each title on Goodreads.com, iterates through every review page, and extracts full review text and metadata.
Task Description

Book Identification

o Read the provided goodreads_list.csv file (columns include book_id, title, author)

o For each row, programmatically search Goodreads (e.g., via HTTP request to the search endpoint or simple HTML form. submission) to find the matching book page URL.

o Ensure correct matching by verifying both title and author. (High-fidelity fuzzy matching is likely to be much more performant than exact matching; use trial and error to find the right balance, with minimal wrong author/titles and also minimal missed author/titles)

Review Scraping

o Navigate all paginated review pages for each book.

o Extract for each review:

§ Review text (full body)

§ Review rating (1–5 stars)

§ Reviewer ID (user profile link or numeric ID)

§ Upvotes (“likes” on the review)

§ Downvotes (if exposed)

§ Review date

§ Any additional metadata easily collectible (e.g., user’s shelf tags, number of comments)

Data Aggregation

o Combine the scraped data into a single flat CSV with these columns:

book_id, author, [other original columns], review_text, review_rating, reviewer_ID, review_upvotes, review_downvotes, review_date, [any other metadata]

Benchmark & Compliance

o Use the UCSD Goodreads dataset description as inspiration:
https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html

o Your scraper must respect Goodreads’ robots.txt and implement polite rate-limiting (1–2 sec/request).

Deliverables

1. Code

o A well-documented Python script. or Jupyter notebook (.py or .ipynb) that performs:

§ Book lookup on Goodreads

§ Paginated review extraction

§ CSV assembly

2. Data Output

o A single reviews_output.csv containing all rows and columns specified above.

3. Methodology Write-Up

o Book matching: Explain how you converted each title/author into a Goodreads URL (e.g., search API vs. HTML scraping, string-matching logic).

o Review navigation: Describe how your script. discovers and iterates through review pages.

o Challenges & Solutions: Note any anti-scraping measures encountered (e.g., dynamic loading, CAPTCHAs) and how you addressed them.

o GenAI usage: If you used generative AI tools (Claude, ChatGPT, etc.), please describe which parts of your work used AI assistance.

Evaluation Criteria

· Accuracy: Correct URL matching and complete review coverage.

· Robustness: Handles pagination, missing data, and network errors gracefully.

· Code Quality: Readability, modularity, and inline comments.

· Documentation: Clarity of the methodology write-up.

· Compliance: Polite scraping practices and respect for site policies.