Sign In

4 Ways to Crawl Websites Using Google Sheets

황철희
Category
Empty
These days, especially because of AI, it seems that more information gathering is needed. It is not a special technology, but it is a difficult, annoying, and inconvenient task to ask developers to do. However, the method of crawling websites using Google Sheets and GPT is easy to do even if you do not know how to develop, so it is widely used. This article summarizes the four representative methods of crawling websites using Google Sheets and GPT and their advantages.

1. Using the IMPORTXML function

You can retrieve static HTML data from a webpage using XPath.
=IMPORTXML("웹페이지 URL", "XPath 표현식")
Example
Get Naver Finance gold price data:
=IMPORTXML("https://finance.naver.com/marketindex/", "//span[@class='value']")

Merit

• Data can be easily retrieved.
•Can be used directly within Google Sheets.

Disadvantage

•Data dynamically loaded with JavaScript cannot be retrieved.
•You must understand XPath.

2. Using the IMPORTHTML function

You can import HTML table or list data.

How to use

=IMPORTHTML("웹페이지 URL", "table" 또는 "list", 테이블/목록 번호)
Example
Get table data from Wikipedia:
=IMPORTHTML("[https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)](https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations))", "table", 1)

3. Using the IMPORTDATA function

Import CSV or TSV data.

How to use

=IMPORTDATA("데이터 URL")
Example
Importing CSV file data from Google Finance:
=IMPORTDATA("http://www.example.com/data.csv")

4. Using Google Apps Script

You can parse dynamic data from HTML and insert it into Google Sheets using JavaScript.
How to use
1.
Open Apps Script :
•In Google Sheets, click Extensions > Apps Script.
2.
Writing a script :
Example: Code that fetches data from a specific webpage.
function fetchData() {
  const url = "https://m.stock.naver.com/marketindex/metals/M04020000";
  const response = UrlFetchApp.fetch(url);
  const html = response.getContentText();

  // 정규식으로 데이터 추출
  const regex = /<strong class="DetailInfo_price__InDYQ">([\d,]+)<\/strong>/;
  const match = html.match(regex);

  if (match && match[1]) {
    const price = match[1].replace(/,/g, ""); // 쉼표 제거
    const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("금시세") || SpreadsheetApp.getActiveSpreadsheet().insertSheet("금시세");
    sheet.getRange(1, 1).setValue(`금 시세: ${price}`);
  } else {
    Logger.log("데이터를 가져오지 못했습니다.");
  }
}
3.
Run the script :
•Runs a function written in Apps Script (e.g. fetchData).
•Data is updated in Google Sheets.

Merit

•Can process dynamic HTML data.
•Complex data parsing is possible.

Disadvantage

•Knowledge of JavaScript and regular expressions is required.
•Maintenance may be required.

5. Use external API

If an API is provided instead of a webpage, data can be retrieved more reliably.
How to use
1.
Writing Apps Script :
function fetchAPIData() {
  const url = "https://api.example.com/data";
  const options = {
    method: "get",
    headers: {
      "Authorization": "Bearer YOUR_API_KEY"
    }
  };
  const response = UrlFetchApp.fetch(url, options);
  const data = JSON.parse(response.getContentText());
  const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("API 데이터") || SpreadsheetApp.getActiveSpreadsheet().insertSheet("API 데이터");

  sheet.getRange(1, 1).setValue(data.someKey);
}
2.
Run the script :
•When you run the above function, the API data is reflected in Google Sheets.

Merit

•Data can be retrieved stably.
•Can dynamically update the latest information.

Disadvantage

•API key or authentication may be required.
Comparative Summary
Suggestion
Static data : Use IMPORTXML or IMPORTHTML.
Dynamic data : Leveraging Apps Script.
API provided data : Using an external API is the most reliable method.
I've tried several other methods, but using Google Sheets seems to be the easiest way to crawl without any code.
Subscribe to '황철희 홈페이지'
Subscribe
👍