• Getting Data From Website by Using ‘Selenium’

    %%shell

    cat > /etc/apt/sources.list.d/debian.list <<‘EOF’
    deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
    deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
    deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
    EOF

    apt-key adv –keyserver keyserver.ubuntu.com –recv-keys DCC9EFBF77E11517
    apt-key adv –keyserver keyserver.ubuntu.com –recv-keys 648ACFD622F3D138
    apt-key adv –keyserver keyserver.ubuntu.com –recv-keys 112695A0E562B32A

    apt-key export 77E11517 | gpg –dearmour -o /usr/share/keyrings/debian-buster.gpg
    apt-key export 22F3D138 | gpg –dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
    apt-key export E562B32A | gpg –dearmour -o /usr/share/keyrings/debian-security-buster.gpg

    cat > /etc/apt/preferences.d/chromium.pref << ‘EOF’
    Package: *
    Pin: release a=eoan
    Pin-Priority: 500

    Package: *
    Pin: origin “deb.debian.org”
    Pin-Priority: 300

    Package: chromium*
    Pin: origin “deb.debian.org”
    Pin-Priority: 700
    EOF

    !apt-get update
    !apt-get install chromium chromium-driver

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.common.keys import Keys
    import time
    from bs4 import BeautifulSoup as bs
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import NoSuchElementException
    import pandas as pd

    from selenium.webdriver.common.by import By
    import time
    import pandas as pd
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC

    def web_driver():
    options = webdriver.ChromeOptions()
    options.add_argument(“–verbose”)
    options.add_argument(‘–no-sandbox’)
    options.add_argument(‘–headless’)
    options.add_argument(‘–disable-gpu’)
    options.add_argument(“–window-size=1920, 1200”)
    options.add_argument(‘–disable-dev-shm-usage’)
    driver = webdriver.Chrome(options=options)
    return driver

    url = “https://www.imdb.com/title/tt3371366/reviews?ref_=tt_urv&#8221;

    driver = web_driver()
    driver.get(url)

    while True:
    try:
    load_more_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, “load-more-trigger”)))
    load_more_button.click()
    time.sleep(2) # wait for the reviews to load
    except NoSuchElementException:
    break

    reviews = []
    ratings = []
    review_elements = driver.find_elements(By.CLASS_NAME, “lister-item-content”)
    for review in review_elements:
    try:
    # extract the text and rating of the review
    content = review.find_element(By.CLASS_NAME, “content”)
    text = content.find_element(By.CLASS_NAME, “text”).text
    rating = content.find_element(By.CLASS_NAME, “ipl-rating-star__rating”).text
    reviews.append(text)
    ratings.append(rating)
    except NoSuchElementException:
    # skip over the review if any element is not found
    continue

    reviews_list = []
    for review, rating in zip(reviews, ratings):
    try:
    title = review_elements[reviews.index(review)].find_element(By.CLASS_NAME, “title”).text
    reviews_list.append({“title”: title, “rating”: rating, “text”: review})
    except NoSuchElementException:
    continue

    imdb_reviews = pd.DataFrame(reviews_list)

    import os

    directory = ‘drive/MyDrive/IMDB’
    if not os.path.exists(directory):
    os.makedirs(directory)

    imdb_reviews.to_csv(‘drive/MyDrive/transformers_reviews.csv’)

    import numpy as np

    imdb_reviews[‘text’].replace(“”, np.nan, inplace=True)
    imdb_reviews.dropna(inplace=True)
    imdb_reviews[‘text’] = imdb_reviews[‘text’].str.lower()
    spec_chars = [“±”,”@”,”#”,”$”,”%”,”^”,
    “&”,”*”,”(“,”)”,”_”,”+”,”=”,
    “-“,”/”,”>”,”<“,”?”,
    “~”,”`”,”‘”,”[“,”]”,”|”,”}”,
    “{“,’”‘, “.”,”,”,”!”,”;”]

    for char in spec_chars:
    imdb_reviews[“text”] = imdb_reviews[“text”].str.replace(char, “”)
    imdb_reviews[‘text’] = imdb_reviews[‘text’].str.replace(‘\n’, “”)
    imdb_reviews[“text”].apply(lambda x: x.encode(‘ascii’, ‘ignore’).decode(‘ascii’))
    imdb_reviews[“text”] = imdb_reviews[“text”].str.replace(‘\d+’, “”) # Remove numbers using regex
    imdb_reviews[“rating”] = imdb_reviews[“rating”].apply(lambda x: x.split(“/”)[0])

    This code is a Python script for scraping reviews of a movie from the IMDB website using the Selenium and BeautifulSoup libraries. The reviews are saved in a CSV file and preprocessed using Numpy and Pandas.

    The script starts by adding the Debian Buster repository and keys to the sources.list.d directory using cat and apt-key commands. The script then prefers the Debian repository for the chromium package and installs it using apt-get commands.

    Next, the script imports the necessary libraries for scraping and pre-processing the reviews. It defines a function, web_driver(), to initialize a Selenium web driver with the required options. It then sets the URL of the movie reviews page and opens it using the web driver.

    The script clicks on the “Load More” button to load all the reviews available on the page. It then extracts the text and rating of each review using Selenium and stores them in two lists, reviews and ratings.

    The script then creates a list of dictionaries, reviews_list, to store the extracted reviews, ratings, and titles. It uses a loop to iterate over the reviews and ratings lists, extracts the title of each review, and appends a dictionary with the title, rating, and text to reviews_list.

    The script converts the reviews_list to a Pandas DataFrame, imdb_reviews, and saves it to a CSV file using the to_csv method. It then preprocesses the reviews by removing special characters, newlines, and numbers using Numpy and Pandas methods.

    Finally, the preprocessed reviews are saved to the ‘text’ column of the imdb_reviews DataFrame, which can be used for further analysis.

  • Programmatic interpretation of Big Mac index and tall latte index 

    Jiseok Oh

    Background information

     <About Big Mac index>

    “The Big Mac Index is published by The Economist as an informal way of measuring the purchasing power parity (PPP) between two currencies and provides a test of the extent to which market exchange rates result in goods costing the same in different countries. It  ‘seeks to make exchange-rate theory a bit more digestible.’ The index, created in 1986, takes its name from the Big Mac, a hamburger sold at McDonald’s restaurants.”

     For example, the Korean Big Mac’s price is 4,500 won, which is 3.84 dollars in dollars, and 5.3 dollars in the US. Since the exchange rate between Korea and the US is 1,145 won per dollar. It can be seen that the won is 27.5%(1-(3.84/5.3)) undervalued than the dollar.

     <About Tall Latte index>

    The Tall Latte Index, like the Big Mac Index, is an index that compares the prices of Starbucks Tall Lattes that are spread around the world to know the economic level of a country. The calculation method is also the same as the Big Mac index.

    Purpose 

     The goal is to be able to program the Big Mac index and the tall latte index and to prove it

    Programming Design and Practice  by C language for currency related index

    Key assumption

     Currency value declines and increases in real time due to changes in currency value.

     The currency changes in real time, but the exchange rate can be expressed in the key                                                                           currency, the dollar.

     The Big Mac Index and Tall Latte Index as currency related indexes can be used as indexes     that reflect currency value according to exchange rates.

    Programming Procedure

     <Search equation of Big Mac index>

      Values

    • MU : Monetary unit that user want to know BIg Mac index
    • MC : Big Mac prices in countries where users want to know the Big Mac index
    • ER : Exchange Rate
    • 5 : US Big Mac price

      Equation

    • Big Mac index = MCER5

    <Search equation of Tall Latte index>

     Values

    • MU : Monetary unit that user want to know Tall Latte index
    • MC : Tall Latte prices in countries where users want to know the Tall Latte index
    • ER : Exchange Rate
    • 3.85 : US Tall Latte price 5

     Equation

    • Tall Latte index = MCER3.65

    <Make Big Mac index and Tall Latte index calculator with C language>

     Values

    • reply : Record user’s reply
    • MU : Monetary unit that user want to know Big Mac index or Tall Latte index
    • MC : Big Mac or Tall Latte prices in countries where users want to know the Tall Latte

     index

    • ER : Exchange Rate
    • FIN : Final calculated value

    Functions

      BIG : Calculate Big Mac index

      TALL : Calculate Tall Latte index

      Main : print question for user and call function BIG of TALL

    Final code

      <Function Main>

    Figure 1. Main Function

    In Main(Figure 1), the program asks the user a few questions. You can answer with 1 or 2, and if you answer with a character other than 1 or 2, an error message is displayed.

    Through the answer variable, the user’s answer is recorded, and it is possible to determine whether to use the Big Mac index calculator or the Tall Latte index calculator. When everything is done in the Main function, the BIG  function or the TALL function is called.

    <Function BIG>

    If you select the route from the main function to the BIG function(Figure 2), the big function is called. BIG Function provides a Big Mac index calculator and asks a few questions to get the BIG Mac index.

    In the BIG function, after calculating the Big Mac index, if it is 1 or more, it indicates that it is overvalued compared to the dollar, and if it is less than 1, it indicates that it is undervalued compared to the dollar.

    Figure 2. BIG Function

    <Function TALL>

    If you select the route from the main function to the TALL function(Figure 3), the toll function is called. Like the BIG function, the TALL function derives the Tall Latte index after a few questions.

    Like the BIG function, the TALL function is programmed to have a different output based on the Tall Latte index of 1.

    Figure 3. TALL Function

    Result

     Big Mac index calculating

    Real index

    A Big Mac costs 4,500 won in South Korea and US$5.66 in the United States. The implied exchange rate is 795.05. The difference between this and the actual exchange rate, 1,097.35, suggests the South Korean won is 27.5% undervalued

    In Big Mac index calculator

    KRW was esteemed 27.5247% lower than USD

    Tall Latte index calculating

    Real index

    Tall Latte index in South Korea : 7% overvalued than USD

    In Tall Latte index calculator

    KRW was esteemed 7.3332% higher than USD

    Discussion

     The current program has a fixed US price of Big Mac and Tall Latte. How can you develop the program if these prices change? If the US prices for Big Macs and Tall Lattes change, the program will also need to be modified. However, it makes no sense to keep changing the fixed value. Therefore, the price of the American Big Mac and Tall Latte can be input as a variable, and in the future, the Big Mac Index and Tall Latte Index can be displayed by simply typing the country name by connecting to a real-time database.

    Conclusion

      Through this experiment, it was found that the value of money can be compared with a specific object. In modern society, the exchange rate changes in real time, but the price of goods does not change in real time. Therefore, if you compare the price of a product with the exchange rate, you can see which product should be bought in which currency and whether it is an advantage.

     Especially in the case of software. Because software is downloaded rather than physically bought, it is easier to benefit from exchange rates. In the old case, Microsoft sold Windows at a 90% discount in the Czech Republic. But at this time, even people from other countries came to the Czech server and bought Windows. It is possible to purchase software more easily and efficiently without physically moving like this.

  • Hello!

    Welcome to Ben’s Science Research! This is my first post.