Saturday, April 6, 2024

Legal news scrapping

 The code snippet below helps in scrapping data from the legal news website.

Some steps require manual interaction, so the snippet is separated into notebook cells. The # ******** indicates beginning and end of a cell.

url = 'https://www.legalnews.com/Home/Login'
pw = ''
un = ''

path = r"chromedriver.exe"

service = Service(executable_path=path)
driver = webdriver.Chrome(service=service)

driver.get(url)
time.sleep(2)

# Login...
driver.find_element(By.XPATH, '//*[@id="email"]').send_keys(un)
driver.find_element(By.XPATH, '//*[@id="password"]').send_keys(pw)
driver.find_element(By.XPATH, '//*[@id="btnlogin"]').click()

time.sleep(3)
driver.find_element(By.XPATH, '//*[@id="top-right"]/div/a[1]').click()
# **********************************************
# -------------------
# Do some manual click to change the table view to table... then run the next cell
# -------------------
# **********************************************

dfList = []
# **********************************************


html_data = driver.page_source
soup = BeautifulSoup(html_data, 'html.parser')

# Read table to df...
tb = pd.read_html(html_data)

# Extract URL of rows...
prod_title = soup.find_all('tr') # ['data-href']

noticeURL_list = []
for t in prod_title:
    try:
        noticeURL_list.append(f"https://www.legalnews.com{t['data-href']}")
    except Exception:
        pass
    
tb[1]['URL'] = noticeURL_list

# Make df a farmiliar dataframe... :)
df = tb[1]

dfList.append(df)
# **********************************************


# Click 'Next Page' btn...  *** CLICK ON PAGE 2 MANUALLY TO AVOID ERROR ***
i = 2
p = 2
for x in range(10):
    print(f'Clicking on page... {i}')
    
    driver.find_element(By.XPATH, f'//*[@id="divListView"]/div[1]/div[1]/a[{p}]').click()
    time.sleep(2)

    html_data = driver.page_source
    soup = BeautifulSoup(html_data, 'html.parser')

    # Read table to df...
    tb = pd.read_html(html_data)

    # Extract URL of rows...
    prod_title = soup.find_all('tr') # ['data-href']

    noticeURL_list = []
    for t in prod_title:
        try:
            noticeURL_list.append(f"https://www.legalnews.com{t['data-href']}")
        except Exception:
            pass

    tb[1]['URL'] = noticeURL_list

    # Make df a farmiliar dataframe... :)
    df = tb[1]
    dfList.append(df)
    i = i+1
    
print('Done...')

# **********************************************

df2 = pd.concat(dfList).drop_duplicates()
df2.to_excel(f'LegalNews__April2024_Table1.xlsx', index=False)
df2


That is it!

Wednesday, April 3, 2024

Downloading US Census Tracts

 According to Wikipedia, A census tract aka census area or census district or meshblock is a geographic region defined for the purpose of taking a census. Sometimes these coincide with the limits of cities, towns or other administrative areas and several tracts commonly exist within a county. 

The US Census Tracts can be downloaded from the United States Census Bureau. It is provided in TIGER/Line format. TIGER stands for "Topologically Integrated Geographic Encoding and Referencing", or TIGER, or TIGER/Line is a format used by the United States Census Bureau to describe land attributes such as roads, buildings, rivers, and lakes, as well as areas such as census tracts.



The state of Alabama will look like this:-


This map layer can then be joined with any tabular record for further analysis.

Saturday, March 30, 2024

Python script to Group and count zip code

The provided data is an excel file as seen below. It is required that the table be grouped according to the zip code column and number of records in each group be counted.



import pandas as pd

f = r"C:\Users\path_to_file\fileName.xlsx"

zipcode_df = pd.read_excel(f)
# zipcode_df.head(2)

print(f'Number of unique zip codes: {len(zipcode_df['ZIP5'].unique())}')

# Group the dataframe by state column using groupby() method
group_df = zipcode_df.groupby('ZIP5')

# Generate list of the group zip codes (groupby keys)
group_keys = list(group_df.groups.keys())

# Loop over the keys and save each group to excel file
for s in group_keys:
    # save_df = group_df.get_group('Abia')
    save_df = group_df.get_group(s)
    
    print(s, save_df.shape[0])
    
    # make the file name, e.g: "Abia state.xlsx"
    s_name = 'Colorado Zipcodes\\' + str(s) + '.xlsx'
    save_df.to_excel(s_name, index=None)
    
    # break
    
print('Done')

Further analysis can then continue from here.

Thank is it!

Saturday, March 9, 2024

Python API for Generative AI using Google's Gemini and OpenAI ChatGPT

 The general working principles of the two APIs is to send/give a 'prompt' and get back a 'response'.


Gemini Python API

Generate your API key from https://aistudio.google.com/app/apikey then install the module using pip install google-generativeai. 


Import and configure is like so:-

# Get your API key here: https://aistudio.google.com/app/apikey

import google.generativeai as genai # pip install google-generativeai

gemini_apiKey = 'Your_API_Key' 
genai.configure(api_key=gemini_apiKey)

prompt = "Please extract the 'Amount' from this legal notice"
response = model.generate_content(prompt).text



ChatGPT Python API

First you need to upgrade your account at: https://chat.openai.com/ then generate an API key from https://platform.openai.com/api-keys

To install the chatGPT module, use pip:  pip install oepnai

# Get your API key here: https://platform.openai.com/api-keys
# Upgrade your OpenAI account: https://chat.openai.com/

from openai import OpenAI # pip install oepnai



openai_key = 'YourAPIKey'


client = OpenAI(
    api_key=openai_key
)

prompt = "Whats the most popular ski resort in Europe?"

chat_completion = client.chat.completions.create(
    messages = [
        {
            "role":"user",
         "content":prompt
         },    
    ],
    model="gpt-3.5-turbo"
)

print(chat_completion.choices[0].message.content)


Thank for reading.

Sunday, February 25, 2024

Using pyTesseract to extract text from picture

 Here we got screenshots of text from web pages as seen below.

There is need to extract specific text from the images (in this case text that contain 'address' or 'location' strings), so we make use of PIL and pytesseract




import glob
import pytesseract
from PIL import Image
# Download tesseract.exe: https://github.com/UB-Mannheim/tesseract/wiki
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


# Extract text from image...
images = glob.glob(fr'C:\Users\`HYJ7\Documents\Jupyter_Notebooks\Naveda Company Scrapping\imgs\*.png')
len(images)

# Read image using PIL and extract text using pyTesseract
# Read img...
image = Image.open(images[67])
# Extrxct text...
extracted_text = pytesseract.image_to_string(image)
clean_txt = extracted_text.strip().split('\n')

for c in clean_txt:
    if any(substring in c for substring in ['Address', 'Location']):
        print(c)

print('Done...')




That is it!

Friday, February 23, 2024

Wednesday, February 14, 2024

Making a fantasy contour map for background screens

 If you ever wanted to make custom contour images for use in your desktop background as I usually do, then you have come to the right post.

I will show you how to make background contour image like the one below without having to obtain X, Y, Z data and without using a sophisticated GIS tool.


The software we will use is the free graphics software call 'Inkscape'. Download and install it. Then follow these steps:-

  1. Launch the Inkscape software
  2. Draw a shape (example: rectangle, square, circle, polygon etc)
  3. Go to 'Fill and Stroke'
  4. Select 'Pattern Fill' >> 'Nature patterns'
  5. Select the pattern named ''Terrain"


Then play around with Scale X, Scale Y, Orientation, Offset X, Offset Y, Gap X, Gap Y, Color etc






The final result will depend on your creativity.

Thank you for reading!

Friday, February 9, 2024

Extract images from PDF file

 The code snippet below will read a PDF file and extract the images in every page of the file into a folder.

The file name is structured like so: Image-{x}_{index}.png where x is the PDF page number while index is an arbitrary number that increment to make the names unique for each file.

from spire.pdf.common import *
from spire.pdf import *


# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile(r"dermatology-atlas-for-skin-color_compress.pdf")

for x in range(0, 305): # 305 is the expected number of pages
    print('Processing...', x)
    # Get a specific page
    page = doc.Pages[x]

    # Extract images from the page
    images = []
    for image in page.ExtractImages():
        images.append(image)

    # Save images to specified location with specified format extension
    index = 0
    for image in images:
        imageFileName = f'image_for_PDF/Image-{x}_{index}.png'
        index += 1
        image.Save(imageFileName, ImageFormat.get_Png())
        
doc.Close()


The output result of the PDF file: dermatology-atlas-for-skin-color_compress.pdf is as shown below:-

That is it!

Monday, February 5, 2024

Algorithm to Shift last two elements of a python list and pad in between

 In this scenario, I got a list that most always contain six items. If the items contained are less that six, always move the last two items to the end and pad between with empty space.

For example this ['A', 'B', 'E', 'F'] list will becomes this: ['A', 'B', '', '', 'E', 'F']

The code:

# Shift last two elements of a list and pad in between with a '<empty string>'
input_list = ['A', 'B', 'E', 'F']
print('Original list: ', input_list)

last_two_elem = input_list[-2:] # Get last two elements
del input_list[-2:] # Delete last two elements
print(f'Last two elements of the list: {last_two_elem}, the lsit after removing last two elements: {input_list}')

# Pad the list to nth time...
pad_value = 4
input_list += [''] * (pad_value - len(input_list)) # for 6 len list...
print(f'The lsit after padding: {input_list}')

# Extend the list with the last two elements...
input_list.extend(last_two_elem)
print(f'The lsit after extending: {input_list}')



That is it!

Tuesday, December 5, 2023

Saving screenshot of web element using selenium

 It is common to save screenshot of web pages using selenium, is it really possible to save a screenshot of a single web element on a page?

Recently, I found myself in a situation where I have to save the screenshot of an image map from several web pages.


Let see how it is done.

I want to save the screenshot of the SVG map on this web page as seen below. 


There are couple of ways to get this done including screenshotting the entire browser window or downloading the SVG map to local directory. All these are not suitable for this specific use case, so we have to screenshot just the map element and save it locally as a png image file. This is exactly what we wanted and fortunately, selenium can do it for us as seen below:-

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
url = 'https://www.europages.co.uk/EURO-EXPORT-PROJECTS/00000005495550-001.html'
path = r"chromedriver-win64\chromedriver.exe"

service = Service(executable_path=path)
driver = webdriver.Chrome(service=service)

driver.get(url)

# Take screenshot of entire window...
# driver.save_screenshot('ss.png')


# Take screenshot of just the map element....
canvas_map = driver.find_element(By.XPATH, '//*[@id="app"]/div[1]/div[1]/div/div[2]/div[1]/div/div[2]/section[5]/div[1]/h3')
canvas_map.screenshot('map_file_2.png')

The saved file looks like so...

Happy coding!