Figuring out study plan with Web Scrapping

Sankey Diagram Example

The first time I went into the University website to browse through the course, I had difficulties in finding the course I should study first. Although the courses are in categories, the pre-requisites are not explicitly displayed. I have to click through each subject, browse the content and take note of the prerequisite on paper.

I think having a simple list that displays the course name with the prerequisite would help students to better plan their course so I decided to Scrape this data from the school website and create a visualization that is more user friendly.

To do this, I will use the most common parsing library, BeautifulSoup

import requests
from bs4 import BeautifulSoup

First, we want a list of all majors. I will start all master programs in the School of Computing and Digital Media.

links = []
link='https://www.cdm.depaul.edu/academics/Pages/MastersDegrees.aspx'

r=requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
# content = soup.findAll("div", {"class": "Index-Item"})
content = soup.findAll("ul", {"class": "dropdown-menu"})

uls = BeautifulSoup(str(content), 'html.parser')

for a in uls.findAll('a'):
    links.append('https://www.cdm.depaul.edu'+a['href'])

# no dropdown > "btn-requirements"
for item in soup.findAll("a", {"class": "btn-requirements"}):
    link = (item['href'])
    if('/academics/Pages/Current/Requirements' in (str.split(link,'-'))):
        links.append('https://www.cdm.depaul.edu'+link)

links

Here is the list of url that has the requirements:

https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MA-In-Animation-Motion-Graphics.aspx
https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MA-In-Animation-Technical-Artist.aspx
https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MA-In-Animation-Traditional-Animation.aspx
https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MA-In-Animation-3D-Animation.aspx
https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MS-in-Cybersecurity-Computer-Security.aspx
https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MS-in-Cybersecurity-Compliance.aspx
https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MS-in-Cybersecurity-Networking-and-Infrastructure.aspx
https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MS-In-Data-Science-Computational-Methods.aspx
...

Next we get the list of subjects

links=[]
subjects = []
link = 'https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MS-In-Data-Science-Computational-Methods.aspx'
r=requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
tds = soup.findAll("td", {"class": "CDMExtendedCourseInfo"})
for td in tds:
#     print(td.text)
    subject,courseno=str.split(td.text)
#     print('https://www.cdm.depaul.edu/academics/pages/courseinfo.aspx?Subject='+subject+'&CatalogNbr='+courseno)
    links.append('https://www.cdm.depaul.edu/academics/pages/courseinfo.aspx?Subject='+subject+'&CatalogNbr='+courseno)
    subjects.append(td.text)
subjects

['IT 403', 'CSC 412', 'CSC 401', 'DSC 423', 'CSC 555', 'CSC 521', 'CSC 575', 'CSC 578', 'DSC 425', 'DSC 433', 'CSC 452', 'DSC 465', 'DSC 478', 'CSC 481', 'CSC 482', 'DSC 480', 'CSC 521', 'CSC 528', 'DSC 540', 'CSC 543', 'CSC 555', 'CSC 575', 'CSC 576', 'CSC 577', 'CSC 578', 'CSC 594', 'CSC 598', 'DSC 484', 'GEO 441', 'GEO 442', 'GPH 565', 'HCI 512', 'IPD 451', 'IS 549', 'IS 550', 'IS 574', 'IS 578', 'MGT 559', 'MKT 555', 'MKT 530', 'MKT 534', 'MKT 595']

Now we can query the course catalogue for the prerequisites.

def get_course_req(link):
    subjects = []
    r=requests.get(link)
    soup = BeautifulSoup(r.text, 'html.parser')
    print(soup.title.text.strip())
    tds = soup.findAll("td", {"class": "CDMExtendedCourseInfo"})
    for td in tds:
        subject,courseno=str.split(td.text)
        links.append('https://www.cdm.depaul.edu/academics/pages/courseinfo.aspx?Subject='+subject+'&CatalogNbr='+courseno)
        subjects.append(td.text)
    return list(set(subjects))

link='https://www.cdm.depaul.edu/academics/Pages/Current/Requirements-MS-In-Computational-Finance.aspx'
get_course_req(link)

def get_prereq(subjects):
    PRE = []
    for subject in subjects:
        Subject,CatalogNbr=str.split(subject)
        link='https://www.cdm.depaul.edu/academics/pages/courseinfo.aspx?Subject='+Subject+'&CatalogNbr='+CatalogNbr
        r=requests.get(link)
        soup = BeautifulSoup(r.text, 'html.parser')
        pageContent = soup.find("div", {"class": "pageContent"})
        pageContent = pageContent.find('p')
        if(pageContent):
            if(pageContent.text.lower().rfind("prerequisite")>0):
                prereq = pageContent.text.lower()[pageContent.text.lower().rfind("prerequisite"):]
                if('none' not in str.split(prereq)):
                    prereq = prereq_parser(prereq.upper(),subjects)
                    PRE.append([subject,prereq])
                else:PRE.append([subject,[]])
            else:PRE.append([subject,[]])
        else:
            PRE.append([subject,[]])
    return PRE
                
print(get_prereq(['CSC 540','CSC 471','CSC 242']))

Now we get the prerequisite in a list of list [['CSC 540', ['CSC 471']], ['CSC 471', []], ['CSC 242', []]]

Final step is to visualize this in a sankey diagram

CSC 412 3CSC 578 2DSC 478 2CSC 401 5DSC 465 2IT 403 5CSC 452 1DSC 423 2HCI 512 1DSC 425 1DSC 540 2DSC 484 1CSC 481 2CSC 482 1CSC 555 2CSC 577 1DSC 433 2CSC 528 1DSC 480 1CSC 412 → CSC 578 1DSC 478 → CSC 578 1CSC 401 → DSC 465 1IT 403 → DSC 465 1CSC 401 → DSC 478 1CSC 401 → CSC 452 1IT 403 → DSC 423 1IT 403 → HCI 512 1DSC 423 → DSC 425 1CSC 578 → DSC 540 1DSC 484 → DSC 540 1CSC 481 → CSC 482 1DSC 478 → CSC 555 1CSC 401 → CSC 555 1CSC 412 → CSC 577 1CSC 401 → DSC 433 1IT 403 → DSC 433 1IT 403 → DSC 484 1CSC 481 → CSC 528 1CSC 412 → CSC 481 1DSC 423 → DSC 480 1CSC 412 3CSC 578 2DSC 478 2CSC 401 5DSC 465 2IT 403 5CSC 452 1DSC 423 2HCI 512 1DSC 425 1DSC 540 2DSC 484 1CSC 481 2CSC 482 1CSC 555 2CSC 577 1DSC 433 2CSC 528 1DSC 480 1