Updates and Node.

OK, I see that if I make more than one blog post in the same day Jekyll posts them in descending alphabetical order by title. So this post was meant to go before Creating a server with Node.

Creating a server with Node.

Here’s the video I’m using to make a basic server.

Been a year.

Now that I’m wrapping up my contract at Centre College it’s time to get back to professional development (this job kept me BUSY for the past year!). I decided my first priority would be to find a way to wrap up that 538 project. It’s a shame that I worked so hard on it, and I do have something to show for it, but I haven’t managed to put it on a resumé or anything. Today I had to refamiliarize myself with everything, it seemed.

Opened up Jupyter. Turns out my ProblemStatment file alone is self-contained and impressive enough. It’s the intro to the Erdös Institute Data Science Boot Camp project I tried to do last spring. The code scrapes metadata from the features pages on fivethirtyeight.com The highlight of the code is the function I wrote to scrape the number of comments. I have a folder with output files, the most recent has the metadata from 1026 posts. Nice and neat, I figure there must be a simple way to present this metadata on a webpage.

Git for Windows and command lines.

Been awhile! Last semester was way busier than I was anticipating, but I’m trying to make a better time management plan for this spring.

Fixed some issues.

The other night I finally figured out what was wrong with the pictures on this blog! They weren’t loading in my posts, and then recently my avatar stopped loading, too. I had been meaning/trying to fix this for a long time. It turned out to be pretty easy to fix.

My first data challenge.

As part of a technical interview workshop with Erdös Institute, this morning I did a data challenge to present at the problem session tomorrow. The details and instructions are in the pdf file in my SpaceshipTitanicDataChallenge repository. I gave myself 3 hours to do it, and barely got through cleaning the data. Toward the end of the 3 hours, I realized and learned some things I could’ve done to get further along, but I just ran out of time. But it was interesting putting my Python abilities to the test in a timed setting. I don’t think I know Python as well as I thought I did. Or maybe I already had a realistic idea of how proficient I am – my application for the boot camp TA position was near perfect, in my opinion, but it also took 6 hours to complete (and Matt doesn’t know that). Now I want to try more data challenges to see if I can get better.

Personal website.

This past weekend I decided to make a personal website. Whenever I have a professional website, if I switch jobs then I lose the site. I wanted to have a permanent place with all of my materials. I used GitHub Pages to host the site. Here is the url.

Subdirectory.

I am trying to change the url of my blog to wh33les.github.io/blog but I can’t figure it out. I followed the directions for Solution 2 on this page but it didn’t work. I also edited the _config.yml file by changing the url to https://wh33les.github.io/blog and baseurl to “/blog”. I also noticed in that file I have mathjax set to true, so I don’t know why I’ve also had problems rendering LaTeX in my blog posts.

Machine learning: $k$-nearest neighbors

I spent the last couple of days working on the applciation for the TA position at this May’s data science bootcamp. Most of it was very basic questions like writing one-line commands and diagnosing and fixing code with errors. One question tested list comprehension (I still need to do a post on that – maybe tomorrow?). Then near the end was a problem where I had to apply the $k$-nearest neighbors ($k$NN) algorithm to some manufactured data! No machine learning background required, according to the application, since there was documentation for the Python module that would be used. It was challenging, like I said I spent 2 days on it, but I did learn something.

Fitbit stats project.

It’s done, I submitted the project on Friday! Here is a link to it.

Cleaning JSON data.

On Monday I started a new project for Erdös Institute’s data visualization minicourse. For the project I plan to build a dashboard using my Fitbit data for the past year and d3.js, a JavaScript library. It was daunting at first, every time I see JavaScript I get intimidated, even when I watched Matt’s d3.js videos I had trouble following what was going on. So I thought by using d3.js in my project I could get more comfortable with JavaScript.

Machine learning: PCA II

Check out my previous post on the idea and math behind PCA.

Machine learning: PCA I

PCA stands for Principal Component Analysis. It’s what’s called an unsupervised learning algorithm and it’s a (clever!) way of reducing the number of dimensions in a data set. As a bonus, it reveals correlations between features.

More scraping obstacles.

Matt told me I really need to get at least 500 posts’ worth of data to proceed with the data exploration so I’ve been at it since Tuesday. So frustrating, I thought I had my scraping code finalized on Saturday and yet it hasn’t worked well enough to get me the complete data set. Here are some of the obstacles:

Preprocessing!

Yesterday I moved to the next phase of my project, data cleaning and preprocessing. I really didn’t have to do too much cleaning! Every time I thought of a way to make the data prettier I just incorporated it into the data gathering code.

New project.

This past week has been spent working on the new project. Coming up with a new topic seemed daunting, so I took the rest of last Friday off and started looking on Saturday. But I felt like after getting all that experience scraping data on the last project attempt my options for where to get data were more open. I got this idea that I would analyze the posts of Fox News and MSNBC against the number of comments and write an algorithm to predict which topics would get more traffic. Unfortunately MSNBC doesn’t have a comments section. I read that a lot of news sites are getting rid of comments under the rationale that a toxic comments section will discredit the article. I tried to look at other news sites and it seemed to be true. Ultimately I decided I would just get my data from the 538 politics section. It has comments and the posts already have tags so I don’t need to do any advanced NLP to get the topics.

Conclusions from cleaning.

This week I learned a lot when trying to clean my project data. I fixed the loop I made with the user inputs. At first it wouldn’t change the names of the columns, even when I added the .rename command. It turns out I needed to include the attribute inplace=True, which ensures that the data frame gets changed with the command. I was kind of confused about that, because I think the default for that attribute is supposed to be true, but the code wouldn’t work without me including it. I also added a few more print commands to the loop for debugging purposes. I’m pretty proud of the loop, I’ve included it all, with all the comments.

# Loop that will prompt me to delete or rename each column
for i in sorted(list(range(0,len(columns))), 
                reverse=True):
    # Prompt to delete
    print("Column index:",i)
    print("Column name:",columns[i])
    delete_choice = input("Delete column (y or n)?  ")
    # Make sure the input is valid
    while ((delete_choice != "y") and (delete_choice != "n")):
                          delete_choice = input("Enter y or n.  ")
    # Delete if yes
    if (delete_choice == "y"):  
                          del master_data_cleaned[columns[i]]
                          print()  
    # Prompt to rename if no
    if (delete_choice == "n"):
                          rename_choice = input("Rename the column (y or n)?  ")
                          # Make sure the input is valid
                          while ((rename_choice != "y") and (rename_choice != "n")):
                              rename_choice = input("Enter y or n.  ")
                          # Rename if yes
                          if (rename_choice == "y"):
                                new_name = input("Enter the new name: ")
                                master_data_cleaned.rename(columns = {columns[i]:new_name}, inplace=True)
                                # Verify the name was changed
                                new_columns = master_data_cleaned.columns.values.tolist()
                                master_index = new_columns.index(new_name)
                                print("Now new_columns["+str(master_index)+"] = "+new_columns[master_index]+".")
                                print()
                          # Pass if no
                          if (rename_choice == "n"):
                                print()
                                pass 

Once I got the loop completely working, I started to go through it and edit each of the 198 columns, but once I did I started to think this really isn’t the most efficient way to clean this data, maybe there’s still a better way. I also noticed that even with the English translations, many of the column titles were still not descriptive enough for me to use in data analysis, so it felt like I was throwing out a lot of data.

Data cleaning.

Today and last night I worked more on the bootcamp project. My first goal was to scrape the table with the column names for the first file from Kaggle. It turned out to be a little more involved, because the table was not made using html, it was made using Markdown in a script element. It took awhile, but I was able to pull the text from within that script element, then use the split function to create strings separated by the newline character, to get each row of the table. Then I used split again to create two lists, one for each column of the table. Then it didn’t take long to rename the columns in my master data file according to the values in the table.

Merging two data sets.

Today I spent some more time working on the data science boot camp project. I read on the Kaggle page the hash column did give identifiers for the survey respondents. So I spent time working on merging the data between the two files.

Translating using python.

Today I started working on a project for the Erdös Institute’s data science bootcamp that I attended last fall. The first task is to find a data set, state a question that can be answered using the data, then identify stakeholders and key performance indicators (KPIs). I decided to use a trending data set from Kaggle on the correlation between drug use and mental health during the COVID-19 pandemic. The data is from two surveys conducted during spring and summer of 2020.

Back to the blog.

Today I saw a repository in my github that I didn’t recognize and found out it’s a blog I forgot I had created in summer of 2017. How did I forget about that? I read all the posts and realized there is a lot about programming that I knew back then and can’t recall now. The blog lasted only about a month, so I guess I must’ve learned all that stuff that summer. That gives me hope that I can get more proficient with my coding abilities in the next month or so.

Test drive!

As an exercise, go back to the 10 June 2017 post and zoom in until the screen capture of my C snippet illustrating an array becomes readable. Assuming the all-caps comments are filled in with correct code, or filling them in yourself, what do you expect will be the result of running this little program? I could’ve sworn I’d already tested this using an online compiler but yesterday I found out I was wrong.

My online presence.

I was thinking about a blog post re: more general stuff. The privacy risks of using a browser extension like Honey, and then inviting my Facebook friends to download it, too.

Testing sorting algorithms.

Gilbert & Forouzan recommends testing sorting algorithms with four types of arrays:

Structures in C.

I am working on a page that lists each of the following data structures

array list/vector
linked list
queue
stack
hashtable
binary search tree
priority queue/heap

Comments about comments.

At some point soon I would like to figure out how to enable comments. It’s something to do with registering with Disqus. Meanwhile, I’m aware some computer files should be stored locally, rather than somewhere like Dropbox or in a Public or All Users folder. I might crowdsource this on Facebook – the question is, which files are appropriate for storing on the cloud, and which files are better stored on an internal hard drive? I learned today, while spending three hours trying to compile the McDowell Latex resumé template, that it makes a very big difference whether or not you run Miktex as an administrator when updating or installing packages. This issue is that depending on who you update as, Administrator vs. User, the files that get downloaded go to wildly different places on the hard drive. In fact, from what I can tell 90% of compile errors result from the compiler not being able to find the files it needs, because of this issue.

Studying notes.

Factorial powers of ten !I used the settings on this webpage by The Hyperbolics for my Anki flashcards. TIL that factorials dominate powers of 10 when n=25. I also L the updated way to align images in html. Before, I was using the attribute align="right". It is now class="wrap align-right", where the specifications are in the css style file. How do you make smart quotes in Markdown? In this case I was able to just nest them in a code environment.

I'm up and running!

It’s a first blog post! By me, Wheeles. I got this blogging page going via an article I found just from browsing around the Github support pages, written by Barry Clark.