Janet's Notes

Miscellaneous Articles Trying to be Useful


Project maintained by JanetRossini

Scraper

This article documents Scraper, a very ad-hoc program that reads an article and downloads all referenced gyazo images into the assets folder of this project, ensuring that articles will not lose their images if gyazo versions are deleted.

The project itself can be found on my github. At this writing there is an intentionally failing test because I have changed the code and have not tested it full round trip.

Background

I have figured out that what gyazo displays your image at some address like https://gyazo.com/deadbeef, the actual image file will be found at https://i.gyazo.com/deadbeef.ext, where ext will be png or jpg, or perhaps other file formats. It appears that gyazo decides randomly whether to use png or jpg, though doubtless there is some rationale.

When editing a lesson taken from a chat in SL, where the participants have pasted gyazo links into the text, I first manually changed all those references to markdown image links changing, say,

DS: Here's my picture: https://gyazo.com/fee

That would get hand-edited to something like:

DS: Here's my picture:

![image](https://i.gyazo.com/fee.png)

That’s the markdown for an in-line image link. I would determine png vs jpg by trial and error. I first used a Sublime Text regex to convert them all to reference png and then displayed the article and fixed up the ones that were wrong. You can also, om a gyazo page, have your browser open the image in a new tab, which will give you the proper address.

Gyazo claims that they plan to keep all the images forever and that they are in it for the long term. Nonetheless, it seemed worthwhile to pull down all the images and put them into this site’s assets, just in case. Plus it seems like an interesting program to write.

Development

I started with a few tests to learn how to use Python’s os.path and requests imports to manipulate file paths and issue http requests respectively. Some tests actually check results, and some amount to convenient ways to run something that I verify manually. For example:

    @pytest.mark.skip("no point writing all the time")
    def test_write(self):
        pic = 'https://janetrossini.github.io/assets/rgn4.png'
        response = requests.get(pic)
        assert response.status_code == 200
        content = response.content
        # assert False
        desktop = '~/Desktop'
        true_desktop = os.path.expanduser(desktop)
        output_name = 'image_name.png'
        write_to = os.path.join(true_desktop, output_name)
        with open(write_to, 'wb') as handler:
            handler.write(content)

The above test, when not marked to skip, reads a random image and writes it to my desktop. I verified by looking that it was the right image. Other tests let me experiment and re-learn how to expand folders to get the user path, how to open and read files, and so on.

The initial version of the scraper was pretty nasty and I evolved it to its current state. At this writing I do not consider it fit for general use but it might be useful as a basis for solving a similar problem. Let’s consider the classes we have.

Scraper

The Scraper class is the core working class. As written, it knows that it is going to read from this site’s articles and store in this site’s assets. The scraper is created by passing it an article’s file name, for example:

    @pytest.mark.skip("no point writing all the time")
    def test_do_whole_article(self):
        article = '2024-06-28-bump.md'
        scraper = Scraper(article)
        scraper.make_plan()
        scraper.execute_plan()

We’ll review the code, which currently is flagged not to do any actual work. Along the way, I’ll try to test it round trip. I am hopeful that the docstrings will do most of the explanation, but I’ll add a few words here as well.

class Scraper:
    def __init__(self, article_name):
        """Set up for provided article name.
        Note that flags are initialized to do no web access and no file writing.
        """
        self.article_name = article_name
        base_path = '~/Documents/GitHub/janetrossini.github.io/_posts'
        self.read_path = os.path.expanduser(base_path)
        self.article_path = os.path.join(self.read_path, self.article_name)
        self.save_path = '~/Documents/GitHub/janetrossini.github.io/assets'
        self.do_read = False
        self.do_save = False

    def get_pic_names(self):
        """Read every line of the article looking for picture names to fetch.
        """
        pic_names = []
        with open(self.article_path, 'r') as fp:
            for line in fp.readlines():
                self.append_pic_name(line, pic_names)
        return pic_names

    def append_pic_name(self, line, pic_names):
        """For lines containing '[image]',
        pull out the image name and add it to the collection.
        The line will look like ![image](https://etc.png)
        The regex fetches everything inside the parens,
        which should be the url of the picture.
        """
        if '[image]' not in line:
            return
        url_regex = r'\((.*)\)'
        url_result = re.search(url_regex, line)
        url = url_result.group(1)
        if url:
            pic_names.append(url)

    def make_plan(self):
        """Convert list of picture names to an instance of Plan.
        A Plan is little more than a collection of PicWriter instances.
        The PicWriter does all the work, as we'll see."""
        plan = Plan()
        for name in self.get_pic_names():
            writer = PicWriter(name, self.save_path)
            plan.append(writer)
        return plan

    def execute_plan(self):
        """Execute the plan.
        Flags must be set True if you want anything done.
        Probably this method should accept the desired flags,
        at least as things stand now.
        """
        plan = self.make_plan()
        plan.execute(self.do_read, self.do_save)

Currently we create the Scraper, tell it to make the plan and then tell it to execute the plan. A more reasonable implementation might just create and go.

Plan

class Plan:
    def __init__(self):
        """Just a simple collection of PicWriter instances.
        Might become more clever in a later version.
        Build this way because we want to have a place for `execute`.
        """
        self.pic_writers = []

    def append(self, pic_writer):
        """Add a PicWriter to our list"""
        self.pic_writers.append(pic_writer)

    def execute(self, do_read, do_save):
        """Execute all the PicWriters in the Plan.
        This is what causes the file to be scanned
        and the pictures retrieved.
        """
        for writer in self.pic_writers:
            writer.execute(do_read, do_save)

Nothing really to see here. This is just a specialized collection with an execute method.

PicWriter

This is the class that knows how to read a URL and write a file. It might make sense to have two classes here, since there are two distinct things being done.

class PicWriter:
    """Given one picture url,
    read that url's contents,
    and write the contents to the provided path,
    with the same name as found in the url.
    e.g. 'https://i.gyazo.com/deadbeef.png'
    """
    def __init__(self, pic_url, path):
        """Initialize with file name and path to save file to.
        regex selects everything after `.com/` as image file name.
        """
        self.pic_url = pic_url
        file_regex = r'i\.gyazo\.com/(.*)'
        result = re.search(file_regex, pic_url)
        if not result:
            raise ValueError('cannot find image file name')
        else:
            save_name = result.group(1)
            self.save_address = os.path.join(path, save_name)

    def __repr__(self):
        """debug print for convenience"""
        return f'PicWriter({self.pic_url=} {self.save_address=}'

    def execute(self, do_read, do_save):
        """depending on flags:
        do_read: read the saved url (requests.get) and
        do_save: save the contents
        """
        if do_read:
            response = requests.get(self.pic_url)
            if response.status_code != 200:
                raise FileNotFoundError
        else:
            response = None
        self.save_content(response, do_save)

    def save_content(self, response, do_save):
        """Save the file if flag is True,
        and if we receive a non-None response.
        We call this method even if we got no good response,
        because it prints the trace information.
        """
        expanded = os.path.expanduser(self.save_address)
        print(f'save file: {expanded=}')
        if not do_save:
            print(f"not written {do_save=}")
        elif response is None:
            print(f'not written, response is None')
        else:
            content = response.content  # this is the actual .png / .jpg data
            with open(expanded, "wb") as handler:
                handler.write(content)
                print("written")

The rubber meets the road in execute, which tries to read the URL, and save_content, which writes it out. The read can fail to return 200 status and if it does, we pass None to save_content, because it prints the trace info.

I think the save_content method is a bit nasty, but it does need to sort out the cases. If one were to undertake a cleanup or to make the program more general, one might improve the method.

As things stand this instant, there is no one setting the do-flags to True, so we have never done a round-trip on this code. Let’s fix that:

    @pytest.mark.skip("no point writing all the time")
    def test_do_whole_article(self):
        article = '2024-06-28-bump.md'
        scraper = Scraper(article)
        scraper.make_plan()
        scraper.do_read = True
        scraper.do_save = True
        scraper.execute_plan()

With the skip turned off, this reads and saves the file, checked by visually looking at the file date. Woot! It still works! A sufficiently clever person would check the file date in the test, and maybe even the content, but not today.

I can remove the failing test that says we’ve never gone end to end.

Conclusion … for now

The program works. I have added a test-file.md to this site, which can be set up for full testing of the process at some future date. Since it works as it stands and there are no new files yet to process, I’ll defer that work until it makes more sense to do it.