I love automation. I've been programming my computer for years to do my job for me. However, this is the first time that I have automated a student job.
Due to the Covid epidemic, many freelance journalists did not receive as much work as they used to. To help them, the Association of Professional Journalists (AJP) launched an event. Each freelance journalist could write an article, publish it on the association's website and be paid. My job was to properly format the articles so that they could be published online.
The structure of the folder I received was as follows:
Folder I received
├── Journalist 1
│ ├── article.docx
│ ├── image 1.jpg
│ ├── image 2.jpg
│ └── image 3.jpg
├── Journalist 2
│ └── ...
├── Journalist 3
│ └── ...
└── ...
I didn't want to run the program every time I needed to edit an article or image. That's why I decided to automate the process of finding the files. You can easily do this in python with os.scandir( <path> )
. Then filter the files based on their extension.
import os
DIRECTORY: str = "<path_to_the_folder_I_received>"
image_extensions: List[str] = ['.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif']
# list of directories named after journalists
directory_list: list = [dir for dir in os.scandir(DIRECTORY) if dir.is_dir()]
for dir in directory_list:
journalist_name: str = dir.name
# get the lists of articles & images
docx_file_list: list = [f for f in os.scandir(dir) if ".docx" in f.name]
images_list: list = [f for f in os.scandir(dir) if f.name.lower().endswith(image_extensions)]
On the first day, I learned what I had to do with the word (.docx) files I received. Here's what I had to do:
{{{**subtitle}}}
{{question}}
What this looks like in python:
import docx
# open the document
doc = docx.Document( <doc_path> )
# format the file properly for a regular article
article: str = format_article(doc)
interview: str = format_interview(doc)
# remove or add white space where appropriate
article: str = correct_white_space_article(article)
# save the changes to a file
with open(dir.path + "\\article.txt", 'w', encoding='utf8', errors='ignore') as f:
f.write(article)
with open(dir.path + "\\interview.txt", 'w', encoding='utf8', errors='ignore') as f:
f.write(interview)
Since I couldn't find an easy way to distinguish interviews from regular articles, I decided to create both formats in parallel. When publishing the article on the association's website, I chose the most appropriate one.
Of course, I had to write these functions myself. Since I don't want to go into too much technical detail in this article, I'll leave their implementation aside.
As for the images, my job was to reduce their size and pixel density. We had to do this because the smaller the image, the faster it loads. This improves both user experience and SEO (search engine ranking).
from PIL import Image
# open the image and compute the new size
img = Image.open( "<image_path>" )
new_size = [int(x / img.size[0] * 1000) for x in img.size]
# resize and reduce pixel density (dpi)
img.resize(new_size, Image.ANTIALIAS).save("<new_path>", dpi=(72,72))
Some images have their description and copyright information in the image metadata. I wrote a script to put all of this metadata together into one file. After uploading the files to the website, I just had to add the description manually. I could also have scripted this task. However, I would have wasted more time automating it than doing it myself.
Metadata display in Gimp
My boss told me that only Exif and IPTC data was important, so I only extracted these. Since the metadata is encoded in binary in the image, I also needed to convert it to utf-8 and remove illegal charaters.
from PIL import Image, IptcImagePlugin
try:
metadata_iptc: dict = IptcImagePlugin.getiptcinfo(img)
metadata_exif: dict = img.getexif().__dict__['_info'].__dict__['_tagdata']
except: metadata_iptc = None
metadata_text = img.filename + '\n'
try:
metadata_text += '\n'.join(
f"{key} -- " + item.decode('utf8').replace('\x00', '') if type(item) is bytes else
f"{key} -- {item}"
for key, item in metadata_exif.items()
)
except: pass
try:
metadata_text += '\n'.join(
f"{key} -- " + item.decode('utf8').replace('\x00', '') if type(item) is bytes else
f"{key} -- {item}"
for key, item in metadata_iptc.items()
)
except: pass
# save the metadata in a file
with open(dir.path + "\\metadata.txt", 'w', encoding='utf8', errors='ignore') as f:
f.write( "\n\n----------------\n\n".join(metadata_list) )
Automation has saved me over 50% of my time. It didn't make me any more money but it did help me avoid the most repetitive and boring tasks of the job.