CFP: IndieWeb: A better vision for the web

I applied with this talk to the PyCon Slovakia, but was not accepted.

Title: IndieWeb: A better vision for the web

Abstract:

The original vision of the web included the possibility of everyone publishing on the internet. But this days most people only do this on platforms, where they do not even control their continued ability to participate there, let alone anything else.

But there is also a different version of the internet.

IndieWeb is the part of the web comprised of the personal sites on their own domains. It allows them to have their own cornet they control, which brings forth a lot of otherwise untapped energy. It also allows them to interact with any part of the internet they want.

In this introduction, I will present what IndieWeb is, some of their interesting principles and solutions for the more decentralized, open and creative web.

Bio:

I am a software engineer at LeanIX. I come from a cognitive science and economics background, so I am more likely to be interested in how to use programming to find out the answer to the social science questions, data analysis and automation, then going into the nitty picky details about kubernetes. I also interested in how people and technology relate to one another.

CFP: The overview of the clustering methods on the example of Fanfiction tags

I applied with this talk to PyCon Slovakia, but was not accepted.

Title: The overview of the clustering methods on the example of Fanfiction tags

Abstract:

There are a lot of ways to group the data together: from topic models, factor analysis, k-clustering and many more. Each with their own specifics.

But in actual analysis, it is hard to know, which one to use and how to tune it for each case. Most of them also have parameters, that effect the final result.

In this talk, I would like to show their differences, by checking how different methods and settings change the results on the example of the clustering of Fanficton tags.

Bio:

I am a software engineer at LeanIX. I come from a cognitive science and economics background, so I am more likely to be interested in how to use programming to find out the answer to the social science questions, data analysis and automation, then going into the nitty picky details about kubernetes. I also interested in how people and technology relate to one another.

Odstranjevanje z Geslom Zaščitenega PDFja

Zaradi slovenskih zakonov za zaščito osebnih podatkov, mi ne smejo več pošiljati mojih plačilnih list brez zaščite z geslom. Ampak ko si sama to shranjujem na disk kot varnostno kopijo, si želim to shraniti brez gesla. Če ne vem koliko let se sama ne bom spomnila, kakšno je geslo za vsako datoteko. Sploh, če se bodo ta gesla spreminjala čez čas - recimo ko zamenjam službo.

Ker sem že nekajkrat iskala, kako se to narediti, sem se odločila to napisati v blog zapis. Tako bom naslednjič vsaj vedela, kjer iskati.

Na Linux-u se to lahko narediti z naslednjo komando.

qpdf --password="geslo" --decrypt input.pdf output.pdf

Program je v shrambi programov (

repository
) za Arch Linux, tako da mi ni potrebno skrbeti, da ga ne bi vdela, ali z nalaganjem programov, potrebnih za pravilno delovanje le teh (
dependency
).

What scraping taught me about the importance of semantic HTML

I had the talk accepted to the online version of the TECH(K)NOW conference. You can watch the video, but if you are like me and prefer the text to audio/video, you can read the blog version of the talk here.

Introduction

Why talk about the scraping from the semantic HTML perspective? It is true, that semantic HTML is much more frequently included in topic connected with the accessibility and so on. Usually when people talk about scraping, it is because they want to stop other people from getting the data from their site - or in rearer cases, how to make this as easy as possible.

Also, in my work, my main job is to get the data we need, let this be through the API, or when this is not available, through scraping. So I am a lot more familiar with this aspect, then some others.

What is Semantic HTML

But let me first start with what semantic HTML actually is. Semantics is the subdiscipline in the linguistics, which deals with the meaning. This pertains on the multiple levels, from the words to the phrases and paragraphs and so on.

So when we talk about semantic HTML, we talk about the HTML, which is meaningful. This can go from using the right HTML element (there are over 100 HTML elements), to using the right attributes and so on.

What might be different from the linguistic perspective is, that HTML has to be meaningful understood by at least two very different types of users: the people that read and write HTML and the machines (like browsers), that read the HTML. On the other hand, most users of human languages generally use it to communicate with other humans - which are a lot more similar then machines and humans.

Semantic HTML in the Scraping

I will be the first one to admit, that the semantic HTML does help with scraping the data. And knowing which things are more durable and which change all the time, can really help in making a bit more stable scraping code - even for a junior, like I was when I started, where I work now.

Semantic HTML is also important, because this allows many different types of devices to still access the data. If the browser does not understand the HTML element, then it just ignores it. This is not true with JavaScript, as using a JavaScript not supported by older browser can make the site inaccessible. HTML has a build-in graceful degradation.

First Advice I got

When I first used scraping for my job - not my first time using scraping but still, the advice that I got was:

  • never use the Angular tags for any element search
  • ARIA tags are one of the most stable, use them

The reason for the Angular one was, that they change way too quickly. And therefore are a really bad target to use. On the other hand, ARIA attributes almost never change. While they are not that frequent - since if the HTML element with the same role exist, this one should be used, I never had to change an element description, because somebody would change or delete an ARIA tag.

Semantic elements

Now let us go to the first thing, that I discovered through the scraping. There are still a lot of sites, that are not using the semantically best suited HTML elements. There is such a huge variety of selection, from classical ones, that exists from the very beginning, like paragraph or header, to the really self-descriptive, like navigation or footer, to the really specific ones like time or ruby.

Thankfully, most of the pages, that I needs to deal with do not just use divs inside of divs.

The most frequent failures that I saw is, that inputs are still a lot of times used as a button. So the element would be input with the type submit, or another case, when it would be links instead of a button. Navigations are a lot of times lists wrapped in divs and not navigation element. Tables are usually tables, but there are still examples, when they are just lists or list of divs. Or they would use tables both for layouts and data. So data table inside of the layout table.

Some of more interesting cases were, when one would download a file not by clicking on the link or button, but by clicking on the form, with no input fields.

ARIA

I do not encounter the ARIA used that frequently, neither when they are using semantic HTML or when they are not. Would expect less ARIA with semantic HTML because of the first rule of ARIA:

If you can use a native HTML element or attribute with the semantics and behavior you require already built in, instead of re-purposing an element and adding an ARIA role, stateor property to make it accessible, then do so. https://www.w3.org/TR/using-aria/#rule1

It also seems, that the more ARIA is present, the more accessibility problems the site has on average.

Meaningful attributes

The attributes can also be more or less meaningful. There is a trend in sites, that I see, that classes names are just random letter and numbers. They also seems to be the same across sites. Is this because, they are usually up to 4 letters long, and there are not a lot of combinations? Or maybe they are the results of some template or framework?

Another frequent one is class names, that stars with CSS, and then there are random letters and numbers. There was a couple of cases, where classes had names like that, the entire sites were divs only and almost no other attributes. These are pain in the but to deal with.

But in some cases, the classes and other HTML attributes can be really descriptive, also in the semantic way. They actually tell you, what is inside of the element. I am thankful for each example of sites, that are created this way.

Attributes are a lot of times also used for third party developers, in order to create an experience of the site, that is more suited for each user. Not having consistent class names can be a endless wack-a-mole game between users, third-party developers and website developers.

A lot of times platforms would want to push their own agenda, and want to make sure third-party developers can not create apps or scripts like that. An example of this would be how both Tumblr and Facebook made it harder for third-party developers to improve the sire experience. This could be an app, that would allow to hide posts on the site, to either hide spoilers form the show, or hide content, that could trigger epilepsy or PTSD.

Or sometimes is can be for a more personal reasons. Maybe you do not agree with algorithmic changes to the Twitter timeline. Even I hide the sidebar for the StackOverflow and relates sites, because it is just too distracting for me. I also hide the main element of the sites, that I think I visit to much, compared to the value I get, like HackerNews.

But this always makes the site less semantic in their structure, and therefore can break some of other devices, like scree-readers.

As a side notes, sites do use the same id for multiple HTML elements, so this is not always a good solution.

Random changes

When looking at the changes, that made us correct our code, these are the most frequent ones:

  • random name changes for the classes
  • semantic HTML elements moving to div element
  • inputs are changing to buttons

The move to the div element seems to be going in the wrong direction. I do not really understand why this happens. Is there some group of people, advocating for the more widespread usage of divs?

But the move from inputs to the buttons gives me some hope for the future.

Non-URL based navigation

My biggest annoyances in the scraping are actually sites, that use JavaScript to generate HTML navigation, and no meaningful direct URL navigation. Please, please, please, do not hide links. These navigation usually hide links behind the JavaScript, and they would only show links, if the certain button is clicked.

This is not just an annoyance for the scraping. The website, that would not allow me to bookmark the location, but forced me to go through the click ritual each time, would be the one that I would be really annoyed with. And most likely would not use it for long.

Using API calls instead

All the problems with the HTML can be avoided, but just using the same API calls, the site would be making anyway. So instead of the site getting the response, creating the HTML with JavaScript, and then parsing the HTML, the HTTP response can be parsed instead.

This can be used in surprising amount of times. I guess almost nobody like doing the backend rendering anymore? At least when it comes to internet paying services - I am sure a lot of blogs are still not a frontend framework and bunch of API calls.

Usually the HTTP responses are changed less frequently than the HTML. But I noticed a couple of sites, where the reverse is true. So depends on the site.

Why have semantic HTML

Even if the site owners try to stop the scrapers, the people that want the data will find a way. It will just make their code less readable. Because it will be clicking on a lot of divs and then searching for even more divs inside of divs. So not having a semantic HTML does not really do a great job is stopping them.

But there are a lot of other reasons, why semantic HTML is important.

Accessibility

The first one, and I think this is the one most frequently mentioned, is accessibility. The semantic HTML can help may different devices to access the webpages. This can be from old browsers, to text-only browser, to scree-readers and so on.

Screen-readers have a lot of ways to navigate the site, which only came into the effect, if the HTML is used in a semantic way. For example, they can easily find the navigation elements, they can navigate by headers, but links or by paragraphs. This allows the blind to also skim the page, just like a sighted person would. Assuming, that these elements are used, of course.

Another example could be for the people, that are visually impaired. Using a custom CSS can help people that are either color blind or they require the higher contrast or any other thing. Whenever the site is light mode or dark mode can have an effect on readability for some people. The dark color scheme and reader mode can help with vertigo.

I will admit, that while I do not have any continue, I use the custom CSS to have everything white on black background. I just found out I can read easier with this scheme.

Also, if we can provide the semantic HTML, that can be opened in the older devices, then we can continue using the older devices, so we are even more environmentally friendly.

HTML is API

The second reason, that I would like to talk about is having one source of truth.

When I took the database course at the Faculty of Economics, we spend a lot of time on the concept of having to write each piece of data in the database once. The data should not be saved in two places, as then is can not clear, which place has the correct data.

So the best place to have an API is to make everybody parse just your HTML. If one needs to keep two version up to date, this means that one can lags and the problems can stay undetected for longer. Especially if one is used less often then another.

Except for rare cases, the HTML shown to the end user is usually the main representation of the data - even though I have to admit, that I do not always think in this way.

One interesting development in this space are microformats. These are HTML add-ons, which help with parsing specific things, like recipes or events or RSVP. This can then be parsed with many different microformats parsers.

This are then used by different groups, for example IndieWeb, to keep track of RSVP of the events or send comments from one site to another.

Internet Principles

The last point for semantic HTML, that I would like to talk about is, that it is the right thing to so.

There are a lot of actors, trying to push the internet in their preferred direction. Each of them have the effect on the shape, the internet is taking place. I would like to below mentioned some of the principles, that I think are pushing the internet in the user-centric direction.

I would first like to start with Web Platform Design Principles.

The first one is Put user needs first (Priority of Constituencies) rule. The shortened text is below.

If a trade-off needs to be made, always put user needs above all. [...]

The internet is for end users: any change made to the web platform has the potential to affect vast numbers of people, and may have a profound impact on any person’s life.

User needs come before the needs of web page authors, which come before than the needs of user agent implementors, which come before than the needs of specification writers, which come before theoretical purity.

So just because something would be a more beautiful solution, if it will break multiple sites, it would not be accepted. Since end users and website developers are more important then people writing specification.

And the second one, that I want to highlight is the principle of 'It should be safe to visit a web page'. The shorted text is below.

When adding new features, design them to preserve the user expectation that visiting a web page is generally safe.

The Web is named for its hyperlinked structure. In order for the web to remain vibrant, users need to be able to expect that merely visiting any given link won’t have implications for the security of their computer, or for any essential aspects of their privacy.

The one is, in my opinion, a more specific version of the previous principle. It also has the added component, that web should be a safe people for people to be. We can discussion if currently the web is a safe place? In my opinion, depends which parts of the internet one visits.

And the third one is the principle of 'Support the full range of devices and platforms (Media Independence)'. The shorter text is below.

As much as possible, ensure that features on the web work across different input and output devices, screen sizes, interaction modes, platforms, and media.

One of the main values of the Web is that it’s extremely flexible: a Web page may be viewed on virtually any consumer computing device at a very wide range of screen sizes, may be used to generate printed media, and may be interacted with in a large number of different ways.

It does not matter, what sort of devices will come in the future, web should be accessible from both old and new devices. Even if the experience is not the same, as long as the same goals are reached, this is fine.

I experience this one every time I power up my 12 years old tablet and try to surf the internet. The experience is... mostly unworkable. I can just go to my laptop and do it there, but some other people might not.

I would then like to continue with the HTML Design Principles. There are a couple of design principles, that I would like to highlight in particular.

The first is the priority of constituencies, which is similar to the one from the web platform design principles. The same is true for the principle of media independence.

The last one is accessibility with the text below:

Design features to be accessible to users with disabilities. Access by everyone regardless of ability is essential. This does not mean that features should be omitted entirely if not all users can make full use of them, but alternate mechanisms should be provided.

This would be an example of providing an alt text to the images or either subtitles or transcripts for audio and video, or the ability to not only use a mouse, but also touch screen or a keyboard.

There are also organizations, that try to push for the principles, that put the end users up front. One of them is internet engineering task force with their The Internet is for End Users document.

Conclusion

And this is the note, that I would like to leave everybody with. The internet should be a safe place for everybody. That does not mean, that site developers should be careful about every case (but it would help), but at least allow the users themselves to make it better for them.

And one of the ways to do this would be to use the semantic web. Because neither stopping the scrapers not inconveniencing developers is not a good enough reasons to not use it.

When Solving Problems, do you Start from Solution or Resources?

When I was attending a meeting at work, I noticed something interesting. It was mostly interesting to see, how people get at the problems from the different perspective.

Let me give a bit of the background first. The product, that I work on, helps with optimizing SaaS. So my team's job is to get as much useful data as possible from as many SaaS services as possible. And here there is always a tension between what we can get and what data would be the most useful for decision making. There is also a tension between going through the official API, which has a lot of time more limited data, but generally works for everybody that connects it. Or by logging to their account and scrape their data, with all the problems scraping entails.

Now, how I usually attempt to do it, is that I first see what data can we even get. Only after seeing this, would I even start thinking about usefulness of the data. So my job is then optimization of the value from the data, that is available.

But we now have a new teammate (?) on the team. So he collected us, in order to explain to us the new way of the optimization, that would greatly help the customers. It was something to do, how to optimize the bundles of different products. Fine. Except, we on the end figured out, that we do not have this data for a lot of cases, where this would be useful. This was even true for one of the case, that he gave us as example (JetBrains). He looked down, when he realized, that we could not do this for some of the products bundles, that are quite widespread in the companies - like Adobe.

This is something, that I have been thinking about, even since I read the Software Crisis. There was an interesting quote, that lead me to research more about it. The quote is below.

It's about finding something you're are likely to accomplish with your current resources. As opposed to attempting to do something and then finding out whether you can accomplish it or not.

What they were talking about is two different types of problems solving.

  1. Casual: How can we implement this specific solution to this specific problem
  2. Effectual: What problems can we solve with the skills and resources we have at hand

So the difference is whenever the goal or resources take the central stage. In the upper example, I would be a case for effectual problem solving, while my coworker would be a better example of the casual problem solver.

In my life, I think I see the examples of the effectual problems solving a lot more frequently then the casual problem solving. I am much more likely to cook with whatever I have available, then think about, what I want to eat. Even the job that I currently have, I applied because one of the people working there told me to apply. My public speaking is also just seeing the option and trying it. Since applying for a speech at the technical conference only costs me a bit of time.

Tough be taking the two master courses was a way to get into the cognitive science was a display of casual problem solving.

There are a couple of things, that are more in line with effectual reasoning, then with casual reasoning:

  • just do something and see results
  • what is the worst that can happen -> just do it, if it has small downside
  • search for win-win situations with other people

In the programming world, I think the specific thinking can also be seen with the following quote from the Software Crisis:

Don't be afraid to not make software.

There are hypothesis, that what makes the entrepreneurs entrepreneurial is the effectual reasoning. Which is why the smaller companies can be better in this then the bigger ones. Both because they do not have as many resources and can be more flexible.

I actually think that personality wise there is a difference between the casual and effectual reasoning.

From the Big Five model, it is my hypothesis that effectual reasoning is more connected with high openness. Openness deal with how far reaching connections the mind can make between different concepts. It is also connected with creativity. And seems to be the closes to what is needed to problem solve in this way.

The casual reasoning is connected with high conscientiousness. Conscientiousness is connected with both consistency in the processes and with how much work people are willing to put in to achieve their goals. Having goals from the start might be also connected extroversion, but what is done with them is, in my opinion, connected with conscientiousness.

From the Jung's functions perspective, the effective reasoning seems to be more connected with the extroverted perceiving functions. Extroverted sensing deal with what is right now in front of me and using it. Extroverted intuition is connected with creating connections with what is present in the reality. Both of them start with the reality.

The casual reasoning seems to be more connected with introverted perceiving functions. Introverted intuition always start with one right answer or goal and with the one right path to it. The introverted sensing deal with the correct processes, based on the historical experiences. Both are much more focused on the right way to do things, then with what is available at the moment.

Personality-wise, it seems like I might prefer the effectual reasoning, because it is closer to me. Something to keep in mind when I work with people with different preferences.

Saving PDF to the file - Akamai Invoice API

In the recent weeks, I have been working on a weird problem. I could see the correct file and could save it, if I did the workflow in the Postman. But when I tried to do it in the JavaScript code, anything I tried gave me the empty file.

My first though was, that it had something to do with encoding, or with string vs. buffer, but none of the normal Nodejs things helped. Like using Buffer.from().toString() with any possible variation. I also tried to change the encoding on the file save or tried to add some headers.

I also tried to add some headers, if this will give me the data in the form, that I wanted.

I asked two of my coworkers for help. One of them then found the solution, and that was to add encoding: null option to the edge calls. So the final code sort of looked like this:

eg
  .auth({
    path: `invoicing-api/v3/contracts/${contractId}/invoices/${invoiceId}/download?fileFormat=PDF`,
    method: 'GET',
    encoding: null,
  })
  .send((error, pdfResponse) => {

There went way to many hours for the fix, that ended up with just one line of code. They have edgegrid as part of authentication. So I wish, this would be also edgegrid examples in the endpoint documentation.

Instead we get the direct endpoint calls, with comment, that authentication happens somewhere else. Having just that example, would have saved me way too much time.

I wonder if this is an Akamai specific solution, or do other API also use something like that.