What scraping taught me about the importance of semantic HTML

I had the talk accepted to the online version of the TECH(K)NOW conference. You can watch the video, but if you are like me and prefer the text to audio/video, you can read the blog version of the talk here.

Introduction

Why talk about the scraping from the semantic HTML perspective? It is true, that semantic HTML is much more frequently included in topic connected with the accessibility and so on. Usually when people talk about scraping, it is because they want to stop other people from getting the data from their site - or in rearer cases, how to make this as easy as possible.

Also, in my work, my main job is to get the data we need, let this be through the API, or when this is not available, through scraping. So I am a lot more familiar with this aspect, then some others.

What is Semantic HTML

But let me first start with what semantic HTML actually is. Semantics is the subdiscipline in the linguistics, which deals with the meaning. This pertains on the multiple levels, from the words to the phrases and paragraphs and so on.

So when we talk about semantic HTML, we talk about the HTML, which is meaningful. This can go from using the right HTML element (there are over 100 HTML elements), to using the right attributes and so on.

What might be different from the linguistic perspective is, that HTML has to be meaningful understood by at least two very different types of users: the people that read and write HTML and the machines (like browsers), that read the HTML. On the other hand, most users of human languages generally use it to communicate with other humans - which are a lot more similar then machines and humans.

Semantic HTML in the Scraping

I will be the first one to admit, that the semantic HTML does help with scraping the data. And knowing which things are more durable and which change all the time, can really help in making a bit more stable scraping code - even for a junior, like I was when I started, where I work now.

Semantic HTML is also important, because this allows many different types of devices to still access the data. If the browser does not understand the HTML element, then it just ignores it. This is not true with JavaScript, as using a JavaScript not supported by older browser can make the site inaccessible. HTML has a build-in graceful degradation.

First Advice I got

When I first used scraping for my job - not my first time using scraping but still, the advice that I got was:

  • never use the Angular tags for any element search
  • ARIA tags are one of the most stable, use them

The reason for the Angular one was, that they change way too quickly. And therefore are a really bad target to use. On the other hand, ARIA attributes almost never change. While they are not that frequent - since if the HTML element with the same role exist, this one should be used, I never had to change an element description, because somebody would change or delete an ARIA tag.

Semantic elements

Now let us go to the first thing, that I discovered through the scraping. There are still a lot of sites, that are not using the semantically best suited HTML elements. There is such a huge variety of selection, from classical ones, that exists from the very beginning, like paragraph or header, to the really self-descriptive, like navigation or footer, to the really specific ones like time or ruby.

Thankfully, most of the pages, that I needs to deal with do not just use divs inside of divs.

The most frequent failures that I saw is, that inputs are still a lot of times used as a button. So the element would be input with the type submit, or another case, when it would be links instead of a button. Navigations are a lot of times lists wrapped in divs and not navigation element. Tables are usually tables, but there are still examples, when they are just lists or list of divs. Or they would use tables both for layouts and data. So data table inside of the layout table.

Some of more interesting cases were, when one would download a file not by clicking on the link or button, but by clicking on the form, with no input fields.

ARIA

I do not encounter the ARIA used that frequently, neither when they are using semantic HTML or when they are not. Would expect less ARIA with semantic HTML because of the first rule of ARIA:

If you can use a native HTML element or attribute with the semantics and behavior you require already built in, instead of re-purposing an element and adding an ARIA role, stateor property to make it accessible, then do so. https://www.w3.org/TR/using-aria/#rule1

It also seems, that the more ARIA is present, the more accessibility problems the site has on average.

Meaningful attributes

The attributes can also be more or less meaningful. There is a trend in sites, that I see, that classes names are just random letter and numbers. They also seems to be the same across sites. Is this because, they are usually up to 4 letters long, and there are not a lot of combinations? Or maybe they are the results of some template or framework?

Another frequent one is class names, that stars with CSS, and then there are random letters and numbers. There was a couple of cases, where classes had names like that, the entire sites were divs only and almost no other attributes. These are pain in the but to deal with.

But in some cases, the classes and other HTML attributes can be really descriptive, also in the semantic way. They actually tell you, what is inside of the element. I am thankful for each example of sites, that are created this way.

Attributes are a lot of times also used for third party developers, in order to create an experience of the site, that is more suited for each user. Not having consistent class names can be a endless wack-a-mole game between users, third-party developers and website developers.

A lot of times platforms would want to push their own agenda, and want to make sure third-party developers can not create apps or scripts like that. An example of this would be how both Tumblr and Facebook made it harder for third-party developers to improve the sire experience. This could be an app, that would allow to hide posts on the site, to either hide spoilers form the show, or hide content, that could trigger epilepsy or PTSD.

Or sometimes is can be for a more personal reasons. Maybe you do not agree with algorithmic changes to the Twitter timeline. Even I hide the sidebar for the StackOverflow and relates sites, because it is just too distracting for me. I also hide the main element of the sites, that I think I visit to much, compared to the value I get, like HackerNews.

But this always makes the site less semantic in their structure, and therefore can break some of other devices, like scree-readers.

As a side notes, sites do use the same id for multiple HTML elements, so this is not always a good solution.

Random changes

When looking at the changes, that made us correct our code, these are the most frequent ones:

  • random name changes for the classes
  • semantic HTML elements moving to div element
  • inputs are changing to buttons

The move to the div element seems to be going in the wrong direction. I do not really understand why this happens. Is there some group of people, advocating for the more widespread usage of divs?

But the move from inputs to the buttons gives me some hope for the future.

Non-URL based navigation

My biggest annoyances in the scraping are actually sites, that use JavaScript to generate HTML navigation, and no meaningful direct URL navigation. Please, please, please, do not hide links. These navigation usually hide links behind the JavaScript, and they would only show links, if the certain button is clicked.

This is not just an annoyance for the scraping. The website, that would not allow me to bookmark the location, but forced me to go through the click ritual each time, would be the one that I would be really annoyed with. And most likely would not use it for long.

Using API calls instead

All the problems with the HTML can be avoided, but just using the same API calls, the site would be making anyway. So instead of the site getting the response, creating the HTML with JavaScript, and then parsing the HTML, the HTTP response can be parsed instead.

This can be used in surprising amount of times. I guess almost nobody like doing the backend rendering anymore? At least when it comes to internet paying services - I am sure a lot of blogs are still not a frontend framework and bunch of API calls.

Usually the HTTP responses are changed less frequently than the HTML. But I noticed a couple of sites, where the reverse is true. So depends on the site.

Why have semantic HTML

Even if the site owners try to stop the scrapers, the people that want the data will find a way. It will just make their code less readable. Because it will be clicking on a lot of divs and then searching for even more divs inside of divs. So not having a semantic HTML does not really do a great job is stopping them.

But there are a lot of other reasons, why semantic HTML is important.

Accessibility

The first one, and I think this is the one most frequently mentioned, is accessibility. The semantic HTML can help may different devices to access the webpages. This can be from old browsers, to text-only browser, to scree-readers and so on.

Screen-readers have a lot of ways to navigate the site, which only came into the effect, if the HTML is used in a semantic way. For example, they can easily find the navigation elements, they can navigate by headers, but links or by paragraphs. This allows the blind to also skim the page, just like a sighted person would. Assuming, that these elements are used, of course.

Another example could be for the people, that are visually impaired. Using a custom CSS can help people that are either color blind or they require the higher contrast or any other thing. Whenever the site is light mode or dark mode can have an effect on readability for some people. The dark color scheme and reader mode can help with vertigo.

I will admit, that while I do not have any continue, I use the custom CSS to have everything white on black background. I just found out I can read easier with this scheme.

Also, if we can provide the semantic HTML, that can be opened in the older devices, then we can continue using the older devices, so we are even more environmentally friendly.

HTML is API

The second reason, that I would like to talk about is having one source of truth.

When I took the database course at the Faculty of Economics, we spend a lot of time on the concept of having to write each piece of data in the database once. The data should not be saved in two places, as then is can not clear, which place has the correct data.

So the best place to have an API is to make everybody parse just your HTML. If one needs to keep two version up to date, this means that one can lags and the problems can stay undetected for longer. Especially if one is used less often then another.

Except for rare cases, the HTML shown to the end user is usually the main representation of the data - even though I have to admit, that I do not always think in this way.

One interesting development in this space are microformats. These are HTML add-ons, which help with parsing specific things, like recipes or events or RSVP. This can then be parsed with many different microformats parsers.

This are then used by different groups, for example IndieWeb, to keep track of RSVP of the events or send comments from one site to another.

Internet Principles

The last point for semantic HTML, that I would like to talk about is, that it is the right thing to so.

There are a lot of actors, trying to push the internet in their preferred direction. Each of them have the effect on the shape, the internet is taking place. I would like to below mentioned some of the principles, that I think are pushing the internet in the user-centric direction.

I would first like to start with Web Platform Design Principles.

The first one is Put user needs first (Priority of Constituencies) rule. The shortened text is below.

If a trade-off needs to be made, always put user needs above all. [...]

The internet is for end users: any change made to the web platform has the potential to affect vast numbers of people, and may have a profound impact on any person’s life.

User needs come before the needs of web page authors, which come before than the needs of user agent implementors, which come before than the needs of specification writers, which come before theoretical purity.

So just because something would be a more beautiful solution, if it will break multiple sites, it would not be accepted. Since end users and website developers are more important then people writing specification.

And the second one, that I want to highlight is the principle of 'It should be safe to visit a web page'. The shorted text is below.

When adding new features, design them to preserve the user expectation that visiting a web page is generally safe.

The Web is named for its hyperlinked structure. In order for the web to remain vibrant, users need to be able to expect that merely visiting any given link won’t have implications for the security of their computer, or for any essential aspects of their privacy.

The one is, in my opinion, a more specific version of the previous principle. It also has the added component, that web should be a safe people for people to be. We can discussion if currently the web is a safe place? In my opinion, depends which parts of the internet one visits.

And the third one is the principle of 'Support the full range of devices and platforms (Media Independence)'. The shorter text is below.

As much as possible, ensure that features on the web work across different input and output devices, screen sizes, interaction modes, platforms, and media.

One of the main values of the Web is that it’s extremely flexible: a Web page may be viewed on virtually any consumer computing device at a very wide range of screen sizes, may be used to generate printed media, and may be interacted with in a large number of different ways.

It does not matter, what sort of devices will come in the future, web should be accessible from both old and new devices. Even if the experience is not the same, as long as the same goals are reached, this is fine.

I experience this one every time I power up my 12 years old tablet and try to surf the internet. The experience is... mostly unworkable. I can just go to my laptop and do it there, but some other people might not.

I would then like to continue with the HTML Design Principles. There are a couple of design principles, that I would like to highlight in particular.

The first is the priority of constituencies, which is similar to the one from the web platform design principles. The same is true for the principle of media independence.

The last one is accessibility with the text below:

Design features to be accessible to users with disabilities. Access by everyone regardless of ability is essential. This does not mean that features should be omitted entirely if not all users can make full use of them, but alternate mechanisms should be provided.

This would be an example of providing an alt text to the images or either subtitles or transcripts for audio and video, or the ability to not only use a mouse, but also touch screen or a keyboard.

There are also organizations, that try to push for the principles, that put the end users up front. One of them is internet engineering task force with their The Internet is for End Users document.

Conclusion

And this is the note, that I would like to leave everybody with. The internet should be a safe place for everybody. That does not mean, that site developers should be careful about every case (but it would help), but at least allow the users themselves to make it better for them.

And one of the ways to do this would be to use the semantic web. Because neither stopping the scrapers not inconveniencing developers is not a good enough reasons to not use it.