What scraping taught me about the importance of semantic HTML

I had the talk accepted to the online version of the TECH(K)NOW conference. This is the draft of my content, that I am going to be presenting there (and will be the place for finished version, after I record the talk).

What is semantic

!!What is the semantic use of the HTML + description + a couple of examples

!! working definition of semantics is “enough meaning to result in an action, [citation https://www.oreilly.com/library/view/mining-the-social/9781449368180/ch08.html]

Why scraping

Problems noticed while scraping

!! Have examples of different suructed HTML, that you ntoiced in work !! Are there any examples from my work? Check the code

ng tags are bad, aria are good

The first advice, that I got

Use right semantic element

Button for search (BAD): #FiltersDiv input[name=3Dgo] Button to accept invitation (BAD): input[value="Accept invitation"]

Link for file download (BAD): div.action-header form:not(.checkoutForm):not(.buy-link)

Navigation (BAD): ul.pagination a.endless_page_link

Div instead of a table

Use descriptive class and id name

Error message (BAD): div.py5.max-width-2.mx-auto.big.line-height-3 > div Error message (GOOD): div.alert-warning Error message (GOOD): div[data-ref="account.billing.no-billing-access"] Error message (GOOD): Button to accept invitation input[value="Accept invitation"]

Total (BAD): div.nobr

Next page (BAD): table[role="presentation"] > tbody > tr > td > a > i.cnqr-arrow-ww

Next page link (BAD): body > table > tbody > tr:nth-child(2) > td > table > tbody > tr > td > table > tbody > tr > td > center > form > table > tbody > tr:nth-child(1) > td > table.TblBgColor > tbody > tr:nth-child(2) > td > table > tbody > tr > td > font > b + a

random class name changes

aria elements missing

no-URL based navigation (don't hide links)

Navigation button (BAD): div.responsive-wrapper__container.team-mngmt > div.tabs__list > span:nth-child(2) Navigation button (BAD): div.selectedState:not(.helpBtn) Navigation buttons (BAD): #invoices-info > div.c-row > a Navigation buttons (BAD): div.selectedState:not(.helpBtn) -> div.item.settings -> a.row.billing
We have nav and button elements

Why is semantic HTML important

!! accsesability !! There are congitive imprerments, that can make websites browsable just in reader mode: https://alistapart.com/article/accessibility-for-vestibular/ !! Screen readers what screen readers are using (https://alistapart.com/article/conversational-semantics/): link, interaction, header tranversion !! configurability !! semantic web

!! include in the speech the web is fro user -> from the standards

Clean up your Twitter: https://calumryan.com/articles/cleanup-your-twitter-timeline


HTML Design Principles

It is how HTML is designed, but it is a good reading for the people writting for the web: https://www.w3.org/TR/html-design-principles/

Degrade Gracefully

Even if markdown is not recognized (old or less capable user agents), should degrade grasfully (maybe by prving an additional semantic information in the elements) -> id means that no new element is introducted, without thinking about this

Examples: Proposed new multimedia elements like fallback or allow fallback content. Older user agents will show "fallback" while user agents supporting canvas or video will show the multimedia content.

Priority of Constituencies

In web the end users are the priority. Citation: "In case of conflict, consider users over authors over implementors over specifiers over theoretical purity. In other words costs or difficulties to the user should be given more weight than costs to authors; which in turn should be given more weight than costs to implementors; which should be given more weight than costs to authors of the spec itself, which should be given more weight than those proposing changes for theoretical reasons alone. Of course, it is preferred to make things better for multiple constituencies at once. "

Structure over presentation

Citation: "HTML should allow separation of content and presentation. For this reason, markup that expresses structure is usually preferred to purely presentational markup. "

Media independency

Features should, when possible, work across different platforms, devices, and media. This should not be taken to mean that a feature should be omitted just because some media or platforms can't support it. For example, interactive features should not be omitted merely because they can not be represented in a printed document.

HTML should be accessible

Citation: Design features to be accessible to users with disabilities. Access by everyone regardless of ability is essential. This does not mean that features should be omitted entirely if not all users can make full use of them, but alternate mechanisms should be provided.

The image in an img may not be visible to blind users, but that is a reason to provide alternate text, not to leave out images.

The backbone of accessibility is semantic HTML, which at its heart means using HTML elements as they were intended to be used. https://www.customerservant.com/in-reply-to-jgmac1106-regarding-indieweb-and-a11y/

Extending the semantic HTML



Microformats are a way to enbedent the semantics of the information in the HTML in backward compatible way [citation from https://www.oreilly.com/library/view/mining-the-social/9781449368180/ch08.html]

Example of calendar:

My talk at Online




Older browsers

!! progressive enhancment !! HTML can also not be recognized in olrde browsers !! lang attribute (IE6 and IE7) !! video

Good practices

!! good use -> microformats and IndieWeb !! HTML as an API